phy 5

Comprehensive Python Programming: From Data Handling to AI Applications

Comprehensive Python Programming

From Data Handling to AI Applications

A practical guide for mastering advanced Python concepts

Introduction

Welcome to the advanced sections of our comprehensive Python programming guide. In these chapters, we'll explore the versatility and power of Python beyond the fundamentals. Whether you're looking to analyze complex datasets, build web applications, explore artificial intelligence, or develop ethical hacking tools, Python offers the libraries and frameworks to bring your ideas to life.

Each chapter builds on core Python knowledge, introducing specialized libraries and techniques for various domains. We've designed this guide with practical applications in mind—you'll find numerous code examples, projects, and exercises to reinforce your learning.

By the end of this guide, you'll have a broad understanding of Python's capabilities across multiple domains and the confidence to apply these skills to your own innovative projects. Let's begin this exciting journey into advanced Python programming!

Chapter 8: Working with Data

Learning Objectives

  • Master file operations for structured and unstructured data
  • Process and manipulate data using powerful Python libraries
  • Analyze and visualize data to extract meaningful insights
  • Work with databases from Python applications
  • Implement data cleaning and transformation pipelines

8.1 File Operations in Python

Key Concept: File Handling

Python provides robust built-in functions for reading, writing, and manipulating files, making it an excellent language for data processing tasks.

Working with Text Files

Python's file operations are straightforward and powerful. The basic pattern involves opening a file, performing operations, and closing it when done:

# Reading a text file
with open('data.txt', 'r') as file:
    content = file.read()
    print(content)

# Writing to a text file
with open('output.txt', 'w') as file:
    file.write('This is some data\n')
    file.write('This is another line of data')

# Appending to a text file
with open('output.txt', 'a') as file:
    file.write('\nThis line is appended to the file')

Tip: Always use the with statement when working with files. It ensures proper resource management by automatically closing the file when operations are complete, even if exceptions occur.

CSV and JSON Files

Most data analysis tasks involve structured data formats like CSV (Comma-Separated Values) and JSON (JavaScript Object Notation). Python provides dedicated modules for handling these formats:

# Working with CSV files
import csv

# Reading CSV
with open('data.csv', 'r') as file:
    csv_reader = csv.reader(file)
    for row in csv_reader:
        print(row)

# Writing CSV
with open('output.csv', 'w', newline='') as file:
    csv_writer = csv.writer(file)
    csv_writer.writerow(['Name', 'Age', 'City'])
    csv_writer.writerow(['Alice', 28, 'New York'])
    csv_writer.writerow(['Bob', 32, 'San Francisco'])

# Working with JSON
import json

# Reading JSON
with open('data.json', 'r') as file:
    data = json.load(file)
    print(data)

# Writing JSON
data = {
    'name': 'Alice',
    'age': 28,
    'city': 'New York',
    'skills': ['Python', 'Data Analysis', 'Machine Learning']
}

with open('output.json', 'w') as file:
    json.dump(data, file, indent=4)

8.2 Data Analysis with Pandas

Key Concept: Pandas Library

Pandas is the most popular Python library for data manipulation and analysis, offering powerful data structures and operations for manipulating numerical tables and time series.

To use Pandas, you'll first need to install it:

pip install pandas

Working with DataFrames

The DataFrame is Pandas' primary data structure—a two-dimensional labeled data structure with columns that can be of different types:

import pandas as pd
import numpy as np

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 35, 40, 45],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Boston'],
    'Salary': [70000, 80000, 90000, 75000, 85000]
}

df = pd.DataFrame(data)
print(df)

# Reading data from a CSV file
# df = pd.read_csv('data.csv')

# Basic information about the DataFrame
print(df.info())
print(df.describe())

# Accessing data
print(df['Name'])         # Access a column
print(df[['Name', 'Age']])  # Access multiple columns
print(df.iloc[0])         # Access a row by position
print(df.loc[2])          # Access a row by label

Data Cleaning and Transformation

Real-world data is often messy. Pandas provides functions to clean and transform data:

# Handling missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})

print("DataFrame with missing values:")
print(df)

# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# Fill missing values
df_filled = df.fillna(0)  # Fill with zeros
print("\nFilled with zeros:")
print(df_filled)

df_filled_mean = df.fillna(df.mean())  # Fill with column means
print("\nFilled with column means:")
print(df_filled_mean)

# Drop rows with any missing values
df_dropped = df.dropna()
print("\nRows with missing values dropped:")
print(df_dropped)

# Data transformation
# Create a new column based on existing ones
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8]
})
df['C'] = df['A'] + df['B']
print("\nDataFrame with calculated column:")
print(df)

# Apply a function to a column
df['D'] = df['A'].apply(lambda x: x * 2)
print("\nDataFrame with applied function:")
print(df)

Data Grouping and Aggregation

One of Pandas' most powerful features is its ability to group and aggregate data:

# Sample DataFrame with categorical data
data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'C', 'C', 'B'],
    'Value': [10, 20, 15, 25, 30, 40, 35, 22]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Group by Category and calculate statistics
grouped = df.groupby('Category')
print("\nGroup means:")
print(grouped.mean())

print("\nGroup sums:")
print(grouped.sum())

# Multiple aggregations
print("\nMultiple aggregations:")
print(grouped.agg(['min', 'max', 'mean', 'count']))

# Custom aggregation
print("\nCustom aggregation:")
print(grouped.agg({
    'Value': ['min', 'max', 'mean', lambda x: x.max() - x.min()]
}))

8.3 Data Visualization

Key Concept: Visual Data Analysis

Data visualization helps identify patterns, trends, and outliers in data that might not be apparent from raw numbers. Python offers several libraries for creating compelling visualizations.

The main visualization libraries in Python are:

  • Matplotlib: The foundation for most visualization in Python
  • Seaborn: Built on Matplotlib, providing higher-level abstractions and prettier defaults
  • Plotly: For interactive visualizations
# Install required libraries
# pip install matplotlib seaborn

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Create sample data
np.random.seed(42)
data = {
    'x': np.random.normal(0, 1, 100),
    'y': np.random.normal(0, 1, 100),
    'category': np.random.choice(['A', 'B', 'C'], 100)
}
df = pd.DataFrame(data)

# Basic Matplotlib
plt.figure(figsize=(10, 6))
plt.plot([1, 2, 3, 4, 5], [1, 4, 9, 16, 25], 'bo-')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Line Plot')
plt.grid(True)
# plt.savefig('line_plot.png')
plt.close()

# Multiple plots with Matplotlib
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# First subplot: scatter plot
axes[0].scatter(df['x'], df['y'], c='blue', alpha=0.5)
axes[0].set_title('Scatter Plot')
axes[0].set_xlabel('X-axis')
axes[0].set_ylabel('Y-axis')
axes[0].grid(True)

# Second subplot: histogram
axes[1].hist(df['x'], bins=15, alpha=0.7)
axes[1].set_title('Histogram')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Frequency')
axes[1].grid(True)

plt.tight_layout()
# plt.savefig('matplotlib_plots.png')
plt.close()

# Seaborn visualizations
plt.figure(figsize=(10, 6))
sns.set_style("whitegrid")
sns.scatterplot(data=df, x='x', y='y', hue='category', palette='viridis')
plt.title('Seaborn Scatter Plot with Categories')
# plt.savefig('seaborn_scatter.png')
plt.close()

# Seaborn distribution plots
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['x'], kde=True)
plt.title('Histogram with KDE')

plt.subplot(1, 2, 2)
sns.boxplot(x='category', y='x', data=df)
plt.title('Box Plot by Category')

plt.tight_layout()
# plt.savefig('seaborn_distributions.png')
plt.close()

8.4 Working with Databases

Key Concept: Database Integration

For persistent storage and complex queries, databases are essential. Python can interact with virtually any database system.

SQLite: Built-in Database

Python's standard library includes SQLite, a lightweight disk-based database that requires no separate server:

import sqlite3
import pandas as pd

# Connect to a database (will be created if it doesn't exist)
conn = sqlite3.connect('example.db')

# Create a cursor object
cursor = conn.cursor()

# Create a table
cursor.execute('''
CREATE TABLE IF NOT EXISTS employees (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    department TEXT,
    salary REAL,
    hire_date TEXT
)
''')

# Insert data
employees = [
    (1, 'Alice Smith', 'Engineering', 75000.00, '2020-01-15'),
    (2, 'Bob Johnson', 'Marketing', 65000.00, '2019-03-10'),
    (3, 'Charlie Brown', 'Engineering', 80000.00, '2021-05-22'),
    (4, 'Diana Lee', 'Finance', 72000.00, '2018-11-30'),
    (5, 'Edward Wilson', 'HR', 62000.00, '2022-02-05')
]

cursor.executemany('''
INSERT OR REPLACE INTO employees (id, name, department, salary, hire_date)
VALUES (?, ?, ?, ?, ?)
''', employees)

# Commit changes
conn.commit()

# Query the database
cursor.execute('SELECT * FROM employees')
result = cursor.fetchall()
print("All employees:")
for row in result:
    print(row)

cursor.execute('SELECT name, salary FROM employees WHERE department = "Engineering"')
engineers = cursor.fetchall()
print("\nEngineers:")
for row in engineers:
    print(row)

# Using Pandas with SQLite
query = 'SELECT * FROM employees'
df = pd.read_sql_query(query, conn)
print("\nDataFrame from SQL query:")
print(df)

# Update data
cursor.execute('UPDATE employees SET salary = salary * 1.1 WHERE department = "Engineering"')
conn.commit()

# Verify the update
df_updated = pd.read_sql_query('SELECT * FROM employees', conn)
print("\nUpdated DataFrame:")
print(df_updated)

# Close the connection
conn.close()

Working with Other Databases

For larger applications, you might use PostgreSQL, MySQL, MongoDB, or other database systems. The pattern is similar, but you'll need specific libraries:

# PostgreSQL example (requires psycopg2 package)
# pip install psycopg2-binary

"""
import psycopg2

# Connect to PostgreSQL
conn = psycopg2.connect(
    host="localhost",
    database="mydatabase",
    user="myuser",
    password="mypassword"
)

cursor = conn.cursor()

# Execute SQL commands like with SQLite
cursor.execute("SELECT * FROM my_table")
results = cursor.fetchall()

conn.close()
"""

# MongoDB example (requires pymongo package)
# pip install pymongo

"""
from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient('mongodb://localhost:27017/')
db = client['mydatabase']
collection = db['mycollection']

# Insert a document
collection.insert_one({"name": "Alice", "age": 30, "city": "New York"})

# Query documents
results = collection.find({"city": "New York"})
for doc in results:
    print(doc)
"""

Exercise 8.1: Data Analysis Project

Create a complete data analysis project using the provided dataset of customer information:

  1. Load the customer dataset from a CSV file using Pandas
  2. Clean the data by handling missing values and removing duplicates
  3. Perform exploratory data analysis (calculate statistics, create visualizations)
  4. Create at least three different types of plots to visualize the data
  5. Save the cleaned dataset to a SQLite database
  6. Query the database to extract specific customer segments
  7. Generate a summary report with your findings

Hint: Use functions to organize your code and document each step of your analysis.

Exercise 8.2: Data Transformation Challenge

You're given a messy dataset containing information about products sold in a store:

  1. The dataset contains inconsistent dates, missing prices, and duplicate product entries
  2. Create a data cleaning pipeline to standardize the data
  3. Calculate monthly sales totals and identify the best-selling products
  4. Create a visualization showing sales trends over time
  5. Export the cleaned and transformed data to both CSV and JSON formats

Bonus challenge: Implement a function to detect and flag suspicious sales patterns that might indicate errors in the data.

Chapter Summary

In this chapter, we explored essential techniques for working with data in Python:

  • File operations for reading and writing common data formats (text, CSV, JSON)
  • Data manipulation using Pandas, including filtering, transformation, and aggregation
  • Data visualization with Matplotlib and Seaborn for identifying patterns and trends
  • Database integration to store, query, and update structured data

These skills form the foundation of data analysis in Python and will be essential for many applications, including web development, machine learning, and scientific computing which we'll explore in subsequent chapters.

Chapter 9: Web Development with Python

Learning Objectives

  • Understand web application architecture and HTTP fundamentals
  • Build web applications using Flask and Django frameworks
  • Develop RESTful APIs to serve data to front-end applications
  • Create dynamic web pages with templates and forms
  • Implement authentication and security best practices

9.1 Web Development Fundamentals

Key Concept: Web Architecture

Web applications involve client-server interactions using HTTP protocols. Understanding these fundamentals is essential for effective web development.

HTTP Basics

HTTP (Hypertext Transfer Protocol) is the foundation of data communication on the web. Key concepts include:

  • Request Methods: GET, POST, PUT, DELETE, etc.
  • Status Codes: 200 (OK), 404 (Not Found), 500 (Server Error), etc.
  • Headers: Metadata about the request or response
  • Body: The actual content being transferred

Python provides several ways to make HTTP requests:

# Using the requests library (install with: pip install requests)
import requests

# GET request
response = requests.get('https://api.github.com/users/python')
print(f"Status code: {response.status_code}")
print(f"Content type: {response.headers['content-type']}")
print(f"Data: {response.json()}")

# POST request
data = {'username': 'pythonuser', 'password': 'securepassword'}
response = requests.post('https://httpbin.org/post', data=data)
print(f"POST response: {response.json()}")

# Custom headers
headers = {'User-Agent': 'MyPythonApp/1.0'}
response = requests.get('https://httpbin.org/headers', headers=headers)
print(f"Headers response: {response.json()}")

Web Scraping

Web scraping involves extracting data from websites. While APIs are preferred when available, scraping is useful for sites without APIs:

# Install required libraries
# pip install beautifulsoup4 requests

import requests
from bs4 import BeautifulSoup

# Fetch a web page
url = 'https://quotes.toscrape.com/'
response = requests.get(url)
html = response.text

# Parse the HTML
soup = BeautifulSoup(html, 'html.parser')

# Extract data
quotes = soup.find_all('span', class_='text')
authors = soup.find_all('small', class_='author')

# Print the results
print("Quotes and Authors:")
for i, (quote, author) in enumerate(zip(quotes, authors), 1):
    print(f"{i}. {quote.text} - {author.text}")

# Extract specific elements
title = soup.title.text
print(f"\nPage title: {title}")

# Find elements by CSS selector
tags = soup.select('.tag')
print("\nTags:")
for tag in tags[:10]:  # Show first 10 tags
    print(f"- {tag.text}")

Important: Web scraping should be done responsibly. Always check a website's robots.txt file and terms of service before scraping. Use appropriate delays between requests to avoid overloading servers, and consider using APIs when available.

9.2 Web Development with Flask

Key Concept: Flask Framework

Flask is a lightweight, flexible web framework for Python, perfect for small to medium applications and APIs. Its "micro" design philosophy makes it easy to learn and extend as needed.

To get started with Flask, install it using pip:

pip install flask

Creating a Basic Flask Application

# app.py
from flask import Flask, render_template, request, redirect, url_for, jsonify

# Initialize Flask application
app = Flask(__name__)

# Sample data
tasks = [
    {'id': 1, 'title': 'Learn Flask', 'done': False},
    {'id': 2, 'title': 'Develop web app', 'done': False},
    {'id': 3, 'title': 'Deploy application', 'done': False}
]

# Route for the home page
@app.route('/')
def home():
    return render_template('index.html', tasks=tasks)

# Route that accepts parameters
@app.route('/task/')
def task_detail(task_id):
    task = next((task for task in tasks if task['id'] == task_id), None)
    if task:
        return render_template('task_detail.html', task=task)
    return "Task not found", 404

# Route that handles form submission (POST request)
@app.route('/add_task', methods=['POST'])
def add_task():
    if request.method == 'POST':
        title = request.form.get('title')
        if title:
            # Generate a new ID (in a real app, this would be handled by a database)
            new_id = max(task['id'] for task in tasks) + 1
            tasks.append({'id': new_id, 'title': title, 'done': False})
        return redirect(url_for('home'))

# API route that returns JSON
@app.route('/api/tasks')
def get_tasks():
    return jsonify(tasks)

# Run the application
if __name__ == '__main__':
    app.run(debug=True)

For the above application to work, you would need to create HTML templates. Here's a simple example for the index.html file:

<!-- templates/index.html -->
<!DOCTYPE html>
<html>
<head>
    <title>Flask Todo App</title>
    <style>
        body { font-family: Arial, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px; }
        .task { margin-bottom: 10px; padding: 10px; border: 1px solid #ddd; border-radius: 5px; }
        form { margin: 20px 0; }
    </style>
</head>
<body>
    <h1>Task List</h1>
    
    <form action="/add_task" method="post">
        <input type="text" name="title" placeholder="New task" required>
        <button type="submit">Add Task</button>
    </form>
    
    <h2>Tasks:</h2>
    {% for task in tasks %}
        <div class="task">
            <strong>{{ task.title }}</strong>
            <p>Status: {{ "Completed" if task.done else "Pending" }}</p>
            <a href="{{ url_for('task_detail', task_id=task.id) }}">View Details</a>
        </div>
    {% endfor %}
</body>
</html>

Building a REST API with Flask

Flask is excellent for building APIs. Here's a more complete example of a RESTful API:

# api_app.py
from flask import Flask, request, jsonify

app = Flask(__name__)

# Sample data (in a real app, this would come from a database)
books = [
    {"id": 1, "title": "The Great Gatsby", "author": "F. Scott Fitzgerald", "year": 1925},
    {"id": 2, "title": "To Kill a Mockingbird", "author": "Harper Lee", "year": 1960},
    {"id": 3, "title": "1984", "author": "George Orwell", "year": 1949}
]

# GET all books
@app.route('/api/books', methods=['GET'])
def get_books():
    return jsonify(books)

# GET a specific book
@app.route('/api/books/', methods=['GET'])
def get_book(book_id):
    book = next((book for book in books if book['id'] == book_id), None)
    if book:
        return jsonify(book)
    return jsonify({"error": "Book not found"}), 404

# POST a new book
@app.route('/api/books', methods=['POST'])
def add_book():
    if not request.json or 'title' not in request.json:
        return jsonify({"error": "Invalid book data"}), 400
    
    # Create a new book object
    new_id = max(book['id'] for book in books) + 1
    new_book = {
        'id': new_id,
        'title': request.json['title'],
        'author': request.json.get('author', "Unknown"),
        'year': request.json.get('year', 0)
    }
    
    # Add to our collection
    books.append(new_book)
    return jsonify(new_book), 201

# PUT (update) a book
@app.route('/api/books/', methods=['PUT'])
def update_book(book_id):
    book = next((book for book in books if book['id'] == book_id), None)
    if not book:
        return jsonify({"error": "Book not found"}), 404
    
    if not request.json:
        return jsonify({"error": "Invalid book data"}), 400
    
    # Update book attributes
    book['title'] = request.json.get('title', book['title'])
    book['author'] = request.json.get('author', book['author'])
    book['year'] = request.json.get('year', book['year'])
    
    return jsonify(book)

# DELETE a book
@app.route('/api/books/', methods=['DELETE'])
def delete_book(book_id):
    book = next((book for book in books if book['id'] == book_id), None)
    if not book:
        return jsonify({"error": "Book not found"}), 404
    
    books.remove(book)
    return jsonify({"result": "Book deleted"}), 200

if __name__ == '__main__':
    app.run(debug=True)

9.3 Web Development with Django

Key Concept: Django Framework

Django is a high-level Python web framework that follows the "batteries-included" philosophy. It provides a comprehensive set of features for building large-scale web applications efficiently.

To get started with Django, install it using pip:

pip install django

Creating a Django Project

Django follows a project/app structure. Here's how to create a basic Django project:

# Create a new Django project
django-admin startproject mysite

# Navigate to the project directory
cd mysite

# Create a new app within the project
python manage.py startapp blog

# Run migrations to create database tables
python manage.py migrate

# Create a superuser for the admin interface
python manage.py createsuperuser

# Run the development server
python manage.py runserver

Django Project Structure

A typical Django project has the following structure:

mysite/                  # Project root directory
│
├── manage.py           # Command-line utility for administrative tasks
│
├── mysite/             # Project package
│   ├── __init__.py
│   ├── settings.py     # Project settings/configuration
│   ├── urls.py         # Project URL declarations
│   ├── asgi.py         # ASGI configuration for async servers
│   └── wsgi.py         # WSGI configuration for traditional servers
│
└── blog/               # App directory
    ├── __init__.py
    ├── admin.py        # Admin interface configuration
    ├── apps.py         # App configuration
    ├── migrations/     # Database migrations
    ├── models.py       # Data models
    ├── tests.py        # Unit tests
    └── views.py        # Request handlers

Building a Blog Application with Django

Let's create a simple blog application with Django:

# blog/models.py
from django.db import models
from django.utils import timezone
from django.contrib.auth.models import User

class Post(models.Model):
    title = models.CharField(max_length=200)
    content = models.TextField()
    date_posted = models.DateTimeField(default=timezone.now)
    author = models.ForeignKey(User, on_delete=models.CASCADE)
    
    def __str__(self):
        return self.title

# blog/views.py
from django.shortcuts import render, get_object_or_404
from django.http import HttpResponse
from .models import Post

def home(request):
    context = {
        'posts': Post.objects.all().order_by('-date_posted')
    }
    return render(request, 'blog/home.html', context)

def post_detail(request, post_id):
    post = get_object_or_404(Post, id=post_id)
    return render(request, 'blog/post_detail.html', {'post': post})

# blog/urls.py (create this file)
from django.urls import path
from . import views

urlpatterns = [
    path('', views.home, name='blog-home'),
    path('post//', views.post_detail, name='post-detail'),
]

# mysite/urls.py (update this file)
from django.contrib import admin
from django.urls import path, include

urlpatterns = [
    path('admin/', admin.site.urls),
    path('blog/', include('blog.urls')),
]

You'll also need to create HTML templates for your blog. Here's a simple example:

<!-- blog/templates/blog/base.html -->
<!DOCTYPE html>
<html>
<head>
    <title>{% block title %}Django Blog{% endblock %}</title>
    <style>
        body { font-family: Arial, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px; }
        .post { margin-bottom: 20px; padding: 15px; border: 1px solid #ddd; border-radius: 5px; }
        .post-meta { color: #666; font-size: 0.9em; }
    </style>
</head>
<body>
    <header>
        <h1>Django Blog</h1>
        <nav>
            <a href="{% url 'blog-home' %}">Home</a>
            <a href="{% url 'admin:index' %}">Admin</a>
        </nav>
    </header>
    
    <main>
        {% block content %}{% endblock %}
    </main>
    
    <footer>
        <p>© {% now "Y" %} Django Blog</p>
    </footer>
</body>
</html>

<!-- blog/templates/blog/home.html -->
{% extends "blog/base.html" %}

{% block content %}
    <h2>Latest Posts</h2>
    
    {% for post in posts %}
        <div class="post">
            <h3><a href="{% url 'post-detail' post.id %}">{{ post.title }}</a></h3>
            <div class="post-meta">
                By {{ post.author }} on {{ post.date_posted|date:"F d, Y" }}
            </div>
            <p>{{ post.content|truncatewords:30 }}</p>
        </div>
    {% empty %}
        <p>No posts available.</p>
    {% endfor %}
{% endblock %}

<!-- blog/templates/blog/post_detail.html -->
{% extends "blog/base.html" %}

{% block title %}{{ post.title }} | Django Blog{% endblock %}

{% block content %}
    <div class="post">
        <h2>{{ post.title }}</h2>
        <div class="post-meta">
            By {{ post.author }} on {{ post.date_posted|date:"F d, Y" }}
        </div>
        <div class="post-content">
            {{ post.content }}
        </div>
    </div>
    <a href="{% url 'blog-home' %}">← Back to all posts</a>
{% endblock %}

Django Admin Interface

One of Django's most powerful features is its automatic admin interface. Register your models in the admin.py file:

# blog/admin.py
from django.contrib import admin
from .models import Post

admin.site.register(Post)

9.4 Web Security Fundamentals

Key Concept: Secure Web Development

Security is critical in web development. Python frameworks include features to protect against common vulnerabilities.

Common Web Vulnerabilities

  • Cross-Site Scripting (XSS): Injecting malicious scripts into web pages
  • SQL Injection: Inserting malicious SQL code into database queries
  • Cross-Site Request Forgery (CSRF): Tricking users into executing unwanted actions
  • Authentication Weaknesses: Insecure password handling and session management

Security Best Practices

# Flask security example
from flask import Flask, request, render_template, redirect, url_for, session
import secrets
import hashlib
import re

app = Flask(__name__)
app.secret_key = secrets.token_hex(16)  # Generate a secure secret key

# Simulated database of users
users = {
    'admin': {
        'password_hash': hashlib.sha256('securepassword123'.encode()).hexdigest(),
        'role': 'admin'
    }
}

# CSRF protection
@app.before_request
def csrf_protect():
    if request.method == "POST":
        token = session.pop('_csrf_token', None)
        if not token or token != request.form.get('_csrf_token'):
            return "CSRF token validation failed", 400

def generate_csrf_token():
    if '_csrf_token' not in session:
        session['_csrf_token'] = secrets.token_hex(16)
    return session['_csrf_token']

# Input validation
def validate_username(username):
    return re.match(r'^[a-zA-Z0-9_]{3,20}$', username) is not None

def validate_password(password):
    # Check for minimum length, uppercase, lowercase, and digit
    if len(password) < 8:
        return False
    if not re.search(r'[A-Z]', password):
        return False
    if not re.search(r'[a-z]', password):
        return False
    if not re.search(r'\d', password):
        return False
    return True

@app.route('/login', methods=['GET', 'POST'])
def login():
    error = None
    
    if request.method == 'POST':
        username = request.form.get('username', '')
        password = request.form.get('password', '')
        
        # Validate username format
        if not validate_username(username):
            error = "Invalid username format"
        else:
            # Check if user exists and password is correct
            user = users.get(username)
            if user and user['password_hash'] == hashlib.sha256(password.encode()).hexdigest():
                session['username'] = username
                session['role'] = user['role']
                return redirect(url_for('dashboard'))
            else:
                error = "Invalid username or password"
    
    # For GET requests or if login failed
    csrf_token = generate_csrf_token()
    return render_template('login.html', csrf_token=csrf_token, error=error)

@app.route('/dashboard')
def dashboard():
    # Check if user is logged in
    if 'username' not in session:
        return redirect(url_for('login'))
    
    return render_template('dashboard.html', username=session['username'], role=session['role'])

@app.route('/logout')
def logout():
    session.clear()
    return redirect(url_for('login'))

if __name__ == '__main__':
    app.run(debug=True)

Exercise 9.1: Build a Personal Portfolio Website

Create a personal portfolio website using Flask that includes the following features:

  1. A home page with an introduction and summary of your skills
  2. A projects page displaying your work with descriptions and images
  3. A contact form that sends emails when submitted
  4. A blog section where you can add new posts through an admin interface
  5. Responsive design that works well on mobile devices

Bonus: Add authentication to protect the admin interface for adding blog posts.

Exercise 9.2: RESTful API Development

Build a RESTful API for a movie database with Flask or Django REST Framework:

  1. Design endpoints for managing movies, directors, and genres
  2. Implement CRUD (Create, Read, Update, Delete) operations for each resource
  3. Add filtering capabilities (e.g., get movies by director, genre, or release year)
  4. Implement proper error handling and status codes
  5. Add authentication and authorization to protect certain endpoints
  6. Document your API using Swagger/OpenAPI

Bonus: Implement rate limiting to prevent API abuse.

Chapter Summary

In this chapter, we explored Python's capabilities for web development:

  • Web fundamentals including HTTP requests and responses
  • Web scraping techniques for extracting data from websites
  • Building web applications with Flask, a lightweight framework
  • Developing larger applications with Django's comprehensive features
  • RESTful API design and implementation
  • Web security best practices to protect against common vulnerabilities

These skills enable you to create everything from simple websites to complex web applications and APIs. In the next chapter, we'll explore how Python powers artificial intelligence and machine learning applications.

Chapter 10: Introduction to AI and Machine Learning

Learning Objectives

  • Understand fundamental concepts in AI and machine learning
  • Set up a Python environment for machine learning development
  • Explore different types of machine learning algorithms
  • Implement basic machine learning models using scikit-learn
  • Evaluate and improve model performance

10.1 AI and Machine Learning Fundamentals

Key Concept: Machine Learning Paradigms

Machine learning is the study of computer algorithms that improve automatically through experience and data. It's a subset of artificial intelligence focused on building systems that learn from data.

Types of Machine Learning

  • Supervised Learning: The algorithm learns from labeled training data, making predictions or decisions based on that learning
  • Unsupervised Learning: The algorithm finds patterns or structures in unlabeled data
  • Reinforcement Learning: The algorithm learns by interacting with an environment, receiving rewards or penalties

Setting Up Your Environment

To get started with machine learning in Python, you'll need to install several libraries:

# Install essential libraries
pip install numpy pandas matplotlib scikit-learn

# For more advanced machine learning
pip install tensorflow keras

The Machine Learning Workflow

A typical machine learning project follows these steps:

  1. Define the problem and gather data
  2. Explore and preprocess the data
  3. Select and train a model
  4. Evaluate the model
  5. Improve the model and tune parameters
  6. Deploy the model

10.2 Supervised Learning

Key Concept: Classification and Regression

Supervised learning involves training a model on labeled data to make predictions. The two main types are classification (predicting categories) and regression (predicting continuous values).

Classification Example

Let's implement a basic classification model to predict iris flower species:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# Create a DataFrame for better visualization
iris_df = pd.DataFrame(X, columns=feature_names)
iris_df['species'] = [target_names[i] for i in y]

# Print dataset information
print("Dataset shape:", X.shape)
print("Feature names:", feature_names)
print("Target names:", target_names)
print("\nFirst 5 rows:")
print(iris_df.head())

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a K-Nearest Neighbors classifier
k = 3
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_scaled, y_train)

# Make predictions
y_pred = knn.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))

print("\nConfusion Matrix:")
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)

# Visualize results - Sepal features
plt.figure(figsize=(10, 6))
colors = ['blue', 'green', 'red']
markers = ['o', 's', '^']

for i, species in enumerate(target_names):
    # Plot training data
    species_data = iris_df[iris_df['species'] == species]
    plt.scatter(
        species_data['sepal length (cm)'], 
        species_data['sepal width (cm)'],
        color=colors[i], 
        marker=markers[i], 
        label=f'{species} (actual)',
        alpha=0.6
    )

plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Iris Species Classification - Sepal Features')
plt.legend()
plt.grid(True)
# plt.savefig('iris_classification.png')
plt.close()

# Visualize the confusion matrix
plt.figure(figsize=(8, 6))
plt.imshow(conf_matrix, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.colorbar()

tick_marks = np.arange(len(target_names))
plt.xticks(tick_marks, target_names, rotation=45)
plt.yticks(tick_marks, target_names)

plt.xlabel('Predicted Label')
plt.ylabel('True Label')
# plt.savefig('confusion_matrix.png')
plt.close()

Regression Example

Now let's implement a regression model to predict house prices:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the California housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target
feature_names = housing.feature_names

# Create a DataFrame for better visualization
housing_df = pd.DataFrame(X, columns=feature_names)
housing_df['price'] = y

# Print dataset information
print("Dataset shape:", X.shape)
print("Feature names:", feature_names)
print("\nFirst 5 rows:")
print(housing_df.head())
print("\nData statistics:")
print(housing_df.describe())

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a linear regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"\nModel Performance:")
print(f"Mean Squared Error: {mse:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")

# Feature importance
coefficients = pd.DataFrame(model.coef_, index=feature_names, columns=['Coefficient'])
print("\nFeature Coefficients:")
print(coefficients.sort_values(by='Coefficient', ascending=False))

# Visualize predicted vs actual values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Linear Regression: Predicted vs Actual House Prices')
plt.grid(True)
# plt.savefig('housing_regression.png')
plt.close()

# Visualize residuals
residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.5)
plt.hlines(y=0, xmin=y_pred.min(), xmax=y_pred.max(), colors='r', linestyles='--')
plt.xlabel('Predicted Prices')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.grid(True)
# plt.savefig('housing_residuals.png')
plt.close()

10.3 Unsupervised Learning

Key Concept: Clustering and Dimensionality Reduction

Unsupervised learning finds patterns in unlabeled data. Common approaches include clustering (grouping similar data points) and dimensionality reduction (simplifying data while preserving key information).

Clustering Example

Let's implement K-means clustering to group data points:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Generate synthetic data with 4 clusters
n_samples = 500
n_features = 2
n_clusters = 4
random_state = 42

X, y_true = make_blobs(
    n_samples=n_samples,
    n_features=n_features,
    centers=n_clusters,
    random_state=random_state
)

# Visualize the original data
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', alpha=0.7, edgecolors='k', s=40)
plt.title('Original Data with True Clusters')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True)
# plt.savefig('original_clusters.png')
plt.close()

# Apply K-means clustering
kmeans = KMeans(n_clusters=n_clusters, random_state=random_state)
y_pred = kmeans.fit_predict(X)
centers = kmeans.cluster_centers_

# Evaluate the clustering
silhouette_avg = silhouette_score(X, y_pred)
print(f"Silhouette Score: {silhouette_avg:.4f}")

# Visualize the K-means clustering results
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', alpha=0.7, edgecolors='k', s=40)
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Centroids')
plt.title('K-means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
# plt.savefig('kmeans_clusters.png')
plt.close()

# Finding the optimal number of clusters using the Elbow Method
inertia = []
silhouette_scores = []
k_range = range(2, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=random_state)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)
    
    # Silhouette score (only computed for k > 1)
    silhouette_scores.append(silhouette_score(X, kmeans.labels_))

# Plot the Elbow Method results
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(k_range, inertia, 'o-', linewidth=2, markersize=8)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia (Within-Cluster Sum of Squares)')
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(k_range, silhouette_scores, 'o-', linewidth=2, markersize=8)
plt.title('Silhouette Method')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.grid(True)

plt.tight_layout()
# plt.savefig('optimal_clusters.png')
plt.close()

Dimensionality Reduction Example

Now let's use Principal Component Analysis (PCA) to reduce the dimensionality of data:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load the digits dataset
digits = load_digits()
X = digits.data
y = digits.target

# Print dataset information
print("Dataset shape:", X.shape)
print("Number of classes:", len(np.unique(y)))

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Calculate explained variance ratio
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)

# Plot the explained variance
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(explained_variance) + 1), explained_variance, alpha=0.6, label='Individual explained variance')
plt.step(range(1, len(cumulative_variance) + 1), cumulative_variance, where='mid', label='Cumulative explained variance')
plt.axhline(y=0.9, linestyle='--', color='r', label='90% explained variance threshold')
plt.title('Explained Variance Ratio by Principal Components')
plt.xlabel('Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.legend()
plt.grid(True)
# plt.savefig('pca_variance.png')
plt.close()

# Find number of components for 90% variance
n_components_90 = np.argmax(cumulative_variance >= 0.9) + 1
print(f"Number of components for 90% variance: {n_components_90}")

# Apply PCA with reduced dimensions
pca_reduced = PCA(n_components=n_components_90)
X_reduced = pca_reduced.fit_transform(X_scaled)
print(f"Reduced data shape: {X_reduced.shape}")

# Visualize first two principal components with class labels
plt.figure(figsize=(10, 8))
colors = plt.cm.rainbow(np.linspace(0, 1, len(np.unique(y))))

for i, color in enumerate(colors):
    indices = y == i
    plt.scatter(X_pca[indices, 0], X_pca[indices, 1], color=color, alpha=0.7, label=f'Digit {i}')

plt.title('PCA: First Two Principal Components')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.grid(True)
# plt.savefig('pca_digits.png')
plt.close()

# Visualize some original digits and their reconstructions
pca = PCA(n_components=n_components_90)
X_pca_reduced = pca.fit_transform(X_scaled)
X_reconstructed = pca.inverse_transform(X_pca_reduced)
X_reconstructed = scaler.inverse_transform(X_reconstructed)

# Display original vs reconstructed digits
fig, axes = plt.subplots(4, 8, figsize=(16, 8))
indices = np.random.choice(len(X), 16, replace=False)

for i, idx in enumerate(indices):
    # Original digit
    ax = axes[i // 4 * 2, i % 4 * 2]
    ax.imshow(digits.images[idx], cmap='gray')
    ax.set_title(f'Original: {y[idx]}')
    ax.axis('off')
    
    # Reconstructed digit
    ax = axes[i // 4 * 2 + 1, i % 4 * 2]
    ax.imshow(X_reconstructed[idx].reshape(8, 8), cmap='gray')
    ax.set_title(f'Reconstructed')
    ax.axis('off')

plt.tight_layout()
# plt.savefig('digit_reconstruction.png')
plt.close()

10.4 Model Evaluation and Improvement

Key Concept: Model Validation and Hyperparameter Tuning

Evaluating and improving machine learning models is crucial for creating reliable AI systems. This involves selecting appropriate metrics, validation techniques, and optimization methods.

Cross-Validation

Cross-validation helps assess model performance more reliably than a single train-test split:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score, KFold, learning_curve
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc

# Load the breast cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create a model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Perform k-fold cross-validation
k_folds = 5
kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_scaled, y, cv=kf, scoring='accuracy')

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.4f}")
print(f"Standard deviation: {cv_scores.std():.4f}")

# Calculate multiple metrics using cross-validation
def calculate_metrics(model, X, y, cv):
    metrics = {
        'accuracy': [],
        'precision': [],
        'recall': [],
        'f1': []
    }
    
    for train_idx, test_idx in cv.split(X):
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        metrics['accuracy'].append(accuracy_score(y_test, y_pred))
        metrics['precision'].append(precision_score(y_test, y_pred))
        metrics['recall'].append(recall_score(y_test, y_pred))
        metrics['f1'].append(f1_score(y_test, y_pred))
    
    return metrics

metrics = calculate_metrics(model, X_scaled, y, kf)

print("\nDetailed Cross-Validation Metrics:")
for metric, values in metrics.items():
    print(f"{metric.capitalize()}: {np.mean(values):.4f} ± {np.std(values):.4f}")

# Plot learning curves to diagnose overfitting/underfitting
train_sizes, train_scores, test_scores = learning_curve(
    model, X_scaled, y, cv=5, n_jobs=-1, 
    train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='accuracy'
)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, color='blue', marker='o', label='Training accuracy')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.15, color='blue')
plt.plot(train_sizes, test_mean, color='green', marker='s', label='Validation accuracy')
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.15, color='green')
plt.title('Learning Curve')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.grid(True)
plt.legend(loc='lower right')
# plt.savefig('learning_curve.png')
plt.close()

Hyperparameter Tuning

Optimize model performance by finding the best hyperparameter values:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

# Load the wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define the parameter grid to search
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf', 'linear']
}

# Create a base model
svm = SVC(probability=True)

# Perform grid search with cross-validation
grid_search = GridSearchCV(
    estimator=svm,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    verbose=0,
    n_jobs=-1
)

# Fit the grid search to the data
grid_search.fit(X_train_scaled, y_train)

# Get the best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score: {:.4f}".format(grid_search.best_score_))

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate on the test set
y_pred = best_model.predict(X_test_scaled)
print("\nTest Set Evaluation:")
print(classification_report(y_test, y_pred, target_names=wine.target_names))

# Print confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Visualize the results of grid search
results = pd.DataFrame(grid_search.cv_results_)
results = results.sort_values(by='rank_test_score')

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
for kernel in ['rbf', 'linear']:
    kernel_results = results[results['param_kernel'] == kernel]
    plt.plot(kernel_results['param_C'], kernel_results['mean_test_score'], 
             marker='o', label=f'kernel={kernel}')

plt.xlabel('C parameter')
plt.ylabel('Mean test score')
plt.title('Grid Search Results: C parameter')
plt.legend()
plt.grid(True)

plt.subplot(1, 2, 2)
rbf_results = results[results['param_kernel'] == 'rbf']
for C in [0.1, 1, 10, 100]:
    C_results = rbf_results[rbf_results['param_C'] == C]
    plt.plot(C_results['param_gamma'], C_results['mean_test_score'], 
             marker='o', label=f'C={C}')

plt.xlabel('gamma parameter')
plt.ylabel('Mean test score')
plt.title('Grid Search Results: gamma parameter (RBF kernel)')
plt.legend()
plt.grid(True)

plt.tight_layout()
# plt.savefig('grid_search_results.png')
plt.close()

Exercise 10.1: Build a Predictive Model

Develop a machine learning model to predict student performance based on various factors:

  1. Load and explore the student performance dataset
  2. Clean and preprocess the data, handling missing values and categorical variables
  3. Split the data into training and testing sets
  4. Train at least three different models (e.g., linear regression, random forest, gradient boosting)
  5. Evaluate each model using appropriate metrics
  6. Tune the hyperparameters of the best-performing model
  7. Create visualizations to interpret the model's predictions
  8. Document your process and findings

Dataset: You can use the Student Performance dataset from the UCI Machine Learning Repository or a similar educational dataset.

Exercise 10.2: Customer Segmentation

Apply unsupervised learning techniques to segment customers based on their purchasing behavior:

  1. Load and explore a retail customer dataset
  2. Preprocess the data, handling outliers and scaling features
  3. Apply PCA to reduce dimensionality if necessary
  4. Use K-means clustering to segment customers
  5. Determine the optimal number of clusters using the elbow method and silhouette score
  6. Analyze and interpret each customer segment
  7. Create visualizations to represent the clusters
  8. Propose marketing strategies for each customer segment

Bonus challenge: Try hierarchical clustering as an alternative to K-means and compare the results.

Chapter Summary

In this chapter, we explored the fundamentals of artificial intelligence and machine learning with Python:

  • Key concepts in machine learning, including supervised and unsupervised learning
  • Setting up a Python environment for machine learning development
  • Implementing classification and regression models using scikit-learn
  • Exploring clustering and dimensionality reduction for unsupervised learning
  • Evaluating models using cross-validation and various performance metrics
  • Optimizing models through hyperparameter tuning

These foundations provide the groundwork for building more complex AI applications, which we'll explore in the next chapter. By understanding these core concepts, you're well-equipped to start applying machine learning to solve real-world problems.

Chapter 11: Building AI Applications

Learning Objectives

  • Develop practical AI applications using Python
  • Implement natural language processing (NLP) techniques
  • Create computer vision applications
  • Build recommendation systems
  • Deploy machine learning models as web services

11.1 Natural Language Processing

Key Concept: Text Processing and Analysis

Natural Language Processing (NLP) is a field of AI focused on enabling computers to understand, interpret, and generate human language. Python offers powerful libraries for NLP tasks.

To get started with NLP in Python, install the necessary libraries:

pip install nltk spacy textblob gensim

Text Preprocessing

Before analyzing text data, preprocessing is essential to clean and normalize the text:

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
import re

# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def preprocess_text(text):
    """
    Preprocess text data by performing multiple cleaning steps
    """
    # Convert to lowercase
    text = text.lower()
    
    # Remove numbers and punctuation
    text = re.sub(r'\d+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return {
        'original_text': text,
        'tokens': tokens,
        'stemmed_tokens': stemmed_tokens,
        'lemmatized_tokens': lemmatized_tokens
    }

# Example usage
sample_text = """Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction
between computers and humans using natural language. The ultimate goal of NLP is to enable computers to understand,
interpret, and generate human language in a way that is both meaningful and useful."""

processed = preprocess_text(sample_text)

print("Original Text:")
print(sample_text)
print("\nProcessed Text:")
print("Tokens:", processed['tokens'][:10], "...")
print("Stemmed:", processed['stemmed_tokens'][:10], "...")
print("Lemmatized:", processed['lemmatized_tokens'][:10], "...")

# Sentence tokenization
sentences = sent_tokenize(sample_text)
print("\nSentences:")
for i, sentence in enumerate(sentences):
    print(f"{i+1}. {sentence}")

Sentiment Analysis

Sentiment analysis determines the emotional tone behind text, useful for analyzing customer feedback, social media, and more:

from textblob import TextBlob
import matplotlib.pyplot as plt
import numpy as np

def analyze_sentiment(text):
    """
    Analyze the sentiment of text using TextBlob
    """
    blob = TextBlob(text)
    sentiment = blob.sentiment
    
    # Polarity ranges from -1 (negative) to 1 (positive)
    # Subjectivity ranges from 0 (objective) to 1 (subjective)
    return {
        'text': text,
        'polarity': sentiment.polarity,
        'subjectivity': sentiment.subjectivity,
        'sentiment': 'positive' if sentiment.polarity > 0 else 'negative' if sentiment.polarity < 0 else 'neutral'
    }

# Example texts
texts = [
    "I absolutely love this product! It's amazing and works perfectly.",
    "The service was okay, but could be better.",
    "This is the worst experience I've ever had. Terrible customer service.",
    "The movie was neither particularly good nor bad.",
    "The staff was friendly and helpful, but the food was disappointing."
]

# Analyze sentiments
sentiments = [analyze_sentiment(text) for text in texts]

# Display results
for i, result in enumerate(sentiments):
    print(f"\nText {i+1}: {result['text']}")
    print(f"Polarity: {result['polarity']:.2f}")
    print(f"Subjectivity: {result['subjectivity']:.2f}")
    print(f"Sentiment: {result['sentiment']}")

# Visualize the results
plt.figure(figsize=(10, 6))

# Extract polarities and subjectivities
polarities = [s['polarity'] for s in sentiments]
subjectivities = [s['subjectivity'] for s in sentiments]
labels = [f"Text {i+1}" for i in range(len(texts))]

# Create scatter plot
plt.scatter(polarities, subjectivities, c=np.array(polarities), cmap='RdYlGn', s=100, alpha=0.7)

# Add labels and details
for i, (x, y) in enumerate(zip(polarities, subjectivities)):
    plt.annotate(labels[i], (x, y), xytext=(5, 5), textcoords='offset points')

plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.3)

plt.title('Sentiment Analysis Results')
plt.xlabel('Polarity (Negative → Positive)')
plt.ylabel('Subjectivity (Objective → Subjective)')
plt.xlim(-1.1, 1.1)
plt.ylim(-0.1, 1.1)
plt.grid(True, alpha=0.3)
plt.colorbar(label='Sentiment Polarity')
# plt.savefig('sentiment_analysis.png')
plt.close()

Topic Modeling

Topic modeling discovers abstract topics in a collection of documents, useful for content organization and recommendation systems:

import gensim
from gensim import corpora
from gensim.models import LdaModel
import pyLDAvis
import pyLDAvis.gensim_models

# Sample documents
documents = [
    "Machine learning is a method of data analysis that automates analytical model building.",
    "Python is a programming language that lets you work quickly and integrate systems more effectively.",
    "Artificial intelligence is intelligence demonstrated by machines.",
    "Deep learning is part of a broader family of machine learning methods based on artificial neural networks.",
    "Natural language processing is a subfield of linguistics, computer science, and artificial intelligence.",
    "Computer vision is an interdisciplinary scientific field that deals with how computers can gain understanding from digital images or videos.",
    "Data science is an inter-disciplinary field that uses scientific methods to extract knowledge from data.",
    "Python libraries like TensorFlow and PyTorch are commonly used for machine learning and AI development.",
    "Big data refers to data sets that are too large or complex to be dealt with by traditional data-processing software.",
    "Cloud computing is the on-demand availability of computer system resources."
]

# Preprocess the documents
processed_docs = []
for doc in documents:
    # Tokenize, remove punctuation and stopwords
    tokens = preprocess_text(doc)['tokens']
    processed_docs.append(tokens)

# Create a dictionary
dictionary = corpora.Dictionary(processed_docs)

# Create a document-term matrix
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# Train the LDA model
num_topics = 3
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=num_topics,
    passes=10,
    alpha='auto',
    random_state=42
)

# Print the topics
print("LDA Topics:")
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

# Show the topic distribution for each document
print("\nTopic Distribution by Document:")
for i, doc in enumerate(corpus):
    print(f"\nDocument {i+1}: \"{documents[i][:50]}...\"")
    topic_distribution = lda_model.get_document_topics(doc)
    for topic_id, prob in sorted(topic_distribution, key=lambda x: x[1], reverse=True):
        print(f"  Topic {topic_id}: {prob:.4f}")

# Function to format topics in a readable way
def format_topics_sentences(ldamodel, corpus, texts):
    sent_topics_df = []
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: x[1], reverse=True)
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df.append([i, int(topic_num), round(prop_topic, 4), topic_keywords, texts[i]])
            else:
                break
    return sent_topics_df

# Format the results
topic_sentences = format_topics_sentences(lda_model, corpus, documents)
print("\nDominant Topic for Each Document:")
for i, topic_num, prop_topic, keywords, text in topic_sentences:
    print(f"Document {i+1}: Topic {topic_num} (Probability: {prop_topic:.4f})")
    print(f"  Keywords: {keywords}")
    print(f"  Text: {text[:70]}...\n")

11.2 Computer Vision

Key Concept: Image Processing and Analysis

Computer vision enables computers to interpret and understand visual information from the world. Python provides powerful libraries for image processing and deep learning-based vision tasks.

To get started with computer vision in Python, install the necessary libraries:

pip install opencv-python pillow scikit-image tensorflow

Basic Image Processing

Let's explore basic image processing operations using OpenCV:

import cv2
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image, ImageFilter, ImageEnhance

# Load an image (replace with your own image path)
image_path = "sample_image.jpg"  # You can use any image for testing

try:
    # OpenCV reads images in BGR format
    img_cv = cv2.imread(image_path)
    
    if img_cv is None:
        raise FileNotFoundError(f"Could not open or find the image: {image_path}")
    
    # Convert BGR to RGB for display with matplotlib
    img_rgb = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)
    
    # Basic image properties
    height, width, channels = img_rgb.shape
    print(f"Image dimensions: {width}x{height}, {channels} channels")
    
    # Create a figure with multiple subplots
    plt.figure(figsize=(15, 10))
    
    # Display original image
    plt.subplot(2, 3, 1)
    plt.imshow(img_rgb)
    plt.title('Original Image')
    plt.axis('off')
    
    # Grayscale conversion
    img_gray = cv2.cvtColor(img_cv, cv2.COLOR_BGR2GRAY)
    plt.subplot(2, 3, 2)
    plt.imshow(img_gray, cmap='gray')
    plt.title('Grayscale')
    plt.axis('off')
    
    # Image blurring
    img_blur = cv2.GaussianBlur(img_rgb, (15, 15), 0)
    plt.subplot(2, 3, 3)
    plt.imshow(img_blur)
    plt.title('Gaussian Blur')
    plt.axis('off')
    
    # Edge detection
    edges = cv2.Canny(img_gray, 100, 200)
    plt.subplot(2, 3, 4)
    plt.imshow(edges, cmap='gray')
    plt.title('Edge Detection')
    plt.axis('off')
    
    # Thresholding
    _, thresh = cv2.threshold(img_gray, 127, 255, cv2.THRESH_BINARY)
    plt.subplot(2, 3, 5)
    plt.imshow(thresh, cmap='gray')
    plt.title('Thresholding')
    plt.axis('off')
    
    # Image resizing
    img_resized = cv2.resize(img_rgb, (width//2, height//2))
    plt.subplot(2, 3, 6)
    plt.imshow(img_resized)
    plt.title('Resized (50%)')
    plt.axis('off')
    
    plt.tight_layout()
    # plt.savefig('image_processing.png')
    plt.close()
    
    # Demonstrate PIL image processing
    pil_img = Image.open(image_path)
    
    plt.figure(figsize=(15, 10))
    
    # Original
    plt.subplot(2, 3, 1)
    plt.imshow(np.array(pil_img))
    plt.title('Original (PIL)')
    plt.axis('off')
    
    # Apply filters
    # Blur
    blur_img = pil_img.filter(ImageFilter.BLUR)
    plt.subplot(2, 3, 2)
    plt.imshow(np.array(blur_img))
    plt.title('Blur Filter')
    plt.axis('off')
    
    # Find edges
    edge_img = pil_img.filter(ImageFilter.FIND_EDGES)
    plt.subplot(2, 3, 3)
    plt.imshow(np.array(edge_img))
    plt.title('Edge Filter')
    plt.axis('off')
    
    # Enhance contrast
    enhancer = ImageEnhance.Contrast(pil_img)
    enhanced_img = enhancer.enhance(1.5)  # Increase contrast by 50%
    plt.subplot(2, 3, 4)
    plt.imshow(np.array(enhanced_img))
    plt.title('Enhanced Contrast')
    plt.axis('off')
    
    # Rotate image
    rotated_img = pil_img.rotate(45)
    plt.subplot(2, 3, 5)
    plt.imshow(np.array(rotated_img))
    plt.title('Rotated 45°')
    plt.axis('off')
    
    # Convert to grayscale
    gray_img = pil_img.convert('L')
    plt.subplot(2, 3, 6)
    plt.imshow(np.array(gray_img), cmap='gray')
    plt.title('Grayscale (PIL)')
    plt.axis('off')
    
    plt.tight_layout()
    # plt.savefig('pil_processing.png')
    plt.close()

except Exception as e:
    print(f"Error: {e}")
    print("Using a placeholder image for demonstration instead.")
    
    # Create a simple placeholder image
    placeholder = np.zeros((300, 400, 3), dtype=np.uint8)
    
    # Add some shapes to the placeholder
    cv2.rectangle(placeholder, (50, 50), (200, 200), (0, 255, 0), -1)
    cv2.circle(placeholder, (300, 150), 80, (0, 0, 255), -1)
    cv2.line(placeholder, (50, 250), (350, 250), (255, 255, 255), 5)
    
    # Convert BGR to RGB for display
    placeholder_rgb = cv2.cvtColor(placeholder, cv2.COLOR_BGR2RGB)
    
    plt.figure(figsize=(10, 6))
    plt.imshow(placeholder_rgb)
    plt.title('Placeholder Image')
    plt.axis('off')
    # plt.savefig('placeholder.png')
    plt.close()

Object Detection

Let's implement basic object detection using pre-trained models:

import cv2
import numpy as np
import matplotlib.pyplot as plt

def detect_faces(image_path):
    """
    Detect faces in an image using OpenCV's pre-trained Haar Cascade classifier
    """
    try:
        # Load the image
        img = cv2.imread(image_path)
        if img is None:
            raise FileNotFoundError(f"Could not open or find the image: {image_path}")
        
        # Convert to grayscale
        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        
        # Load the face detector
        face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')
        
        # Detect faces
        faces = face_cascade.detectMultiScale(
            gray,
            scaleFactor=1.1,
            minNeighbors=5,
            minSize=(30, 30)
        )
        
        print(f"Found {len(faces)} faces!")
        
        # Draw rectangles around the faces
        for (x, y, w, h) in faces:
            cv2.rectangle(img, (x, y), (x+w, y+h), (255, 0, 0), 2)
        
        # Convert to RGB for display
        img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        
        return img_rgb, faces
    
    except Exception as e:
        print(f"Error: {e}")
        return None, []

def detect_objects(image_path):
    """
    Detect objects in an image using OpenCV's pre-trained YOLO model
    """
    try:
        # Load the image
        img = cv2.imread(image_path)
        if img is None:
            raise FileNotFoundError(f"Could not open or find the image: {image_path}")
        
        # Get image dimensions
        height, width, _ = img.shape
        
        # Load YOLO model (you need to download these files separately)
        # Uncomment and use these lines if you have the YOLO weights and configuration
        """
        net = cv2.dnn.readNet("yolov3.weights", "yolov3.cfg")
        with open("coco.names", "r") as f:
            classes = [line.strip() for line in f.readlines()]
        
        # Create a blob from the image
        blob = cv2.dnn.blobFromImage(img, 1/255.0, (416, 416), swapRB=True, crop=False)
        net.setInput(blob)
        
        # Get output layer names
        output_layers = net.getUnconnectedOutLayersNames()
        
        # Forward pass
        layer_outputs = net.forward(output_layers)
        
        # Initialize lists for detected objects
        boxes = []
        confidences = []
        class_ids = []
        
        # Process each output layer
        for output in layer_outputs:
            for detection in output:
                scores = detection[5:]
                class_id = np.argmax(scores)
                confidence = scores[class_id]
                
                if confidence > 0.5:  # Confidence threshold
                    # Object detected
                    center_x = int(detection[0] * width)
                    center_y = int(detection[1] * height)
                    w = int(detection[2] * width)
                    h = int(detection[3] * height)
                    
                    # Rectangle coordinates
                    x = int(center_x - w / 2)
                    y = int(center_y - h / 2)
                    
                    boxes.append([x, y, w, h])
                    confidences.append(float(confidence))
                    class_ids.append(class_id)
        
        # Apply non-maximum suppression
        indices = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
        
        # Draw bounding boxes and labels
        for i in indices:
            i = i[0] if isinstance(i, (list, np.ndarray)) else i
            x, y, w, h = boxes[i]
            label = f"{classes[class_ids[i]]}: {confidences[i]:.2f}"
            
            cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
            cv2.putText(img, label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
        """
        
        # Since YOLO requires additional files (weights/config), let's create a placeholder
        # This simulates object detection results for demonstration
        objects = [
            {"label": "Person", "confidence": 0.92, "box": [50, 50, 100, 200]},
            {"label": "Car", "confidence": 0.85, "box": [200, 150, 150, 100]},
            {"label": "Dog", "confidence": 0.78, "box": [350, 200, 80, 60]}
        ]
        
        # Draw bounding boxes and labels on the placeholder
        for obj in objects:
            x, y, w, h = obj["box"]
            label = f"{obj['label']}: {obj['confidence']:.2f}"
            
            cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
            cv2.putText(img, label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
        
        # Convert to RGB for display
        img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        
        return img_rgb, objects
    
    except Exception as e:
        print(f"Error: {e}")
        return None, []

# Demonstration with placeholder/sample images
try:
    # Try to use the sample image for face detection
    face_result, faces = detect_faces(image_path)
    object_result, objects = detect_objects(image_path)
    
    plt.figure(figsize=(12, 6))
    
    if face_result is not None:
        plt.subplot(1, 2, 1)
        plt.imshow(face_result)
        plt.title(f'Face Detection ({len(faces)} faces)')
        plt.axis('off')
    
    if object_result is not None:
        plt.subplot(1, 2, 2)
        plt.imshow(object_result)
        plt.title(f'Object Detection ({len(objects)} objects)')
        plt.axis('off')
    
    plt.tight_layout()
    # plt.savefig('detection_results.png')
    plt.close()

except Exception as e:
    print(f"Error in demonstration: {e}")
    # Create a placeholder for the demonstration
    placeholder = np.zeros((400, 600, 3), dtype=np.uint8)
    placeholder[:] = (240, 240, 240)  # Light gray background
    
    # Add text
    font = cv2.FONT_HERSHEY_SIMPLEX
    cv2.putText(placeholder, "Object Detection Placeholder", (120, 60), font, 1, (0, 0, 0), 2)
    
    # Draw some "detected objects" with bounding boxes
    objects = [
        {"label": "Person", "confidence": 0.92, "box": [50, 100, 100, 200]},
        {"label": "Car", "confidence": 0.85, "box": [250, 150, 150, 100]},
        {"label": "Dog", "confidence": 0.78, "box": [450, 200, 80, 60]}
    ]
    
    for obj in objects:
        x, y, w, h = obj["box"]
        label = f"{obj['label']}: {obj['confidence']:.2f}"
        
        cv2.rectangle(placeholder, (x, y), (x + w, y + h), (0, 255, 0), 2)
        cv2.putText(placeholder, label, (x, y - 10), font, 0.5, (0, 0, 0), 2)
    
    # Display placeholder
    placeholder_rgb = cv2.cvtColor(placeholder, cv2.COLOR_BGR2RGB)
    plt.figure(figsize=(10, 6))
    plt.imshow(placeholder_rgb)
    plt.title('Object Detection (Placeholder)')
    plt.axis('off')
    # plt.savefig('detection_placeholder.png')
    plt.close()

11.3 Recommendation Systems

Key Concept: Personalized Recommendations

Recommendation systems suggest items to users based on their preferences and behavior. Python provides tools for building different types of recommenders, from simple collaborative filtering to complex deep learning models.

Building a Simple Recommender

Let's implement a basic collaborative filtering recommender system:

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns

# Create a sample user-item ratings matrix
# Each row represents a user, each column represents an item (e.g., movie)
# The values are ratings given by users to items
ratings = pd.DataFrame({
    'Item1': [5, 4, 0, 0, 1],
    'Item2': [4, 0, 3, 4, 0],
    'Item3': [1, 0, 5, 4, 3],
    'Item4': [0, 3, 5, 0, 4],
    'Item5': [2, 5, 0, 3, 5]
}, index=['User1', 'User2', 'User3', 'User4', 'User5'])

print("User-Item Ratings Matrix:")
print(ratings)

# Visualize the ratings matrix
plt.figure(figsize=(10, 6))
sns.heatmap(ratings, annot=True, cmap="YlGnBu", cbar_kws={'label': 'Rating'})
plt.title('User-Item Ratings Matrix')
plt.tight_layout()
# plt.savefig('ratings_matrix.png')
plt.close()

# Item-based collaborative filtering
def item_based_recommendations(ratings_matrix, item_similarity_matrix, user_id, num_recommendations=2):
    """
    Generate item-based recommendations for a user
    """
    # Get user ratings
    user_ratings = ratings_matrix.loc[user_id].values.reshape(1, -1)
    
    # Create a mask for already rated items
    already_rated = user_ratings[0] > 0
    
    # Calculate the predicted ratings
    # Weighted sum of item similarities and user ratings
    weighted_sum = np.dot(item_similarity_matrix, user_ratings.T)
    
    # Sum of similarities for normalization
    similarity_sums = np.sum(np.abs(item_similarity_matrix[:, already_rated]), axis=1)
    similarity_sums[similarity_sums == 0] = 1  # Avoid division by zero
    
    # Calculate predicted ratings
    predicted_ratings = weighted_sum / similarity_sums.reshape(-1, 1)
    
    # Convert to a more usable format
    predicted_ratings = predicted_ratings.flatten()
    
    # Mask out already rated items
    predicted_ratings[already_rated] = 0
    
    # Get top recommendations
    item_indices = np.argsort(predicted_ratings)[::-1][:num_recommendations]
    
    return {
        'item_indices': item_indices,
        'predicted_ratings': predicted_ratings[item_indices]
    }

# Calculate item-item similarity matrix using cosine similarity
item_similarity = cosine_similarity(ratings.T)
item_similarity_df = pd.DataFrame(item_similarity, 
                                  index=ratings.columns, 
                                  columns=ratings.columns)

print("\nItem-Item Similarity Matrix:")
print(item_similarity_df)

# Visualize the item similarity matrix
plt.figure(figsize=(8, 6))
sns.heatmap(item_similarity_df, annot=True, cmap="coolwarm", vmin=-1, vmax=1, 
            cbar_kws={'label': 'Cosine Similarity'})
plt.title('Item-Item Similarity Matrix')
plt.tight_layout()
# plt.savefig('item_similarity.png')
plt.close()

# Generate recommendations for each user
print("\nItem-Based Collaborative Filtering Recommendations:")
for user in ratings.index:
    recs = item_based_recommendations(ratings, item_similarity, user)
    rec_items = [ratings.columns[i] for i in recs['item_indices']]
    rec_ratings = recs['predicted_ratings']
    
    print(f"\n{user}:")
    for item, rating in zip(rec_items, rec_ratings):
        print(f"  Recommended: {item} (Predicted rating: {rating:.2f})")

# User-based collaborative filtering
def user_based_recommendations(ratings_matrix, user_similarity_matrix, user_id, num_recommendations=2):
    """
    Generate user-based recommendations for a user
    """
    # Get index of the target user
    user_idx = list(ratings_matrix.index).index(user_id)
    
    # Get similarities between the target user and all other users
    user_similarities = user_similarity_matrix[user_idx]
    
    # Create a mask for the target user's already rated items
    user_ratings = ratings_matrix.loc[user_id].values
    already_rated = user_ratings > 0
    
    # Initialize predicted ratings
    predicted_ratings = np.zeros(len(ratings_matrix.columns))
    
    # For each item that the user hasn't rated
    for item_idx in range(len(ratings_matrix.columns)):
        if not already_rated[item_idx]:
            # Get ratings for this item from all users
            item_ratings = ratings_matrix.iloc[:, item_idx].values
            
            # Create a mask for users who have rated this item
            rated_mask = item_ratings > 0
            
            # If no other user has rated this item, skip
            if np.sum(rated_mask) == 0:
                continue
            
            # Calculate the weighted average rating
            # Weighted by similarity between the target user and other users
            weighted_sum = np.sum(user_similarities[rated_mask] * item_ratings[rated_mask])
            similarity_sum = np.sum(np.abs(user_similarities[rated_mask]))
            
            if similarity_sum > 0:
                predicted_ratings[item_idx] = weighted_sum / similarity_sum
    
    # Get top recommendations (items with highest predicted ratings)
    # Only consider items the user hasn't rated yet
    unrated_item_indices = np.where(already_rated == False)[0]
    unrated_pred_ratings = predicted_ratings[unrated_item_indices]
    
    # Sort by predicted rating
    top_indices = np.argsort(unrated_pred_ratings)[::-1][:num_recommendations]
    
    return {
        'item_indices': unrated_item_indices[top_indices],
        'predicted_ratings': unrated_pred_ratings[top_indices]
    }

# Calculate user-user similarity matrix
user_similarity = cosine_similarity(ratings)
user_similarity_df = pd.DataFrame(user_similarity, 
                                 index=ratings.index, 
                                 columns=ratings.index)

print("\nUser-User Similarity Matrix:")
print(user_similarity_df)

# Visualize the user similarity matrix
plt.figure(figsize=(8, 6))
sns.heatmap(user_similarity_df, annot=True, cmap="coolwarm", vmin=-1, vmax=1, 
            cbar_kws={'label': 'Cosine Similarity'})
plt.title('User-User Similarity Matrix')
plt.tight_layout()
# plt.savefig('user_similarity.png')
plt.close()

# Generate user-based recommendations for each user
print("\nUser-Based Collaborative Filtering Recommendations:")
for user in ratings.index:
    recs = user_based_recommendations(ratings, user_similarity, user)
    rec_items = [ratings.columns[i] for i in recs['item_indices']]
    rec_ratings = recs['predicted_ratings']
    
    print(f"\n{user}:")
    for item, rating in zip(rec_items, rec_ratings):
        print(f"  Recommended: {item} (Predicted rating: {rating:.2f})")

# Compare recommendations from both approaches
print("\nComparison of Recommendation Approaches:")
for user in ratings.index:
    item_recs = item_based_recommendations(ratings, item_similarity, user)
    user_recs = user_based_recommendations(ratings, user_similarity, user)
    
    item_rec_items = [ratings.columns[i] for i in item_recs['item_indices']]
    user_rec_items = [ratings.columns[i] for i in user_recs['item_indices']]
    
    print(f"\n{user}:")
    print(f"  Item-based recommendations: {', '.join(item_rec_items)}")
    print(f"  User-based recommendations: {', '.join(user_rec_items)}")

11.4 Deploying Machine Learning Models

Key Concept: Model Deployment

Deploying machine learning models makes them accessible to applications via APIs, batch processing, or embedded systems. Python provides various frameworks for model deployment.

Creating a Model API with Flask

Let's build a simple API for a machine learning model:

"""
# app.py - Save this to a separate file to run it

from flask import Flask, request, jsonify
import pickle
import numpy as np
from sklearn.preprocessing import StandardScaler

app = Flask(__name__)

# Load the pre-trained model
# Assuming you have trained and saved a model using pickle
try:
    with open('model.pkl', 'rb') as f:
        model = pickle.load(f)
    with open('scaler.pkl', 'rb') as f:
        scaler = pickle.load(f)
    print("Model and scaler loaded successfully")
except FileNotFoundError:
    # For demonstration purposes, we'll create a simple model
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.datasets import load_iris
    
    print("Creating a sample model for demonstration")
    iris = load_iris()
    X, y = iris.data, iris.target
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_scaled, y)
    
    # Save the model and scaler (optional)
    with open('model.pkl', 'wb') as f:
        pickle.dump(model, f)
    with open('scaler.pkl', 'wb') as f:
        pickle.dump(scaler, f)

@app.route('/predict', methods=['POST'])
def predict():
    # Get request data
    data = request.get_json(force=True)
    
    # Check if 'features' is in the request
    if 'features' not in data:
        return jsonify({'error': 'No features provided in the request'}), 400
    
    # Extract features
    features = data['features']
    
    try:
        # Convert to numpy array
        features_array = np.array(features).reshape(1, -1)
        
        # Scale the features
        features_scaled = scaler.transform(features_array)
        
        # Make prediction
        prediction = model.predict(features_scaled)
        
        # For iris dataset, map class indices to names
        class_names = ['setosa', '

Post a Comment

0 Comments