Comprehensive Python Programming
From Data Handling to AI Applications
A practical guide for mastering advanced Python concepts
Table of Contents
- Chapter 8: Working with Data
- Chapter 9: Web Development with Python
- Chapter 10: Introduction to AI and Machine Learning
- Chapter 11: Building AI Applications
- Chapter 12: Ethical Hacking with Python
- Chapter 13: Advanced Python Concepts
- Chapter 14: Thinking Like a Programmer
- Chapter 15: Real-World Python Projects
- Conclusion: Your Continuing Python Journey
Introduction
Welcome to the advanced sections of our comprehensive Python programming guide. In these chapters, we'll explore the versatility and power of Python beyond the fundamentals. Whether you're looking to analyze complex datasets, build web applications, explore artificial intelligence, or develop ethical hacking tools, Python offers the libraries and frameworks to bring your ideas to life.
Each chapter builds on core Python knowledge, introducing specialized libraries and techniques for various domains. We've designed this guide with practical applications in mind—you'll find numerous code examples, projects, and exercises to reinforce your learning.
By the end of this guide, you'll have a broad understanding of Python's capabilities across multiple domains and the confidence to apply these skills to your own innovative projects. Let's begin this exciting journey into advanced Python programming!
Chapter 8: Working with Data
Learning Objectives
- Master file operations for structured and unstructured data
- Process and manipulate data using powerful Python libraries
- Analyze and visualize data to extract meaningful insights
- Work with databases from Python applications
- Implement data cleaning and transformation pipelines
8.1 File Operations in Python
Key Concept: File Handling
Python provides robust built-in functions for reading, writing, and manipulating files, making it an excellent language for data processing tasks.
Working with Text Files
Python's file operations are straightforward and powerful. The basic pattern involves opening a file, performing operations, and closing it when done:
# Reading a text file
with open('data.txt', 'r') as file:
content = file.read()
print(content)
# Writing to a text file
with open('output.txt', 'w') as file:
file.write('This is some data\n')
file.write('This is another line of data')
# Appending to a text file
with open('output.txt', 'a') as file:
file.write('\nThis line is appended to the file')
Tip: Always use the with statement when working with files. It ensures proper resource management by automatically closing the file when operations are complete, even if exceptions occur.
CSV and JSON Files
Most data analysis tasks involve structured data formats like CSV (Comma-Separated Values) and JSON (JavaScript Object Notation). Python provides dedicated modules for handling these formats:
# Working with CSV files
import csv
# Reading CSV
with open('data.csv', 'r') as file:
csv_reader = csv.reader(file)
for row in csv_reader:
print(row)
# Writing CSV
with open('output.csv', 'w', newline='') as file:
csv_writer = csv.writer(file)
csv_writer.writerow(['Name', 'Age', 'City'])
csv_writer.writerow(['Alice', 28, 'New York'])
csv_writer.writerow(['Bob', 32, 'San Francisco'])
# Working with JSON
import json
# Reading JSON
with open('data.json', 'r') as file:
data = json.load(file)
print(data)
# Writing JSON
data = {
'name': 'Alice',
'age': 28,
'city': 'New York',
'skills': ['Python', 'Data Analysis', 'Machine Learning']
}
with open('output.json', 'w') as file:
json.dump(data, file, indent=4)
8.2 Data Analysis with Pandas
Key Concept: Pandas Library
Pandas is the most popular Python library for data manipulation and analysis, offering powerful data structures and operations for manipulating numerical tables and time series.
To use Pandas, you'll first need to install it:
pip install pandas
Working with DataFrames
The DataFrame is Pandas' primary data structure—a two-dimensional labeled data structure with columns that can be of different types:
import pandas as pd
import numpy as np
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 45],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Boston'],
'Salary': [70000, 80000, 90000, 75000, 85000]
}
df = pd.DataFrame(data)
print(df)
# Reading data from a CSV file
# df = pd.read_csv('data.csv')
# Basic information about the DataFrame
print(df.info())
print(df.describe())
# Accessing data
print(df['Name']) # Access a column
print(df[['Name', 'Age']]) # Access multiple columns
print(df.iloc[0]) # Access a row by position
print(df.loc[2]) # Access a row by label
Data Cleaning and Transformation
Real-world data is often messy. Pandas provides functions to clean and transform data:
# Handling missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [9, 10, 11, 12]
})
print("DataFrame with missing values:")
print(df)
# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())
# Fill missing values
df_filled = df.fillna(0) # Fill with zeros
print("\nFilled with zeros:")
print(df_filled)
df_filled_mean = df.fillna(df.mean()) # Fill with column means
print("\nFilled with column means:")
print(df_filled_mean)
# Drop rows with any missing values
df_dropped = df.dropna()
print("\nRows with missing values dropped:")
print(df_dropped)
# Data transformation
# Create a new column based on existing ones
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]
})
df['C'] = df['A'] + df['B']
print("\nDataFrame with calculated column:")
print(df)
# Apply a function to a column
df['D'] = df['A'].apply(lambda x: x * 2)
print("\nDataFrame with applied function:")
print(df)
Data Grouping and Aggregation
One of Pandas' most powerful features is its ability to group and aggregate data:
# Sample DataFrame with categorical data
data = {
'Category': ['A', 'B', 'A', 'B', 'A', 'C', 'C', 'B'],
'Value': [10, 20, 15, 25, 30, 40, 35, 22]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Group by Category and calculate statistics
grouped = df.groupby('Category')
print("\nGroup means:")
print(grouped.mean())
print("\nGroup sums:")
print(grouped.sum())
# Multiple aggregations
print("\nMultiple aggregations:")
print(grouped.agg(['min', 'max', 'mean', 'count']))
# Custom aggregation
print("\nCustom aggregation:")
print(grouped.agg({
'Value': ['min', 'max', 'mean', lambda x: x.max() - x.min()]
}))
8.3 Data Visualization
Key Concept: Visual Data Analysis
Data visualization helps identify patterns, trends, and outliers in data that might not be apparent from raw numbers. Python offers several libraries for creating compelling visualizations.
The main visualization libraries in Python are:
- Matplotlib: The foundation for most visualization in Python
- Seaborn: Built on Matplotlib, providing higher-level abstractions and prettier defaults
- Plotly: For interactive visualizations
# Install required libraries
# pip install matplotlib seaborn
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# Create sample data
np.random.seed(42)
data = {
'x': np.random.normal(0, 1, 100),
'y': np.random.normal(0, 1, 100),
'category': np.random.choice(['A', 'B', 'C'], 100)
}
df = pd.DataFrame(data)
# Basic Matplotlib
plt.figure(figsize=(10, 6))
plt.plot([1, 2, 3, 4, 5], [1, 4, 9, 16, 25], 'bo-')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Line Plot')
plt.grid(True)
# plt.savefig('line_plot.png')
plt.close()
# Multiple plots with Matplotlib
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# First subplot: scatter plot
axes[0].scatter(df['x'], df['y'], c='blue', alpha=0.5)
axes[0].set_title('Scatter Plot')
axes[0].set_xlabel('X-axis')
axes[0].set_ylabel('Y-axis')
axes[0].grid(True)
# Second subplot: histogram
axes[1].hist(df['x'], bins=15, alpha=0.7)
axes[1].set_title('Histogram')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Frequency')
axes[1].grid(True)
plt.tight_layout()
# plt.savefig('matplotlib_plots.png')
plt.close()
# Seaborn visualizations
plt.figure(figsize=(10, 6))
sns.set_style("whitegrid")
sns.scatterplot(data=df, x='x', y='y', hue='category', palette='viridis')
plt.title('Seaborn Scatter Plot with Categories')
# plt.savefig('seaborn_scatter.png')
plt.close()
# Seaborn distribution plots
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['x'], kde=True)
plt.title('Histogram with KDE')
plt.subplot(1, 2, 2)
sns.boxplot(x='category', y='x', data=df)
plt.title('Box Plot by Category')
plt.tight_layout()
# plt.savefig('seaborn_distributions.png')
plt.close()
8.4 Working with Databases
Key Concept: Database Integration
For persistent storage and complex queries, databases are essential. Python can interact with virtually any database system.
SQLite: Built-in Database
Python's standard library includes SQLite, a lightweight disk-based database that requires no separate server:
import sqlite3
import pandas as pd
# Connect to a database (will be created if it doesn't exist)
conn = sqlite3.connect('example.db')
# Create a cursor object
cursor = conn.cursor()
# Create a table
cursor.execute('''
CREATE TABLE IF NOT EXISTS employees (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
department TEXT,
salary REAL,
hire_date TEXT
)
''')
# Insert data
employees = [
(1, 'Alice Smith', 'Engineering', 75000.00, '2020-01-15'),
(2, 'Bob Johnson', 'Marketing', 65000.00, '2019-03-10'),
(3, 'Charlie Brown', 'Engineering', 80000.00, '2021-05-22'),
(4, 'Diana Lee', 'Finance', 72000.00, '2018-11-30'),
(5, 'Edward Wilson', 'HR', 62000.00, '2022-02-05')
]
cursor.executemany('''
INSERT OR REPLACE INTO employees (id, name, department, salary, hire_date)
VALUES (?, ?, ?, ?, ?)
''', employees)
# Commit changes
conn.commit()
# Query the database
cursor.execute('SELECT * FROM employees')
result = cursor.fetchall()
print("All employees:")
for row in result:
print(row)
cursor.execute('SELECT name, salary FROM employees WHERE department = "Engineering"')
engineers = cursor.fetchall()
print("\nEngineers:")
for row in engineers:
print(row)
# Using Pandas with SQLite
query = 'SELECT * FROM employees'
df = pd.read_sql_query(query, conn)
print("\nDataFrame from SQL query:")
print(df)
# Update data
cursor.execute('UPDATE employees SET salary = salary * 1.1 WHERE department = "Engineering"')
conn.commit()
# Verify the update
df_updated = pd.read_sql_query('SELECT * FROM employees', conn)
print("\nUpdated DataFrame:")
print(df_updated)
# Close the connection
conn.close()
Working with Other Databases
For larger applications, you might use PostgreSQL, MySQL, MongoDB, or other database systems. The pattern is similar, but you'll need specific libraries:
# PostgreSQL example (requires psycopg2 package)
# pip install psycopg2-binary
"""
import psycopg2
# Connect to PostgreSQL
conn = psycopg2.connect(
host="localhost",
database="mydatabase",
user="myuser",
password="mypassword"
)
cursor = conn.cursor()
# Execute SQL commands like with SQLite
cursor.execute("SELECT * FROM my_table")
results = cursor.fetchall()
conn.close()
"""
# MongoDB example (requires pymongo package)
# pip install pymongo
"""
from pymongo import MongoClient
# Connect to MongoDB
client = MongoClient('mongodb://localhost:27017/')
db = client['mydatabase']
collection = db['mycollection']
# Insert a document
collection.insert_one({"name": "Alice", "age": 30, "city": "New York"})
# Query documents
results = collection.find({"city": "New York"})
for doc in results:
print(doc)
"""
Exercise 8.1: Data Analysis Project
Create a complete data analysis project using the provided dataset of customer information:
- Load the customer dataset from a CSV file using Pandas
- Clean the data by handling missing values and removing duplicates
- Perform exploratory data analysis (calculate statistics, create visualizations)
- Create at least three different types of plots to visualize the data
- Save the cleaned dataset to a SQLite database
- Query the database to extract specific customer segments
- Generate a summary report with your findings
Hint: Use functions to organize your code and document each step of your analysis.
Exercise 8.2: Data Transformation Challenge
You're given a messy dataset containing information about products sold in a store:
- The dataset contains inconsistent dates, missing prices, and duplicate product entries
- Create a data cleaning pipeline to standardize the data
- Calculate monthly sales totals and identify the best-selling products
- Create a visualization showing sales trends over time
- Export the cleaned and transformed data to both CSV and JSON formats
Bonus challenge: Implement a function to detect and flag suspicious sales patterns that might indicate errors in the data.
Chapter Summary
In this chapter, we explored essential techniques for working with data in Python:
- File operations for reading and writing common data formats (text, CSV, JSON)
- Data manipulation using Pandas, including filtering, transformation, and aggregation
- Data visualization with Matplotlib and Seaborn for identifying patterns and trends
- Database integration to store, query, and update structured data
These skills form the foundation of data analysis in Python and will be essential for many applications, including web development, machine learning, and scientific computing which we'll explore in subsequent chapters.
Chapter 9: Web Development with Python
Learning Objectives
- Understand web application architecture and HTTP fundamentals
- Build web applications using Flask and Django frameworks
- Develop RESTful APIs to serve data to front-end applications
- Create dynamic web pages with templates and forms
- Implement authentication and security best practices
9.1 Web Development Fundamentals
Key Concept: Web Architecture
Web applications involve client-server interactions using HTTP protocols. Understanding these fundamentals is essential for effective web development.
HTTP Basics
HTTP (Hypertext Transfer Protocol) is the foundation of data communication on the web. Key concepts include:
- Request Methods: GET, POST, PUT, DELETE, etc.
- Status Codes: 200 (OK), 404 (Not Found), 500 (Server Error), etc.
- Headers: Metadata about the request or response
- Body: The actual content being transferred
Python provides several ways to make HTTP requests:
# Using the requests library (install with: pip install requests)
import requests
# GET request
response = requests.get('https://api.github.com/users/python')
print(f"Status code: {response.status_code}")
print(f"Content type: {response.headers['content-type']}")
print(f"Data: {response.json()}")
# POST request
data = {'username': 'pythonuser', 'password': 'securepassword'}
response = requests.post('https://httpbin.org/post', data=data)
print(f"POST response: {response.json()}")
# Custom headers
headers = {'User-Agent': 'MyPythonApp/1.0'}
response = requests.get('https://httpbin.org/headers', headers=headers)
print(f"Headers response: {response.json()}")
Web Scraping
Web scraping involves extracting data from websites. While APIs are preferred when available, scraping is useful for sites without APIs:
# Install required libraries
# pip install beautifulsoup4 requests
import requests
from bs4 import BeautifulSoup
# Fetch a web page
url = 'https://quotes.toscrape.com/'
response = requests.get(url)
html = response.text
# Parse the HTML
soup = BeautifulSoup(html, 'html.parser')
# Extract data
quotes = soup.find_all('span', class_='text')
authors = soup.find_all('small', class_='author')
# Print the results
print("Quotes and Authors:")
for i, (quote, author) in enumerate(zip(quotes, authors), 1):
print(f"{i}. {quote.text} - {author.text}")
# Extract specific elements
title = soup.title.text
print(f"\nPage title: {title}")
# Find elements by CSS selector
tags = soup.select('.tag')
print("\nTags:")
for tag in tags[:10]: # Show first 10 tags
print(f"- {tag.text}")
Important: Web scraping should be done responsibly. Always check a website's robots.txt file and terms of service before scraping. Use appropriate delays between requests to avoid overloading servers, and consider using APIs when available.
9.2 Web Development with Flask
Key Concept: Flask Framework
Flask is a lightweight, flexible web framework for Python, perfect for small to medium applications and APIs. Its "micro" design philosophy makes it easy to learn and extend as needed.
To get started with Flask, install it using pip:
pip install flask
Creating a Basic Flask Application
# app.py
from flask import Flask, render_template, request, redirect, url_for, jsonify
# Initialize Flask application
app = Flask(__name__)
# Sample data
tasks = [
{'id': 1, 'title': 'Learn Flask', 'done': False},
{'id': 2, 'title': 'Develop web app', 'done': False},
{'id': 3, 'title': 'Deploy application', 'done': False}
]
# Route for the home page
@app.route('/')
def home():
return render_template('index.html', tasks=tasks)
# Route that accepts parameters
@app.route('/task/')
def task_detail(task_id):
task = next((task for task in tasks if task['id'] == task_id), None)
if task:
return render_template('task_detail.html', task=task)
return "Task not found", 404
# Route that handles form submission (POST request)
@app.route('/add_task', methods=['POST'])
def add_task():
if request.method == 'POST':
title = request.form.get('title')
if title:
# Generate a new ID (in a real app, this would be handled by a database)
new_id = max(task['id'] for task in tasks) + 1
tasks.append({'id': new_id, 'title': title, 'done': False})
return redirect(url_for('home'))
# API route that returns JSON
@app.route('/api/tasks')
def get_tasks():
return jsonify(tasks)
# Run the application
if __name__ == '__main__':
app.run(debug=True)
For the above application to work, you would need to create HTML templates. Here's a simple example for the index.html file:
<!-- templates/index.html -->
<!DOCTYPE html>
<html>
<head>
<title>Flask Todo App</title>
<style>
body { font-family: Arial, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px; }
.task { margin-bottom: 10px; padding: 10px; border: 1px solid #ddd; border-radius: 5px; }
form { margin: 20px 0; }
</style>
</head>
<body>
<h1>Task List</h1>
<form action="/add_task" method="post">
<input type="text" name="title" placeholder="New task" required>
<button type="submit">Add Task</button>
</form>
<h2>Tasks:</h2>
{% for task in tasks %}
<div class="task">
<strong>{{ task.title }}</strong>
<p>Status: {{ "Completed" if task.done else "Pending" }}</p>
<a href="{{ url_for('task_detail', task_id=task.id) }}">View Details</a>
</div>
{% endfor %}
</body>
</html>
Building a REST API with Flask
Flask is excellent for building APIs. Here's a more complete example of a RESTful API:
# api_app.py
from flask import Flask, request, jsonify
app = Flask(__name__)
# Sample data (in a real app, this would come from a database)
books = [
{"id": 1, "title": "The Great Gatsby", "author": "F. Scott Fitzgerald", "year": 1925},
{"id": 2, "title": "To Kill a Mockingbird", "author": "Harper Lee", "year": 1960},
{"id": 3, "title": "1984", "author": "George Orwell", "year": 1949}
]
# GET all books
@app.route('/api/books', methods=['GET'])
def get_books():
return jsonify(books)
# GET a specific book
@app.route('/api/books/', methods=['GET'])
def get_book(book_id):
book = next((book for book in books if book['id'] == book_id), None)
if book:
return jsonify(book)
return jsonify({"error": "Book not found"}), 404
# POST a new book
@app.route('/api/books', methods=['POST'])
def add_book():
if not request.json or 'title' not in request.json:
return jsonify({"error": "Invalid book data"}), 400
# Create a new book object
new_id = max(book['id'] for book in books) + 1
new_book = {
'id': new_id,
'title': request.json['title'],
'author': request.json.get('author', "Unknown"),
'year': request.json.get('year', 0)
}
# Add to our collection
books.append(new_book)
return jsonify(new_book), 201
# PUT (update) a book
@app.route('/api/books/', methods=['PUT'])
def update_book(book_id):
book = next((book for book in books if book['id'] == book_id), None)
if not book:
return jsonify({"error": "Book not found"}), 404
if not request.json:
return jsonify({"error": "Invalid book data"}), 400
# Update book attributes
book['title'] = request.json.get('title', book['title'])
book['author'] = request.json.get('author', book['author'])
book['year'] = request.json.get('year', book['year'])
return jsonify(book)
# DELETE a book
@app.route('/api/books/', methods=['DELETE'])
def delete_book(book_id):
book = next((book for book in books if book['id'] == book_id), None)
if not book:
return jsonify({"error": "Book not found"}), 404
books.remove(book)
return jsonify({"result": "Book deleted"}), 200
if __name__ == '__main__':
app.run(debug=True)
9.3 Web Development with Django
Key Concept: Django Framework
Django is a high-level Python web framework that follows the "batteries-included" philosophy. It provides a comprehensive set of features for building large-scale web applications efficiently.
To get started with Django, install it using pip:
pip install django
Creating a Django Project
Django follows a project/app structure. Here's how to create a basic Django project:
# Create a new Django project django-admin startproject mysite # Navigate to the project directory cd mysite # Create a new app within the project python manage.py startapp blog # Run migrations to create database tables python manage.py migrate # Create a superuser for the admin interface python manage.py createsuperuser # Run the development server python manage.py runserver
Django Project Structure
A typical Django project has the following structure:
mysite/ # Project root directory
│
├── manage.py # Command-line utility for administrative tasks
│
├── mysite/ # Project package
│ ├── __init__.py
│ ├── settings.py # Project settings/configuration
│ ├── urls.py # Project URL declarations
│ ├── asgi.py # ASGI configuration for async servers
│ └── wsgi.py # WSGI configuration for traditional servers
│
└── blog/ # App directory
├── __init__.py
├── admin.py # Admin interface configuration
├── apps.py # App configuration
├── migrations/ # Database migrations
├── models.py # Data models
├── tests.py # Unit tests
└── views.py # Request handlers
Building a Blog Application with Django
Let's create a simple blog application with Django:
# blog/models.py
from django.db import models
from django.utils import timezone
from django.contrib.auth.models import User
class Post(models.Model):
title = models.CharField(max_length=200)
content = models.TextField()
date_posted = models.DateTimeField(default=timezone.now)
author = models.ForeignKey(User, on_delete=models.CASCADE)
def __str__(self):
return self.title
# blog/views.py
from django.shortcuts import render, get_object_or_404
from django.http import HttpResponse
from .models import Post
def home(request):
context = {
'posts': Post.objects.all().order_by('-date_posted')
}
return render(request, 'blog/home.html', context)
def post_detail(request, post_id):
post = get_object_or_404(Post, id=post_id)
return render(request, 'blog/post_detail.html', {'post': post})
# blog/urls.py (create this file)
from django.urls import path
from . import views
urlpatterns = [
path('', views.home, name='blog-home'),
path('post//', views.post_detail, name='post-detail'),
]
# mysite/urls.py (update this file)
from django.contrib import admin
from django.urls import path, include
urlpatterns = [
path('admin/', admin.site.urls),
path('blog/', include('blog.urls')),
]
You'll also need to create HTML templates for your blog. Here's a simple example:
<!-- blog/templates/blog/base.html -->
<!DOCTYPE html>
<html>
<head>
<title>{% block title %}Django Blog{% endblock %}</title>
<style>
body { font-family: Arial, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px; }
.post { margin-bottom: 20px; padding: 15px; border: 1px solid #ddd; border-radius: 5px; }
.post-meta { color: #666; font-size: 0.9em; }
</style>
</head>
<body>
<header>
<h1>Django Blog</h1>
<nav>
<a href="{% url 'blog-home' %}">Home</a>
<a href="{% url 'admin:index' %}">Admin</a>
</nav>
</header>
<main>
{% block content %}{% endblock %}
</main>
<footer>
<p>© {% now "Y" %} Django Blog</p>
</footer>
</body>
</html>
<!-- blog/templates/blog/home.html -->
{% extends "blog/base.html" %}
{% block content %}
<h2>Latest Posts</h2>
{% for post in posts %}
<div class="post">
<h3><a href="{% url 'post-detail' post.id %}">{{ post.title }}</a></h3>
<div class="post-meta">
By {{ post.author }} on {{ post.date_posted|date:"F d, Y" }}
</div>
<p>{{ post.content|truncatewords:30 }}</p>
</div>
{% empty %}
<p>No posts available.</p>
{% endfor %}
{% endblock %}
<!-- blog/templates/blog/post_detail.html -->
{% extends "blog/base.html" %}
{% block title %}{{ post.title }} | Django Blog{% endblock %}
{% block content %}
<div class="post">
<h2>{{ post.title }}</h2>
<div class="post-meta">
By {{ post.author }} on {{ post.date_posted|date:"F d, Y" }}
</div>
<div class="post-content">
{{ post.content }}
</div>
</div>
<a href="{% url 'blog-home' %}">← Back to all posts</a>
{% endblock %}
Django Admin Interface
One of Django's most powerful features is its automatic admin interface. Register your models in the admin.py file:
# blog/admin.py from django.contrib import admin from .models import Post admin.site.register(Post)
9.4 Web Security Fundamentals
Key Concept: Secure Web Development
Security is critical in web development. Python frameworks include features to protect against common vulnerabilities.
Common Web Vulnerabilities
- Cross-Site Scripting (XSS): Injecting malicious scripts into web pages
- SQL Injection: Inserting malicious SQL code into database queries
- Cross-Site Request Forgery (CSRF): Tricking users into executing unwanted actions
- Authentication Weaknesses: Insecure password handling and session management
Security Best Practices
# Flask security example
from flask import Flask, request, render_template, redirect, url_for, session
import secrets
import hashlib
import re
app = Flask(__name__)
app.secret_key = secrets.token_hex(16) # Generate a secure secret key
# Simulated database of users
users = {
'admin': {
'password_hash': hashlib.sha256('securepassword123'.encode()).hexdigest(),
'role': 'admin'
}
}
# CSRF protection
@app.before_request
def csrf_protect():
if request.method == "POST":
token = session.pop('_csrf_token', None)
if not token or token != request.form.get('_csrf_token'):
return "CSRF token validation failed", 400
def generate_csrf_token():
if '_csrf_token' not in session:
session['_csrf_token'] = secrets.token_hex(16)
return session['_csrf_token']
# Input validation
def validate_username(username):
return re.match(r'^[a-zA-Z0-9_]{3,20}$', username) is not None
def validate_password(password):
# Check for minimum length, uppercase, lowercase, and digit
if len(password) < 8:
return False
if not re.search(r'[A-Z]', password):
return False
if not re.search(r'[a-z]', password):
return False
if not re.search(r'\d', password):
return False
return True
@app.route('/login', methods=['GET', 'POST'])
def login():
error = None
if request.method == 'POST':
username = request.form.get('username', '')
password = request.form.get('password', '')
# Validate username format
if not validate_username(username):
error = "Invalid username format"
else:
# Check if user exists and password is correct
user = users.get(username)
if user and user['password_hash'] == hashlib.sha256(password.encode()).hexdigest():
session['username'] = username
session['role'] = user['role']
return redirect(url_for('dashboard'))
else:
error = "Invalid username or password"
# For GET requests or if login failed
csrf_token = generate_csrf_token()
return render_template('login.html', csrf_token=csrf_token, error=error)
@app.route('/dashboard')
def dashboard():
# Check if user is logged in
if 'username' not in session:
return redirect(url_for('login'))
return render_template('dashboard.html', username=session['username'], role=session['role'])
@app.route('/logout')
def logout():
session.clear()
return redirect(url_for('login'))
if __name__ == '__main__':
app.run(debug=True)
Exercise 9.1: Build a Personal Portfolio Website
Create a personal portfolio website using Flask that includes the following features:
- A home page with an introduction and summary of your skills
- A projects page displaying your work with descriptions and images
- A contact form that sends emails when submitted
- A blog section where you can add new posts through an admin interface
- Responsive design that works well on mobile devices
Bonus: Add authentication to protect the admin interface for adding blog posts.
Exercise 9.2: RESTful API Development
Build a RESTful API for a movie database with Flask or Django REST Framework:
- Design endpoints for managing movies, directors, and genres
- Implement CRUD (Create, Read, Update, Delete) operations for each resource
- Add filtering capabilities (e.g., get movies by director, genre, or release year)
- Implement proper error handling and status codes
- Add authentication and authorization to protect certain endpoints
- Document your API using Swagger/OpenAPI
Bonus: Implement rate limiting to prevent API abuse.
Chapter Summary
In this chapter, we explored Python's capabilities for web development:
- Web fundamentals including HTTP requests and responses
- Web scraping techniques for extracting data from websites
- Building web applications with Flask, a lightweight framework
- Developing larger applications with Django's comprehensive features
- RESTful API design and implementation
- Web security best practices to protect against common vulnerabilities
These skills enable you to create everything from simple websites to complex web applications and APIs. In the next chapter, we'll explore how Python powers artificial intelligence and machine learning applications.
Chapter 10: Introduction to AI and Machine Learning
Learning Objectives
- Understand fundamental concepts in AI and machine learning
- Set up a Python environment for machine learning development
- Explore different types of machine learning algorithms
- Implement basic machine learning models using scikit-learn
- Evaluate and improve model performance
10.1 AI and Machine Learning Fundamentals
Key Concept: Machine Learning Paradigms
Machine learning is the study of computer algorithms that improve automatically through experience and data. It's a subset of artificial intelligence focused on building systems that learn from data.
Types of Machine Learning
- Supervised Learning: The algorithm learns from labeled training data, making predictions or decisions based on that learning
- Unsupervised Learning: The algorithm finds patterns or structures in unlabeled data
- Reinforcement Learning: The algorithm learns by interacting with an environment, receiving rewards or penalties
Setting Up Your Environment
To get started with machine learning in Python, you'll need to install several libraries:
# Install essential libraries pip install numpy pandas matplotlib scikit-learn # For more advanced machine learning pip install tensorflow keras
The Machine Learning Workflow
A typical machine learning project follows these steps:
- Define the problem and gather data
- Explore and preprocess the data
- Select and train a model
- Evaluate the model
- Improve the model and tune parameters
- Deploy the model
10.2 Supervised Learning
Key Concept: Classification and Regression
Supervised learning involves training a model on labeled data to make predictions. The two main types are classification (predicting categories) and regression (predicting continuous values).
Classification Example
Let's implement a basic classification model to predict iris flower species:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
# Create a DataFrame for better visualization
iris_df = pd.DataFrame(X, columns=feature_names)
iris_df['species'] = [target_names[i] for i in y]
# Print dataset information
print("Dataset shape:", X.shape)
print("Feature names:", feature_names)
print("Target names:", target_names)
print("\nFirst 5 rows:")
print(iris_df.head())
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train a K-Nearest Neighbors classifier
k = 3
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_scaled, y_train)
# Make predictions
y_pred = knn.predict(X_test_scaled)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))
print("\nConfusion Matrix:")
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)
# Visualize results - Sepal features
plt.figure(figsize=(10, 6))
colors = ['blue', 'green', 'red']
markers = ['o', 's', '^']
for i, species in enumerate(target_names):
# Plot training data
species_data = iris_df[iris_df['species'] == species]
plt.scatter(
species_data['sepal length (cm)'],
species_data['sepal width (cm)'],
color=colors[i],
marker=markers[i],
label=f'{species} (actual)',
alpha=0.6
)
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Iris Species Classification - Sepal Features')
plt.legend()
plt.grid(True)
# plt.savefig('iris_classification.png')
plt.close()
# Visualize the confusion matrix
plt.figure(figsize=(8, 6))
plt.imshow(conf_matrix, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.colorbar()
tick_marks = np.arange(len(target_names))
plt.xticks(tick_marks, target_names, rotation=45)
plt.yticks(tick_marks, target_names)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
# plt.savefig('confusion_matrix.png')
plt.close()
Regression Example
Now let's implement a regression model to predict house prices:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the California housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target
feature_names = housing.feature_names
# Create a DataFrame for better visualization
housing_df = pd.DataFrame(X, columns=feature_names)
housing_df['price'] = y
# Print dataset information
print("Dataset shape:", X.shape)
print("Feature names:", feature_names)
print("\nFirst 5 rows:")
print(housing_df.head())
print("\nData statistics:")
print(housing_df.describe())
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train a linear regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = model.predict(X_test_scaled)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"\nModel Performance:")
print(f"Mean Squared Error: {mse:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")
# Feature importance
coefficients = pd.DataFrame(model.coef_, index=feature_names, columns=['Coefficient'])
print("\nFeature Coefficients:")
print(coefficients.sort_values(by='Coefficient', ascending=False))
# Visualize predicted vs actual values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Linear Regression: Predicted vs Actual House Prices')
plt.grid(True)
# plt.savefig('housing_regression.png')
plt.close()
# Visualize residuals
residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.5)
plt.hlines(y=0, xmin=y_pred.min(), xmax=y_pred.max(), colors='r', linestyles='--')
plt.xlabel('Predicted Prices')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.grid(True)
# plt.savefig('housing_residuals.png')
plt.close()
10.3 Unsupervised Learning
Key Concept: Clustering and Dimensionality Reduction
Unsupervised learning finds patterns in unlabeled data. Common approaches include clustering (grouping similar data points) and dimensionality reduction (simplifying data while preserving key information).
Clustering Example
Let's implement K-means clustering to group data points:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Generate synthetic data with 4 clusters
n_samples = 500
n_features = 2
n_clusters = 4
random_state = 42
X, y_true = make_blobs(
n_samples=n_samples,
n_features=n_features,
centers=n_clusters,
random_state=random_state
)
# Visualize the original data
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', alpha=0.7, edgecolors='k', s=40)
plt.title('Original Data with True Clusters')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True)
# plt.savefig('original_clusters.png')
plt.close()
# Apply K-means clustering
kmeans = KMeans(n_clusters=n_clusters, random_state=random_state)
y_pred = kmeans.fit_predict(X)
centers = kmeans.cluster_centers_
# Evaluate the clustering
silhouette_avg = silhouette_score(X, y_pred)
print(f"Silhouette Score: {silhouette_avg:.4f}")
# Visualize the K-means clustering results
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', alpha=0.7, edgecolors='k', s=40)
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Centroids')
plt.title('K-means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
# plt.savefig('kmeans_clusters.png')
plt.close()
# Finding the optimal number of clusters using the Elbow Method
inertia = []
silhouette_scores = []
k_range = range(2, 11)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=random_state)
kmeans.fit(X)
inertia.append(kmeans.inertia_)
# Silhouette score (only computed for k > 1)
silhouette_scores.append(silhouette_score(X, kmeans.labels_))
# Plot the Elbow Method results
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(k_range, inertia, 'o-', linewidth=2, markersize=8)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia (Within-Cluster Sum of Squares)')
plt.grid(True)
plt.subplot(1, 2, 2)
plt.plot(k_range, silhouette_scores, 'o-', linewidth=2, markersize=8)
plt.title('Silhouette Method')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.grid(True)
plt.tight_layout()
# plt.savefig('optimal_clusters.png')
plt.close()
Dimensionality Reduction Example
Now let's use Principal Component Analysis (PCA) to reduce the dimensionality of data:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Load the digits dataset
digits = load_digits()
X = digits.data
y = digits.target
# Print dataset information
print("Dataset shape:", X.shape)
print("Number of classes:", len(np.unique(y)))
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
# Calculate explained variance ratio
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)
# Plot the explained variance
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(explained_variance) + 1), explained_variance, alpha=0.6, label='Individual explained variance')
plt.step(range(1, len(cumulative_variance) + 1), cumulative_variance, where='mid', label='Cumulative explained variance')
plt.axhline(y=0.9, linestyle='--', color='r', label='90% explained variance threshold')
plt.title('Explained Variance Ratio by Principal Components')
plt.xlabel('Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.legend()
plt.grid(True)
# plt.savefig('pca_variance.png')
plt.close()
# Find number of components for 90% variance
n_components_90 = np.argmax(cumulative_variance >= 0.9) + 1
print(f"Number of components for 90% variance: {n_components_90}")
# Apply PCA with reduced dimensions
pca_reduced = PCA(n_components=n_components_90)
X_reduced = pca_reduced.fit_transform(X_scaled)
print(f"Reduced data shape: {X_reduced.shape}")
# Visualize first two principal components with class labels
plt.figure(figsize=(10, 8))
colors = plt.cm.rainbow(np.linspace(0, 1, len(np.unique(y))))
for i, color in enumerate(colors):
indices = y == i
plt.scatter(X_pca[indices, 0], X_pca[indices, 1], color=color, alpha=0.7, label=f'Digit {i}')
plt.title('PCA: First Two Principal Components')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.grid(True)
# plt.savefig('pca_digits.png')
plt.close()
# Visualize some original digits and their reconstructions
pca = PCA(n_components=n_components_90)
X_pca_reduced = pca.fit_transform(X_scaled)
X_reconstructed = pca.inverse_transform(X_pca_reduced)
X_reconstructed = scaler.inverse_transform(X_reconstructed)
# Display original vs reconstructed digits
fig, axes = plt.subplots(4, 8, figsize=(16, 8))
indices = np.random.choice(len(X), 16, replace=False)
for i, idx in enumerate(indices):
# Original digit
ax = axes[i // 4 * 2, i % 4 * 2]
ax.imshow(digits.images[idx], cmap='gray')
ax.set_title(f'Original: {y[idx]}')
ax.axis('off')
# Reconstructed digit
ax = axes[i // 4 * 2 + 1, i % 4 * 2]
ax.imshow(X_reconstructed[idx].reshape(8, 8), cmap='gray')
ax.set_title(f'Reconstructed')
ax.axis('off')
plt.tight_layout()
# plt.savefig('digit_reconstruction.png')
plt.close()
10.4 Model Evaluation and Improvement
Key Concept: Model Validation and Hyperparameter Tuning
Evaluating and improving machine learning models is crucial for creating reliable AI systems. This involves selecting appropriate metrics, validation techniques, and optimization methods.
Cross-Validation
Cross-validation helps assess model performance more reliably than a single train-test split:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score, KFold, learning_curve
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc
# Load the breast cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Create a model
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Perform k-fold cross-validation
k_folds = 5
kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_scaled, y, cv=kf, scoring='accuracy')
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.4f}")
print(f"Standard deviation: {cv_scores.std():.4f}")
# Calculate multiple metrics using cross-validation
def calculate_metrics(model, X, y, cv):
metrics = {
'accuracy': [],
'precision': [],
'recall': [],
'f1': []
}
for train_idx, test_idx in cv.split(X):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
metrics['accuracy'].append(accuracy_score(y_test, y_pred))
metrics['precision'].append(precision_score(y_test, y_pred))
metrics['recall'].append(recall_score(y_test, y_pred))
metrics['f1'].append(f1_score(y_test, y_pred))
return metrics
metrics = calculate_metrics(model, X_scaled, y, kf)
print("\nDetailed Cross-Validation Metrics:")
for metric, values in metrics.items():
print(f"{metric.capitalize()}: {np.mean(values):.4f} ± {np.std(values):.4f}")
# Plot learning curves to diagnose overfitting/underfitting
train_sizes, train_scores, test_scores = learning_curve(
model, X_scaled, y, cv=5, n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='accuracy'
)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, color='blue', marker='o', label='Training accuracy')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.15, color='blue')
plt.plot(train_sizes, test_mean, color='green', marker='s', label='Validation accuracy')
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.15, color='green')
plt.title('Learning Curve')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.grid(True)
plt.legend(loc='lower right')
# plt.savefig('learning_curve.png')
plt.close()
Hyperparameter Tuning
Optimize model performance by finding the best hyperparameter values:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
# Load the wine dataset
wine = load_wine()
X = wine.data
y = wine.target
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Define the parameter grid to search
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1],
'kernel': ['rbf', 'linear']
}
# Create a base model
svm = SVC(probability=True)
# Perform grid search with cross-validation
grid_search = GridSearchCV(
estimator=svm,
param_grid=param_grid,
cv=5,
scoring='accuracy',
verbose=0,
n_jobs=-1
)
# Fit the grid search to the data
grid_search.fit(X_train_scaled, y_train)
# Get the best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score: {:.4f}".format(grid_search.best_score_))
# Get the best model
best_model = grid_search.best_estimator_
# Evaluate on the test set
y_pred = best_model.predict(X_test_scaled)
print("\nTest Set Evaluation:")
print(classification_report(y_test, y_pred, target_names=wine.target_names))
# Print confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)
# Visualize the results of grid search
results = pd.DataFrame(grid_search.cv_results_)
results = results.sort_values(by='rank_test_score')
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
for kernel in ['rbf', 'linear']:
kernel_results = results[results['param_kernel'] == kernel]
plt.plot(kernel_results['param_C'], kernel_results['mean_test_score'],
marker='o', label=f'kernel={kernel}')
plt.xlabel('C parameter')
plt.ylabel('Mean test score')
plt.title('Grid Search Results: C parameter')
plt.legend()
plt.grid(True)
plt.subplot(1, 2, 2)
rbf_results = results[results['param_kernel'] == 'rbf']
for C in [0.1, 1, 10, 100]:
C_results = rbf_results[rbf_results['param_C'] == C]
plt.plot(C_results['param_gamma'], C_results['mean_test_score'],
marker='o', label=f'C={C}')
plt.xlabel('gamma parameter')
plt.ylabel('Mean test score')
plt.title('Grid Search Results: gamma parameter (RBF kernel)')
plt.legend()
plt.grid(True)
plt.tight_layout()
# plt.savefig('grid_search_results.png')
plt.close()
Exercise 10.1: Build a Predictive Model
Develop a machine learning model to predict student performance based on various factors:
- Load and explore the student performance dataset
- Clean and preprocess the data, handling missing values and categorical variables
- Split the data into training and testing sets
- Train at least three different models (e.g., linear regression, random forest, gradient boosting)
- Evaluate each model using appropriate metrics
- Tune the hyperparameters of the best-performing model
- Create visualizations to interpret the model's predictions
- Document your process and findings
Dataset: You can use the Student Performance dataset from the UCI Machine Learning Repository or a similar educational dataset.
Exercise 10.2: Customer Segmentation
Apply unsupervised learning techniques to segment customers based on their purchasing behavior:
- Load and explore a retail customer dataset
- Preprocess the data, handling outliers and scaling features
- Apply PCA to reduce dimensionality if necessary
- Use K-means clustering to segment customers
- Determine the optimal number of clusters using the elbow method and silhouette score
- Analyze and interpret each customer segment
- Create visualizations to represent the clusters
- Propose marketing strategies for each customer segment
Bonus challenge: Try hierarchical clustering as an alternative to K-means and compare the results.
Chapter Summary
In this chapter, we explored the fundamentals of artificial intelligence and machine learning with Python:
- Key concepts in machine learning, including supervised and unsupervised learning
- Setting up a Python environment for machine learning development
- Implementing classification and regression models using scikit-learn
- Exploring clustering and dimensionality reduction for unsupervised learning
- Evaluating models using cross-validation and various performance metrics
- Optimizing models through hyperparameter tuning
These foundations provide the groundwork for building more complex AI applications, which we'll explore in the next chapter. By understanding these core concepts, you're well-equipped to start applying machine learning to solve real-world problems.
Chapter 11: Building AI Applications
Learning Objectives
- Develop practical AI applications using Python
- Implement natural language processing (NLP) techniques
- Create computer vision applications
- Build recommendation systems
- Deploy machine learning models as web services
11.1 Natural Language Processing
Key Concept: Text Processing and Analysis
Natural Language Processing (NLP) is a field of AI focused on enabling computers to understand, interpret, and generate human language. Python offers powerful libraries for NLP tasks.
To get started with NLP in Python, install the necessary libraries:
pip install nltk spacy textblob gensim
Text Preprocessing
Before analyzing text data, preprocessing is essential to clean and normalize the text:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
import re
# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
def preprocess_text(text):
"""
Preprocess text data by performing multiple cleaning steps
"""
# Convert to lowercase
text = text.lower()
# Remove numbers and punctuation
text = re.sub(r'\d+', '', text)
text = text.translate(str.maketrans('', '', string.punctuation))
# Tokenize
tokens = word_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in tokens]
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
return {
'original_text': text,
'tokens': tokens,
'stemmed_tokens': stemmed_tokens,
'lemmatized_tokens': lemmatized_tokens
}
# Example usage
sample_text = """Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction
between computers and humans using natural language. The ultimate goal of NLP is to enable computers to understand,
interpret, and generate human language in a way that is both meaningful and useful."""
processed = preprocess_text(sample_text)
print("Original Text:")
print(sample_text)
print("\nProcessed Text:")
print("Tokens:", processed['tokens'][:10], "...")
print("Stemmed:", processed['stemmed_tokens'][:10], "...")
print("Lemmatized:", processed['lemmatized_tokens'][:10], "...")
# Sentence tokenization
sentences = sent_tokenize(sample_text)
print("\nSentences:")
for i, sentence in enumerate(sentences):
print(f"{i+1}. {sentence}")
Sentiment Analysis
Sentiment analysis determines the emotional tone behind text, useful for analyzing customer feedback, social media, and more:
from textblob import TextBlob
import matplotlib.pyplot as plt
import numpy as np
def analyze_sentiment(text):
"""
Analyze the sentiment of text using TextBlob
"""
blob = TextBlob(text)
sentiment = blob.sentiment
# Polarity ranges from -1 (negative) to 1 (positive)
# Subjectivity ranges from 0 (objective) to 1 (subjective)
return {
'text': text,
'polarity': sentiment.polarity,
'subjectivity': sentiment.subjectivity,
'sentiment': 'positive' if sentiment.polarity > 0 else 'negative' if sentiment.polarity < 0 else 'neutral'
}
# Example texts
texts = [
"I absolutely love this product! It's amazing and works perfectly.",
"The service was okay, but could be better.",
"This is the worst experience I've ever had. Terrible customer service.",
"The movie was neither particularly good nor bad.",
"The staff was friendly and helpful, but the food was disappointing."
]
# Analyze sentiments
sentiments = [analyze_sentiment(text) for text in texts]
# Display results
for i, result in enumerate(sentiments):
print(f"\nText {i+1}: {result['text']}")
print(f"Polarity: {result['polarity']:.2f}")
print(f"Subjectivity: {result['subjectivity']:.2f}")
print(f"Sentiment: {result['sentiment']}")
# Visualize the results
plt.figure(figsize=(10, 6))
# Extract polarities and subjectivities
polarities = [s['polarity'] for s in sentiments]
subjectivities = [s['subjectivity'] for s in sentiments]
labels = [f"Text {i+1}" for i in range(len(texts))]
# Create scatter plot
plt.scatter(polarities, subjectivities, c=np.array(polarities), cmap='RdYlGn', s=100, alpha=0.7)
# Add labels and details
for i, (x, y) in enumerate(zip(polarities, subjectivities)):
plt.annotate(labels[i], (x, y), xytext=(5, 5), textcoords='offset points')
plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.3)
plt.title('Sentiment Analysis Results')
plt.xlabel('Polarity (Negative → Positive)')
plt.ylabel('Subjectivity (Objective → Subjective)')
plt.xlim(-1.1, 1.1)
plt.ylim(-0.1, 1.1)
plt.grid(True, alpha=0.3)
plt.colorbar(label='Sentiment Polarity')
# plt.savefig('sentiment_analysis.png')
plt.close()
Topic Modeling
Topic modeling discovers abstract topics in a collection of documents, useful for content organization and recommendation systems:
import gensim
from gensim import corpora
from gensim.models import LdaModel
import pyLDAvis
import pyLDAvis.gensim_models
# Sample documents
documents = [
"Machine learning is a method of data analysis that automates analytical model building.",
"Python is a programming language that lets you work quickly and integrate systems more effectively.",
"Artificial intelligence is intelligence demonstrated by machines.",
"Deep learning is part of a broader family of machine learning methods based on artificial neural networks.",
"Natural language processing is a subfield of linguistics, computer science, and artificial intelligence.",
"Computer vision is an interdisciplinary scientific field that deals with how computers can gain understanding from digital images or videos.",
"Data science is an inter-disciplinary field that uses scientific methods to extract knowledge from data.",
"Python libraries like TensorFlow and PyTorch are commonly used for machine learning and AI development.",
"Big data refers to data sets that are too large or complex to be dealt with by traditional data-processing software.",
"Cloud computing is the on-demand availability of computer system resources."
]
# Preprocess the documents
processed_docs = []
for doc in documents:
# Tokenize, remove punctuation and stopwords
tokens = preprocess_text(doc)['tokens']
processed_docs.append(tokens)
# Create a dictionary
dictionary = corpora.Dictionary(processed_docs)
# Create a document-term matrix
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
# Train the LDA model
num_topics = 3
lda_model = LdaModel(
corpus=corpus,
id2word=dictionary,
num_topics=num_topics,
passes=10,
alpha='auto',
random_state=42
)
# Print the topics
print("LDA Topics:")
for idx, topic in lda_model.print_topics(-1):
print(f"Topic {idx}: {topic}")
# Show the topic distribution for each document
print("\nTopic Distribution by Document:")
for i, doc in enumerate(corpus):
print(f"\nDocument {i+1}: \"{documents[i][:50]}...\"")
topic_distribution = lda_model.get_document_topics(doc)
for topic_id, prob in sorted(topic_distribution, key=lambda x: x[1], reverse=True):
print(f" Topic {topic_id}: {prob:.4f}")
# Function to format topics in a readable way
def format_topics_sentences(ldamodel, corpus, texts):
sent_topics_df = []
for i, row in enumerate(ldamodel[corpus]):
row = sorted(row, key=lambda x: x[1], reverse=True)
for j, (topic_num, prop_topic) in enumerate(row):
if j == 0: # => dominant topic
wp = ldamodel.show_topic(topic_num)
topic_keywords = ", ".join([word for word, prop in wp])
sent_topics_df.append([i, int(topic_num), round(prop_topic, 4), topic_keywords, texts[i]])
else:
break
return sent_topics_df
# Format the results
topic_sentences = format_topics_sentences(lda_model, corpus, documents)
print("\nDominant Topic for Each Document:")
for i, topic_num, prop_topic, keywords, text in topic_sentences:
print(f"Document {i+1}: Topic {topic_num} (Probability: {prop_topic:.4f})")
print(f" Keywords: {keywords}")
print(f" Text: {text[:70]}...\n")
11.2 Computer Vision
Key Concept: Image Processing and Analysis
Computer vision enables computers to interpret and understand visual information from the world. Python provides powerful libraries for image processing and deep learning-based vision tasks.
To get started with computer vision in Python, install the necessary libraries:
pip install opencv-python pillow scikit-image tensorflow
Basic Image Processing
Let's explore basic image processing operations using OpenCV:
import cv2
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image, ImageFilter, ImageEnhance
# Load an image (replace with your own image path)
image_path = "sample_image.jpg" # You can use any image for testing
try:
# OpenCV reads images in BGR format
img_cv = cv2.imread(image_path)
if img_cv is None:
raise FileNotFoundError(f"Could not open or find the image: {image_path}")
# Convert BGR to RGB for display with matplotlib
img_rgb = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)
# Basic image properties
height, width, channels = img_rgb.shape
print(f"Image dimensions: {width}x{height}, {channels} channels")
# Create a figure with multiple subplots
plt.figure(figsize=(15, 10))
# Display original image
plt.subplot(2, 3, 1)
plt.imshow(img_rgb)
plt.title('Original Image')
plt.axis('off')
# Grayscale conversion
img_gray = cv2.cvtColor(img_cv, cv2.COLOR_BGR2GRAY)
plt.subplot(2, 3, 2)
plt.imshow(img_gray, cmap='gray')
plt.title('Grayscale')
plt.axis('off')
# Image blurring
img_blur = cv2.GaussianBlur(img_rgb, (15, 15), 0)
plt.subplot(2, 3, 3)
plt.imshow(img_blur)
plt.title('Gaussian Blur')
plt.axis('off')
# Edge detection
edges = cv2.Canny(img_gray, 100, 200)
plt.subplot(2, 3, 4)
plt.imshow(edges, cmap='gray')
plt.title('Edge Detection')
plt.axis('off')
# Thresholding
_, thresh = cv2.threshold(img_gray, 127, 255, cv2.THRESH_BINARY)
plt.subplot(2, 3, 5)
plt.imshow(thresh, cmap='gray')
plt.title('Thresholding')
plt.axis('off')
# Image resizing
img_resized = cv2.resize(img_rgb, (width//2, height//2))
plt.subplot(2, 3, 6)
plt.imshow(img_resized)
plt.title('Resized (50%)')
plt.axis('off')
plt.tight_layout()
# plt.savefig('image_processing.png')
plt.close()
# Demonstrate PIL image processing
pil_img = Image.open(image_path)
plt.figure(figsize=(15, 10))
# Original
plt.subplot(2, 3, 1)
plt.imshow(np.array(pil_img))
plt.title('Original (PIL)')
plt.axis('off')
# Apply filters
# Blur
blur_img = pil_img.filter(ImageFilter.BLUR)
plt.subplot(2, 3, 2)
plt.imshow(np.array(blur_img))
plt.title('Blur Filter')
plt.axis('off')
# Find edges
edge_img = pil_img.filter(ImageFilter.FIND_EDGES)
plt.subplot(2, 3, 3)
plt.imshow(np.array(edge_img))
plt.title('Edge Filter')
plt.axis('off')
# Enhance contrast
enhancer = ImageEnhance.Contrast(pil_img)
enhanced_img = enhancer.enhance(1.5) # Increase contrast by 50%
plt.subplot(2, 3, 4)
plt.imshow(np.array(enhanced_img))
plt.title('Enhanced Contrast')
plt.axis('off')
# Rotate image
rotated_img = pil_img.rotate(45)
plt.subplot(2, 3, 5)
plt.imshow(np.array(rotated_img))
plt.title('Rotated 45°')
plt.axis('off')
# Convert to grayscale
gray_img = pil_img.convert('L')
plt.subplot(2, 3, 6)
plt.imshow(np.array(gray_img), cmap='gray')
plt.title('Grayscale (PIL)')
plt.axis('off')
plt.tight_layout()
# plt.savefig('pil_processing.png')
plt.close()
except Exception as e:
print(f"Error: {e}")
print("Using a placeholder image for demonstration instead.")
# Create a simple placeholder image
placeholder = np.zeros((300, 400, 3), dtype=np.uint8)
# Add some shapes to the placeholder
cv2.rectangle(placeholder, (50, 50), (200, 200), (0, 255, 0), -1)
cv2.circle(placeholder, (300, 150), 80, (0, 0, 255), -1)
cv2.line(placeholder, (50, 250), (350, 250), (255, 255, 255), 5)
# Convert BGR to RGB for display
placeholder_rgb = cv2.cvtColor(placeholder, cv2.COLOR_BGR2RGB)
plt.figure(figsize=(10, 6))
plt.imshow(placeholder_rgb)
plt.title('Placeholder Image')
plt.axis('off')
# plt.savefig('placeholder.png')
plt.close()
Object Detection
Let's implement basic object detection using pre-trained models:
import cv2
import numpy as np
import matplotlib.pyplot as plt
def detect_faces(image_path):
"""
Detect faces in an image using OpenCV's pre-trained Haar Cascade classifier
"""
try:
# Load the image
img = cv2.imread(image_path)
if img is None:
raise FileNotFoundError(f"Could not open or find the image: {image_path}")
# Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Load the face detector
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')
# Detect faces
faces = face_cascade.detectMultiScale(
gray,
scaleFactor=1.1,
minNeighbors=5,
minSize=(30, 30)
)
print(f"Found {len(faces)} faces!")
# Draw rectangles around the faces
for (x, y, w, h) in faces:
cv2.rectangle(img, (x, y), (x+w, y+h), (255, 0, 0), 2)
# Convert to RGB for display
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
return img_rgb, faces
except Exception as e:
print(f"Error: {e}")
return None, []
def detect_objects(image_path):
"""
Detect objects in an image using OpenCV's pre-trained YOLO model
"""
try:
# Load the image
img = cv2.imread(image_path)
if img is None:
raise FileNotFoundError(f"Could not open or find the image: {image_path}")
# Get image dimensions
height, width, _ = img.shape
# Load YOLO model (you need to download these files separately)
# Uncomment and use these lines if you have the YOLO weights and configuration
"""
net = cv2.dnn.readNet("yolov3.weights", "yolov3.cfg")
with open("coco.names", "r") as f:
classes = [line.strip() for line in f.readlines()]
# Create a blob from the image
blob = cv2.dnn.blobFromImage(img, 1/255.0, (416, 416), swapRB=True, crop=False)
net.setInput(blob)
# Get output layer names
output_layers = net.getUnconnectedOutLayersNames()
# Forward pass
layer_outputs = net.forward(output_layers)
# Initialize lists for detected objects
boxes = []
confidences = []
class_ids = []
# Process each output layer
for output in layer_outputs:
for detection in output:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence > 0.5: # Confidence threshold
# Object detected
center_x = int(detection[0] * width)
center_y = int(detection[1] * height)
w = int(detection[2] * width)
h = int(detection[3] * height)
# Rectangle coordinates
x = int(center_x - w / 2)
y = int(center_y - h / 2)
boxes.append([x, y, w, h])
confidences.append(float(confidence))
class_ids.append(class_id)
# Apply non-maximum suppression
indices = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
# Draw bounding boxes and labels
for i in indices:
i = i[0] if isinstance(i, (list, np.ndarray)) else i
x, y, w, h = boxes[i]
label = f"{classes[class_ids[i]]}: {confidences[i]:.2f}"
cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.putText(img, label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
"""
# Since YOLO requires additional files (weights/config), let's create a placeholder
# This simulates object detection results for demonstration
objects = [
{"label": "Person", "confidence": 0.92, "box": [50, 50, 100, 200]},
{"label": "Car", "confidence": 0.85, "box": [200, 150, 150, 100]},
{"label": "Dog", "confidence": 0.78, "box": [350, 200, 80, 60]}
]
# Draw bounding boxes and labels on the placeholder
for obj in objects:
x, y, w, h = obj["box"]
label = f"{obj['label']}: {obj['confidence']:.2f}"
cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.putText(img, label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
# Convert to RGB for display
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
return img_rgb, objects
except Exception as e:
print(f"Error: {e}")
return None, []
# Demonstration with placeholder/sample images
try:
# Try to use the sample image for face detection
face_result, faces = detect_faces(image_path)
object_result, objects = detect_objects(image_path)
plt.figure(figsize=(12, 6))
if face_result is not None:
plt.subplot(1, 2, 1)
plt.imshow(face_result)
plt.title(f'Face Detection ({len(faces)} faces)')
plt.axis('off')
if object_result is not None:
plt.subplot(1, 2, 2)
plt.imshow(object_result)
plt.title(f'Object Detection ({len(objects)} objects)')
plt.axis('off')
plt.tight_layout()
# plt.savefig('detection_results.png')
plt.close()
except Exception as e:
print(f"Error in demonstration: {e}")
# Create a placeholder for the demonstration
placeholder = np.zeros((400, 600, 3), dtype=np.uint8)
placeholder[:] = (240, 240, 240) # Light gray background
# Add text
font = cv2.FONT_HERSHEY_SIMPLEX
cv2.putText(placeholder, "Object Detection Placeholder", (120, 60), font, 1, (0, 0, 0), 2)
# Draw some "detected objects" with bounding boxes
objects = [
{"label": "Person", "confidence": 0.92, "box": [50, 100, 100, 200]},
{"label": "Car", "confidence": 0.85, "box": [250, 150, 150, 100]},
{"label": "Dog", "confidence": 0.78, "box": [450, 200, 80, 60]}
]
for obj in objects:
x, y, w, h = obj["box"]
label = f"{obj['label']}: {obj['confidence']:.2f}"
cv2.rectangle(placeholder, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.putText(placeholder, label, (x, y - 10), font, 0.5, (0, 0, 0), 2)
# Display placeholder
placeholder_rgb = cv2.cvtColor(placeholder, cv2.COLOR_BGR2RGB)
plt.figure(figsize=(10, 6))
plt.imshow(placeholder_rgb)
plt.title('Object Detection (Placeholder)')
plt.axis('off')
# plt.savefig('detection_placeholder.png')
plt.close()
11.3 Recommendation Systems
Key Concept: Personalized Recommendations
Recommendation systems suggest items to users based on their preferences and behavior. Python provides tools for building different types of recommenders, from simple collaborative filtering to complex deep learning models.
Building a Simple Recommender
Let's implement a basic collaborative filtering recommender system:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
# Create a sample user-item ratings matrix
# Each row represents a user, each column represents an item (e.g., movie)
# The values are ratings given by users to items
ratings = pd.DataFrame({
'Item1': [5, 4, 0, 0, 1],
'Item2': [4, 0, 3, 4, 0],
'Item3': [1, 0, 5, 4, 3],
'Item4': [0, 3, 5, 0, 4],
'Item5': [2, 5, 0, 3, 5]
}, index=['User1', 'User2', 'User3', 'User4', 'User5'])
print("User-Item Ratings Matrix:")
print(ratings)
# Visualize the ratings matrix
plt.figure(figsize=(10, 6))
sns.heatmap(ratings, annot=True, cmap="YlGnBu", cbar_kws={'label': 'Rating'})
plt.title('User-Item Ratings Matrix')
plt.tight_layout()
# plt.savefig('ratings_matrix.png')
plt.close()
# Item-based collaborative filtering
def item_based_recommendations(ratings_matrix, item_similarity_matrix, user_id, num_recommendations=2):
"""
Generate item-based recommendations for a user
"""
# Get user ratings
user_ratings = ratings_matrix.loc[user_id].values.reshape(1, -1)
# Create a mask for already rated items
already_rated = user_ratings[0] > 0
# Calculate the predicted ratings
# Weighted sum of item similarities and user ratings
weighted_sum = np.dot(item_similarity_matrix, user_ratings.T)
# Sum of similarities for normalization
similarity_sums = np.sum(np.abs(item_similarity_matrix[:, already_rated]), axis=1)
similarity_sums[similarity_sums == 0] = 1 # Avoid division by zero
# Calculate predicted ratings
predicted_ratings = weighted_sum / similarity_sums.reshape(-1, 1)
# Convert to a more usable format
predicted_ratings = predicted_ratings.flatten()
# Mask out already rated items
predicted_ratings[already_rated] = 0
# Get top recommendations
item_indices = np.argsort(predicted_ratings)[::-1][:num_recommendations]
return {
'item_indices': item_indices,
'predicted_ratings': predicted_ratings[item_indices]
}
# Calculate item-item similarity matrix using cosine similarity
item_similarity = cosine_similarity(ratings.T)
item_similarity_df = pd.DataFrame(item_similarity,
index=ratings.columns,
columns=ratings.columns)
print("\nItem-Item Similarity Matrix:")
print(item_similarity_df)
# Visualize the item similarity matrix
plt.figure(figsize=(8, 6))
sns.heatmap(item_similarity_df, annot=True, cmap="coolwarm", vmin=-1, vmax=1,
cbar_kws={'label': 'Cosine Similarity'})
plt.title('Item-Item Similarity Matrix')
plt.tight_layout()
# plt.savefig('item_similarity.png')
plt.close()
# Generate recommendations for each user
print("\nItem-Based Collaborative Filtering Recommendations:")
for user in ratings.index:
recs = item_based_recommendations(ratings, item_similarity, user)
rec_items = [ratings.columns[i] for i in recs['item_indices']]
rec_ratings = recs['predicted_ratings']
print(f"\n{user}:")
for item, rating in zip(rec_items, rec_ratings):
print(f" Recommended: {item} (Predicted rating: {rating:.2f})")
# User-based collaborative filtering
def user_based_recommendations(ratings_matrix, user_similarity_matrix, user_id, num_recommendations=2):
"""
Generate user-based recommendations for a user
"""
# Get index of the target user
user_idx = list(ratings_matrix.index).index(user_id)
# Get similarities between the target user and all other users
user_similarities = user_similarity_matrix[user_idx]
# Create a mask for the target user's already rated items
user_ratings = ratings_matrix.loc[user_id].values
already_rated = user_ratings > 0
# Initialize predicted ratings
predicted_ratings = np.zeros(len(ratings_matrix.columns))
# For each item that the user hasn't rated
for item_idx in range(len(ratings_matrix.columns)):
if not already_rated[item_idx]:
# Get ratings for this item from all users
item_ratings = ratings_matrix.iloc[:, item_idx].values
# Create a mask for users who have rated this item
rated_mask = item_ratings > 0
# If no other user has rated this item, skip
if np.sum(rated_mask) == 0:
continue
# Calculate the weighted average rating
# Weighted by similarity between the target user and other users
weighted_sum = np.sum(user_similarities[rated_mask] * item_ratings[rated_mask])
similarity_sum = np.sum(np.abs(user_similarities[rated_mask]))
if similarity_sum > 0:
predicted_ratings[item_idx] = weighted_sum / similarity_sum
# Get top recommendations (items with highest predicted ratings)
# Only consider items the user hasn't rated yet
unrated_item_indices = np.where(already_rated == False)[0]
unrated_pred_ratings = predicted_ratings[unrated_item_indices]
# Sort by predicted rating
top_indices = np.argsort(unrated_pred_ratings)[::-1][:num_recommendations]
return {
'item_indices': unrated_item_indices[top_indices],
'predicted_ratings': unrated_pred_ratings[top_indices]
}
# Calculate user-user similarity matrix
user_similarity = cosine_similarity(ratings)
user_similarity_df = pd.DataFrame(user_similarity,
index=ratings.index,
columns=ratings.index)
print("\nUser-User Similarity Matrix:")
print(user_similarity_df)
# Visualize the user similarity matrix
plt.figure(figsize=(8, 6))
sns.heatmap(user_similarity_df, annot=True, cmap="coolwarm", vmin=-1, vmax=1,
cbar_kws={'label': 'Cosine Similarity'})
plt.title('User-User Similarity Matrix')
plt.tight_layout()
# plt.savefig('user_similarity.png')
plt.close()
# Generate user-based recommendations for each user
print("\nUser-Based Collaborative Filtering Recommendations:")
for user in ratings.index:
recs = user_based_recommendations(ratings, user_similarity, user)
rec_items = [ratings.columns[i] for i in recs['item_indices']]
rec_ratings = recs['predicted_ratings']
print(f"\n{user}:")
for item, rating in zip(rec_items, rec_ratings):
print(f" Recommended: {item} (Predicted rating: {rating:.2f})")
# Compare recommendations from both approaches
print("\nComparison of Recommendation Approaches:")
for user in ratings.index:
item_recs = item_based_recommendations(ratings, item_similarity, user)
user_recs = user_based_recommendations(ratings, user_similarity, user)
item_rec_items = [ratings.columns[i] for i in item_recs['item_indices']]
user_rec_items = [ratings.columns[i] for i in user_recs['item_indices']]
print(f"\n{user}:")
print(f" Item-based recommendations: {', '.join(item_rec_items)}")
print(f" User-based recommendations: {', '.join(user_rec_items)}")
11.4 Deploying Machine Learning Models
Key Concept: Model Deployment
Deploying machine learning models makes them accessible to applications via APIs, batch processing, or embedded systems. Python provides various frameworks for model deployment.
Creating a Model API with Flask
Let's build a simple API for a machine learning model:
"""
# app.py - Save this to a separate file to run it
from flask import Flask, request, jsonify
import pickle
import numpy as np
from sklearn.preprocessing import StandardScaler
app = Flask(__name__)
# Load the pre-trained model
# Assuming you have trained and saved a model using pickle
try:
with open('model.pkl', 'rb') as f:
model = pickle.load(f)
with open('scaler.pkl', 'rb') as f:
scaler = pickle.load(f)
print("Model and scaler loaded successfully")
except FileNotFoundError:
# For demonstration purposes, we'll create a simple model
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
print("Creating a sample model for demonstration")
iris = load_iris()
X, y = iris.data, iris.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_scaled, y)
# Save the model and scaler (optional)
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
with open('scaler.pkl', 'wb') as f:
pickle.dump(scaler, f)
@app.route('/predict', methods=['POST'])
def predict():
# Get request data
data = request.get_json(force=True)
# Check if 'features' is in the request
if 'features' not in data:
return jsonify({'error': 'No features provided in the request'}), 400
# Extract features
features = data['features']
try:
# Convert to numpy array
features_array = np.array(features).reshape(1, -1)
# Scale the features
features_scaled = scaler.transform(features_array)
# Make prediction
prediction = model.predict(features_scaled)
# For iris dataset, map class indices to names
class_names = ['setosa', '
0 Comments