{% extends "layout.html" %} {% block content %}
Story-style intuition: The Supermarket Detective
Imagine you're a detective hired by a supermarket. Your mission is to analyze thousands of shopping receipts (transactions) to find hidden patterns. You soon notice a classic pattern: "Customers who buy bread also tend to buy butter." This is a valuable clue! The store can place bread and butter closer together to increase sales. The Apriori Algorithm is the systematic method this detective uses to sift through all the receipts and find these "frequently bought together" item combinations and turn them into powerful rules. This whole process is called Market Basket Analysis.
The Apriori Algorithm is a classic algorithm used for association rule mining. Its main goal is to find relationships and patterns between items in large transactional datasets. It generates rules in the format "If A, then B," helping businesses understand customer behavior and make smarter decisions.
To be a good supermarket detective, you need to know the lingo. The three most important metrics are Support, Confidence, and Lift.
Example Scenario: Let's say we have 100 shopping receipts.
Example: The support for {Bread, Butter} is 60/100 = 0.6 or 60%. This tells us that 60% of all shoppers bought bread and butter together. High support means the itemset is frequent.
$$ \text{Confidence}(X \Rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)} $$
Example: Confidence({Bread} => {Butter}) = Support({Bread, Butter}) / Support({Bread}) = 60 / 80 = 0.75 or 75%. This means that 75% of customers who bought bread also bought butter. High confidence makes the rule strong.
$$ \text{Lift}(X \Rightarrow Y) = \frac{\text{Confidence}(X \Rightarrow Y)}{\text{Support}(Y)} $$
Example: Lift({Bread} => {Butter}) = Confidence({Bread} => {Butter}) / Support({Butter}) = 0.75 / 0.7 = 1.07.
• Lift > 1: Indicates a positive correlation (buying bread increases the likelihood of buying butter).
• Lift = 1: No correlation.
• Lift < 1: Negative correlation.
The Detective's Golden Rule: Our detective quickly realizes a simple but powerful truth: If customers rarely buy {Milk}, then they will *definitely* rarely buy the combination {Milk, Bread, Eggs}. Why waste time checking the records for a combination containing an already unpopular item? This is the Apriori Principle.
The principle states: "All non-empty subsets of a frequent itemset must also be frequent." This is the core idea that makes the Apriori algorithm efficient. It allows the algorithm to "prune" the search space by eliminating a huge number of candidate itemsets. If {Milk} is infrequent, any larger itemset containing {Milk} is guaranteed to be infrequent and can be ignored.
The algorithm works iteratively, building up larger and larger frequent itemsets level by level.
Here, we'll be a supermarket detective with a small set of receipts. We need to prepare our data in a specific way (a one-hot encoded format) where each row is a transaction and each column is an item. Then, we'll use the `apriori` function to find frequent itemsets and `association_rules` to find the strong relationships.
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
# --- 1. Create a Sample Dataset ---
# This represents 5 shopping receipts.
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
# --- 2. Prepare Data in One-Hot Encoded Format ---
# mlxtend's apriori needs the data as a DataFrame of True/False values.
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
# --- 3. Find Frequent Itemsets with Apriori ---
# We set min_support to 0.6, meaning we only want itemsets
# that appear in at least 60% of the transactions (3 out of 5).
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
print("--- Frequent Itemsets (Support >= 60%) ---")
print(frequent_itemsets)
# --- 4. Generate Association Rules ---
# We generate rules that have a confidence of at least 70%.
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
# Let's sort the rules by their "lift" to see the strongest relationships.
sorted_rules = rules.sort_values(by='lift', ascending=False)
print("\n--- Strong Association Rules (Confidence >= 70%) ---")
print(sorted_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
1. The Apriori Principle states that all subsets of a frequent itemset must also be frequent. It's important because it allows the algorithm to prune a massive number of candidate itemsets early on, making the process much more efficient.
2. Confidence({A} => {B}) = Support({A, B}) / Support({A}) = 20% / 30% ≈ 66.7%.
3. A Lift of 3.0 means that customers who buy diapers are 3 times more likely to buy beer than a randomly chosen customer. This indicates a strong positive association.
4. The main bottleneck is the candidate generation step. In each pass, it can create a very large number of potential itemsets that need to be checked against the entire database, which is slow and memory-intensive.
The Story: Decoding the Supermarket Detective's Notebook