milwright commited on
Commit
8e635f8
·
verified ·
1 Parent(s): 8597f42

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -94
README.md CHANGED
@@ -11,127 +11,78 @@ thumbnail: >-
11
 
12
  # Cloze Reader
13
 
14
- ## When Assessment Becomes Training Data, and Training Data Becomes Assessment
15
 
16
- In 1953, Wilson Taylor proposed the "cloze procedure" as a measurement tool for reading comprehension and text difficulty. The method was elegantly simple: delete every nth word from a passage, ask readers to fill in the blanks, and score their accuracy. Taylor argued that successful completion demonstrated not mere vocabulary knowledge but genuine contextual understanding—the ability to infer meaning from surrounding linguistic patterns. By the 1960s, cloze testing had become standard in educational assessment, literacy research, and language teaching. Its appeal lay in its objectivity: unlike essay grading, cloze scores could be automated, quantified, compared across populations.
17
 
18
- Sixty-five years later, Google researchers published "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." BERT's innovation was masked language modeling: randomly mask 15% of tokens in a text, train the model to predict the missing words. The researchers didn't cite Taylor. They didn't frame this as a pedagogical technique. Yet they had independently reinvented the cloze procedure as a machine learning objective. BERT learned language by solving millions of fill-in-the-blank exercises, training on the same task that had measured human comprehension since the Eisenhower administration.
19
 
20
- This convergence wasn't coincidental. Both cloze testing and masked language modeling operate on the same premise: understanding language means predicting from context. If you can accurately fill in blanks, you've demonstrated comprehension—whether you're a student in 1960s Kansas or a transformer model in 2018. The methodology traveled from educational psychology to computational linguistics because it captured something fundamental about how meaning works: language is redundant, predictable, inferable from surrounding structure.
21
 
22
- Both approaches also share the same constraints. Educational researchers identified "clozability"—the phenomenon that some words are easier to predict than others due to context salience, limited synonyms, or statistical frequency. Machine learning researchers independently discovered the same issue: certain tokens are trivially predictable from context, while others require deeper reasoning. Zhang & Hashimoto (2021) showed that masked language models learn statistical and syntactic dependencies—exactly what cloze tests aim to measure in humans. The parallel is not superficial.
23
 
24
- But now the loop has closed in an unexpected way. BERT and its descendants—including Google's Gemma models—were trained on masked prediction tasks extracted from web text, books, Wikipedia. These models learned to predict missing words from billions of cloze-like exercises. Now, with Cloze Reader, those same models generate cloze tests for humans. The AI that learned language by filling in blanks now decides which blanks you should fill in. The training methodology has become the assessment tool, and the assessment tool has become training data.
25
 
26
- ## What This Game Explores
27
 
28
- Cloze Reader uses Google's open-weight Gemma-3-27b model to transform Project Gutenberg literature into dynamically generated cloze exercises. The model scans passages, selects vocabulary to remove, generates contextual hints, and provides conversational guidance. Every passage is fresh, every blank algorithmically chosen, every hint synthesized in real time.
29
 
30
- This isn't just automated test generation. It's an investigation into what happens when the twin histories of educational assessment and machine learning collapse into each other. Consider:
31
 
32
- **Standardization vs. Serendipity:** Educational cloze tests sought standardization—predictable difficulty, comparable scores, systematic progression. Machine learning cloze tasks sought diversity—randomized masking, varied contexts, statistical coverage. Using Gemma models on Project Gutenberg's 70,000-book corpus introduces radical serendipity: you might encounter Victorian Gothic prose, 1920s adventure serials, obscure 19th-century essays, forgotten feminist manifestos. The algorithm selects passages and words through statistical patterns trained on internet-scale text, not curriculum design. What does assessment mean when no human predetermined what counts as "appropriate difficulty"?
33
 
34
- **Inference vs. Memorization:** Traditional cloze tests measured whether students could infer from context rather than recall from memory. Educational researchers have long critiqued cloze procedures for measuring primarily local, sentence-level inference (surface coherence) rather than global text structure or pragmatic reasoning. Machine learning critics make parallel arguments: masked language models exploit spurious statistical regularities rather than genuine semantic understanding. When a model trained on surface patterns generates tests, and humans solve those tests using similar heuristics, where is comprehension actually happening? The distinction between understanding and statistical correlation becomes harder to maintain on both sides.
35
 
36
- **Automation and Authority:** Educational assessment historically required human expertise—teachers selecting texts, choosing appropriate blanks, evaluating answers. Automated testing promised efficiency but was criticized for reducing learning to quantifiable metrics. Now the automation is complete: an algorithmic system with no pedagogical training, no curriculum knowledge, no understanding of individual learners generates and evaluates exercises. It runs on open-weight models anyone can download, modify, or interrogate. What happens to authority over what constitutes "correct" reading comprehension when assessment moves from institutional gatekeeping to open algorithmic systems?
37
 
38
- **The Feedback Loop:** Most critically, this is a recursive system. Gemma models were trained partly on digitized books—including many from Project Gutenberg. The texts they learned from become the texts they generate exercises from. The model learned language patterns from Victorian literature, then uses those patterns to test human understanding of Victorian literature. Meanwhile, interactions with this game could theoretically become training data for future models. Assessment data becomes training data becomes assessment tools becomes training data. At what point does the distinction between learning and evaluation dissolve entirely?
39
 
40
- **The Exact-Word Problem:** Educational cloze testing has long debated whether to accept only exact matches or score semantic/grammatical equivalents (synonyms, morphological variants). This game enforces exact-word matching with some suffix normalization, mirroring how masked language models are trained on exact token prediction. Both approaches may penalize valid alternatives. When you type "sad" but the answer was "melancholy," have you failed to comprehend the passage—or simply chosen a different word from the same semantic field? This scoring problem exists identically in human assessment and algorithmic evaluation.
41
 
42
- ## The Research Context
 
43
 
44
- Recent scholarship explicitly bridges cloze assessment and masked language modeling:
 
45
 
46
- - **Matsumori et al. (2023)** built CLOZER, a system using masked language models to generate open cloze questions for L2 English learners, demonstrating practical pedagogical applications
47
- - **Ondov et al. (2024, NAACL)** argue: "The cloze training objective of Masked Language Models makes them a natural choice for generating plausible distractors for human cloze questions"
48
- - **Zhang & Hashimoto (2021)** analyzed the inductive biases of masked tokens, showing that models learn statistical and syntactic dependencies through the same mechanisms cloze tests measure in humans
49
 
50
- This project sits at the intersection of these research trajectories—using the tools that now generate assessments to explore what happens when the boundary between human learning and machine training dissolves.
51
 
52
- ## A Game, Not a Conclusion
53
-
54
- Cloze Reader doesn't resolve these tensions. It stages them. Through vintage aesthetics and classic texts, it creates a space where the convergence of educational assessment and machine learning becomes palpable. You're playing a literacy game designed by an algorithm that learned literacy by playing the same game billions of times. Every passage is a historical text processed by a model trained on historical texts. Every hint comes from a system that doesn't "understand" in any human sense but can nonetheless guide you toward understanding.
55
-
56
- The experience raises more questions than it answers. Is this pedagogy or pattern replication? Assessment or performance? Human learning or collaborative prediction with a statistical engine? These aren't rhetorical questions—they're open empirical questions about what education looks like when the tools we use to measure learning are built from the same processes we're trying to measure.
57
-
58
- ## How It Works
59
-
60
- **Single-Model Architecture:** The system uses Google's Gemma-3-27b model for all operations—analyzing passages, selecting words to mask, generating contextual hints, and powering the chat interface. The model handles both assessment design and pedagogical guidance through the same algorithmic system.
61
-
62
- **Progressive Levels:** The game implements a level system (1-5 with 1 blank, 6-10 with 2 blanks, 11+ with 3 blanks) that scaffolds difficulty through word length constraints, historical period selection, and hint disclosure. Early levels use 1900s texts and show first+last letters; advanced levels draw from any era and provide only first letters. Each round presents two passages from different books, requiring consistent performance across rounds before advancing.
63
-
64
- **Serendipitous Selection:** Passages stream directly from Hugging Face's Project Gutenberg dataset. The model selects words based on its training rather than curricular logic—sometimes choosing obvious vocabulary, sometimes obscure terms, sometimes generating exercises that are trivially easy or frustratingly hard. This unpredictability is a feature: it reveals how algorithmic assessment differs from human-designed pedagogy.
65
-
66
- **Chat as Scaffold:** Click the 💬 icon beside any blank to engage the model in conversation. It attempts to guide you through Socratic questioning, semantic clues, and contextual hints—replicating what a tutor might do, constrained by what a language model trained on text prediction can actually accomplish.
67
-
68
- The system filters out dictionaries, technical documentation, and poetry—ensuring narrative prose where blanks are theoretically inferable from context, even if the model's choices sometimes suggest otherwise.
69
-
70
- ## Technology
71
-
72
- **Vanilla JavaScript, No Build Step:** The application runs entirely in the browser using ES6 modules—no webpack, no bundler, no compilation. This architectural choice mirrors the project's conceptual interests: keeping the machinery visible and modifiable rather than obscured behind layers of tooling. A minimal FastAPI backend serves static files and injects API keys; everything else happens client-side.
73
-
74
- **Open-Weight Models:** Uses Google's Gemma-3-27b model (27 billion parameters) via OpenRouter, or alternatively connects to local LLM servers (LM Studio, etc.) on port 1234 with smaller models like Gemma-3-12b. The choice of open-weight models is deliberate: these systems can be downloaded, inspected, run locally, modified. When assessment becomes algorithmic, transparency about the algorithm matters. You can examine exactly which model is generating your exercises, run the same models yourself, experiment with alternatives.
75
-
76
- **Streaming from Public Archives:** Book data streams directly from Hugging Face's mirror of Project Gutenberg's corpus—public domain texts, open dataset infrastructure, no proprietary content libraries. The entire pipeline from literature to exercises relies on openly accessible resources, making the system reproducible and auditable.
77
-
78
- ## Running Locally with Docker
79
-
80
- To run the Cloze Reader application locally using Docker:
81
-
82
- 1. **Build the Docker image**:
83
- ```bash
84
- docker build -t cloze-reader .
85
- ```
86
-
87
- 2. **Run the container**:
88
- ```bash
89
- docker run -p 7860:7860 cloze-reader
90
- ```
91
 
92
- 3. **Access the application**:
93
- Open your browser and navigate to `http://localhost:7860`
94
 
95
- ### Prerequisites
96
- - Docker installed on your system
97
- - Port 7860 available on your machine
98
 
99
- ## Local LLM Integration
100
 
101
- Cloze Reader supports running with a local LLM server instead of OpenRouter API:
102
-
103
- ### Setup
104
- 1. **Start your local LLM server** on port 1234 (e.g., using LM Studio with Gemma-3-12b)
105
- 2. **Run the development server**:
106
- ```bash
107
- make dev # or python3 local-server.py 8000
108
- ```
109
- 3. **Access with local LLM**:
110
- - Navigate to `http://localhost:8000/index.html?local=true`
111
- - The `?local=true` parameter switches from OpenRouter to your local LLM
112
-
113
- ### Local LLM Features
114
- - **No API key required** - works entirely offline with your local model
115
- - **Automatic response cleaning** - handles local LLM output artifacts
116
- - **Compatible with LM Studio** and other OpenAI-compatible local servers
117
- - **Same game experience** - all features work identically to cloud version
118
-
119
- ### Testing Local Integration
120
- - Test page: `http://localhost:8000/test-local-llm.html?local=true`
121
- - Stress test script: `node test-local-llm.js`
122
- - Direct integration test available in test files
123
 
124
  ## Architecture
125
- This is a **vanilla JavaScript modular application** with no build step. Key architectural patterns:
126
 
127
  **Module Organization:**
128
- - `app.js` - Main application controller, handles UI state and round management
129
- - `clozeGameEngine.js` - Core game logic, word selection, and scoring
130
- - `bookDataService.js` - Manages book data fetching from Hugging Face Datasets API
131
- - `aiService.js` - OpenRouter API integration for AI-powered word selection and contextualization
132
- - `chatInterface.js` - Modal-based chat UI for contextual hints
133
- - `conversationManager.js` - AI conversation state management for chat functionality
134
  - `welcomeOverlay.js` - First-time user onboarding
135
 
 
 
 
 
 
 
 
136
  ---
137
  [milwright](https://huggingface.co/milwright), *Zach Muhlbauer*, CUNY Graduate Center
 
11
 
12
  # Cloze Reader
13
 
14
+ ## Overview
15
 
16
+ Cloze Reader transforms classic literature into interactive vocabulary exercises using AI-powered word selection and hint generation. The application uses Google's Gemma-3-27B model to analyze passages from Project Gutenberg, select contextually appropriate blanks, and provide adaptive guidance through a chat interface.
17
 
18
+ ## Background
19
 
20
+ The cloze procedure, introduced by Wilson Taylor in 1953, measures reading comprehension by having readers fill in deleted words from passages. BERT and subsequent masked language models use the same fundamental technique as their training objective: predict missing tokens from context. Cloze Reader closes this loop by using models trained on masked prediction to generate cloze exercises for human learners.
21
 
22
+ ## Features
23
 
24
+ **Progressive Difficulty:** Levels 1-5 present single blanks with full hints from recent texts; levels 6-10 add multiple blanks with partial hints; levels 11+ use historical texts with minimal hints. Players must complete two passages per round before advancing.
25
 
26
+ **Interactive Chat:** Each blank includes a chat interface providing contextual hints through Socratic questioning and semantic guidance.
27
 
28
+ **Public Domain Content:** All passages stream from Hugging Face's Project Gutenberg dataset, filtered to exclude dictionaries, technical documentation, and poetry.
29
 
30
+ ## Technology Stack
31
 
32
+ **Frontend:** Vanilla JavaScript with ES6 modules, no build tooling. The application runs entirely in the browser.
33
 
34
+ **Backend:** Minimal FastAPI server for static file serving and API key injection.
35
 
36
+ **Models:** Google Gemma-3-27B via OpenRouter, with support for local LLM servers on port 1234.
37
 
38
+ **Data Source:** Hugging Face Datasets API streaming from Project Gutenberg corpus.
39
 
40
+ ## Running with Docker
41
 
42
+ # Build the image
43
+ docker build -t cloze-reader .
44
 
45
+ # Run the container
46
+ docker run -p 7860:7860 cloze-reader
47
 
48
+ # Access at http://localhost:7860
 
 
49
 
50
+ **Prerequisites:** Docker installed, port 7860 available.
51
 
52
+ ## Local LLM Integration
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
+ Run with a local LLM server instead of OpenRouter:
 
55
 
56
+ # Start local LLM server on port 1234 (e.g., LM Studio with Gemma-3-27b)
57
+ # Run development server
58
+ make dev # or python3 local-server.py 8000
59
 
60
+ # Access at http://localhost:8000/index.html?local=true
61
 
62
+ **Features:**
63
+ - No API key required
64
+ - Offline operation
65
+ - Automatic response cleaning for local LLM output
66
+ - Compatible with LM Studio and OpenAI-compatible servers
67
+ - Testing available at http://localhost:8000/test-local-llm.html?local=true
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
  ## Architecture
 
70
 
71
  **Module Organization:**
72
+ - `app.js` - Application controller and UI state management
73
+ - `clozeGameEngine.js` - Game logic, word selection, scoring
74
+ - `bookDataService.js` - Book data fetching from Hugging Face
75
+ - `aiService.js` - OpenRouter API integration
76
+ - `chatInterface.js` - Modal-based chat UI
77
+ - `conversationManager.js` - AI conversation state management
78
  - `welcomeOverlay.js` - First-time user onboarding
79
 
80
+ ## Research Context
81
+
82
+ Recent work connects cloze assessment and masked language modeling:
83
+ - Matsumori et al. (2023) developed CLOZER using masked language models for L2 English cloze question generation
84
+ - Ondov et al. (2024, NAACL) demonstrated masked language models as natural generators for cloze distractors
85
+ - Zhang & Hashimoto (2021) analyzed inductive biases in masked tokens, showing models learn statistical and syntactic dependencies that cloze tests measure in humans
86
+
87
  ---
88
  [milwright](https://huggingface.co/milwright), *Zach Muhlbauer*, CUNY Graduate Center