ContextLab GPT-2 H.G. Wells Stylometry Model

Overview

This model is a GPT-2 language model trained exclusively on 12 books by H.G. Wells (1866-1946). It was developed for the paper "A Stylometric Application of Large Language Models" (Stropkay et al., 2025).

The model captures H.G. Wells's unique writing style through intensive training on their corpus. By learning the statistical patterns, vocabulary, syntax, and thematic elements characteristic of Wells's writing, this model enables:

Text generation in the authentic style of H.G. Wells
Authorship attribution through cross-entropy loss comparison
Stylometric analysis of literary works from late 19th to early 20th century England
Computational literary studies exploring Wells's distinctive voice

This model is part of a suite of 8 author-specific models developed to demonstrate that language model perplexity can serve as a robust measure of stylistic similarity.

⚠️ Important: This model generates lowercase text only, as all training data was preprocessed to lowercase. Use lowercase prompts for best results.

Model Details

Model type: GPT-2 (custom compact architecture)
Language: English (lowercase)
License: MIT
Author: H.G. Wells (1866-1946)
Notable works: The Time Machine, The War of the Worlds, The Invisible Man
Training data: 12 books by H.G. Wells
Training tokens: 1,035,901
Final training loss: 1.4242
Epochs trained: 50,000

Architecture

Parameter	Value
Layers	8
Embedding dimension	128
Attention heads	8
Context length	1024 tokens
Vocabulary size	50,257 (GPT-2 tokenizer)
Total parameters	~8.1M

Usage

Basic Text Generation

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("contextlab/gpt2-wells")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# IMPORTANT: Use lowercase prompts (model trained on lowercase text)
prompt = "the time traveller"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate text
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=200,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Output: Generates text in H.G. Wells's distinctive style (all lowercase).

Stylometric Analysis

Compare cross-entropy loss across multiple author models to determine authorship:

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# Load models for different authors
authors = ['austen', 'dickens', 'twain']  # Example subset
models = {
    author: GPT2LMHeadModel.from_pretrained(f"contextlab/gpt2-{author}")
    for author in authors
}

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Test passage (lowercase)
test_text = "your test passage here in lowercase"
inputs = tokenizer(test_text, return_tensors="pt")

# Compute loss for each model
for author, model in models.items():
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs['input_ids'])
        loss = outputs.loss.item()
    print(f"{author}: {loss:.4f}")

# Lower loss indicates more similar style (likely author)

Training Procedure

Dataset

The model was trained on the complete works of H.G. Wells sourced from Project Gutenberg. The text was preprocessed to:

Remove Project Gutenberg headers and footers
Convert all text to lowercase
Remove chapter headings and non-narrative text
Preserve punctuation and structure

See the Wells corpus dataset for details.

Hyperparameters

Parameter	Value
Context length	1,024 tokens
Batch size	16
Learning rate	5×10⁻⁵
Optimizer	AdamW
Training tokens	1,035,901
Epochs	50,000
Final loss	1.4242

Training Method

The model was initialized with a compact GPT-2 architecture (8 layers, 128-dimensional embeddings) and trained exclusively on H.G. Wells's works until reaching a training loss of approximately 1.4242. This intensive training enables the model to capture fine-grained stylistic patterns characteristic of Wells's writing.

See the GitHub repository for complete training code and methodology.

Intended Use

Primary Uses

Research: Stylometric analysis, authorship attribution studies
Education: Demonstrations of computational stylometry
Creative: Generate text in H.G. Wells's style
Analysis: Compare writing styles across historical periods

Out-of-Scope Uses

This model is not intended for:

Factual information retrieval
Modern language generation
Tasks requiring uppercase text
Commercial publication without attribution

Limitations

Lowercase only: All generated text is lowercase (due to preprocessing)
Historical language: Reflects late 19th to early 20th century England vocabulary and grammar
Training data bias: Limited to H.G. Wells's published works
Small model: Compact architecture prioritizes training speed over generation quality
No factual grounding: Generates stylistically similar text, not historically accurate content

Evaluation

This model achieved perfect accuracy (100%) in distinguishing H.G. Wells's works from seven other classic authors in cross-entropy loss comparisons. See the paper for detailed evaluation results.

Citation

If you use this model in your research, please cite:

@article{StroEtal25,
  title={A Stylometric Application of Large Language Models},
  author={Stropkay, Harrison F. and Chen, Jiayi and Jabelli, Mohammad J. L. and Rockmore, Daniel N. and Manning, Jeremy R.},
  journal={arXiv preprint arXiv:2510.21958},
  year={2025}
}

Contact

Paper & Code: https://github.com/ContextLab/llm-stylometry
Issues: https://github.com/ContextLab/llm-stylometry/issues
Contact: Jeremy R. Manning ([email protected])
Lab: Context Lab, Dartmouth College

Related Models

Explore models for all 8 authors in the study:

Downloads last month: 11

Safetensors

Model size

8.15M params

Tensor type

F32

contextlab
/

gpt2-wells