Jay-Rajput commited on
Commit
f659ec0
·
1 Parent(s): 23c23e6

ai detector new

Browse files
Files changed (3) hide show
  1. README.md +71 -1
  2. app.py +282 -1012
  3. requirements.txt +6 -8
README.md CHANGED
@@ -10,4 +10,74 @@ pinned: false
10
  license: mit
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  license: mit
11
  ---
12
 
13
+ # Advanced AI Text Detector 🔍
14
+
15
+ An advanced AI text detection system that identifies AI-generated content, particularly from ChatGPT and similar language models.
16
+
17
+ ## Features
18
+
19
+ ### 🤖 Dual Detection Methods
20
+ - **Transformer-based Detection**: Uses fine-tuned RoBERTa model specifically trained on ChatGPT detection
21
+ - **Statistical Analysis**: Employs multiple linguistic metrics for robust detection
22
+
23
+ ### 📊 Comprehensive Analysis Metrics
24
+ - **Burstiness Analysis**: Measures sentence length variation (human text is typically more "bursty")
25
+ - **Vocabulary Diversity**: Analyzes lexical richness and word variety
26
+ - **Repetition Detection**: Identifies repeated phrases and patterns
27
+ - **Perplexity Scoring**: Evaluates text predictability
28
+ - **Punctuation Patterns**: Analyzes punctuation consistency
29
+
30
+ ### 🎯 High Accuracy Features
31
+ - Multi-method ensemble approach for improved accuracy
32
+ - Confidence scoring system
33
+ - Detailed explanations for each detection
34
+ - Visual probability distribution
35
+
36
+ ## How It Works
37
+
38
+ 1. **Input Processing**: The text is tokenized and prepared for analysis
39
+ 2. **Transformer Analysis**: If available, the RoBERTa model provides initial AI probability
40
+ 3. **Statistical Analysis**: Multiple linguistic features are extracted and analyzed
41
+ 4. **Score Combination**: Results are weighted and combined for final prediction
42
+ 5. **Result Generation**: Detailed report with classification, confidence, and explanations
43
+
44
+ ## Detection Categories
45
+
46
+ - **AI-Generated**: >80% AI probability (High confidence)
47
+ - **Likely AI-Generated**: 60-80% AI probability (Medium confidence)
48
+ - **Uncertain**: 40-60% AI probability (Low confidence)
49
+ - **Likely Human-Written**: 20-40% AI probability (Medium confidence)
50
+ - **Human-Written**: <20% AI probability (High confidence)
51
+
52
+ ## Usage Tips
53
+
54
+ - Provide at least 100 words for optimal accuracy
55
+ - Longer texts generally yield more reliable results
56
+ - The detector works best with English text
57
+ - Results are probabilistic - use them as guidance, not absolute truth
58
+
59
+ ## Technical Stack
60
+
61
+ - **Gradio**: Interactive web interface
62
+ - **Transformers**: Hugging Face transformer models
63
+ - **PyTorch**: Deep learning backend
64
+ - **SciPy/NumPy**: Statistical analysis
65
+
66
+ ## Limitations
67
+
68
+ - Best performance with English text
69
+ - Requires sufficient text length (minimum 50 characters, optimal 100+ words)
70
+ - Detection accuracy may vary with highly technical or specialized content
71
+ - Should be used as a tool for guidance, not definitive judgment
72
+
73
+ ## Deployment
74
+
75
+ This app is designed to run on Hugging Face Spaces. Simply upload the files to your Space and it will automatically deploy.
76
+
77
+ ## Model Credit
78
+
79
+ This detector uses the `Hello-SimpleAI/chatgpt-detector-roberta` model from Hugging Face, combined with custom statistical analysis methods.
80
+
81
+ ---
82
+
83
+ **Note**: AI detection is a rapidly evolving field. No detector is 100% accurate, and results should be interpreted with appropriate context and judgment.
app.py CHANGED
@@ -1,1031 +1,301 @@
1
-
2
- """
3
- Enhanced AI Text Detector - Superior Pattern Recognition
4
- Significantly improved ChatGPT detection with advanced linguistic analysis
5
- Addresses missed patterns in formal, academic, and corporate writing styles
6
- """
7
-
8
  import gradio as gr
9
  import torch
 
10
  import numpy as np
 
11
  import re
12
- import time
13
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
14
- from typing import Dict, List, Tuple
15
- import statistics
16
- import string
17
  from collections import Counter
18
- import json
19
- import plotly.graph_objects as go
20
- import plotly.express as px
21
-
22
- class EnhancedAIDetector:
23
- """
24
- Enhanced AI text detector with superior pattern recognition
25
- Specifically improved for ChatGPT's formal, academic, and corporate writing styles
26
- """
27
 
 
28
  def __init__(self):
29
- self.primary_tokenizer = None
30
- self.primary_model = None
31
- self.backup_models = []
32
- self.load_models()
33
-
34
- def load_models(self):
35
- """Load multiple detection models for ensemble approach"""
36
  try:
37
- # Primary model - RoBERTa based
38
- primary_model_name = "roberta-base-openai-detector"
39
- self.primary_tokenizer = AutoTokenizer.from_pretrained(primary_model_name)
40
- self.primary_model = AutoModelForSequenceClassification.from_pretrained(primary_model_name)
41
-
42
- # Try to load additional models if available
43
- alternative_models = [
44
- "Hello-SimpleAI/chatgpt-detector-roberta",
45
- "andreas122001/roberta-mixed-detector",
46
- "TrustSafeAI/GUARD-1B"
47
- ]
48
-
49
- for model_name in alternative_models:
50
- try:
51
- tokenizer = AutoTokenizer.from_pretrained(model_name)
52
- model = AutoModelForSequenceClassification.from_pretrained(model_name)
53
- self.backup_models.append((tokenizer, model, model_name))
54
- print(f"✓ Loaded additional model: {model_name}")
55
- except:
56
- continue
57
-
58
- print(f"✓ Models loaded successfully - {1 + len(self.backup_models)} total models")
59
- except Exception as e:
60
- print(f"⚠️ Model loading failed: {e}")
61
- self.primary_tokenizer = None
62
- self.primary_model = None
63
-
64
- def extract_enhanced_ai_features(self, text: str) -> Dict[str, float]:
65
- """Extract enhanced features with better ChatGPT pattern recognition"""
66
-
67
- if len(text.strip()) < 10:
68
- return {}
69
-
70
- features = {}
71
- sentences = re.split(r'[.!?]+', text)
72
- sentences = [s.strip() for s in sentences if s.strip()]
73
  words = text.split()
74
-
75
- if not sentences or not words:
76
- return {}
77
-
78
- # ENHANCED: Academic/Corporate Language Patterns (MAJOR IMPROVEMENT)
79
- academic_phrases = [
80
- "demonstrates", "is defined by", "functions as", "serves as", "operates as",
81
- "characterized by", "exemplifies", "represents", "constitutes", "embodies",
82
- "encompasses", "facilitates", "enables", "promotes", "establishes",
83
- "technological object", "systematic approach", "comprehensive analysis",
84
- "strategic implementation", "optimal solution", "integrated system"
85
- ]
86
- academic_count = sum(1 for phrase in academic_phrases if phrase in text.lower())
87
- features['academic_language'] = min(academic_count / len(sentences) * 3, 1.0)
88
-
89
- # ENHANCED: Corporate Buzzwords (MAJOR IMPROVEMENT)
90
- corporate_buzzwords = [
91
- "ecosystem", "framework", "scalability", "optimization", "integration",
92
- "synergy", "leverage", "streamline", "enhance", "maximize", "utilize",
93
- "implement", "facilitate", "comprehensive", "strategic", "innovative",
94
- "efficient", "effective", "robust", "seamless", "dynamic", "paradigm",
95
- "methodology", "infrastructure", "architecture", "deployment"
96
- ]
97
- buzzword_count = sum(1 for word in words if word.lower() in corporate_buzzwords)
98
- features['corporate_buzzwords'] = min(buzzword_count / len(words) * 20, 1.0)
99
-
100
- # ENHANCED: Technical Jargon Overuse (NEW)
101
- technical_terms = [
102
- "iterative", "predictable", "standardized", "regulated", "uniform",
103
- "optimized", "systematic", "consistent", "scalable", "integrated",
104
- "automated", "synchronized", "configured", "calibrated", "validated"
105
- ]
106
- technical_count = sum(1 for word in words if word.lower() in technical_terms)
107
- features['technical_jargon'] = min(technical_count / len(words) * 15, 1.0)
108
-
109
- # ENHANCED: Abstract Conceptualization (NEW)
110
- abstract_patterns = [
111
- "in this framework", "in this context", "within this paradigm",
112
- "from this perspective", "in this regard", "in this manner",
113
- "serves as a", "functions as a", "operates as a", "acts as a",
114
- "not only.*but also", "both.*and", "either.*or"
115
- ]
116
- abstract_count = sum(1 for pattern in abstract_patterns if re.search(pattern, text.lower()))
117
- features['abstract_conceptualization'] = min(abstract_count / len(sentences) * 2, 1.0)
118
-
119
- # ENHANCED: Formal Hedging Language (NEW)
120
- hedging_patterns = [
121
- "not only", "but also", "furthermore", "moreover", "additionally",
122
- "consequently", "therefore", "thus", "hence", "accordingly",
123
- "in conclusion", "to summarize", "overall", "in summary",
124
- "it should be noted", "it is important to", "it is worth noting"
125
- ]
126
- hedging_count = sum(1 for pattern in hedging_patterns if pattern in text.lower())
127
- features['formal_hedging'] = min(hedging_count / len(sentences) * 2, 1.0)
128
-
129
- # ENHANCED: Objective/Neutral Tone Detection (NEW)
130
- subjective_indicators = [
131
- "i think", "i believe", "i feel", "in my opinion", "personally",
132
- "i love", "i hate", "amazing", "terrible", "awesome", "sucks",
133
- "definitely", "probably", "maybe", "might", "could be", "seems like"
134
- ]
135
- subjective_count = sum(1 for phrase in subjective_indicators if phrase in text.lower())
136
- features['objective_tone'] = 1.0 - min(subjective_count / len(sentences), 1.0)
137
-
138
- # ENHANCED: Systematic Structure Indicators (NEW)
139
- structure_words = [
140
- "first", "second", "third", "finally", "initially", "subsequently",
141
- "furthermore", "moreover", "however", "nevertheless", "in addition",
142
- "on the other hand", "in contrast", "similarly", "likewise"
143
- ]
144
- structure_count = sum(1 for word in text.lower().split() if word in structure_words)
145
- features['systematic_structure'] = min(structure_count / len(words) * 10, 1.0)
146
-
147
- # ENHANCED: Passive Voice Usage (ChatGPT loves passive voice)
148
- passive_indicators = [
149
- "is defined", "are defined", "is characterized", "are characterized",
150
- "is demonstrated", "are demonstrated", "is established", "are established",
151
- "is implemented", "are implemented", "is facilitated", "are facilitated",
152
- "is regulated", "are regulated", "is standardized", "are standardized"
153
- ]
154
- passive_count = sum(1 for phrase in passive_indicators if phrase in text.lower())
155
- features['passive_voice'] = min(passive_count / len(sentences) * 3, 1.0)
156
-
157
- # ORIGINAL: Politeness and helpful language patterns (REWEIGHTED)
158
- polite_phrases = [
159
- "i hope this helps", "i would be happy to", "please let me know",
160
- "feel free to", "i would recommend", "you might want to", "you might consider",
161
- "it is worth noting", "it is important to", "keep in mind",
162
- "i understand", "certainly", "absolutely", "definitely"
163
- ]
164
- polite_count = sum(1 for phrase in polite_phrases if phrase in text.lower())
165
- features['politeness_score'] = min(polite_count / len(sentences), 1.0)
166
-
167
- # ORIGINAL: Explanation and clarification patterns (REWEIGHTED)
168
- explanation_patterns = [
169
- 'this means', 'in other words', 'specifically', 'for example',
170
- 'for instance', 'such as', 'including', 'that is',
171
- 'i.e.', 'e.g.', 'namely', 'particularly'
172
- ]
173
- explanation_count = sum(1 for phrase in explanation_patterns if phrase in text.lower())
174
- features['explanation_score'] = min(explanation_count / len(sentences), 1.0)
175
-
176
- # ORIGINAL: Lack of personal experiences (ENHANCED)
177
- personal_indicators = [
178
- 'i remember', 'when i was', 'my experience', 'i once', 'i personally',
179
- 'in my opinion', 'i think', 'i believe', 'i feel', 'my view',
180
- 'from my perspective', 'i have seen', 'i have noticed', 'i have found',
181
- 'my friend', 'my family', 'my colleague', 'yesterday', 'last week',
182
- 'last month', 'last year', 'when i', 'my boss', 'my teacher'
183
- ]
184
- personal_count = sum(1 for phrase in personal_indicators if phrase in text.lower())
185
- features['personal_absence'] = 1.0 - min(personal_count / len(sentences), 1.0)
186
-
187
- # ENHANCED: Sentence Complexity and Length Consistency
188
- if len(sentences) > 1:
189
- sentence_lengths = [len(s.split()) for s in sentences]
190
- avg_length = np.mean(sentence_lengths)
191
- length_variance = np.var(sentence_lengths)
192
-
193
- # ChatGPT tends to have consistent, moderate-length sentences
194
- features['sentence_consistency'] = 1.0 - min(length_variance / max(avg_length, 1), 1.0)
195
- features['optimal_length'] = 1.0 if 10 <= avg_length <= 20 else max(0, 1.0 - abs(avg_length - 15) / 15)
196
- else:
197
- features['sentence_consistency'] = 0.5
198
- features['optimal_length'] = 0.5
199
-
200
- # ENHANCED: Punctuation and Grammar Perfection
201
- exclamation_count = text.count('!')
202
- question_count = text.count('?')
203
- period_count = text.count('.')
204
-
205
- # ChatGPT rarely uses exclamations or questions in formal text
206
- features['punctuation_perfection'] = 1.0 - min((exclamation_count + question_count) / max(period_count, 1), 1.0)
207
-
208
- # ENHANCED: Vocabulary Sophistication
209
- sophisticated_words = [
210
- "demonstrates", "facilitates", "encompasses", "constitutes", "exemplifies",
211
- "characterizes", "emphasizes", "indicates", "suggests", "implies",
212
- "encompasses", "encompasses", "substantial", "significant", "considerable",
213
- "comprehensive", "extensive", "thorough", "meticulous", "systematic"
214
- ]
215
- sophisticated_count = sum(1 for word in words if word.lower() in sophisticated_words)
216
- features['vocabulary_sophistication'] = min(sophisticated_count / len(words) * 20, 1.0)
217
-
218
- return features
219
-
220
- def calculate_ensemble_ai_probability(self, text: str) -> float:
221
- """Use multiple models to calculate AI probability with ensemble approach"""
222
- probabilities = []
223
-
224
- # Primary model prediction
225
- if self.primary_model and self.primary_tokenizer:
226
- try:
227
- inputs = self.primary_tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
228
- with torch.no_grad():
229
- outputs = self.primary_model(**inputs)
230
- probs = torch.softmax(outputs.logits, dim=-1)
231
- ai_prob = probs[0][1].item()
232
- probabilities.append(ai_prob * 0.6) # Primary model gets 60% weight
233
- except:
234
- probabilities.append(0.5)
235
-
236
- # Backup models predictions
237
- for tokenizer, model, model_name in self.backup_models:
238
- try:
239
- inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
240
- with torch.no_grad():
241
- outputs = model(**inputs)
242
- probs = torch.softmax(outputs.logits, dim=-1)
243
- ai_prob = probs[0][1].item()
244
- probabilities.append(ai_prob * (0.4 / len(self.backup_models)))
245
- except:
246
- continue
247
-
248
- # If no models worked, return default
249
- if not probabilities:
250
  return 0.5
251
-
252
- return sum(probabilities)
253
-
254
- def classify_text_category(self, text: str) -> Tuple[str, Dict[str, float], float]:
255
- """Enhanced classification with superior AI pattern recognition"""
256
- if len(text.strip()) < 10:
257
- return "Uncertain", {"ai_generated": 0.25, "ai_refined": 0.25, "human_ai_refined": 0.25, "human_written": 0.25}, 0.3
258
-
259
- # Extract enhanced AI-specific features
260
- ai_features = self.extract_enhanced_ai_features(text)
261
-
262
- # Get ensemble model prediction
263
- ensemble_ai_prob = self.calculate_ensemble_ai_probability(text)
264
-
265
- # ENHANCED SCORING WITH BETTER WEIGHTS FOR CHATGPT PATTERNS
266
- scores = {}
267
-
268
- # AI-generated score (SIGNIFICANTLY ENHANCED)
269
- formal_ai_indicators = [
270
- ai_features.get('academic_language', 0) * 0.15, # Academic language is a strong ChatGPT indicator
271
- ai_features.get('corporate_buzzwords', 0) * 0.15, # Corporate buzzwords
272
- ai_features.get('technical_jargon', 0) * 0.12, # Technical jargon overuse
273
- ai_features.get('abstract_conceptualization', 0) * 0.10, # Abstract concepts
274
- ai_features.get('formal_hedging', 0) * 0.08, # Formal hedging language
275
- ai_features.get('objective_tone', 0) * 0.12, # Objective, neutral tone
276
- ai_features.get('systematic_structure', 0) * 0.08, # Systematic presentation
277
- ai_features.get('passive_voice', 0) * 0.10, # Passive voice usage
278
- ai_features.get('vocabulary_sophistication', 0) * 0.10 # Sophisticated vocabulary
279
- ]
280
-
281
- traditional_ai_indicators = [
282
- ai_features.get('politeness_score', 0) * 0.05, # Reduced weight
283
- ai_features.get('explanation_score', 0) * 0.03, # Reduced weight
284
- ai_features.get('personal_absence', 0) * 0.08, # Still important
285
- ai_features.get('punctuation_perfection', 0) * 0.04 # Reduced weight
286
- ]
287
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
288
  ai_score = (
289
- ensemble_ai_prob * 0.35 + # Reduced model weight to make room for features
290
- sum(formal_ai_indicators) * 0.45 + # MAJOR EMPHASIS on formal patterns
291
- sum(traditional_ai_indicators) * 0.20 # Traditional patterns
292
- )
293
-
294
- scores['ai_generated'] = min(max(ai_score, 0.0), 1.0)
295
-
296
- # AI-generated & AI-refined score (ENHANCED)
297
- ai_refined_score = (
298
- ensemble_ai_prob * 0.3 +
299
- ai_features.get('formal_hedging', 0) * 0.2 +
300
- ai_features.get('vocabulary_sophistication', 0) * 0.2 +
301
- ai_features.get('punctuation_perfection', 0) * 0.15 +
302
- ai_features.get('systematic_structure', 0) * 0.15
303
- )
304
- scores['ai_refined'] = min(max(ai_refined_score, 0.0), 1.0)
305
-
306
- # Human-written & AI-refined score
307
- human_ai_refined_score = (
308
- (1.0 - ensemble_ai_prob) * 0.4 +
309
- (1.0 - ai_features.get('personal_absence', 0.5)) * 0.2 +
310
- ai_features.get('explanation_score', 0) * 0.2 +
311
- ai_features.get('systematic_structure', 0) * 0.2
312
- )
313
- scores['human_ai_refined'] = min(max(human_ai_refined_score, 0.0), 1.0)
314
-
315
- # Human-written score (ENHANCED TO REDUCE FALSE NEGATIVES)
316
- human_written_score = (
317
- (1.0 - ensemble_ai_prob) * 0.3 + # Reduced model influence
318
- (1.0 - ai_features.get('academic_language', 0.5)) * 0.15 + # Penalize academic language
319
- (1.0 - ai_features.get('corporate_buzzwords', 0.5)) * 0.15 + # Penalize buzzwords
320
- (1.0 - ai_features.get('objective_tone', 0.5)) * 0.15 + # Penalize overly objective tone
321
- (1.0 - ai_features.get('formal_hedging', 0.5)) * 0.1 + # Penalize formal hedging
322
- (1.0 - ai_features.get('vocabulary_sophistication', 0.5)) * 0.15 # Penalize over-sophistication
323
  )
324
- scores['human_written'] = min(max(human_written_score, 0.0), 1.0)
325
-
326
- # Normalize scores
327
- total_score = sum(scores.values())
328
- if total_score > 0:
329
- scores = {k: v / total_score for k, v in scores.items()}
330
- else:
331
- scores = {"ai_generated": 0.25, "ai_refined": 0.25, "human_ai_refined": 0.25, "human_written": 0.25}
332
-
333
- # Determine primary category
334
- primary_category = max(scores, key=scores.get)
335
- confidence = scores[primary_category]
336
-
337
- # Map to readable names
338
- category_names = {
339
- 'ai_generated': 'AI-generated',
340
- 'ai_refined': 'AI-generated & AI-refined',
341
- 'human_ai_refined': 'Human-written & AI-refined',
342
- 'human_written': 'Human-written'
343
  }
344
-
345
- return category_names[primary_category], scores, confidence
346
-
347
- def split_into_sentences(self, text: str) -> List[str]:
348
- """Split text into sentences for individual analysis"""
349
- sentences = re.split(r'(?<=[.!?])\s+', text.strip())
350
- sentences = [s.strip() for s in sentences if len(s.strip()) > 10]
351
- return sentences
352
-
353
- def analyze_sentence_ai_probability(self, sentence: str) -> float:
354
- """Analyze individual sentence for AI probability with enhanced features"""
355
- if len(sentence.strip()) < 10:
356
- return 0.5
357
-
358
- # Use ensemble approach for sentence-level detection
359
- ensemble_prob = self.calculate_ensemble_ai_probability(sentence)
360
-
361
- # Add enhanced sentence-level features
362
- sentence_features = self.extract_enhanced_ai_features(sentence)
363
-
364
- # Enhanced sentence scoring
365
- ai_sentence_score = (
366
- ensemble_prob * 0.4 +
367
- sentence_features.get('academic_language', 0) * 0.15 +
368
- sentence_features.get('corporate_buzzwords', 0) * 0.15 +
369
- sentence_features.get('technical_jargon', 0) * 0.1 +
370
- sentence_features.get('formal_hedging', 0) * 0.1 +
371
- sentence_features.get('objective_tone', 0) * 0.1
372
- )
373
-
374
- return min(max(ai_sentence_score, 0.0), 1.0)
375
-
376
- def highlight_ai_text(self, text: str, threshold: float = 0.55) -> str:
377
- """Highlight sentences with LOWER threshold for better sensitivity"""
378
- sentences = self.split_into_sentences(text)
379
-
380
- if not sentences:
381
- return text
382
-
383
- highlighted_text = text
384
- sentence_scores = []
385
-
386
- # Analyze each sentence
387
- for sentence in sentences:
388
- ai_prob = self.analyze_sentence_ai_probability(sentence)
389
- sentence_scores.append((sentence, ai_prob))
390
-
391
- # Sort by AI probability
392
- sentence_scores.sort(key=lambda x: x[1], reverse=True)
393
-
394
- # Highlight sentences above threshold (LOWERED THRESHOLD)
395
- for sentence, ai_prob in sentence_scores:
396
- if ai_prob > threshold:
397
- # Use different colors based on confidence
398
- if ai_prob > 0.75:
399
- # High confidence - red highlight
400
- highlighted_sentence = f'<mark style="background-color: #ffe6e6; padding: 2px 4px; border-radius: 3px; border-left: 3px solid #dc3545; color: #721c24;">{sentence}</mark>'
401
- elif ai_prob > 0.65:
402
- # Medium-high confidence - orange-red highlight
403
- highlighted_sentence = f'<mark style="background-color: #fff0e6; padding: 2px 4px; border-radius: 3px; border-left: 3px solid #fd7e14;">{sentence}</mark>'
404
- else:
405
- # Medium confidence - orange highlight
406
- highlighted_sentence = f'<mark style="background-color: #fff3cd; padding: 2px 4px; border-radius: 3px; border-left: 3px solid #ffc107;">{sentence}</mark>'
407
- highlighted_text = highlighted_text.replace(sentence, highlighted_sentence)
408
-
409
- return highlighted_text
410
-
411
- def get_analysis_json(self, text: str) -> Dict:
412
- """Get analysis results in JSON format"""
413
- start_time = time.time()
414
-
415
- if not text or len(text.strip()) < 10:
416
- return {
417
- "error": "Text must be at least 10 characters long",
418
- "ai_percentage": 0,
419
- "human_percentage": 0,
420
- "ai_likelihood": 0,
421
- "category_scores": {
422
- "ai_generated": 0,
423
- "ai_refined": 0,
424
- "human_ai_refined": 0,
425
- "human_written": 0
426
- },
427
- "primary_category": "uncertain",
428
- "confidence": 0,
429
- "processing_time_ms": 0,
430
- "highlighted_text": text
431
- }
432
-
433
  try:
434
- primary_category, category_scores, confidence = self.classify_text_category(text)
435
- highlighted_text = self.highlight_ai_text(text)
436
-
437
- ai_percentage = (category_scores['ai_generated'] + category_scores['ai_refined']) * 100
438
- human_percentage = (category_scores['human_ai_refined'] + category_scores['human_written']) * 100
439
- ai_likelihood = category_scores['ai_generated'] * 100
440
-
441
- processing_time = (time.time() - start_time) * 1000
442
-
443
- return {
444
- "ai_percentage": round(ai_percentage, 1),
445
- "human_percentage": round(human_percentage, 1),
446
- "ai_likelihood": round(ai_likelihood, 1),
447
- "category_scores": {
448
- "ai_generated": round(category_scores['ai_generated'] * 100, 1),
449
- "ai_refined": round(category_scores['ai_refined'] * 100, 1),
450
- "human_ai_refined": round(category_scores['human_ai_refined'] * 100, 1),
451
- "human_written": round(category_scores['human_written'] * 100, 1)
452
- },
453
- "primary_category": primary_category.lower().replace(' ', '_').replace('-', '_'),
454
- "confidence": round(confidence * 100, 1),
455
- "processing_time_ms": round(processing_time, 1),
456
- "highlighted_text": highlighted_text
457
- }
458
-
459
  except Exception as e:
 
 
 
 
 
460
  return {
461
- "error": str(e),
462
- "ai_percentage": 0,
463
- "human_percentage": 0,
464
- "ai_likelihood": 0,
465
- "category_scores": {
466
- "ai_generated": 0,
467
- "ai_refined": 0,
468
- "human_ai_refined": 0,
469
- "human_written": 0
470
- },
471
- "primary_category": "error",
472
- "confidence": 0,
473
- "processing_time_ms": 0,
474
- "highlighted_text": text
475
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
476
 
477
- # Initialize the enhanced detector
478
- detector = EnhancedAIDetector()
479
-
480
- def create_bar_chart(ai_percentage, human_percentage):
481
- """Create vertical bar chart showing AI vs Human percentages"""
482
-
483
- fig = go.Figure(data=[
484
- go.Bar(
485
- x=['AI', 'Human'],
486
- y=[ai_percentage, human_percentage],
487
- marker=dict(
488
- color=['#FF6B6B', '#4ECDC4'],
489
- line=dict(color='rgba(0,0,0,0.3)', width=2)
490
- ),
491
- text=[f'{ai_percentage:.0f}%', f'{human_percentage:.0f}%'],
492
- textposition='auto',
493
- textfont=dict(size=14, color='white', family='Arial Black'),
494
- hovertemplate='<b>%{x}</b><br>%{y:.1f}%<extra></extra>'
495
- )
496
- ])
497
-
498
- fig.update_layout(
499
- title=dict(
500
- text='AI vs Human Content Distribution',
501
- x=0.5,
502
- font=dict(size=16, color='#2c3e50', family='Arial')
503
- ),
504
- xaxis=dict(
505
- title=dict(
506
- text='Content Type',
507
- font=dict(size=14, color='#34495e')
508
- ),
509
- tickfont=dict(size=12, color='#34495e'),
510
- showgrid=False,
511
- zeroline=False
512
- ),
513
- yaxis=dict(
514
- title=dict(
515
- text='Percentage (%)',
516
- font=dict(size=14, color='#34495e')
517
- ),
518
- tickfont=dict(size=12, color='#34495e'),
519
- range=[0, 100],
520
- showgrid=True,
521
- gridwidth=1,
522
- gridcolor='rgba(0,0,0,0.1)'
523
- ),
524
- plot_bgcolor='rgba(0,0,0,0)',
525
- paper_bgcolor='rgba(0,0,0,0)',
526
- showlegend=False,
527
- height=400,
528
- margin=dict(t=60, b=50, l=50, r=50)
529
- )
530
-
531
- return fig
532
-
533
- def analyze_text_enhanced(text):
534
- """Enhanced analysis function with superior pattern recognition"""
535
- if not text or len(text.strip()) < 10:
536
- return (
537
- "⚠️ Please provide at least 10 characters of text for accurate AI detection.",
538
- text,
539
- None,
540
- "",
541
- f"Text length: {len(text.strip())} characters"
542
- )
543
-
544
- start_time = time.time()
545
-
546
- try:
547
- # Get enhanced analysis results
548
- primary_category, category_scores, confidence = detector.classify_text_category(text)
549
-
550
- # Get highlighted text with enhanced sensitivity
551
- highlighted_text = detector.highlight_ai_text(text)
552
-
553
- # Calculate percentages
554
- ai_percentage = (category_scores['ai_generated'] + category_scores['ai_refined']) * 100
555
- human_percentage = (category_scores['human_ai_refined'] + category_scores['human_written']) * 100
556
- ai_likelihood = category_scores['ai_generated'] * 100
557
-
558
- processing_time = (time.time() - start_time) * 1000
559
-
560
- # Enhanced summary
561
- summary_html = f"""
562
- <div style="text-align: center; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
563
- color: white; padding: 30px; border-radius: 15px; margin: 20px 0; box-shadow: 0 8px 25px rgba(0,0,0,0.15);">
564
- <div style="font-size: 48px; font-weight: bold; margin-bottom: 10px; text-shadow: 2px 2px 4px rgba(0,0,0,0.3);">
565
- {ai_percentage:.0f}%
566
- </div>
567
- <div style="font-size: 18px; line-height: 1.4; margin-bottom: 10px;">
568
- of this text is likely <strong>AI-generated or AI-refined</strong>
569
- </div>
570
- <div style="font-size: 16px; line-height: 1.4; margin-bottom: 5px; background: rgba(255,255,255,0.2); padding: 8px; border-radius: 5px;">
571
- 🎯 <strong>AI Content Likelihood: {ai_likelihood:.0f}%</strong>
572
- </div>
573
- <div style="font-size: 14px; opacity: 0.9; font-style: italic;">
574
- (Enhanced detection with superior pattern recognition for formal AI writing)
575
- </div>
576
- </div>
577
- """
578
-
579
- # Create bar chart
580
- bar_chart = create_bar_chart(ai_percentage, human_percentage)
581
-
582
- # Enhanced metrics with confidence indicators
583
- confidence_color = "#28a745" if confidence > 0.7 else "#ffc107" if confidence > 0.5 else "#dc3545"
584
- confidence_text = "High" if confidence > 0.7 else "Medium" if confidence > 0.5 else "Low"
585
-
586
- metrics_html = f"""
587
- <div style="margin: 20px 0; padding: 20px; background: #f8f9fa; border-radius: 12px; border-left: 5px solid #667eea;">
588
- <h4 style="color: #2c3e50; margin-bottom: 15px; font-size: 16px;">📊 Enhanced Detection Results</h4>
589
-
590
- <div style="background: #fff; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #667eea;">
591
- <div style="text-align: center;">
592
- <h5 style="color: #667eea; margin-bottom: 10px;">🤖 AI Detection Score</h5>
593
- <div style="font-size: 32px; font-weight: bold; color: #667eea;">{ai_likelihood:.0f}%</div>
594
- <div style="font-size: 14px; color: #6c757d; margin-top: 5px;">
595
- Likelihood this text was generated by AI models
596
- </div>
597
- <div style="margin-top: 8px; padding: 4px 8px; background: {confidence_color}; color: white; border-radius: 4px; font-size: 12px; display: inline-block;">
598
- {confidence_text} Confidence ({confidence*100:.0f}%)
599
- </div>
600
- </div>
601
- </div>
602
-
603
- <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 15px; margin-bottom: 20px;">
604
-
605
- <div style="background: white; padding: 15px; border-radius: 8px; border: 1px solid #e9ecef;">
606
- <div style="display: flex; align-items: center; margin-bottom: 8px;">
607
- <span style="font-size: 20px; margin-right: 8px;">🤖</span>
608
- <span style="font-weight: 600; color: #2c3e50;">AI-generated</span>
609
- <span title="Text likely generated by AI models with enhanced pattern detection." style="margin-left: 5px; cursor: help; color: #6c757d;">ⓘ</span>
610
- </div>
611
- <div style="font-size: 24px; font-weight: bold; color: #FF6B6B;">
612
- {category_scores['ai_generated']*100:.0f}%
613
- </div>
614
- </div>
615
-
616
- <div style="background: white; padding: 15px; border-radius: 8px; border: 1px solid #e9ecef;">
617
- <div style="display: flex; align-items: center; margin-bottom: 8px;">
618
- <span style="font-size: 20px; margin-right: 8px;">🛠️</span>
619
- <span style="font-weight: 600; color: #2c3e50;">AI-generated & AI-refined</span>
620
- <span title="AI text that has been further processed or polished using AI tools." style="margin-left: 5px; cursor: help; color: #6c757d;">ⓘ</span>
621
- </div>
622
- <div style="font-size: 24px; font-weight: bold; color: #FFA07A;">
623
- {category_scores['ai_refined']*100:.0f}%
624
- </div>
625
- </div>
626
-
627
- <div style="background: white; padding: 15px; border-radius: 8px; border: 1px solid #e9ecef;">
628
- <div style="display: flex; align-items: center; margin-bottom: 8px;">
629
- <span style="font-size: 20px; margin-right: 8px;">✍️</span>
630
- <span style="font-weight: 600; color: #2c3e50;">Human-written & AI-refined</span>
631
- <span title="Human text that has been enhanced or edited using AI tools." style="margin-left: 5px; cursor: help; color: #6c757d;">ⓘ</span>
632
- </div>
633
- <div style="font-size: 24px; font-weight: bold; color: #98D8C8;">
634
- {category_scores['human_ai_refined']*100:.0f}%
635
- </div>
636
- </div>
637
-
638
- <div style="background: white; padding: 15px; border-radius: 8px; border: 1px solid #e9ecef;">
639
- <div style="display: flex; align-items: center; margin-bottom: 8px;">
640
- <span style="font-size: 20px; margin-right: 8px;">👤</span>
641
- <span style="font-weight: 600; color: #2c3e50;">Human-written</span>
642
- <span title="Text written entirely by humans without AI assistance." style="margin-left: 5px; cursor: help; color: #6c757d;">ⓘ</span>
643
- </div>
644
- <div style="font-size: 24px; font-weight: bold; color: #4ECDC4;">
645
- {category_scores['human_written']*100:.0f}%
646
- </div>
647
- </div>
648
-
649
- </div>
650
-
651
- <div style="text-align: center; padding: 10px; background: white; border-radius: 8px; border: 1px solid #e9ecef;">
652
- <div style="font-size: 14px; color: #6c757d; margin-bottom: 5px;">Primary Classification</div>
653
- <div style="font-size: 18px; font-weight: bold; color: #2c3e50;">{primary_category}</div>
654
- <div style="font-size: 14px; color: #6c757d;">Processing: {processing_time:.0f}ms | Enhanced Pattern Recognition</div>
655
- </div>
656
- </div>
657
- """
658
-
659
- return (
660
- summary_html,
661
- highlighted_text,
662
- bar_chart,
663
- metrics_html,
664
- f"Text length: {len(text)} characters, {len(text.split())} words"
665
- )
666
-
667
- except Exception as e:
668
- return (
669
- f"❌ Error during enhanced AI analysis: {str(e)}",
670
- text,
671
- None,
672
- "",
673
- "Error"
674
- )
675
-
676
- def batch_analyze_enhanced(file):
677
- """Enhanced batch analysis"""
678
- if file is None:
679
- return "Please upload a text file."
680
-
681
- try:
682
- content = file.read().decode('utf-8')
683
- texts = [line.strip() for line in content.split('\n') if line.strip() and len(line.strip()) >= 10]
684
-
685
- if not texts:
686
- return "No valid texts found in the uploaded file (each line should have at least 10 characters)."
687
-
688
- results = []
689
- category_counts = {'AI-generated': 0, 'AI-generated & AI-refined': 0, 'Human-written & AI-refined': 0, 'Human-written': 0}
690
- total_ai_percentage = 0
691
- total_ai_likelihood = 0
692
-
693
- for i, text in enumerate(texts[:15]):
694
- primary_category, category_scores, confidence = detector.classify_text_category(text)
695
- category_counts[primary_category] += 1
696
-
697
- ai_percentage = (category_scores['ai_generated'] + category_scores['ai_refined']) * 100
698
- ai_likelihood = category_scores['ai_generated'] * 100
699
- total_ai_percentage += ai_percentage
700
- total_ai_likelihood += ai_likelihood
701
-
702
- results.append(f"""
703
- **Text {i+1}:** {text[:80]}{'...' if len(text) > 80 else ''}
704
- **Result:** {primary_category} ({confidence:.1%} confidence)
705
- **AI Likelihood:** {ai_likelihood:.0f}% | **AI Content:** {ai_percentage:.0f}% | **Breakdown:** AI-gen: {category_scores['ai_generated']:.0%}, AI-refined: {category_scores['ai_refined']:.0%}, Human+AI: {category_scores['human_ai_refined']:.0%}, Human: {category_scores['human_written']:.0%}
706
- """)
707
-
708
- avg_ai_percentage = total_ai_percentage / len(results) if results else 0
709
- avg_ai_likelihood = total_ai_likelihood / len(results) if results else 0
710
-
711
- summary = f"""
712
- ## 📊 Enhanced AI Detection Batch Analysis
713
-
714
- **Total texts analyzed:** {len(results)}
715
- **Average AI likelihood:** {avg_ai_likelihood:.1f}%
716
- **Average AI content:** {avg_ai_percentage:.1f}%
717
-
718
- ### Category Distribution:
719
- - **AI-generated:** {category_counts['AI-generated']} texts ({category_counts['AI-generated']/len(results)*100:.0f}%)
720
- - **AI-generated & AI-refined:** {category_counts['AI-generated & AI-refined']} texts ({category_counts['AI-generated & AI-refined']/len(results)*100:.0f}%)
721
- - **Human-written & AI-refined:** {category_counts['Human-written & AI-refined']} texts ({category_counts['Human-written & AI-refined']/len(results)*100:.0f}%)
722
- - **Human-written:** {category_counts['Human-written']} texts ({category_counts['Human-written']/len(results)*100:.0f}%)
723
-
724
- ---
725
-
726
- ### Individual Results:
727
- """
728
-
729
- return summary + "\n".join(results)
730
-
731
- except Exception as e:
732
- return f"Error processing file: {str(e)}"
733
-
734
- def create_enhanced_interface():
735
- """Create enhanced Gradio interface with superior detection"""
736
-
737
- custom_css = """
738
- .gradio-container {
739
- font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
740
- max-width: 1400px;
741
- margin: 0 auto;
742
- }
743
- .gr-button-primary {
744
- background: linear-gradient(45deg, #667eea 0%, #764ba2 100%);
745
- border: none;
746
- border-radius: 8px;
747
- font-weight: 600;
748
- padding: 12px 24px;
749
- }
750
- .gr-button-primary:hover {
751
- transform: translateY(-2px);
752
- box-shadow: 0 8px 25px rgba(102, 126, 234, 0.3);
753
- }
754
- .highlighted-text {
755
- line-height: 1.6;
756
- padding: 15px;
757
- background: #f8f9fa;
758
- border-radius: 8px;
759
- border: 1px solid #e9ecef;
760
- }
761
- mark {
762
- background-color: #ffe6e6 !important;
763
- padding: 2px 4px !important;
764
- border-radius: 3px !important;
765
- border-left: 3px solid #dc3545 !important;
766
- }
767
- """
768
-
769
- with gr.Blocks(css=custom_css, title="Enhanced AI Text Detector", theme=gr.themes.Soft()) as interface:
770
-
771
- gr.HTML("""
772
- <div style="text-align: center; padding: 25px; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
773
- color: white; border-radius: 15px; margin-bottom: 25px; box-shadow: 0 10px 30px rgba(0,0,0,0.2);">
774
- <h1 style="margin-bottom: 10px; font-size: 2.2em; text-shadow: 2px 2px 4px rgba(0,0,0,0.3);">🔍 Enhanced AI Text Detector</h1>
775
- <p style="font-size: 1.1em; margin: 0; opacity: 0.95;">
776
- Superior pattern recognition for formal, academic, and corporate AI writing
777
- </p>
778
- <p style="font-size: 0.9em; margin-top: 8px; opacity: 0.8;">
779
- Enhanced detection with 30+ linguistic features and advanced ensemble models
780
- </p>
781
- </div>
782
- """)
783
-
784
- with gr.Tabs() as tabs:
785
-
786
- # Single text analysis tab
787
- with gr.Tab("🔍 Enhanced AI Detection", elem_id="enhanced-analysis"):
788
- with gr.Row():
789
- with gr.Column(scale=1):
790
- text_input = gr.Textbox(
791
- label="📝 Enter text to analyze with enhanced AI detection",
792
- placeholder="Paste your text here (enhanced detection works best with 20+ words)...",
793
- lines=10,
794
- max_lines=20,
795
- show_label=True
796
- )
797
-
798
- analyze_btn = gr.Button(
799
- "🔍 Analyze with Enhanced Detection",
800
- variant="primary",
801
- size="lg"
802
- )
803
-
804
- text_info = gr.Textbox(
805
- label="📊 Text Information",
806
- interactive=False,
807
- show_label=True
808
- )
809
-
810
- with gr.Column(scale=1):
811
- # Enhanced results
812
- summary_result = gr.HTML(
813
- label="📊 Enhanced Detection Results",
814
- value="<div style='text-align: center; padding: 20px; color: #6c757d;'>Results will appear here after enhanced analysis...</div>"
815
- )
816
-
817
- # Bar Chart
818
- bar_chart = gr.Plot(
819
- label="📈 AI vs Human Distribution",
820
- show_label=True
821
- )
822
-
823
- # Enhanced Metrics
824
- detailed_metrics = gr.HTML(
825
- label="📋 Enhanced Detection Metrics",
826
- value=""
827
- )
828
-
829
- # Enhanced Highlighted Text Section
830
- gr.HTML("<hr style='margin: 20px 0;'><h3>🎯 Enhanced Pattern Analysis with Highlighting</h3>")
831
- gr.HTML("""
832
- <div style="background: #e8f4fd; padding: 15px; border-radius: 8px; margin-bottom: 15px; border-left: 4px solid #2196F3;">
833
- <p style="margin: 0; color: #1565C0; font-size: 14px;">
834
- <strong>🎯 Enhanced Pattern Detection:</strong> Now detects formal, academic, and corporate AI writing patterns.
835
- <span style="background-color: #ffe6e6; padding: 2px 4px; border-radius: 3px; border-left: 3px solid #dc3545;">Very high confidence (75%+)</span>,
836
- <span style="background-color: #fff0e6; padding: 2px 4px; border-radius: 3px; border-left: 3px solid #fd7e14;">high confidence (65-75%)</span>,
837
- <span style="background-color: #fff3cd; padding: 2px 4px; border-radius: 3px; border-left: 3px solid #ffc107;">medium confidence (55-65%)</span> highlighting.
838
- </p>
839
- </div>
840
- """)
841
-
842
- highlighted_text_display = gr.HTML(
843
- label="📝 Text with Enhanced AI Pattern Highlights",
844
- value="<div style='padding: 15px; background: #f8f9fa; border-radius: 8px; border: 1px solid #e9ecef; color: #6c757d;'>Enhanced highlighted text with AI patterns will appear here after analysis...</div>"
845
- )
846
-
847
- # Enhanced Understanding Section
848
- with gr.Accordion("🧠 Understanding Enhanced AI Detection", open=False):
849
- gr.HTML("""
850
- <div style="padding: 20px; line-height: 1.6;">
851
- <h4 style="color: #2c3e50; margin-bottom: 15px;">🎯 Enhanced Detection Capabilities</h4>
852
-
853
- <p><strong>This enhanced detector now identifies formal, academic, and corporate AI writing patterns</strong>
854
- that were previously missed, providing significantly improved accuracy for professional AI-generated text.</p>
855
-
856
- <h5 style="color: #34495e; margin-top: 20px; margin-bottom: 10px;">🆕 New Enhanced Features:</h5>
857
- <ul style="margin-left: 20px;">
858
- <li><strong>📚 Academic Language Detection:</strong> "demonstrates", "is defined by", "constitutes", "encompasses"</li>
859
- <li><strong>🏢 Corporate Buzzword Analysis:</strong> "ecosystem", "framework", "scalability", "optimization", "synergy"</li>
860
- <li><strong>🔧 Technical Jargon Recognition:</strong> "iterative", "standardized", "systematic", "optimized"</li>
861
- <li><strong>🎭 Abstract Conceptualization:</strong> "In this framework", "serves as a", "functions as a"</li>
862
- <li><strong>📝 Formal Hedging Language:</strong> "not only... but also", "furthermore", "consequently"</li>
863
- <li><strong>⚖️ Objective Tone Analysis:</strong> Detects overly neutral, impersonal writing</li>
864
- <li><strong>🎯 Passive Voice Detection:</strong> "is defined", "are characterized", "is demonstrated"</li>
865
- <li><strong>📊 Vocabulary Sophistication:</strong> Identifies unnecessarily complex word choices</li>
866
- </ul>
867
-
868
- <h5 style="color: #34495e; margin-top: 20px; margin-bottom: 10px;">🎨 Enhanced Highlighting System:</h5>
869
- <ul style="margin-left: 20px;">
870
- <li><strong>🔴 Red highlighting (75%+ confidence):</strong> Very high likelihood of AI generation</li>
871
- <li><strong>🟠 Orange-red highlighting (65-75% confidence):</strong> High likelihood with formal patterns</li>
872
- <li><strong>🟡 Orange highlighting (55-65% confidence):</strong> Medium confidence with AI patterns</li>
873
- <li><strong>🎯 Lower threshold (55%):</strong> More sensitive detection for comprehensive analysis</li>
874
- </ul>
875
-
876
- <h5 style="color: #34495e; margin-top: 20px; margin-bottom: 10px;">⚡ Enhanced Accuracy:</h5>
877
- <ul style="margin-left: 20px;">
878
- <li><strong>🎯 Formal AI Text:</strong> 40% improvement in detecting academic/corporate AI writing</li>
879
- <li><strong>📈 Pattern Recognition:</strong> 30+ linguistic features analyzed (vs 20 previously)</li>
880
- <li><strong>🔍 Sentence Analysis:</strong> Enhanced sentence-level pattern detection</li>
881
- <li><strong>⚖️ Weighted Scoring:</strong> Optimized weights for formal AI writing patterns</li>
882
- <li><strong>📊 False Negative Reduction:</strong> Significantly fewer missed AI texts</li>
883
- </ul>
884
-
885
- <div style="background: #d4edda; border: 1px solid #c3e6cb; border-radius: 8px; padding: 15px; margin-top: 20px;">
886
- <h5 style="color: #155724; margin-bottom: 10px;">✅ Enhanced Performance:</h5>
887
- <p style="margin: 0; color: #155724;">
888
- The enhanced detector now catches formal AI writing that appeared "too professional" for previous versions.
889
- It specifically targets academic, corporate, and technical writing styles commonly used by modern AI models.
890
- <strong>Test case: The iPhone example now properly detects as AI-generated.</strong>
891
- </p>
892
- </div>
893
- </div>
894
- """)
895
-
896
- # Batch analysis tab
897
- with gr.Tab("📄 Enhanced Batch Analysis", elem_id="batch-enhanced-analysis"):
898
- gr.HTML("""
899
- <div style="background: #e8f4fd; padding: 20px; border-radius: 12px; border-left: 5px solid #2196F3; margin-bottom: 20px;">
900
- <h4 style="color: #1565C0; margin-bottom: 15px;">📋 Enhanced Batch Analysis</h4>
901
- <ul style="color: #1976D2; line-height: 1.6;">
902
- <li>Upload a <strong>.txt</strong> file with one text sample per line</li>
903
- <li>Enhanced detection works best with texts of 20+ words each</li>
904
- <li>Maximum 15 texts processed for optimal performance</li>
905
- <li>Now includes enhanced formal and academic AI pattern detection</li>
906
- <li>Significantly improved accuracy for professional AI-generated content</li>
907
- </ul>
908
- </div>
909
- """)
910
-
911
- file_input = gr.File(
912
- label="📁 Upload text file (.txt)",
913
- file_types=[".txt"],
914
- type="binary"
915
- )
916
-
917
- batch_analyze_btn = gr.Button("🔍 Enhanced Batch Analysis", variant="primary", size="lg")
918
- batch_results = gr.Markdown(label="📊 Enhanced Detection Results")
919
-
920
- # About tab
921
- with gr.Tab("ℹ️ About Enhanced Detection", elem_id="about-tab"):
922
- gr.Markdown("""
923
- # 🔍 Enhanced AI Text Detector
924
-
925
- ## 🚀 Superior Pattern Recognition Technology
926
-
927
- This **enhanced version** specifically addresses formal, academic, and corporate AI writing patterns
928
- that were previously missed by standard detection methods.
929
-
930
- ### 🎯 Enhanced Detection Capabilities
931
-
932
- **New Pattern Recognition:**
933
- 1. **📚 Academic Language**: Formal academic phrases and structures
934
- 2. **🏢 Corporate Buzzwords**: Business and technical terminology overuse
935
- 3. **🔧 Technical Jargon**: Unnecessary technical complexity
936
- 4. **🎭 Abstract Concepts**: Over-conceptualization of simple topics
937
- 5. **📝 Formal Hedging**: Academic writing connectors and transitions
938
- 6. **⚖️ Objective Tone**: Overly neutral and impersonal writing
939
- 7. **🎯 Passive Voice**: Systematic use of passive constructions
940
- 8. **📊 Vocabulary**: Unnecessarily sophisticated word choices
941
-
942
- ### 📈 Performance Improvements
943
-
944
- **Compared to previous version:**
945
- - **+40% better** detection of formal AI writing
946
- - **+35% improvement** on academic/corporate AI text
947
- - **+50% fewer** false negatives on professional AI content
948
- - **+25% better** overall accuracy across all text types
949
-
950
- ### 🔬 Enhanced Methodology
951
-
952
- **Advanced Feature Analysis:**
953
- - **30+ linguistic patterns** (vs 20 in standard version)
954
- - **Weighted scoring** optimized for formal AI writing
955
- - **Enhanced sentence analysis** with formal pattern detection
956
- - **Improved thresholds** for better sensitivity
957
- - **Ensemble validation** with multiple specialized models
958
-
959
- ### 📊 Technical Specifications
960
-
961
- - **Model Architecture**: Enhanced ensemble with formal pattern weights
962
- - **Feature Count**: 30+ linguistic and stylistic features
963
- - **Processing Speed**: <2 seconds for most texts
964
- - **Optimal Length**: 20+ words for enhanced accuracy
965
- - **Highlighting Threshold**: Lowered to 55% for better sensitivity
966
-
967
- ### ⚡ What Makes This Enhanced
968
-
969
- **Specifically targets AI writing that:**
970
- - Uses formal academic language unnecessarily
971
- - Employs corporate buzzwords and jargon
972
- - Sounds like textbook or corporate documentation
973
- - Lacks personal voice or subjective opinions
974
- - Uses systematic, mechanical presentation styles
975
- - Employs passive voice and abstract conceptualization
976
-
977
- ### 🎯 Test Case Performance
978
-
979
- **Example improvement:**
980
- ```
981
- Previous version: iPhone text → 43% AI (MISSED)
982
- Enhanced version: iPhone text → 85%+ AI (DETECTED)
983
- ```
984
-
985
- The enhanced detector successfully identifies formal AI writing patterns
986
- that appear professional but lack human authenticity.
987
-
988
- ---
989
-
990
- **Version**: 5.0.0 | **Updated**: September 2025 | **Status**: Enhanced Pattern Recognition
991
- """)
992
-
993
- # Event handlers
994
- analyze_btn.click(
995
- fn=analyze_text_enhanced,
996
- inputs=[text_input],
997
- outputs=[summary_result, highlighted_text_display, bar_chart, detailed_metrics, text_info]
998
- )
999
-
1000
- batch_analyze_btn.click(
1001
- fn=batch_analyze_enhanced,
1002
- inputs=[file_input],
1003
- outputs=[batch_results]
1004
- )
1005
-
1006
- # Test examples including the problematic iPhone text
1007
- gr.Examples(
1008
- examples=[
1009
- ["The iPhone is a technological object that demonstrates consistency, scalability, and precision. It is defined by iterative updates, predictable release cycles, and optimized integration between hardware and software. The system functions as a closed ecosystem where inputs are standardized, processes are regulated, and outputs are uniform. In this framework, the iPhone is not only a communication tool but also a controlled environment for digital interaction."],
1010
- ["Hey everyone! I just got the new iPhone and I'm absolutely loving it! The camera quality is insane - took some photos yesterday at the beach and they look professional. Battery life is way better than my old phone too. Definitely worth the upgrade if you're thinking about it. Anyone else get one yet?"],
1011
- ["The implementation of sustainable energy solutions requires comprehensive analysis of environmental factors, economic considerations, and technological feasibility to ensure optimal outcomes for stakeholders. Organizations must systematically evaluate various renewable energy options before making strategic investment decisions. This framework facilitates the optimization of resource allocation."],
1012
- ["I cannot believe what happened at work today! My boss actually praised the report I spent weeks on. Turns out all those late nights were worth it. My coworker Mike was shocked too - he has been there for 10 years and says he has never seen the boss so enthusiastic about anything. Guess I am finally getting the hang of this job!"]
1013
- ],
1014
- inputs=text_input,
1015
- outputs=[summary_result, highlighted_text_display, bar_chart, detailed_metrics, text_info],
1016
- fn=analyze_text_enhanced,
1017
- cache_examples=False
1018
- )
1019
-
1020
- return interface
1021
-
1022
- # Launch the enhanced interface
1023
  if __name__ == "__main__":
1024
- interface = create_enhanced_interface()
1025
- interface.launch(
1026
- server_name="0.0.0.0",
1027
- server_port=7860,
1028
- share=True,
1029
- show_error=True,
1030
- debug=False
1031
- )
 
 
 
 
 
 
 
 
1
  import gradio as gr
2
  import torch
3
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
4
  import numpy as np
5
+ from scipy import stats
6
  import re
 
 
 
 
 
7
  from collections import Counter
8
+ import math
 
 
 
 
 
 
 
 
9
 
10
+ class AdvancedAITextDetector:
11
  def __init__(self):
12
+ """Initialize the AI Text Detector with multiple detection methods"""
13
+ # Load pre-trained model for AI detection
14
+ self.model_name = "Hello-SimpleAI/chatgpt-detector-roberta"
 
 
 
 
15
  try:
16
+ self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
17
+ self.model = AutoModelForSequenceClassification.from_pretrained(self.model_name)
18
+ self.model.eval()
19
+ self.model_loaded = True
20
+ except:
21
+ print("Warning: Could not load transformer model. Using statistical methods only.")
22
+ self.model_loaded = False
23
+
24
+ def calculate_perplexity_score(self, text):
25
+ """Calculate perplexity-based features"""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  words = text.split()
27
+ if len(words) < 2:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  return 0.5
29
+
30
+ # Simple bigram perplexity approximation
31
+ bigrams = [(words[i], words[i+1]) for i in range(len(words)-1)]
32
+ unique_bigrams = len(set(bigrams))
33
+ total_bigrams = len(bigrams)
34
+
35
+ # AI text tends to have less variation in bigrams
36
+ diversity_score = unique_bigrams / total_bigrams if total_bigrams > 0 else 0
37
+ return diversity_score
38
+
39
+ def calculate_burstiness(self, text):
40
+ """Calculate burstiness - human text tends to be more bursty"""
41
+ sentences = re.split(r'[.!?]+', text)
42
+ sentence_lengths = [len(s.split()) for s in sentences if s.strip()]
43
+
44
+ if len(sentence_lengths) < 2:
45
+ return 0.5
46
+
47
+ # Calculate variance in sentence lengths
48
+ variance = np.var(sentence_lengths)
49
+ mean_length = np.mean(sentence_lengths)
50
+
51
+ # Normalize burstiness score
52
+ burstiness = variance / (mean_length + 1) if mean_length > 0 else 0
53
+ return min(burstiness / 10, 1.0) # Normalize to 0-1
54
+
55
+ def calculate_repetition_score(self, text):
56
+ """Calculate repetition patterns - AI tends to repeat phrases more"""
57
+ words = text.lower().split()
58
+
59
+ # Check for repeated phrases (3-grams)
60
+ if len(words) < 3:
61
+ return 0.5
62
+
63
+ trigrams = [' '.join(words[i:i+3]) for i in range(len(words)-2)]
64
+ trigram_counts = Counter(trigrams)
65
+
66
+ repeated_trigrams = sum(1 for count in trigram_counts.values() if count > 1)
67
+ repetition_ratio = repeated_trigrams / len(trigrams) if trigrams else 0
68
+
69
+ return repetition_ratio
70
+
71
+ def calculate_vocabulary_diversity(self, text):
72
+ """Calculate vocabulary diversity - AI text often has less diverse vocabulary"""
73
+ words = re.findall(r'\b\w+\b', text.lower())
74
+ if not words:
75
+ return 0.5
76
+
77
+ unique_words = set(words)
78
+ diversity = len(unique_words) / len(words)
79
+
80
+ # Type-token ratio
81
+ return diversity
82
+
83
+ def calculate_punctuation_patterns(self, text):
84
+ """Analyze punctuation patterns - AI has more regular punctuation"""
85
+ sentences = re.split(r'[.!?]+', text)
86
+
87
+ punct_variance = []
88
+ for sentence in sentences:
89
+ if sentence.strip():
90
+ punct_count = len(re.findall(r'[,;:\-—()]', sentence))
91
+ word_count = len(sentence.split())
92
+ if word_count > 0:
93
+ punct_variance.append(punct_count / word_count)
94
+
95
+ if not punct_variance:
96
+ return 0.5
97
+
98
+ # AI text tends to have more consistent punctuation density
99
+ variance = np.var(punct_variance)
100
+ return 1 - min(variance * 10, 1.0) # Lower variance = more likely AI
101
+
102
+ def detect_ai_statistical(self, text):
103
+ """Combine statistical methods for AI detection"""
104
+ if len(text.strip()) < 50:
105
+ return 0.5, "Text too short for accurate analysis"
106
+
107
+ # Calculate various features
108
+ perplexity_score = self.calculate_perplexity_score(text)
109
+ burstiness = self.calculate_burstiness(text)
110
+ repetition = self.calculate_repetition_score(text)
111
+ vocab_diversity = self.calculate_vocabulary_diversity(text)
112
+ punct_patterns = self.calculate_punctuation_patterns(text)
113
+
114
+ # Weighted combination of features
115
+ # Lower perplexity, lower burstiness, higher repetition, lower diversity = more likely AI
116
  ai_score = (
117
+ (1 - perplexity_score) * 0.2 + # Low diversity in bigrams
118
+ (1 - burstiness) * 0.25 + # Low burstiness
119
+ repetition * 0.2 + # High repetition
120
+ (1 - vocab_diversity) * 0.2 + # Low vocabulary diversity
121
+ punct_patterns * 0.15 # Regular punctuation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
  )
123
+
124
+ return ai_score, {
125
+ "perplexity_score": perplexity_score,
126
+ "burstiness": burstiness,
127
+ "repetition": repetition,
128
+ "vocab_diversity": vocab_diversity,
129
+ "punct_patterns": punct_patterns
 
 
 
 
 
 
 
 
 
 
 
 
130
  }
131
+
132
+ def detect_ai_transformer(self, text):
133
+ """Use transformer model for AI detection"""
134
+ if not self.model_loaded:
135
+ return 0.5, "Model not loaded"
136
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
  try:
138
+ inputs = self.tokenizer(text, return_tensors="pt", truncation=True,
139
+ max_length=512, padding=True)
140
+
141
+ with torch.no_grad():
142
+ outputs = self.model(**inputs)
143
+ logits = outputs.logits
144
+ probabilities = torch.softmax(logits, dim=-1)
145
+
146
+ # Assuming class 1 is AI-generated
147
+ ai_probability = probabilities[0][1].item()
148
+
149
+ return ai_probability, "Transformer model prediction"
 
 
 
 
 
 
 
 
 
 
 
 
 
150
  except Exception as e:
151
+ return 0.5, f"Error in transformer model: {str(e)}"
152
+
153
+ def detect(self, text):
154
+ """Main detection method combining multiple approaches"""
155
+ if not text or len(text.strip()) < 20:
156
  return {
157
+ "ai_probability": 0.5,
158
+ "classification": "Undetermined",
159
+ "confidence": "Low",
160
+ "explanation": "Text too short for accurate analysis. Please provide at least 50 characters.",
161
+ "detailed_scores": {}
 
 
 
 
 
 
 
 
 
162
  }
163
+
164
+ # Get statistical analysis
165
+ stat_score, stat_details = self.detect_ai_statistical(text)
166
+
167
+ # Get transformer model prediction if available
168
+ if self.model_loaded:
169
+ transformer_score, _ = self.detect_ai_transformer(text)
170
+ # Weighted average of both methods
171
+ final_score = (transformer_score * 0.7 + stat_score * 0.3)
172
+ else:
173
+ final_score = stat_score
174
+
175
+ # Determine classification and confidence
176
+ if final_score >= 0.8:
177
+ classification = "AI-Generated"
178
+ confidence = "High"
179
+ elif final_score >= 0.6:
180
+ classification = "Likely AI-Generated"
181
+ confidence = "Medium"
182
+ elif final_score >= 0.4:
183
+ classification = "Uncertain"
184
+ confidence = "Low"
185
+ elif final_score >= 0.2:
186
+ classification = "Likely Human-Written"
187
+ confidence = "Medium"
188
+ else:
189
+ classification = "Human-Written"
190
+ confidence = "High"
191
+
192
+ # Create detailed explanation
193
+ explanation = self._generate_explanation(final_score, stat_details if isinstance(stat_details, dict) else {})
194
+
195
+ return {
196
+ "ai_probability": round(final_score * 100, 2),
197
+ "classification": classification,
198
+ "confidence": confidence,
199
+ "explanation": explanation,
200
+ "detailed_scores": stat_details if isinstance(stat_details, dict) else {}
201
+ }
202
+
203
+ def _generate_explanation(self, score, details):
204
+ """Generate human-readable explanation of the detection result"""
205
+ explanations = []
206
+
207
+ if score >= 0.7:
208
+ explanations.append("This text shows strong indicators of AI generation.")
209
+ elif score >= 0.3:
210
+ explanations.append("This text shows mixed characteristics.")
211
+ else:
212
+ explanations.append("This text appears to be human-written.")
213
+
214
+ if details:
215
+ if details.get('burstiness', 0.5) < 0.3:
216
+ explanations.append("• Low sentence length variation (typical of AI)")
217
+ elif details.get('burstiness', 0.5) > 0.7:
218
+ explanations.append("• High sentence length variation (typical of humans)")
219
+
220
+ if details.get('vocab_diversity', 0.5) < 0.4:
221
+ explanations.append("• Limited vocabulary diversity")
222
+ elif details.get('vocab_diversity', 0.5) > 0.6:
223
+ explanations.append("• Rich vocabulary diversity")
224
+
225
+ if details.get('repetition', 0) > 0.2:
226
+ explanations.append("• Notable phrase repetition detected")
227
+
228
+ if details.get('punct_patterns', 0.5) > 0.7:
229
+ explanations.append("• Regular punctuation patterns (AI-like)")
230
+
231
+ return " ".join(explanations)
232
+
233
+ # Initialize detector
234
+ detector = AdvancedAITextDetector()
235
+
236
+ def analyze_text(text):
237
+ """Gradio interface function"""
238
+ result = detector.detect(text)
239
+
240
+ # Format output for Gradio
241
+ output = f"""
242
+ ## Detection Result
243
+
244
+ **Classification:** {result['classification']}
245
+ **AI Probability:** {result['ai_probability']}%
246
+ **Confidence Level:** {result['confidence']}
247
+
248
+ ### Analysis Details
249
+ {result['explanation']}
250
+
251
+ ### Detailed Metrics
252
+ """
253
+
254
+ if result['detailed_scores']:
255
+ for metric, value in result['detailed_scores'].items():
256
+ metric_name = metric.replace('_', ' ').title()
257
+ output += f"- {metric_name}: {round(value, 3)}\n"
258
+
259
+ # Create a simple bar chart visualization
260
+ ai_prob = result['ai_probability']
261
+ human_prob = 100 - ai_prob
262
+
263
+ bar_chart = f"""
264
+ ### Probability Distribution
265
+ ```
266
+ AI-Generated: {'█' * int(ai_prob/5)}{'░' * (20-int(ai_prob/5))} {ai_prob}%
267
+ Human-Written: {'█' * int(human_prob/5)}{'░' * (20-int(human_prob/5))} {human_prob}%
268
+ ```
269
+ """
270
+
271
+ return output + bar_chart
272
+
273
+ # Create Gradio interface
274
+ interface = gr.Interface(
275
+ fn=analyze_text,
276
+ inputs=gr.Textbox(
277
+ lines=10,
278
+ placeholder="Paste the text you want to analyze here...",
279
+ label="Input Text"
280
+ ),
281
+ outputs=gr.Markdown(label="Analysis Result"),
282
+ title="🔍 Advanced AI Text Detector",
283
+ description="""
284
+ This advanced AI text detector uses multiple techniques to identify AI-generated content:
285
+ - **Transformer-based detection** using fine-tuned RoBERTa model
286
+ - **Statistical analysis** including burstiness, perplexity, and repetition patterns
287
+ - **Linguistic features** such as vocabulary diversity and punctuation patterns
288
+
289
+ The tool is particularly effective at detecting text from ChatGPT, GPT-4, and similar language models.
290
+ For best results, provide at least 100 words of text.
291
+ """,
292
+ examples=[
293
+ ["The impact of artificial intelligence on modern society cannot be overstated. From healthcare to transportation, AI systems are revolutionizing how we live and work. Machine learning algorithms process vast amounts of data to identify patterns and make predictions with unprecedented accuracy. In medical diagnosis, AI assists doctors in detecting diseases earlier than ever before. Autonomous vehicles promise to transform our cities and reduce traffic accidents. However, these advancements also raise important ethical questions about privacy, employment, and human autonomy that society must carefully consider."],
294
+ ["So I was walking down the street yesterday, right? And this crazy thing happened - I mean, you won't believe it. There was this dog, just a regular golden retriever, but it was wearing these ridiculous sunglasses. Like, who puts sunglasses on a dog? Anyway, the owner was this old lady, must've been like 80 or something, and she was just chatting away on her phone, completely oblivious. The dog looked so confused! I couldn't help but laugh. Sometimes you see the weirdest stuff when you're just out and about, you know?"]
295
+ ],
296
+ theme=gr.themes.Soft(),
297
+ analytics_enabled=False
298
+ )
299
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
300
  if __name__ == "__main__":
301
+ interface.launch()
 
 
 
 
 
 
 
requirements.txt CHANGED
@@ -1,8 +1,6 @@
1
- gradio>=4.0.0
2
- torch>=1.13.0
3
- transformers>=4.25.0
4
- numpy>=1.21.0
5
- scikit-learn>=1.2.0
6
- plotly>=5.0.0
7
- fastapi>=0.68.0
8
- uvicorn>=0.15.0
 
1
+ gradio==4.44.0
2
+ torch==2.1.0
3
+ transformers==4.35.0
4
+ scipy==1.11.4
5
+ numpy==1.24.3
6
+ huggingface-hub==0.19.4