Spaces:
Running
Running
Upload 12 files
Browse files- RAG_SETUP_GUIDE.md +267 -0
- app.py +46 -0
- debug_rag_setup.py +80 -0
- fix_oauth_setup.py +141 -0
- fix_verification_issue.py +112 -0
- get_drive_links.py +48 -0
- quick_check.py +55 -0
- rag_news_manager.py +432 -0
- setup_google_drive_rag.py +199 -0
- view_rag_news.py +283 -0
RAG_SETUP_GUIDE.md
ADDED
|
@@ -0,0 +1,267 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π Enhanced RAG System Setup Guide
|
| 2 |
+
|
| 3 |
+
This guide will help you set up the Enhanced RAG (Retrieval-Augmented Generation) system for saving high-confidence news to Google Drive.
|
| 4 |
+
|
| 5 |
+
## π Overview
|
| 6 |
+
|
| 7 |
+
The Enhanced RAG system automatically saves news with **95%+ confidence** from Gemini analysis to Google Drive, allowing you to:
|
| 8 |
+
- View all high-confidence news entries
|
| 9 |
+
- Use them for better RAG analysis
|
| 10 |
+
- Track user input patterns
|
| 11 |
+
- Build a comprehensive knowledge base
|
| 12 |
+
|
| 13 |
+
## π§ Setup Steps
|
| 14 |
+
|
| 15 |
+
### Step 1: Google Cloud Console Setup
|
| 16 |
+
|
| 17 |
+
1. **Go to Google Cloud Console**
|
| 18 |
+
- Visit: https://console.cloud.google.com/
|
| 19 |
+
|
| 20 |
+
2. **Create or Select Project**
|
| 21 |
+
- Create a new project or select existing one
|
| 22 |
+
- Note your project ID
|
| 23 |
+
|
| 24 |
+
3. **Enable Google Drive API**
|
| 25 |
+
- Go to "APIs & Services" β "Library"
|
| 26 |
+
- Search for "Google Drive API"
|
| 27 |
+
- Click "Enable"
|
| 28 |
+
|
| 29 |
+
4. **Create OAuth 2.0 Credentials**
|
| 30 |
+
- Go to "APIs & Services" β "Credentials"
|
| 31 |
+
- Click "Create Credentials" β "OAuth 2.0 Client IDs"
|
| 32 |
+
- Choose "Desktop application"
|
| 33 |
+
- Download the JSON file
|
| 34 |
+
- Rename it to `credentials.json`
|
| 35 |
+
- Place it in your project directory
|
| 36 |
+
|
| 37 |
+
### Step 2: Local Setup
|
| 38 |
+
|
| 39 |
+
1. **Run the Setup Script**
|
| 40 |
+
```bash
|
| 41 |
+
python setup_google_drive_rag.py
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
2. **Follow the Authentication Process**
|
| 45 |
+
- A browser window will open
|
| 46 |
+
- Log in with your Google account
|
| 47 |
+
- Grant permissions for Google Drive access
|
| 48 |
+
- The script will save your credentials
|
| 49 |
+
|
| 50 |
+
3. **Verify Setup**
|
| 51 |
+
- The script will test Google Drive access
|
| 52 |
+
- It will create the RAG folder and file
|
| 53 |
+
- You'll see confirmation messages
|
| 54 |
+
|
| 55 |
+
### Step 3: Hugging Face Spaces Setup (Optional)
|
| 56 |
+
|
| 57 |
+
If you want to use this on Hugging Face Spaces:
|
| 58 |
+
|
| 59 |
+
1. **Add Secrets to Hugging Face**
|
| 60 |
+
- Go to your Space settings
|
| 61 |
+
- Add these secrets:
|
| 62 |
+
- `GOOGLE_CLIENT_ID`: Your OAuth client ID
|
| 63 |
+
- `GOOGLE_CLIENT_SECRET`: Your OAuth client secret
|
| 64 |
+
- `GOOGLE_REFRESH_TOKEN`: Get this from your local token.json
|
| 65 |
+
|
| 66 |
+
2. **Get Refresh Token**
|
| 67 |
+
- Run the setup script locally first
|
| 68 |
+
- Check the `token.json` file
|
| 69 |
+
- Copy the `refresh_token` value
|
| 70 |
+
|
| 71 |
+
## π File Structure
|
| 72 |
+
|
| 73 |
+
After setup, you'll have:
|
| 74 |
+
|
| 75 |
+
```
|
| 76 |
+
your-project/
|
| 77 |
+
βββ credentials.json # Google OAuth credentials
|
| 78 |
+
βββ token.json # Saved authentication token
|
| 79 |
+
βββ rag_news_manager.py # Main RAG system
|
| 80 |
+
βββ setup_google_drive_rag.py # Setup script
|
| 81 |
+
βββ view_rag_news.py # News viewer
|
| 82 |
+
βββ app.py # Your main app (updated)
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
## π Google Drive Structure
|
| 86 |
+
|
| 87 |
+
The system creates:
|
| 88 |
+
|
| 89 |
+
```
|
| 90 |
+
Google Drive/
|
| 91 |
+
βββ Vietnamese_Fake_News_RAG/
|
| 92 |
+
βββ high_confidence_news.json
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
## π How It Works
|
| 96 |
+
|
| 97 |
+
### Automatic Saving
|
| 98 |
+
- When users input news, the system analyzes it
|
| 99 |
+
- If Gemini confidence > 95%, it's automatically saved to Google Drive
|
| 100 |
+
- Each entry includes:
|
| 101 |
+
- News text
|
| 102 |
+
- Prediction (REAL/FAKE)
|
| 103 |
+
- Confidence score
|
| 104 |
+
- Gemini analysis
|
| 105 |
+
- Search results
|
| 106 |
+
- Timestamp
|
| 107 |
+
|
| 108 |
+
### Data Format
|
| 109 |
+
```json
|
| 110 |
+
{
|
| 111 |
+
"metadata": {
|
| 112 |
+
"created_at": "2024-01-01T00:00:00",
|
| 113 |
+
"description": "High-confidence Vietnamese fake news for RAG",
|
| 114 |
+
"threshold": 0.95,
|
| 115 |
+
"total_entries": 10,
|
| 116 |
+
"last_updated": "2024-01-01T12:00:00"
|
| 117 |
+
},
|
| 118 |
+
"news_entries": [
|
| 119 |
+
{
|
| 120 |
+
"id": 1,
|
| 121 |
+
"content_hash": "abc123...",
|
| 122 |
+
"news_text": "Argentina vΓ΄ Δα»ch World Cup 2022...",
|
| 123 |
+
"prediction": "REAL",
|
| 124 |
+
"gemini_confidence": 0.98,
|
| 125 |
+
"gemini_analysis": "1. KαΊΎT LUαΊ¬N: THαΊ¬T...",
|
| 126 |
+
"distilbert_confidence": 0.85,
|
| 127 |
+
"search_results": [...],
|
| 128 |
+
"created_at": "2024-01-01T10:00:00",
|
| 129 |
+
"source": "user_input",
|
| 130 |
+
"verified": true
|
| 131 |
+
}
|
| 132 |
+
]
|
| 133 |
+
}
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
## π₯οΈ Viewing Saved News
|
| 137 |
+
|
| 138 |
+
### Option 1: Command Line Viewer
|
| 139 |
+
```bash
|
| 140 |
+
python view_rag_news.py
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
Features:
|
| 144 |
+
- View all saved news
|
| 145 |
+
- Filter by prediction (REAL/FAKE)
|
| 146 |
+
- Search through entries
|
| 147 |
+
- View statistics
|
| 148 |
+
- Open Google Drive directly
|
| 149 |
+
|
| 150 |
+
### Option 2: Google Drive Web Interface
|
| 151 |
+
- Go to your Google Drive
|
| 152 |
+
- Find the "Vietnamese_Fake_News_RAG" folder
|
| 153 |
+
- Open "high_confidence_news.json"
|
| 154 |
+
- View the raw JSON data
|
| 155 |
+
|
| 156 |
+
### Option 3: Direct Google Drive Links
|
| 157 |
+
The system provides direct links:
|
| 158 |
+
- Folder: `https://drive.google.com/drive/folders/{folder_id}`
|
| 159 |
+
- File: `https://drive.google.com/file/d/{file_id}/view`
|
| 160 |
+
|
| 161 |
+
## π§ Configuration
|
| 162 |
+
|
| 163 |
+
### In app.py
|
| 164 |
+
```python
|
| 165 |
+
# Enhanced RAG System Configuration
|
| 166 |
+
ENABLE_ENHANCED_RAG = True # Enable/disable the system
|
| 167 |
+
RAG_CONFIDENCE_THRESHOLD = 0.95 # 95% threshold for saving
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
+
### Thresholds
|
| 171 |
+
- **95%**: Only very high-confidence predictions are saved
|
| 172 |
+
- **90%**: More entries saved, but still high quality
|
| 173 |
+
- **85%**: More entries, but some uncertainty
|
| 174 |
+
|
| 175 |
+
## π Statistics
|
| 176 |
+
|
| 177 |
+
The system tracks:
|
| 178 |
+
- Total entries saved
|
| 179 |
+
- Real vs Fake news count
|
| 180 |
+
- Average confidence score
|
| 181 |
+
- Latest entry timestamp
|
| 182 |
+
- Google Drive folder/file IDs
|
| 183 |
+
|
| 184 |
+
## π¨ Troubleshooting
|
| 185 |
+
|
| 186 |
+
### Common Issues
|
| 187 |
+
|
| 188 |
+
1. **"credentials.json not found"**
|
| 189 |
+
- Make sure you downloaded the OAuth credentials
|
| 190 |
+
- Rename the file to exactly `credentials.json`
|
| 191 |
+
- Place it in the project directory
|
| 192 |
+
|
| 193 |
+
2. **"Authentication failed"**
|
| 194 |
+
- Check your internet connection
|
| 195 |
+
- Make sure Google Drive API is enabled
|
| 196 |
+
- Try running the setup script again
|
| 197 |
+
|
| 198 |
+
3. **"Permission denied"**
|
| 199 |
+
- Make sure you granted all required permissions
|
| 200 |
+
- Check if your Google account has Drive access
|
| 201 |
+
|
| 202 |
+
4. **"RAG system not available"**
|
| 203 |
+
- Check if all dependencies are installed
|
| 204 |
+
- Make sure `rag_news_manager.py` is in the same directory
|
| 205 |
+
|
| 206 |
+
### Debug Mode
|
| 207 |
+
Add this to see detailed logs:
|
| 208 |
+
```python
|
| 209 |
+
import logging
|
| 210 |
+
logging.basicConfig(level=logging.DEBUG)
|
| 211 |
+
```
|
| 212 |
+
|
| 213 |
+
## π Integration with Existing System
|
| 214 |
+
|
| 215 |
+
The Enhanced RAG system works alongside your existing knowledge base:
|
| 216 |
+
- **Local Knowledge Base**: Still works as before
|
| 217 |
+
- **Enhanced RAG**: Additional Google Drive storage
|
| 218 |
+
- **Both systems**: Can be used together for comprehensive RAG
|
| 219 |
+
|
| 220 |
+
## π± Usage Examples
|
| 221 |
+
|
| 222 |
+
### View Recent News
|
| 223 |
+
```bash
|
| 224 |
+
python view_rag_news.py
|
| 225 |
+
# Select option 2: View Recent News
|
| 226 |
+
```
|
| 227 |
+
|
| 228 |
+
### Search for Specific Topics
|
| 229 |
+
```bash
|
| 230 |
+
python view_rag_news.py
|
| 231 |
+
# Select option 6: Search News
|
| 232 |
+
# Enter: "COVID-19"
|
| 233 |
+
```
|
| 234 |
+
|
| 235 |
+
### Check Statistics
|
| 236 |
+
```bash
|
| 237 |
+
python view_rag_news.py
|
| 238 |
+
# Select option 1: View Statistics
|
| 239 |
+
```
|
| 240 |
+
|
| 241 |
+
## π― Benefits
|
| 242 |
+
|
| 243 |
+
1. **Automatic Collection**: No manual intervention needed
|
| 244 |
+
2. **High Quality**: Only 95%+ confidence entries saved
|
| 245 |
+
3. **Easy Access**: View through multiple interfaces
|
| 246 |
+
4. **Scalable**: Google Drive handles large datasets
|
| 247 |
+
5. **Searchable**: Find specific news entries quickly
|
| 248 |
+
6. **Analytics**: Track patterns and statistics
|
| 249 |
+
|
| 250 |
+
## π Security
|
| 251 |
+
|
| 252 |
+
- OAuth 2.0 authentication
|
| 253 |
+
- Credentials stored securely
|
| 254 |
+
- Only your Google account can access
|
| 255 |
+
- No sensitive data exposed
|
| 256 |
+
|
| 257 |
+
## π Support
|
| 258 |
+
|
| 259 |
+
If you encounter issues:
|
| 260 |
+
1. Check the troubleshooting section
|
| 261 |
+
2. Verify all setup steps completed
|
| 262 |
+
3. Check Google Cloud Console for API quotas
|
| 263 |
+
4. Ensure proper file permissions
|
| 264 |
+
|
| 265 |
+
---
|
| 266 |
+
|
| 267 |
+
**π Congratulations!** You now have a comprehensive RAG system that automatically saves high-confidence news to Google Drive for analysis and viewing!
|
app.py
CHANGED
|
@@ -31,6 +31,10 @@ KNOWLEDGE_BASE_DB = "knowledge_base.db"
|
|
| 31 |
CONFIDENCE_THRESHOLD = 0.95 # 95% Gemini confidence threshold for RAG knowledge base
|
| 32 |
ENABLE_KNOWLEDGE_BASE_SEARCH = True # Enable knowledge base search with training data
|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
# Cloud Storage Configuration
|
| 35 |
USE_CLOUD_STORAGE = True # Set to True to use cloud storage instead of local DB
|
| 36 |
CLOUD_STORAGE_TYPE = "google_drive" # Options: "google_drive", "google_cloud", "local"
|
|
@@ -440,6 +444,23 @@ def get_knowledge_base_stats():
|
|
| 440 |
# Initialize knowledge base on startup
|
| 441 |
init_knowledge_base()
|
| 442 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 443 |
def populate_knowledge_base_from_training_data():
|
| 444 |
"""Populate knowledge base with existing training data"""
|
| 445 |
try:
|
|
@@ -1366,6 +1387,31 @@ def analyze_news(news_text):
|
|
| 1366 |
print("β
Successfully added to knowledge base for future RAG retrieval!")
|
| 1367 |
else:
|
| 1368 |
print("β οΈ Failed to add to knowledge base (duplicate or error)")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1369 |
|
| 1370 |
# Build the detailed report with better formatting
|
| 1371 |
# Use combined_confidence to determine the final classification (not just DistilBERT)
|
|
|
|
| 31 |
CONFIDENCE_THRESHOLD = 0.95 # 95% Gemini confidence threshold for RAG knowledge base
|
| 32 |
ENABLE_KNOWLEDGE_BASE_SEARCH = True # Enable knowledge base search with training data
|
| 33 |
|
| 34 |
+
# Enhanced RAG System Configuration
|
| 35 |
+
ENABLE_ENHANCED_RAG = True # Enable enhanced RAG system for Google Drive
|
| 36 |
+
RAG_CONFIDENCE_THRESHOLD = 0.95 # 95% threshold for saving to RAG
|
| 37 |
+
|
| 38 |
# Cloud Storage Configuration
|
| 39 |
USE_CLOUD_STORAGE = True # Set to True to use cloud storage instead of local DB
|
| 40 |
CLOUD_STORAGE_TYPE = "google_drive" # Options: "google_drive", "google_cloud", "local"
|
|
|
|
| 444 |
# Initialize knowledge base on startup
|
| 445 |
init_knowledge_base()
|
| 446 |
|
| 447 |
+
# Initialize Enhanced RAG System
|
| 448 |
+
if ENABLE_ENHANCED_RAG:
|
| 449 |
+
try:
|
| 450 |
+
from rag_news_manager import initialize_rag_system
|
| 451 |
+
print("π Initializing Enhanced RAG System...")
|
| 452 |
+
if initialize_rag_system():
|
| 453 |
+
print("β
Enhanced RAG System initialized successfully!")
|
| 454 |
+
else:
|
| 455 |
+
print("β οΈ Enhanced RAG System initialization failed - continuing without it")
|
| 456 |
+
ENABLE_ENHANCED_RAG = False
|
| 457 |
+
except ImportError as e:
|
| 458 |
+
print(f"β οΈ Enhanced RAG System not available: {e}")
|
| 459 |
+
ENABLE_ENHANCED_RAG = False
|
| 460 |
+
except Exception as e:
|
| 461 |
+
print(f"β οΈ Enhanced RAG System initialization error: {e}")
|
| 462 |
+
ENABLE_ENHANCED_RAG = False
|
| 463 |
+
|
| 464 |
def populate_knowledge_base_from_training_data():
|
| 465 |
"""Populate knowledge base with existing training data"""
|
| 466 |
try:
|
|
|
|
| 1387 |
print("β
Successfully added to knowledge base for future RAG retrieval!")
|
| 1388 |
else:
|
| 1389 |
print("β οΈ Failed to add to knowledge base (duplicate or error)")
|
| 1390 |
+
|
| 1391 |
+
# Step 8: Enhanced RAG System - Save to Google Drive if confidence is high enough
|
| 1392 |
+
if ENABLE_ENHANCED_RAG and gemini_max_confidence > RAG_CONFIDENCE_THRESHOLD:
|
| 1393 |
+
try:
|
| 1394 |
+
from rag_news_manager import add_news_to_rag
|
| 1395 |
+
|
| 1396 |
+
print(f"π High confidence detected ({gemini_max_confidence:.1%}) - saving to Enhanced RAG system...")
|
| 1397 |
+
final_prediction = "REAL" if gemini_real_confidence > gemini_fake_confidence else "FAKE"
|
| 1398 |
+
|
| 1399 |
+
rag_success = add_news_to_rag(
|
| 1400 |
+
news_text=news_text,
|
| 1401 |
+
gemini_analysis=gemini_analysis,
|
| 1402 |
+
gemini_confidence=gemini_max_confidence,
|
| 1403 |
+
prediction=final_prediction,
|
| 1404 |
+
search_results=search_results,
|
| 1405 |
+
distilbert_confidence=distilbert_confidence
|
| 1406 |
+
)
|
| 1407 |
+
|
| 1408 |
+
if rag_success:
|
| 1409 |
+
print("β
Successfully saved to Enhanced RAG system (Google Drive)!")
|
| 1410 |
+
else:
|
| 1411 |
+
print("β οΈ Failed to save to Enhanced RAG system (duplicate or error)")
|
| 1412 |
+
|
| 1413 |
+
except Exception as e:
|
| 1414 |
+
print(f"β οΈ Enhanced RAG system error: {e}")
|
| 1415 |
|
| 1416 |
# Build the detailed report with better formatting
|
| 1417 |
# Use combined_confidence to determine the final classification (not just DistilBERT)
|
debug_rag_setup.py
ADDED
|
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Debug RAG system setup
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
def debug_rag_setup():
|
| 7 |
+
"""Debug the RAG system setup step by step"""
|
| 8 |
+
print("π§ Debugging RAG System Setup")
|
| 9 |
+
print("=" * 40)
|
| 10 |
+
|
| 11 |
+
try:
|
| 12 |
+
# Step 1: Test imports
|
| 13 |
+
print("1. Testing imports...")
|
| 14 |
+
from rag_news_manager import RAGNewsManager
|
| 15 |
+
print("β
RAGNewsManager imported successfully")
|
| 16 |
+
|
| 17 |
+
# Step 2: Create manager instance
|
| 18 |
+
print("2. Creating RAG manager...")
|
| 19 |
+
manager = RAGNewsManager()
|
| 20 |
+
print("β
RAG manager created")
|
| 21 |
+
|
| 22 |
+
# Step 3: Test authentication
|
| 23 |
+
print("3. Testing authentication...")
|
| 24 |
+
if manager.authenticate():
|
| 25 |
+
print("β
Authentication successful")
|
| 26 |
+
else:
|
| 27 |
+
print("β Authentication failed")
|
| 28 |
+
return False
|
| 29 |
+
|
| 30 |
+
# Step 4: Test folder setup
|
| 31 |
+
print("4. Testing folder setup...")
|
| 32 |
+
if manager.setup_rag_folder():
|
| 33 |
+
print("β
Folder setup successful")
|
| 34 |
+
print(f" Folder ID: {manager.rag_folder_id}")
|
| 35 |
+
else:
|
| 36 |
+
print("β Folder setup failed")
|
| 37 |
+
return False
|
| 38 |
+
|
| 39 |
+
# Step 5: Test file setup
|
| 40 |
+
print("5. Testing file setup...")
|
| 41 |
+
if manager.setup_rag_file():
|
| 42 |
+
print("β
File setup successful")
|
| 43 |
+
print(f" File ID: {manager.rag_file_id}")
|
| 44 |
+
else:
|
| 45 |
+
print("β File setup failed")
|
| 46 |
+
return False
|
| 47 |
+
|
| 48 |
+
# Step 6: Test data loading
|
| 49 |
+
print("6. Testing data loading...")
|
| 50 |
+
data = manager.load_rag_data()
|
| 51 |
+
if data:
|
| 52 |
+
print("β
Data loading successful")
|
| 53 |
+
print(f" Total entries: {data.get('metadata', {}).get('total_entries', 0)}")
|
| 54 |
+
else:
|
| 55 |
+
print("β Data loading failed")
|
| 56 |
+
return False
|
| 57 |
+
|
| 58 |
+
# Step 7: Test statistics
|
| 59 |
+
print("7. Testing statistics...")
|
| 60 |
+
stats = manager.get_rag_statistics()
|
| 61 |
+
if stats:
|
| 62 |
+
print("β
Statistics successful")
|
| 63 |
+
print(f" Total entries: {stats['total_entries']}")
|
| 64 |
+
print(f" Folder ID: {stats.get('folder_id', 'None')}")
|
| 65 |
+
print(f" File ID: {stats.get('file_id', 'None')}")
|
| 66 |
+
else:
|
| 67 |
+
print("β Statistics failed")
|
| 68 |
+
return False
|
| 69 |
+
|
| 70 |
+
print("\nπ All tests passed! RAG system is working correctly.")
|
| 71 |
+
return True
|
| 72 |
+
|
| 73 |
+
except Exception as e:
|
| 74 |
+
print(f"β Error during debugging: {e}")
|
| 75 |
+
import traceback
|
| 76 |
+
traceback.print_exc()
|
| 77 |
+
return False
|
| 78 |
+
|
| 79 |
+
if __name__ == "__main__":
|
| 80 |
+
debug_rag_setup()
|
fix_oauth_setup.py
ADDED
|
@@ -0,0 +1,141 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Fix OAuth setup for Google Drive RAG system
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import os
|
| 7 |
+
import json
|
| 8 |
+
from google.oauth2.credentials import Credentials
|
| 9 |
+
from google_auth_oauthlib.flow import InstalledAppFlow
|
| 10 |
+
from google.auth.transport.requests import Request
|
| 11 |
+
|
| 12 |
+
# Configuration
|
| 13 |
+
SCOPES = ['https://www.googleapis.com/auth/drive.file']
|
| 14 |
+
CREDENTIALS_FILE = 'credentials.json'
|
| 15 |
+
TOKEN_FILE = 'token.json'
|
| 16 |
+
|
| 17 |
+
def fix_oauth_setup():
|
| 18 |
+
"""Fix OAuth setup with proper redirect URIs"""
|
| 19 |
+
print("π§ Fixing OAuth Setup for Google Drive RAG")
|
| 20 |
+
print("=" * 50)
|
| 21 |
+
|
| 22 |
+
# Check if credentials file exists
|
| 23 |
+
if not os.path.exists(CREDENTIALS_FILE):
|
| 24 |
+
print(f"β {CREDENTIALS_FILE} not found!")
|
| 25 |
+
print("\nπ Please follow these steps:")
|
| 26 |
+
print("1. Go to: https://console.cloud.google.com/")
|
| 27 |
+
print("2. APIs & Services β Credentials")
|
| 28 |
+
print("3. Create Credentials β OAuth 2.0 Client IDs")
|
| 29 |
+
print("4. Application type: Desktop application")
|
| 30 |
+
print("5. Download as 'credentials.json'")
|
| 31 |
+
return False
|
| 32 |
+
|
| 33 |
+
# Delete old token file if it exists
|
| 34 |
+
if os.path.exists(TOKEN_FILE):
|
| 35 |
+
print(f"ποΈ Removing old token file: {TOKEN_FILE}")
|
| 36 |
+
os.remove(TOKEN_FILE)
|
| 37 |
+
|
| 38 |
+
print(f"β
Found {CREDENTIALS_FILE}")
|
| 39 |
+
|
| 40 |
+
try:
|
| 41 |
+
# Load and validate credentials
|
| 42 |
+
with open(CREDENTIALS_FILE, 'r') as f:
|
| 43 |
+
creds_data = json.load(f)
|
| 44 |
+
|
| 45 |
+
print("β
Credentials file is valid")
|
| 46 |
+
print(f" Client ID: {creds_data.get('client_id', 'N/A')[:20]}...")
|
| 47 |
+
|
| 48 |
+
# Check if it's a desktop application
|
| 49 |
+
if creds_data.get('installed'):
|
| 50 |
+
print("β
Desktop application credentials detected")
|
| 51 |
+
else:
|
| 52 |
+
print("β οΈ Warning: This doesn't look like desktop application credentials")
|
| 53 |
+
print(" Make sure you selected 'Desktop application' when creating credentials")
|
| 54 |
+
|
| 55 |
+
except json.JSONDecodeError:
|
| 56 |
+
print("β Invalid JSON in credentials file")
|
| 57 |
+
return False
|
| 58 |
+
except Exception as e:
|
| 59 |
+
print(f"β Error reading credentials: {e}")
|
| 60 |
+
return False
|
| 61 |
+
|
| 62 |
+
# Try authentication with different ports
|
| 63 |
+
ports_to_try = [8080, 8081, 8082, 0] # 0 means let the system choose
|
| 64 |
+
|
| 65 |
+
for port in ports_to_try:
|
| 66 |
+
try:
|
| 67 |
+
print(f"\nπ Trying authentication on port {port if port > 0 else 'auto'}...")
|
| 68 |
+
|
| 69 |
+
# Create flow
|
| 70 |
+
flow = InstalledAppFlow.from_client_secrets_file(CREDENTIALS_FILE, SCOPES)
|
| 71 |
+
|
| 72 |
+
if port == 0:
|
| 73 |
+
# Let the system choose the port
|
| 74 |
+
creds = flow.run_local_server(port=0)
|
| 75 |
+
else:
|
| 76 |
+
# Try specific port
|
| 77 |
+
creds = flow.run_local_server(port=port)
|
| 78 |
+
|
| 79 |
+
print("β
Authentication successful!")
|
| 80 |
+
|
| 81 |
+
# Save credentials
|
| 82 |
+
with open(TOKEN_FILE, 'w') as token:
|
| 83 |
+
token.write(creds.to_json())
|
| 84 |
+
print(f"β
Credentials saved to {TOKEN_FILE}")
|
| 85 |
+
|
| 86 |
+
# Test the credentials
|
| 87 |
+
print("\nπ§ͺ Testing Google Drive access...")
|
| 88 |
+
from googleapiclient.discovery import build
|
| 89 |
+
service = build('drive', 'v3', credentials=creds)
|
| 90 |
+
|
| 91 |
+
results = service.files().list(pageSize=1, fields="files(id, name)").execute()
|
| 92 |
+
files = results.get('files', [])
|
| 93 |
+
|
| 94 |
+
print("β
Google Drive access successful!")
|
| 95 |
+
print(f" Found {len(files)} file(s) in your Drive")
|
| 96 |
+
|
| 97 |
+
return True
|
| 98 |
+
|
| 99 |
+
except Exception as e:
|
| 100 |
+
error_msg = str(e).lower()
|
| 101 |
+
if "redirect_uri_mismatch" in error_msg:
|
| 102 |
+
print(f"β Port {port} failed: redirect_uri_mismatch")
|
| 103 |
+
if port < 8082: # Don't show this message for the last attempt
|
| 104 |
+
print(" Trying next port...")
|
| 105 |
+
continue
|
| 106 |
+
else:
|
| 107 |
+
print(f"β Port {port} failed: {e}")
|
| 108 |
+
if port < 8082:
|
| 109 |
+
print(" Trying next port...")
|
| 110 |
+
continue
|
| 111 |
+
|
| 112 |
+
print("\nβ All authentication attempts failed!")
|
| 113 |
+
print("\nπ§ Manual Fix Required:")
|
| 114 |
+
print("1. Go to: https://console.cloud.google.com/")
|
| 115 |
+
print("2. APIs & Services β Credentials")
|
| 116 |
+
print("3. Edit your OAuth 2.0 Client ID")
|
| 117 |
+
print("4. Add these to 'Authorized redirect URIs':")
|
| 118 |
+
print(" - http://localhost:8080/")
|
| 119 |
+
print(" - http://localhost:8081/")
|
| 120 |
+
print(" - http://localhost:8082/")
|
| 121 |
+
print(" - http://127.0.0.1:8080/")
|
| 122 |
+
print(" - http://127.0.0.1:8081/")
|
| 123 |
+
print(" - http://127.0.0.1:8082/")
|
| 124 |
+
print("5. Save and try again")
|
| 125 |
+
|
| 126 |
+
return False
|
| 127 |
+
|
| 128 |
+
def main():
|
| 129 |
+
"""Main function"""
|
| 130 |
+
print("π OAuth Fix for Google Drive RAG System")
|
| 131 |
+
print("=" * 50)
|
| 132 |
+
|
| 133 |
+
if fix_oauth_setup():
|
| 134 |
+
print("\nπ OAuth setup fixed successfully!")
|
| 135 |
+
print("β
You can now run: python setup_google_drive_rag.py")
|
| 136 |
+
else:
|
| 137 |
+
print("\nβ OAuth setup failed")
|
| 138 |
+
print("π‘ Please follow the manual fix instructions above")
|
| 139 |
+
|
| 140 |
+
if __name__ == "__main__":
|
| 141 |
+
main()
|
fix_verification_issue.py
ADDED
|
@@ -0,0 +1,112 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Fix Google verification issue for OAuth
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import os
|
| 7 |
+
import json
|
| 8 |
+
from google.oauth2.credentials import Credentials
|
| 9 |
+
from google_auth_oauthlib.flow import InstalledAppFlow
|
| 10 |
+
from google.auth.transport.requests import Request
|
| 11 |
+
|
| 12 |
+
# Configuration
|
| 13 |
+
SCOPES = ['https://www.googleapis.com/auth/drive.file']
|
| 14 |
+
CREDENTIALS_FILE = 'credentials.json'
|
| 15 |
+
TOKEN_FILE = 'token.json'
|
| 16 |
+
|
| 17 |
+
def fix_verification_issue():
|
| 18 |
+
"""Fix Google verification issue"""
|
| 19 |
+
print("π§ Fixing Google Verification Issue")
|
| 20 |
+
print("=" * 50)
|
| 21 |
+
|
| 22 |
+
print("π The issue is that your app needs to be in 'Testing' mode")
|
| 23 |
+
print(" and you need to be added as a test user.")
|
| 24 |
+
print()
|
| 25 |
+
|
| 26 |
+
print("π§ Manual Fix Required:")
|
| 27 |
+
print("1. Go to: https://console.cloud.google.com/")
|
| 28 |
+
print("2. APIs & Services β OAuth consent screen")
|
| 29 |
+
print("3. Make sure 'Publishing status' is set to 'Testing'")
|
| 30 |
+
print("4. Scroll down to 'Test users' section")
|
| 31 |
+
print("5. Click 'Add Users'")
|
| 32 |
+
print("6. Add your email: [email protected]")
|
| 33 |
+
print("7. Save the changes")
|
| 34 |
+
print()
|
| 35 |
+
|
| 36 |
+
print("π Alternative: Change User Type to Internal")
|
| 37 |
+
print(" (If you have a Google Workspace account)")
|
| 38 |
+
print()
|
| 39 |
+
|
| 40 |
+
# Check if credentials exist
|
| 41 |
+
if not os.path.exists(CREDENTIALS_FILE):
|
| 42 |
+
print(f"β {CREDENTIALS_FILE} not found!")
|
| 43 |
+
return False
|
| 44 |
+
|
| 45 |
+
# Delete old token file
|
| 46 |
+
if os.path.exists(TOKEN_FILE):
|
| 47 |
+
print(f"ποΈ Removing old token file: {TOKEN_FILE}")
|
| 48 |
+
os.remove(TOKEN_FILE)
|
| 49 |
+
|
| 50 |
+
print("β
Ready to test after you add yourself as a test user")
|
| 51 |
+
print()
|
| 52 |
+
|
| 53 |
+
# Ask user if they've completed the steps
|
| 54 |
+
response = input("Have you added yourself as a test user? (y/n): ").strip().lower()
|
| 55 |
+
|
| 56 |
+
if response == 'y':
|
| 57 |
+
print("\nπ§ͺ Testing authentication...")
|
| 58 |
+
|
| 59 |
+
try:
|
| 60 |
+
# Try authentication
|
| 61 |
+
flow = InstalledAppFlow.from_client_secrets_file(CREDENTIALS_FILE, SCOPES)
|
| 62 |
+
creds = flow.run_local_server(port=0)
|
| 63 |
+
|
| 64 |
+
print("β
Authentication successful!")
|
| 65 |
+
|
| 66 |
+
# Save credentials
|
| 67 |
+
with open(TOKEN_FILE, 'w') as token:
|
| 68 |
+
token.write(creds.to_json())
|
| 69 |
+
print(f"β
Credentials saved to {TOKEN_FILE}")
|
| 70 |
+
|
| 71 |
+
# Test Google Drive access
|
| 72 |
+
print("\nπ§ͺ Testing Google Drive access...")
|
| 73 |
+
from googleapiclient.discovery import build
|
| 74 |
+
service = build('drive', 'v3', credentials=creds)
|
| 75 |
+
|
| 76 |
+
results = service.files().list(pageSize=1, fields="files(id, name)").execute()
|
| 77 |
+
files = results.get('files', [])
|
| 78 |
+
|
| 79 |
+
print("β
Google Drive access successful!")
|
| 80 |
+
print(f" Found {len(files)} file(s) in your Drive")
|
| 81 |
+
|
| 82 |
+
return True
|
| 83 |
+
|
| 84 |
+
except Exception as e:
|
| 85 |
+
error_msg = str(e).lower()
|
| 86 |
+
if "access_denied" in error_msg or "verification" in error_msg:
|
| 87 |
+
print("β Still getting verification error")
|
| 88 |
+
print("π‘ Make sure you:")
|
| 89 |
+
print(" 1. Added yourself as a test user")
|
| 90 |
+
print(" 2. Set publishing status to 'Testing'")
|
| 91 |
+
print(" 3. Saved all changes")
|
| 92 |
+
return False
|
| 93 |
+
else:
|
| 94 |
+
print(f"β Authentication failed: {e}")
|
| 95 |
+
return False
|
| 96 |
+
else:
|
| 97 |
+
print("π‘ Please complete the steps above and run this script again")
|
| 98 |
+
return False
|
| 99 |
+
|
| 100 |
+
def main():
|
| 101 |
+
"""Main function"""
|
| 102 |
+
print("π Google Verification Fix for RAG System")
|
| 103 |
+
print("=" * 50)
|
| 104 |
+
|
| 105 |
+
if fix_verification_issue():
|
| 106 |
+
print("\nπ Verification issue fixed!")
|
| 107 |
+
print("β
You can now run: python setup_google_drive_rag.py")
|
| 108 |
+
else:
|
| 109 |
+
print("\nβ Please complete the manual steps above")
|
| 110 |
+
|
| 111 |
+
if __name__ == "__main__":
|
| 112 |
+
main()
|
get_drive_links.py
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Quick script to get Google Drive links for your RAG files
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
from rag_news_manager import initialize_rag_system, get_rag_stats
|
| 7 |
+
|
| 8 |
+
def get_drive_links():
|
| 9 |
+
"""Get direct Google Drive links"""
|
| 10 |
+
print("π Getting Google Drive Links...")
|
| 11 |
+
|
| 12 |
+
# Initialize RAG system
|
| 13 |
+
if not initialize_rag_system():
|
| 14 |
+
print("β Failed to initialize RAG system")
|
| 15 |
+
return
|
| 16 |
+
|
| 17 |
+
# Get statistics (includes folder and file IDs)
|
| 18 |
+
stats = get_rag_stats()
|
| 19 |
+
|
| 20 |
+
if not stats:
|
| 21 |
+
print("β Could not get RAG statistics")
|
| 22 |
+
return
|
| 23 |
+
|
| 24 |
+
print(f"\nπ RAG System Statistics:")
|
| 25 |
+
print(f" Total entries: {stats['total_entries']}")
|
| 26 |
+
print(f" Real news: {stats['real_count']}")
|
| 27 |
+
print(f" Fake news: {stats['fake_count']}")
|
| 28 |
+
print(f" Average confidence: {stats['avg_confidence']:.1%}")
|
| 29 |
+
|
| 30 |
+
print(f"\nπ Google Drive Links:")
|
| 31 |
+
|
| 32 |
+
if stats['folder_id']:
|
| 33 |
+
folder_url = f"https://drive.google.com/drive/folders/{stats['folder_id']}"
|
| 34 |
+
print(f"π RAG Folder: {folder_url}")
|
| 35 |
+
print(f" (Click to open in browser)")
|
| 36 |
+
|
| 37 |
+
if stats['file_id']:
|
| 38 |
+
file_url = f"https://drive.google.com/file/d/{stats['file_id']}/view"
|
| 39 |
+
print(f"π RAG File: {file_url}")
|
| 40 |
+
print(f" (Click to view the JSON data)")
|
| 41 |
+
|
| 42 |
+
print(f"\nπ‘ Tips:")
|
| 43 |
+
print(f" - Use the folder link to browse all RAG files")
|
| 44 |
+
print(f" - Use the file link to view the raw JSON data")
|
| 45 |
+
print(f" - Run 'python view_rag_news.py' for a better interface")
|
| 46 |
+
|
| 47 |
+
if __name__ == "__main__":
|
| 48 |
+
get_drive_links()
|
quick_check.py
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Quick check to see if you have any saved RAG news
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
from rag_news_manager import initialize_rag_system, get_rag_stats
|
| 7 |
+
|
| 8 |
+
def quick_check():
|
| 9 |
+
"""Quick check for saved news"""
|
| 10 |
+
print("π Quick RAG System Check")
|
| 11 |
+
print("=" * 30)
|
| 12 |
+
|
| 13 |
+
# Initialize RAG system
|
| 14 |
+
if not initialize_rag_system():
|
| 15 |
+
print("β RAG system not initialized")
|
| 16 |
+
print("π‘ Run: python setup_google_drive_rag.py")
|
| 17 |
+
return
|
| 18 |
+
|
| 19 |
+
# Get statistics
|
| 20 |
+
stats = get_rag_stats()
|
| 21 |
+
|
| 22 |
+
if not stats:
|
| 23 |
+
print("β Could not get statistics")
|
| 24 |
+
return
|
| 25 |
+
|
| 26 |
+
print(f"π Current Status:")
|
| 27 |
+
print(f" Total entries: {stats['total_entries']}")
|
| 28 |
+
|
| 29 |
+
if stats['total_entries'] == 0:
|
| 30 |
+
print("π No news entries saved yet")
|
| 31 |
+
print("π‘ Try analyzing some news with your app first!")
|
| 32 |
+
print("π‘ News with 95%+ confidence will be automatically saved")
|
| 33 |
+
else:
|
| 34 |
+
print(f"β
You have {stats['total_entries']} saved news entries!")
|
| 35 |
+
print(f" Real news: {stats['real_count']}")
|
| 36 |
+
print(f" Fake news: {stats['fake_count']}")
|
| 37 |
+
print(f" Average confidence: {stats['avg_confidence']:.1%}")
|
| 38 |
+
|
| 39 |
+
if stats['latest_entry']:
|
| 40 |
+
latest = stats['latest_entry']
|
| 41 |
+
print(f"\nπ° Latest entry:")
|
| 42 |
+
print(f" {latest['news_text'][:80]}...")
|
| 43 |
+
print(f" {latest['prediction']} ({latest['gemini_confidence']:.1%})")
|
| 44 |
+
|
| 45 |
+
# Show Google Drive links
|
| 46 |
+
if stats['folder_id']:
|
| 47 |
+
folder_url = f"https://drive.google.com/drive/folders/{stats['folder_id']}"
|
| 48 |
+
print(f"\nπ Google Drive Folder: {folder_url}")
|
| 49 |
+
|
| 50 |
+
if stats['file_id']:
|
| 51 |
+
file_url = f"https://drive.google.com/file/d/{stats['file_id']}/view"
|
| 52 |
+
print(f"π Google Drive File: {file_url}")
|
| 53 |
+
|
| 54 |
+
if __name__ == "__main__":
|
| 55 |
+
quick_check()
|
rag_news_manager.py
ADDED
|
@@ -0,0 +1,432 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Enhanced RAG News Manager for Google Drive
|
| 4 |
+
Saves high-confidence news (95%+ from Gemini) to Google Drive for RAG purposes
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import json
|
| 8 |
+
import os
|
| 9 |
+
import hashlib
|
| 10 |
+
from datetime import datetime
|
| 11 |
+
from google.oauth2.credentials import Credentials
|
| 12 |
+
from google_auth_oauthlib.flow import InstalledAppFlow
|
| 13 |
+
from google.auth.transport.requests import Request
|
| 14 |
+
from googleapiclient.discovery import build
|
| 15 |
+
from googleapiclient.http import MediaIoBaseDownload, MediaIoBaseUpload
|
| 16 |
+
import io
|
| 17 |
+
|
| 18 |
+
# Configuration
|
| 19 |
+
SCOPES = ['https://www.googleapis.com/auth/drive.file']
|
| 20 |
+
RAG_FOLDER_NAME = "Vietnamese_Fake_News_RAG"
|
| 21 |
+
RAG_FILE_NAME = "high_confidence_news.json"
|
| 22 |
+
CONFIDENCE_THRESHOLD = 0.95 # 95% threshold
|
| 23 |
+
|
| 24 |
+
class RAGNewsManager:
|
| 25 |
+
def __init__(self):
|
| 26 |
+
self.service = None
|
| 27 |
+
self.rag_folder_id = None
|
| 28 |
+
self.rag_file_id = None
|
| 29 |
+
self.credentials_file = 'credentials.json'
|
| 30 |
+
self.token_file = 'token.json'
|
| 31 |
+
|
| 32 |
+
def authenticate(self):
|
| 33 |
+
"""Authenticate with Google Drive API"""
|
| 34 |
+
try:
|
| 35 |
+
creds = None
|
| 36 |
+
|
| 37 |
+
# Check if running on Hugging Face Spaces
|
| 38 |
+
is_hf_space = os.getenv('SPACE_ID') is not None
|
| 39 |
+
|
| 40 |
+
if is_hf_space:
|
| 41 |
+
# For Hugging Face Spaces, use environment variables
|
| 42 |
+
client_id = os.getenv('GOOGLE_CLIENT_ID')
|
| 43 |
+
client_secret = os.getenv('GOOGLE_CLIENT_SECRET')
|
| 44 |
+
refresh_token = os.getenv('GOOGLE_REFRESH_TOKEN')
|
| 45 |
+
|
| 46 |
+
if client_id and client_secret and refresh_token:
|
| 47 |
+
creds = Credentials.from_authorized_user_info({
|
| 48 |
+
'client_id': client_id,
|
| 49 |
+
'client_secret': client_secret,
|
| 50 |
+
'refresh_token': refresh_token,
|
| 51 |
+
'token_uri': 'https://oauth2.googleapis.com/token'
|
| 52 |
+
}, SCOPES)
|
| 53 |
+
else:
|
| 54 |
+
print("β οΈ Google Drive credentials not found in Hugging Face secrets")
|
| 55 |
+
return False
|
| 56 |
+
else:
|
| 57 |
+
# For local development, use files
|
| 58 |
+
if os.path.exists(self.token_file):
|
| 59 |
+
creds = Credentials.from_authorized_user_file(self.token_file, SCOPES)
|
| 60 |
+
|
| 61 |
+
# If no valid credentials, request authorization
|
| 62 |
+
if not creds or not creds.valid:
|
| 63 |
+
if creds and creds.expired and creds.refresh_token:
|
| 64 |
+
creds.refresh(Request())
|
| 65 |
+
else:
|
| 66 |
+
if os.path.exists(self.credentials_file):
|
| 67 |
+
flow = InstalledAppFlow.from_client_secrets_file(
|
| 68 |
+
self.credentials_file, SCOPES)
|
| 69 |
+
creds = flow.run_local_server(port=0)
|
| 70 |
+
else:
|
| 71 |
+
print("β οΈ credentials.json not found for local development")
|
| 72 |
+
return False
|
| 73 |
+
|
| 74 |
+
# Save credentials for next run
|
| 75 |
+
with open(self.token_file, 'w') as token:
|
| 76 |
+
token.write(creds.to_json())
|
| 77 |
+
|
| 78 |
+
self.service = build('drive', 'v3', credentials=creds)
|
| 79 |
+
print("β
Google Drive authentication successful!")
|
| 80 |
+
return True
|
| 81 |
+
|
| 82 |
+
except Exception as e:
|
| 83 |
+
print(f"β Google Drive authentication failed: {e}")
|
| 84 |
+
return False
|
| 85 |
+
|
| 86 |
+
def setup_rag_folder(self):
|
| 87 |
+
"""Create or find the RAG folder in Google Drive"""
|
| 88 |
+
try:
|
| 89 |
+
# Check if folder already exists
|
| 90 |
+
results = self.service.files().list(
|
| 91 |
+
q=f"name='{RAG_FOLDER_NAME}' and mimeType='application/vnd.google-apps.folder'",
|
| 92 |
+
fields="files(id, name)"
|
| 93 |
+
).execute()
|
| 94 |
+
|
| 95 |
+
folders = results.get('files', [])
|
| 96 |
+
|
| 97 |
+
if folders:
|
| 98 |
+
self.rag_folder_id = folders[0]['id']
|
| 99 |
+
print(f"β
Found existing RAG folder: {RAG_FOLDER_NAME}")
|
| 100 |
+
else:
|
| 101 |
+
# Create new folder
|
| 102 |
+
folder_metadata = {
|
| 103 |
+
'name': RAG_FOLDER_NAME,
|
| 104 |
+
'mimeType': 'application/vnd.google-apps.folder'
|
| 105 |
+
}
|
| 106 |
+
|
| 107 |
+
folder = self.service.files().create(
|
| 108 |
+
body=folder_metadata,
|
| 109 |
+
fields='id'
|
| 110 |
+
).execute()
|
| 111 |
+
|
| 112 |
+
self.rag_folder_id = folder.get('id')
|
| 113 |
+
print(f"β
Created new RAG folder: {RAG_FOLDER_NAME}")
|
| 114 |
+
|
| 115 |
+
return True
|
| 116 |
+
|
| 117 |
+
except Exception as e:
|
| 118 |
+
print(f"β Error setting up RAG folder: {e}")
|
| 119 |
+
return False
|
| 120 |
+
|
| 121 |
+
def setup_rag_file(self):
|
| 122 |
+
"""Create or find the RAG data file"""
|
| 123 |
+
try:
|
| 124 |
+
# Check if file already exists
|
| 125 |
+
results = self.service.files().list(
|
| 126 |
+
q=f"name='{RAG_FILE_NAME}' and parents in '{self.rag_folder_id}'",
|
| 127 |
+
fields="files(id, name)"
|
| 128 |
+
).execute()
|
| 129 |
+
|
| 130 |
+
files = results.get('files', [])
|
| 131 |
+
|
| 132 |
+
if files:
|
| 133 |
+
self.rag_file_id = files[0]['id']
|
| 134 |
+
print(f"β
Found existing RAG file: {RAG_FILE_NAME}")
|
| 135 |
+
else:
|
| 136 |
+
# Create new file with empty data
|
| 137 |
+
initial_data = {
|
| 138 |
+
"metadata": {
|
| 139 |
+
"created_at": datetime.now().isoformat(),
|
| 140 |
+
"description": "High-confidence Vietnamese fake news for RAG",
|
| 141 |
+
"threshold": CONFIDENCE_THRESHOLD,
|
| 142 |
+
"total_entries": 0
|
| 143 |
+
},
|
| 144 |
+
"news_entries": []
|
| 145 |
+
}
|
| 146 |
+
|
| 147 |
+
file_metadata = {
|
| 148 |
+
'name': RAG_FILE_NAME,
|
| 149 |
+
'parents': [self.rag_folder_id]
|
| 150 |
+
}
|
| 151 |
+
|
| 152 |
+
media = MediaIoBaseUpload(
|
| 153 |
+
io.BytesIO(json.dumps(initial_data, ensure_ascii=False, indent=2).encode('utf-8')),
|
| 154 |
+
mimetype='application/json'
|
| 155 |
+
)
|
| 156 |
+
|
| 157 |
+
file = self.service.files().create(
|
| 158 |
+
body=file_metadata,
|
| 159 |
+
media_body=media,
|
| 160 |
+
fields='id'
|
| 161 |
+
).execute()
|
| 162 |
+
|
| 163 |
+
self.rag_file_id = file.get('id')
|
| 164 |
+
print(f"β
Created new RAG file: {RAG_FILE_NAME}")
|
| 165 |
+
|
| 166 |
+
return True
|
| 167 |
+
|
| 168 |
+
except Exception as e:
|
| 169 |
+
print(f"β Error setting up RAG file: {e}")
|
| 170 |
+
return False
|
| 171 |
+
|
| 172 |
+
def load_rag_data(self):
|
| 173 |
+
"""Load existing RAG data from Google Drive"""
|
| 174 |
+
try:
|
| 175 |
+
if not self.rag_file_id:
|
| 176 |
+
return {"metadata": {"total_entries": 0}, "news_entries": []}
|
| 177 |
+
|
| 178 |
+
request = self.service.files().get_media(fileId=self.rag_file_id)
|
| 179 |
+
file_content = io.BytesIO()
|
| 180 |
+
downloader = MediaIoBaseDownload(file_content, request)
|
| 181 |
+
|
| 182 |
+
done = False
|
| 183 |
+
while done is False:
|
| 184 |
+
status, done = downloader.next_chunk()
|
| 185 |
+
|
| 186 |
+
file_content.seek(0)
|
| 187 |
+
data = json.loads(file_content.read().decode('utf-8'))
|
| 188 |
+
|
| 189 |
+
print(f"π Loaded {data.get('metadata', {}).get('total_entries', 0)} entries from RAG file")
|
| 190 |
+
return data
|
| 191 |
+
|
| 192 |
+
except Exception as e:
|
| 193 |
+
print(f"β Error loading RAG data: {e}")
|
| 194 |
+
return {"metadata": {"total_entries": 0}, "news_entries": []}
|
| 195 |
+
|
| 196 |
+
def save_rag_data(self, data):
|
| 197 |
+
"""Save RAG data to Google Drive"""
|
| 198 |
+
try:
|
| 199 |
+
if not self.rag_file_id:
|
| 200 |
+
return False
|
| 201 |
+
|
| 202 |
+
# Update metadata
|
| 203 |
+
data['metadata']['last_updated'] = datetime.now().isoformat()
|
| 204 |
+
data['metadata']['total_entries'] = len(data['news_entries'])
|
| 205 |
+
|
| 206 |
+
# Convert to JSON
|
| 207 |
+
json_data = json.dumps(data, ensure_ascii=False, indent=2)
|
| 208 |
+
|
| 209 |
+
media = MediaIoBaseUpload(
|
| 210 |
+
io.BytesIO(json_data.encode('utf-8')),
|
| 211 |
+
mimetype='application/json'
|
| 212 |
+
)
|
| 213 |
+
|
| 214 |
+
# Update the file
|
| 215 |
+
self.service.files().update(
|
| 216 |
+
fileId=self.rag_file_id,
|
| 217 |
+
media_body=media
|
| 218 |
+
).execute()
|
| 219 |
+
|
| 220 |
+
print(f"β
Saved {len(data['news_entries'])} entries to RAG file")
|
| 221 |
+
return True
|
| 222 |
+
|
| 223 |
+
except Exception as e:
|
| 224 |
+
print(f"β Error saving RAG data: {e}")
|
| 225 |
+
return False
|
| 226 |
+
|
| 227 |
+
def add_high_confidence_news(self, news_text, gemini_analysis, gemini_confidence,
|
| 228 |
+
prediction, search_results=None, distilbert_confidence=None):
|
| 229 |
+
"""Add high-confidence news to RAG system"""
|
| 230 |
+
try:
|
| 231 |
+
# Check confidence threshold
|
| 232 |
+
if gemini_confidence < CONFIDENCE_THRESHOLD:
|
| 233 |
+
print(f"β οΈ Confidence {gemini_confidence:.1%} below threshold {CONFIDENCE_THRESHOLD:.1%}")
|
| 234 |
+
return False
|
| 235 |
+
|
| 236 |
+
# Create content hash for deduplication
|
| 237 |
+
content_hash = hashlib.md5(news_text.encode('utf-8')).hexdigest()
|
| 238 |
+
|
| 239 |
+
# Load existing data
|
| 240 |
+
data = self.load_rag_data()
|
| 241 |
+
|
| 242 |
+
# Check if entry already exists
|
| 243 |
+
for entry in data['news_entries']:
|
| 244 |
+
if entry.get('content_hash') == content_hash:
|
| 245 |
+
print(f"β οΈ News already exists in RAG (hash: {content_hash[:8]}...)")
|
| 246 |
+
return False
|
| 247 |
+
|
| 248 |
+
# Create new entry
|
| 249 |
+
new_entry = {
|
| 250 |
+
'id': len(data['news_entries']) + 1,
|
| 251 |
+
'content_hash': content_hash,
|
| 252 |
+
'news_text': news_text,
|
| 253 |
+
'prediction': prediction,
|
| 254 |
+
'gemini_confidence': gemini_confidence,
|
| 255 |
+
'gemini_analysis': gemini_analysis,
|
| 256 |
+
'distilbert_confidence': distilbert_confidence,
|
| 257 |
+
'search_results': search_results or [],
|
| 258 |
+
'created_at': datetime.now().isoformat(),
|
| 259 |
+
'source': 'user_input',
|
| 260 |
+
'verified': True # High confidence means verified
|
| 261 |
+
}
|
| 262 |
+
|
| 263 |
+
# Add to data
|
| 264 |
+
data['news_entries'].append(new_entry)
|
| 265 |
+
|
| 266 |
+
# Save to Google Drive
|
| 267 |
+
success = self.save_rag_data(data)
|
| 268 |
+
|
| 269 |
+
if success:
|
| 270 |
+
print(f"β
Added high-confidence news to RAG:")
|
| 271 |
+
print(f" π° News: {news_text[:100]}...")
|
| 272 |
+
print(f" π― Prediction: {prediction}")
|
| 273 |
+
print(f" π Confidence: {gemini_confidence:.1%}")
|
| 274 |
+
print(f" π Hash: {content_hash[:8]}...")
|
| 275 |
+
return True
|
| 276 |
+
else:
|
| 277 |
+
return False
|
| 278 |
+
|
| 279 |
+
except Exception as e:
|
| 280 |
+
print(f"β Error adding news to RAG: {e}")
|
| 281 |
+
return False
|
| 282 |
+
|
| 283 |
+
def search_rag_news(self, query_text, limit=5):
|
| 284 |
+
"""Search RAG news for similar entries"""
|
| 285 |
+
try:
|
| 286 |
+
data = self.load_rag_data()
|
| 287 |
+
if not data['news_entries']:
|
| 288 |
+
return []
|
| 289 |
+
|
| 290 |
+
results = []
|
| 291 |
+
query_lower = query_text.lower()
|
| 292 |
+
|
| 293 |
+
for entry in data['news_entries']:
|
| 294 |
+
# Simple text similarity search
|
| 295 |
+
if (query_lower in entry.get('news_text', '').lower() or
|
| 296 |
+
query_lower in entry.get('gemini_analysis', '').lower()):
|
| 297 |
+
|
| 298 |
+
results.append({
|
| 299 |
+
'news_text': entry['news_text'],
|
| 300 |
+
'prediction': entry['prediction'],
|
| 301 |
+
'confidence': entry['gemini_confidence'],
|
| 302 |
+
'analysis': entry['gemini_analysis'],
|
| 303 |
+
'created_at': entry['created_at'],
|
| 304 |
+
'id': entry['id']
|
| 305 |
+
})
|
| 306 |
+
|
| 307 |
+
# Sort by confidence and creation date
|
| 308 |
+
results.sort(key=lambda x: (x['confidence'], x['created_at']), reverse=True)
|
| 309 |
+
results = results[:limit]
|
| 310 |
+
|
| 311 |
+
if results:
|
| 312 |
+
print(f"π Found {len(results)} similar entries in RAG")
|
| 313 |
+
|
| 314 |
+
return results
|
| 315 |
+
|
| 316 |
+
except Exception as e:
|
| 317 |
+
print(f"β Error searching RAG news: {e}")
|
| 318 |
+
return []
|
| 319 |
+
|
| 320 |
+
def get_rag_statistics(self):
|
| 321 |
+
"""Get statistics about RAG data"""
|
| 322 |
+
try:
|
| 323 |
+
data = self.load_rag_data()
|
| 324 |
+
entries = data['news_entries']
|
| 325 |
+
|
| 326 |
+
if not entries:
|
| 327 |
+
return {
|
| 328 |
+
'total_entries': 0,
|
| 329 |
+
'real_count': 0,
|
| 330 |
+
'fake_count': 0,
|
| 331 |
+
'avg_confidence': 0,
|
| 332 |
+
'latest_entry': None,
|
| 333 |
+
'folder_id': self.rag_folder_id,
|
| 334 |
+
'file_id': self.rag_file_id
|
| 335 |
+
}
|
| 336 |
+
|
| 337 |
+
real_count = sum(1 for entry in entries if entry['prediction'] == 'REAL')
|
| 338 |
+
fake_count = sum(1 for entry in entries if entry['prediction'] == 'FAKE')
|
| 339 |
+
avg_confidence = sum(entry['gemini_confidence'] for entry in entries) / len(entries)
|
| 340 |
+
|
| 341 |
+
# Get latest entry
|
| 342 |
+
latest_entry = max(entries, key=lambda x: x['created_at']) if entries else None
|
| 343 |
+
|
| 344 |
+
stats = {
|
| 345 |
+
'total_entries': len(entries),
|
| 346 |
+
'real_count': real_count,
|
| 347 |
+
'fake_count': fake_count,
|
| 348 |
+
'avg_confidence': avg_confidence,
|
| 349 |
+
'latest_entry': latest_entry,
|
| 350 |
+
'folder_id': self.rag_folder_id,
|
| 351 |
+
'file_id': self.rag_file_id
|
| 352 |
+
}
|
| 353 |
+
|
| 354 |
+
return stats
|
| 355 |
+
|
| 356 |
+
except Exception as e:
|
| 357 |
+
print(f"β Error getting RAG statistics: {e}")
|
| 358 |
+
return None
|
| 359 |
+
|
| 360 |
+
def initialize(self):
|
| 361 |
+
"""Initialize the RAG system"""
|
| 362 |
+
print("π Initializing RAG News Manager...")
|
| 363 |
+
|
| 364 |
+
if not self.authenticate():
|
| 365 |
+
return False
|
| 366 |
+
|
| 367 |
+
if not self.setup_rag_folder():
|
| 368 |
+
return False
|
| 369 |
+
|
| 370 |
+
if not self.setup_rag_file():
|
| 371 |
+
return False
|
| 372 |
+
|
| 373 |
+
print("β
RAG News Manager initialized successfully!")
|
| 374 |
+
return True
|
| 375 |
+
|
| 376 |
+
# Global instance
|
| 377 |
+
rag_manager = RAGNewsManager()
|
| 378 |
+
|
| 379 |
+
def initialize_rag_system():
|
| 380 |
+
"""Initialize the RAG system"""
|
| 381 |
+
return rag_manager.initialize()
|
| 382 |
+
|
| 383 |
+
def add_news_to_rag(news_text, gemini_analysis, gemini_confidence, prediction,
|
| 384 |
+
search_results=None, distilbert_confidence=None):
|
| 385 |
+
"""Add news to RAG system if confidence is high enough"""
|
| 386 |
+
return rag_manager.add_high_confidence_news(
|
| 387 |
+
news_text, gemini_analysis, gemini_confidence, prediction,
|
| 388 |
+
search_results, distilbert_confidence
|
| 389 |
+
)
|
| 390 |
+
|
| 391 |
+
def search_rag_for_context(query_text, limit=3):
|
| 392 |
+
"""Search RAG for context to use in analysis"""
|
| 393 |
+
return rag_manager.search_rag_news(query_text, limit)
|
| 394 |
+
|
| 395 |
+
def get_rag_stats():
|
| 396 |
+
"""Get RAG system statistics"""
|
| 397 |
+
return rag_manager.get_rag_statistics()
|
| 398 |
+
|
| 399 |
+
if __name__ == "__main__":
|
| 400 |
+
# Test the RAG system
|
| 401 |
+
print("Testing RAG News Manager...")
|
| 402 |
+
|
| 403 |
+
if initialize_rag_system():
|
| 404 |
+
# Test adding a news entry
|
| 405 |
+
test_news = "Argentina vΓ΄ Δα»ch World Cup 2022 lΓ sα»± thαΊt"
|
| 406 |
+
test_analysis = "1. KαΊΎT LUαΊ¬N: THαΊ¬T\n2. Δα» TIN CαΊ¬Y: THαΊ¬T: 98% / GIαΊ’: 2%"
|
| 407 |
+
test_confidence = 0.98
|
| 408 |
+
|
| 409 |
+
success = add_news_to_rag(
|
| 410 |
+
news_text=test_news,
|
| 411 |
+
gemini_analysis=test_analysis,
|
| 412 |
+
gemini_confidence=test_confidence,
|
| 413 |
+
prediction="REAL"
|
| 414 |
+
)
|
| 415 |
+
|
| 416 |
+
if success:
|
| 417 |
+
print("β
Test news added successfully!")
|
| 418 |
+
|
| 419 |
+
# Get statistics
|
| 420 |
+
stats = get_rag_stats()
|
| 421 |
+
if stats:
|
| 422 |
+
print(f"π RAG Statistics:")
|
| 423 |
+
print(f" Total entries: {stats['total_entries']}")
|
| 424 |
+
print(f" Real news: {stats['real_count']}")
|
| 425 |
+
print(f" Fake news: {stats['fake_count']}")
|
| 426 |
+
print(f" Average confidence: {stats['avg_confidence']:.1%}")
|
| 427 |
+
print(f" Google Drive folder ID: {stats['folder_id']}")
|
| 428 |
+
print(f" Google Drive file ID: {stats['file_id']}")
|
| 429 |
+
else:
|
| 430 |
+
print("β Failed to add test news")
|
| 431 |
+
else:
|
| 432 |
+
print("β Failed to initialize RAG system")
|
setup_google_drive_rag.py
ADDED
|
@@ -0,0 +1,199 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Setup script for Google Drive RAG system
|
| 4 |
+
This script helps you set up Google Drive authentication for the RAG news manager
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
import json
|
| 9 |
+
from google.oauth2.credentials import Credentials
|
| 10 |
+
from google_auth_oauthlib.flow import InstalledAppFlow
|
| 11 |
+
from google.auth.transport.requests import Request
|
| 12 |
+
|
| 13 |
+
# Configuration
|
| 14 |
+
SCOPES = ['https://www.googleapis.com/auth/drive.file']
|
| 15 |
+
CREDENTIALS_FILE = 'credentials.json'
|
| 16 |
+
TOKEN_FILE = 'token.json'
|
| 17 |
+
|
| 18 |
+
def setup_google_drive_credentials():
|
| 19 |
+
"""Set up Google Drive credentials for local development"""
|
| 20 |
+
print("π§ Setting up Google Drive credentials for RAG system...")
|
| 21 |
+
print("=" * 60)
|
| 22 |
+
|
| 23 |
+
# Check if credentials file exists
|
| 24 |
+
if not os.path.exists(CREDENTIALS_FILE):
|
| 25 |
+
print(f"β {CREDENTIALS_FILE} not found!")
|
| 26 |
+
print("\nπ To get Google Drive credentials:")
|
| 27 |
+
print("1. Go to Google Cloud Console: https://console.cloud.google.com/")
|
| 28 |
+
print("2. Create a new project or select existing one")
|
| 29 |
+
print("3. Enable Google Drive API")
|
| 30 |
+
print("4. Go to 'Credentials' β 'Create Credentials' β 'OAuth 2.0 Client IDs'")
|
| 31 |
+
print("5. Choose 'Desktop application'")
|
| 32 |
+
print("6. Download the JSON file and rename it to 'credentials.json'")
|
| 33 |
+
print("7. Place it in this directory")
|
| 34 |
+
return False
|
| 35 |
+
|
| 36 |
+
print(f"β
Found {CREDENTIALS_FILE}")
|
| 37 |
+
|
| 38 |
+
# Load credentials
|
| 39 |
+
try:
|
| 40 |
+
with open(CREDENTIALS_FILE, 'r') as f:
|
| 41 |
+
creds_data = json.load(f)
|
| 42 |
+
|
| 43 |
+
print("β
Credentials file is valid JSON")
|
| 44 |
+
print(f" Client ID: {creds_data.get('client_id', 'N/A')[:20]}...")
|
| 45 |
+
print(f" Project ID: {creds_data.get('project_id', 'N/A')}")
|
| 46 |
+
|
| 47 |
+
except json.JSONDecodeError:
|
| 48 |
+
print("β Invalid JSON in credentials file")
|
| 49 |
+
return False
|
| 50 |
+
except Exception as e:
|
| 51 |
+
print(f"β Error reading credentials: {e}")
|
| 52 |
+
return False
|
| 53 |
+
|
| 54 |
+
# Authenticate
|
| 55 |
+
creds = None
|
| 56 |
+
|
| 57 |
+
# Check if token file exists
|
| 58 |
+
if os.path.exists(TOKEN_FILE):
|
| 59 |
+
print(f"β
Found existing {TOKEN_FILE}")
|
| 60 |
+
try:
|
| 61 |
+
creds = Credentials.from_authorized_user_file(TOKEN_FILE, SCOPES)
|
| 62 |
+
print("β
Loaded existing credentials")
|
| 63 |
+
except Exception as e:
|
| 64 |
+
print(f"β οΈ Error loading existing credentials: {e}")
|
| 65 |
+
creds = None
|
| 66 |
+
|
| 67 |
+
# If no valid credentials, get new ones
|
| 68 |
+
if not creds or not creds.valid:
|
| 69 |
+
if creds and creds.expired and creds.refresh_token:
|
| 70 |
+
print("π Refreshing expired credentials...")
|
| 71 |
+
try:
|
| 72 |
+
creds.refresh(Request())
|
| 73 |
+
print("β
Credentials refreshed successfully")
|
| 74 |
+
except Exception as e:
|
| 75 |
+
print(f"β Error refreshing credentials: {e}")
|
| 76 |
+
creds = None
|
| 77 |
+
|
| 78 |
+
if not creds:
|
| 79 |
+
print("π Starting OAuth flow...")
|
| 80 |
+
print(" A browser window will open for authentication")
|
| 81 |
+
print(" Please log in with your Google account and grant permissions")
|
| 82 |
+
|
| 83 |
+
try:
|
| 84 |
+
flow = InstalledAppFlow.from_client_secrets_file(CREDENTIALS_FILE, SCOPES)
|
| 85 |
+
creds = flow.run_local_server(port=0)
|
| 86 |
+
print("β
Authentication successful!")
|
| 87 |
+
except Exception as e:
|
| 88 |
+
print(f"β Authentication failed: {e}")
|
| 89 |
+
return False
|
| 90 |
+
|
| 91 |
+
# Save credentials for next time
|
| 92 |
+
try:
|
| 93 |
+
with open(TOKEN_FILE, 'w') as token:
|
| 94 |
+
token.write(creds.to_json())
|
| 95 |
+
print(f"β
Credentials saved to {TOKEN_FILE}")
|
| 96 |
+
except Exception as e:
|
| 97 |
+
print(f"β οΈ Warning: Could not save credentials: {e}")
|
| 98 |
+
|
| 99 |
+
# Test the credentials
|
| 100 |
+
print("\nπ§ͺ Testing Google Drive access...")
|
| 101 |
+
try:
|
| 102 |
+
from googleapiclient.discovery import build
|
| 103 |
+
service = build('drive', 'v3', credentials=creds)
|
| 104 |
+
|
| 105 |
+
# List files to test access
|
| 106 |
+
results = service.files().list(pageSize=1, fields="files(id, name)").execute()
|
| 107 |
+
files = results.get('files', [])
|
| 108 |
+
|
| 109 |
+
print("β
Google Drive access successful!")
|
| 110 |
+
print(f" Found {len(files)} file(s) in your Drive")
|
| 111 |
+
|
| 112 |
+
if files:
|
| 113 |
+
print(f" Sample file: {files[0]['name']}")
|
| 114 |
+
|
| 115 |
+
return True
|
| 116 |
+
|
| 117 |
+
except Exception as e:
|
| 118 |
+
print(f"β Google Drive access test failed: {e}")
|
| 119 |
+
return False
|
| 120 |
+
|
| 121 |
+
def test_rag_system():
|
| 122 |
+
"""Test the RAG system"""
|
| 123 |
+
print("\nπ§ͺ Testing RAG News Manager...")
|
| 124 |
+
print("=" * 40)
|
| 125 |
+
|
| 126 |
+
try:
|
| 127 |
+
from rag_news_manager import initialize_rag_system, get_rag_stats
|
| 128 |
+
|
| 129 |
+
if initialize_rag_system():
|
| 130 |
+
print("β
RAG system initialized successfully!")
|
| 131 |
+
|
| 132 |
+
# Get statistics
|
| 133 |
+
stats = get_rag_stats()
|
| 134 |
+
if stats:
|
| 135 |
+
print(f"π Current RAG Statistics:")
|
| 136 |
+
print(f" Total entries: {stats['total_entries']}")
|
| 137 |
+
print(f" Real news: {stats['real_count']}")
|
| 138 |
+
print(f" Fake news: {stats['fake_count']}")
|
| 139 |
+
print(f" Average confidence: {stats['avg_confidence']:.1%}")
|
| 140 |
+
print(f" Google Drive folder: {stats['folder_id']}")
|
| 141 |
+
print(f" Google Drive file: {stats['file_id']}")
|
| 142 |
+
|
| 143 |
+
# Provide Google Drive links
|
| 144 |
+
if stats['folder_id']:
|
| 145 |
+
folder_url = f"https://drive.google.com/drive/folders/{stats['folder_id']}"
|
| 146 |
+
print(f"\nπ Google Drive RAG Folder: {folder_url}")
|
| 147 |
+
|
| 148 |
+
if stats['file_id']:
|
| 149 |
+
file_url = f"https://drive.google.com/file/d/{stats['file_id']}/view"
|
| 150 |
+
print(f"π Google Drive RAG File: {file_url}")
|
| 151 |
+
else:
|
| 152 |
+
print("β οΈ Could not get RAG statistics")
|
| 153 |
+
else:
|
| 154 |
+
print("β RAG system initialization failed")
|
| 155 |
+
return False
|
| 156 |
+
|
| 157 |
+
except ImportError as e:
|
| 158 |
+
print(f"β Could not import RAG system: {e}")
|
| 159 |
+
return False
|
| 160 |
+
except Exception as e:
|
| 161 |
+
print(f"β RAG system test failed: {e}")
|
| 162 |
+
return False
|
| 163 |
+
|
| 164 |
+
return True
|
| 165 |
+
|
| 166 |
+
def main():
|
| 167 |
+
"""Main setup function"""
|
| 168 |
+
print("π Google Drive RAG System Setup")
|
| 169 |
+
print("=" * 50)
|
| 170 |
+
print("This script will help you set up Google Drive integration")
|
| 171 |
+
print("for saving high-confidence news for RAG purposes.")
|
| 172 |
+
print()
|
| 173 |
+
|
| 174 |
+
# Step 1: Setup credentials
|
| 175 |
+
if not setup_google_drive_credentials():
|
| 176 |
+
print("\nβ Setup failed at credentials step")
|
| 177 |
+
return False
|
| 178 |
+
|
| 179 |
+
# Step 2: Test RAG system
|
| 180 |
+
if not test_rag_system():
|
| 181 |
+
print("\nβ Setup failed at RAG system test")
|
| 182 |
+
return False
|
| 183 |
+
|
| 184 |
+
print("\nπ Setup completed successfully!")
|
| 185 |
+
print("=" * 50)
|
| 186 |
+
print("β
Google Drive credentials configured")
|
| 187 |
+
print("β
RAG system initialized")
|
| 188 |
+
print("β
Ready to save high-confidence news!")
|
| 189 |
+
print()
|
| 190 |
+
print("π Next steps:")
|
| 191 |
+
print("1. Your app will now automatically save news with 95%+ confidence")
|
| 192 |
+
print("2. Check your Google Drive for the 'Vietnamese_Fake_News_RAG' folder")
|
| 193 |
+
print("3. View saved news in the 'high_confidence_news.json' file")
|
| 194 |
+
print("4. The system will use this data for better RAG analysis")
|
| 195 |
+
|
| 196 |
+
return True
|
| 197 |
+
|
| 198 |
+
if __name__ == "__main__":
|
| 199 |
+
main()
|
view_rag_news.py
ADDED
|
@@ -0,0 +1,283 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
RAG News Viewer
|
| 4 |
+
View and manage high-confidence news saved in Google Drive
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import json
|
| 8 |
+
import os
|
| 9 |
+
from datetime import datetime
|
| 10 |
+
from rag_news_manager import initialize_rag_system, get_rag_stats, rag_manager
|
| 11 |
+
|
| 12 |
+
def format_news_entry(entry, index):
|
| 13 |
+
"""Format a news entry for display"""
|
| 14 |
+
created_date = datetime.fromisoformat(entry['created_at'].replace('Z', '+00:00'))
|
| 15 |
+
formatted_date = created_date.strftime("%Y-%m-%d %H:%M:%S")
|
| 16 |
+
|
| 17 |
+
prediction_emoji = "β
" if entry['prediction'] == 'REAL' else "β"
|
| 18 |
+
confidence_color = "π’" if entry['gemini_confidence'] > 0.95 else "π‘"
|
| 19 |
+
|
| 20 |
+
print(f"\n{'='*80}")
|
| 21 |
+
print(f"π° ENTRY #{index} - {prediction_emoji} {entry['prediction']} {confidence_color}")
|
| 22 |
+
print(f"{'='*80}")
|
| 23 |
+
print(f"π ID: {entry['id']}")
|
| 24 |
+
print(f"π
Created: {formatted_date}")
|
| 25 |
+
print(f"π Confidence: {entry['gemini_confidence']:.1%}")
|
| 26 |
+
print(f"π Hash: {entry['content_hash'][:12]}...")
|
| 27 |
+
print(f"π Source: {entry.get('source', 'Unknown')}")
|
| 28 |
+
print(f"β
Verified: {entry.get('verified', False)}")
|
| 29 |
+
|
| 30 |
+
if entry.get('distilbert_confidence'):
|
| 31 |
+
print(f"π€ DistilBERT: {entry['distilbert_confidence']:.1%}")
|
| 32 |
+
|
| 33 |
+
print(f"\nπ° NEWS TEXT:")
|
| 34 |
+
print(f"{'-'*40}")
|
| 35 |
+
print(entry['news_text'])
|
| 36 |
+
|
| 37 |
+
print(f"\nπ§ GEMINI ANALYSIS:")
|
| 38 |
+
print(f"{'-'*40}")
|
| 39 |
+
print(entry['gemini_analysis'])
|
| 40 |
+
|
| 41 |
+
if entry.get('search_results'):
|
| 42 |
+
print(f"\nπ SEARCH RESULTS ({len(entry['search_results'])} sources):")
|
| 43 |
+
print(f"{'-'*40}")
|
| 44 |
+
for i, result in enumerate(entry['search_results'][:3], 1):
|
| 45 |
+
print(f"{i}. {result.get('title', 'No title')}")
|
| 46 |
+
print(f" {result.get('snippet', 'No snippet')[:100]}...")
|
| 47 |
+
print(f" π {result.get('link', 'No link')}")
|
| 48 |
+
|
| 49 |
+
return True
|
| 50 |
+
|
| 51 |
+
def view_all_news():
|
| 52 |
+
"""View all saved news entries"""
|
| 53 |
+
print("π VIEWING ALL RAG NEWS ENTRIES")
|
| 54 |
+
print("=" * 60)
|
| 55 |
+
|
| 56 |
+
try:
|
| 57 |
+
data = rag_manager.load_rag_data()
|
| 58 |
+
entries = data.get('news_entries', [])
|
| 59 |
+
|
| 60 |
+
if not entries:
|
| 61 |
+
print("π No news entries found in RAG system")
|
| 62 |
+
return
|
| 63 |
+
|
| 64 |
+
print(f"π Found {len(entries)} news entries")
|
| 65 |
+
print(f"π
Last updated: {data.get('metadata', {}).get('last_updated', 'Unknown')}")
|
| 66 |
+
|
| 67 |
+
# Sort by creation date (newest first)
|
| 68 |
+
entries.sort(key=lambda x: x['created_at'], reverse=True)
|
| 69 |
+
|
| 70 |
+
for i, entry in enumerate(entries, 1):
|
| 71 |
+
format_news_entry(entry, i)
|
| 72 |
+
|
| 73 |
+
if i < len(entries):
|
| 74 |
+
input("\nβΈοΈ Press Enter to view next entry (or Ctrl+C to exit)...")
|
| 75 |
+
|
| 76 |
+
print(f"\nβ
Displayed all {len(entries)} entries")
|
| 77 |
+
|
| 78 |
+
except KeyboardInterrupt:
|
| 79 |
+
print("\n\nπ Viewing interrupted by user")
|
| 80 |
+
except Exception as e:
|
| 81 |
+
print(f"β Error viewing news: {e}")
|
| 82 |
+
|
| 83 |
+
def view_recent_news(limit=5):
|
| 84 |
+
"""View recent news entries"""
|
| 85 |
+
print(f"π° VIEWING {limit} MOST RECENT NEWS ENTRIES")
|
| 86 |
+
print("=" * 50)
|
| 87 |
+
|
| 88 |
+
try:
|
| 89 |
+
data = rag_manager.load_rag_data()
|
| 90 |
+
entries = data.get('news_entries', [])
|
| 91 |
+
|
| 92 |
+
if not entries:
|
| 93 |
+
print("π No news entries found in RAG system")
|
| 94 |
+
return
|
| 95 |
+
|
| 96 |
+
# Sort by creation date (newest first)
|
| 97 |
+
entries.sort(key=lambda x: x['created_at'], reverse=True)
|
| 98 |
+
recent_entries = entries[:limit]
|
| 99 |
+
|
| 100 |
+
print(f"π Showing {len(recent_entries)} most recent entries")
|
| 101 |
+
|
| 102 |
+
for i, entry in enumerate(recent_entries, 1):
|
| 103 |
+
format_news_entry(entry, i)
|
| 104 |
+
|
| 105 |
+
if i < len(recent_entries):
|
| 106 |
+
input("\nβΈοΈ Press Enter to view next entry (or Ctrl+C to exit)...")
|
| 107 |
+
|
| 108 |
+
except KeyboardInterrupt:
|
| 109 |
+
print("\n\nπ Viewing interrupted by user")
|
| 110 |
+
except Exception as e:
|
| 111 |
+
print(f"β Error viewing recent news: {e}")
|
| 112 |
+
|
| 113 |
+
def view_by_prediction(prediction):
|
| 114 |
+
"""View news entries by prediction type"""
|
| 115 |
+
print(f"π VIEWING {prediction} NEWS ENTRIES")
|
| 116 |
+
print("=" * 50)
|
| 117 |
+
|
| 118 |
+
try:
|
| 119 |
+
data = rag_manager.load_rag_data()
|
| 120 |
+
entries = data.get('news_entries', [])
|
| 121 |
+
|
| 122 |
+
# Filter by prediction
|
| 123 |
+
filtered_entries = [entry for entry in entries if entry['prediction'] == prediction]
|
| 124 |
+
|
| 125 |
+
if not filtered_entries:
|
| 126 |
+
print(f"π No {prediction} news entries found")
|
| 127 |
+
return
|
| 128 |
+
|
| 129 |
+
print(f"π Found {len(filtered_entries)} {prediction} entries")
|
| 130 |
+
|
| 131 |
+
# Sort by confidence (highest first)
|
| 132 |
+
filtered_entries.sort(key=lambda x: x['gemini_confidence'], reverse=True)
|
| 133 |
+
|
| 134 |
+
for i, entry in enumerate(filtered_entries, 1):
|
| 135 |
+
format_news_entry(entry, i)
|
| 136 |
+
|
| 137 |
+
if i < len(filtered_entries):
|
| 138 |
+
input("\nβΈοΈ Press Enter to view next entry (or Ctrl+C to exit)...")
|
| 139 |
+
|
| 140 |
+
except KeyboardInterrupt:
|
| 141 |
+
print("\n\nπ Viewing interrupted by user")
|
| 142 |
+
except Exception as e:
|
| 143 |
+
print(f"β Error viewing {prediction} news: {e}")
|
| 144 |
+
|
| 145 |
+
def search_news(query):
|
| 146 |
+
"""Search news entries"""
|
| 147 |
+
print(f"π SEARCHING FOR: '{query}'")
|
| 148 |
+
print("=" * 50)
|
| 149 |
+
|
| 150 |
+
try:
|
| 151 |
+
results = rag_manager.search_rag_news(query, limit=10)
|
| 152 |
+
|
| 153 |
+
if not results:
|
| 154 |
+
print("π No matching news entries found")
|
| 155 |
+
return
|
| 156 |
+
|
| 157 |
+
print(f"π Found {len(results)} matching entries")
|
| 158 |
+
|
| 159 |
+
for i, entry in enumerate(results, 1):
|
| 160 |
+
format_news_entry(entry, i)
|
| 161 |
+
|
| 162 |
+
if i < len(results):
|
| 163 |
+
input("\nβΈοΈ Press Enter to view next entry (or Ctrl+C to exit)...")
|
| 164 |
+
|
| 165 |
+
except KeyboardInterrupt:
|
| 166 |
+
print("\n\nπ Search interrupted by user")
|
| 167 |
+
except Exception as e:
|
| 168 |
+
print(f"β Error searching news: {e}")
|
| 169 |
+
|
| 170 |
+
def show_statistics():
|
| 171 |
+
"""Show RAG system statistics"""
|
| 172 |
+
print("π RAG SYSTEM STATISTICS")
|
| 173 |
+
print("=" * 40)
|
| 174 |
+
|
| 175 |
+
try:
|
| 176 |
+
stats = get_rag_stats()
|
| 177 |
+
|
| 178 |
+
if not stats:
|
| 179 |
+
print("β Could not retrieve statistics")
|
| 180 |
+
return
|
| 181 |
+
|
| 182 |
+
print(f"π Total Entries: {stats['total_entries']}")
|
| 183 |
+
print(f"β
Real News: {stats['real_count']}")
|
| 184 |
+
print(f"β Fake News: {stats['fake_count']}")
|
| 185 |
+
print(f"π Average Confidence: {stats['avg_confidence']:.1%}")
|
| 186 |
+
|
| 187 |
+
if stats['latest_entry']:
|
| 188 |
+
latest = stats['latest_entry']
|
| 189 |
+
latest_date = datetime.fromisoformat(latest['created_at'].replace('Z', '+00:00'))
|
| 190 |
+
print(f"π Latest Entry: {latest_date.strftime('%Y-%m-%d %H:%M:%S')}")
|
| 191 |
+
print(f" π° {latest['news_text'][:80]}...")
|
| 192 |
+
print(f" π― {latest['prediction']} ({latest['gemini_confidence']:.1%})")
|
| 193 |
+
|
| 194 |
+
print(f"\nπ Google Drive Links:")
|
| 195 |
+
if stats['folder_id']:
|
| 196 |
+
folder_url = f"https://drive.google.com/drive/folders/{stats['folder_id']}"
|
| 197 |
+
print(f" π RAG Folder: {folder_url}")
|
| 198 |
+
|
| 199 |
+
if stats['file_id']:
|
| 200 |
+
file_url = f"https://drive.google.com/file/d/{stats['file_id']}/view"
|
| 201 |
+
print(f" π RAG File: {file_url}")
|
| 202 |
+
|
| 203 |
+
except Exception as e:
|
| 204 |
+
print(f"β Error getting statistics: {e}")
|
| 205 |
+
|
| 206 |
+
def main_menu():
|
| 207 |
+
"""Main menu for the viewer"""
|
| 208 |
+
while True:
|
| 209 |
+
print("\n" + "="*60)
|
| 210 |
+
print("π RAG NEWS VIEWER - Vietnamese Fake News Detection")
|
| 211 |
+
print("="*60)
|
| 212 |
+
print("1. π View Statistics")
|
| 213 |
+
print("2. π° View Recent News (5 entries)")
|
| 214 |
+
print("3. π View All News")
|
| 215 |
+
print("4. β
View Real News Only")
|
| 216 |
+
print("5. β View Fake News Only")
|
| 217 |
+
print("6. π Search News")
|
| 218 |
+
print("7. π Open Google Drive")
|
| 219 |
+
print("8. β Exit")
|
| 220 |
+
print("="*60)
|
| 221 |
+
|
| 222 |
+
try:
|
| 223 |
+
choice = input("π Select option (1-8): ").strip()
|
| 224 |
+
|
| 225 |
+
if choice == '1':
|
| 226 |
+
show_statistics()
|
| 227 |
+
elif choice == '2':
|
| 228 |
+
view_recent_news(5)
|
| 229 |
+
elif choice == '3':
|
| 230 |
+
view_all_news()
|
| 231 |
+
elif choice == '4':
|
| 232 |
+
view_by_prediction('REAL')
|
| 233 |
+
elif choice == '5':
|
| 234 |
+
view_by_prediction('FAKE')
|
| 235 |
+
elif choice == '6':
|
| 236 |
+
query = input("π Enter search query: ").strip()
|
| 237 |
+
if query:
|
| 238 |
+
search_news(query)
|
| 239 |
+
else:
|
| 240 |
+
print("β Please enter a search query")
|
| 241 |
+
elif choice == '7':
|
| 242 |
+
stats = get_rag_stats()
|
| 243 |
+
if stats and stats['folder_id']:
|
| 244 |
+
folder_url = f"https://drive.google.com/drive/folders/{stats['folder_id']}"
|
| 245 |
+
print(f"π Opening Google Drive: {folder_url}")
|
| 246 |
+
import webbrowser
|
| 247 |
+
webbrowser.open(folder_url)
|
| 248 |
+
else:
|
| 249 |
+
print("β Google Drive folder not found")
|
| 250 |
+
elif choice == '8':
|
| 251 |
+
print("π Goodbye!")
|
| 252 |
+
break
|
| 253 |
+
else:
|
| 254 |
+
print("β Invalid choice. Please select 1-8.")
|
| 255 |
+
|
| 256 |
+
except KeyboardInterrupt:
|
| 257 |
+
print("\n\nπ Goodbye!")
|
| 258 |
+
break
|
| 259 |
+
except Exception as e:
|
| 260 |
+
print(f"β Error: {e}")
|
| 261 |
+
|
| 262 |
+
def main():
|
| 263 |
+
"""Main function"""
|
| 264 |
+
print("π RAG News Viewer")
|
| 265 |
+
print("=" * 30)
|
| 266 |
+
|
| 267 |
+
# Initialize RAG system
|
| 268 |
+
print("π§ Initializing RAG system...")
|
| 269 |
+
if not initialize_rag_system():
|
| 270 |
+
print("β Failed to initialize RAG system")
|
| 271 |
+
print("Please run setup_google_drive_rag.py first")
|
| 272 |
+
return
|
| 273 |
+
|
| 274 |
+
print("β
RAG system initialized successfully!")
|
| 275 |
+
|
| 276 |
+
# Show initial statistics
|
| 277 |
+
show_statistics()
|
| 278 |
+
|
| 279 |
+
# Start main menu
|
| 280 |
+
main_menu()
|
| 281 |
+
|
| 282 |
+
if __name__ == "__main__":
|
| 283 |
+
main()
|