Spaces:
Runtime error
Runtime error
| - 2024_main_document_lvl | |
| - 2024_main_paragraph_lvl | |
| - 2023_main_document_lvl | |
| - 2023_main_paragraph_lvl | |
| - Embeddings convert pdfs | |
| - Para | |
| - Docs | |
| - HNSW - Kmeans fast searcddh | |
| - K means graphs based on the topics | |
| - Check for similarity between our own db | |
| - Para | |
| - Docs | |
| - Get The most important Ones | |
| - Get the Unquine sentances like title & other content ?? - LLM think karun karel | |
| - Search Google using the unquine searches --> get the top 3 and do the same check again --> result | |
| ### 1. Data Input: | |
| - **Input Data:** Collect a diverse dataset of academic papers, articles, or textual content from various sources. | |
| - **Format:** Ensure the data is in a consistent and machine-readable format, such as plain text or a format compatible with your chosen NLP library. | |
| ### 2. Data Cleaning: | |
| - **Text Cleaning:** | |
| - Remove metadata, formatting, and irrelevant details. | |
| - Handle special characters, punctuation, and stopwords. | |
| - **Normalization:** | |
| - Convert text to lowercase to ensure uniformity. | |
| - **Tokenization:** | |
| - Tokenize the text into words or subword tokens. | |
| - **Libraries:** | |
| - For Python, you can use NLTK or spaCy for tokenization. | |
| ### 3. Embedding Generation: | |
| - **Word Level Embeddings:** | |
| - Utilize pre-trained word embeddings like Word2Vec or GloVe. | |
| - **Libraries:** | |
| - For Word2Vec: Gensim library. | |
| - For GloVe: spaCy or gensim. | |
| - **Paragraph Level Embeddings:** | |
| - Aggregate word embeddings using techniques like averaging or using Doc2Vec. | |
| - **Libraries:** | |
| - Gensim for Doc2Vec. | |
| - **Document Level Embeddings:** | |
| - Consider using the average of paragraph embeddings or more advanced models. | |
| - **Libraries:** | |
| - spaCy or transformers library for more advanced models. | |
| ### 4. Pairwise Comparison: | |
| - **Similarity Measures:** | |
| - Calculate cosine similarity, Jaccard similarity, or other relevant measures. | |
| - **Libraries:** | |
| - scikit-learn for cosine similarity. | |
| ### 5. Clustering: | |
| - **K-Means Clustering:** | |
| - Partition documents into K clusters. | |
| - **Libraries:** | |
| - scikit-learn for K-Means. | |
| - **Hierarchical Clustering:** | |
| - Build a hierarchy of clusters. | |
| - **Libraries:** | |
| - scipy.cluster.hierarchy for hierarchical clustering. | |
| - **DBSCAN:** | |
| - Density-based clustering. | |
| - **Libraries:** | |
| - scikit-learn for DBSCAN. | |
| ### 6. Scoring System: | |
| - **Threshold Setting:** | |
| - Establish a threshold for similarity scores to classify documents. | |
| - Determine the threshold through experimentation. | |
| - **Scoring Logic:** | |
| - Develop a scoring system based on the results of pairwise comparison and clustering. | |
| - Decide on the scoring weights for each component. | |
| ### 7. Hybrid Approach: | |
| - **Traditional Models:** | |
| - Use traditional similarity measures for efficiency. | |
| - Implement efficient algorithms for quick pairwise comparisons. | |
| - **Large Language Models:** | |
| - Fine-tune or use pre-trained models for enhanced context understanding. | |
| - Hugging Face Transformers library for accessing pre-trained models. | |
| - Fingerprinting Concept |