Spaces:

asach
/

arxiv-plagiarism-checker-Ilm

Runtime error

App Files Files Community

arxiv-plagiarism-checker-Ilm / docs.md

gamingflexer

Documentation Updated

cc93c45 almost 2 years ago

preview code

raw

history blame contribute delete

3.03 kB

	- 2024_main_document_lvl
	- 2024_main_paragraph_lvl

	- 2023_main_document_lvl
	- 2023_main_paragraph_lvl

	- Embeddings convert pdfs
	- Para
	- Docs
	- HNSW - Kmeans fast searcddh
	- K means graphs based on the topics
	- Check for similarity between our own db
	- Para
	- Docs
	- Get The most important Ones
	- Get the Unquine sentances like title & other content ?? - LLM think karun karel
	- Search Google using the unquine searches --> get the top 3 and do the same check again --> result

	### 1. Data Input:

	- Input Data: Collect a diverse dataset of academic papers, articles, or textual content from various sources.
	- Format: Ensure the data is in a consistent and machine-readable format, such as plain text or a format compatible with your chosen NLP library.

	### 2. Data Cleaning:

	- Text Cleaning:
	- Remove metadata, formatting, and irrelevant details.
	- Handle special characters, punctuation, and stopwords.

	- Normalization:
	- Convert text to lowercase to ensure uniformity.

	- Tokenization:
	- Tokenize the text into words or subword tokens.
	- Libraries:
	- For Python, you can use NLTK or spaCy for tokenization.

	### 3. Embedding Generation:

	- Word Level Embeddings:
	- Utilize pre-trained word embeddings like Word2Vec or GloVe.
	- Libraries:
	- For Word2Vec: Gensim library.
	- For GloVe: spaCy or gensim.

	- Paragraph Level Embeddings:
	- Aggregate word embeddings using techniques like averaging or using Doc2Vec.
	- Libraries:
	- Gensim for Doc2Vec.

	- Document Level Embeddings:
	- Consider using the average of paragraph embeddings or more advanced models.
	- Libraries:
	- spaCy or transformers library for more advanced models.

	### 4. Pairwise Comparison:

	- Similarity Measures:
	- Calculate cosine similarity, Jaccard similarity, or other relevant measures.
	- Libraries:
	- scikit-learn for cosine similarity.

	### 5. Clustering:

	- K-Means Clustering:
	- Partition documents into K clusters.
	- Libraries:
	- scikit-learn for K-Means.

	- Hierarchical Clustering:
	- Build a hierarchy of clusters.
	- Libraries:
	- scipy.cluster.hierarchy for hierarchical clustering.

	- DBSCAN:
	- Density-based clustering.
	- Libraries:
	- scikit-learn for DBSCAN.

	### 6. Scoring System:

	- Threshold Setting:
	- Establish a threshold for similarity scores to classify documents.
	- Determine the threshold through experimentation.

	- Scoring Logic:
	- Develop a scoring system based on the results of pairwise comparison and clustering.
	- Decide on the scoring weights for each component.

	### 7. Hybrid Approach:

	- Traditional Models:
	- Use traditional similarity measures for efficiency.
	- Implement efficient algorithms for quick pairwise comparisons.

	- Large Language Models:
	- Fine-tune or use pre-trained models for enhanced context understanding.
	- Hugging Face Transformers library for accessing pre-trained models.

	- Fingerprinting Concept