table-extraction / README.md
Sudhanshu1304's picture
Adding library_name (#5)
f2f8d0d verified
---
library_name: PaddleOCR
tags:
- table-extraction
- paddleocr
- huggingface
license: mit
---
# **🌟 Table Extraction Tool: OCR & Computer Vision for Structured Data**
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Build Status](https://img.shields.io/badge/build-passing-brightgreen.svg)](https://github.com/Sudhanshu1304/table-transformer)
[![Stars](https://img.shields.io/github/stars/Sudhanshu1304/table-transformer.svg)](https://github.com/Sudhanshu1304/table-transformer/stargazers)
[![Watchers](https://img.shields.io/github/watchers/Sudhanshu1304/table-transformer.svg)](https://github.com/Sudhanshu1304/table-transformer/watchers)
## Overview
Table Transformer is an advanced open-source tool that leverages state-of-the-art OCR and computer vision techniques to extract structured tabular data from images. It is ideal for enhancing LLM preprocessing, powering data analysis pipelines, and automating your data extraction tasks.
## Features
- πŸ“Š **Automatic Table Detection**: Effortlessly detect tables in images.
- πŸ“ **OCR-based Document Processing**: Extract text with high accuracy.
- 🧠 **Integrated Models**: Seamlessly combine OCR and table detection models.
- πŸ’Ύ **Flexible Export Options**: Export data as DataFrame, HTML, CSV, and more.
---
## **Tool Overview**
<div align="center">
<!-- First Row -->
<img src="images/image1.png" alt="Image upload" width="45%" style="margin: 10px;">
<img src="images/image2.png" alt="Table detection & extraction" width="45%" style="margin: 10px;">
<!-- Second Row -->
<img src="images/image3.png" alt="Table in HTML format" width="45%" style="margin: 10px;">
<img src="images/image4.png" alt="Table exported as CSV" width="45%" style="margin: 10px;">
</div>
---
## **Open-Source Tools Used**
- **[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)**: For text extraction.
- **[Hugging Face Table Detection](https://huggingface.co/foduucom/table-detection-and-extraction)**: For table structure detection.
---
## **Installation**
### **Prerequisites**
- Python 3.8+
- Conda
### **Setup**
1. **Clone the Repository**
Clone the repository to your local machine:
```bash
git clone https://github.com/Sudhanshu1304/table-transformer.git
cd table-transformer
```
2. **Create and Activate Conda Environment**
Create a new conda environment and activate it:
```bash
conda create --name myenv python=3.12.7
conda activate myenv
```
3. **Install PaddlePaddle**
Install PaddlePaddle in the conda environment:
```bash
python -m pip install paddlepaddle==3.0.0rc1 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
```
4. **Install PaddleOCR**
Install PaddleOCR:
```bash
pip install paddleocr
```
5. **Install Additional Dependencies**
Install other required packages:
```bash
pip install ultralytics pandas
pip install streamlit
```
### **Project Structure**
```
project/
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ streamlit_app.py # Streamlit application
β”‚ β”œβ”€β”€ table_creator/
β”‚ β”‚ └── processing.py # Core processing logic
β”‚ β”œβ”€β”€ models/
β”‚ β”‚ └── text.py # table detection and text recognition
β”‚
β”œβ”€β”€ requirements.txt # Dependencies
β”œβ”€β”€ README.md # Project documentation
└── .gitignore # Git ignore configuration
```
### **Usage**
Run the Streamlit app to interact with the tool:
```bash
streamlit run src/streamlit_app.py
```
### **Contributions**
Contributions are welcome! Please fork the repository and submit a pull request with your improvements or new features.
### **License**
This project is licensed under the MIT License.
---
## **Connect with Us**
Stay updated and connect for any queries or contributions:
- **GitHub**: [Sudhanshu1304](https://github.com/Sudhanshu1304)
- **LinkedIn**: [Sudhanshu Pandey](https://www.linkedin.com/in/sudhanshu-pandey-847448193/)
- **Medium**: [@sudhanshu.dpandey](https://medium.com/@sudhanshu.dpandey)
---
## **Support**
If you find this tool useful, please consider giving it a ⭐ on GitHub. Your support is greatly appreciated!
Happy Extracting!