Stop Wasting Hours on Manual Data Entry Forever

If you're tired of manually entering data from documents or forms, you're not alone. Many businesses and organizations face the challenge of dealing with a lot of paperwork. LLMs (Large Language Models) can help solve this problem by automating the process of extracting and processing information from documents. In this post, we'll explore how you can use LLMs to streamline data entry, saving time and reducing errors.

The Problem

Many industries rely on processing large amounts of information from documents—whether it’s invoices, contracts, or customer forms. This process usually involves a lot of repetitive work. Employees often spend hours extracting data, like dates, names, and amounts, and entering it into spreadsheets or databases. This can lead to human errors, which are costly in the long run. Imagine trying to process dozens or even hundreds of invoices manually; it’s time-consuming and inefficient.

The Solution

LLMs can automate data entry by extracting key information from documents and organizing it in a structured way. By using tools like the Gemini model, LLMs can read documents, identify relevant details (such as invoice numbers, dates, and payment terms), and extract that data automatically. With just a few lines of code, you can save hours of manual data entry, minimize human errors, and improve the accuracy of your document processing tasks.

Here’s how you can use LLMs to automate this process:

Implementation

Below is an example of how to set up a system that automates data entry from documents, using LLMs and a simple Python script:

import os
from dotenv import load_dotenv
import instructor
import google.generativeai as genai
from pydantic import BaseModel, Field
import pypdfium2 as pdfium

load_dotenv()

# Configure API key
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

# Use latest Flash model with JSON support
client = instructor.from_gemini(
    client=genai.GenerativeModel(
        model_name="gemini-1.5-flash-latest",  # Updated model name
    ),
    mode=instructor.Mode.GEMINI_JSON
)

class EssentialInvoiceData(BaseModel):
    invoice_number: str = Field(description="Unique invoice ID")
    invoice_date: str = Field(description="Date in YYYY-MM-DD format")
    company_name: str = Field(description="Vendor/seller name")
    customer_name: str = Field(description="Buyer/client name")
    total_amount: float = Field(description="Final payable amount")
    due_date: str | None = Field(default=None, description="Payment deadline date")

def read_pdf_content(file_path):
    """Convert PDF to plain text with proper bounds"""
    pdf = pdfium.PdfDocument(file_path)
    return " ".join(
        page.get_textpage().get_text_bounded()  # Updated text extraction method
        for page in pdf
    )

def get_invoice_data(text):
    """Extract key fields from text"""
    return client.chat.completions.create(
        messages=[
            {
                "role": "system", 
                "content": "Extract invoice data and return as JSON:"
            },
            {
                "role": "user", 
                "content": text
            }
        ],
        response_model=EssentialInvoiceData,
    )

def process_invoice(file_path):
    """Full processing pipeline"""
    text_content = read_pdf_content(file_path)
    return get_invoice_data(text_content).model_dump()

def main():
    result = process_invoice("sample-invoice.pdf")
    print(result)

if __name__ == "__main__":
    main()

Output

{'invoice_number': '123100401', 'invoice_date': '2024-03-01', 'company_name': 'CPB Software (Germany) GmbH', 'customer_name': 'Musterkunde AG', 'total_amount': 381.12, 'due_date': None}

How It Works

Setup and Configuration: The first thing we do is load environment variables from a .env file using load_dotenv. This ensures that sensitive information, like your Google API key, is stored securely and not hardcoded in the script. The genai.configure(api_key=...) line sets up the API with the appropriate key.
Using the LLM (Gemini Flash Model): The client object is configured to interact with Google's Gemini API. The "gemini-1.5-flash-latest" model is used because it supports extracting structured data from documents (in this case, invoices) and returns results in a JSON format. This allows the system to parse and process the document in a structured way.
Reading PDF Files: The read_pdf_content function handles the conversion of PDF files into plain text. The updated method get_text_bounded() allows for more accurate extraction of text within the bounds of each page in the PDF. This ensures that the text is correctly captured from the document.
Extracting Invoice Data: Once the text is extracted from the PDF, it is passed to the get_invoice_data function. Here, the LLM processes the text and extracts key invoice information like the invoice number, date, company name, customer name, total amount, and due date. This data is then returned in a structured format as defined by the EssentialInvoiceData class (a Pydantic model).
Processing the Invoice: The process_invoice function combines everything into a single workflow. It reads the PDF, extracts the text, and passes the text to the LLM for data extraction. The result is a clean JSON object containing the structured data from the invoice.
Final Output: The script is set to process a sample invoice ("sample-invoice.pdf"), extract relevant data, and print it to the console. The data can easily be adapted to be saved in a database or exported to a different format (e.g., CSV, JSON) for further processing.

Conclusion

LLMs are powerful tools that can simplify complex tasks, such as automating data entry from documents. With just a few lines of code, you can save time, reduce errors, and improve the efficiency of your workflows. Whether you're working with invoices, contracts, or any other document type, LLMs can help streamline the process and automate manual tasks.

Ready to get started? Try running the code on your own documents and see how LLMs can transform your data entry process.