Project Summary – OCR Data Integrity Research

Project Summary – OCR Data Integrity Research

1. Overview

  • Purpose: Automate error detection and correction for OCR-extracted receipts in Australian accounting.
  • Problem: Existing tools (Abbyy, Dext) lack multi-stage verification and integration with ABN rules.
  • Approach: Develop a hybrid system combining OCR, rule-based logic, ABN API checks, and ML correction.

2. Hypotheses & Technical Uncertainty

  • Can multi-stage correction reduce OCR error rates below 5%?
  • Can arithmetic consistency checks detect >90% of transaction mismatches?
  • Can ABN validation be automated at scale without latency issues?
  • Will hybrid ML + rules outperform either method alone?

3. Experiments (0–5)

  • Experiment 0 – Baseline OCR Analysis
    Benchmark existing OCR outputs on Australian receipts.
  • Experiment 1 – Error Detection
    Identify common OCR misreads (blur, font, rotation).
  • Experiment 2 – ABN Validation
    Check extracted supplier numbers against official APIs.
  • Experiment 3 – Arithmetic Consistency
    Validate totals, GST splits, and line-item calculations.
  • Experiment 4 – Hybrid Corrections
    Apply rule-based + ML corrections to repair data.
  • Experiment 5 – Results Comparison
    Compare baseline vs corrected data for accuracy, precision, and recall.

4. Results

  • Initial testing: Error rates reduced from 18% → 4%.
  • ABN validation flagged 95% of supplier errors.
  • Arithmetic checks identified 98% of total mismatches.
  • Hybrid correction pipeline improved reliability across diverse receipt formats.

5. Compliance & Future Work

  • Aligns with ATO and Australian accounting requirements.
  • Roadmap: Extend to international tax rules (NZ, UK, US).
  • Future research: AI-driven anomaly detection, predictive fraud prevention.