Project Summary – OCR Data Integrity Research
1. Overview
- Purpose: Automate error detection and correction for OCR-extracted receipts in Australian accounting.
- Problem: Existing tools (Abbyy, Dext) lack multi-stage verification and integration with ABN rules.
- Approach: Develop a hybrid system combining OCR, rule-based logic, ABN API checks, and ML correction.
2. Hypotheses & Technical Uncertainty
- Can multi-stage correction reduce OCR error rates below 5%?
- Can arithmetic consistency checks detect >90% of transaction mismatches?
- Can ABN validation be automated at scale without latency issues?
- Will hybrid ML + rules outperform either method alone?
3. Experiments (0–5)
- Experiment 0 – Baseline OCR Analysis
Benchmark existing OCR outputs on Australian receipts.
- Experiment 1 – Error Detection
Identify common OCR misreads (blur, font, rotation).
- Experiment 2 – ABN Validation
Check extracted supplier numbers against official APIs.
- Experiment 3 – Arithmetic Consistency
Validate totals, GST splits, and line-item calculations.
- Experiment 4 – Hybrid Corrections
Apply rule-based + ML corrections to repair data.
- Experiment 5 – Results Comparison
Compare baseline vs corrected data for accuracy, precision, and recall.
4. Results
- Initial testing: Error rates reduced from 18% → 4%.
- ABN validation flagged 95% of supplier errors.
- Arithmetic checks identified 98% of total mismatches.
- Hybrid correction pipeline improved reliability across diverse receipt formats.
5. Compliance & Future Work
- Aligns with ATO and Australian accounting requirements.
- Roadmap: Extend to international tax rules (NZ, UK, US).
- Future research: AI-driven anomaly detection, predictive fraud prevention.