Yaz Hobooti commited on
Commit
1baa0bd
Β·
1 Parent(s): 7de2fa8

Complete setup: Add app.py, update requirements.txt and README.md

Browse files
Files changed (3) hide show
  1. README.md +34 -5
  2. app.py +1382 -0
  3. requirements.txt +7 -1
README.md CHANGED
@@ -1,13 +1,42 @@
1
  ---
2
  title: ProofCheck
3
- emoji: πŸ”₯
4
- colorFrom: indigo
5
  colorTo: purple
6
  sdk: gradio
7
- sdk_version: 5.47.2
8
  app_file: app.py
9
  pinned: false
10
- short_description: ProofCheck
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: ProofCheck
3
+ emoji: πŸ”
4
+ colorFrom: blue
5
  colorTo: purple
6
  sdk: gradio
7
+ sdk_version: "4.44.0"
8
  app_file: app.py
9
  pinned: false
10
+ license: apache-2.0
11
+ tags:
12
+ - document-processing
13
+ - pdf
14
+ - ocr
15
+ - comparator
16
+ task_categories:
17
+ - other
18
+ pretty_name: ProofCheck
19
  ---
20
 
21
+ # πŸ” Advanced PDF Comparison Tool
22
+
23
+ Upload two PDF files to get comprehensive analysis including:
24
+ - **Visual differences** with bounding boxes
25
+ - **OCR and spell checking**
26
+ - **Barcode/QR code detection**
27
+ - **CMYK color analysis**
28
+
29
+ ## Features
30
+ - High-DPI PDF rendering (600 DPI) for improved OCR and barcode recognition
31
+ - Rule-based text and layout comparison
32
+ - Export of comparison results
33
+
34
+ ## Usage
35
+ Run locally:
36
+
37
+ ```bash
38
+ python app.py
39
+ ```
40
+
41
+ ## License
42
+ Apache-2.0
app.py ADDED
@@ -0,0 +1,1382 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Gradio PDF Comparison Tool
4
+ Upload two PDF files and get comprehensive analysis including differences, OCR, barcodes, and CMYK analysis.
5
+ """
6
+
7
+ import os, sys, re, csv, json, io
8
+ from dataclasses import dataclass
9
+ from typing import List, Tuple, Optional, Iterable
10
+ import tempfile
11
+ import unicodedata
12
+
13
+ import numpy as np
14
+ from PIL import Image, ImageChops, ImageDraw, UnidentifiedImageError
15
+ from pdf2image import convert_from_path
16
+ from skimage.measure import label, regionprops
17
+ from skimage.morphology import dilation, rectangle
18
+ import gradio as gr
19
+
20
+ # Alternative PDF processing
21
+ try:
22
+ import fitz # PyMuPDF
23
+ HAS_PYMUPDF = True
24
+ except Exception:
25
+ fitz = None
26
+ HAS_PYMUPDF = False
27
+
28
+ # Optional features
29
+ try:
30
+ import pytesseract
31
+ HAS_OCR = True
32
+ except Exception:
33
+ pytesseract = None
34
+ HAS_OCR = False
35
+
36
+ try:
37
+ from spellchecker import SpellChecker
38
+ HAS_SPELLCHECK = True
39
+ except Exception:
40
+ SpellChecker = None
41
+ HAS_SPELLCHECK = False
42
+
43
+ try:
44
+ import regex as re
45
+ HAS_REGEX = True
46
+ except Exception:
47
+ import re
48
+ HAS_REGEX = False
49
+
50
+ try:
51
+ from pyzbar.pyzbar import decode as zbar_decode
52
+ HAS_BARCODE = True
53
+ except Exception:
54
+ zbar_decode = None
55
+ HAS_BARCODE = False
56
+
57
+ # -------------------- Core Data --------------------
58
+ @dataclass
59
+ class Box:
60
+ y1: int; x1: int; y2: int; x2: int; area: int
61
+
62
+ # ---- spell/tokenization helpers & caches ----
63
+ if HAS_REGEX:
64
+ # Improved regex: better word boundaries, handle apostrophes, hyphens, and spaces
65
+ _WORD_RE = re.compile(r"\b\p{Letter}+(?:['\-]\p{Letter}+)*\b", re.UNICODE)
66
+ else:
67
+ # Fallback regex for basic ASCII
68
+ _WORD_RE = re.compile(r"\b[A-Za-z]+(?:['\-][A-Za-z]+)*\b")
69
+
70
+ if HAS_SPELLCHECK:
71
+ # Initialize English spell checker with comprehensive dictionary
72
+ _SPELL_EN = SpellChecker(language="en")
73
+
74
+ # Try to initialize French spell checker with fallback
75
+ _SPELL_FR = None
76
+ try:
77
+ _SPELL_FR = SpellChecker(language="fr")
78
+ except Exception:
79
+ # If French dictionary fails, try alternative approach
80
+ try:
81
+ _SPELL_FR = SpellChecker()
82
+ # Load some basic French words manually if needed
83
+ except Exception:
84
+ _SPELL_FR = None
85
+ print("Warning: French spell checker not available")
86
+ else:
87
+ _SPELL_EN = None
88
+ _SPELL_FR = None
89
+
90
+ _DOMAIN_ALLOWLIST = {
91
+ # Company/Brand names
92
+ "Furry", "Fox", "Packaging", "Digitaljoint", "ProofCheck", "PDF",
93
+ "SKU", "SKUs", "ISO", "G7", "WebCenter", "Hybrid",
94
+
95
+ # Technical terms
96
+ "CMYK", "RGB", "DPI", "PPI", "TIFF", "JPEG", "PNG", "GIF", "BMP",
97
+ "Pantone", "Spot", "Process", "Offset", "Lithography", "Gravure",
98
+ "Flexography", "Digital", "Print", "Press", "Ink", "Paper", "Stock",
99
+
100
+ # Common abbreviations
101
+ "Inc", "Ltd", "LLC", "Corp", "Co", "Ave", "St", "Rd", "Blvd",
102
+ "USA", "US", "CA", "ON", "QC", "BC", "AB", "MB", "SK", "NS", "NB", "NL", "PE", "YT", "NT", "NU",
103
+
104
+ # French words (common in Canadian context)
105
+ "QuΓ©bec", "MontrΓ©al", "Toronto", "Vancouver", "Ottawa", "Calgary",
106
+ "franΓ§ais", "franΓ§aise", "anglais", "anglaise", "bilingue",
107
+
108
+ # Common business terms
109
+ "Marketing", "Sales", "Customer", "Service", "Quality", "Control",
110
+ "Management", "Administration", "Production", "Manufacturing",
111
+ "Distribution", "Logistics", "Supply", "Chain", "Inventory",
112
+
113
+ # Common words that might be flagged
114
+ "Email", "Website", "Online", "Internet", "Software", "Hardware",
115
+ "Database", "System", "Network", "Server", "Client", "User",
116
+ "Password", "Login", "Logout", "Account", "Profile", "Settings",
117
+ "Configuration", "Installation", "Maintenance", "Support",
118
+
119
+ # Numbers and measurements
120
+ "mm", "cm", "m", "kg", "g", "ml", "l", "oz", "lb", "ft", "in",
121
+ "x", "by", "times", "multiply", "divide", "plus", "minus",
122
+
123
+ # Common misspellings that are actually correct in context
124
+ "colour", "colour", "favour", "favour", "honour", "honour",
125
+ "behaviour", "behaviour", "neighbour", "neighbour", "centre", "centre",
126
+ "theatre", "theatre", "metre", "metre", "litre", "litre",
127
+
128
+ # Pharmaceutical terms
129
+ "glycerol", "sativa","tocophersolan", "tocopherol", "tocopheryl", "acetate",
130
+ "ascorbic", "ascorbate", "retinol", "retinyl", "palmitate",
131
+ "stearate", "oleate", "linoleate", "arachidonate", "docosahexaenoate",
132
+ "eicosapentaenoate", "alpha", "beta", "gamma", "delta", "omega",
133
+ "hydroxy", "methyl", "ethyl", "propyl", "butyl", "pentyl", "hexyl",
134
+ "phosphate", "sulfate", "nitrate", "chloride", "bromide", "iodide",
135
+ "sodium", "potassium", "calcium", "magnesium", "zinc", "iron",
136
+ "copper", "manganese", "selenium", "chromium", "molybdenum",
137
+ "thiamine", "riboflavin", "niacin", "pantothenic", "pyridoxine",
138
+ "biotin", "folate", "cobalamin", "cholecalciferol", "ergocalciferol",
139
+ "phylloquinone", "menaquinone", "ubiquinone", "coenzyme", "carnitine",
140
+ "creatine", "taurine", "glutamine", "arginine", "lysine", "leucine",
141
+ "isoleucine", "valine", "phenylalanine", "tryptophan", "methionine",
142
+ "cysteine", "tyrosine", "histidine", "proline", "serine", "threonine",
143
+ "asparagine", "glutamic", "aspartic", "alanine", "glycine",
144
+ "polysorbate", "monostearate", "distearate", "tristearate",
145
+ "polyethylene", "polypropylene", "polyvinyl", "carbomer", "carboxymethyl",
146
+ "cellulose", "hydroxypropyl", "methylcellulose", "ethylcellulose",
147
+ "microcrystalline", "lactose", "sucrose", "dextrose", "fructose",
148
+ "maltose", "galactose", "mannitol", "sorbitol", "xylitol", "erythritol",
149
+ "stearic", "palmitic", "oleic", "linoleic", "arachidonic", "docosahexaenoic",
150
+ "eicosapentaenoic", "arachidonic", "linolenic", "gamma", "linolenic",
151
+ "conjugated", "linoleic", "acid", "ester", "amide", "anhydride",
152
+ "hydrochloride", "hydrobromide", "hydroiodide", "nitrate", "sulfate",
153
+ "phosphate", "acetate", "citrate", "tartrate", "succinate", "fumarate",
154
+ "malate", "lactate", "gluconate", "ascorbate", "tocopheryl", "acetate",
155
+ "palmitate", "stearate", "oleate", "linoleate", "arachidonate"
156
+ }
157
+ _DOMAIN_ALLOWLIST_LOWER = {w.lower() for w in _DOMAIN_ALLOWLIST}
158
+
159
+ if _SPELL_EN:
160
+ _SPELL_EN.word_frequency.load_words(_DOMAIN_ALLOWLIST_LOWER)
161
+ if _SPELL_FR:
162
+ _SPELL_FR.word_frequency.load_words(_DOMAIN_ALLOWLIST_LOWER)
163
+
164
+ def _normalize_text(s: str) -> str:
165
+ """Normalize text for better word extraction"""
166
+ if not s:
167
+ return ""
168
+
169
+ # Unicode normalization
170
+ s = unicodedata.normalize("NFC", s)
171
+
172
+ # Fix common apostrophe issues
173
+ s = s.replace("'", "'").replace("'", "'")
174
+
175
+ # Normalize whitespace - replace multiple spaces with single space
176
+ s = re.sub(r'\s+', ' ', s)
177
+
178
+ # Remove leading/trailing whitespace
179
+ s = s.strip()
180
+
181
+ return s
182
+
183
+ def _extract_tokens(raw: str):
184
+ """Extract word tokens with improved filtering"""
185
+ s = _normalize_text(raw or "")
186
+ tokens = _WORD_RE.findall(s)
187
+
188
+ # Filter out tokens that are too short or don't look like words
189
+ filtered_tokens = []
190
+ for token in tokens:
191
+ if len(token) >= 2 and _is_likely_word(token):
192
+ filtered_tokens.append(token)
193
+
194
+ return filtered_tokens
195
+
196
+ def _looks_like_acronym(tok: str) -> bool:
197
+ """Check if token looks like a valid acronym"""
198
+ return tok.isupper() and 2 <= len(tok) <= 6
199
+
200
+ def _has_digits(tok: str) -> bool:
201
+ """Check if token contains digits"""
202
+ return any(ch.isdigit() for ch in tok)
203
+
204
+ def _is_mostly_numbers(tok: str) -> bool:
205
+ """Check if token is mostly numbers (should be ignored)"""
206
+ if not tok:
207
+ return False
208
+
209
+ # Count digits and letters
210
+ digit_count = sum(1 for ch in tok if ch.isdigit())
211
+ letter_count = sum(1 for ch in tok if ch.isalpha())
212
+ total_chars = len(tok)
213
+
214
+ # If more than 70% digits, consider it mostly numbers
215
+ if digit_count / total_chars > 0.7:
216
+ return True
217
+
218
+ # If it's a pure number (all digits), ignore it
219
+ if digit_count == total_chars:
220
+ return True
221
+
222
+ # If it's a number with common suffixes (like "1st", "2nd", "3rd", "4th")
223
+ if total_chars >= 2 and digit_count >= 1:
224
+ suffix = tok[-2:].lower()
225
+ if suffix in ['st', 'nd', 'rd', 'th']:
226
+ return True
227
+
228
+ # If it's a decimal number (contains digits and decimal point)
229
+ if '.' in tok and digit_count > 0:
230
+ return True
231
+
232
+ # If it's a percentage (ends with %)
233
+ if tok.endswith('%') and digit_count > 0:
234
+ return True
235
+
236
+ return False
237
+
238
+ def _is_likely_word(tok: str) -> bool:
239
+ """Check if token looks like a real word (not random characters)"""
240
+ if len(tok) < 2:
241
+ return False
242
+
243
+ # Filter out tokens that are mostly non-letter characters
244
+ letter_count = sum(1 for c in tok if c.isalpha())
245
+ if letter_count < len(tok) * 0.6: # At least 60% letters
246
+ return False
247
+
248
+ # Filter out tokens with too many consecutive consonants/vowels
249
+ vowels = set('aeiouAEIOU')
250
+ consonants = set('bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ')
251
+
252
+ # Check for excessive consonant clusters (like "qwerty" or "zxcvb")
253
+ if len(tok) >= 4:
254
+ consonant_clusters = 0
255
+ vowel_clusters = 0
256
+ for i in range(len(tok) - 2):
257
+ if tok[i:i+3].lower() in consonants:
258
+ consonant_clusters += 1
259
+ if tok[i:i+3].lower() in vowels:
260
+ vowel_clusters += 1
261
+
262
+ # If more than half the possible clusters are consonant clusters, likely not a word
263
+ if consonant_clusters > len(tok) * 0.3:
264
+ return False
265
+
266
+ # Filter out tokens that look like random keyboard patterns
267
+ keyboard_patterns = [
268
+ 'qwerty', 'asdfgh', 'zxcvbn', 'qwertyuiop', 'asdfghjkl', 'zxcvbnm',
269
+ 'abcdef', 'bcdefg', 'cdefgh', 'defghi', 'efghij', 'fghijk',
270
+ '123456', '234567', '345678', '456789', '567890'
271
+ ]
272
+
273
+ tok_lower = tok.lower()
274
+ for pattern in keyboard_patterns:
275
+ if pattern in tok_lower or tok_lower in pattern:
276
+ return False
277
+
278
+ return True
279
+
280
+ def _is_known_word(tok: str) -> bool:
281
+ """Check if token is a known word with comprehensive filtering"""
282
+ t = tok.lower()
283
+
284
+ # First check if it looks like a real word
285
+ if not _is_likely_word(tok):
286
+ return True # Don't flag non-words as misspellings
287
+
288
+ # Ignore numbers and mostly numeric tokens
289
+ if _is_mostly_numbers(tok):
290
+ return True # Don't flag numbers as misspellings
291
+
292
+ # Check domain allowlist, acronyms, and words with digits
293
+ if t in _DOMAIN_ALLOWLIST_LOWER or _looks_like_acronym(tok) or _has_digits(tok):
294
+ return True
295
+
296
+ # Check hyphenated words - if any part is known, consider the whole word known
297
+ if '-' in tok:
298
+ parts = tok.split('-')
299
+ if all(_is_known_word(part) for part in parts):
300
+ return True
301
+
302
+ # Check against English spell checker
303
+ if _SPELL_EN:
304
+ try:
305
+ # Check if word is known in English dictionary
306
+ if not _SPELL_EN.unknown([t]):
307
+ return True
308
+ except Exception:
309
+ pass
310
+
311
+ # Check against French spell checker
312
+ if _SPELL_FR:
313
+ try:
314
+ # Check if word is known in French dictionary
315
+ if not _SPELL_FR.unknown([t]):
316
+ return True
317
+ except Exception:
318
+ pass
319
+
320
+ # Additional checks for common patterns
321
+ # Check for common suffixes/prefixes that might not be in dictionaries
322
+ common_suffixes = ['ing', 'ed', 'er', 'est', 'ly', 'tion', 'sion', 'ness', 'ment', 'able', 'ible']
323
+ common_prefixes = ['un', 're', 'pre', 'dis', 'mis', 'over', 'under', 'out', 'up', 'down']
324
+
325
+ # Check if word with common suffix/prefix is known
326
+ for suffix in common_suffixes:
327
+ if t.endswith(suffix) and len(t) > len(suffix) + 2:
328
+ base_word = t[:-len(suffix)]
329
+ if _SPELL_EN and not _SPELL_EN.unknown([base_word]):
330
+ return True
331
+
332
+ for prefix in common_prefixes:
333
+ if t.startswith(prefix) and len(t) > len(prefix) + 2:
334
+ base_word = t[len(prefix):]
335
+ if _SPELL_EN and not _SPELL_EN.unknown([base_word]):
336
+ return True
337
+
338
+ # Check for plural forms (simple 's' ending)
339
+ if t.endswith('s') and len(t) > 3:
340
+ singular = t[:-1]
341
+ if _SPELL_EN and not _SPELL_EN.unknown([singular]):
342
+ return True
343
+
344
+ return False
345
+
346
+ # (optional) keep a compatibility shim so any other code calling normalize_token() won't break
347
+ def normalize_token(token: str) -> str:
348
+ toks = _extract_tokens(token)
349
+ return (toks[0].lower() if toks else "")
350
+
351
+ # -------------------- Helpers ----------------------
352
+ def _is_pdf(path: str) -> bool:
353
+ return os.path.splitext(path.lower())[1] == ".pdf"
354
+
355
+ def _is_in_excluded_bottom_area(box: Box, image_height: int, excluded_height_mm: float = 115.0, dpi: int = 400) -> bool:
356
+ """
357
+ Check if a box is in the excluded bottom area (115mm from bottom).
358
+ Converts mm to pixels using DPI.
359
+ """
360
+ # Convert mm to pixels: 1 inch = 25.4mm, so 1mm = dpi/25.4 pixels
361
+ excluded_height_pixels = int(excluded_height_mm * dpi / 25.4)
362
+
363
+ # Calculate the top boundary of the excluded area
364
+ excluded_top = image_height - excluded_height_pixels
365
+
366
+ # Check if the box intersects with the excluded area
367
+ return box.y1 >= excluded_top
368
+
369
+ def _contains_validation_text(text: str) -> bool:
370
+ """Check if text contains the validation text '50 Carroll'"""
371
+ return "50 Carroll" in text
372
+
373
+ def load_pdf_pages(path: str, dpi: int = 600, max_pages: int = 15) -> List[Image.Image]:
374
+ """Load PDF pages as images with fallback options"""
375
+ if not _is_pdf(path):
376
+ return [Image.open(path).convert("RGB")]
377
+
378
+ # Try pdf2image first
379
+ poppler_paths = ["/usr/bin", "/usr/local/bin", "/bin", None]
380
+
381
+ for poppler_path in poppler_paths:
382
+ try:
383
+ if poppler_path:
384
+ imgs = convert_from_path(path, dpi=dpi, first_page=1, last_page=max_pages, poppler_path=poppler_path)
385
+ else:
386
+ imgs = convert_from_path(path, dpi=dpi, first_page=1, last_page=max_pages)
387
+
388
+ if imgs:
389
+ return [img.convert("RGB") for img in imgs]
390
+ except Exception:
391
+ if poppler_path is None: # All pdf2image attempts failed
392
+ break
393
+ continue # Try next path
394
+
395
+ # Fallback to PyMuPDF
396
+ if HAS_PYMUPDF:
397
+ try:
398
+ doc = fitz.open(path)
399
+ pages = []
400
+ for page_num in range(min(len(doc), max_pages)):
401
+ page = doc[page_num]
402
+ mat = fitz.Matrix(dpi/72, dpi/72)
403
+ pix = page.get_pixmap(matrix=mat)
404
+ img_data = pix.tobytes("ppm")
405
+ img = Image.open(io.BytesIO(img_data))
406
+ pages.append(img.convert("RGB"))
407
+ doc.close()
408
+ return pages
409
+ except Exception as e:
410
+ raise ValueError(f"Failed to convert PDF with both pdf2image and PyMuPDF. Error: {str(e)}")
411
+
412
+ raise ValueError("Failed to convert PDF to image. No working method available.")
413
+
414
+ def combine_pages_vertically(pages: List[Image.Image], spacing: int = 20) -> Image.Image:
415
+ """Combine multiple pages into a single vertical image"""
416
+ if not pages:
417
+ raise ValueError("No pages to combine")
418
+ if len(pages) == 1:
419
+ return pages[0]
420
+
421
+ # Find the maximum width
422
+ max_width = max(page.width for page in pages)
423
+
424
+ # Calculate total height
425
+ total_height = sum(page.height for page in pages) + spacing * (len(pages) - 1)
426
+
427
+ # Create combined image
428
+ combined = Image.new('RGB', (max_width, total_height), (255, 255, 255))
429
+
430
+ y_offset = 0
431
+ for page in pages:
432
+ # Center the page horizontally if it's narrower than max_width
433
+ x_offset = (max_width - page.width) // 2
434
+ combined.paste(page, (x_offset, y_offset))
435
+ y_offset += page.height + spacing
436
+
437
+ return combined
438
+
439
+ def match_sizes(a: Image.Image, b: Image.Image) -> Tuple[Image.Image, Image.Image]:
440
+ if a.size == b.size:
441
+ return a, b
442
+ w, h = min(a.width, b.width), min(a.height, b.height)
443
+ return a.crop((0, 0, w, h)), b.crop((0, 0, w, h))
444
+
445
+ def difference_map(a: Image.Image, b: Image.Image) -> Image.Image:
446
+ return ImageChops.difference(a, b)
447
+
448
+ def find_diff_boxes(diff_img: Image.Image, threshold: int = 12, min_area: int = 25) -> List[Box]:
449
+ arr = np.asarray(diff_img).astype(np.uint16)
450
+ gray = arr.max(axis=2).astype(np.uint8)
451
+ mask = (gray >= threshold).astype(np.uint8)
452
+ mask = dilation(mask, rectangle(3, 3))
453
+ labeled = label(mask, connectivity=2)
454
+ out: List[Box] = []
455
+ img_height = diff_img.height
456
+
457
+ for p in regionprops(labeled):
458
+ if p.area < min_area:
459
+ continue
460
+ minr, minc, maxr, maxc = p.bbox
461
+ box = Box(minr, minc, maxr, maxc, int(p.area))
462
+
463
+ # Skip boxes in the excluded bottom area
464
+ if _is_in_excluded_bottom_area(box, img_height):
465
+ continue
466
+
467
+ out.append(box)
468
+ return out
469
+
470
+ def draw_boxes_multi(img: Image.Image, red_boxes: List[Box], cyan_boxes: List[Box], green_boxes: List[Box] = None,
471
+ width: int = 3) -> Image.Image:
472
+ out = img.copy(); d = ImageDraw.Draw(out)
473
+ # red (diff)
474
+ for b in red_boxes:
475
+ for w in range(width):
476
+ d.rectangle([b.x1-w,b.y1-w,b.x2+w,b.y2+w], outline=(255,0,0))
477
+ # cyan (misspellings)
478
+ for b in cyan_boxes:
479
+ for w in range(width):
480
+ d.rectangle([b.x1-w,b.y1-w,b.x2+w,b.y2+w], outline=(0,255,255))
481
+ # green (barcodes)
482
+ if green_boxes:
483
+ for b in green_boxes:
484
+ for w in range(width):
485
+ d.rectangle([b.x1-w,b.y1-w,b.x2+w,b.y2+w], outline=(0,255,0))
486
+ return out
487
+
488
+ def make_red_overlay(a: Image.Image, b: Image.Image) -> Image.Image:
489
+ A = np.asarray(a).copy(); B = np.asarray(b)
490
+ mask = np.any(A != B, axis=2)
491
+ A[mask] = [255, 0, 0]
492
+ return Image.fromarray(A)
493
+
494
+ # -------------------- OCR + Spellcheck -------------
495
+ from typing import List, Iterable, Optional
496
+ from PIL import Image
497
+ import unicodedata
498
+ import regex as re
499
+ import pytesseract
500
+ from spellchecker import SpellChecker
501
+
502
+ # If these existed in your file, keep them; otherwise define defaults to avoid NameError
503
+ try:
504
+ HAS_OCR
505
+ except NameError:
506
+ HAS_OCR = True
507
+ try:
508
+ HAS_SPELLCHECK
509
+ except NameError:
510
+ HAS_SPELLCHECK = True
511
+
512
+ # ---- spell/tokenization helpers & caches ----
513
+ _WORD_RE = re.compile(r"\p{Letter}+(?:[’'\-]\p{Letter}+)*", re.UNICODE)
514
+
515
+ _SPELL_EN = SpellChecker(language="en")
516
+ _SPELL_FR = SpellChecker(language="fr")
517
+
518
+ _DOMAIN_ALLOWLIST = {
519
+ "Furry", "Fox", "Packaging", "Digitaljoint", "ProofCheck", "PDF",
520
+ "SKU", "SKUs", "ISO", "G7", "WebCenter", "Hybrid"
521
+ }
522
+ _SPELL_EN.word_frequency.load_words(w.lower() for w in _DOMAIN_ALLOWLIST)
523
+ _SPELL_FR.word_frequency.load_words(w.lower() for w in _DOMAIN_ALLOWLIST)
524
+
525
+ def _normalize_text(s: str) -> str:
526
+ s = unicodedata.normalize("NFC", s)
527
+ return s.replace("’", "'").strip()
528
+
529
+ def _extract_tokens(raw: str):
530
+ s = _normalize_text(raw or "")
531
+ return _WORD_RE.findall(s)
532
+
533
+ def _looks_like_acronym(tok: str) -> bool:
534
+ return tok.isupper() and 2 <= len(tok) <= 6
535
+
536
+ def _has_digits(tok: str) -> bool:
537
+ return any(ch.isdigit() for ch in tok)
538
+
539
+ # (optional) keep a compatibility shim so any other code calling normalize_token() won't break
540
+ def normalize_token(token: str) -> str:
541
+ toks = _extract_tokens(token)
542
+ return (toks[0].lower() if toks else "")
543
+
544
+ def _get_available_tesseract_langs():
545
+ """Get available Tesseract languages"""
546
+ try:
547
+ langs = pytesseract.get_languages()
548
+ if 'eng' in langs and 'fra' in langs:
549
+ return "eng+fra"
550
+ elif 'eng' in langs:
551
+ return "eng"
552
+ elif langs:
553
+ return langs[0]
554
+ else:
555
+ return "eng"
556
+ except Exception:
557
+ return "eng"
558
+
559
+ def prepare_for_ocr(img: Image.Image) -> Image.Image:
560
+ """Prepare image for better OCR results"""
561
+ from PIL import ImageOps, ImageFilter
562
+ g = img.convert("L")
563
+ g = ImageOps.autocontrast(g)
564
+ g = g.filter(ImageFilter.UnsharpMask(radius=1.0, percent=150, threshold=2))
565
+ return g
566
+
567
+ def extract_pdf_text(path: str, max_pages: int = 5) -> List[str]:
568
+ """Extract text directly from PDF using PyMuPDF"""
569
+ if not HAS_PYMUPDF:
570
+ return []
571
+
572
+ try:
573
+ doc = fitz.open(path)
574
+ texts = []
575
+ for page_num in range(min(len(doc), max_pages)):
576
+ page = doc[page_num]
577
+ text = page.get_text()
578
+ texts.append(text)
579
+ doc.close()
580
+ return texts
581
+ except Exception:
582
+ return []
583
+
584
+ def convert_pdf_to_image_coords(pdf_bbox, pdf_page_size, image_size, page_num=0, page_height=1000):
585
+ """Convert PDF coordinates to image coordinates"""
586
+ pdf_width, pdf_height = pdf_page_size
587
+ img_width, img_height = image_size
588
+
589
+ # Scale factors
590
+ scale_x = img_width / pdf_width
591
+ scale_y = img_height / pdf_height
592
+
593
+ # Convert PDF coordinates to image coordinates
594
+ x1 = int(pdf_bbox[0] * scale_x)
595
+ y1 = int(pdf_bbox[1] * scale_y) + (page_num * page_height)
596
+ x2 = int(pdf_bbox[2] * scale_x)
597
+ y2 = int(pdf_bbox[3] * scale_y) + (page_num * page_height)
598
+
599
+ return x1, y1, x2, y2
600
+
601
+ def find_misspell_boxes_from_text(
602
+ pdf_path: str,
603
+ *,
604
+ extra_allow: Optional[Iterable[str]] = None,
605
+ max_pages: int = 5,
606
+ image_size: Optional[Tuple[int, int]] = None
607
+ ) -> List[Box]:
608
+ """Find misspellings by analyzing extracted PDF text directly with coordinate mapping"""
609
+ if not (HAS_SPELLCHECK and HAS_PYMUPDF):
610
+ return []
611
+
612
+ # Load extra allowed words
613
+ if extra_allow and _SPELL_EN:
614
+ _SPELL_EN.word_frequency.load_words(w.lower() for w in extra_allow)
615
+ if extra_allow and _SPELL_FR:
616
+ _SPELL_FR.word_frequency.load_words(w.lower() for w in extra_allow)
617
+
618
+ boxes: List[Box] = []
619
+
620
+ try:
621
+ doc = fitz.open(pdf_path)
622
+
623
+ for page_num in range(min(len(doc), max_pages)):
624
+ page = doc[page_num]
625
+
626
+ # Get text with position information
627
+ text_dict = page.get_text("dict")
628
+
629
+ # Process each block of text
630
+ for block in text_dict.get("blocks", []):
631
+ if "lines" not in block:
632
+ continue
633
+
634
+ for line in block["lines"]:
635
+ for span in line["spans"]:
636
+ text = span.get("text", "").strip()
637
+ if not text:
638
+ continue
639
+
640
+ # Extract tokens and check for misspellings
641
+ tokens = _extract_tokens(text)
642
+ has_misspelling = False
643
+
644
+ for token in tokens:
645
+ if len(token) >= 2 and not _is_known_word(token):
646
+ has_misspelling = True
647
+ break
648
+
649
+ # If this span has misspellings, create a box for it
650
+ if has_misspelling:
651
+ bbox = span["bbox"] # [x0, y0, x1, y1]
652
+
653
+ # Get page dimensions for coordinate conversion
654
+ page_rect = page.rect
655
+ pdf_width = page_rect.width
656
+ pdf_height = page_rect.height
657
+
658
+ # Calculate coordinates
659
+ if image_size:
660
+ img_width, img_height = image_size
661
+ # Convert PDF coordinates to image coordinates
662
+ scale_x = img_width / pdf_width
663
+ scale_y = img_height / pdf_height
664
+ x1 = int(bbox[0] * scale_x)
665
+ y1 = int(bbox[1] * scale_y) + (page_num * img_height)
666
+ x2 = int(bbox[2] * scale_x)
667
+ y2 = int(bbox[3] * scale_y) + (page_num * img_height)
668
+ else:
669
+ x1 = int(bbox[0])
670
+ y1 = int(bbox[1]) + (page_num * 1000)
671
+ x2 = int(bbox[2])
672
+ y2 = int(bbox[3]) + (page_num * 1000)
673
+
674
+ # Create box
675
+ box = Box(y1=y1, x1=x1, y2=y2, x2=x2, area=(x2 - x1) * (y2 - y1))
676
+
677
+ # Skip boxes in excluded bottom area unless they contain validation text
678
+ if image_size:
679
+ img_height = image_size[1]
680
+ if _is_in_excluded_bottom_area(box, img_height) and not _contains_validation_text(text):
681
+ continue
682
+ else:
683
+ if _is_in_excluded_bottom_area(box, ph):
684
+ continue
685
+
686
+ boxes.append(box)
687
+
688
+ doc.close()
689
+
690
+ except Exception:
691
+ # Fallback to simple text extraction if coordinate mapping fails
692
+ page_texts = extract_pdf_text(pdf_path, max_pages)
693
+ for page_num, text in enumerate(page_texts):
694
+ if not text.strip():
695
+ continue
696
+
697
+ tokens = _extract_tokens(text)
698
+ misspelled_words = [token for token in tokens if len(token) >= 2 and not _is_known_word(token)]
699
+
700
+ if misspelled_words:
701
+ # Create a placeholder box for the page
702
+ boxes.append(Box(
703
+ y1=page_num * 1000,
704
+ x1=0,
705
+ y2=(page_num + 1) * 1000,
706
+ x2=800,
707
+ area=800 * 1000
708
+ ))
709
+
710
+ return boxes
711
+
712
+ def find_misspell_boxes(
713
+ img: Image.Image,
714
+ *,
715
+ min_conf: int = 60,
716
+ lang: Optional[str] = None,
717
+ extra_allow: Optional[Iterable[str]] = None,
718
+ dpi: int = 300,
719
+ psm: int = 6,
720
+ oem: int = 3
721
+ ) -> List[Box]:
722
+ """Legacy OCR-based spell checking (kept for fallback)"""
723
+ if not (HAS_OCR and HAS_SPELLCHECK):
724
+ return []
725
+
726
+ # Auto-detect language if not provided
727
+ if lang is None:
728
+ try:
729
+ avail = set(pytesseract.get_languages(config="") or [])
730
+ except Exception:
731
+ avail = {"eng"}
732
+ lang = "eng+fra" if {"eng","fra"}.issubset(avail) else "eng"
733
+
734
+ # OPTIONAL: light upscale if the image is small (heuristic)
735
+ # target width ~ 2500–3000 px for letter-sized pages
736
+ if img.width < 1600:
737
+ scale = 2
738
+ img = img.resize((img.width*scale, img.height*scale), Image.LANCZOS)
739
+
740
+ # Prepare image for better OCR
741
+ img = prepare_for_ocr(img)
742
+
743
+ try:
744
+ if extra_allow and _SPELL_EN:
745
+ _SPELL_EN.word_frequency.load_words(w.lower() for w in extra_allow)
746
+ if extra_allow and _SPELL_FR:
747
+ _SPELL_FR.word_frequency.load_words(w.lower() for w in extra_allow)
748
+
749
+ # Build a config that sets an explicit DPI and keeps spaces
750
+ config = f"--psm {psm} --oem {oem} -c preserve_interword_spaces=1 -c user_defined_dpi={dpi}"
751
+
752
+ data = pytesseract.image_to_data(
753
+ img,
754
+ lang=lang,
755
+ config=config,
756
+ output_type=pytesseract.Output.DICT,
757
+ )
758
+ except Exception:
759
+ return []
760
+
761
+ n = len(data.get("text", [])) or 0
762
+ boxes: List[Box] = []
763
+
764
+ for i in range(n):
765
+ raw = data["text"][i]
766
+ if not raw:
767
+ continue
768
+
769
+ # confidence filter
770
+ conf_str = data.get("conf", ["-1"])[i]
771
+ try:
772
+ conf = int(float(conf_str))
773
+ except Exception:
774
+ conf = -1
775
+ if conf < min_conf:
776
+ continue
777
+
778
+ tokens = _extract_tokens(raw)
779
+ if not tokens:
780
+ continue
781
+
782
+ # flag the box if ANY token in it looks misspelled
783
+ if all(_is_known_word(tok) or len(tok) < 2 for tok in tokens):
784
+ continue
785
+
786
+ left = data.get("left", [0])[i]
787
+ top = data.get("top", [0])[i]
788
+ width = data.get("width", [0])[i]
789
+ height = data.get("height",[0])[i]
790
+ if width <= 0 or height <= 0:
791
+ continue
792
+
793
+ # NOTE: adjust to match your Box constructor if needed
794
+ b = Box(top, left, top + height, left + width, width * height)
795
+ # Exclude bottom 115mm unless the text contains the validation phrase
796
+ if _is_in_excluded_bottom_area(b, img.height) and not _contains_validation_text(raw):
797
+ continue
798
+ boxes.append(b)
799
+
800
+ return boxes
801
+
802
+
803
+
804
+
805
+
806
+
807
+
808
+
809
+ # deps: pip install zxing-cpp pyzbar pylibdmtx PyMuPDF pillow opencv-python-headless regex
810
+ # system: macOS -> brew install zbar poppler ; Ubuntu -> sudo apt-get install libzbar0 poppler-utils
811
+
812
+ import io, regex as re
813
+ from typing import List, Tuple, Dict, Any
814
+ from PIL import Image, ImageOps
815
+ import numpy as np
816
+
817
+ import fitz # PyMuPDF
818
+
819
+ # Optional backends
820
+ try:
821
+ import zxingcpp; HAS_ZXING=True
822
+ except Exception: HAS_ZXING=False
823
+ try:
824
+ from pyzbar.pyzbar import decode as zbar_decode, ZBarSymbol; HAS_ZBAR=True
825
+ except Exception: HAS_ZBAR=False; ZBarSymbol=None
826
+ try:
827
+ from pylibdmtx.pylibdmtx import decode as dmtx_decode; HAS_DMTX=True
828
+ except Exception: HAS_DMTX=False
829
+ try:
830
+ import cv2; HAS_CV2=True
831
+ except Exception: HAS_CV2=False
832
+
833
+ # your Box(y1,x1,y2,x2,area) assumed to exist
834
+
835
+ def _binarize(img: Image.Image) -> Image.Image:
836
+ g = ImageOps.grayscale(img)
837
+ g = ImageOps.autocontrast(g)
838
+ return g.point(lambda x: 255 if x > 140 else 0, mode="1").convert("L")
839
+
840
+ def _ean_checksum_ok(d: str) -> bool:
841
+ if not d.isdigit(): return False
842
+ n=len(d); nums=list(map(int,d))
843
+ if n==8:
844
+ return (10 - (sum(nums[i]*(3 if i%2==0 else 1) for i in range(7))%10))%10==nums[7]
845
+ if n==12:
846
+ return (10 - (sum(nums[i]*(3 if i%2==0 else 1) for i in range(11))%10))%10==nums[11]
847
+ if n==13:
848
+ return (10 - (sum(nums[i]*(1 if i%2==0 else 3) for i in range(12))%10))%10==nums[12]
849
+ return True
850
+
851
+ def _normalize_upc_ean(sym: str, text: str):
852
+ digits = re.sub(r"\D","",text or "")
853
+ s = (sym or "").upper()
854
+ if s in ("EAN13","EAN-13") and len(digits)==13 and digits.startswith("0"):
855
+ return "UPCA", digits[1:]
856
+ return s, (digits if s in ("EAN13","EAN-13","EAN8","EAN-8","UPCA","UPC-A") else text or "")
857
+
858
+ def _validate(sym: str, payload: str) -> bool:
859
+ s, norm = _normalize_upc_ean(sym, payload)
860
+ return _ean_checksum_ok(norm) if s in ("EAN13","EAN-13","EAN8","EAN-8","UPCA","UPC-A") else bool(payload)
861
+
862
+ def _decode_zxing(pil: Image.Image) -> List[Dict[str,Any]]:
863
+ if not HAS_ZXING: return []
864
+ arr = np.asarray(pil.convert("L"))
865
+ out=[]
866
+ for r in zxingcpp.read_barcodes(arr): # try_harder is default True in recent builds; otherwise supply options
867
+ # zxingcpp.Position may be iterable (sequence of points) or an object with corner attributes
868
+ x1=y1=x2=y2=w=h=0
869
+ pos = getattr(r, "position", None)
870
+ pts: List[Any] = []
871
+ if pos is not None:
872
+ try:
873
+ pts = list(pos) # works if iterable
874
+ except TypeError:
875
+ # Fall back to known corner attribute names across versions
876
+ corner_names = (
877
+ "top_left", "topLeft",
878
+ "top_right", "topRight",
879
+ "bottom_left", "bottomLeft",
880
+ "bottom_right", "bottomRight",
881
+ "point1", "point2", "point3", "point4",
882
+ )
883
+ seen=set()
884
+ for name in corner_names:
885
+ if hasattr(pos, name):
886
+ p = getattr(pos, name)
887
+ # avoid duplicates
888
+ if id(p) not in seen and hasattr(p, "x") and hasattr(p, "y"):
889
+ pts.append(p)
890
+ seen.add(id(p))
891
+ if pts:
892
+ xs=[int(getattr(p, "x", 0)) for p in pts]
893
+ ys=[int(getattr(p, "y", 0)) for p in pts]
894
+ x1,x2=min(xs),max(xs); y1,y2=min(ys),max(ys)
895
+ w,h=x2-x1,y2-y1
896
+ out.append({
897
+ "type": str(r.format),
898
+ "data": r.text or "",
899
+ "left": x1,
900
+ "top": y1,
901
+ "width": w,
902
+ "height": h,
903
+ })
904
+ return out
905
+
906
+ def _decode_zbar(pil: Image.Image) -> List[Dict[str,Any]]:
907
+ if not HAS_ZBAR: return []
908
+ syms=[ZBarSymbol.QRCODE,ZBarSymbol.EAN13,ZBarSymbol.EAN8,ZBarSymbol.UPCA,ZBarSymbol.CODE128] if ZBarSymbol else None
909
+ res=zbar_decode(pil, symbols=syms) if syms else zbar_decode(pil)
910
+ return [{"type": d.type, "data": (d.data.decode("utf-8","ignore") if isinstance(d.data,(bytes,bytearray)) else str(d.data)),
911
+ "left": d.rect.left, "top": d.rect.top, "width": d.rect.width, "height": d.rect.height} for d in res]
912
+
913
+ def _decode_dmtx(pil: Image.Image) -> List[Dict[str,Any]]:
914
+ if not HAS_DMTX: return []
915
+ try:
916
+ res=dmtx_decode(ImageOps.grayscale(pil))
917
+ return [{"type":"DATAMATRIX","data": r.data.decode("utf-8","ignore"),
918
+ "left": r.rect.left, "top": r.rect.top, "width": r.rect.width, "height": r.rect.height} for r in res]
919
+ except Exception:
920
+ return []
921
+
922
+ def _decode_cv2_qr(pil: Image.Image) -> List[Dict[str,Any]]:
923
+ if not HAS_CV2: return []
924
+ try:
925
+ det=cv2.QRCodeDetector()
926
+ g=np.asarray(pil.convert("L"))
927
+ val, pts, _ = det.detectAndDecode(g)
928
+ if val:
929
+ if pts is not None and len(pts)>=1:
930
+ pts=pts.reshape(-1,2); xs,ys=pts[:,0],pts[:,1]
931
+ x1,x2=int(xs.min()),int(xs.max()); y1,y2=int(ys.min()),int(ys.max())
932
+ w,h=x2-x1,y2-y1
933
+ else:
934
+ x1=y1=w=h=0
935
+ return [{"type":"QRCODE","data":val,"left":x1,"top":y1,"width":w,"height":h}]
936
+ except Exception:
937
+ pass
938
+ return []
939
+
940
+ def _decode_variants(pil: Image.Image) -> List[Dict[str,Any]]:
941
+ variants=[pil, ImageOps.grayscale(pil), _binarize(pil)]
942
+ # upsample small images with NEAREST to keep bars crisp
943
+ w,h=pil.size
944
+ if max(w,h)<1600:
945
+ up=pil.resize((w*2,h*2), resample=Image.NEAREST)
946
+ variants += [up, _binarize(up)]
947
+ for v in variants:
948
+ # ZXing first (broad coverage), then ZBar, then DMTX, then cv2 QR
949
+ res = _decode_zxing(v)
950
+ if res: return res
951
+ res = _decode_zbar(v)
952
+ if res: return res
953
+ res = _decode_dmtx(v)
954
+ if res: return res
955
+ res = _decode_cv2_qr(v)
956
+ if res: return res
957
+ # try rotations
958
+ for angle in (90,180,270):
959
+ r=v.rotate(angle, expand=True)
960
+ res = _decode_zxing(r) or _decode_zbar(r) or _decode_dmtx(r) or _decode_cv2_qr(r)
961
+ if res: return res
962
+ return []
963
+
964
+ def _pix_to_pil(pix) -> Image.Image:
965
+ # convert PyMuPDF Pixmap to grayscale PIL without alpha (avoids blur)
966
+ if pix.alpha: pix = fitz.Pixmap(pix, 0)
967
+ try:
968
+ pix = fitz.Pixmap(fitz.csGRAY, pix)
969
+ except Exception:
970
+ pass
971
+ return Image.open(io.BytesIO(pix.tobytes("png")))
972
+
973
+ def scan_pdf_barcodes(pdf_path: str, *, dpi_list=(900,1200), max_pages=10):
974
+ """Return (boxes, infos) from both rendered pages and embedded images."""
975
+ boxes=[]; infos=[]
976
+ doc=fitz.open(pdf_path)
977
+ n=min(len(doc), max_pages)
978
+ for page_idx in range(n):
979
+ page=doc[page_idx]
980
+
981
+ # A) Embedded images (often crisp)
982
+ for ix,(xref,*_) in enumerate(page.get_images(full=True)):
983
+ try:
984
+ pix=fitz.Pixmap(doc, xref)
985
+ pil=_pix_to_pil(pix)
986
+ hits=_decode_variants(pil)
987
+ for r in hits:
988
+ b = Box(r["top"], r["left"], r["top"]+r["height"], r["left"]+r["width"], r["width"]*r["height"])
989
+ # Exclude barcodes in the bottom 115mm of the page image
990
+ if _is_in_excluded_bottom_area(b, pil.height):
991
+ continue
992
+ boxes.append(b)
993
+ sym, payload = r["type"], r["data"]
994
+ infos.append({**r, "valid": _validate(sym, payload), "page": page_idx+1, "source": f"embed:{ix+1}"})
995
+ except Exception:
996
+ pass
997
+
998
+ # B) Render page raster at high DPI (grayscale)
999
+ for dpi in dpi_list:
1000
+ scale=dpi/72.0
1001
+ try:
1002
+ pix=page.get_pixmap(matrix=fitz.Matrix(scale,scale), colorspace=fitz.csGRAY, alpha=False)
1003
+ except TypeError:
1004
+ pix=page.get_pixmap(matrix=fitz.Matrix(scale,scale), alpha=False)
1005
+ pil=_pix_to_pil(pix)
1006
+ hits=_decode_variants(pil)
1007
+ for r in hits:
1008
+ b = Box(r["top"], r["left"], r["top"]+r["height"], r["left"]+r["width"], r["width"]*r["height"])
1009
+ if _is_in_excluded_bottom_area(b, pil.height):
1010
+ continue
1011
+ boxes.append(b)
1012
+ sym, payload = r["type"], r["data"]
1013
+ infos.append({**r, "valid": _validate(sym, payload), "page": page_idx+1, "source": f"page@{dpi}dpi"})
1014
+ if any(i["page"]==page_idx+1 for i in infos):
1015
+ break # found something for this page β†’ next page
1016
+ doc.close()
1017
+ return boxes, infos
1018
+
1019
+
1020
+
1021
+
1022
+ # -------------------- CMYK Panel -------------------
1023
+ def rgb_to_cmyk_array(img: Image.Image) -> np.ndarray:
1024
+ return np.asarray(img.convert('CMYK')).astype(np.float32) # 0..255
1025
+
1026
+ def avg_cmyk_in_box(cmyk_arr: np.ndarray, box: Box) -> Tuple[float,float,float,float]:
1027
+ y1,y2 = max(0, box.y1), min(cmyk_arr.shape[0], box.y2)
1028
+ x1,x2 = max(0, box.x1), min(cmyk_arr.shape[1], box.x2)
1029
+ if y2<=y1 or x2<=x1:
1030
+ return (0.0,0.0,0.0,0.0)
1031
+ region = cmyk_arr[y1:y2, x1:x2, :]
1032
+ mean_vals = region.reshape(-1, 4).mean(axis=0)
1033
+ return tuple(float(round(v * 100.0 / 255.0, 1)) for v in mean_vals)
1034
+
1035
+ def compute_cmyk_diffs(a_img: Image.Image, b_img: Image.Image, red_boxes: List[Box]):
1036
+ a_cmyk = rgb_to_cmyk_array(a_img)
1037
+ b_cmyk = rgb_to_cmyk_array(b_img)
1038
+ entries = []
1039
+ for i, bx in enumerate(red_boxes):
1040
+ a_vals = avg_cmyk_in_box(a_cmyk, bx)
1041
+ b_vals = avg_cmyk_in_box(b_cmyk, bx)
1042
+ delta = tuple(round(b_vals[j] - a_vals[j], 1) for j in range(4))
1043
+ entries.append({'idx': i+1, 'A': a_vals, 'B': b_vals, 'Delta': delta})
1044
+ return entries
1045
+
1046
+ def draw_cmyk_panel(base: Image.Image, entries, title: str = 'CMYK breakdowns', panel_width: int = 260) -> Image.Image:
1047
+ w,h = base.size
1048
+ panel = Image.new('RGB', (panel_width, h), (245,245,245))
1049
+ out = Image.new('RGB', (w+panel_width, h), (255,255,255))
1050
+ out.paste(base, (0,0)); out.paste(panel, (w,0))
1051
+ d = ImageDraw.Draw(out)
1052
+ x0 = w + 8; y = 8
1053
+ d.text((x0, y), title, fill=(0,0,0)); y += 18
1054
+ if not entries:
1055
+ d.text((x0, y), 'No differing regions', fill=(80,80,80))
1056
+ return out
1057
+ for e in entries:
1058
+ idx = e['idx']; aC,aM,aY,aK = e['A']; bC,bM,bY,bK = e['B']; dC,dM,dY,dK = e['Delta']
1059
+ d.text((x0, y), f"#{idx}", fill=(0,0,0)); y += 14
1060
+ d.text((x0, y), f"A: C {aC}% M {aM}% Y {aY}% K {aK}%", fill=(0,0,0)); y += 14
1061
+ d.text((x0, y), f"B: C {bC}% M {bM}% Y {bY}% K {bK}%", fill=(0,0,0)); y += 14
1062
+ d.text((x0, y), f"Delta: C {dC}% M {dM}% Y {dY}% K {dK}%", fill=(120,0,0)); y += 18
1063
+ if y > h - 40: break
1064
+ return out
1065
+
1066
+ # -------------------- Gradio Interface -----------------
1067
+ def compare_pdfs(file_a, file_b):
1068
+ """Main comparison function for Gradio interface"""
1069
+ try:
1070
+ if file_a is None or file_b is None:
1071
+ return None, None, None, "❌ Please upload both PDF files to compare", [], []
1072
+
1073
+ # Load images with multiple pages support
1074
+ pages_a = load_pdf_pages(file_a.name, dpi=600, max_pages=15)
1075
+ pages_b = load_pdf_pages(file_b.name, dpi=600, max_pages=15)
1076
+
1077
+ # Combine pages into single images for comparison
1078
+ a = combine_pages_vertically(pages_a)
1079
+ b = combine_pages_vertically(pages_b)
1080
+
1081
+ # Match sizes
1082
+ a, b = match_sizes(a, b)
1083
+
1084
+ # Find differences with default settings
1085
+ diff = difference_map(a, b)
1086
+ red_boxes = find_diff_boxes(diff, threshold=12, min_area=25)
1087
+
1088
+ # Run all analysis features with defaults
1089
+ # Use text-based spell checking instead of OCR for better accuracy
1090
+ # Pass image dimensions for proper coordinate mapping
1091
+ image_size = (a.width, a.height)
1092
+ misspell_a = find_misspell_boxes_from_text(file_a.name, image_size=image_size) if HAS_SPELLCHECK and HAS_PYMUPDF else []
1093
+ misspell_b = find_misspell_boxes_from_text(file_b.name, image_size=image_size) if HAS_SPELLCHECK and HAS_PYMUPDF else []
1094
+
1095
+ # Debug: Print spell check results
1096
+ print(f"Spell check results - A: {len(misspell_a)} boxes, B: {len(misspell_b)} boxes")
1097
+
1098
+ if HAS_BARCODE:
1099
+ # Use PDF-based barcode detection instead of rasterized image
1100
+ bar_a, info_a = find_barcode_boxes_and_info_from_pdf(file_a.name, image_size=image_size) if HAS_PYMUPDF else find_barcode_boxes_and_info(a)
1101
+ bar_b, info_b = find_barcode_boxes_and_info_from_pdf(file_b.name, image_size=image_size) if HAS_PYMUPDF else find_barcode_boxes_and_info(b)
1102
+
1103
+ # Debug: Print barcode detection results
1104
+ print(f"Barcode detection results - A: {len(bar_a)} codes, B: {len(bar_b)} codes")
1105
+ else:
1106
+ bar_a, info_a = [], []
1107
+ bar_b, info_b = [], []
1108
+
1109
+ # Always enable CMYK analysis
1110
+ cmyk_entries = compute_cmyk_diffs(a, b, red_boxes)
1111
+
1112
+ # Create visualizations with default box width
1113
+ a_boxed_core = draw_boxes_multi(a, red_boxes, misspell_a, bar_a, width=3)
1114
+ b_boxed_core = draw_boxes_multi(b, red_boxes, misspell_b, bar_b, width=3)
1115
+
1116
+ # Always show CMYK panel
1117
+ a_disp = draw_cmyk_panel(a_boxed_core, cmyk_entries, title='CMYK Analysis (A vs B)')
1118
+ b_disp = draw_cmyk_panel(b_boxed_core, cmyk_entries, title='CMYK Analysis (A vs B)')
1119
+
1120
+ # Create pixel difference overlay
1121
+ overlay = make_red_overlay(a, b)
1122
+
1123
+ # Create status message
1124
+ status = f"""
1125
+ πŸ“Š **Analysis Complete!**
1126
+ - **Pages processed:** A: {len(pages_a)}, B: {len(pages_b)}
1127
+ - **Difference regions found:** {len(red_boxes)}
1128
+ - **Misspellings detected:** A: {len(misspell_a)}, B: {len(misspell_b)}
1129
+ - **Barcodes found:** A: {len(bar_a)}, B: {len(bar_b)}
1130
+ - **Combined image dimensions:** {a.width} Γ— {a.height} pixels
1131
+
1132
+ **Legend:**
1133
+ - πŸ”΄ Red boxes: Visual differences
1134
+ - πŸ”΅ Cyan boxes: Spelling errors
1135
+ - 🟒 Green boxes: Barcodes/QR codes
1136
+ """
1137
+
1138
+ # Prepare barcode data for tables
1139
+ codes_a = [[c.get('type',''), c.get('data',''), c.get('left',0), c.get('top',0),
1140
+ c.get('width',0), c.get('height',0), c.get('valid', False)] for c in info_a]
1141
+ codes_b = [[c.get('type',''), c.get('data',''), c.get('left',0), c.get('top',0),
1142
+ c.get('width',0), c.get('height',0), c.get('valid', False)] for c in info_b]
1143
+
1144
+ return overlay, a_disp, b_disp, status, codes_a, codes_b
1145
+
1146
+ except Exception as e:
1147
+ error_msg = f"❌ **Error:** {str(e)}"
1148
+ return None, None, None, error_msg, [], []
1149
+
1150
+ # -------------------- Gradio App -------------------
1151
+ def create_demo():
1152
+ # Create custom theme with light blue background
1153
+ # Create a simple, working theme with supported parameters only
1154
+ custom_theme = gr.themes.Soft(
1155
+ primary_hue="blue",
1156
+ neutral_hue="blue",
1157
+ font=gr.themes.GoogleFont("Inter"),
1158
+ ).set(
1159
+ body_background_fill="#99cfe9", # Light blue background
1160
+ body_background_fill_dark="#99cfe9",
1161
+ block_background_fill="#000000", # Black blocks for contrast
1162
+ block_background_fill_dark="#000000",
1163
+ border_color_primary="#333333", # Dark borders
1164
+ border_color_primary_dark="#333333",
1165
+ )
1166
+
1167
+ with gr.Blocks(title="PDF Comparison Tool", theme=custom_theme) as demo:
1168
+ gr.Markdown("""
1169
+ # πŸ” Advanced PDF Comparison Tool
1170
+
1171
+ Upload two PDF files to get comprehensive analysis including:
1172
+ - **Multi-page PDF support** (up to 15 pages per document)
1173
+ - **Visual differences** with bounding boxes
1174
+ - **OCR and spell checking**
1175
+ - **Barcode/QR code detection**
1176
+ - **CMYK color analysis**
1177
+ """)
1178
+
1179
+ with gr.Row():
1180
+ with gr.Column():
1181
+ file_a = gr.File(label="πŸ“„ PDF A (Reference)", file_types=[".pdf"])
1182
+ file_b = gr.File(label="πŸ“„ PDF B (Comparison)", file_types=[".pdf"])
1183
+
1184
+ compare_btn = gr.Button("πŸ” Compare PDF Files", variant="primary", size="lg")
1185
+
1186
+ status_md = gr.Markdown("")
1187
+
1188
+ with gr.Row():
1189
+ overlay_img = gr.Image(label="πŸ”΄ Pixel Differences (Red = Different)", type="pil")
1190
+
1191
+ with gr.Row():
1192
+ img_a = gr.Image(label="πŸ“„ File A with Analysis", type="pil")
1193
+ img_b = gr.Image(label="πŸ“„ File B with Analysis", type="pil")
1194
+
1195
+ gr.Markdown("### πŸ“Š Barcode Detection Results")
1196
+ with gr.Row():
1197
+ codes_a_df = gr.Dataframe(
1198
+ headers=["Type", "Data", "Left", "Top", "Width", "Height", "Valid"],
1199
+ label="Barcodes in File A",
1200
+ interactive=False
1201
+ )
1202
+ codes_b_df = gr.Dataframe(
1203
+ headers=["Type", "Data", "Left", "Top", "Width", "Height", "Valid"],
1204
+ label="Barcodes in File B",
1205
+ interactive=False
1206
+ )
1207
+
1208
+ # Event handlers
1209
+ compare_btn.click(
1210
+ fn=compare_pdfs,
1211
+ inputs=[file_a, file_b],
1212
+ outputs=[overlay_img, img_a, img_b, status_md, codes_a_df, codes_b_df]
1213
+ )
1214
+
1215
+ gr.Markdown("""
1216
+ ### πŸ“ Instructions:
1217
+ 1. Upload two PDF files
1218
+ 2. Click "Compare PDF Files"
1219
+ 3. View results with comprehensive analysis
1220
+
1221
+ ### 🎨 Color Legend:
1222
+ - **πŸ”΄ Red boxes:** Visual differences between files
1223
+ - **πŸ”΅ Cyan boxes:** Potential spelling errors (OCR)
1224
+ - **🟒 Green boxes:** Detected barcodes/QR codes
1225
+ - **πŸ“Š Side panel:** CMYK color analysis for print workflows
1226
+ """)
1227
+
1228
+ return demo
1229
+
1230
+ def _binarize(pil_img: Image.Image) -> Image.Image:
1231
+ """Create a binarized (black/white) version of the image for better barcode detection"""
1232
+ g = ImageOps.grayscale(pil_img)
1233
+ g = ImageOps.autocontrast(g)
1234
+ return g.point(lambda x: 255 if x > 140 else 0, mode='1').convert('L')
1235
+
1236
+ def _decode_once(img: Image.Image):
1237
+ """Single decode attempt with common barcode symbols"""
1238
+ if not HAS_BARCODE:
1239
+ return []
1240
+ syms = [ZBarSymbol.QRCODE, ZBarSymbol.EAN13, ZBarSymbol.EAN8, ZBarSymbol.UPCA, ZBarSymbol.CODE128]
1241
+ return zbar_decode(img, symbols=syms)
1242
+
1243
+ def debug_scan_pdf(pdf_path: str, outdir: str = "barcode_debug", max_pages=2):
1244
+ """
1245
+ Debug function to scan PDF at multiple DPIs and variants to diagnose barcode detection issues.
1246
+
1247
+ This function:
1248
+ - Renders pages at 600/900/1200 DPI
1249
+ - Tries grayscale, binarized, and rotated versions
1250
+ - Scans embedded images (XObjects)
1251
+ - Prints what it finds and writes debug PNGs
1252
+ - Helps identify if barcodes are too thin/low resolution
1253
+
1254
+ Usage:
1255
+ debug_scan_pdf("your.pdf", outdir="barcode_debug", max_pages=2)
1256
+ """
1257
+ if not (HAS_BARCODE and HAS_PYMUPDF):
1258
+ print("ERROR: Missing dependencies (pyzbar or PyMuPDF)")
1259
+ return
1260
+
1261
+ os.makedirs(outdir, exist_ok=True)
1262
+ doc = fitz.open(pdf_path)
1263
+
1264
+ for dpi in (600, 900, 1200):
1265
+ scale = dpi / 72.0
1266
+ mat = fitz.Matrix(scale, scale)
1267
+ print(f"\n=== DPI {dpi} ===")
1268
+
1269
+ for p in range(min(len(doc), max_pages)):
1270
+ page = doc[p]
1271
+ pix = page.get_pixmap(matrix=mat, alpha=False)
1272
+ img = Image.open(io.BytesIO(pix.tobytes("ppm")))
1273
+ img.save(f"{outdir}/page{p+1}_{dpi}.png")
1274
+
1275
+ # Try different image variants
1276
+ variants = [
1277
+ ("orig", img),
1278
+ ("gray", ImageOps.grayscale(img)),
1279
+ ("bin", _binarize(img)),
1280
+ ]
1281
+ found = []
1282
+
1283
+ for tag, v in variants:
1284
+ r = _decode_once(v)
1285
+ if r:
1286
+ found.extend((tag, rr.type, rr.data) for rr in r)
1287
+ else:
1288
+ # Try rotations
1289
+ for angle in (90, 180, 270):
1290
+ rr = _decode_once(v.rotate(angle, expand=True))
1291
+ if rr:
1292
+ found.extend((f"{tag}_rot{angle}", rri.type, rri.data) for rri in rr)
1293
+ break
1294
+
1295
+ print(f"Page {p+1}: {len(found)} hits at DPI {dpi} -> {found}")
1296
+
1297
+ # Scan embedded images too
1298
+ imgs = page.get_images(full=True)
1299
+ for ix, (xref, *_) in enumerate(imgs):
1300
+ try:
1301
+ ipix = fitz.Pixmap(doc, xref)
1302
+ if ipix.alpha:
1303
+ ipix = fitz.Pixmap(ipix, 0)
1304
+ pil = Image.open(io.BytesIO(ipix.tobytes("ppm")))
1305
+ pil.save(f"{outdir}/page{p+1}_embed{ix+1}.png")
1306
+ rr = _decode_once(pil) or _decode_once(_binarize(pil))
1307
+ if rr:
1308
+ print(f" Embedded image {ix+1}: {[(r.type, r.data) for r in rr]}")
1309
+ except Exception as e:
1310
+ print(" Embedded image error:", e)
1311
+
1312
+ doc.close()
1313
+ print(f"\nDebug images saved to: {outdir}/")
1314
+ print("Open the PNGs and zoom in to check bar width. If narrow bars are <2px at 600 DPI, you need 900-1200 DPI.")
1315
+
1316
+ def find_barcode_boxes_and_info_from_pdf(pdf_path: str, image_size: Optional[Tuple[int, int]] = None, max_pages: int = 10):
1317
+ """Detect barcodes from the original PDF and return boxes in the same
1318
+ coordinate space as the combined display image.
1319
+
1320
+ If image_size is provided (w,h of the vertically combined display image),
1321
+ each page is rendered so its width matches w, then decoded. Box y-coordinates
1322
+ are offset by the cumulative height of previous pages so that all boxes map
1323
+ into the combined image space correctly.
1324
+ """
1325
+ boxes: List[Box] = []
1326
+ infos: List[Dict[str, Any]] = []
1327
+ try:
1328
+ doc = fitz.open(pdf_path)
1329
+ num_pages = min(len(doc), max_pages)
1330
+ if num_pages == 0:
1331
+ return [], []
1332
+
1333
+ target_width = None
1334
+ if image_size:
1335
+ target_width = int(image_size[0])
1336
+
1337
+ y_offset = 0
1338
+ for page_idx in range(num_pages):
1339
+ page = doc[page_idx]
1340
+ # Compute scale so that rendered width matches target_width when provided
1341
+ if target_width:
1342
+ page_width_pts = float(page.rect.width) # points (72 dpi)
1343
+ scale = max(1.0, target_width / page_width_pts)
1344
+ else:
1345
+ # fallback dpi ~600
1346
+ scale = 600.0 / 72.0
1347
+ try:
1348
+ pix = page.get_pixmap(matrix=fitz.Matrix(scale, scale), colorspace=fitz.csGRAY, alpha=False)
1349
+ except TypeError:
1350
+ pix = page.get_pixmap(matrix=fitz.Matrix(scale, scale), alpha=False)
1351
+ pil = _pix_to_pil(pix)
1352
+ pw, ph = pil.size
1353
+ hits = _decode_variants(pil)
1354
+ for r in hits:
1355
+ x1 = int(r.get("left", 0))
1356
+ y1 = int(r.get("top", 0)) + y_offset
1357
+ w = int(r.get("width", 0))
1358
+ h = int(r.get("height", 0))
1359
+ x2 = x1 + w
1360
+ y2 = y1 + h
1361
+ b = Box(y1, x1, y2, x2, w * h)
1362
+ # Exclude bottom 115mm for combined image if we know full height; else per-page
1363
+ if image_size and _is_in_excluded_bottom_area(b, image_size[1]):
1364
+ continue
1365
+ if not image_size and _is_in_excluded_bottom_area(b, ph):
1366
+ continue
1367
+ boxes.append(b)
1368
+ sym, payload = r.get("type", ""), r.get("data", "")
1369
+ infos.append({**r, "valid": _validate(sym, payload), "page": page_idx + 1, "source": f"page@scale{scale:.2f}"})
1370
+ y_offset += ph
1371
+ doc.close()
1372
+ except Exception:
1373
+ return [], []
1374
+ return boxes, infos
1375
+
1376
+ if __name__ == "__main__":
1377
+ demo = create_demo()
1378
+ demo.launch(
1379
+ server_name="0.0.0.0", # Allow external access
1380
+ share=True, # Set to True to create a public link
1381
+ show_error=True
1382
+ )
requirements.txt CHANGED
@@ -4,4 +4,10 @@ pillow
4
  pdf2image
5
  gradio
6
  PyMuPDF>=1.24
7
- pytesseract
 
 
 
 
 
 
 
4
  pdf2image
5
  gradio
6
  PyMuPDF>=1.24
7
+ pytesseract
8
+ spellchecker
9
+ regex
10
+ pyzbar
11
+ zxing-cpp
12
+ pylibdmtx
13
+ scikit-image