modify scripts

2025-11-05 17:25:41 +08:00
parent b7dffc539c
commit 808dbaa985
4 changed files with 980 additions and 9 deletions
--- a/docker/paperless/plugins/redme.txt
+++ b/docker/paperless/plugins/redme.txt
@ -34,7 +34,23 @@ environment:
  PAPERLESS_POST_CONSUME_SCRIPT: "/usr/src/paperless/scripts/parse_filename.py"


-paperless 默认不会删除重复的文件，这会导致如果重复添加，会不停扫描，加载，报错。没找到配置，直接修改源码解决：
+对于无法简单读取pdf内容的文档，paperless会启动OCR扫描，且复杂情况下会执行两遍，非常慢而且消耗资源。只能通过修改源码解决：
+/usr/src/paperless/src/paperless_tesseract/parsers.py :
+
+        # force skip ocr process.
+        if not original_has_text:
+            original_has_text = True
+            text_original = "this is default content, as we skipped ocr process..."
+            self.log.warning("Cannot read text from Document, use default message.")
+
+        if skip_archive_for_text and original_has_text:
+            self.log.debug("Document has text, skipping OCRmyPDF entirely.")
+            self.text = text_original
+            return
+
+
+
+paperless 默认不会删除重复的文件，这会导致如果重复添加，会不停扫描，加载，报错。没找到配置，直接修改源码解决：（已经有配置，详见 docker-compose.yml）

 /usr/src/paperless/src/documents/consumer.py