-------------------------------------------------------| ------------------- paperless 无纸化pdf管理 ------------| -------------------------------------------------------| ## 最好不要用命令,使用docker-compose.yml来创建,需要制定后端使用的数据库,以及redis! docker run -itd \ --name paperless \ --network devops \ --platform linux/x86_64 \ -e TZ="Asia/Shanghai" \ -v /etc/localtime:/etc/localtime:ro \ -v "$(pwd)/dockers/paperless/pdfs:/usr/src/paperless/data" \ -v "$(pwd)/dockers/paperless/db:/usr/src/paperless/db" \ -e USERMAP_UID=1000 -e USERMAP_GID=1000 \ -p 8000:8000 \ ghcr.io/paperless-ngx/paperless-ngx # 容器创建好之后,要手动设置密码(二选一操作,目前设置的 admin / admin) docker compose run --rm webserver createsuperuser python3 manage.py createsuperuser # 已有文档,放在指定目录下,等系统自动加载(或者手工启动) cd /path/to/paperless/src/ python3 manage.py document_consumer # 自动解析文件名 https://docs.paperless-ngx.com/advanced_usage/#file-name-handling https://docs.paperless-ngx.com/configuration/#PAPERLESS_POST_CONSUME_SCRIPT environment: PAPERLESS_POST_CONSUME_SCRIPT: "/usr/src/paperless/scripts/parse_filename.py" 对于无法简单读取pdf内容的文档,paperless会启动OCR扫描,且复杂情况下会执行两遍,非常慢而且消耗资源。只能通过修改源码解决: /usr/src/paperless/src/paperless_tesseract/parsers.py : # force skip ocr process. if not original_has_text: original_has_text = True text_original = "this is default content, as we skipped ocr process..." self.log.warning("Cannot read text from Document, use default message.") if skip_archive_for_text and original_has_text: self.log.debug("Document has text, skipping OCRmyPDF entirely.") self.text = text_original return paperless 默认不会删除重复的文件,这会导致如果重复添加,会不停扫描,加载,报错。没找到配置,直接修改源码解决:(已经有配置,详见 docker-compose.yml) /usr/src/paperless/src/documents/consumer.py def pre_check_duplicate(self): """ Using the MD5 of the file, check this exact file doesn't already exist """ with open(self.input_doc.original_file, "rb") as f: checksum = hashlib.md5(f.read()).hexdigest() existing_doc = Document.global_objects.filter( Q(checksum=checksum) | Q(archive_checksum=checksum), ) if existing_doc.exists(): msg = ConsumerStatusShortMessage.DOCUMENT_ALREADY_EXISTS log_msg = f"Not consuming {self.filename}: It is a duplicate of {existing_doc.get().title} (#{existing_doc.get().pk})." if existing_doc.first().deleted_at is not None: msg = ConsumerStatusShortMessage.DOCUMENT_ALREADY_EXISTS_IN_TRASH log_msg += " Note: existing document is in the trash." ## 修改这里,让它删除重复文件。 if settings.CONSUMER_DELETE_DUPLICATES or True: os.unlink(self.input_doc.original_file) self._fail( msg, log_msg, )