devops/docker/paperless/plugins/redme.txt



-------------------------------------------------------｜
------------------- paperless 无纸化pdf管理  ------------｜
-------------------------------------------------------｜

## 最好不要用命令，使用docker-compose.yml来创建，需要制定后端使用的数据库，以及redis！
docker run -itd \
  --name paperless \
  --network devops \
  --platform linux/x86_64 \
  -e TZ="Asia/Shanghai"  \
  -v /etc/localtime:/etc/localtime:ro \
  -v "$(pwd)/dockers/paperless/pdfs:/usr/src/paperless/data"  \
  -v "$(pwd)/dockers/paperless/db:/usr/src/paperless/db"  \
  -e USERMAP_UID=1000 -e USERMAP_GID=1000 \
  -p 8000:8000 \
  ghcr.io/paperless-ngx/paperless-ngx


# 容器创建好之后，要手动设置密码（二选一操作，目前设置的 admin / admin）
docker compose run --rm webserver createsuperuser
python3 manage.py createsuperuser

# 已有文档，放在指定目录下，等系统自动加载(或者手工启动)
cd /path/to/paperless/src/
python3 manage.py document_consumer

# 自动解析文件名
https://docs.paperless-ngx.com/advanced_usage/#file-name-handling
https://docs.paperless-ngx.com/configuration/#PAPERLESS_POST_CONSUME_SCRIPT

environment:
  PAPERLESS_POST_CONSUME_SCRIPT: "/usr/src/paperless/scripts/parse_filename.py"


对于无法简单读取pdf内容的文档，paperless会启动OCR扫描，且复杂情况下会执行两遍，非常慢而且消耗资源。只能通过修改源码解决：
/usr/src/paperless/src/paperless_tesseract/parsers.py :

        # force skip ocr process.
        if not original_has_text:
            original_has_text = True
            text_original = "this is default content, as we skipped ocr process..."
            self.log.warning("Cannot read text from Document, use default message.")

        if skip_archive_for_text and original_has_text:
            self.log.debug("Document has text, skipping OCRmyPDF entirely.")
            self.text = text_original
            return


paperless 默认不会删除重复的文件，这会导致如果重复添加，会不停扫描，加载，报错。没找到配置，直接修改源码解决：（已经有配置，详见 docker-compose.yml）

/usr/src/paperless/src/documents/consumer.py

    def pre_check_duplicate(self):
        """
        Using the MD5 of the file, check this exact file doesn't already exist
        """
        with open(self.input_doc.original_file, "rb") as f:
            checksum = hashlib.md5(f.read()).hexdigest()
        existing_doc = Document.global_objects.filter(
            Q(checksum=checksum) | Q(archive_checksum=checksum),
        )
        if existing_doc.exists():
            msg = ConsumerStatusShortMessage.DOCUMENT_ALREADY_EXISTS
            log_msg = f"Not consuming {self.filename}: It is a duplicate of {existing_doc.get().title} (#{existing_doc.get().pk})."

            if existing_doc.first().deleted_at is not None:
                msg = ConsumerStatusShortMessage.DOCUMENT_ALREADY_EXISTS_IN_TRASH
                log_msg += " Note: existing document is in the trash."

            ## 修改这里，让它删除重复文件。
            if settings.CONSUMER_DELETE_DUPLICATES or True:
                os.unlink(self.input_doc.original_file)
            self._fail(
                msg,
                log_msg,
            )