Add ISO-8601 dates from filenames in Paperless-ngx post-consume script

Paperless-ngx is a simply amazing document management tool. It has made managing the thousands of documents that I have in my collection an absolute breeze.

I have a lot of documents which have ISO-8601 dates in the filename. When I ingest them into Paperless-ngx I want those dates to be assigned as the “Date created” for that document. In order to get paperless-ngx to do this, I needed to create a Post-Consume script

After finding this link to a python script that more or less did what I wanted, I tweaked it with Claude 4 Sonnet’s help and it works well. It looks at the original filename of the ingested document and parses the date from it, if it exists, then makes that date the “Date Created” for the document. Additionally, it will tag the document with the year that the document was created. It requires the PAPERLESS_API_KEY environment variable to be populated. Add the path to the script as PAPERLESS_POST_CONSUME_SCRIPT environment variable.

#!/usr/bin/env python3
"""
Paperless-ngx Post-Process Script with Configurable Regex Patterns
Transforms document titles and dates based on original filename patterns
Uses environment variables for configuration

Inspired by https://github.com/paperless-ngx/paperless-ngx/discussions/7580
and tweaked with Claude 4 Sonnet
"""

import os
import json
import requests
import re
from datetime import datetime
from typing import Dict, List, Optional, Tuple


# Configuration: Define your regex patterns and transformations here
FILENAME_PATTERNS = [
    {
        "name": "Brokerage Statement Pattern",
        "pattern": r"^([^_]+)_(\d{4})-(\d{2})-(\d{2})_(\d+)\.(.+)$",
        "title_transform": lambda m: f"{m.group(1)}",  # Document type
        "date_transform": lambda m: f"{m.group(2)}-{m.group(3)}-{m.group(4)}",  # YYYY-MM-DD
        "date_format": "%Y-%m-%d"
    },
    {
        "name": "Standard Date-Title Pattern",
        "pattern": r"^(\d{4}-\d{2}-\d{2}) - (.+)\.(.+)$",
        "title_transform": lambda m: m.group(2),  # Title part
        "date_transform": lambda m: m.group(1),   # Date part
        "date_format": "%Y-%m-%d"
    },
    {
        "name": "Invoice Pattern",
        "pattern": r"^Invoice_(\d{4})(\d{2})(\d{2})_(.+)_(\d+)\.(.+)$",
        "title_transform": lambda m: f"Invoice - {m.group(4)}",
        "date_transform": lambda m: f"{m.group(1)}-{m.group(2)}-{m.group(3)}",
        "date_format": "%Y-%m-%d"
    },
    {
        "name": "Bank Statement Pattern",
        "pattern": r"^(.+)_Statement_(\d{1,2})-(\d{1,2})-(\d{4})\.(.+)$",
        "title_transform": lambda m: f"{m.group(1)} Statement",
        "date_transform": lambda m: f"{m.group(4)}-{m.group(2):0>2}-{m.group(3):0>2}",
        "date_format": "%Y-%m-%d"
    }
]


def get_config_from_env():
    """Get configuration from environment variables"""
    paperless_url = os.getenv("PAPERLESS_URL", "http://localhost:8000")
    api_token = os.getenv("PAPERLESS_API_TOKEN")
    timeout = float(os.getenv("PAPERLESS_TIMEOUT", "10.0"))

    return paperless_url, api_token, timeout


def _set_auth_headers(session: requests.Session, api_token: str):
    """Set authentication headers using API token"""
    session.headers.update({
        "Authorization": f"Token {api_token}",
        "Content-Type": "application/json"
    })


def match_filename_pattern(filename: str) -> Optional[Tuple[Dict, re.Match]]:
    """
    Try to match filename against configured patterns
    Returns (pattern_config, match_object) or None
    """
    for pattern_config in FILENAME_PATTERNS:
        match = re.match(pattern_config["pattern"], filename)
        if match:
            return pattern_config, match
    return None


def extract_title_and_date(filename: str) -> Tuple[Optional[str], Optional[str], Optional[str]]:
    """
    Extract title and date from filename using configured patterns
    Returns (new_title, date_string, pattern_name) or (None, None, None)
    """
    result = match_filename_pattern(filename)
    if not result:
        return None, None, None

    pattern_config, match = result

    try:
        # Extract title using the transform function
        new_title = pattern_config["title_transform"](match)

        # Extract date using the transform function
        date_string = pattern_config["date_transform"](match)


        return new_title, date_string, pattern_config["name"]

    except Exception as e:
        print(f"Error applying pattern '{pattern_config['name']}': {e}")
        return None, None, None


def parse_date(date_string: str, date_format: str) -> Optional[datetime]:
    """Parse date string using the specified format"""
    try:
        return datetime.strptime(date_string, date_format).date()
    except ValueError as e:
        print(f"Failed to parse date '{date_string}' with format '{date_format}': {e}")
        return None


def get_or_create_year_tag(year: str, paperless_url: str, timeout: float, session: requests.Session) -> Optional[int]:
    """
    Get existing year tag or create a new one
    Returns tag ID or None if failed
    """
    try:
        # First, try to find existing tag (note: this searches for tags containing the name)
        search_url = paperless_url + f"/api/tags/?name={year}"

        tags_resp = session.get(search_url, timeout=timeout)
        tags_resp.raise_for_status()
        tags_data = tags_resp.json()

        if tags_data["results"]:
            # Look for exact match since API returns partial matches
            for tag in tags_data["results"]:
                if tag['name'] == year:
                    print(f"Found existing year tag '{year}' with ID: {tag['id']}")
                    return tag['id']

        # Tag doesn't exist, create it
        print(f"No existing tag found, creating new tag '{year}'")
        create_resp = session.post(
            paperless_url + "/api/tags/",
            data=json.dumps({
                "name": year,
                "color": "#007acc",  # Blue color for year tags
                "is_inbox_tag": False
            }),
            timeout=timeout
        )
        create_resp.raise_for_status()
        tag_data = create_resp.json()
        tag_id = tag_data["id"]
        print(f"Created new year tag '{year}' with ID: {tag_id}")
        return tag_id

    except requests.exceptions.RequestException as e:
        print(f"Failed to get/create year tag '{year}': {e}")
        if hasattr(e, 'response') and e.response is not None:
            print(f"Response status: {e.response.status_code}")
            print(f"Response text: {e.response.text}")
        return None


def add_year_tag_to_document(doc_pk: int, year: str, paperless_url: str, timeout: float, session: requests.Session) -> bool:
    """
    Add year tag to document
    Returns True if successful, False otherwise
    """
    # Get or create the year tag
    tag_id = get_or_create_year_tag(year, paperless_url, timeout, session)
    if not tag_id:
        return False

    try:
        # Get current document tags
        doc_resp = session.get(
            paperless_url + f"/api/documents/{doc_pk}/",
            timeout=timeout
        )
        doc_resp.raise_for_status()
        doc_data = doc_resp.json()
        current_tags = doc_data.get("tags", [])

        # Check if year tag is already assigned
        if tag_id in current_tags:
            print(f"Document {doc_pk} already has year tag '{year}'")
            return True

        # Add year tag to existing tags
        updated_tags = current_tags + [tag_id]

        # Update document with new tags
        update_resp = session.patch(
            paperless_url + f"/api/documents/{doc_pk}/",
            data=json.dumps({"tags": updated_tags}),
            timeout=timeout
        )
        update_resp.raise_for_status()
        print(f"Document {doc_pk} - Added year tag '{year}'")
        return True

    except requests.exceptions.RequestException as e:
        print(f"Failed to add year tag '{year}' to document {doc_pk}: {e}")
        return False


def test_api_connection(paperless_url: str, timeout: float, session: requests.Session) -> bool:
    """Test API connection and authentication"""
    try:
        response = session.get(
            paperless_url + "/api/documents/?page_size=1",
            timeout=timeout
        )
        response.raise_for_status()
        print("API connection successful")
        return True
    except requests.exceptions.RequestException as e:
        print(f"API connection failed: {e}")
        return False


def update_document(doc_pk: int, paperless_url: str, timeout: float, session: requests.Session):
    """Main function to update document title and date"""

    # Get document info
    try:
        doc_info_resp = session.get(
            paperless_url + f"/api/documents/{doc_pk}/",
            timeout=timeout
        )
        doc_info_resp.raise_for_status()
        doc_info = doc_info_resp.json()
    except requests.exceptions.RequestException as e:
        print(f"Failed to fetch document {doc_pk}: {e}")
        return

    original_filename = doc_info["original_file_name"]
    current_title = doc_info["title"]

    print(f"Processing document {doc_pk}: {original_filename}")

    # Try to extract title and date from filename
    new_title, date_string, pattern_name = extract_title_and_date(original_filename)

    if not new_title and not date_string:
        print(f"Document {doc_pk} - No matching pattern found for: {original_filename}")
        return

    print(f"Document {doc_pk} - Matched pattern: {pattern_name}")

    # Prepare update data
    update_data = {}
    parsed_date = None

    # Update title if extracted
    if new_title and new_title != current_title:
        update_data["title"] = new_title
        print(f"Document {doc_pk} - Title will be updated to: {new_title}")

    # Update date if extracted and valid
    if date_string:
        # Find the pattern config to get date format
        pattern_result = match_filename_pattern(original_filename)
        if pattern_result:
            pattern_config, _ = pattern_result
            parsed_date = parse_date(date_string, pattern_config["date_format"])

            if parsed_date:
                update_data["created"] = parsed_date.isoformat()
                print(f"Document {doc_pk} - Date will be updated to: {parsed_date}")
            else:
                print(f"Document {doc_pk} - Invalid date format: {date_string}")

    # Apply updates if any
    if update_data:
        try:
            resp = session.patch(
                paperless_url + f"/api/documents/{doc_pk}/",
                data=json.dumps(update_data),
                timeout=timeout,
            )
            resp.raise_for_status()
            print(f"Document {doc_pk} - Successfully updated: {update_data}")

        except requests.exceptions.RequestException as e:
            print(f"Document {doc_pk} - Failed to update: {e}")
            return
    else:
        print(f"Document {doc_pk} - No updates needed")

    # Add year tag if we have a valid date
    if parsed_date:
        year = str(parsed_date.year)
        add_year_tag_to_document(doc_pk, year, paperless_url, timeout, session)


if __name__ == "__main__":
    # Get configuration from environment variables
    paperless_url, api_token, timeout = get_config_from_env()

    # Validate required environment variables
    if not api_token:
        print("Error: PAPERLESS_API_TOKEN environment variable is required")
        print("Set it with: export PAPERLESS_API_TOKEN=your_token_here")
        exit(1)

    print(f"Using Paperless URL: {paperless_url}")
    print(f"Using timeout: {timeout}s")

    try:
        with requests.Session() as sess:
            # Set authentication headers
            _set_auth_headers(sess, api_token)

            # Test API connection
            if not test_api_connection(paperless_url, timeout, sess):
                print("Exiting due to API connection failure")
                exit(1)

            # Get document ID from environment
            doc_pk = int(os.environ["DOCUMENT_ID"])
            update_document(doc_pk, paperless_url, timeout, sess)

    except KeyError:
        print("Error: DOCUMENT_ID environment variable not found")
        exit(1)
    except ValueError:
        print("Error: DOCUMENT_ID is not a valid integer")
        exit(1)
    except Exception as e:
        print(f"Unexpected error: {e}")
        exit(1)

Increment modified date of files in a directory based on first file

I had an issue in Immich where it was sorting pictures by their modified date. The modified dates are random, but the filenames are not. I wanted the album to sort by filename, and to do that I needed to get each filename to have a modified time in the same order. This was my solution (run within the directory in question) :

date=$(date -r $(ls | head -1) +%s); for file in *.jpg; do touch -m -d "@$date" $file; ((date+=1)); done

This bash one-liner does the following:

  • Sets a date variable by taking the modified date of the first file in the directory and converting it to epoch time
  • Goes through each JPG file in the directory and executes a touch command to set the date of that file to the date variable
  • Increments the date variable by 1 before processing the next file

The end result is now the order the files are in by modified date match their filename order.

Rename directory contents with prefix of directory

Quick snippet to rename every file within a directory to have a prefix of the directory they reside in as part of the file name. If the directory name has a space in it, replace spaces with underscores for the file name. Run from within the directory in question.

base=$(basename "$PWD"| tr ' ' '_'); for file in *; do [ -f "$file" ] && mv "$file" "${base}_$file"; done

It does the following:

  • Gets the name of the current directory, replacing spaces with underscores, and saves into the variable base
  • Iterates through everything in the directory in a for loop
  • If the item is a regular file, execute the mv command to rename the file to include the contents of the base variable as a prefix
    • It uses BASH substitution to prepend the directory name to the new file name

This was helpful when dealing with a scanning project where many files had the same filename in different directories, which confused stacking images within Immich.

Wireguard client full tunnel no internet

I tried to set up a wireguard full tunnel VPN on my Debian Bookworm server and ran into issues with internet connectivity. LAN / VPN connectivity was fine, just no internet.

My first realization was that when making changes to your config, be sure to wg-quick down & wg-quick up (if you’re using wg-quick.) Simply editing the files and reloading the service doesn’t pick up the changes.

I followed this guide to get it up, and it simply wasn’t working:

https://wiki.debian.org/WireGuard#Step_2_-Alternative_A-_Manual_Configuration

I did eventually realize I need to enable the following in /etc/sysctl.conf

net.ipv4.ip_forward = 1

Then reload settings with:

# sysctl -p

In the wireguard server config I added these iptables commands:
PostUp = iptables -A FORWARD -i %i -j ACCEPT; iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i %i -j ACCEPT; iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE

I discovered that allowed IPs on the server side are simply the IP address(es) of the wireguard clients, nothing more.

For full tunnel, set Client’s allowed IP to 0.0.0.0/0

I did all this and it still didn’t work. Then I stumbled upon https://www.ckn.io/blog/2017/11/14/wireguard-vpn-typical-setup

which mentioned to enable conntrack with iptables:

iptables -A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

When I enabled conntrack, internet connectivity worked. I decided to reboot without making the above iptables commands persistent. But it worked after reboot!

Lesson learned: try rebooting the host as a wireguard troubleshooting step, especially if all the configs look like they should be working but simply aren’t.

Here are my working configs:

Server:

[Interface]
Address = 10.10.1.1/24
SaveConfig = true
PostUp = iptables -A FORWARD -i %i -j ACCEPT; iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i %i -j ACCEPT; iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
ListenPort = 51820
PrivateKey = <server_private_key>

[Peer]
PublicKey = <client_public_key>
AllowedIPs = 10.10.1.5/32

Client:

[Interface]
Address = 10.10.1.5/32
PrivateKey = <private key>
DNS = 10.10.1.1 10.10.1.2

[Peer]
PublicKey = <server public key>
Endpoint = mx.jeppson.org:54137
AllowedIPs = 0.0.0.0/0
PersistentKeepalive = 21

Get list of offline hosts with ping, grep & awk

Here is a simple bash one-liner that takes a list of hosts to check via stdin and attempts to ping a host a single time. If no response is received within 1 second, it prints that hostname and moves onto the next host. It’s designed to work with the output of another command that outputs hostnames (for example, an inventory file.)

|awk '{print $6}'| xargs -I {} sh -c 'ping -c 1 -w 1 {} | grep -B1 100% |  head -1' | awk '{print $2}'
  • Prints the 6th column of the output (you may or may not need this depending on what program is outputting hostnames)
  • xargs takes the output from the previous command and runs the ping command against it in a subshell
    • ping -c1 to only do it once, -w1 to wait 1 second for timeout
    • grep for 100%, grab the line before it (100% in this case means packet loss)
    • head -1 only prints the first line of the ping results
  • Awk prints only the second column in the resulting ping statistics output

It takes output like this:

PING examplehost (10.13.12.12) 56(84) bytes of data.

— examplehost ping statistics —
1 packets transmitted, 0 received, 100% packet loss, time 0ms

And simply outputs this:

examplehost

but only if the ping failed. No output otherwise.

I will note that the Anthropic’s Claude Sonnet AI helped me come to this conclusion, but not directly. Its suggestions for my problem didn’t work but were enough to point me in the right direction. The grep -B1 100% | head -1 portion need to be grouped together with the ping command in a separate shell, not appended afterward.

Generate list of youtube links from song titles

I needed to get a list of youtube links from a list of song titles. Thanks to this reddit post I was able to get what I needed. I did have to update it to use yt which is a fork of the referenced mps-youtube package.

After installing yewtube per https://github.com/mps-youtube/yewtube#installation I was able to get what I wanted with this one-liner:

while read song; do echo $song; yt search "$song", i 1, q|grep -i link| awk -F ': ' '{ print $2 }'; done < playlist

The above command looks at a playlist which is only artist & song names, prints the song name to the console for reference, then uses yewtube to search youtube for that song name and select the first result, then grab the link and print it to the screen.

I had to double check that the correct version of the song was selected, but for the most part it did exactly what I needed!

Report proper RATGDO door state in home assistant

I have an old dry contact garage door opener. I’ve wired it to a nifty ratgdo device to control it with Home Assistant. I ran into an issue with wiring reed switches to report the garage door state (open / closed.) The wires carrying the data would pick up RFI that made my remotes stop working. I never did find a way around this wired issue, so I went wireless and installed a Zigbee door sensor.

I struggled to get Home Assistant to report the state of the door as reported by the sensor. After much reading of documentation I finally got it working! Here is my configuration.yml:

cover:

  – platform: template

    covers:

      east_garage_door:

        device_class: garage

        friendly_name: “East Garage Door”

        unique_id: “east_garage_door”

        value_template: “{{ is_state(‘binary_sensor.east_garage_door_opening’, ‘on’) }}”

        open_cover:

          – action: button.press

            target:

              entity_id: button.east_garage_door_toggle_door

        close_cover:

          – action: button.press

            target:

              entity_id: button.east_garage_door_toggle_door

        stop_cover:

          action: button.press

          target:

            entity_id: button.east_garage_door_toggle_door

Add NVIDIA GPU to LXC container

I followed this guide to get NVIDIA drivers working on my Proxmox machine. However when I tried to get them working in my container I couldn’t see how to get nvidia-smi installed. Thankfully this blog had what I needed.

The step I missed was copying & installing the NVIDIA drivers into the container with this flag:

--no-kernel-module

That got me one step closer but I could not spin up open-webui in a container. I kept getting the error

Error response from daemon: could not select device driver "nvidia" with capabilities: [[gpu]]

The fix was to install the NVDIA Container Toolkit:

Configure the production repository:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Update the packages list from the repository:

sudo apt-get update

Install the NVIDIA Container Toolkit packages:

sudo apt-get install -y nvidia-container-toolkit

An additional hurtle I encountered was this error:

Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown

I found here that the fix is to change a line in /etc/nvidia-container-runtime/config.toml. Uncomment and change no-cgroups to true.

no-cgroups = true

Success.

Not working after reboot

I had a working config until I rebooted the host. It turns out that two services need to run on the host:

nvidia-persistenced
nvidia-smi

Configured cron tab to run these on reboot:

/etc/cron.d/nvidia:
@reboot root /usr/bin/nvidia-smi
@reboot root /usr/bin/nvidia-persistenced

Update 2025-05-06

I encountered an error when trying to set up alltalk tts:


nvidia-container-cli: mount error: stat failed: /dev/nvidia-modeset: no such file or directory: unknown

It turns out I needed to expose /dev/nvidia-modeset to the container as well. Thanks to this reddit post for the answer. The complete container passthrough config is now this:

lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 243:* rwm
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file

Recursively find files with the same filename

I needed a way to find files with the same filename, but were not identical files. Thankfully Reddit had the solution I was looking for: a combination of find, sort, and while loop with if statements.

https://www.reddit.com/r/bash/comments/fjsr8v/recursively_find_files_with_same_name_under_a/

find . -type f -printf '%f/%p\0' | { sort -z -t/ -k1; printf '\0';} |
while IFS=/ read -r -d '' name file; do
    if [[ "$name" = "$oldname" ]]; then
        repeated+=("$file")  # duplicate file
        continue
    fi
    if (( ${#repeated[@]} > 1)); then
        printf '%s\n' "$oldname" "${repeated[@]}" ''
        # do something with list "${repeated[@]}"
    fi
    repeated=("$file")
    oldname=$name
done

Load balancing behind Nginx Proxy Manager

Nginx proxy manager is a really convenient UI wrapped around nginx. It covers the most common use cases very well. If you have more advanced needs, then it requires some custom configuration. In my case, I wanted to load balance my Proxmox servers. This is how you do that, as per https://nginxproxymanager.com/advanced-config/#custom-nginx-configurations and https://www.reddit.com/r/selfhosted/comments/1fp5mxz/nginx_proxy_manager_fails_when_adding_load/

  1. Make directories & create data/nginx/custom/root_top.conf
  2. Populate with your upstream server list
    upstream proxmox-backend {
    server first_server.fqdn:8006;
    server second_server.fqdn:8006;
    server nth_server.fqdn:8006;
    }
  3. Set proxy host for http://backend and edit proxy host to have the following in the Advanced tab:
    location / {
    proxy_pass https://proxmox-backend;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    }

    4.Profit

In my case it is very simple load balancing with no persistent cookies/session (not needed for my application.) But all those options exist for nginx and it’s just a matter of adding them in custom configuration for NPM.