From 78e3e234877cb10ca1088df31e831b36fa4a12c0 Mon Sep 17 00:00:00 2001 From: Yuqian Yang Date: Fri, 23 Jan 2026 23:16:45 +0800 Subject: HALF WORK! --- www-2/content/posts/c-func-ext.md | 101 +++++++++++++++++ www-2/content/posts/nspawn.md | 207 +++++++++++++++++++++++++++++++++++ www-2/content/posts/use-paddleocr.md | 103 +++++++++++++++++ 3 files changed, 411 insertions(+) create mode 100644 www-2/content/posts/c-func-ext.md create mode 100644 www-2/content/posts/nspawn.md create mode 100644 www-2/content/posts/use-paddleocr.md (limited to 'www-2/content/posts') diff --git a/www-2/content/posts/c-func-ext.md b/www-2/content/posts/c-func-ext.md new file mode 100644 index 0000000..1f5f822 --- /dev/null +++ b/www-2/content/posts/c-func-ext.md @@ -0,0 +1,101 @@ +--- +title: "Libc/POSIX Function \"Extensions\"" +date: 2025-03-04T13:40:33+08:00 +lastmod: 2025-03-04T13:40:33+08:00 +categories: coding +tags: + - c + - posix +--- + +(I've given up on this, at least for linux pam.) + +Recently, I’ve been working on porting some libraries to GNU/Hurd. Many (old) +libraries use [`*_MAX` constants on POSIX system +interfaces](https://pubs.opengroup.org/onlinepubs/9699919799.2008edition/nframe.html) +to calculate buffer sizes. However, the GNU/Hurd maintainers urge against the +blind use of them and refuse to define them in system headers. When old APIs are +gone, compatibility problems come. To make my life easier, I'll put some +reusable code snippets here to help *fix `*_MAX` bugs*. + + + +```c +#include +#include +#include +#include +#include + +static inline char *xreadlink(const char *restrict path) { + char *buffer; + size_t allocated = 128; + ssize_t len; + + while (1) { + buffer = (char*) malloc(allocated); + if (!buffer) { return NULL; } + len = readlink(path, buffer, allocated); + if (len < (ssize_t) allocated) { return buffer; } + free(buffer); + if (len >= (ssize_t) allocated) { allocated *= 2; continue; } + return NULL; + } + } + + +static inline char *xgethostname(void) { + long max_host_name; + char *buffer; + + max_host_name = sysconf(_SC_HOST_NAME_MAX); + buffer = malloc(max_host_name + 1); + + if (gethostname(buffer, max_host_name + 1)) { + free(buffer); + return NULL; + } + + buffer[max_host_name] = '\0'; + return buffer; +} + +static inline char *xgetcwd(void) { + char *buffer; + size_t allocated = 128; + + while (1) { + buffer = (char*) malloc(allocated); + if (!buffer) { return NULL; } + getcwd(buffer, allocated); + if (buffer) return buffer; + free(buffer); + if (errno == ERANGE) { allocated *= 2; continue; } + return NULL; + } +} + +static inline __attribute__((__format__(__printf__, 2, 3))) int +xsprintf(char **buf_ptr, const char *restrict format, ...) { + char *buffer; + int ret; + + va_list args; + va_start(args, format); + + ret = snprintf(NULL, 0, format, args); + if (ret < 0) { goto out; } + + buffer = malloc(ret + 1); + if (!buffer) { ret = -1; goto out; } + + ret = snprintf(NULL, 0, format, args); + if (ret < 0) { free(buffer); goto out; } + + *buf_ptr = buffer; + +out: + va_end(args); + return ret; +} +``` diff --git a/www-2/content/posts/nspawn.md b/www-2/content/posts/nspawn.md new file mode 100644 index 0000000..866cf96 --- /dev/null +++ b/www-2/content/posts/nspawn.md @@ -0,0 +1,207 @@ +--- +title: "Use systemd-nspawn to Create a Development Sandbox" +date: 2025-03-04T23:22:23+08:00 +lastmod: 2025-03-27T17:46:24+08:00 +--- + +*systemd-nspawn* is a great tool for creating development sandboxes. Compared to +other similar technologies, it's lightweight, flexible, and easy to use. In this +blog, I'll present a simple guide to using it. + + + +## Advantages + +I've been using traditional VMs and Docker for creating development +environments. While both work fine, regardless of the performance, they suffer +from being overly isolated. Two big headaches for me are host network sharing in +traditional VMs and the immutability of Docker container ports and mounts. + +*systemd-nspawn* is much more flexible. Every feature can be configured +granularly and dynamically. For example, filesystem sharing can be configured to +work like bind mounts, and network isolation can be disabled entirely, which +exactly solves the two headaches mentioned above. Additionally, being part of +*systemd*, it has the same excellent design as other *systemd* components. + +Debian has a similar powerful tool called *schroot*. It is the official tool for +automatic package building. Unfortunately, it seems to be a tool specific to +Debian. + +## Usage + +*systemd-nspawn* consists of two parts that work together to achieve its VM +functionality: + +1. The program `systemd-nspawn`, which runs other programs in an isolated + environment with user-specified settings. Each running VM is essentially a + group of processes launched via `systemd-nspawn`. +2. Components for defining and managing VMs, possibly utilizing + `systemd-nspawn`. + +*systemd-nspawn* has a user interface similar to *systemd service*: + +- `[name].service` => `[name].nspawn`: Define VMs. + - Should be placed in `/etc/systemd/nspawn/`, where `machinectl` scans for VM + definitions. + - `[name]` serves as the VM's name. Use it to specify the VM when calling + `machinectl`. Note: You'd better use a valid hostname (avoid illegal + characters like `.`) to prevent weird errors. + - The options available roughly mirror `systemd-nspawn`'s CLI arguments, with + some adjustments to better fit VM semantics. + - Isolation-related options are usually prefixed with `Private` (e.g., + `PrivateUsers=`). +- `systemctl` => `machinectl`: Manage VMs. + - `enable`/`disable`: Set whether the VM starts automatically at system boot. + - `start`/`poweroff`/`reboot`/`terminate`/`kill`: Control the VM's running + state. + - `login`/`shell`: Do things inside the VM. + +I'll demonstrate how to create a Debian-based VM on Arch Linux as an example. +You should adjust the commands based on your own situation. + +### Create Root Filesystem + +The root filesystem of a distribution can be created using a special tool from +its package manager. For Debian-based distributions, it's `debootstrap`. If your +OS uses a different package manager ecosystem, the target distribution's one and +its keyrings (which might reside somewhere else) have to be installed first. + +```bash-session +# pacman -S debootstrap debian-archive-keyring ubuntu-keyring +``` + +Regular directories work perfectly as root filesystems, but other directory-like +things should work, too, such as `btrfs` subvolume. + +```bash-session +# btrfs subvolume create /var/lib/machines/[name] +``` + +Now, run `debootstrap` to create a minimal filesystem. Update the command with +the target distribution's codename and one of its mirrors you select. + +```bash-session +# debootstrap --include=dbus,libpam-systemd,libnss-systemd [codename] \ + /var/lib/machines/[name] [mirror] +``` + +At this point, the filesystem contains only the distribution's essential +packages, much like a base Docker image (e.g., `debian`), so you can customize +it in a similar way. + +### Configure and Customize + +I'll present my personal configuration here as a reference. You can create a new +one based on it or from scratch. + +1. Disable user isolation: `[Exec] PrivateUsers=no` +2. Disable network isolation: `[Network] Private=no` +3. Create a user with the same username, group name, UID and GIDs: should be + done inside the VM. +4. Only bind a subdirectory under *home*: `[Files] Bind=/home/[user]/[subdir]` +5. Set the hostname: `[Exec] Hostname=[hostname]` + +I disable user isolation because it's implemented using the kernel's user +namespace, which adds many inconveniences due to UID/GID mapping. + +So, the final `.nspawn` file is like: + +```systemd +/etc/systemd/nspawn/[name].nspawn +--- +[Exec] +PrivateUsers=no +Hostname=[hostname] + +[Files] +Bind=/home/[user]/[subdir] + +[Network] +Private=no +``` + +If `machinectl` can already start the VM, you can log in to customize it +further. Otherwise, you can use `systemd-nspawn` directly to enter the VM and +run commands inside it. `--resolv-conf=bind-host` binds the host's +`/etc/resolv.conf` file to make the network work. + +```bash-session +# systemd-nspawn --resolv-conf=bind-host -D /var/lib/machines/[name] +``` + +Now, inside the VM, you can do whatever you like. In my configuration, a correct +user must be created manually. + +```bash-session +# apt install locales lsb-release sudo \ + nano vim less man bash-completion curl wget \ + build-essential git +# dpkg-reconfigure locales + +# useradd -m -G sudo -s /usr/bin/bash [user] +# passwd [user] +``` + +Some setup may need to be done manually, especially those usually handled by the +distribution's installer. + +1. Update `/etc/hostname` with the VM's real hostname. +2. Update `/etc/hosts`. + +```plain +/etc/hosts +--- +127.0.0.1 localhost [hostname] +::1 localhost ip6-localhost ip6-loopback +ff02::1 ip6-allnodes +ff02::2 ip6-allrouters +``` + +**Ubuntu 20.04 specific:** Due to [a bug in +systemd](https://github.com/systemd/systemd/issues/22234), the backport source +has to be added. + +```plain +/etc/apt/sources.list +--- +deb https://mirrors.ustc.edu.cn/ubuntu focal main restricted universe multiverse +deb https://mirrors.ustc.edu.cn/ubuntu/ focal-updates main restricted universe multiverse +deb https://mirrors.ustc.edu.cn/ubuntu/ focal-backports main restricted universe multiverse +deb https://mirrors.ustc.edu.cn/ubuntu/ focal-security main restricted universe multiverse +``` + +### Use + +The following command starts a new shell session for the specified user inside +the VM, where you can run commands and perform tasks. + +```bash-session +# machinectl shell [user]@[name] +``` + +Another way is to use `login` command to enter the *login console*. From there, +you can log in as a user to start a shell session. + +```bash-session +# machinectl login [name] +``` + +To exit a VM session (especially in the *login console*), press `CTRL+]` three +times quickly in a row. + +### Snapshot + +The easiest way to backup/snapshot a VM is to create an archive of the VM's +filesystem. You can use any archive tool you prefer, such as the simple `tar`. +If the VM's filesystem is a `btrfs` subvolume, native `btrfs` snapshots can be +used here. Before creating a snapshot, you should power off the VM to avoid +archiving runtime files. + +```bash-session +# machinectl poweroff [name] +# btrfs subvolume snapshot /var/lib/machines/[name] \ + /var/lib/machines/btrfs-snapshots/[name]/[snapshot-name] +``` + +`machinectl` also provides an *image* feature similar to Docker, though I've +never tried it. Feel free to explore it if you're interested! diff --git a/www-2/content/posts/use-paddleocr.md b/www-2/content/posts/use-paddleocr.md new file mode 100644 index 0000000..806df41 --- /dev/null +++ b/www-2/content/posts/use-paddleocr.md @@ -0,0 +1,103 @@ +--- +title: "Use PaddleOCR" +date: 2022-11-30T13:25:36+08:00 +description: Simple steps to use PaddleOCR. +categories: coding +tags: + - AI + - python + - OCR +--- + +I guess [_OCR_](https://en.wikipedia.org/wiki/Optical_character_recognition) is not something new for us. While there are a lot of open source artificial intelligence engines to achieve this, I need a easy-to-use one. + +Recently I got a task to convert images into text. The image number is fairly big. So it's just impossible to OCR them one by one manually. So I wrote a python script to handle this tedious task. + + + +## Basic Processing + +The original images contain a identical useless frame around the part that I need. So a crop is required because it will improve the performance (of course, the image is smaller) and there are unrelated texts in the frame. + +Cropping is a easy problem. Just install [`Pillow`](https://pillow.readthedocs.io/en/stable/) package with `pip`: + +```shell +pip install Pillow +``` + +Then use `Pillow` to do the cropping: + +```python +image_file_list = ["image1.png", "image2.png", ...] +crop_file_list = [f"crop-{image_file}" for image_file in image_file_list] + +## left, top, width, height +geometry = (100, 200, 300, 400) +print("Target geometry:", geometry) +## convert to (left, top, right, bottom) +geometry_ltrb = (geometry[0], geometry[1], geometry[0] + + geometry[2], geometry[1] + geometry[3]) + +## crop image with geometry +for index, image_file in enumerate(image_file_list): + print(f"[{index + 1}/{len(image_file_list)}] Cropping '{image_file}' ...") + with Image.open(join(dir_path, image_file)) as image: + image.crop(geometry_ltrb).save(crop_file_list) +``` + +Now we have cropped images with original filename prefixed by `crop-`. + +## Install PaddlePaddle + +It's not easy to install [`PaddlePaddle`](https://github.com/PaddlePaddle/Paddle) with `pip` because it needs to run some native compilation. `Anaconda` is also complex to install and generates a lot of garbage files. The cleanest way is to use [`Docker`](https://www.docker.com) and with [`vscode` Remote Connect extensions](https://code.visualstudio.com/docs/devcontainers/containers). + +Of course you need to install docker first, which is basically out of this blog's scope. + +Then run the following command to create and run the `PaddlePaddle` image: + +```shell +docker run -it --name ppocr -v "$PWD:/data" --network=host registry.baidubce.com/paddlepaddle/paddle:2.4.0-cpu /bin/bash +``` + +Something to note + +1. You can change the mounted volumes to what you want to process. + +2. This image is pulled from [`Baidu`](https://baidu.com) (the company creates _PaddlePaddle_) registry, which is fast in China. You can also pull it from `DockerHub`. + +3. This image's _PaddlePaddle_ is based on cpu. Of course you have a cpu in your computer. But if you have a GPU or even [_CUDA_](https://developer.nvidia.com/cuda-downloads), you can select another image with correct tag. But cpu image is almost always work and using GPU is harder to configure. + +4. I don't known why `--network=host` is needed. The container does not publish any ports. But it can access Internet faster or VSCode Remote Connect needs it? + +## Install PaddleOCR + +This image above only contain _PaddlePaddle_. [_PaddleOCR_](https://github.com/PaddlePaddle/PaddleOCR) is another package based on it and needs individual install. However, this time we can just use `pip` again. + +```shell +pip install paddleocr +``` + +## Coding + +The next step is to write python codes. Also the easiest part! +You can connect to the container you just created with vscode and then happy coding! + +```python +ocr = PaddleOCR(use_angle_cls=True, lang="ch") ## change the language to what you need +image_text_list = [] +for index, crop_image_file in enumerate(crop_file_list): + print(f"[{index + 1}/{len(crop_file_list)}] OCRing '{crop_image_file}' ...") + result = ocr.ocr(crop_image_file, cls=True) + result = result[0] ## There is some inconsistence of official docs. Result is a list with single element. + line_text_list = [line[1][0] for line in result] ## a list of text str + image_text = "\n".join(line_text_list) + image_text_list.append(paragraph) +``` + +Now you can do any other things to the the `image_text_list` . + +## Finally + +Now just run the script. Or even better, customize it. + +By the way, `PaddleOCR` is far more accurate than [`tesseract`](https://tesseract-ocr.github.io) in __Chinese__. Maybe because it is created by _Baidu_, a Chinese local company or I missed some configuration. For English, I haven't tested. -- cgit v1.2.3