aboutsummaryrefslogtreecommitdiff
path: root/www/content/posts/use-paddleocr.md
diff options
context:
space:
mode:
Diffstat (limited to 'www/content/posts/use-paddleocr.md')
-rw-r--r--www/content/posts/use-paddleocr.md103
1 files changed, 103 insertions, 0 deletions
diff --git a/www/content/posts/use-paddleocr.md b/www/content/posts/use-paddleocr.md
new file mode 100644
index 0000000..f4eae32
--- /dev/null
+++ b/www/content/posts/use-paddleocr.md
@@ -0,0 +1,103 @@
+---
+title: "Use PaddleOCR"
+date: 2022-11-30T13:25:36+08:00
+description: Simple steps to use PaddleOCR.
+categories: Coding
+tags:
+ - AI
+ - Python
+ - OCR
+---
+
+I guess [_OCR_](https://en.wikipedia.org/wiki/Optical_character_recognition) is not something new for us. While there are a lot of open source artificial intelligence engines to achieve this, I need a easy-to-use one.
+
+Recently I got a task to convert images into text. The image number is fairly big. So it's just impossible to OCR them one by one manually. So I wrote a python script to handle this tedious task.
+
+<!--more-->
+
+## Basic Processing
+
+The original images contain a identical useless frame around the part that I need. So a crop is required because it will improve the performance (of course, the image is smaller) and there are unrelated texts in the frame.
+
+Cropping is a easy problem. Just install [`Pillow`](https://pillow.readthedocs.io/en/stable/) package with `pip`:
+
+```shell
+pip install Pillow
+```
+
+Then use `Pillow` to do the cropping:
+
+```python
+image_file_list = ["image1.png", "image2.png", ...]
+crop_file_list = [f"crop-{image_file}" for image_file in image_file_list]
+
+## left, top, width, height
+geometry = (100, 200, 300, 400)
+print("Target geometry:", geometry)
+## convert to (left, top, right, bottom)
+geometry_ltrb = (geometry[0], geometry[1], geometry[0] +
+ geometry[2], geometry[1] + geometry[3])
+
+## crop image with geometry
+for index, image_file in enumerate(image_file_list):
+ print(f"[{index + 1}/{len(image_file_list)}] Cropping '{image_file}' ...")
+ with Image.open(join(dir_path, image_file)) as image:
+ image.crop(geometry_ltrb).save(crop_file_list)
+```
+
+Now we have cropped images with original filename prefixed by `crop-`.
+
+## Install PaddlePaddle
+
+It's not easy to install [`PaddlePaddle`](https://github.com/PaddlePaddle/Paddle) with `pip` because it needs to run some native compilation. `Anaconda` is also complex to install and generates a lot of garbage files. The cleanest way is to use [`Docker`](https://www.docker.com) and with [`vscode` Remote Connect extensions](https://code.visualstudio.com/docs/devcontainers/containers).
+
+Of course you need to install docker first, which is basically out of this blog's scope.
+
+Then run the following command to create and run the `PaddlePaddle` image:
+
+```shell
+docker run -it --name ppocr -v "$PWD:/data" --network=host registry.baidubce.com/paddlepaddle/paddle:2.4.0-cpu /bin/bash
+```
+
+Something to note
+
+1. You can change the mounted volumes to what you want to process.
+
+2. This image is pulled from [`Baidu`](https://baidu.com) (the company creates _PaddlePaddle_) registry, which is fast in China. You can also pull it from `DockerHub`.
+
+3. This image's _PaddlePaddle_ is based on cpu. Of course you have a cpu in your computer. But if you have a GPU or even [_CUDA_](https://developer.nvidia.com/cuda-downloads), you can select another image with correct tag. But cpu image is almost always work and using GPU is harder to configure.
+
+4. I don't known why `--network=host` is needed. The container does not publish any ports. But it can access Internet faster or VSCode Remote Connect needs it?
+
+## Install PaddleOCR
+
+This image above only contain _PaddlePaddle_. [_PaddleOCR_](https://github.com/PaddlePaddle/PaddleOCR) is another package based on it and needs individual install. However, this time we can just use `pip` again.
+
+```shell
+pip install paddleocr
+```
+
+## Coding
+
+The next step is to write python codes. Also the easiest part!
+You can connect to the container you just created with vscode and then happy coding!
+
+```python
+ocr = PaddleOCR(use_angle_cls=True, lang="ch") ## change the language to what you need
+image_text_list = []
+for index, crop_image_file in enumerate(crop_file_list):
+ print(f"[{index + 1}/{len(crop_file_list)}] OCRing '{crop_image_file}' ...")
+ result = ocr.ocr(crop_image_file, cls=True)
+ result = result[0] ## There is some inconsistence of official docs. Result is a list with single element.
+ line_text_list = [line[1][0] for line in result] ## a list of text str
+ image_text = "\n".join(line_text_list)
+ image_text_list.append(paragraph)
+```
+
+Now you can do any other things to the the `image_text_list` .
+
+## Finally
+
+Now just run the script. Or even better, customize it.
+
+By the way, `PaddleOCR` is far more accurate than [`tesseract`](https://tesseract-ocr.github.io) in __Chinese__. Maybe because it is created by _Baidu_, a Chinese local company or I missed some configuration. For English, I haven't tested.