Use PaddleOCR

Nov 30, 2022 | 604 words

I guess OCR is not something new for us. While there are a lot of open source artificial intelligence engines to achieve this, I need a easy-to-use one.

Recently I got a task to convert images into text. The image number is fairly big. So it’s just impossible to OCR them one by one manually. So I wrote a python script to handle this tedious task.

Basic Processing

The original images contain a identical useless frame around the part that I need. So a crop is required because it will improve the performance (of course, the image is smaller) and there are unrelated texts in the frame.

Cropping is a easy problem. Just install Pillow package with pip:

1
pip install Pillow

Then use Pillow to do the cropping:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
image_file_list = ["image1.png", "image2.png", ...]
crop_file_list = [f"crop-{image_file}" for image_file in image_file_list]

## left, top, width, height
geometry = (100, 200, 300, 400)
print("Target geometry:", geometry)
## convert to (left, top, right, bottom)
geometry_ltrb = (geometry[0], geometry[1], geometry[0] +
                 geometry[2], geometry[1] + geometry[3])

## crop image with geometry
for index, image_file in enumerate(image_file_list):
    print(f"[{index + 1}/{len(image_file_list)}] Cropping '{image_file}' ...")
    with Image.open(join(dir_path, image_file)) as image:
        image.crop(geometry_ltrb).save(crop_file_list)

Now we have cropped images with original filename prefixed by crop-.

Install PaddlePaddle

It’s not easy to install PaddlePaddle with pip because it needs to run some native compilation. Anaconda is also complex to install and generates a lot of garbage files. The cleanest way is to use Docker and with vscode Remote Connect extensions.

Of course you need to install docker first, which is basically out of this blog’s scope.

Then run the following command to create and run the PaddlePaddle image:

1
docker run -it --name ppocr -v "$PWD:/data" --network=host registry.baidubce.com/paddlepaddle/paddle:2.4.0-cpu /bin/bash

Something to note

You can change the mounted volumes to what you want to process.
This image is pulled from Baidu (the company creates PaddlePaddle) registry, which is fast in China. You can also pull it from DockerHub.
This image’s PaddlePaddle is based on cpu. Of course you have a cpu in your computer. But if you have a GPU or even CUDA, you can select another image with correct tag. But cpu image is almost always work and using GPU is harder to configure.
I don’t known why --network=host is needed. The container does not publish any ports. But it can access Internet faster or VSCode Remote Connect needs it?

Install PaddleOCR

This image above only contain PaddlePaddle. PaddleOCR is another package based on it and needs individual install. However, this time we can just use pip again.

1
pip install paddleocr

Coding

The next step is to write python codes. Also the easiest part! You can connect to the container you just created with vscode and then happy coding!

1
2
3
4
5
6
7
8
9
ocr = PaddleOCR(use_angle_cls=True, lang="ch") ## change the language to what you need
image_text_list = []
for index, crop_image_file in enumerate(crop_file_list):
    print(f"[{index + 1}/{len(crop_file_list)}] OCRing '{crop_image_file}' ...")
    result = ocr.ocr(crop_image_file, cls=True)
    result = result[0] ## There is some inconsistence of official docs. Result is a list with single element.
    line_text_list = [line[1][0] for line in result] ## a list of text str
    image_text = "\n".join(line_text_list)
    image_text_list.append(paragraph)

Now you can do any other things to the the image_text_list .

Finally

Now just run the script. Or even better, customize it.

By the way, PaddleOCR is far more accurate than tesseract in Chinese. Maybe because it is created by Baidu, a Chinese local company or I missed some configuration. For English, I haven’t tested.