How to train pottan for another language

Preface

Pottan-ocr basically uses a CNN+RNN+CTC ( Crnn ) model for text detection. For training it uses synthetic data generated at training time using various fonts available in the system. So, for training, we need target language's unicode fonts in our system. Also, for training, we need a running Torch machine learning framework installed ( GPU is recommended for training )

Quick overview

create a config.yaml file similar to config.yaml.sample. In that, we need to list primary unicode code points in the target language .

By the term primary unicode code points I mean that, we dont have to include all the glyps available in our target language. Glyph created by using two or more more primary code points are not required. Including such glyphs will cause performance impact but it wont break the tool.

Install as many different unicode fonts for the targeted language in your system. Note its qualified font name and fill config.yaml . For eg

fonts:
- - AnjaliOldLipi
  - [regular, bold]
- - Chilanka
  - [regular, bold, italic]

which means, regular & bold styles of AnjaliOldLipi font will be used for training and, for the font "Chilanka" all the three styles regular, bold & italic will be used for training. 3. Create a text corpus with sentences with limited length. The synthetic training data generation algorithm doesn't check text overflow or text wrapping. So if we doesn't limit length of sentence, it will cause text overflow like this image .

Same thing can happen if we provide a large value for fontsinze. So, if we need to specify a different fontSize from defaultFontSize for any font, we need to specify it in the config file otherwise, it will generate wrong training data

- - Karumbi
  - [regular, bold, italic, bold italic]
  - 15

An example text corpus file used for training Malayalam language can be seen at
- https://github.com/harish2704/pottan-ocr-data/blob/master/train.txt.gz ( used for training )
- https://github.com/harish2704/pottan-ocr-data/blob/master/validate.txt.gz ( used for validation )

Run training

An example command line will be like this

./bin/pottan train \
  --cuda \
  --crnn /presession2/$1 \
  --traindata ./train.txt.gz  \
  --traindata_limit $(( 256* 512)) \
  --traindata_cache ./traindata_cache \
  --valdata ./validate.txt.gz \
  --valdata_limit $(( 256* 16 )) \
  --valdata_cache ./valdata_cache \
  --valInterval 512 \
  --batchSize 64 \
  --lr 0.00005 \
  --niter 50 \
  --outdir ./output \
  --displayInterval 50 \
  --saveInterval 1024 \

if we specify valdata_cache and traindata_cache , then generated image will be written to disk ( can be used for manual inspection ) . If we omit that option, no generated images will get written to disk. Trained models will get saved in the output directory with timestamp in the filename.

In the earlier days, I had used floydhub for training pottan-ocr. those sessions can be seen here https://www.floydhub.com/harish2704/projects/pottan-ocr/3 These files includes complete bash script which runs training on fresh Ubuntu floydhub container

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to train pottan for another language

Preface

Quick overview

Uh oh!

Clone this wiki locally