Skip to content

How to train pottan for another language

Harish K edited this page Apr 29, 2019 · 1 revision

Preface

Pottan-ocr basically uses a CNN+RNN+CTC ( Crnn ) model for text detection. For training it uses synthetic data generated at training time using various fonts available in the system. So, for training, we need target language's unicode fonts in our system. Also, for training, we need a running Torch machine learning framework installed ( GPU is recommended for training )

Quick overview

  1. create a config.yaml file similar to config.yaml.sample. In that, we need to list primary unicode code points in the target language .
  • By the term primary unicode code points I mean that, we dont have to include all the glyps available in our target language. Glyph created by using two or more more primary code points are not required. Including such glyphs will cause performance impact but it wont break the tool.
  1. Install as many different unicode fonts for the targeted language in your system. Note its qualified font name and fill config.yaml . For eg
fonts:
- - AnjaliOldLipi
  - [regular, bold]
- - Chilanka
  - [regular, bold, italic]

which means, regular & bold styles of AnjaliOldLipi font will be used for training and, for the font "Chilanka" all the three styles regular, bold & italic will be used for training. 3. Create a text corpus with sentences with limited length. The synthetic training data generation algorithm doesn't check text overflow or text wrapping. So if we doesn't limit length of sentence, it will cause text overflow like this image .

  • Same thing can happen if we provide a large value for fontsinze. So, if we need to specify a different fontSize from defaultFontSize for any font, we need to specify it in the config file otherwise, it will generate wrong training data
- - Karumbi
  - [regular, bold, italic, bold italic]
  - 15
  1. Run training
  • An example command line will be like this
./bin/pottan train \
  --cuda \
  --crnn /presession2/$1 \
  --traindata ./train.txt.gz  \
  --traindata_limit $(( 256* 512)) \
  --traindata_cache ./traindata_cache \
  --valdata ./validate.txt.gz \
  --valdata_limit $(( 256* 16 )) \
  --valdata_cache ./valdata_cache \
  --valInterval 512 \
  --batchSize 64 \
  --lr 0.00005 \
  --niter 50 \
  --outdir ./output \
  --displayInterval 50 \
  --saveInterval 1024 \
  • if we specify valdata_cache and traindata_cache , then generated image will be written to disk ( can be used for manual inspection ) . If we omit that option, no generated images will get written to disk. Trained models will get saved in the output directory with timestamp in the filename.

In the earlier days, I had used floydhub for training pottan-ocr. those sessions can be seen here https://www.floydhub.com/harish2704/projects/pottan-ocr/3 These files includes complete bash script which runs training on fresh Ubuntu floydhub container

Clone this wiki locally