Deepfont on Keras
Deepfont was first introduced by Adobe, which uses deep learning to identify font type. Inspired by their works, I made this reproduction using Keras.
DeepFont: Identify Your Font from An Image
Their technical contributions are listed below:
AdobeVFR Dataset A large set of labeled real-world images as well as a large corpus of unlabeled real-world data are collected for both training and testing, which could be found at the link Adobe Visual Font Recognition (VFR)
Domain Adapted CNN This real-to-synthetic domain gap caused poor generalization to new real data in previous VFR methods. They address this domain mismatch problem by leveraging synthetic data to obtain effective classification features, while introducing a domain adaptation technique based on Stacked Convolutional Auto Encoder (SCAE) with the help of unlabeled real-world data.
Learning-based Model Compression They introduce a novel learning-based approach to obtain a losslessly compressible model, for a high compression ratio with- out sacrificing its performance. An exact low-rank constraint is enforced on the targeted weight matrix.
Datasets
To apply machine learning to VFR problem, both synthetic and realistic text images with ground truth font labels is required. The way to overcome the training data challenge is to synthesize the training set by rendering text fragments for all the necessary fonts.
Synthetic Text
It’s easy to generate dataset based custom font image patches using TextRecognitionDataGenerator.
GitHub - TextRecognitionDataGenerator
Words will be randomly chosen from a dictionary of a specific language. Then an image of those words will be generated by using font, background, and modifications (skewing, blurring, etc.) as specified.
TextRecognitionDataGenerator comes with an easy to use CLI and Python Module. It has a nice written tutorial.
TextRecognitionDataGenerator Tutorial
Realistic Text
AdobeVFR Dataset obtain 4,384 real-world test images with reliable labels, covering 617 classes (out of 2,383). Compared to the synthetic data, these images typically have much larger appearance variations caused by scaling, back- ground clutter, lighting, noise, perspective distortions, and compression artifacts.
Preprocessing
Fonts are different with objects, which have huge spatial information when classify features. Aimed to reduce the mismatch, preprocessing is required and exampled by the paper.
Firstly, import needed modules.
1 | import PIL |
Add %matplotlib inline
as Magic Function if uses IPython to render images directly in browser. Otherwise, It would cause errors if you’re not using IPython.
Then code image load function.
1 | def pil_image(img_path): |
Legacy
It is usual to artificially augment training data using label-preserving transformations to reduce overfitting.
- Noise a small Gaussian noise with 0 mean and standard deviation 3 is added to input.
1 | def noise_image(img): |
- Blur a random Gaussian blur with standard deviation from 2.5 to 3.5 is added to input.
1 | def blur_image(img): |
- Perspective Rotation a randomly-parameterized affine transformation is added to input.
1 | def affine_rotation(img): |
- Shading the input background is filled with a gradient in illumination.
1 | def gradient_fill(img): |
Additional
As a very particular type of images, text images have various real-world appearances caused by specific handlings. Based on the observations in the paper, they identify two additional font-specific augmentation steps to the training data.
- Variable Character Spacing when rendering each synthetic image, set the character spacing (by pixel) to be a Gaussian random variable of mean 10 and standard deviation 40, bounded by [0, 50].
- Variable Aspect Ratio Before cropping each image into a input patch, the image, with heigh fixed, is squeezed in width by a random ratio, drawn from a uniform distribution between 5/6 and 7/6.
It not convenient to do the additional steps for each characters, so loosely speaking, we could done this before legacy steps, at the beginning we generate our datasets using TextRecognitionDataGenerator.
1 | python3 run.py -c 10 -k 15 -rk -d 3 -do 2 -f 64 -ft ['Font1', 'Font2', 'Font3'] -t $(grep -c ^processor /proc/cpuinfo) |
This generate 10 examples with Font1, Font2 and Font3 which characters sized 64x64 with a skewing angle between -15 and 15 and a random distorsions both vertical and horizontal, multi-threads acceleration enabled.
Otherwise, it would be more difficult if we do as same as the paper. Firstly we generate single characters in same font with random aspect ratio follow the paper advice, the we flatten all these single characters with random spacing into many word, again we got a sentence in one image labeled by the font. Lastly by repeating these steps, we got images datasets with different fonts before applying legacy steps.
However, we’re supposed to do something which is similar to this at the end of datasets importing and actually I did it this way. To be clear why we could and should do this, I would clear that there’re something that I misunderstood and it totally different, just imaging the real situation when people tring to identify a font, the font would always be some part of some texts which has strong and clear characteristic, It’s the most important connection to our datasets, but the preprocessing solution I suggested before, just using the opponent side to undermine the most print font’s characteristic, through it may did some help on handwriting font recognition.
Architecture
Domain adapted CNN employs a Convolutional Neural Network (CNN) architecture, which is further decomposed into two sub-networks:
- A “shared” low-level sub-network which is learned from the composite set of synthetic and real-world data.
- A high-level sub-network that learns a deep classifier from the low-level features.
Generate Datasets
Here we use the Text Recognition Data Generator CLI trdg
to generate the random datasets.
ttf_path
is a folder contains all the font file with correct font name and.ttf
extension.data_path
is a folder stores or contains generated datasets.
1 | import os |
Import Datasets
Import pre-generated synthetic and realistic text images from datasets_path
(here especially the datasets we generated before).
1 | import os |
Tag Labels
Convert font name string to integer and use the matched number as a font label when training models.
1 | def conv_label(label): |
Preprocessing Datasets
Preprocessing functions are already finished, for each font patch images, effects should be applied randomly, so firstly we generate random combinations in 4 legacy preprocessing functions. Then apply the effects following the generated combinations list for all the font patch images.
1 | import os |
According to the paper, 75% of the datasets is for training and the remaining 25% is for testing, so partition the data into training and testing is required.
1 | from sklearn.model_selection import train_test_split |
For further processing, both train and test labels of the datasets should be converted from integers to vectors.
1 | from keras.utils import to_categorical |
Then process the datasets using additional preprocessing steps.
1 | from keras.preprocessing.image import ImageDataGenerator |
Create Model
When the CNN model is trained fully on a synthetic dataset, it witnesses a significant performance drop when testing on real-world data, compared to when applied to another synthetic validation set. It alludes to discrepancies between the distributions of synthetic and real-world examples. They propose to decompose the N CNN layers into two sub-networks to be learned sequentially:
Unsupervised cross-domain sub-network Cu, which consists of the first K layers of CNN. It accounts for extracting low-level visual features shared by both syn- thetic and real-world data domains. Cu will be trained in a unsupervised way, using unlabeled data from both domains. It constitutes the crucial step that further minimizes the low-level feature gap, beyond the previous data augmentation efforts.
Supervised domain-specific sub-network Cs, which consists of the remaining N − K layers. It accounts for learning higher-level discriminative features for classi- fication, based on the shared features from Cs. Cs will be trained in a supervised way, using labeled data from the synthetic domain only.
Firstly we modify the order of picture channels to avoid OverflowError
.
1 | from keras import backend as K |
Note the difference about the format which keras
use in different versions.
1 | K.set_image_dim_ordering('tf') --> K.set_image_data_format('channels_last') |
Secondly code create model function to define the architecture of the CNN layers.
1 | from keras.models import Sequential |
Then create and compile model using Gradient descent (with momentum) optimizer with the CNN architecture network we created just now.
1 | from keras import optimizers |
Periodically save my model to disk and get a view on internal states and statistics of a model during training.
1 | from keras import callbacks |
Evaluate
It’s necessary to evaluate a model after training to test whether it has meet our exceptions. If not, it means there would be some problem with our datasets or arguments used to compile.
Load Model
Load the model from model_store_path
and print model evaluation information on the screen.
1 | from keras.models import load_model |
Load Image
Load the test image from image_path
and preprocess with blur_img
function, conver image to array.
1 | import PIL |
Inference
Firstly code the lable restore function to convert font name from integer to string.
1 | def rev_conv_label(label): |
Then use loaded model to predict image array.
1 | import numpy as np |
Lastly, show the prediction results.
1 | import matplotlib.cm as cm |