| Project Title | Development of Robust Document Image Analysis and Recognition System for Printed Urdu Script |
| |
| |
| Sponsored by | Department of Science & Technology |
| |
| |
| Grant | 52.90 Lakh |
| |
| |
| Duration | Year 2010 to 2013 |
| |
| |
|
| Aim of Project | |
|
- To develop an Urdu OCR system, with the following capability:
- Recognize commonly used Urdu fonts with 95% recognition accuracy at character level.
- Recognize the common Urdu symbols and numerals
- Handling Documents with complex layout(Tables, Multicolumn’s, etc.)
- Processing Multi-color pages
- Prepare 5000 pages of annotated corpus for Urdu script
|
| Scope of the Project | | |
We shall also be developing the first generation OCR for Urdu script. From OCR point of view, Urdu is one of the most challenging script as the character and word shape changes according to context and usually the characters are joined together. The Urdu word grows both in horizontal and vertical direction. An Urdu word is a combination of ligatures (characters which join together) and isolated characters. The concept of space as a word boundary marker is not present in Urdu writing, which makes word segmentation a challenging task. It has been estimated by Urdu font developers that there are around 18,000 ligatures in Urdu, which makes ligature classification a tough job
|
| Project Methodology | | |
Urdu is written using Arabic script in Natalique writing style. Urdu words are written from right to left and numbers are written left to right. Thus the script is bidirectional. From OCR point of view, Urdu is one of the most challenging script as the character and word shape changes according to context and usually the characters are joined together.
There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Nastalique style for Urdu presents much more challenges and thus a very different OCR challenge. In summary, the challenges include much more cursiveness, diagonality, mark placement and significantly more contextual shaping. This entails that though the work on Arabic language is relevant, these algorithms need to be further evolved for Urdu.
The main modules for OCR of Urdu script will be:
- Pre-processing routines for slant normalisation, smoothing, noise cleaning, skew correction, thinning etc.
- Segmentation routines for line, ligature and character segmentation
- Feature extraction and Recognition of Urdu ligatures and characters.
- Development of language model for Urdu for combining the adjacent Urdu ligatures to form Urdu words.
- Annotated Corpus of images of 5000 Urdu pages will be created
We propose to develop Urdu OCR with around 95% character recognition accuracy on noise-free documents.
|
| Intermediate Milestones | |
|
- Year-I
- Initiation of OCR development for Urdu script with already developed tools
- Pre-processing routines of Urdu script
- Statistical analysis of ligatures
- Development of line and ligature segmentation routines
- Develop language models for Urdu characters, ligatures and words.
- Year-II
- Feature extraction routines for Urdu Script
- Multi-classifier system for ligature recognition. Develop separate classifiers for high and low frequency ligatures
- Rules and language models to combine adjacent ligatures to form valid Urdu words.
- Year-III
- Testing and development of 1st generation OCR for Urdu script
|
| Project Team | | |
Project staff members for development:
|
|
|
© 2009 ACTDPL Punjabi University, Patiala |
|