Punjabi University Patiala, India, Website http://www.punjabiuniversity.ac.in

http://www.punjabiuniversity.ac.in/sangam/

http://www.advancedcentrepunjabi.org/intro1.asp



	Aim of Project

	Project Scope

	Project Methodology

	Intermediate Milestones

	Team Members

	International Projects

	Inhouse Projects

	Project Title	Development of Robust Document Image Analysis and Recognition System for Printed Urdu Script


	Sponsored by	Department of Science & Technology


	Grant	52.90 Lakh


	Duration	Year 2010 to 2013

Aim of Project

To develop an Urdu OCR system, with the following capability:

Recognize commonly used Urdu fonts with 95% recognition accuracy at character level.
Recognize the common Urdu symbols and numerals
Handling Documents with complex layout(Tables, Multicolumn’s, etc.)
Processing Multi-color pages

Prepare 5000 pages of annotated corpus for Urdu script

Scope of the Project

We shall also be developing the first generation OCR for Urdu script. From OCR point of view, Urdu is one of the most challenging script as the character and word shape changes according to context and usually the characters are joined together. The Urdu word grows both in horizontal and vertical direction. An Urdu word is a combination of ligatures (characters which join together) and isolated characters. The concept of space as a word boundary marker is not present in Urdu writing, which makes word segmentation a challenging task. It has been estimated by Urdu font developers that there are around 18,000 ligatures in Urdu, which makes ligature classification a tough job

Project Methodology

Urdu is written using Arabic script in Natalique writing style. Urdu words are written from right to left and numbers are written left to right. Thus the script is bidirectional. From OCR point of view, Urdu is one of the most challenging script as the character and word shape changes according to context and usually the characters are joined together. There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Nastalique style for Urdu presents much more challenges and thus a very different OCR challenge. In summary, the challenges include much more cursiveness, diagonality, mark placement and significantly more contextual shaping. This entails that though the work on Arabic language is relevant, these algorithms need to be further evolved for Urdu. The main modules for OCR of Urdu script will be:

Pre-processing routines for slant normalisation, smoothing, noise cleaning, skew correction, thinning etc.
Segmentation routines for line, ligature and character segmentation
Feature extraction and Recognition of Urdu ligatures and characters.
Development of language model for Urdu for combining the adjacent Urdu ligatures to form Urdu words.
Annotated Corpus of images of 5000 Urdu pages will be created

We propose to develop Urdu OCR with around 95% character recognition accuracy on noise-free documents.

Intermediate Milestones

Year-I

Initiation of OCR development for Urdu script with already developed tools
Pre-processing routines of Urdu script
Statistical analysis of ligatures
Development of line and ligature segmentation routines
Develop language models for Urdu characters, ligatures and words.

Year-II

Feature extraction routines for Urdu Script
Multi-classifier system for ligature recognition. Develop separate classifiers for high and low frequency ligatures
Rules and language models to combine adjacent ligatures to form valid Urdu words.

Year-III

Testing and development of 1st generation OCR for Urdu script

Project Team

Project staff members for development:

	Principle Investigator	Dr. Gurpreet Singh Lehal Punjabi University, Patiala


	Co-Investigator	Dr. Dharam Veer Sharma Punjabi University, Patiala