Skip to content

The main idea of this project is that extracting entities from the scanned Business Card

Notifications You must be signed in to change notification settings

ierolsen/Business-Card-Reader-App

Repository files navigation

Business Card Reader App

The main idea of this project is that extracting entities from the scanned Business Card.

cover

Project Features:

  • Extract Entities (text and data) from image of Business Card
    • Entities : Name, Organization, Phone, Email and Web Address

Tasks:

  • 1-Location of Entity
  • 2-Text of Corresponding Entity

Examples:

  • Name:
  • Designation
  • Organization
  • Phone
  • Email
  • Web Address

1

Technologies:

  • Computer Vision
    • Scanning Document
    • Identify Location of Text
    • Extract the Text from Image

Using OpenCV and Tesseract OCR


  • Natural Language Processing
    • Extract Entities from Text
    • Cleaning and Parsing

Using Pandas, spaCy, RegEx


Stages of Development

1-Setting up Project

  • Installations

2-Data Preparation

  • Extract Text and Location from Business Card

3-Labelling

  • BIO Tagging

4-Data Preprocessing

  • Text Cleaning and Processing

5-Training Named Entity Recognition

  • Train Machine Learning Model

6-Prediction

  • Parsing and Bounding Box

7-Document Scanner App

  • Automatic Document Scanner App

Architecture

Business Card -> Extract Text from Image Using OCR -> Text -> Text Cleaning -> Deep Learning Model Trained in spaCy for NER -> Entities

Training Architecture

Collected Data -> Extract Text from Image Using OCR -> Text -> Labeling -> Text Cleaning -> Train NER Model in SpaCy


Installations

Environment Installation

conda create -n docscanner python=3.9
activate docscanner
pip install -r requirements.txt

If you do not use anaconda, type this:

python -m venv docscanner 

Activation:

.\docscanner\Scripts\activate

For Linux or Mac:

source <venv>/bin/activate
pip install -r requirements.txt

Install Tesseract OCR and Pytesseract

Installation for Tesseract OCR

https://tesseract-ocr.github.io/tessdoc/Installation.html

For windows: https://digi.bib.uni-mannheim.de/tesseract/

And download this: tesseract-ocr-w32-setup-v4.1.0.20190314.exe

Note that: When you install Tesseract OCR, save the url where it is installed. It will be required in environmental setup.

2

After installation of tesseract, check “Environment Variables”

Click Path, and check the url. If the urls are not there, you can manually add them into environment variables.

3

Installation of PyTesseract

After this installation, go terminal and type

pip install pytesseract

Instalation of spaCy

Go this website, https://spacy.io/usage

For Windows

pip install -U spacy
python -m spacy download en_core_web_sm


Section 1 - Data Preparation with PyTesseract

Notebook: 01_PyTesseract.ipynb

Open a page from Jupyter Notebook and import all libraries that we installed 4 and it works without any errors!

Hierarchy of PyTesseract - How it works -

There are 5 levels in PyTesseract.

  • Level 1 This is for defining the page. If there is only one image, then it is only one level.

  • Level 2 It defines the block.

  • Level 3 It defines the paragraph.

  • Level 4 This is for line.

  • Level 5 It is for words.

First, in Level 1 it will define the page. In that page, it will define the block and then in that block, it will detect paragraph. In paragraph, it will detect all line and in line, it will detect words. Then it will detect letters from words.

After all these steps, it will take each letters to Machine Learning model.

Level 1 - Page

In this case, we only have one image which means, there is only one page. 5

Level 2 - Block

6

Level 3 - Paragraph

7

Level 4 - Line

8

Level 5 - Words

9

After all these steps, it will detect all letters ( I am kinda lazy for framing each words here :) but I will draw them, you will be able to find them below)

After letters are detected, machine learning model will classify it what kind of alphabet or number etc.

Section 1.2

Now, we will get the hierarchy from image to data using PyTesseract.

In order to get those information, there is a special command called image_to_data in PyTesseract.

When you execute the command, here is what happens: 10

And now, I will split the data for each line 11 12

Now what I will do is that I will take every element from there that I listed and I will also split by backslash “\t” and will create a new list. 13 Here it is seen, first element is separated, I will apply this for all elements. 14 15 And I will turn them into a Data Frame 16

You should also notice that one of the columns is called Level, it is what I mentioned before. Also Level defines the block numbers. And this is how we extract data from image to pandas Data Frame. Through this, we have much clear information.

In order to show, I will draw boxes according to the positions by considering what Level means.

  • Level 2: Block
  • Level 3: Paragraph
  • Level 4: Line number
  • Level 5: Text

17

Before drawing boxes, I should handle these missing values and types to proper form.

1- Drop Missing Values 2- Turn the Columns into integer

18

Drawing:

  • l: level
  • x: left
  • y: top
  • w: width
  • h: height
  • c: confidence score

19

20 This is what PyTesseract does :)



Section 2 - Data Preprocessing and Preparation

Notebook: 02_Data_Preparation.ipynb

Now what I am going to do is, I will apply all these steps to all dataset.

For this, I will open a new notebook and will import libraries that I will use. By using glob, I will get paths of images and by using os I will separate filename 21

Like what I did in first notebook, I will also do same things and I will get a DataFrame 22

But now, I will only get those which their conf. is grater than 30 and I will create a new DataFrame called businessCard. 23 24

Here is the result.

I did this steps to see what will happen. Looks super! Now I will apply all these steps to all data.

25

After getting a new DataFrame called allBusinessCard I will save it into csv file.

Next step is, I will label this data, for example: name, organization, phone number etc.

Labeling

Now, what I will do is tagging all words in the cvs file.

BIO / IOB Format

SOURCE: https://medium.com/analytics-vidhya/bio-tagged-text-to-original-text-99b05da6664 The BIO / IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. named-entity recognition). The B- prefix before a tag indicates that the tag is the beginning of a chunk, and an I- prefix before a tag indicates that the tag is inside a chunk. The B- tag is used only when a tag is followed by a tag of the same type without O tokens between them. An O tag indicates that a token belongs to no entity / chunk.

The following figure shows how a BIO tagged sentence looks like:

26

Entities

Description Tag
Name NAME
Designation DES
Organization ORG
Phone Number PHONE
Email Address EMAIL
Website WEB

27

Unfortunately, there is no shortcuts of tagging. I have to do this manually inside of the csv file.

After this long and boring process, I will prepare the data for the training.



Section 3 - Data Preprocessing and Cleaning

Notebook: 03_Data_Preprocessing.ipynb

1-Data Preprocessing

SpaCy Data Format:

28

In this example from tha documentation of SpaCy , there are totally 11 arrows, as it is seen in the example which is [(0, 11, “BUILDING”)] It means, the buildings which is “Tokyo Tower” it starts from index 0 to 11st. That’s what I need to do for preparing the data, I will determine them like this.

Link: https://spacy.io/usage/training#basics

Before starting, convert csv file to tsv (tab separated value). In order to convert the csv file, just click “Save As” and choose Tab Delimited txt file.

And, time to open the file

29

When we look at the data, it will look like this,

30

I will apply same methods what I did in 2nd notebook

31

This is the data what I have right now. I will also turn this into a pd DataFrame

32

2-Data Cleaning

This section will be cleaning process. In this case, I will remove white spaces and unwanted special characters because I don’t need them.

First, I will define white space, there are different ways to define that but the useful way is doing it with “string” library.

Next thing is defining special characters. But here I will not remove all special characters. For instance “@” that is important for Email.

33

In the above image, I also defined a function which remove white spaces and special characters.

I will apply this function to the DataFrame

34

Next thing what I will do is, convert the data into SpaCy format.

Converting to SpaCy Format:

35 36 37

Basically what I try to create is that content is all information in the text, annotations is about the labels and their start and end positions.

I am not into “O” because it means outside. I am only into “B” and “I”

38

Lets check if annotation is correct.

39

as it is seen, start and end positions of phone are correct.

After this step, now I will apply these steps to all dataset.

40 41 42

Splitting Data

First I will shuffle the dataset

43

And then, I will split the data 90% - 10%

44

Next thing is saving data into data folder by using “pickle” library.

45

In the next step, I will train a Named Entity Recognition (NER) model.



Section 4 - Train Named Entity Recognition (NER)

Code: preprocess . py

Spacy is one of the most popular and useful framework for Natural Language Processing. It is easy to use and it is a way to find a lot of predefined models.

https://spacy.io/ https://spacy.io/usage/training

What I will do is, take the model, use the framework and training. It is very simple.

In order to get the model, first visit this website and get Quickstart. Choose what you need, then SpaCy will give you the predefined code.

46

Then click download. That’s all!

In order to fill all the details of configuration, I need to type magical word to terminal

When I open this config file, there is a note which is:

python -m spacy init fill-config ./base_config.cfg ./config.cfg

47

I will paste it to terminal.

48

49

And it worked!

Now I will train the model by following commands.

50

As you see here, the format is .spacy but before I saved train and test data as .pickle. Now I will convert them into .spacy For doing this, in the documentation there is a section called Preparing Training Data It is also so easy to convert. I will copy the code and will just make some changes. That’s all!

51

All I need to do is just run the preprocess.py file which will make converting process from .pickle to .spacy format.

52

And It’s ready, that’s all here!

53

And as a final step in the training process, I need to train the model using the config file. Before this, I will create a folder called output to save output files.

python -m spacy train config.cfg --output output --paths.train data/train.spacy --paths.dev data/test.spacy

When I run this code, the training is start

54

After the training, there became two folder inside of the output file.

55

There are two folders which are those above the image. model-best file contains high score model which has 0.64 score, model-last contains the last one which has 0.62 score after the training.

I will use the best one for prediction.



Section 5- Prediction

Notebook: 04_Predictions.ipynb

In this section, I will test the NER model that I trained using SpaCy. All steps that I will apply are the same with that before I did.

From old notebook, I will copy and paste the function for cleaning text.

56

STEPS:

1- Load NER Model

57

2- Load Image

58

3- Extract Data from Text using Pytesseract

59

4- Convert into DataFrame

60

5- Convert Data into Content

61

6- Get Predictions from NER Model and Render the Content

There are some ways to render the content 62

7- Render the Content

63

Here it is seen, the NER (name entity recognition) model classified the content.

Now I will tag every word again.

Tagging

There are different ways to do BIO Tagging but I will use the same doc. 64

Token means every words. I will convert this information to DataFrame. 65

I also turned token into DataFrame and now I will combine it with doc_text 66

And this is basically what lambda function does: 67

After this step, I will add one more column to the DataFrame which is entities. 68 69

Here you can see, there are some “NaN” values, I will replace them with “O”

70

As next step, I will combine “label” with data_clean column. 71

If I make this joining, it will be much convenient for drawing bounding boxes. 72

The reason why adding “+1” is that words are separated by one space. If I do cumulative sum, I can get the end position of every words and minus one is removing space. Here is the correct end positions. 73

I will also create start position, in order to get start position is end position - length of word. 74

Now I will combine them inside of df_clean 75

And now, I will merge all dataframes into a new dataframe 76

Text and token may look like same but let’s check again by looking at last columns of the DataFrame 77

As it is seen, token contains clear text.

Bounding Box

In order to draw bounding box what I have to do is that I need to take the information except the label O. For this, I will filter the main dataframe. 78

And result, 79

As next step, I will combine BIO information.

For this, I will separate labels by applying lambda function which removes first value of label. 80

And now I will define a class which groups texts if they are same. 81 82

According to this info, I will draw boxes again. But before doing this I will create two more columns for right and bottom positions. 83 84

For tagging, I will also groupby the dataframe by group

85

NOTE: I changed the image in order to get more clear value. So, next data will be different than last ones.

86

87

And entities are drawn.

Now I will combine the text where B - I tags are. At the same time I will also do parsing. For example for the phone number, I will only take digits, for e-mail address, I will only take special characters etc.

88 89

It works well! It is cleaning special characters and this is how parser will work. Now by using the entities I will save them into a dictionary, for this I will open a basic loop.

90

The basic idea is, for instance in the image above, B-NAME: james, I-NAME: bond, what I will do is combine them.

91

The result: 92

Except phone number, everything looks great.

Now in order to proper all codes, I will define a pipe which has all these steps and prediction function.

From notebook: 04_Predictions.ipynb I copied all steps and created them as a function. I only deleted some usefulness lines and codes. Nothing changed actually.

You can find the code in prediction.py

And notebook: 05_Final_Predictions.ipynb it is my test notebook, I test my prediction function which is prediction.py

93

And here are the results:

Test 1 of Version 1

94 95

As you can realize, the model confused in detecting phone again :)

Test 2 of Version 1

96

Test 3 of Version 1

97

NOTE:

I have created a folder called VERSION_2. In this folder I only made change inside of clean_text function. In the first version, text are turning lowercase, but in the Version 2 I have canceled this. Because some organizations etc have uppercase words, that is why I have canceled and it worked slightly better. I am not saying this, accuracy is saying. Here is the accuracy reports:

98

In the first version the best accuracy was 0.64, but here, it is 0.72. Much better!

Here are some examples of predictions.

Test 1 of Version 2

99

Test 2 of Version 2

100

Test 3 of Version 2

101



Section 6 - Document Scanner

Notebook: Document_Scanner.ipynb

In this notebook, I will work on fixing images which are rotated etc. Because in order to work with PyTesseract in proper way, it is necessary. PyTesseract does not work well with rotated images.

102

Steps:

1- Resize the image and set aspect ratio 2-Image Processing

  • Enhance
  • Gray Scale
  • Blur
  • Edge Detection
  • Morphological Transform
  • Contours
  • Find Four Points

1-Resize the image and set aspect ratio

103

2-Image Processing

Enhance

104

Edge Detection

105

As you can realize, there are some noises around the image, I will apply morphological functions to clean them.

After Dilation, here is the result, as it is seen, thickness is increase, as my 2.step I will apply closing 106

Closing

107

Now I will find the contours. 108

What I will do is that I will multiply this four_points with the multiplier which is width of the original image divided by width of the resize image. 109

After these four points, I will wrap the original image using imutils library. 110

And it is time to define a function which does all these steps

111

In order to analyse the images I also return resized image (which is drawn its contours) and closing image.

Here is an example:

112

Another Example:

113

Another Example:

114

Another Example:

115

As next step, I will also define a function for finding great brightness and contrast. 116

Here is its example:

117 As it can be seen, color balance of magic image is much clear, when I apply NER algorithm it will be easy to read and detect.

Another example:

118

As summary of Magic Image is that the function increases the contrast and brightness of image.

##Integration of NER Prediction

First thing what I do is that, I import my best model which is in the version 2 and then, I read one of the images

Here are results:

119 120

121

122

123

Unfortunately, the model can not predict well because of that I didn’t feed it with more data and that’s the result. If I give more data, I would definitely get much better results. But this is not my priority, I am doing this exercise to learn and practise with PyTesseract.

About

The main idea of this project is that extracting entities from the scanned Business Card

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published