Challenge

Develop a solution to define terms in PDF files

Solution

We developed a mechanism from scratch that recognizes the font style (italic, bold, underline) and term boundaries, separating it from the following sentence and indicating the exact location in the document. The most complicated task was to recognize the distribution of text in tables since standard algorithms can not cope with this.

Our team implemented the solution based on Artificial Intelligence and Machine Learning methods

Result

The client received a useful tool that saves time searching for the necessary data in documents

Tech Stack

  • Python
  • OCR
  • Tesseract 3,4
  • OpenCV
  • Pandas
  • PostgreSQL
  • Django
  • DRF
  • AWS

Team

2 members

  • developers

Employee's feedback

comma comma

Working on artificial intelligence algorithms is always interesting. It is required to calculate and analyze all variants of the data, to direct to the solution of the problem of finding the term. Our system is constantly self-training and this makes it unique

comma comma

Other Projects