3 min read

How I'm Making an App with OCR

How I'm Making an App with OCR
Photo by Jon Tyson / Unsplash

I had this idea to make an app which can complete any school (maybe even work) assignment. This is a very easy but difficult app to make. And there's a lot that I am learning while making it. Here's how I am doing it.

As of right now I am still unsure wether I want to do a mobile app or a web app. Both have their issues associated with them though (and many other issues). I am using Apple's Vision ML library in Swift for detecting text in the image scans. The issue with that is I'm not sure wether or not I can use their OCR Vision API in a website, especially when it's in production use (this would easily violate Apple's Developer Terms of Service). The issue associated with the app is I really don't know how to make a good UI and comply with Apple's Human Interface Guidelines, in addition to the possibility that Apple could deny my submission to the Apple App Store (this is likely to happen to be honest). I also need to go through the "Going Live" steps with OpenAI since I would be using their private beta API in a production application. It's possible to use it, I just don't think they would approve it for my use case. It's also been very difficult to send an HTTPS request in Swift to the OpenAI API and get a response. I'm still very new to Swift, so I haven't been able to figure out an easy way to get the request sent, and then return the request in the Swift ViewController script. Now while there are like 1 or 2 API libraries out there, they both suck. There's a huge lack of documentation, as well as many other small issues I am running into. The bigger issue being I still cannot return the response from the API to the Swift ViewController.

Now here's what's good so far. I have been able to successfully implement Apple's Vision API into an app. The app right now is very bare bones and can only scan documents. This app so far is a very unfinished prototype.

Originally I was planning to use AWS for just about all of the application. But of course there's issue associated with that. The issue being cost. AWS is very expensive, and while I'm sure that I could figure out ways around that and save money, it still costs money. I also don't want to put in an in-app purchase thing. In the end though, it would still have been way more complicated using AWS. I would have needed to figure out how to scan a document before sending it to the API. Which after that I would need to send the scanned document to an AWS S3 bucket and then from there send it to the Rekognition API. All of this is way more complicated than it needs to be, and it incurs a huge price.

Here's the next steps for me. Currently I have the document scanning feature done. Next, I need to figure out a way to extract the text and sending it via an HTTPS request to the OpenAI API. I was thinking of figuring out someway to just store the extracted text to a variable which can then be passed on via the HTTPS request. Using OpenAI brings one more issue though: How do I optimize the OpenAI model for each question submitted? Meaning, how do I ask the AI something different every time if that makes any sense. The model will need to solve Math problems, English problems, Science problems, History problems, and other foreign language problems. How do I analyze the question and ask the AI a question which makes sense? Also, how do I phrase English questions like "What did Bill Gates write in this story?".

Example of what I mean:

Algebra Question: 2x+9 = 21
Optimized question for the model: 'Solve 2x+9 = 21'
French Question: What does 'Quel' mean?
Optimized Question: Translate 'Quel' from French to English.

I plan to use the Davinci Instruct model from OpenAI since I have found for that to be the most advanced and smartest model they have for this use case. It's a very good model for solving generic problems like this.

After sending the problems to the OpenAI API, I then need to return the answers. This should be pretty easy to do in app. I should be able to just make a text box for each question and display the answer per question.

But yeah, that's how I plan to make an app with OCR. If you have any suggestions for a Swift newbie, like me, or any tips feel free to send them my way.

(Bonus! Here's a video demo of current progress.)