Read documents with OCR in Python - python

I have the need to read structured data (pdfs and excel spreadsheets) to a central location. From where other files can read data, to produce a constant and consistent output.
The first challenge is to read data from the input files, the second is to read recognized data into a central location feeding into output files.
Any insight to OCR in python, or alternative text and variable recognition algorithms would be highly appreciated.

Related

Correctly formatting my csv file for input into ML algorithm

I am having a lot of trouble of formatting my csv file in a way which makes it suitable for a machine learning algorithm on python. I have followed various tutorials however no guidance has been given on a column in my csv file which has huge arrays of data in.
For context I am collecting Channel State information (CSI) data from various individuals and the programme collects this data into a big csv file with the csi data, the bit I am interested in being presented into a huge array of numbers. Therefore I want the ML algorithm to identify individuals based off this data and I am having trouble at this moment in time in finding a way to format the csv file.
Thanks
I have tried various tutorials etc but no algorithm is allowing my array of data as there's often a valueError. it says can't convert string to float.

Is there a preferred format for Python to retrieve time-series data - between .txt or xlsx?

I am using a third party tool to extract vast amounts of time-series data (to be analysed within python). The options are to save this as a text file or excel file. Which is the more efficient route speed-wise?
You can have a look here: Faster way to read Excel files to pandas dataframe
Here it is also mentioned that csv is fater, so the text file should be the better option.

Which data files output format from C++ to be read in python (or other) is more size efficient?

I want to write output files containing tabular datas (float values in lines and columns) from a C++ program.
I need to open those files later on with other languages/softwares (here python and paraview, but might change).
What would be the most efficient output format for tabular files (efficient for files memory sizes efficiency) that would be compatible with other languages ?
E.g., txt files, csv, xml, binarized or not ?
Thanks for advices
HDF5 might be a good option for you. It’s a standard format for storing large amounts of data, and there are Python and C++ libraries for reading and writing to it.
See here for an example
1- Your output files contain tabular data (float values in lines and columns), in other words, a kind of matrix.
2- You need to open those files later on with other languages/softwares
3- You want to have files memory sizes efficiency
That's said, you have to consider one of the two formats below:
CSV: if your data are very simple (a matrix of float without particualr structure)
JSON if you need a minimum structure for your files
These two formats are standard, supported by almost all the known languages and maintained softwares.
Last, if your data have a great complexity structure, prefer to look at a format like XML but the price to pay is then in the size of your files!
Hope this helps!
First of all the efficiency of i/o operations is limited by the buffer size. So if you want to achieve higher throughput you might have to play with the input output buffers. And regarding your doubt of what way to output into the files is dependent on your data and what delimiters you want to use to separate the data in the files.

Processing TensorFlow Records that are XML (text)

I would like to use TensorFlow to process XML strings that are proper TFRecords. I'm curious to understand how to structure code that parses each TFRecord. There is a set of input rules and data type mappings that are applied to each TFRecord record to produce an output TFRecord.
Example input TFRecord:
<PLANT><COMMON>Shooting Star</COMMON><BOTANICAL>Dodecatheon</BOTANICAL><ZONE>Annual</ZONE><LIGHT>Mostly Shady</LIGHT><PRICE>$8.60</PRICE><EXTREF><REF1><ID>608</ID><TYPE>LOOKUP</TYPE><REF2><ID>703</ID><TYPE>STD</TYPE></EXTREF><AVAILABILITY>051399</AVAILABILITY></PLANT>
The rules show what needs to be parsed and how it needs to be formatted. E.g. find the COMMON, PRICE, EXTREF>REF2>ID and AVAILABILITY elements and export their values as a TFRecord.
Example output TFRecord:
Shooting Star,8.60,703,51399
How do I add this logic to a graph so when it executes it produces the output TFRecord? My initial thoughts are that I need to translate the mapping logic into a series of tf.ops...
I believe this link will be very helpful to you. It specifies the exact format that the TFRecord needs, and it provides the code to turn your own dataset into a TFRecord file.
However, that link did not mention XML files. It only talked about how to create a tf_example and turn it into a TFRecord. This link will actually go a step back and show you how to turn a XML file into a tf_example. Note that it will need some modification to fit your needs because it is using the Oxford Pet Dataset.

Can mongoexport be used to export images stored in binary format in mongodb

I am using mongoexport to export mongodb data which also has Image data in Binary format.
Export is done in csv format.
I tried to read image data from csv file into python and tried to store as in Image File in .jpg format on disk.
But it seems that, data is corrupt and image is not getting stored.
Has anybody come across such situation or resolved similar thing ?
Thanks,
One thing to watch out for is an arbitrary 2MB BSON Object size limit in several of 10gen's implementations. You might have to denormalize your image data and store it across multiple objects.
Depending how you stored the data, it may be prefixed with 4 bytes of size. Are the corrupt exports 4 bytes/GridFS chunk longer than you'd expect?

Categories