python JSON parsing: can I load partially to speed up? - python

I am processing large text files using Python. Each line of a file is a complete JSON message, and might be very long. I need to insert information about each line into a database. This info is very simple: the length of the line plus a unique ID which each message contains. So each line has the form
{"field1":"val1", ..., "ID":"12345", ..., "fieldK":"valK"}
and I need to extract "12345" from the message.
Right now I load the entire string using json.loads() then find the ID and ignore the rest.
My code is too slow and I need to speed it up. I am trying to see if there is a way of extracting "ID" faster than loading like the whole string. One option is to search the string for "ID" and then process :"12345". But it might be brittle if it so happens that there is a substring "ID" someplace else in the message.
So is there a way of somehow partially loading the line to find ID, which would be as robust as, but also faster than, loading the whole line?

I would recommend a couple of paths:
If your input is very large, it may be that loading it wholly into memory is wasteful. It may be faster to load/parse each line separately.
If the above won't help, then devising some way to search for the right ID in the file isn't a bad idea. Just verify that the input is kosher when you actually find the right ID: number. So you should:
Search (regex or otherwise) for the ID you expect.
For a match, actually parse the line and make sure it's valid. If it isn't (say, just ID: embedded in some string), drop it and keep searching.
Since non-legit occurrences of (2) should be rare, the verification doesn't have to be very efficient.

Related

fast checking the existence of tag in large XML using python cElementTree

I have XML files with sizes of hundreds of megabytes to tens of gigabytes and use Python's cElementTree to process them. Due to limited memory and low speed, I don't want to load all contents into memory using et.parse then find or findall method to find whether the tag exists (I didn't try this way, actually). Now I simply use et.iterparse to iterate through all tags to achieve this aim. In the case that the tag locates close to the end of the file, this can be very slow as well. I wonder whether there exists a better way to achieve this and get the location of the tag? If I know the top level (e.g., index) the tag locates, at which the size is much smaller than other parts of the file, is it possible to iterate through the top level tag and then target that part to parse? I searched online, but surprisingly no related questions are posted. Do I miss anything? Thanks in advance.
I solved this by reading the file block by block instead of parsing the file using cElementTree. My tags are close to the end of the file, so according to this answer, I read a block of contexts with specified size block_size at a time from the end of the file by using file.seek and file.read methods, and line = f.read(block_size), and then simply using "<my_tag " in line (or more specific tag name to avoid ambiguity) to check whether the tag exists. This is much faster then using iterparse to go through all tags.

Validating XML with large text element against XML Schema (xsd)

I have to process XML files that contain potentially large (up to 2GB) content. In these files , the 'large' part of the content is not spread over the whole file but is contained in one single element (an encrypted file, hex encoded).
I have no leverage on the source of the files, so I need to deal with that situation.
A requirement is to keep a small memory foot print (< 500MB). I was able to read and process the file's contents in streaming mode using xml.sax which is doing it's job just fine.
The problem is, that these files also need to be validated against an XML schema definition (.xsd file), which seems not to be supported by xml.sax.
I found some up-to-date libraries for schema validation like xmlschema but none for doing the validation in a streaming/lazy fashion.
Can anyone recommend a way to do this?
Many schema processors (such as Xerces and Saxon) operate in streaming mode, so there's no need to hold the data in memory while it's being validated. However, a 2Gb single text node is stretching Java's limits on the size of strings and arrays, and even a streaming processor is quite likely to want to hold the whole of a single node in memory.
If there are no validation constraints on the content of this text node (e.g. you don't need to validate that it is valid xs:base64Binary) then I would suggest using a schema validator (such as Saxon) that accepts SAX input, and supplying the input via a SAX filter that eliminates or condenses the long text value. A SAX parser supplies text to the ContentHandler in multiple chunks so there should be no limit in the SAX parser on the size of a text node. Saxon will try and combine the multiple chunks into a single string (or char array) and may fail at this stage either because of Java limits or because of the amount of memory available; but if your filter cuts out the big text node, this won't happen.
Michael Kay's answer had this nice idea of a content filter that can condense long text. This helped me solve my problem.
I ended up writing a simple text shrinker that pre-processes an XML file for me by reducing the text content size in named tags (like: "only keep the first 64 bytes of the text in the 'Data' and 'CipherValue' elements, don't touch anything else").
The resulting file then is small enought to feed it into a validator like xmlschema.
If anyone needs something similar: here is the code of the shrinker
If you use this, be careful
This indeed changes the content of the XML and could potentially cause problems, if the XML schema definition contains things like min or max length checks for the affected elements.

Find width of pdf form field in python

I have a fillable pdf with fields that need to be filled out by the user. I am trying to auto-generate responses to for those fields with python, but I need to know the width/length of the form fields in order to know whether my responses will fit in the field.
How do I find the width of these fields, or at least test whether a possible response will fit?
I was thinking that if I knew the font and font size of the field, that might help.
Edit: I just realized that the pdf is encrypted, so interfacing with the pdf in a programmatic way may be impossible. Any suggestions for a quick and dirty solution are welcome though.
Link to form: http://static.e-publishing.af.mil/production/1/af_a1/form/af910/af910.pdf
I need to know the width of the comments blocks.
After some quick digging around in pdf files and one of Adobe's pdf references (source) it turns out that a text field may have a key "MaxLen" whose value is an integer representing the maximum length of the field's text, in characters (See page 444 in the reference mentioned). It appears that if no such key is present, there is no maximum length.
What one could do then, is simply search the pdf file for the "MaxLen" keys (if multiple text fields, else you could just search for one) and return their values. E.g.:
import re
with open('your_file.pdf', 'r', errors='ignore') as pdf_file:
content = pdf_file.read()
# Matches every substring "n" (n is an integer) with a preceding "/MaxLen "
regexp = '(?<=\/MaxLen )\d+'
max_lengths = [int(match) for match in re.findall(regexp, content)]
(If the file is huge you may not be able to read it all into memory at once. If that's the case, reading it line by line could be a solution.)
max_lengths will then be a list of all the "MaxLen" values, ordered after occurrence in the file (the first occurrence will be first etc.).
However, depending on what you need, you may need to further your search and add more conditionals to my code. For example, if a file contains multiple text fields but not all of them has a maximum length, you may not know which length correspond to which field. Also, if a pdf file has been modified and saved (not using "Save As"), the modifications will be appended to the old file instead of overwriting it completely. I'm not sure exactly how this works, but I suppose it could make you get the max lengths of previously removed fields etc. if you're not careful and check for that.
(Working with pdf's in this way is very new to me, please correct me if I'm wrong about anything. I'm not saying that there's no library that can do this for you, maybe PDFMiner can, though it will probably be more advanced.)
Update 23-10-2017
I'm afraid the problem just got a lot harder. I believe you still should be able to deduce the widths of the text fields by parsing the right parts of the pdf file. Why? Because Adobe's software can render it properly (at least Adobe Acrobat Pro DC) without requiring some password to decrypt it first. The problem is that I don't know how to parse it. Dig deep enough and you may find out, or not.
I suppose you could solve the problem in a graphical way, opening every pdf with some viewer which can read them properly, then measuring the widths of the text fields. But, this would be fairly slow and I'm not sure how you would go about recognizing text fields.
It doesn't help that the forms don't use a monospaced font, but that's a smaller problem that definitely can be solved (find which font the text fields use, look up the width of all the characters in that font and use that information in your calculations).
If you do manage to solve the problem, please share. :)

Python SQLite Search Must Ignore HTML Tags

I have several SQLite databases ranging in size from 1 to 150 MB some with as many as 30,000 rows. The data being searched is basic HTML. I'm looking for the quickest way to search the HTML text while compensating for any HTML tags.
For instance, if I am searching for "the sky is blue" and a record in a database has an italics tag (i.e. "the <i>sky</i> is blue"), I need it to find it.
Obviously a straight search,
SELECT * FROM dictionary WHERE definition LIKE "%the sky is blue%"
won't work.
So I tried a search for all the individual words in a record in any order and then filter them with a regular expression. This works but is slow. It delivers too many false records that must be scanned by the regex. Especially if there are common words in the search string.
I tried searching for the individual words in order (LIKE "%the%sky%is%blue%") but this would sometimes cause the SQL search to hang with the larger records for some reason. I think it is because of the short common strings ("is", "at", etc.) that produce 1000s of hits.
An SQL regex search is also too slow for my purposes.
One option is to make another table with the data in all the records stripped of HTML tags and search that instead, but this nearly doubles the size of the database.
What other options are there to compensate for the tags?
As you have discovered, relational systems weren't designed for this kind of searching, and there's very little you can do to fix that. The best answer is indeed to store a pre-stripped version of the text purely for searching purposes. Even a 300MB file would be considered small in today's terms, so unless space is a real constraint I wouldn't bother too much about that.
There's no real need for another table, though - that would only complicate things. I'd recommend that you simply add the stripped text as an additional column on your existing table.

Extract particular fields from json in python

Say I have a lot of json lines to process and I only care about the specific fields in a json line.
{blablabla, 'whatICare': 1, blablabla}
{blablabla, 'whatICare': 2, blablabla}
....
Is there any way to extract whatICare from these json lines withoud loads them? Since the json lines are very long it may be slow to build objects from json..
Not any reliable way without writing your own parsing code.
But check out ujson! It can be 10x faster than python's built in json library, which is a bit on the slow side.
No, you will have to load and parse the JSON before you know what’s inside and to be able to filter out the desired elements.
That being said, if you worry about memory, you could use ijson which is an iterative parser. Instead of loading all the content at once, it is able to load only what’s necessary for the next iteration. So if you your file contains an array of objects, you can load and parse one object at a time, reducing the memory impact (as you only need to keep one object in memory, plus the data you actually care about). But it won’t become faster, and it also won’t magically skip data you are not interested in.

Categories