python process creating file with inflated size - python

i have a python process which takes a file containing streamed data and converts it into a format ready to load to a database. i have just migrated this process from one Linux GCP VM to another running exactly the same code, but the final output file size is nearly 4 times as big. 500mb vs 2gb.
When i download the files and manually inspect them, they look exactly the same to the eye.
Any ideas what could be causing this?
Edit: Thanks for the feedback, i traced it back to the input file, which is slightly different (as my stream recording process has also been migrated)
I am now trying to work out why a marginally different file creates such a different output file once its been processed.

Related

parallelize external program in python call?

I have an external program which I can not change.
It reads an input file, does some calculations, and writes out a result file. I need to run this for a million or so combinations of input parameters.
The way I do it at the moment is, that I open a template file, change some strings in it (to input the new parameters), write it out, start the program using os.popen(), read the output file, do a chisquare test on the result, and then I restart with a different set of parameters.
The external program is only running on one core, so I tried to split my parameters space up and started multiple instances in different folders. Different folders were necessary because the program overwrites its output file. This works, but it still took about ~24 hours to finish.
Is it possible to run this as seperate processes without the result file being overwritten? Or do you see any other thing I could do to speed this up?
Thx.

BigQuery script failing for large file

I am trying to load a json file to GoogleBigquery using the script at
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/api/load_data_by_post.py with very little modification.
I added
,chunksize=10*1024*1024, resumable=True))
to MediaFileUpload.
The script works fine for a sample file with a few million records. The actual file is about 140 GB with approx 200,000,000 records. insert_request.execute() always fails with
socket.error: `[Errno 32] Broken pipe`
after half an hour or so. How can this be fixed? Each row is less than 1 KB, so it shouldn't be a quota issue.
When handling large files don't use streaming, but batch load: Streaming will easily handle up to 100,000 rows per second. That's pretty good for streaming, but not for loading large files.
The sample code linked is doing the right thing (batch instead of streaming), so what we see is a different problem: This sample code is trying to load all this data straight into BigQuery, but the uploading through POST part fails.
Solution: Instead of loading big chunks of data through POST, stage them in Google Cloud Storage first, then tell BigQuery to read files from GCS.
Update: Talking to the engineering team, POST should work if you try a smaller chunksize.

How can I force Python code to read input files again without rebooting my computer

I am scanning through a large number of files looking for some markers. I am starting to be really confident that once I have run through the code one time Python is not rereading the actual files from disk. I find this behavior strange because I was told that one reason I needed to structure my file access in the manner I have is so that the handle and file content is flushed. But that can't be.
There are 9,568 file paths in the list I am reading from. If I shut down Python and reboot my computer it takes roughly 6 minutes to read the files and determine if there is anything returned from the regular expression.
However, if I run the code a second time it takes about 36 seconds. Just for grins, the average document has 53,000 words.
Therefore I am concluding that Python still has access to the file it read in the first iteration.
I want to also observe that the first time I do this I can hear the disk spin (E:\ - Python is on C:). E is just a spinning disk with 126 MB cache - I don't think the cache is big enough to hold the contents of these files. When I do it later I do not hear the disk spin.
Here is the code
import re
test_7A_re = re.compile(r'\n\s*ITEM\s*7\(*a\)*[.]*\s*-*\s*QUANT.*\n',re.IGNORECASE)
no7a = []
for path in path_list:
path = path.strip()
with open(path,'r') as fh:
string = fh.read()
items = [item for item in re.finditer(test_7A_re,string)]
if len(items) == 0:
no7a.append(path)
continue
I care about this for a number of reasons, one is that I was thinking about using multi-processing. But if the bottleneck is reading in the files I don't see that I will gain much. I also think this is a problem because I would be worried about the file being modified and not having the most recent version of the file available.
I am tagging this 2.7 because I have no idea if this behavior is persistent across versions.
To confirm this behavior I modified my code to run as a .py file, and added some timing code. I then rebooted my computer - the first time it ran it took 5.6 minutes and the second time (without rebooting) the time was 36 seconds. Output is the same in both cases.
The really interesting thing is that even if shut down IDLE (but do not reboot my computer) it still takes 36 seconds to run the code.
All of this suggests to me that the files are not read from disk after the first time - this is amazing behavior to me but it seems dangerous.
To be clear, the results are the same - I believe given the timing tests I have run and the fact that I do not hear the disk spinning that somehow the files are still accessible to Python.
This is caused by caching in Windows. It is not related to Python.
In order to stop Windows from caching your reads:
Disable paging file in Windows and fill the RAM up to 90%
Use some tool to disable file caching in Windows like this one.
Run your code on a Linux VM on your Windows machine that has limited RAM. In Linux you can control the caching much better
Make the files much bigger, so that they won't fit in cache
I fail to see why this is a problem. I'm not 100% certain of how Windows handles file cache invalidation, but unless the "Last modified time" changes, you and I and Windows would assume that the file still holds the same content. If the file holds the same content, I don't see why reading from cache can be a problem.
I'm pretty sure that if you change the last modified date, say, by opening the file for write access then closing it right away, Windows will hold sufficient doubts over the file content and invalidate the cache.

Cannot find file error

Small python script to split a large data file into many smaller files (some 100,000 or so) - basically map tiling.
Anyway, after a long time (20+hrs) of running the script it dies with:
IOError: [Errno 2] No such file or directory: ......
Now this strikes me as odd as NONE of the files existed before the script runs - and f.close() is called after every file has been written, so FD limits don't seem to be responsible... (and plenty of disk space)
The other odd thing was that all the files it had already created/written (about 55,000) were deleted along with the containing directory when the script died.
I used the multiprocessing module to create a process for each CPU core, and all 4 spat out the same message on dying - but with a file in the section of data they were processing (perhaps this is relevant?)
I can hopefully work around this, but am just curious as to why this might have occured ??
Edit: For more context, the script is splitting the GSHHS geo data into small 'tiles'. Now, I have succesfully done this for the lower resolution sections of the DB, into files containing larger areas of the globe - it fell over when trying to split the hi resolution data into 1x1° tiles

Python: Reading New Information Added To Massive Files

I'm working on a Python script to parse Squid(http://www.squid-cache.org/) log files. While the logs are rotated every day to stop them getting to big, they do reach between 40-90MB by the end of each day.
Essentially what I'm doing is reading the file line by line, parsing out the data I need(IP, Requested URL, Time) and adding it to an sqlite database. However this seems to be taking a very long time(It's been running over 20 minutes now)
So obviously, re-reading the file can't be done. What I would like to do is read the file and then detect all new lines written. Or even better, at the start of the day the script will simply read the data in real time as it is added so there will never be any long processing times.
How would I go about doing this?
One way to achieve this is by emulating tail -f. The script would constantly monitor the file and process each new line as it appears.
For a discussion and some recipes, see tail -f in python with no time.sleep
One way to do this is to use file system monitoring with py-inotify http://pyinotify.sourceforge.net/ - and set a callback function to be executed whenever
the log file size changed.
Another way to do it, without requiring external modules, is to record in the filesystem
(possibily on your sqlite database itself), the offset of the end of the lest line read on the log file, (which you get with with file.tell() ), and just read the newly added lines
from that offset onwards, which is done with a simple call to file.seek(offset) before looping through the lines.
The main difference of keeping track of the offset and the "tail" emulation described ont he other post is that this one allows your script to be run multiple times, i.e. - no need for it to be running continually, or to recover in case of a crash.

Categories