Splitting file into small chunks and processing - python

I have three files and each contain close to 300k records. Have written a python script to process those files with some business logic and able to create the output file successfully. This process completes in 5 mins.
I am using the same script to process the files with high of volume of data (All the three input files contain about 30 million records). Now the processing taking hours and kept running for very long time.
So I am thinking of breaking the file into 100 small chunks based on the last two digits of the unique id and having it processed parallels. Are there any data pipeline packages that I could use to perform this?
BTW, I am running this process in my VDI machine.

I am not sure of any API as such for the function.But you can try multiprocessing and multithreading to process large volume of data

Related

python process creating file with inflated size

i have a python process which takes a file containing streamed data and converts it into a format ready to load to a database. i have just migrated this process from one Linux GCP VM to another running exactly the same code, but the final output file size is nearly 4 times as big. 500mb vs 2gb.
When i download the files and manually inspect them, they look exactly the same to the eye.
Any ideas what could be causing this?
Edit: Thanks for the feedback, i traced it back to the input file, which is slightly different (as my stream recording process has also been migrated)
I am now trying to work out why a marginally different file creates such a different output file once its been processed.

Program not faster with threading

I made a program that edits some files in a game, it read bytedata from files and stores the data in a python dict.
the files that the program read are 30+ files, all with the same type of data.
I had the program processing the files one by one, for each saving bytedata to a python data array, then following a structure I read all bytes in that file, then continue to the next file, etc.
that process takes about 1 minute to complete.
So I made threads, for each file, I made a thread that does the same at the same time for all files.
But it takes the same time, about 1 minute.
How can this be possible?
the problem is not data reading speed from HDD because that is made in less than 1 second, after that the bytedata that each thread is processing is stored in memory. why does it take so long? and why it is not faster with threading?
Another question, is normal that the python dict where I store all the info after that is very large in memory? like 3 GB?? , there are around 10k keys in the dict and each have maaaaany values.

parallelize external program in python call?

I have an external program which I can not change.
It reads an input file, does some calculations, and writes out a result file. I need to run this for a million or so combinations of input parameters.
The way I do it at the moment is, that I open a template file, change some strings in it (to input the new parameters), write it out, start the program using os.popen(), read the output file, do a chisquare test on the result, and then I restart with a different set of parameters.
The external program is only running on one core, so I tried to split my parameters space up and started multiple instances in different folders. Different folders were necessary because the program overwrites its output file. This works, but it still took about ~24 hours to finish.
Is it possible to run this as seperate processes without the result file being overwritten? Or do you see any other thing I could do to speed this up?
Thx.

BigQuery script failing for large file

I am trying to load a json file to GoogleBigquery using the script at
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/api/load_data_by_post.py with very little modification.
I added
,chunksize=10*1024*1024, resumable=True))
to MediaFileUpload.
The script works fine for a sample file with a few million records. The actual file is about 140 GB with approx 200,000,000 records. insert_request.execute() always fails with
socket.error: `[Errno 32] Broken pipe`
after half an hour or so. How can this be fixed? Each row is less than 1 KB, so it shouldn't be a quota issue.
When handling large files don't use streaming, but batch load: Streaming will easily handle up to 100,000 rows per second. That's pretty good for streaming, but not for loading large files.
The sample code linked is doing the right thing (batch instead of streaming), so what we see is a different problem: This sample code is trying to load all this data straight into BigQuery, but the uploading through POST part fails.
Solution: Instead of loading big chunks of data through POST, stage them in Google Cloud Storage first, then tell BigQuery to read files from GCS.
Update: Talking to the engineering team, POST should work if you try a smaller chunksize.

Python Pandas MemoryError on first run, goes away on rerun

I'm running a long Python program that includes a step of reading a file into a Pandas dataframe. The program consistently fails with a MemoryError when it first tries to read the file into memory. When I rerun the failing step (without rerunning the previous parts of the program), there is no MemoryError.
It may be a problem of accumulating lots of previous objects in memory, which aren't present on the rerun. But the amount of memory in play is below the 2 GB limit where Windows starts having problems. In particular, the previous steps of the program only leave around ~400 MB in RAM, and the file I'm trying to read takes only ~400 MB.
Any ideas what's causing the MemoryError the first time around?

Categories