Reading large csv files python and panda - python

I have this Python server, that connecting to sftp server, and pulling CSV files (there is a For loop that running in a nodeJS server, each time a different connection is coming)
In that Python server - i'm reading the CSV file with panda - like this
file = sftp.open(latestfile)
check = pd.read_csv(file).to_csv()
at the end, i return the check with the CSV file data inisde - and than i parse in the nodeJS server.
this process went really good and i managed to achieve a lot of data this way - but my Python server really crashed when he tried to read a big CSV file (22MB)
i searched online and tried to solve it with chunks and with .modin library and dask.dataframe but everytime i tried to use one of thsee methods i couldn't read the file content properly (.to_csv part)
I'm really lost right now because i can't get it to work (there can be larger files than that)

Here is a way to process your large csv files. It allows you to process groups of chunks at a time. You could modify it based on your requirements (such as through sftp etc).
Minimal example
import pandas as pd
chunksize = 10 ** 4
for chunk in pd.read_csv(latestfile, chunksize=chunksize):
process(chunk.to_csv())

Related

python process creating file with inflated size

i have a python process which takes a file containing streamed data and converts it into a format ready to load to a database. i have just migrated this process from one Linux GCP VM to another running exactly the same code, but the final output file size is nearly 4 times as big. 500mb vs 2gb.
When i download the files and manually inspect them, they look exactly the same to the eye.
Any ideas what could be causing this?
Edit: Thanks for the feedback, i traced it back to the input file, which is slightly different (as my stream recording process has also been migrated)
I am now trying to work out why a marginally different file creates such a different output file once its been processed.

Reading from a CSV using two programs simultaneously

Good day!
I am wondering if there is any way in which one can allow different programming languages to access a csv at the same time.
I am using c# to get live stock market data, and then Python does calculations on this data where it then returns the data to the csv file to be read by c# again, it works if I use multiple steps i.e. collect historical data predict the historical data and read the historical data, but when I try to do this in one step(live) I get the following error.
[Errno 13] Permission denied: 'CurrencyPair-Minute.csv'
I think this is the result of the file being used by the c# program.
Which I opened with the following parameters
File.Open(fiName, FileMode.Open, FileAccess.ReadWrite, FileShare.Write);
I only close the data streams when the program stops and to the streams are continually open for reading and writing in the c# program.
If I close the stream while the file is not being read or written the error I get is
Crashed in (MethodName) with ArgumentException: Stream was not readable.
Also, this will not work since the Python program continually needs to check the file for updates.
Any help would be appreciated
Thank you
You should be able to run the python script from your C# file after fetching the data and writing to the csv and closing the csv file in C#. See here: How do I run a Python script from C#?.
The flow would be something like this:
C#
Fetch data
Write to csv
Close file
Call python script
Python
Do calculation
Write to file
Close file
Exit

BigQuery script failing for large file

I am trying to load a json file to GoogleBigquery using the script at
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/api/load_data_by_post.py with very little modification.
I added
,chunksize=10*1024*1024, resumable=True))
to MediaFileUpload.
The script works fine for a sample file with a few million records. The actual file is about 140 GB with approx 200,000,000 records. insert_request.execute() always fails with
socket.error: `[Errno 32] Broken pipe`
after half an hour or so. How can this be fixed? Each row is less than 1 KB, so it shouldn't be a quota issue.
When handling large files don't use streaming, but batch load: Streaming will easily handle up to 100,000 rows per second. That's pretty good for streaming, but not for loading large files.
The sample code linked is doing the right thing (batch instead of streaming), so what we see is a different problem: This sample code is trying to load all this data straight into BigQuery, but the uploading through POST part fails.
Solution: Instead of loading big chunks of data through POST, stage them in Google Cloud Storage first, then tell BigQuery to read files from GCS.
Update: Talking to the engineering team, POST should work if you try a smaller chunksize.

Using Python File Operations

So I have two scripts. Main.py, which is ran upon startup and is ran in the background. otherscript.py which is ran whenever the user invokes it.
main.py crunches some data then writes it out to a file every iteration of the while loop. (this data is about ~ 1.17 mb), and erases old data. So data.txt contains the latest crunched data.
otherscript.py will read data.txt (the current data at that instant) then do something with it.
main.py
while True:
file = "data.txt"
data = crunchData()
file.write(data)
otherscript.py
data = file.read("data.txt")
doSomethingWithData(data)
How can I make the connection between the two scripts process faster? Are there any alternatives to file writing the data?
This is a problem of Inter-Process Communication (IPC). In your case, you basically have a producer process, and a consumer process.
One way of doing IPC, as you've found, is using files. However, it'll saturate the disk quickly if there's lots of data going through.
If you had a straight consumer that wanted to read all the data all the time, the easiest way to do this would probably be a pipe - at least if you're on a unix platform (mac, linux).
If you want a cross-platform solution, my advice, in this case, would be to use a socket. Basically, you open a network port on the producer process, and every time a consumer connects, you dump the latest data. You can find a simple how-to on sockets here.

Python: Two script working with same file , one updating it another deleting the data when processed

Firstly I am new to Python.
Now my question goes like this:
I have a call back script running in remote machine
which sends some data and run a script in local machine
which process that data and write to a file. Now another
script of mine locally needs to process the file data
one by one and delete them from the file if done.
The problem is the file may be updating continuoulsy.
How do i schyncronize the work so that it doesnt mess up
my file.
Also please suggest me if the same work can be done in some
better way.
I would suggest you to look into named pipes or sockets which seem to be more suited for your purpose than a file. If it's really just between those two applications and you have control on the source code of both.
For example, on unix, you could create a pipe like (see os.mkfifo):
import os
os.mkfifo("/some/unique/path")
And then access it like a file:
dest = open("/some/unique/path", "w") # on the sending side
src = open("/some/unique/path", "r") # on the reading side
The data will be queued between your processes. It's a First In First Out really, but it behaves like a file (mostly).
If you cannot go for named pipes like this, I'd suggest to use IP sockets over localhost from the socket module, preferably DGRAM sockets, as you do not need to do some connection handling there. You seem to know how to do networking already.
I would suggest using a database whose transactions allow for concurrent processing.

Categories