I'm currently running a script in Jupyter Notebook which loops over a Dataframe and manipulates the data of the current row. As my Dataframe has thousands of rows and each loop takes a while to run, I am wondering whether it's safe to interrupt the script without losing all of my progress?
I am keeping track of rows that have already been processed so I could just start where I left off in the case that the manipulations on the Dataframe don't get lost. I don't want to take the risk of trying it out right now so advice would be appreciated.
Unless you are storing progress in external files, interrupting Jupyter will lose you data. I highly do not recommend on counting on the variables inside of Jupyter on being in some state if you are mid-way through a calculation, just save intermediate steps in files to track progress, chunking as you go.
Related
I read a big CSV file into a Dataframe in a Jupyter notebook with:
df = pd.read_csv(my_file)
df.info()
> memory usage: 10.7+ GB
When I execute the same cell again, the total memory usage of my system increases. And after I repeat a few times, Jupyter kernel eventually dies.
I would expect Python to release the memory before loading new data to the same variable, or release the memory once it finishes loading. Why does the memory usage increases more and more? How can I make Python return that memory back to the system?
In this case as #Giacoma Catenazzi explained in his comment. Ipykernel the kernel behind the Jupyters Notebooks, KEEPS EVERY SINGLE VARIABLE IN MEMORY. Until you explicitly tell to clear it is space. That is partially one of the main reasons.
But why does it increase?
Well the basic idea behind it, is you are using procedural logic of code like in Jupyters and you are trying to formulate specific logics that will probably run one after another. You will most of the time never rerun the same cell over and over again, specially if a succeeding variable is dependent of the variable being loaded ("which I think is your case").
So if you are working with great quantities of data it is recommended to either use, the copy statement or the del statement.
More about del:
Del is gonna COMPLETELY clear the space in memory that your current variable is using.
You can see more about the del statement in programiz website
More about copy:
It copies your dataframe indices and data, depending of certain given parameters. It possibilitates a change in a dataframe, that will affect all the suceding dataframe without having to restart the kernel, more about it in the documents
The context:
I am using PyArrow to read a folder structured as exchange/symbol/date.parquet. The folder contains multiple exchanges, multiple symbols and multiple files. At the time I am writing the folder is about 30GB/1.85M files.
If I use a single PyArrow Dataset to read/manage the entire folder, the simplest process with just the dataset defined will occupy 2.3GB of RAM. The problem is, I am instanciating this dataset on multiple processes but since every process only needs some exchanges (typically just one), I don't need to read all folders and files in every single process.
So I tried to use a UnionDataset composed of single exchange Dataset. In this way, every process just loads the required folder/files as a dataset. By a simple test, by doing so every process now occupy just 868MB of RAM, -63%.
The problem:
When using a single Dataset for the entire folder/files, I have no problem at all. I can read filtered data without problems and it's fast as duck.
But when I read the UnionDataset filtered data, I always get Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) error. So after looking every single source of the problem, I noticed that if I create a dummy folder with multiple exchanges but just some symbols, in order to limit the files amout to read, I don't get that error and it works normally. If I then copy new symbols folders (any) I get again that error.
I came up thinking that the problem is not about my code, but linked instead to the amout of files that the UnionDataset is able to manage.
Am I correct or am I doing something wrong? Thank you all, have a nice day and good work.
I'm writing an app that monitors an applications scanning process. Of course to check this overtime I have to log the progress (don't ask me why this isn't in the app already).
To do this the app runs every half hour, determines what's worth loggin and not and adds them to a pandas dataframe that is then saved locally as a CSV so next run it can determine if progress is as we expect.
My question is that should I append the data i need to as I find it through the run or store it in a list or another dataframe and append it all at the end of a run before saving to CSV?
Is there a benefit to one way or another or is the difference between running append multiple times vs once negligable?
The reason I ask is this could eventually be large amounts of data being appended so building efficiencies in from the start is a good idea.
Thanks in Advance
This really depends on what you mean by large amounts of data. If it's MB then keeping everything as a df in memeory is fine; however if GB then it's better to saving them to CSV and concat to a new df
from glob import glob
df = pd.concat([pd.read_csv(i) for i in glob('/path/to/csv_files/*.csv')])
So I just started coding with Python. I have a lot of pdfs which are my target for data grabbing. I have the script finished and it works with out errors if I limit it to a small number of pdfs (~200). If i let the skript run with 4000 pdfs the script is terminated without an error. Friend of mine told me that this is due to the cache.
I save the grabbed data to a list and in the last step create a DataFrame out of the different lists. The DataFrame is then exported to excel.
So i tried to export the DataFrame after 200 pdfs (and then clear all lists and the dataframe) but then pandas overwrites the prior results. Is this the right way to go? Or can anyone think of a different approach to get arround the Termination by large number of pdfs?
Right now i use:
MN=list()
Vds=list()
data={'Materialnummer': MN,'Verwendung des Stoffs':VdS}
df=pd.DataFrame(data)
df.to_excel('test.xls')
I am testing a simple python script to collect images. Set a fixed timing to run the script every day, but I want to keep a count continuously increasing.
schedule.every().day.at(time1).do(job)
I realized that if I do not do that, new images will overwrite the old images. I want to find a way to properly count/name the newly downloaded images in the next day. Can anyone help?