Python Pandas memory management - python

2 questions :
When we delete/drop a dataframe and use gc.collect(), Does the python empty the ram occupied?
if we do df.to_csv(filename), will that df be referring to the file after that or the data still being referred from in-memory(RAM)?

When we delete/drop a dataframe and use gc.collect(), Does the python empty the ram occupied?
Yes, it will be empty space in the RAM.
If we do df.to_csv(filename), will that df be referring to the file after that or the data still being referred from in-memory(RAM)?
No, it will be not refer from the file it will still refer to the df variable.

Related

Pyspark caches dataframe by default or not?

If i read a file in pyspark:
Data = spark.read(file.csv)
Then for the life of the spark session, the ‘data’ is available in memory,correct? So if i call data.show() 5 times, it will not read from disk 5 times. Is it correct? If yes, why do i need:
Data.cache()
If i read a file in pyspark:
Data = spark.read(file.csv)
Then for the life of the spark session, the ‘data’ is available in memory,correct?
No. Nothing happens here due to Spark lazy evaluation, which happens upon the first call to show() in your case.
So if i call data.show() 5 times, it will not read from disk 5 times. Is it correct?
No. The dataframe will be re-evaluated for each call to show. Caching the dataframe will prevent that re-evaluation, forcing the data to be read from cache instead.

Can using Pandas read_csv alter the original file?

Using this code to work with a dataset:
cities_orig = pd.read_csv("cities.csv")
I presume that now that cities_orig is assigned, I can alter that dataframe and the original csv file won't be changed. Am I correct? Are there exceptions?
Yes, you are correct and there are no exceptions. The read_csv means to read the file and load into memory. Any changes to the in-memory variable does not change the actual csv file.

Renaming the columns in Vaex

I tried to read a csv file of 4GB initially with pandas pd.read_csv but my system is running out of memory (I guess) and the kernel is restarting or the system hangs.
So, I tried using vaex library to convert csv to HDF5 and do operations(aggregations,group by)on that. For that I've used:
df = vaex.from_csv('Wager-Win_April-Jul.csv',column_names = None, convert=True, chunk_size=5000000)
and
df = vaex.from_csv('Wager-Win_April-Jul.csv',header = None, convert=True, chunk_size=5000000)
But still I'm getting my first record in csv file as the header(column names to be precise)and I'm unable to change the column names. I tried finding function to change the names but didn't come across any. Pls help me on that. Thanks :)
The column names 1559104, 10289, 991... is actually the first record in the csv and somehow vaex is taking the first row as my column names which I want to avoid
vaex.from_csv is a wrapper around pandas.read_csv with few extra options for the conversion.
So reading the pandas documentation, header='infer' (which is the default) if you want the csv reader to automatically infer the column names. Otherwise the 1st row of the file is used as the header. Alternatively you can pass the column names manually via the names kwarg. Same holds true for both vaex and pandas.
I would read the pandas.read_csv documentation to better understand all the options. Then you can use those options with vaex and the convert and chunk_size arguments.

Pandas - memory error while importing a CSV file of size 4GB

I tried importing a csv file of size 4GB using pd.read_csv but received out of memory error. Then tried with dask.dataframe, but couldn't convert to pandas dataframe ( same memory error).
import pandas as pd
import dask.dataframe as dd
df = dd.read_csv(#file)
df = df.compute()
Then tried to use the chunksize parameter, but same memory error:
import pandas as pd
df = pd.read_csv(#file, chunksize=1000000, low_memory=False)
df = pd.concat(df)
Also tried using chunksize with lists, same error:
import pandas as pd
list = []
for chunk in pd.read_csv(#file, chunksize=1000000, low_memory=False)
list.append(chunk)
df = pd.concat(list)
Attempts:
Tried with file size 1.5GB - successfully imported
Tried with file size 4GB - failed (memory error)
Tried with low chunksize (2000 or 50000) - failed (memory error for 4GB file)
Please let me know how to proceed further?
I use python 3.7 and RAM 8GB.
I also tried the Attempt 3 in a server with RAM 128GB, but still memory error
I cannot assign dtype as the csv file to be imported can contain different columns at different time
Already been answered here:
How to read a 6 GB csv file with pandas
I also tried the above method with a 2GB file and it works.
Also try to keep the chunk size even smaller.
Can you share the configuration of your system as well, that would be quite useful
I just want to record what I tried after getting enough suggestion! Thanks to Robin Nemeth and juanpa
As juanpa pointed I was able to read the csv file (4GB) in the
server with 128GB RAM when I used 64bit python executable file
As Robin pointed out even with a 64bit executable I'm not able to
read the csv file (4GB) in my local machine with 8GB RAM.
So, no matter what we try the machine's RAM matters as dataframe uses in memory

Pandas read_stata() with large .dta files

I am working with a Stata .dta file that is around 3.3 gigabytes, so it is large but not excessively large. I am interested in using IPython and tried to import the .dta file using Pandas but something wonky is going on. My box has 32 gigabytes of RAM and attempting to load the .dta file results in all the RAM being used (after ~30 minutes) and my computer to stall out. This doesn't 'feel' right in that I am able to open the file in R using read.dta() from the foreign package no problem, and working with the file in Stata is fine. The code I am using is:
%time myfile = pd.read_stata(data_dir + 'my_dta_file.dta')
and I am using IPython in Enthought's Canopy program. The reason for the '%time' is because I am interested in benchmarking this against R's read.dta().
My questions are:
Is there something I am doing wrong that is resulting in Pandas having issues?
Is there a workaround to get the data into a Pandas dataframe?
Here is a little function that has been handy for me, using some pandas features that might not have been available when the question was originally posed:
def load_large_dta(fname):
import sys
reader = pd.read_stata(fname, iterator=True)
df = pd.DataFrame()
try:
chunk = reader.get_chunk(100*1000)
while len(chunk) > 0:
df = df.append(chunk, ignore_index=True)
chunk = reader.get_chunk(100*1000)
print '.',
sys.stdout.flush()
except (StopIteration, KeyboardInterrupt):
pass
print '\nloaded {} rows'.format(len(df))
return df
I loaded an 11G Stata file in 100 minutes with this, and it's nice to have something to play with if I get tired of waiting and hit cntl-c.
This notebook shows it in action.
For all the people who end on this page, please upgrade Pandas to the latest version. I had this exact problem with a stalled computer during load (300 MB Stata file but only 8 GB system ram), and upgrading from v0.14 to v0.16.2 solved the issue in a snap.
Currently, it's v 0.16.2. There have been significant improvements to speed though I don't know the specifics. See: most efficient I/O setup between Stata and Python (Pandas)
There is a simpler way to solve it using Pandas' built-in function read_stata.
Assume your large file is named as large.dta.
import pandas as pd
reader=pd.read_stata("large.dta",chunksize=100000)
df = pd.DataFrame()
for itm in reader:
df=df.append(itm)
df.to_csv("large.csv")
Question 1.
There's not much I can say about this.
Question 2.
Consider exporting your .dta file to .csv using Stata command outsheet or export delimited and then using read_csv() in pandas. In fact, you could take the newly created .csv file, use it as input for R and compare with pandas (if that's of interest). read_csv is likely to have had more testing than read_stata.
Run help outsheet for details of the exporting.
You should not be reading a 3GB+ file into an in-memory data object, that's a recipe for disaster (and has nothing to do with pandas).
The right way to do this is to mem-map the file and access the data as needed.
You should consider converting your file to a more appropriate format (csv or hdf) and then you can use the Dask wrapper around pandas DataFrame for chunk-loading the data as needed:
from dask import dataframe as dd
# If you don't want to use all the columns, make a selection
columns = ['column1', 'column2']
data = dd.read_csv('your_file.csv', use_columns=columns)
This will transparently take care of chunk-loading, multicore data handling and all that stuff.

Categories