Memory usage is at 29GB when loading a big dataframe - python

I have 15 csv files, then I load all 15 csv files and concatenate all 15 csv files with below code:
[Jupyter notebook]
%%time
df1 = pd.read_csv("test1.csv")
.... for 15 times with above code.
#Combine the code
total_df = pd.concat([df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12,df13,df14,df15])
When I check my memory, its at 29GB used. I am I try to use below code to delete:
import gc
del [[df1, df2, df3, df4, df5, df6, df7, df8, df9, df10,df11,df12,df13,df14,df15]]
gc.collect()
But still wont work and still 29GB.
Is there a way to just remove the used memory and leave on the total_df memory as used?
I am planning to perform another processing to total_df but the kernel just went dead when I load such a big dataframe. I hope to reduce the memory and then perform more tasks on the total_df like using PCA to reduce the dimension.
Also, the shape of each dataframe is about (900000,1024). The shape is huge and many rows, so total more than a millions of rows with dimension of above 1k columns.

You might want to try loading the files directly in concat:
files = ['test1.csv', 'test 2.csv']
total_df = pd.concat(pd.read_csv(f) for f in files)

Related

How to work with a DataFrame which cannot be transformed by pandas pivot due to excessive memory usage?

I have dataframe with this structure
i built this dfp with 100 rows of the original for testing
and then i tried to make a pivot operation to get a dataframe like this
The problem with the pivot operations using all data is that the solution would have 131209 rows and 39123 columns. When I try the operation the memory collapse and restar my pc.
I tried segmenting the dataframe with 10 or 20. The pivot works but when I do a concat operation it crashes the memory again.
My pc have 16gb of memory. I have also tried with collabs but it also collapses the memory.
Is there a format or another strategy to work on this operation?
You may try this,
dfu = dfp.groupby(['order_id','product_id'])[['my_column']].sum().unstack().fillna(0)
Another way is you split product_id to process and concatinate back to ,
front_part = []
rear_part = []
dfp_f = dfp[dfp['product_id'].isin(front_part)]
dfp_r = dfp[dfp['product_id'].isin(rear_part)]
dfs_f = dfp_f.pivot(index='order_id', columns='product_id', values=['my_column']).fillna(0)
dfs_r = dfp_r.pivot(index='order_id', columns='product_id', values=['my_column']).fillna(0)
dfs = pd.concat([dfs_f, dfs_r], axis=1)
front_part, rear_part means we wanna separate product_id into two parts, but we need to specify the discrete numerical value into lists.

Memory Error: happening on Linux but not Mac OS

I have a big pandas dataframe (7 GiB) that I read from a csv. I need to merge this dataframe with another one, much much smaller. Let's say its size is negligible.
I'm aware that a merge operation in pandas will keep the 2 dataframes to merge + the merged dataframe. Since I have only 16 GiB of RAM, when I run the merge on Linux, it fails with a memory error (my system consumes around 3-4 GiB).
I also tried to run the merge on a Mac, with 16 GiB as well. The system consumes about 3 GiB of RAM by default. The merge completed on the Mac, with the memory going no higher than 10 GiB.
How is this possible? The version of pandas is the same, the dataframe is the same. What is happening here?
Edit:
Here is the code I use to read/merge my files:
# Read the data for the stations, stored in a separate file
stations = pd.read_csv("stations_with_id.csv", index_col=0)
stations.set_index("id_station")
list_data = list()
data = pd.DataFrame()
# Merge all pollutants data in one dataframe
# Probably not the most optimized approach ever...
for pollutant in POLLUTANTS:
path_merged_data_per_pollutant = os.path.join("raw_data", f"{pollutant}_merged")
print(f"Pollutant: {pollutant}")
for f in os.listdir(path_merged_data_per_pollutant):
if ".csv" not in f:
print(f"passing {f}")
continue
print(f"loading {f}")
df = pd.read_csv(
os.path.join(path_merged_data_per_pollutant, f),
sep=";",
na_values="mq",
dtype={"concentration": "float64"},
)
# Drop useless colums and translate useful ones to english
# Do that here to limit memory usage
df = df.rename(index=str, columns=col_to_rename)
df = df[list(col_to_rename.values())]
# Date formatted as YYYY-MM
df["date"] = df["date"].str[:7]
df.set_index("id_station")
df = pd.merge(df, stations, left_on="id_station", right_on="id_station")
# Filter entries to France only (only the metropolitan area) based on GPS coordinates
df = df[(df.longitude > -5) & (df.longitude < 12)]
list_data.append(df)
print("\n")
data = pd.concat(list_data)
The only column that is not a string is concentration, and I specify the type when I read the csv.
The stations dataframe is < 1 MiB.
MacOS compresses memory since Mavericks. If your dataframe is not literally random, it won't take up the full 7GiB in RAM.
There are ways to get compressed memory on Linux as well, but this isn't necessarily enabled. It depends on your distro and configuration.

Pandas_Merge two large Datasets

I am using Pandas for my analysis (I am currently using Jupyter Network). I have a two large datasets (one is 14 GB and the second one is 4 GB). I need to merge this two datasets based on a column. I employ the following code:
df = pd.merge(aa, bb, on='column', how='outer')
Normally, this code works. However, since my datasets are large, it take to much time. I run my code 4 hours ago and it still continuing. The RAM of my machine is 8 GB.
Do you have any suggestions for that?
You can try using dask.dataframe to parallelize your task:
import dask.dataframe as dd
# define lazy readers
aa = dd.read_csv('file1.csv')
bb = dd.read_csv('file2.csv')
# define merging logic
dd_merged = aa.merge(bb, on='column', how='outer')
# apply merge and convert to dataframe
df = dd_merged.compute()

How to use dask to quickly access subsets of the data?

One of the main reasons I love pandas is that it's easy to home in on subsets, e.g. df[df.sample.isin(['a', 'c', 'p'])] or df[df.age < 35]. Is dask dataframe good at (optimized for) this as well? The tutorials I've seen have focused on whole-column manipulations.
My specific application is (thousands of named GCMS samples) x (~20000 time points per sample) x (500 m/z channels) x (intensity), and I'm looking for the fastest tool to pull arbitrary subsets, e.g.
df[df.sample.isin([...]) & df.rt.lt(800) & df.rt.gt(600) & df.mz.isin(...)]
If dask is a good choice, then I would appreciate advice on how best to structure it.
What I've tried
What I've tried so far is to convert each sample to pandas dataframe that looks like
smp rt 14 15 16 17 18
0 160602_JK_OFCmix:1 271.0 64088.0 9976.0 26848.0 23928.0 89600.0
1 160602_JK_OFCmix:1 271.1 65472.0 10880.0 28328.0 24808.0 91840.0
2 160602_JK_OFCmix:1 271.2 64528.0 10232.0 27672.0 25464.0 90624.0
3 160602_JK_OFCmix:1 271.3 63424.0 10272.0 27600.0 25064.0 90176.0
4 160602_JK_OFCmix:1 271.4 64816.0 10640.0 27592.0 24896.0 90624.0
('smp' is sample name, 'rt' is retention time, 14,15,...500 are m/z channels), save to hdf with zlib, level=1, then make the dask dataframe with
ddf = dd.read_hdf(*.hdf5, key='/*', chunksize=100000, lock=False)
but df = ddf[ddf.smp.isin([...a couple of samples...]).compute() is 100x slower than ddf['57'].mean().compute().
(Note: this is with dask.set_options(get=dask.multiprocessing.get))
Your dask.dataframe is backed by an HDF file, so every time you do any operation you're reading in the data from disk. This is great if your data doesn't fit in memory but wasteful if your data does fit in memory.
If your data fits in memory
Instead, if your data fits in memory then try backing your dask.dataframe off of a Pandas dataframe:
# ddf = dd.from_hdf(...)
ddf = dd.from_pandas(df, npartitions=20)
I expect you'll see better performance from the threaded or distributed schedulers: http://dask.pydata.org/en/latest/scheduler-choice.html
If your data doesn't fit in memory
Try to reduce the number of bytes you have to read by specifying a set of columns to read in your read_hdf call
df = dd.read_hdf(..., columns=['57'])
Or, better yet, use a data store that lets you efficiently load individual columns. You could try something like Feather or Parquet, though both are in early stages:
https://github.com/wesm/feather
http://fastparquet.readthedocs.io/en/latest/
I suspect that if you're careful to avoid reading in all of the columns at once you could probably get by with just Pandas instead of using Dask.dataframe.

How to stream in and manipulate a large data file in python

I have a relatively large (1 GB) text file that I want to cut down in size by summing across categories:
Geography AgeGroup Gender Race Count
County1 1 M 1 12
County1 2 M 1 3
County1 2 M 2 0
To:
Geography Count
County1 15
County2 23
This would be a simple matter if the whole file could fit in memory but using pandas.read_csv() gives MemoryError. So I have been looking into other methods, and there appears to be many options - HDF5? Using itertools (which seems complicated - generators?) Or just using the standard file methods to read in the first geography (70 lines), sum the count column, and write out before loading in another 70 lines.
Does anyone have any suggestions on the best way to do this? I especially like the idea of streaming data in, especially because I can think of a lot of other places where this would be useful. I am most interested in this method, or one that similarly uses the most basic functionality possible.
Edit: In this small case I only want the sums of count by geography. However, it would be ideal if I could read in a chunk, specify any function (say, add 2 columns together, or take the max of a column by geography), apply the function, and write the output before reading in a new chunk.
You can use dask.dataframe, which is syntactically similar to pandas, but performs manipulations out-of-core, so memory shouldn't be an issue:
import dask.dataframe as dd
df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')
Alternatively, if pandas is a requirement you can use chunked reads, as mentioned by #chrisaycock. You may want to experiment with the chunksize parameter.
# Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
data.append(chunk)
# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')
I do like #root's solution, but i would go bit further optimizing memory usage - keeping only aggregated DF in memory and reading only those columns, that you really need:
cols = ['Geography','Count']
df = pd.DataFrame()
chunksize = 2 # adjust it! for example --> 10**5
for chunk in (pd.read_csv(filename,
usecols=cols,
chunksize=chunksize)
):
# merge previously aggregated DF with a new portion of data and aggregate it again
df = (pd.concat([df,
chunk.groupby('Geography')['Count'].sum().to_frame()])
.groupby(level=0)['Count']
.sum()
.to_frame()
)
df.reset_index().to_csv('c:/temp/result.csv', index=False)
test data:
Geography,AgeGroup,Gender,Race,Count
County1,1,M,1,12
County2,2,M,1,3
County3,2,M,2,0
County1,1,M,1,12
County2,2,M,1,33
County3,2,M,2,11
County1,1,M,1,12
County2,2,M,1,111
County3,2,M,2,1111
County5,1,M,1,12
County6,2,M,1,33
County7,2,M,2,11
County5,1,M,1,12
County8,2,M,1,111
County9,2,M,2,1111
output.csv:
Geography,Count
County1,36
County2,147
County3,1122
County5,24
County6,33
County7,11
County8,111
County9,1111
PS using this approach will you can process huge files.
PPS using chunking approach should work unless you need to sort your data - in this case i would use classic UNIX tools, like awk, sort, etc. for sorting your data first
I would also recommend to use PyTables (HDF5 Storage), instead of CSV files - it is very fast and allows you to read data conditionally (using where parameter), so it's very handy and saves a lot of resources and usually much faster compared to CSV.

Categories