Pandas dataframe pivot not fitting in memory - python

I have a dataframe df with the following structure:
val newidx Code
Idx
0 1.0 1220121127 706
1 1.0 1220121030 706
2 1.0 1620120122 565
It has 1000000 lines.
In total we have 600 unique Code value and 200000 unique newidx values.
If I perform the following operation
df.pivot_table(values='val', index='newidx', columns='Code', aggfunc='max')
I get a MemoryError . but this sounds strange as the size of the resulting dataframe should be sustainable: 200000x600.
How much memory requires such operation? Is there a way to fix this memory error?

Try to see if this fits in your memory:
df.groupby(['newidx', 'Code'])['val'].max().unstack()
pivot_table is unfortunately very memory intensive as it may make multiple copies of data.
If the groupby does not work, you will have to split your DataFrame into smaller pieces. Try not to assign multiple times. For example, if reading from csv:
df = pd.read_csv('file.csv').groupby(['newidx', 'Code'])['val'].max().unstack()
avoids multiple assignments.

I've had a very similar problem when carrying out a merge between 4 dataframes recently.
What worked for me was disabling the index during the groupby, then merging.
if #Kartiks answer doesn't work, try this before chunking the DataFrame.
df.groupby(['newidx', 'Code'], as_index=False)['val'].max().unstack()

Related

How to work with a DataFrame which cannot be transformed by pandas pivot due to excessive memory usage?

I have dataframe with this structure
i built this dfp with 100 rows of the original for testing
and then i tried to make a pivot operation to get a dataframe like this
The problem with the pivot operations using all data is that the solution would have 131209 rows and 39123 columns. When I try the operation the memory collapse and restar my pc.
I tried segmenting the dataframe with 10 or 20. The pivot works but when I do a concat operation it crashes the memory again.
My pc have 16gb of memory. I have also tried with collabs but it also collapses the memory.
Is there a format or another strategy to work on this operation?
You may try this,
dfu = dfp.groupby(['order_id','product_id'])[['my_column']].sum().unstack().fillna(0)
Another way is you split product_id to process and concatinate back to ,
front_part = []
rear_part = []
dfp_f = dfp[dfp['product_id'].isin(front_part)]
dfp_r = dfp[dfp['product_id'].isin(rear_part)]
dfs_f = dfp_f.pivot(index='order_id', columns='product_id', values=['my_column']).fillna(0)
dfs_r = dfp_r.pivot(index='order_id', columns='product_id', values=['my_column']).fillna(0)
dfs = pd.concat([dfs_f, dfs_r], axis=1)
front_part, rear_part means we wanna separate product_id into two parts, but we need to specify the discrete numerical value into lists.

Dask dataframe running out of memory while converting it to pandas after selecting subset

So I have a parquet file with 160M records and 240 columns. So I'm using dask to load it in python using EMR CLuster m5.12xlarge.
import dask.dataframe as dd
df = dd.read_parquet(file)
Now I want value count and normalized value count of one of the column:
count = df.a.value_counts()
percent = df.a.value_counts(normalize = True)
a_count = dd.concat([count,percent], axis=1, keys=['counts', '%'])
Out:
Dask DataFrame Structure:
counts %
npartitions=1
int64 float64
... ...
Dask Name: concat, 489 tasks
Note here that I have total 1 npartitions and 489 task.
Now I'm trying to convert it into pandas df. Which is only taking few secs to execute and using around 1.5 GB of memory.
a_count = a_count.compute()
Now from one of the column I want all records with null values and then do the same as I did earlier, value count.
empty_b = df[df['b'].isna()]
count = empty_b.a.value_counts()
percent = empty_b.a.value_counts(normalize = True)
empty_b = dd.concat([count,percent], axis=1, keys=['counts', '%'])
empty_b
Out:
Dask DataFrame Structure:
counts %
npartitions=1
int64 float64
... ...
Dask Name: concat, 828 tasks
Which has a total 1 npartition and 828 task.
Now I'm trying to convert this into pandas dataframe by computing and it's taking lot of time and running out of memory after utilizing 170 GB.
empty_b = empty_b.compute()
Can someone explain what's going wrong here, I'm doing the same thing and that also on subset of bigger one, but still my notebook is running out of memory and not able to execute.
I have a parquet file with 160M records and 240 columns
It's probably best to split up this data into multiple Parquet files
I'm using dask to load it in python using EMR CLuster m5.12xlarge
This instance type has 48 CPUs. Is this a multi node instance or a single node?
Note here that I have total 1 npartitions and 489 task.
You're probably running all these computations on a single core. Try repartitioning the DataFrame into 48 partitions at least, so you can leverage parallelism on your powerful machine.
Now I'm trying to convert this into pandas dataframe
You generally don't want to convert Dask DataFrames to Pandas DataFrames, unless you've significantly reduced the number of rows of data. You lose all the benefits that the parallelism of Dask can provide once converting to Pandas.
In this example, it seems like you're reading a single Parquet file into a Dask DataFrame with one partition and then converting it back to Pandas. You might want to consider breaking up the Dask DataFrame into multiple partitions (and running computations via Dask) or just reading the file directly into a Pandas DataFrame.

Pandas: How to efficiently diff() after a groupby() operation?

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to use the diff() function in a performant manner on a subset of the data.
Here is how my dataset looks like:
prec type
location_id hours
135 78 12.0 A
79 14.0 A
80 14.3 A
81 15.0 A
82 15.0 A
83 15.0 A
84 15.5 A
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to apply the diff() function for each location on the prec column. The original dataset piles up the prec numbers; by applying diff() I will get the appropriate prec value for each hour.
With these in mind, I have implemented the following algorithm in Pandas:
# Filter the data first
df_filtered = df_data[df_data.type == "A"] # only work on locations with 'A' type
df_filtered = df_filtered.query('hours > 0 & hours <= 120') # only work on certain hours
# Apply the diff()
for location_id, data_of_location in df_filtered.groupby(level="location_id"):
df_data.loc[data_of_location.index, "prec"] = data_of_location.prec.diff().replace(np.nan, 0.0)
del df_filtered
This works really well functionally, however the performance and the memory consumption is horrible. It is taking around 30 minutes on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?
Also, the overall memory consumption of the Python script is sky-rocketing during this operation; it grows around 300%! The memory consumed by the main df_data data frame doesn't change but the overall process memory consumption rises.
With the input from #Quang Hoang and #Ben. T, I figured out a solution that is pretty fast but still consumes a lot of memory.
# Filter the data first
df_filtered = df_data[df_data.type == "A"] # only work on locations with 'A' type
df_filtered = df_filtered.query('hours > 0 & hours <= 120') # only work on certain hours
# Apply the diff()
df_diffed = df_data.groupby(level="location_id").prec.diff().replace(np.nan, 0.0)
df_data[df_diffed.index, "prec"] = df_diffed
del df_diffed
del df_filtered
I am guessing 2 things can be done to improve memory usage:
df_filtered seems like a copy of the data; that should increase the memory a lot.
df_diffed is also a copy.
The memory usage is very intensive while computing these two variables. I am not sure if there is any in-place way to execute such operations.

innerjoin between two large pandas dataframe using dask

I have two large tables with one of them is relatively small ~8Million rows and one column. Other is large 173Million rows and one column. The index of the first data frame is IntervalIndex (eg (0,13], (13, 20], (20, 23], ...) and the second one is ordered numbers (1,2,3, ...). Both DataFrame are sorted so
DF1 category
(0,13] 1
(13 20] 2
....
Df2 Value
1 5.2
2 3.4
3 7.8
Desired
Df3
index value category
1 5.2 1
2 3.4 1
3 7.8 1
I want two obtain inner join (faster algorithm) that returns inner join similar to MySQL on data_frame2.index
I would like to be able to perform it in an elaborate way in a cluster because when I PRODUCED THE INNERJOIN WITH SMALLER SECOND DATASET THE RESULT ARE SO MEMORY CONSUMING IMAGINE 105MEGABYTE for 10 rows using map_partitions.
Another problem is that I cannot use scatter twice, given if first DaskDF=client.scatter(dataframe2) followed by DaskDF=client.submit(fun1,DaskDF) I am unable to do sth like client.submit(fun2,DaskDF).
You might try using smaller partitions. Recall that the memory use of joins depend on how many shared rows there are. Depending on your data the memory use of an output partition may be much larger than the memory use of your input partitions.

increase efficiency of pandas groupby with custom aggregation function

I have a not so large dataframe (somewhere in 2000x10000 range in terms of shape).
I am trying to groupby a columns, and average the first N non-null entries:
e.g.
def my_part_of_interest(v,N=42):
valid=v[~np.isnan(v)]
return np.mean(valid.values[0:N])
mydf.groupby('key').agg(my_part_of_interest)
It now take a long time (dozen of minutes), when .agg(np.nanmean)
was instead in order of seconds.
how to get it running faster?
Some things to consider:
Droping the nan entries on the entire df via single operation is faster than doing it on chunks of grouped datasets mydf.dropna(subset=['v'], inplace=True)
Use the .head to slice mydf.groupby('key').apply(lambda x: x.head(42).agg('mean')
I think those combined can optimize things a bit and they are more idiomatic to pandas.

Categories