I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to use the diff() function in a performant manner on a subset of the data.
Here is how my dataset looks like:
prec type
location_id hours
135 78 12.0 A
79 14.0 A
80 14.3 A
81 15.0 A
82 15.0 A
83 15.0 A
84 15.5 A
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to apply the diff() function for each location on the prec column. The original dataset piles up the prec numbers; by applying diff() I will get the appropriate prec value for each hour.
With these in mind, I have implemented the following algorithm in Pandas:
# Filter the data first
df_filtered = df_data[df_data.type == "A"] # only work on locations with 'A' type
df_filtered = df_filtered.query('hours > 0 & hours <= 120') # only work on certain hours
# Apply the diff()
for location_id, data_of_location in df_filtered.groupby(level="location_id"):
df_data.loc[data_of_location.index, "prec"] = data_of_location.prec.diff().replace(np.nan, 0.0)
del df_filtered
This works really well functionally, however the performance and the memory consumption is horrible. It is taking around 30 minutes on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?
Also, the overall memory consumption of the Python script is sky-rocketing during this operation; it grows around 300%! The memory consumed by the main df_data data frame doesn't change but the overall process memory consumption rises.
With the input from #Quang Hoang and #Ben. T, I figured out a solution that is pretty fast but still consumes a lot of memory.
# Filter the data first
df_filtered = df_data[df_data.type == "A"] # only work on locations with 'A' type
df_filtered = df_filtered.query('hours > 0 & hours <= 120') # only work on certain hours
# Apply the diff()
df_diffed = df_data.groupby(level="location_id").prec.diff().replace(np.nan, 0.0)
df_data[df_diffed.index, "prec"] = df_diffed
del df_diffed
del df_filtered
I am guessing 2 things can be done to improve memory usage:
df_filtered seems like a copy of the data; that should increase the memory a lot.
df_diffed is also a copy.
The memory usage is very intensive while computing these two variables. I am not sure if there is any in-place way to execute such operations.
Related
I am trying to remove outliers for all my columns. I already used a very large instance on AWS (4v CPU + 16 GiB), but still couldn't run it through.
num_col = data.select_dtypes(include=['int64','float64']).columns.tolist()
data[num_col] = data[num_col].apply(lambda x: x.clip(*x.quantile([0.01, 0.99])))
There are total 102 columns that I need to remove outliers.
Is there a more efficient way to code in order to run faster and less memory.
I'm trying to optimize my script performance via Pandas. I'm running into a roadblock where I need to apply a large number of filters to a DataFrame and store a few totals from the results.
Currently the fastest way I can make this happen is running a For Loop on the list of filters (as strings) and using eval() to calculate the totals:
for filter_str in filter_list:
data_filtered = data[eval(filter_str)]
avg_change = data_filtered['NewChangePerc'].mean()
Here's my attempt at using pd.apply() to speed this up because I can't think of a vectorized way to make it happen (the filters are in a DataFrame this time instead of a list):
def applying(x):
f = data[eval(x)]
avg_change = f['NewChangePerc'].mean()
filter_df.processed.apply(applying)
The main goal is to simply make it as fast as possible. What I don't understand is why a For Loop is faster than pd.apply(). It's about twice as fast.
Any input would be greatly appreciated.
UPDATE
Here's more specifics about what I'm trying to accomplish:
Take a data set of roughly 67 columns and 2500 rows.
Code Name ... Price NewChangePerc
0 JOHNS33 Johnson, Matthew ... 0.93 0.388060
1 QUEST01 Questlove, Inc. ... 18.07 0.346498
2 773NI01 773 Entertainment ... 1.74 0.338462
3 CLOVE03 Cloverfield, Sam ... 21.38 0.276418
4 KITET08 Kite Teal 6 ... 0.38 0.225806
Then take a list of filters.
['Price > 5.0', 'NewChangePerc < .02']
Apply each filter to the data and calculate certain values, such as the average NewChangePerc.
For example, when applying 'Price > 5.0', the average NewChangePerc would be ~0.31.
Now grow that list of filters to a length of about 1,000, and it starts to take some time to run. I need to cut that time down as much as possible. I've run out of ideas and can't find any solutions beyond what I've listed above, but they're just too slow (~0.86s for 50 filters with the For Loop; ~1.65s for 50 filters with pd.apply()). Are there any alternatives?
I have a not so large dataframe (somewhere in 2000x10000 range in terms of shape).
I am trying to groupby a columns, and average the first N non-null entries:
e.g.
def my_part_of_interest(v,N=42):
valid=v[~np.isnan(v)]
return np.mean(valid.values[0:N])
mydf.groupby('key').agg(my_part_of_interest)
It now take a long time (dozen of minutes), when .agg(np.nanmean)
was instead in order of seconds.
how to get it running faster?
Some things to consider:
Droping the nan entries on the entire df via single operation is faster than doing it on chunks of grouped datasets mydf.dropna(subset=['v'], inplace=True)
Use the .head to slice mydf.groupby('key').apply(lambda x: x.head(42).agg('mean')
I think those combined can optimize things a bit and they are more idiomatic to pandas.
I have a dataframe df with the following structure:
val newidx Code
Idx
0 1.0 1220121127 706
1 1.0 1220121030 706
2 1.0 1620120122 565
It has 1000000 lines.
In total we have 600 unique Code value and 200000 unique newidx values.
If I perform the following operation
df.pivot_table(values='val', index='newidx', columns='Code', aggfunc='max')
I get a MemoryError . but this sounds strange as the size of the resulting dataframe should be sustainable: 200000x600.
How much memory requires such operation? Is there a way to fix this memory error?
Try to see if this fits in your memory:
df.groupby(['newidx', 'Code'])['val'].max().unstack()
pivot_table is unfortunately very memory intensive as it may make multiple copies of data.
If the groupby does not work, you will have to split your DataFrame into smaller pieces. Try not to assign multiple times. For example, if reading from csv:
df = pd.read_csv('file.csv').groupby(['newidx', 'Code'])['val'].max().unstack()
avoids multiple assignments.
I've had a very similar problem when carrying out a merge between 4 dataframes recently.
What worked for me was disabling the index during the groupby, then merging.
if #Kartiks answer doesn't work, try this before chunking the DataFrame.
df.groupby(['newidx', 'Code'], as_index=False)['val'].max().unstack()
I want to impute a large datamatrix (90*90000) and later an even larger one (150000*800000) using pandas.
At the moment I am testing with the smaller one on my laptop (8gb ram, Haswell core i5 2.2 GHz, the larger dataset will be run on a server).
The columns have some missing values that I want to impute with the most frequent one over all rows.
My working code for this is:
freq_val = pd.Series(mode(df.ix[:,6:])[0][0], df.ix[:,6:].columns.values) #most frequent value per column, starting from the first SNP column (second row of 'mode'gives actual frequencies)
df_imputed = df.ix[:,6:].fillna(freq_val) #impute unknown SNP values with most frequent value of respective columns
The imputation takes about 20 minutes on my machine. Is there another implementation that would increase performance?
try this:
df_imputed = df.iloc[:, 6:].fillna(df.iloc[:, 6:].apply(lambda x: x.mode()).iloc[0])
I tried different approaches. The key learning is that the mode function is really slow. Alternatively, I implemented the same functionality using np.unique (return_counts=True) and np.bincount. The latter is supposedly faster, but it doesn't work with NaN values.
The optimized code now needs about 28 s to run. MaxU's answer needs ~48 s on my machine to finish.
The code:
iter = range(np.shape(df.ix[:,6:])[1])
freq_val = np.zeros(np.shape(df.ix[:,6:])[1])
for i in iter:
_, count = np.unique(df.ix[:,i+6], return_counts=True)
freq_val[i] = count.argmax()
freq_val_series = pd.Series(freq_val, df.ix[:,6:].columns.values)
df_imputed = df.ix[:,6:].fillna(freq_val_series)
Thanks for the input!