Summary
I am trying to iterate over a large dataframe. Identify unique groups based on several columns, apply the mean to another column based on how many are in the group. My current approach is very slow when iterating over a large dataset and applying the average function across many columns. Is there a way I can do this more efficiently?
Example
Here's a example of the problem. I want to find unique combinations of ['A', 'B', 'C']. For each unique combination, I want the value of column ['D'] / number of rows in the group.
Edit:
Resulting dataframe should preserve the duplicated groups. But with edited column 'D'
import pandas as pd
import numpy as np
import datetime
def time_mean_rows():
# Generate some random data
A = np.random.randint(0, 5, 1000)
B = np.random.randint(0, 5, 1000)
C = np.random.randint(0, 5, 1000)
D = np.random.randint(0, 10, 1000)
# init dataframe
df = pd.DataFrame(data=[A, B, C, D]).T
df.columns = ['A', 'B', 'C', 'D']
tstart = datetime.datetime.now()
# Get unique combinations of A, B, C
unique_groups = df[['A', 'B', 'C']].drop_duplicates().reset_index()
# Iterate unique groups
normalised_solutions = []
for idx, row in unique_groups.iterrows():
# Subset dataframe to the unique group
sub_df = df[
(df['A'] == row['A']) &
(df['B'] == row['B']) &
(df['C'] == row['C'])
]
# If more than one solution, get mean of column D
num_solutions = len(sub_df)
if num_solutions > 1:
sub_df.loc[:, 'D'] = sub_df.loc[:,'D'].values.sum(axis=0) / num_solutions
normalised_solutions.append(sub_df)
# Concatenate results
res = pd.concat(normalised_solutions)
tend = datetime.datetime.now()
time_elapsed = (tstart - tend).seconds
print(time_elapsed)
I know the section causing slowdown is when num_solutions > 1. How can I do this more efficiently
Hm, why don't you use groupby?
df_res = df.groupby(['A', 'B', 'C'])['D'].mean().reset_index()
This is a complement to AT_asks's answer which only gave the first part of the solution.
Once we have df.groupby(['A', 'B', 'C'])['D'].mean() we can use it to change the value of the column 'D' in a copy of the original dataframe provided we use a dataframe sharing same index. The global solution is then:
res = df.set_index(['A', 'B', 'C']).assign(
D=df.groupby(['A', 'B', 'C'])['D'].mean()).reset_index()
This will contains same rows (even if a different order that the res dataframe from OP's question.
Here's a solution I found
Using groupby as suggested by AT, then merging back to the original df and dropping the original ['D', 'E'] columns. Nice speedup!
def time_mean_rows():
# Generate some random data
np.random.seed(seed=42)
A = np.random.randint(0, 10, 10000)
B = np.random.randint(0, 10, 10000)
C = np.random.randint(0, 10, 10000)
D = np.random.randint(0, 10, 10000)
E = np.random.randint(0, 10, 10000)
# init dataframe
df = pd.DataFrame(data=[A, B, C, D, E]).T
df.columns = ['A', 'B', 'C', 'D', 'E']
tstart_grpby = timer()
cols = ['D', 'E']
group_df = df.groupby(['A', 'B', 'C'])[cols].mean().reset_index()
# Merge df
df = pd.merge(df, group_df, how='left', on=['A', 'B', 'C'], suffixes=('_left', ''))
# Get left columns (have not been normalised) and drop
drop_cols = [x for x in df.columns if x.endswith('_left')]
df.drop(drop_cols, inplace=True, axis='columns')
tend_grpby = timer()
time_elapsed_grpby = timedelta(seconds=tend_grpby-tstart_grpby).total_seconds()
print(time_elapsed_grpby)
Related
How can I create the master DataFrame through some vectorised process? If it's not possible, what's the most time efficient (not concerned about memory) method to execute this operation?
Can the for-loop be replaced for something more efficient?
As you can see, combinations very quickly produces very large number, thus I need a fast way to produce this DataFrame.
Please see below a minimum reproducible example:
%%time
import pandas as pd
import string
import numpy as np
from itertools import combinations
# create dummy data
cols = list(string.ascii_uppercase)
dummy = pd.DataFrame()
for col in cols:
dummy = dummy.append([[col, 0] + np.random.randint(2, 100, size=(1, 10)).tolist()[0]])
dummy = dummy.append([[col, 1] + np.random.randint(2, 100, size=(1, 10)).tolist()[0]])
dummy = dummy.append([[col, 2] + np.random.randint(2, 100, size=(1, 10)).tolist()[0]])
dummy.columns=['name', 'id', 'v1', 'v2', 'v3', 'v4', 'v5', 'v1', 'v6', 'v7', 'v8', 'v9']
# create all possible unique combinations
combos = list(combinations(cols, 2))
# generate DataFrame with all combinations
master = pd.DataFrame()
for i, combo in enumerate(combos):
A = dummy[dummy.name == combo[0]]
B = dummy[dummy.name == combo[1]]
joined = pd.merge(A, B, on=["id"], suffixes=('_A', '_B'))
joined = joined.sort_values("id")
joined['pair_id'] = i
master = pd.concat([master, joined])
Output:
CPU times: total: 1.8 s
Wall time: 1.8 s
Thanks!
Since your data is structural, you can drop down to numpy to take advantage of vectorized operations.
names = list(string.ascii_uppercase)
ids = [0, 1, 2]
columns = pd.Series(["v1", "v2", "v3", "v4", "v5", "v1", "v6", "v7", "v8", "v9"])
# Generate the random data
data = np.random.randint(2, 100, (len(names), len(ids), len(columns)))
# Pair data for every 2-combination of names
arr = [np.hstack([data[i], data[j]]) for i,j in combinations(range(len(names)), 2)]
# Assembling the data to final dataframe
idx = pd.MultiIndex.from_tuples([
(p,a,b,i) for p, (a, b) in enumerate(combinations(names,2)) for i in ids
], names=["pair_id", "name_A", "name_B", "id"])
cols = pd.concat([columns + "_A", columns + "_B"])
master = pd.DataFrame(np.vstack(arr), index=idx, columns=cols)
Original code: 4s. New code: 7ms
I am having an issue with returning the original df index of a row given a groupby condition after subselecting some of the df. It's easier to understand through code.
So if we start with a toy dataframe:
headers = ['a','b']
nrows = 8
df = pd.DataFrame(columns = headers)
df['a'] = [0]*(nrows//2) + [1]*(nrows//2)
df['b'] = [2]*(nrows//4) + [4]*(nrows//4) + [2]*(nrows//4) + [4]*(nrows//4)
print(df)
then I select the subset of data I am interested in and checking that the index is retained:
sub_df = df[df['a']==1] ## selects for only group 1 (indices 4-7)
print(sub_df.index) ## looks good so far
sub_df.index returns
Int64Index([4, 5, 6, 7], dtype='int64')
Which seems great! I would like to group data from that subset and extract the original df index and that is where the issue occurs:
For example:
g_df = sub_df.groupby('b')
g_df_idx = g_df.indices
print(g_df_idx) ## bad!
when I print(g_df_idx) I want it to return:
{2: array([4,5]), 4: array([6,7])}
Due to the way I will be using this code I can't just groupby(['a','b'])
I'm going nuts with this thing. Here are some of the many solutions I have tried:
## 1
e1_idx = sub_df.groupby('b').indices
# print(e1_idx) ## issue persists
## 2
e2 = sub_df.groupby('b', as_index = True) ## also tried as_index = False
e2_idx = e2.indices
# print(e2_idx) ## issue persists
## 3
e3 = sub_df.reset_index()
e3_idx = e3.groupby('b').indices
# print(e3_idx) ## issue persists
I'm sure there must be some simple solution I'm just overlooking. Would be very grateful for any advice.
you can do like this
g_df_idx = g_df.apply(lambda x: x.index).to_dict()
print(g_df_idx)
# {2: Int64Index([4, 5], dtype='int64'), 4: Int64Index([6, 7], dtype='int64')}
Let's say I have the following dataframe and series
import numpy as np
import pandas as pd
ser = pd.Series([1,2,2], pd.date_range(start='2020-01-01', periods=3, freq='D'))
test = pd.DataFrame([pd.Series([1, 2], index=['a', 'b'])] * len(ser), index=ser.index)
Let's say I want to compare whether each value in the dataframe is larger than the corresponding values in the series, by column. I know that I can do it in the following way:
test.apply(lambda x: x > ser)
However, is there also a way to do this using np.where that is perhaps more efficient? I know that if I were comparing the dataframe to the series by rows, the following would work:
np.where(test > [0.5,2], 1, 0)
Use gt with axis=0:
test.gt(ser, axis=0)
# a b
#2020-01-01 False True
#2020-01-02 False False
#2020-01-03 False False
Quick benchmark against apply:
import time
def apply_compare():
t1 = time.time()
test.apply(lambda x: x > ser)
print(time.time() - t1)
def gt_compare():
t1 = time.time()
test.gt(ser, 0)
print(time.time() - t1)
ser = pd.Series([1,2,2] * 10000, pd.date_range(start='2020-01-01', periods=30000, freq='D'))
test = pd.DataFrame([pd.Series([1, 2], index=['a', 'b'])] * len(ser), index=ser.index)
apply_compare()
# 0.006000041961669922
gt_compare()
# 0.0009999275207519531
I have a list
a = [15, 50 , 75]
Using the above list I have to create smaller dataframes filtering out rows (the number of rows is defined by the list) on the index from the main dataframe.
let's say my main dataframe is df
the dataframes I'd like to have is df1 (from row index 0-15),df2 (from row index 15-65), df3 (from row index 65 - 125)
since these are just three I can easily use something like this below:
limit1 = a[0]
limit2 = a[1] + limit1
limit3 = a[2] + limit3
df1 = df.loc[df.index <= limit1]
df2 = df.loc[(df.index > limit1) & (df.index <= limit2)]
df2 = df2.reset_index(drop = True)
df3 = df.loc[(df.index > limit2) & (df.index <= limit3)]
df3 = df3.reset_index(drop = True)
But what if I want to implement this with a long list on the main dataframe df, I am looking for something which is iterable like the following (which doesn't work):
df1 = df.loc[df.index <= limit1]
for i in range(2,3):
for j in range(2,3):
for k in range(2,3):
df[i] = df.loc[(df.index > limit[j]) & (df.index <= limit[k])]
df[i] = df[i].reset_index(drop=True)
print(df[i])
you could modify your code by building dataframes from the main dataframe iteratively cutting out slices from the end of the dataframe.
dfs = [] # this list contains your partitioned dataframes
a = [15, 50 , 75]
for idx in a[::-1]:
dfs.insert(0, df.iloc[idx:])
df = df.iloc[:idx]
dfs.insert(0, df) # add the last remaining dataframe
print(dfs)
Another option is to use list expressions as follows:
a = [0, 15, 50 , 75]
dfs = [df.iloc[a[i]:a[i+1]] for i in range(len(a)-1)]
This does it. It's better to use dictionaries if you want to store multiple variables and call them later. It's bad practice to create variables in an iterative way, so always avoid it.
df = pd.DataFrame(np.linspace(1,75,75), columns=['a'])
a = [15, 50 , 25]
d = {}
b = 0
for n,i in enumerate(a):
d[f'df{n}'] = df.iloc[b:b+i]
b+=i
Output:
I have recently asked a question on applying select_dtypes for specific columns of a data frame.
I have this data frame that has different dtypes on its columns (str and int in this case).
df = pd.DataFrame([
[-1, 3, 0],
[5, 2, 1],
[-6, 3, 2],
[7, '<blank>', 3 ],
['<blank>', 2, 4],
['<blank>', '<blank', '<blank>']], columns='A B C'.split())
I want to create different masks for strings and integers. And then I will apply stylings based on these masks.
First let's define a function that will help me create my mask for different dtypes. (Thanks to #jpp)
def filter_type(s, num=True):
s_new = pd.to_numeric(s, errors='coerce')
if num:
return s_new.notnull()
else:
return s_new.isnull()
then our first mask will be:
mask1 = filter_type(df['A'], num=False) # working and creating the bool values
Second mask will be based on an interval of integers:
mask2 = df['A'].between(7 , 0 , inclusive=False)
But when I run the mask2 it gives me the error:
TypeError:'>' not supported between instances of 'str' and 'int'
How can I overcome this issue?
Note: Stylings I would like to apply is like below:
def highlight_col(x):
df=x.copy
mask1 = filter_type(df['A'], num=False)
mask2 = df['A'].between(7 , 0 , inclusive=False)
x.loc[mask1, ['A', 'B', 'C']] = 'background-color: ""'
x.loc[mask2, ['A', 'B', 'C']] = 'background-color: #7fbf7f'
pd.DataFrame.loc is used to set values. You need pd.DataFrame.style to set styles. In addition, you can use try / except for a method of identifying when numeric comparisons fail.
Here's a minimal example:
def styler(x):
res = []
for i in x:
try:
if 0 <= i <= 7:
res.append('background: red')
else:
res.append('')
except TypeError:
res.append('')
return res
res = df.style.apply(styler, axis = 1)
Result: