Complex aggregation after group by operation in Pandas DataFrame

Complex aggregation after group by operation in Pandas DataFrame - python

I have this DataFrame in pandas that I have that I've grouped by a column.
After this operation I need to generate all unique pairs between the rows of
each group and perform some aggregate operation on all the pairs of a group.
I've implemented the following sample algorithm to give you an idea. I want to refactor this code in order to make it work with pandas to yield performance increase and/or decrease code complexity.
Code:
import numpy as np
import pandas as pd
import itertools
#Construct Dataframe
samples=40
a=np.random.randint(3,size=(1,samples))
b=np.random.randint(9,size=(1,samples))
c=np.random.randn(1,samples)
d=np.append(a,b,axis=0)
e=np.append(d,c,axis=0)
e=e.transpose()
df = pd.DataFrame(e,columns=['attr1','attr2','value'])
df['attr1'] = df.attr1.astype('int')
df['attr2'] = df.attr2.astype('int')
#drop duplicate rows so (attr1,attr2) will be key
df = df.drop_duplicates(['attr1','attr2'])
#df = df.reset_index()
print(df)
for key,tup in df.groupby('attr1'):
print('Group',key,' length ',len(tup))
#generate pairs
agg=[]
for v1,v2 in itertools.combinations(list(tup['attr2']),2):
p1_val = float(df.loc[(df['attr1']==key) & (df['attr2']==v1)]['value'])
p2_val = float(df.loc[(df['attr1']==key) & (df['attr2']==v2)]['value'])
agg.append([key,(v1,v2),(p1_val-p2_val)**2])
#insert pairs to dataframe
p = pd.DataFrame(agg,columns=['group','pair','value'])
top = p.sort_values(by='value').head(4)
print(top['pair'])
#Perform some operation in df based on pair values
#....
I am really afraid that pandas DataFrames can not provide such sophisticated analysis functionality.
Do I have to stick to traditional python like in the example?
I'm new to Pandas so any comments/suggestions are welcome.

Related

Why does the dataframe not display a table with the same columns?

I came across such a case. there is a data frame with the same columns and does not output the entire table.
My code:
import pandas as pd
data = {2:['Green','Blue'],
2:['small','BIG'],
2:['High','Low']}
df = pd.DataFrame(data)
print(df)
Output:
2
0 High
1 Low

Dictionary only supports Unique Keys (in Key-Value pairs)
So when you create a DataFrame using Dictionary, it will only consider the latest Key-Value pair if there is duplication in Key.
For any reason, you create DataFrame with the Same Column Headers, use following code -
import pandas as pd
df = pd.DataFrame([['Green','Blue'], ['small','BIG'], ['High','Low']], columns = [2,2])
print(df)
It will show entire table with same column headers

Python how to filter a csv based on a column value and get the row count

I want to do data insepction and print count of rows that matches a certain value in one of the columns. So below is my code
import numpy as np
import pandas as pd
data = pd.read_csv("census.csv")
The census.csv has a column "income" which has 3 values '<=50K', '=50K' and '>50K'
and i want to print number of rows that has income value '<=50K'
i was trying like below
count = data['income']='<=50K'
That does not work though.

Sum Boolean selection
(data['income'].eq('<50K')).sum()

The key is to learn how to filter pandas rows.
Quick answer:
import pandas as pd
data = pd.read_csv("census.csv")
df2 = data[data['income']=='<=50K']
print(df2)
print(len(df2))
Slightly longer answer:
import pandas as pd
data = pd.read_csv("census.csv")
filter = data['income']=='<=50K'
print(filter) # notice the boolean list based on filter criteria
df2 = data[filter] # next we use that boolean list to filter data
print(df2)
print(len(df2))

Efficiently apply function to every row of a dataframe depending on another dataframe

Introduction
I have two dataframes. I would like to apply a function to each row of the first one. This function depends on the row and the entire second dataframe. I would like to do this efficiently.
Reprudicible Example
Setting up the dataframes
import pandas as pd
import numpy as np
Let the two dataframes be:
df0 = pd.DataFrame.from_dict({'a':np.random.normal(0,1,5),'b':np.random.normal(0,1,5)})
df1 = pd.DataFrame.from_dict({'c':np.random.normal(0,1,10),'d':np.random.normal(0,1,10)})
(In the real application, they are much bigger.)
I would like to find which row from df1 is closest to each row in df0, where closest is defined as having the least squared_dist between them:
def squared_dist(x,y):
return np.sum(np.square(x-y))
What I have tried
What I do is create to numpy arrays from the dataframes
df0np=df0.to_numpy()
df1np=df1.to_numpy()
Iterate through these arrays:
res=[]
for row in df0np:
distances = [squared_dist(row,df1np[i,]) for i in range(len(df1np))]
index=np.argmin(distances)
res.append(index)
Add the result to df0 as a new column:
df0['res']=res
How fast is it?
The whole code in one piece, including timings for the method described above:
import time
import pandas as pd
import numpy as np
df0 = pd.DataFrame.from_dict({'a':np.random.normal(0,1,5),'b':np.random.normal(0,1,5)})
df1 = pd.DataFrame.from_dict({'c':np.random.normal(0,1,10),'d':np.random.normal(0,1,10)})
df0np=df0.to_numpy()
df1np=df1.to_numpy()
start=time.time()
df0np=df0.to_numpy()
df1np=df1.to_numpy()
res=[]
for row in df0np:
distances = [squared_dist(row,df1np[i,]) for i in range(len(df1np))]
index=np.argmin(distances)
res.append(index)
df0['res']=res
end=time.time()
print(end-start) # prints 0.0014030933380126953
Question
How could I make this more efficient, ie how could I achieve lower execution times? This method works fine for this example above, but in my real world application where dataframes are much bigger, this is unusably slow.

Python Dataframe Create a rolling aggregate of list column with a window

I have a df that has a column of lists.
Python Pandas rolling aggregate a column of lists
import pandas as pd
import numpy as np
# Get some time series data
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv")
input_cols = ['A', 'B']
df['single_input_vector'] = df[input_cols].apply(tuple, axis=1).apply(list)
I am wondering if there is a way to create a rolling aggregate of the 'single_input_vector' column for a given window. I looked at the following SO link but it does not provide a way to include a window. In my case, the desired output column for a window of 3 would be:
Row1: [[24.68, 164.93]]
Row2: [[24.68, 164.93], [24.18, 164.89]]
Row3: [[24.68, 164.93], [24.18, 164.89], [23.99, 164.63]]
Row4: [[24.18, 164.89], [23.99, 164.63], [24.14, 163.92]]
and so on.

I can't think of a more efficient way to do this, so while this does work there may be performance constraints on massive data sets.
We are basically using rolling count to create a start:stop set of slicing indices.
import pandas as pd
import numpy as np
# Get some time series data
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv")
input_cols = ['A', 'B']
df['single_input_vector'] = df[input_cols].apply(tuple, axis=1).apply(list)
window = 3
df['len'] = df['A'].rolling(window=window).count()
df['vector_list'] = df.apply(lambda x: df['single_input_vector'][max(0,x.name-(window-1)):int(x.name)+1].values, axis=1)

Python Pandas - filter pandas dataframe to get rows with minimum values in one column for each unique value in another column

Here is a dummy example of the DF I'm working with ('ETC' represents several columns):
df = pd.DataFrame(data={'PlotCode':['A','A','A','A','B','B','B','C','C'],
'INVYR':[2000,2000,2000,2005,1990,2000,1990,2005,2001],
'ETC':['a','b','c','d','e','f','g','h','i']})
picture of df (sorry not enough reputation yet)
And here is what I want to end up with:
df1 = pd.DataFrame(data={'PlotCode':['A','A','A','B','B','C'],
'INVYR':[2000,2000,2000,1990,1990,2001],
'ETC':['a','b','c','e','g','i']})
picture of df1
NOTE: I want ALL rows with minimum 'INVYR' values for each 'PlotCode', not just one or else I'm assuming I could do something easier with drop_duplicates and sort.
So far, following the answer here Appending pandas dataframes generated in a for loop I've tried this with the following code:
df1 = []
for i in df['PlotCode'].unique():
j = df[df['PlotCode']==i]
k = j[j['INVYR']==j['INVYR'].min()]
df1.append(k)
df1 = pd.concat(df1)
This code works but is very slow, my actual data contains some 40,000 different PlotCodes so this isn't a feasible solution. Does anyone know some smooth filtering way of doing this? I feel like I'm missing something very simple.
Thank you in advance!

Try not to use for loops when using pandas, they are extremely slow in comparison to the vectorized operations that pandas has.
Solution 1:
Determine the minimum INVYR for every plotcode, using .groupby():
min_invyr_per_plotcode = df.groupby('PlotCode', as_index=False)['INVYR'].min()
And use pd.merge() to do an inner join between your orignal df with this minimum you just found:
result_df = pd.merge(
df,
min_invyr_per_plotcode,
how='inner',
on=['PlotCode', 'INVYR'],
)
Solution 2:
Again, determine the minimum per group, but now add it as a column to your dataframe. This minimum per group gets added to every row by using .groupby().transform()
df['min_per_group'] = (df
.groupby('PlotCode')['INVYR']
.transform('min')
)
Now filter your dataframe where INVYR in a row is equal to the minimum of that group:
df[df['INVYR'] == df['min_per_group']]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Complex aggregation after group by operation in Pandas DataFrame - python

Related

Why does the dataframe not display a table with the same columns?

Python how to filter a csv based on a column value and get the row count

Efficiently apply function to every row of a dataframe depending on another dataframe

Python Dataframe Create a rolling aggregate of list column with a window

Python Pandas - filter pandas dataframe to get rows with minimum values in one column for each unique value in another column

Categories

Resources