I have this DataFrame in pandas that I have that I've grouped by a column.
After this operation I need to generate all unique pairs between the rows of
each group and perform some aggregate operation on all the pairs of a group.
I've implemented the following sample algorithm to give you an idea. I want to refactor this code in order to make it work with pandas to yield performance increase and/or decrease code complexity.
Code:
import numpy as np
import pandas as pd
import itertools
#Construct Dataframe
samples=40
a=np.random.randint(3,size=(1,samples))
b=np.random.randint(9,size=(1,samples))
c=np.random.randn(1,samples)
d=np.append(a,b,axis=0)
e=np.append(d,c,axis=0)
e=e.transpose()
df = pd.DataFrame(e,columns=['attr1','attr2','value'])
df['attr1'] = df.attr1.astype('int')
df['attr2'] = df.attr2.astype('int')
#drop duplicate rows so (attr1,attr2) will be key
df = df.drop_duplicates(['attr1','attr2'])
#df = df.reset_index()
print(df)
for key,tup in df.groupby('attr1'):
print('Group',key,' length ',len(tup))
#generate pairs
agg=[]
for v1,v2 in itertools.combinations(list(tup['attr2']),2):
p1_val = float(df.loc[(df['attr1']==key) & (df['attr2']==v1)]['value'])
p2_val = float(df.loc[(df['attr1']==key) & (df['attr2']==v2)]['value'])
agg.append([key,(v1,v2),(p1_val-p2_val)**2])
#insert pairs to dataframe
p = pd.DataFrame(agg,columns=['group','pair','value'])
top = p.sort_values(by='value').head(4)
print(top['pair'])
#Perform some operation in df based on pair values
#....
I am really afraid that pandas DataFrames can not provide such sophisticated analysis functionality.
Do I have to stick to traditional python like in the example?
I'm new to Pandas so any comments/suggestions are welcome.
Related
I came across such a case. there is a data frame with the same columns and does not output the entire table.
My code:
import pandas as pd
data = {2:['Green','Blue'],
2:['small','BIG'],
2:['High','Low']}
df = pd.DataFrame(data)
print(df)
Output:
2
0 High
1 Low
Dictionary only supports Unique Keys (in Key-Value pairs)
So when you create a DataFrame using Dictionary, it will only consider the latest Key-Value pair if there is duplication in Key.
For any reason, you create DataFrame with the Same Column Headers, use following code -
import pandas as pd
df = pd.DataFrame([['Green','Blue'], ['small','BIG'], ['High','Low']], columns = [2,2])
print(df)
It will show entire table with same column headers
I want to do data insepction and print count of rows that matches a certain value in one of the columns. So below is my code
import numpy as np
import pandas as pd
data = pd.read_csv("census.csv")
The census.csv has a column "income" which has 3 values '<=50K', '=50K' and '>50K'
and i want to print number of rows that has income value '<=50K'
i was trying like below
count = data['income']='<=50K'
That does not work though.
Sum Boolean selection
(data['income'].eq('<50K')).sum()
The key is to learn how to filter pandas rows.
Quick answer:
import pandas as pd
data = pd.read_csv("census.csv")
df2 = data[data['income']=='<=50K']
print(df2)
print(len(df2))
Slightly longer answer:
import pandas as pd
data = pd.read_csv("census.csv")
filter = data['income']=='<=50K'
print(filter) # notice the boolean list based on filter criteria
df2 = data[filter] # next we use that boolean list to filter data
print(df2)
print(len(df2))
Introduction
I have two dataframes. I would like to apply a function to each row of the first one. This function depends on the row and the entire second dataframe. I would like to do this efficiently.
Reprudicible Example
Setting up the dataframes
import pandas as pd
import numpy as np
Let the two dataframes be:
df0 = pd.DataFrame.from_dict({'a':np.random.normal(0,1,5),'b':np.random.normal(0,1,5)})
df1 = pd.DataFrame.from_dict({'c':np.random.normal(0,1,10),'d':np.random.normal(0,1,10)})
(In the real application, they are much bigger.)
I would like to find which row from df1 is closest to each row in df0, where closest is defined as having the least squared_dist between them:
def squared_dist(x,y):
return np.sum(np.square(x-y))
What I have tried
What I do is create to numpy arrays from the dataframes
df0np=df0.to_numpy()
df1np=df1.to_numpy()
Iterate through these arrays:
res=[]
for row in df0np:
distances = [squared_dist(row,df1np[i,]) for i in range(len(df1np))]
index=np.argmin(distances)
res.append(index)
Add the result to df0 as a new column:
df0['res']=res
How fast is it?
The whole code in one piece, including timings for the method described above:
import time
import pandas as pd
import numpy as np
df0 = pd.DataFrame.from_dict({'a':np.random.normal(0,1,5),'b':np.random.normal(0,1,5)})
df1 = pd.DataFrame.from_dict({'c':np.random.normal(0,1,10),'d':np.random.normal(0,1,10)})
df0np=df0.to_numpy()
df1np=df1.to_numpy()
start=time.time()
df0np=df0.to_numpy()
df1np=df1.to_numpy()
res=[]
for row in df0np:
distances = [squared_dist(row,df1np[i,]) for i in range(len(df1np))]
index=np.argmin(distances)
res.append(index)
df0['res']=res
end=time.time()
print(end-start) # prints 0.0014030933380126953
Question
How could I make this more efficient, ie how could I achieve lower execution times? This method works fine for this example above, but in my real world application where dataframes are much bigger, this is unusably slow.
I have a df that has a column of lists.
Python Pandas rolling aggregate a column of lists
import pandas as pd
import numpy as np
# Get some time series data
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv")
input_cols = ['A', 'B']
df['single_input_vector'] = df[input_cols].apply(tuple, axis=1).apply(list)
I am wondering if there is a way to create a rolling aggregate of the 'single_input_vector' column for a given window. I looked at the following SO link but it does not provide a way to include a window. In my case, the desired output column for a window of 3 would be:
Row1: [[24.68, 164.93]]
Row2: [[24.68, 164.93], [24.18, 164.89]]
Row3: [[24.68, 164.93], [24.18, 164.89], [23.99, 164.63]]
Row4: [[24.18, 164.89], [23.99, 164.63], [24.14, 163.92]]
and so on.
I can't think of a more efficient way to do this, so while this does work there may be performance constraints on massive data sets.
We are basically using rolling count to create a start:stop set of slicing indices.
import pandas as pd
import numpy as np
# Get some time series data
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv")
input_cols = ['A', 'B']
df['single_input_vector'] = df[input_cols].apply(tuple, axis=1).apply(list)
window = 3
df['len'] = df['A'].rolling(window=window).count()
df['vector_list'] = df.apply(lambda x: df['single_input_vector'][max(0,x.name-(window-1)):int(x.name)+1].values, axis=1)
Here is a dummy example of the DF I'm working with ('ETC' represents several columns):
df = pd.DataFrame(data={'PlotCode':['A','A','A','A','B','B','B','C','C'],
'INVYR':[2000,2000,2000,2005,1990,2000,1990,2005,2001],
'ETC':['a','b','c','d','e','f','g','h','i']})
picture of df (sorry not enough reputation yet)
And here is what I want to end up with:
df1 = pd.DataFrame(data={'PlotCode':['A','A','A','B','B','C'],
'INVYR':[2000,2000,2000,1990,1990,2001],
'ETC':['a','b','c','e','g','i']})
picture of df1
NOTE: I want ALL rows with minimum 'INVYR' values for each 'PlotCode', not just one or else I'm assuming I could do something easier with drop_duplicates and sort.
So far, following the answer here Appending pandas dataframes generated in a for loop I've tried this with the following code:
df1 = []
for i in df['PlotCode'].unique():
j = df[df['PlotCode']==i]
k = j[j['INVYR']==j['INVYR'].min()]
df1.append(k)
df1 = pd.concat(df1)
This code works but is very slow, my actual data contains some 40,000 different PlotCodes so this isn't a feasible solution. Does anyone know some smooth filtering way of doing this? I feel like I'm missing something very simple.
Thank you in advance!
Try not to use for loops when using pandas, they are extremely slow in comparison to the vectorized operations that pandas has.
Solution 1:
Determine the minimum INVYR for every plotcode, using .groupby():
min_invyr_per_plotcode = df.groupby('PlotCode', as_index=False)['INVYR'].min()
And use pd.merge() to do an inner join between your orignal df with this minimum you just found:
result_df = pd.merge(
df,
min_invyr_per_plotcode,
how='inner',
on=['PlotCode', 'INVYR'],
)
Solution 2:
Again, determine the minimum per group, but now add it as a column to your dataframe. This minimum per group gets added to every row by using .groupby().transform()
df['min_per_group'] = (df
.groupby('PlotCode')['INVYR']
.transform('min')
)
Now filter your dataframe where INVYR in a row is equal to the minimum of that group:
df[df['INVYR'] == df['min_per_group']]