Related
I have the following dataframe
x = pd.DataFrame(
{
'FirstGroupCriterium': [1,1,2,2,3],
'SortingCriteria': [1,1,1,2,1],
'Value': [10,20,30,40,50]
}
)
x.sort_values('SortingCriteria').groupby('FirstGroupCriterium').agg(last_value=('Value', 'last'))
The latter outputs:
FirstGroupCriterium
last_value
1
20
2
40
3
50
What I would like to have, is to sum up the last value, based on the last SortingCriteria. So in this case:
FirstGroupCriterium
last_value
1
10+20 = 30
2
40
3
50
My initial idea was to call a custom aggregator function that groups the data yet again, but that fails.
def last_value(group):
return group.groupby('SortingCriteria')['Value'].sum().tail(1)
Do you have any idea how to get this to work? Thank you!
Sorting by both columns first, then filter last rows per FirstGroupCriterium in GroupBy.transform and aggregate sum:
df = x.sort_values(['FirstGroupCriterium','SortingCriteria'])
df1 = df[df['SortingCriteria'].eq(df.groupby('FirstGroupCriterium')['SortingCriteria'].transform('last'))]
print (df1)
FirstGroupCriterium SortingCriteria Value
0 1 1 10
1 1 1 20
3 2 2 40
4 3 1 50
df2 = df1.groupby(['FirstGroupCriterium'],as_index=False)['Value'].sum()
print (df2)
FirstGroupCriterium Value
0 1 30
1 2 40
2 3 50
Anoter idea is aggregate sum by both columns and then remove duplicates with keep last row by DataFrame.drop_duplicates:
df2 = (df.groupby(['FirstGroupCriterium','SortingCriteria'],as_index=False)['Value'].sum()
.drop_duplicates(['FirstGroupCriterium'], keep='last'))
print (df2)
FirstGroupCriterium SortingCriteria Value
0 1 1 30
2 2 2 40
3 3 1 50
I have a dataframe
df1 = pd.DataFrame([["A",1,98,56,61,1,4,6], ["B",1,79,54,36,2,5,7], ["C",1,97,32,83,3,6,8],["B",1,96,31,90,4,7,9], ["C",1,45,32,12,5,8,10], ["A",1,67,33,55,6,9,11]], columns=["id","date","c1","c2","c3","x","y","z"])
I have another dataframe where conditions for selected columns are present
df2 = pd.DataFrame([["c2",40], ["c1",80], ["C3",90]], columns=["col","condition"])
Perform operations on df1 based on conditions present in df2. Like if the value for c1 is 80 in df2, change the values present in column c1 of df1 to -1 if the value is less than 80, if higher than 80 then change values to 1. Perform similar operations for other columns which are present in df2 also.
Expected Output:
df_out = pd.DataFrame([["A",1,1,1,-1,1,4,6], ["B",1,-1,1,-1,2,5,7], ["C",1,1,-1,-1,3,6,8],["B",1,1,-1,1,4,7,9], ["C",1,-1,-1,-1,5,8,10], ["A",1,-1,-1,-1,6,9,11]], columns=["id","date","c1","c2","c3","x","y","z"])
How to do it?
Convert df2 to Series first, then create mask for compare with columns names, compre for greater or equal by DataFrame.ge by Series and pass to numpy.where:
s = df2.set_index('col')['condition']
m = df1.columns.isin(s.index)
df1.loc[:, m] = np.where(df1.loc[:, m].ge(s), 1, -1)
print (df1)
id date c1 c2 c3 x y z
0 A 1 1 1 -1 1 4 6
1 B 1 -1 1 -1 2 5 7
2 C 1 1 -1 -1 3 6 8
3 B 1 1 -1 1 4 7 9
4 C 1 -1 -1 -1 5 8 10
5 A 1 -1 -1 -1 6 9 11
I have a table with 40 columns and 1500 rows. I want to find the maximum value among the 30-32nd (3 columns). How can it be done? I want to return the maximum value among these 3 columns and the index of dataframe.
print(Max_kVA_df.iloc[30:33].max())
hi you can refer this example
import pandas as pd
df=pd.DataFrame({'col1':[1,2,3,4,5],
'col2':[4,5,6,7,8],
'col3':[2,3,4,5,7]
})
print(df)
#print(df.iloc[:,0:3].max())# Mention range of the columns which you want, In your case change 0:3 to 30:33, here 33 will be excluded
ser=df.iloc[:,0:3].max()
print(ser.max())
Output
8
Select values by positions and use np.max:
Sample: for maximum by first 5 rows:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(10, 3)), columns=list('ABC'))
print (df)
A B C
0 2 2 6
1 1 3 9
2 6 1 0
3 1 9 0
4 0 9 3
print (df.iloc[0:5])
A B C
0 2 2 6
1 1 3 9
2 6 1 0
3 1 9 0
4 0 9 3
print (np.max(df.iloc[0:5].max()))
9
Or use iloc this way:
print(df.iloc[[30, 31], 2].max())
I have a set of dataframe with about 20k rows. with headings X,Y,Z,I,R,G,B. ( yes its point cloud)
I would wanna create numerous sub dataframes by grouping the data in rows of 100 after sorting out according to column X.
Subsequently i would like to sort all sub dataframes according to Y column and breaking them down further into rows of 50. (breaking each sub dataframe down further)
The end result is I should have a group of sub dataframes in rows of 50, and i would like to pick out all the rows with the highest Z value in each sub dataframe and write them onto a CSV file.
I have reached the following method with my code. But i am not sure how to continue further.
import pandas as pd
headings = ['x', 'y', 'z']
data = pd.read_table('file.csv', sep=',', skiprows=[0], names=headings)
points = data.sort_values(by=['x'])
Considering a dummy dataframe of 1000 rows,
df.head() # first 5 rows
X Y Z I R G B
0 6 6 0 3 7 0 2
1 0 8 3 6 5 9 7
2 8 9 7 3 0 4 5
3 9 6 8 5 1 0 0
4 9 0 3 0 9 2 9
First, extract the highest value of Z from the dataframe,
z_max = df['Z'].max()
df = df.sort_values('X')
# list of dataframes
dfs_X = np.split(df, len(df)/ 100)
results = pd.DataFrame()
for idx, df_x in enumerate(dfs_X):
dfs_X[idx] = df_x.sort_values('Y')
dfs_Y = np.split(dfs_X[idx], len(dfs_X[idx]) / 50)
for idy, df_y in enumerate(dfs_Y):
rows = df_y[df_y['Z'] == z_max]
results = results.append(rows)
results.head()
results will contain rows from all dataframes which have highest value of Z.
Output: First 5 rows
X Y Z I R G B
541 0 0 9 0 3 6 2
610 0 2 9 3 0 7 6
133 0 4 9 3 3 9 9
731 0 5 9 5 1 0 2
629 0 5 9 0 9 7 7
Now, write this dataframe to csv using df.to_csv().
I am trying to sort a pandas df into individual columns based on when values in columns change. For the df below I can sort the df into separate columns when a values changes in Col B. But I'm trying to add Col Cso it's when values change in both Col B and Col C.
import pandas as pd
df = pd.DataFrame({
'A' : [10,20,30,40,40,30,20,10,5,10,15,20,20,15,10,5],
'B' : ['X','X','X','X','Y','Y','Y','Y','X','X','X','X','Y','Y','Y','Y'],
'C' : ['W','W','Z','Z','Z','Z','W','W','W','W','Z','Z','Z','Z','W','W'],
})
d = df['B'].ne(df['B'].shift()).cumsum()
df['C'] = d.groupby(df['B']).transform(lambda x: pd.factorize(x)[0]).add(1).astype(str)
df['D'] = df.groupby(['B','C']).cumcount()
df = df.set_index(['D','C','B'])['A'].unstack([2,1])
df.columns = df.columns.map(''.join)
Output:
X1 Y1 X2 Y2
D
0 10 40 5 20
1 20 30 10 15
2 30 20 15 10
3 40 10 20 5
As you can see, this creates a new column every time there's a new value in Col B. But I'm trying to incorporate Col C as well. So it should be every time there's a change in both Col B and Col C.
Intended output:
XW1 XZ1 YZ1 YW1 XW2 XZ2 YZ2 YW2
0 10 30 40 20 5 15 20 10
1 20 40 30 10 10 20 15 5
Just base on your out put create the help columns one by one.
df['key']=df.B+df.C# create the key
df['key2']=(df.key!=df.key.shift()).ne(0).cumsum() # make the continue key into one group
df.key2=df.groupby('key').key2.apply(lambda x : x.astype('category').cat.codes+1)# change the group number to 1 or 2
df['key3']=df.groupby(['key','key2']).cumcount() # create the index for pivot
df['key']=df.key+df.key2.astype(str) # create the columns for pivot
df.pivot('key3','key','A')#yield
Out[126]:
key XW1 XW2 XZ1 XZ2 YW1 YW2 YZ1 YZ2
key3
0 10 5 30 15 20 10 40 20
1 20 10 40 20 10 5 30 15