Keeping one (invariant) row from each dataframe group - python

I have a pandas DataFrame that I have grouped by a combination of three columns A, B, C.
grouped = df.groupby(["A", "B", "C"])
Several additional columns D, E, F, G are (guaranteed) identical for all elements of each group, while other columns X, Y vary within each group. (I already know which columns are fixed, and which vary.)
I would like to construct a dataframe containing one row per group, and consisting of the values for the invariant columns A, B, C, D, E, F, G only. What is the most straightforward way to do this? Since there are lots of identical values, I would prefer to specify which columns to omit, rather than the other way around.
I've come up with "aggregating" by choosing one row from each group, and then deleting the unwanted columns in a separate step:
thinned = grouped.aggregate(lambda x: x.iloc[0])
del thinned["X"], thinned["Y"]
The purpose of this is to combine the invariant values with several new summary values that I calculate, in a dataframe that has one row per (current) group.
thinned["newAA"] = grouped.apply(some_function)
thinned["newBB"] = grouped.apply(other_function)
...
But I suspect there must be a less round-about way.

You could use GroupBy.first() to select just the first record of each group. For example, this
import pandas
df = pandas.DataFrame({
'A': [1, 1, 2, 2, 3, 3],
'B': [1, 1, 1, 2, 2, 2],
'C': [2, 2, 3, 3, 1, 1]
})
print(df.groupby(['A', 'B'])['C'].first())
results in
A B
1 1 2
2 1 3
2 3
3 2 1
Name: C, dtype: int64

I think you need drop_duplicates:
df = pd.DataFrame({'A':[7,4,4],
'B':[7,4,4],
'C':[7,4,4],
'D':[7,4,4],
'E':[7,4,4],
'F':[7,4,4],
'G':[7,4,4],
'X':[1,2,8],
'Y':[5,7,0]})
print (df)
A B C D E F G X Y
0 7 7 7 7 7 7 7 1 5
1 4 4 4 4 4 4 4 2 7
2 4 4 4 4 4 4 4 8 0
#filter by subset
cols = ["A", "B", "C", "D","E","F", "G"]
df1 = df.drop_duplicates(subset=cols)[cols]
print (df1)
A B C D E F G
0 7 7 7 7 7 7 7
1 4 4 4 4 4 4 4
#remove unnecessary columns
df2 = df.drop(['X','Y'], axis=1).drop_duplicates()
print (df2)
A B C D E F G
0 7 7 7 7 7 7 7
1 4 4 4 4 4 4 4

I guess you have many option here, more or less elegant.
First of all, do you care of 'X' and 'Y'? If you don't, since you're deleting them at the end you could simply use drop_duplicates
new_df = df[['A', 'B', 'C', 'D', 'E', 'F', 'G']].drop_duplicates()
# this will keep only the unique values of the above columns

Related

What is the most efficient way to swap the values of two columns of a 2D list in python when the number of rows is in the tens of thousands?

for example if I have an original list:
A B
1 3
2 4
to be turned into
A B
3 1
4 2
two cents worth:
3 ways to do it
you could add a 3rd column C, copy A to C, then delete A. This would take more memory.
you could create a swap function for the values in a row, then wrap it into a loop.
you could just swap the labels of the columns. This is probably the most efficient way.
You could use rename:
df2 = df.rename(columns={'A': 'B', 'B': 'A'})
output:
B A
0 1 3
1 2 4
If order matters:
df2 = df.rename(columns={'A': 'B', 'B': 'A'})[df.columns]
output:
A B
0 3 1
1 4 2
Use DataFrame.rename with dictionary for swapping columnsnames, last check orcer by selecting columns:
df = df.rename(columns=dict(zip(df.columns, df.columns[::-1])))[df.columns]
print (df)
A B
0 3 1
1 4 2
You can also just simple use masking to change the values.
import pandas as pd
df = pd.DataFrame({"A":[1,2],"B":[3,4]})
df[["A","B"]] = df[["B","A"]].values
df
A B
0 3 1
1 4 2
for more than 2 columns:
df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9], 'D':[10,11,12]})
print(df)
'''
A B C D
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
'''
df = df.set_axis(df.columns[::-1],axis=1)[df.columns]
print(df)
'''
A B C D
0 10 7 4 1
1 11 8 5 2
2 12 9 6 3
I assume that your list is like this:
my_list = [[1, 3], [2, 4]]
So you can use this code:
print([[each_element[1], each_element[0]] for each_element in my_list])
The output is:
[[3, 1], [4, 2]]

Filter out entire group if all values in group are zero

Using pandas, I want to filter out all groups that contain only zero values
So in pseudo-code something like this
df.groupby('my_group')['values'].filter(all(iszero))
Example input dataframe could be something like this
df = pd.DataFrame({'my_group': ['A', 'B', 'C', 'D']*3, 'values': [0 if (x % 4 == 0 or x == 11) else random.random() for x in range(12)]})
my_group values
0 A 0.000000
1 B 0.286104
2 C 0.359804
3 D 0.596152
4 A 0.000000
5 B 0.560742
6 C 0.534575
7 D 0.251302
8 A 0.000000
9 B 0.445010
10 C 0.750434
11 D 0.000000
Here, group A contains all zero values, so it should be filtered out. Group D also has a zero value in row 11, but in other rows it has non-zero values, so it shouldn't be filtered out
Here are possible solution from the best to worse performance:
#filtere groups by != 0 and then filter again original column by mask
df1 = df[df['my_group'].isin(df.loc[df['values'].ne(0), 'my_group'])]
#create mask by groupy.transform
df1 = df[df['values'].ne(0).groupby(df['my_group']).transform('any')]
#filtered by lambda function (if large data it is slow)
df1 = df.groupby('my_group').filter(lambda x: x['values'].ne(0).any())
print (df1)
my_group values
1 B 0.286104
2 C 0.359804
3 D 0.596152
5 B 0.560742
6 C 0.534575
7 D 0.251302
9 B 0.445010
10 C 0.750434
11 D 0.000000
IIUC use a condition to keep the rows. For this if any value in the group is not equal (ne) to zero, then keep the group:
df2 = df.groupby('my_group').filter(lambda g: g['values'].ne(0).any())
output:
my_group values
1 B 0.286104
2 C 0.359804
3 D 0.596152
5 B 0.560742
6 C 0.534575
7 D 0.251302
9 B 0.445010
10 C 0.750434
11 D 0.000000
Or to get only the indices:
idx = df.groupby('my_group')['values'].filter(lambda s: s.ne(0).any()).index
output: Int64Index([1, 2, 3, 5, 6, 7, 9, 10, 11], dtype='int64')
You can use:
>>> df[df.groupby('my_group')['values'].transform('any')]
my_group values
1 B 0.507089
2 C 0.846842
3 D 0.953003
5 B 0.085316
6 C 0.482732
7 D 0.764508
9 B 0.879005
10 C 0.717571
11 D 0.000000

Getting the total for some columns (independently) in a data frame with python [duplicate]

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

pandas reorder only a specific row

I have a Dataframe file in which I want to switch the order of columns in only the third row while keeping other rows the same.
Under some condition, I have to switch orders for my project, but here is an example that probably has no real meaning.
Suppose the dataset is
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', 'b', 'c', 'd', 'e']})
df
out[1]:
A B C
0 0 5 a
1 1 6 b
2 2 7 c
3 3 8 d
4 4 9 e
I want to have the output:
A B C
0 0 5 a
1 1 6 b
2 **7 2** c
3 3 8 d
4 4 9 e
How do I do it?
I have tried:
new_order = [1, 0, 2] # specify new order of the third row
i = 2 # specify row number
df.iloc[i] = df[df.columns[new_order]].loc[i] # reorder the third row only and assign new values to df
I observed from the output of the right-hand side that the columns are reordering as I wanted:
df[df.columns[new_order]].loc[i]
Out[2]:
B 7
A 2
C c
Name: 2, dtype: object
But when assigned to df again, it did nothing. I guess it's because of the name matching.
Can someone help me? Thanks in advance!

Creating columns dynamically. Assigning them a constant row vector

Say I have some dataframe df. I would like to add to it four columns ['A', 'B', 'C, 'D'] that do not exist yet, and that will hold a constant row vector [1, 2, 3, 4].
When I try to do:
df[new_columns] = [1,2,3,4]
it fails (saying ['A', 'B', 'C, 'D'] is not in index).
How can I create multiple columns dynamically in Pandas? Do I always have to use append for something like this? I remember reading (e.g. in #Jeff's comment to this question) that in newer versions the dynamic creation of columns was supported. Am I wrong?
I think this is the way to go. Pretty clear logic here.
In [19]: pd.concat([df,DataFrame([[1,2,3,4]],columns=list('ABCD'),index=df.index)],axis=1)
Out[19]:
label somedata A B C D
0 b 1.462108 1 2 3 4
1 c -2.060141 1 2 3 4
2 e -0.322417 1 2 3 4
3 f -0.384054 1 2 3 4
4 c 1.133769 1 2 3 4
5 e -1.099891 1 2 3 4
6 d -0.172428 1 2 3 4
7 e -0.877858 1 2 3 4
8 c 0.042214 1 2 3 4
9 e 0.582815 1 2 3 4
Multi-assignment could work, but I don't think its a great soln because its so error prone (e.g. say some of your columns already exists, what should you do?). And the rhs is very problematic as you normally want to align, so its not obvious that you need to broadcast.
You can do it column by column:
import pandas as pd
df = pd.DataFrame(index=range(5))
cols = ['A', 'B', 'C', 'D', 'E']
vals = [1, 2, 3, 4, 5]
for c, v in zip(cols, vals):
df[c] = v
print df
Note that the last method mentioned in the other question you referred works similarly by creating each column before hand:
for a in attrlist:
df[a] = 0

Categories