create a dataframe mask and sum columns - python

Suppose I have the following dataframe
# dictionary with list object in values
details = {
'A1' : [1,3,4,5],
'A2' : [2,3,5,6],
'A3' : [4,3,2,6],
}
# creating a Dataframe object
df = pd.DataFrame(details)
I want to query on each columns with the follow conditions to obtain a boolean mask and then perform the sum on axis=1
A1 >= 3
A2 >=3
A3 >=4
I would like to end-up with the following dataframe
details = {
'A1' : [1,3,4,5],
'A2' : [2,3,5,6],
'A3' : [4,3,2,6],
'score' : [1,2,2,3]
}
# creating a Dataframe object
df = pd.DataFrame(details)
How would you do it?

Since your operators are the same, you can try numpy broadcasting
import numpy as np
df['score'] = (df.T >= np.array([3,3,4])[:, None]).sum()
print(df)
A1 A2 A3 score
0 1 2 4 1
1 3 3 3 2
2 4 5 2 2
3 5 6 6 3

You could also do:
df.assign(score = (df >=[3,3,4]).sum(1))
A1 A2 A3 score
0 1 2 4 1
1 3 3 3 2
2 4 5 2 2
3 5 6 6 3

If you want to specifically align your comparators to each column, you can pass them as a dictionary that is alignable against the DataFrames columns.
>>> comparisons = pd.Series({'A1': 3, 'A2': 3, 'A3': 4})
>>> df['score'] = df.ge(comparisons).sum(axis=1)
>>> df
A1 A2 A3 score
0 1 2 4 1
1 3 3 3 2
2 4 5 2 2
3 5 6 6 3
For a little more manual control, you can always subset your df according to your comparators before performing the comparisons.
comparisons = pd.Series({'A1': 3, 'A2': 3, 'A3': 4})
df['score'] = df[comparisons.index].ge(comparisons).sum(axis=1)

Related

In pandas, how to re-arrange the dataframe to simultaneously combine groups of columns?

I hope someone could help me solve my issue.
Given a pandas dataframe as depicted in the image below,
I would like to re-arrange it into a new dataframe, combining several sets of columns (the sets have all the same size) such that each set becomes a single column as shown in the desired result image below.
Thank you in advance for any tips.
For a general solution, you can try one of this two options:
You could try this, using OrderedDict to get the alpha-nonnumeric column names ordered alphabetically, pd.DataFrame.filter to filter the columns with similar names, and then concat the values with pd.DataFrame.stack:
import pandas as pd
from collections import OrderedDict
df = pd.DataFrame([[0,1,2,3,4],[5,6,7,8,9]], columns=['a1','a2','b1','b2','c'])
newdf=pd.DataFrame()
for col in list(OrderedDict.fromkeys( ''.join(df.columns)).keys()):
if col.isalpha():
newdf[col]=df.filter(like=col, axis=1).stack().reset_index(level=1,drop=True)
newdf=newdf.reset_index(drop=True)
Output:
df
a1 a2 b1 b2 c
0 0 1 2 3 4
1 5 6 7 8 9
newdf
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
Another way to get the column names could be using re and set like this, and then sort columns alphabetically:
newdf=pd.DataFrame()
import re
for col in set(re.findall('[^\W\d_]',''.join(df.columns))):
newdf[col]=df.filter(like=col, axis=1).stack().reset_index(level=1,drop=True)
newdf=newdf.reindex(sorted(newdf.columns), axis=1).reset_index(drop=True)
Output:
newdf
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
You can do this with pd.wide_to_long and rename the 'c' column:
df_out = pd.wide_to_long(df.reset_index().rename(columns={'c':'c1'}),
['a','b','c'],'index','no')
df_out = df_out.reset_index(drop=True).ffill().astype(int)
df_out
Output:
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
Same dataframe just sorting is different.
pd.wide_to_long(df, ['a','b'], 'c', 'no').reset_index().drop('no', axis=1)
Output:
c a b
0 4 0 2
1 9 5 7
2 4 1 3
3 9 6 8
The fact that column c only had one columns versus other letters having two columns, made it kind of tricky. I first stacked the dataframe and got rid of the numbers in the column names. Then for a and b I pivoted a dataframe and removed all nans. For c, I multiplied the length of the dataframe by 2 to make it match a and b and then merged it in with a and b.
input:
import pandas as pd
df = pd.DataFrame({'a1': {0: 0, 1: 5},
'a2': {0: 1, 1: 6},
'b1': {0: 2, 1: 7},
'b2': {0: 3, 1: 8},
'c': {0: 4, 1: 9}})
df
code:
df1=df.copy().stack().reset_index().replace('[0-9]+', '', regex=True)
dfab = df1[df1['level_1'].isin(['a','b'])].pivot(index=0, columns='level_1', values=0) \
.apply(lambda x: pd.Series(x.dropna().values)).astype(int)
dfc = pd.DataFrame(np.repeat(df['c'].values,2,axis=0)).rename({0:'c'}, axis=1)
df2=pd.merge(dfab, dfc, how='left', left_index=True, right_index=True)
df2
output:
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9

How to sort dataframe based on column whose entries consist of letters and numbers?

I have a dataframe like this:
import pandas as pd
df = pd.DataFrame(
{
'pos': ['A1', 'B03', 'A2', 'B01', 'A3', 'B02'],
'ignore': range(6)
}
)
pos ignore
0 A1 0
1 A03 1
2 A2 2
3 B01 3
4 B3 4
5 B02 5
Which I would like to sort according to pos whereby
it should be first sorted according to the number and then according to the letter and
leading 0s should be ignored,
so the desired outcome is
pos ignore
0 A1 0
1 B01 3
2 A2 2
3 B02 5
4 A03 1
5 B3 4
I currently do it like this:
df[['let', 'num']] = df['pos'].str.extract(
'([A-Za-z]+)([0-9]+)'
)
df['num'] = df['num'].astype(int)
df = (
df.sort_values(['num', 'let'])
.drop(['let', 'num'], axis=1)
.reset_index(drop=True)
)
That works, but what I don't like is that I need two temporary columns I later have to drop again. Is there a more straightforward way of doing it?
You can use argsort with zfill and first sort on the numbers as 01, 02, 03 etc. This way you don't have to assign / drop columns:
val = df['pos'].str.extract('(\D+)(\d+)')
df.loc[(val[1].str.zfill(2) + val[0]).argsort()]
pos ignore
0 A1 0
3 B01 3
2 A2 2
5 B02 5
4 A3 4
1 B03 1
Here's one way:
import re
def extract_parts(x):
groups = re.match('([A-Za-z]+)([0-9]+)', x)
return (int(groups[2]), groups[1])
df.reindex(df.pos.transform(extract_parts).sort_values().index).reset_index(drop=True)
Output
Out[1]:
pos ignore
0 A1 0
1 B01 3
2 A2 2
3 B02 5
4 A03 1
5 B3 4

how to merge and update 2 Dataframes

a1 = pd.DataFrame({'A': [1,2,3], 'B': [2,3,4]})
b2 = pd.DataFrame({'A': [1,4], 'B': [3,6]})
and I wanna get
c = pd.DataFrame({'A': [1,2,3,4], 'B': [3,3,4,6]})
a1 and b2 merge on the key='A'
but when 'A' equal but B different, get b2 value
how can I get this work? have no idea.
First concatenate both dataframes under each other to get one big dataframe:
c = pd.concat([a1, b2], 0)
A B
0 1 2
1 2 3
2 3 4
0 1 3
1 4 6
Then group on column A to only get the unique values of A, by using last you make sure than when there is a duplicate the value of b2 is used. This gives:
c = c.groupby('A').last()
B
A
1 3
2 3
3 4
4 6
Then set reset index to get a nice numerical index.
c = c.reset_index()
which returns:
A B
0 1 3
1 2 3
2 3 4
3 4 6
To do it all in one go just enter the following lines of code:
c = pd.concat([a1, b2], 0)
c = c.groupby('A').last().reset_index()

How to cut and group by letter in pandas dataframe

A B
a0 1
a0-2 2
a1 3
a2 4
a2-2 5
a3 6
a4 7
I would like to group below bins
df.B.sum
[a0~a0-2) 3
[a1~a1-2) 3
[a2~a2-2) 9
[a3~a3-2) 6
[a4~a4-2) 7
How this could be done...
You can use groupby by Series created by cut by second letter of column A:
print (df.A.str[1:2].astype(int))
0 0
1 0
2 1
3 2
4 2
5 3
6 4
Name: A, dtype: int32
bins = [-1,0,1,2,5]
labels=['[a0~a0-2)','[a1~a1-2)','[a2~a2-2)','[a3~a4-2)']
s = pd.cut(df.A.str[1:2].astype(int), bins=bins, labels=labels)
print (s)
0 [a0~a0-2)
1 [a0~a0-2)
2 [a1~a1-2)
3 [a2~a2-2)
4 [a2~a2-2)
5 [a3~a4-2)
6 [a3~a4-2)
Name: A, dtype: category
Categories (4, object): [[a0~a0-2) < [a1~a1-2) < [a2~a2-2) < [a3~a4-2)]
df = df.groupby(s).B.sum().reset_index()
print (df)
A B
0 [a0~a0-2) 3
1 [a1~a1-2) 3
2 [a2~a2-2) 9
3 [a3~a4-2) 13
Another similar solution as another answer, only is used map function:
d = {'a0': '[a0~a0-2)',
'a1': '[a1~a1-2)',
'a2': '[a2~a2-2)',
'a3': '[a3~a4-2)',
'a4': '[a3~a4-2)'}
df = df.groupby(df.A.str[:2].map(d)).B.sum().reset_index()
print (df)
A B
0 [a0~a0-2) 3
1 [a1~a1-2) 3
2 [a2~a2-2) 9
3 [a3~a4-2) 13
You can create a new column with the shortened version of you column and then group on this column.
# take only the first two characters into the new column
df['group_col'] = df.A.str[:2]
df.groupby('group_col').B.sum()
Of course you can be creative when creating the group column.
lo = {'a0': 0, 'a1': 1, 'a2': 2, 'a3': 3, 'a4': 3}
df['group_col'] = df.A.str[:2].apply(lambda val: lo[val])
df.groupby('group_col').B.sum()
group_col
0 3
1 3
2 9
3 13
Name: B, dtype: int64
If you like to group by elements that start with same letter and number, you can use a function in groupby like this :
def group_func(i):
global df
return df.iloc[i]['A'].split("-")[0]
df.groupby(group_func).sum()
otherwise if you want to group every two elements,
def group_func(i):
return i // 2
df.groupby(group_func).sum()

Pandas index column by boolean

I want to keep columns that have 'n' or more values.
For example:
> df = pd.DataFrame({'a': [1,2,3], 'b': [1,None,4]})
a b
0 1 1
1 2 NaN
2 3 4
3 rows × 2 columns
> df[df.count()==3]
IndexingError: Unalignable boolean Series key provided
> df[:,df.count()==3]
TypeError: unhashable type: 'slice'
> df[[k for (k,v) in (df.count()==3).items() if v]]
a
0 1
1 2
2 3
Is that the best way to do this? It seems ridiculous.
You can use conditional list comprehension to generate the columns that exceed your threshold (e.g. 3). Then just select those columns from the data frame:
# Create sample DataFrame
df = pd.DataFrame({'a': [1, 2, 3, 4, 5],
'b': [1, None, 4, None, 2],
'c': [5, 4, 3, 2, None]})
>>> df_new = df[[col for col in df if df[col].count() > 3]]
Out[82]:
a c
0 1 5
1 2 4
2 3 3
3 4 2
4 5 NaN
Use count to produce a boolean index and use this as a mask for the columns:
In [10]:
df[df.columns[df.count() > 2]]
Out[10]:
a
0 1
1 2
2 3
if you want to keep columns that have 'n' or more values. for my example i am considering n value as 4
df = pd.DataFrame({'a': [1,2,3,4,6], 'b': [1,None,4,5,7],'c': [1,2,3,5,8]})
print df
a b c
0 1 1 1
1 2 NaN 2
2 3 4 3
3 4 5 5
4 6 7 8
print df[[i for i in xrange(0,len(df.columns)) if len(df.iloc[:,i]) - df.isnull().sum()[i] >4]]
a c
0 1 1
1 2 2
2 3 3
3 4 5
4 6 8

Categories