python - sum list of columns, even if not all there - python

I have a dataframe that looks like this
A B C D G
0 9 5 7 6 1
1 1 4 7 3 1
2 8 4 1 3 1
generated by this:
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
x=np.array([[1,2]])
df['G'] = np.repeat(x,5)
Suppose there are times when a certain column 'E' exists, and sometimes it doesn't depending on the time frame of the data.
So sometimes we have
A B C D E G
0 9 5 7 6 2 1
1 1 4 7 3 3 1
2 8 4 1 3 4 1
So either way, I'd like to sum the rows from columns A, C, and E, and groupby column G. So when column E exists , I just use
df.groupby('G')['A', 'C', 'E'].sum()
but when E doesn't exist, like in the first dataframe, it doesn't work.
What do I need to do in order to sum even if a column is missing?

You could store the columns you wish to sum in a list sum_cols = list('ACE'), and then intersect whatever DataFrame you're working with with this list.
df.groupby('G')[df.columns.intersection(sum_cols)].sum()
Demo
>>> df = pd.DataFrame(np.random.randint(0, 10, (2, 5)),
columns=list('ABCDG'))
>>> df
A B C D G
0 9 5 9 2 6
1 3 1 1 1 3
>>> sum_cols = list('ACE')
>>> df.groupby('G')[df.columns.intersection(sum_cols)].sum()
A C
G
3 3 1
6 9 9
>>> df['E'] = [100, 200]
>>> df.groupby('G')[df.columns.intersection(sum_cols)].sum()
A C E
G
3 3 1 200
6 9 9 100

Related

How to add occurrence of each entry to pandas data frame?

Let df1 be a pandas data frame with a column of letters and a column of integers:
>>> k = pd.DataFrame({
"a": numpy.random.choice([i for i in "abcde"], 10),
"b": numpy.random.choice(range(5), 10)
})
>>> k
a b
0 a 1
1 c 2
2 e 1
3 b 3
4 c 2
5 d 2
6 e 2
7 c 3
8 b 0
9 a 3
Using value_counts(), the counts of the letters are found:
>>> counts = k["a"].value_counts()
>>> counts
c 3
e 2
b 2
a 2
d 1
Name: a, dtype: int64
How to add each occurrance to the respective row? It should result in
>>> k
a b count
0 a 1 2
1 c 2 3
2 e 1 2
[...]
9 a 3 2
Here's an alternate to using transform:
First, you can extract the value_counts() into a dataframe:
mycounts = k['a'].value_counts().rename_axis('a').reset_index(name = 'counts')
The step above is useful in many different scenarios (and good to know in general).
Then, a left-join will put the value counts into the original dataframe:
k = k.merge(mycounts, left_on = 'a', right_on = 'a', how = 'left')
You can try with transform
k['count']=k.groupby('a').a.transform('count')
k
Out[330]:
a b count
0 d 1 2
1 e 3 3
2 e 3 3
3 d 3 2
4 b 4 4
5 b 1 4
6 b 0 4
7 a 2 1
8 b 0 4
9 e 4 3

map values in a dataframe according to ranges

I have a dataframe df
import pandas
df = pandas.DataFrame(data=[1,2,3,2,2,2,3,3,4,5,10,11,12,1,2,1,1], columns=['codes'])
codes
0 1
1 2
2 3
3 2
4 2
5 2
6 3
7 3
8 4
9 5
10 10
11 11
12 12
13 1
14 2
15 1
16 1
and I would like to group the values in the column code
according to a specific logic:
values == 0 become A
values in the range (1,4) becomes B
values == 5 becomes C
values in the range (6,16) becomes D
is there a way to keep the logic and the dataframe separate so that it is easy to change the grouping rules in the future?
I would like to avoid to write
df.loc[df['code']==0,'code']=A
df.loc[(df['code']>=1 & df['code']<=4),'code']=B
First idea is use Series.map with merge dictionaries, second is use cut with right=False:
df = pd.DataFrame(data=[0,1,2,3,2,2,2,3,3,4,5,10,11,12,16,2,17,1], columns=['codes'])
d1 = {0: 'A', 5:'C'}
d2 = dict.fromkeys(range(1,5), 'B')
d3 = dict.fromkeys(range(6,17), 'D')
d = {**d1, **d2, **d3}
df['codes1'] = df['codes'].map(d)
df['codes2'] = pd.cut(df['codes'], bins=(0,1,5,6,17), labels=list('ABCD'), right=False)
print (df)
codes codes1 codes2
0 0 A A
1 1 B B
2 2 B B
3 3 B B
4 2 B B
5 2 B B
6 2 B B
7 3 B B
8 3 B B
9 4 B B
10 5 C C
11 10 D D
12 11 D D
13 12 D D
14 16 D D
15 2 B B
16 17 NaN NaN
17 1 B B

Python - Pandas, split long column to multiple columns

Given the following DataFrame:
>>> pd.DataFrame(data=[['a',1],['a',2],['b',3],['b',4],['c',5],['c',6],['d',7],['d',8],['d',9],['e',10]],columns=['key','value'])
key value
0 a 1
1 a 2
2 b 3
3 b 4
4 c 5
5 c 6
6 d 7
7 d 8
8 d 9
9 e 10
I'm looking for a method that will change the structure based on the key value, like so:
a b c d e
0 1 3 5 7 10
1 2 4 6 8 10 <- 10 is duplicated
2 2 4 6 9 10 <- 10 is duplicated
The result row number is as the longest group count (d in the above example) and the missing values are duplicates of the last available value.
Create MultiIndex by set_index with counter column by cumcount, reshape by unstack, repalce missing values by last non missing ones with ffill and last converting all data to integers if necessary:
df = df.set_index([df.groupby('key').cumcount(),'key'])['value'].unstack().ffill().astype(int)
Another solution with custom lambda function:
df = (df.groupby('key')['value']
.apply(lambda x: pd.Series(x.values))
.unstack(0)
.ffill()
.astype(int))
print (df)
key a b c d e
0 1 3 5 7 10
1 2 4 6 8 10
2 2 4 6 9 10
Using pivot , with groupby + cumcount
df.assign(key2=df.groupby('key').cumcount()).pivot('key2','key','value').ffill().astype(int)
Out[214]:
key a b c d e
key2
0 1 3 5 7 10
1 2 4 6 8 10
2 2 4 6 9 10

Repeating rows of a dataframe based on a column value

I have a data frame like this:
df1 = pd.DataFrame({'a': [1,2],
'b': [3,4],
'c': [6,5]})
df1
Out[150]:
a b c
0 1 3 6
1 2 4 5
Now I want to create a df that repeats each row based on difference between col b and c plus 1. So diff between b and c for first row is 6-3 = 3. I want to repeat that row 3+1=4 times. Similarly for second row the difference is 5-4 = 1, so I want to repeat it 1+1=2 times. The column d is added to have value from min(b) to diff between b and c (i.e.6-3 = 3. So it goes from 3->6). So I want to get this df:
a b c d
0 1 3 6 3
0 1 3 6 4
0 1 3 6 5
0 1 3 6 6
1 2 4 5 4
1 2 4 5 5
Do it with reindex + repeat, then using groupby cumcount assign the new value d
df1.reindex(df1.index.repeat(df1.eval('c-b').add(1))).\
assign(d=lambda x : x.c-x.groupby('a').cumcount(ascending=False))
Out[572]:
a b c d
0 1 3 6 3
0 1 3 6 4
0 1 3 6 5
0 1 3 6 6
1 2 4 5 4
1 2 4 5 5

Merge and split columns in pandas dataframe

I want to know how to merge multiple columns, and split them again.
Input data
A B C
1 3 5
2 4 6
Merge A, B, C to one column X
X
1
2
3
4
5
6
Process something with X, then split X into A, B, C again. The number of rows for A, B, C is same(2).
A B C
1 3 5
2 4 6
Is there any simple way for this work?
Start with df:
A B C
0 1 3 5
1 2 4 6
Next, get all values in one column:
df2 = df.unstack().reset_index(drop=True).rename('X').to_frame()
print(df2)
X
0 1
1 2
2 3
3 4
4 5
5 6
And, convert back to original shape:
df3 = pd.DataFrame(df2.values.reshape(2,-1, order='F'), columns=list('ABC'))
print(df3)
A B C
0 1 3 5
1 2 4 6
Setup
df=pd.DataFrame({'A': {0: 1, 1: 2}, 'B': {0: 3, 1: 4}, 'C': {0: 5, 1: 6}})
df
Out[684]:
A B C
0 1 3 5
1 2 4 6
Solution
Merge df to 1 column:
df2 = pd.DataFrame(df.values.flatten('F'),columns=['X'])
Out[686]:
X
0 1
1 2
2 3
3 4
4 5
5 6
Split it back to 3 columns:
pd.DataFrame(df2.values.reshape(-1,3,order='F'),columns=['A','B','C'])
Out[701]:
A B C
0 1 3 5
1 2 4 6
un unwind in the way you'd like, you need to either unstack or ravel with order='F'
Option 1
def proc1(df):
v = df.values
s = v.ravel('F')
s = s * 2
return pd.DataFrame(s.reshape(v.shape, order='F'), df.index, df.columns)
proc1(df)
A B C
0 2 6 10
1 4 8 12
Option 2
def proc2(df):
return df.unstack().mul(2).unstack(0)
proc2(df)
A B C
0 2 6 10
1 4 8 12

Categories