Pandas: Group By Elements of a Column - python

Looking for assistance to group by elements of a column in a Pandas df.
Original df:
Country Feature Number
0 US A 1
1 DE A 2
2 FR A 3
3 US B 0
4 DE B 5
5 FR B 7
6 US C 9
7 DE C 0
8 FR C 1
Desired df:
Country A B C
0 US 1 0 9
1 DE 2 5 0
2 FR 3 7 1
Not sure if group by is the best choice if I should create a dictionary. Thanks in advance for your help!

You could use pivot_table for that:
In [39]: df.pivot_table(index='Country', columns='Feature')
Out[39]:
Number
Feature A B C
Country
DE 2 5 0
FR 3 7 1
US 1 0 9
If you want your index to be 0, 1, 2 you could use reset_index
EDIT
If your Number actually not numbers but strings you could convert that column with astype or with pd.to_numeric:
df.Number = df.Number.astype(float)
or:
df.Number = pd.to_numeric(df.Number)
Note: pd.to_numeric is available only for pandas >= 0.17.0

Related

Count level 1 size per level 0 in multi index and add new column

What is a pythonic way of counting level 1 size per level 0 in multi index and creating a new column (named counts). I can achieve this in the following way but would like to gain an understanding of any simpler approaches:
Code
df = pd.DataFrame({'STNAME':['AL'] * 3 + ['MI'] * 4,
'CTYNAME':list('abcdefg'),
'COL': range(7) }).set_index(['STNAME','CTYNAME'])
print(df)
COL
STNAME CTYNAME
AL a 0
b 1
c 2
MI d 3
e 4
f 5
g 6
df1 = df.groupby(level=0).size().reset_index(name='count')
counts = df.merge(df1,left_on="STNAME",right_on="STNAME")["count"].values
df["counts"] = counts
This is the desired output:
COL counts
STNAME CTYNAME
AL a 0 3
b 1 3
c 2 3
MI d 3 4
e 4 4
f 5 4
g 6 4
You can use groupby.transform with size here instead of merging:
output = df.assign(Counts=df.groupby(level=0)['COL'].transform('size'))
print(output)
COL Counts
STNAME CTYNAME
AL a 0 3
b 1 3
c 2 3
MI d 3 4
e 4 4
f 5 4
g 6 4

Pandas - split text with values in parenthesis into multiple columns

I have a dataframe column with values as below:
HexNAc(6)Hex(7)Fuc(1)NeuAc(3)
HexNAc(6)Hex(7)Fuc(1)NeuAc(3)
HexNAc(5)Hex(4)NeuAc(1)
HexNAc(6)Hex(7)
I want to split this information into multiple columns:
HexNAc Hex Fuc NeuAc
6 7 1 3
6 7 1 3
5 4 0 1
6 7 0 0
What is the best way to do this?
Can be done with a combination of string splits and explode (pandas version >= 0.25) then pivot. The rest cleans up some of the columns and fills missing values.
import pandas as pd
s = pd.Series(['HexNAc(6)Hex(7)Fuc(1)NeuAc(3)', 'HexNAc(6)Hex(7)Fuc(1)NeuAc(3)',
'HexNAc(5)Hex(4)NeuAc(1)', 'HexNAc(6)Hex(7)'])
(pd.DataFrame(s.str.split(')').explode().str.split('\(', expand=True))
.pivot(columns=0, values=1)
.rename_axis(None, axis=1)
.dropna(how='all', axis=1)
.fillna(0, downcast='infer'))
Fuc Hex HexNAc NeuAc
0 1 7 6 3
1 1 7 6 3
2 0 4 5 1
3 0 7 6 0
Check
pd.DataFrame(s.str.findall('\w+').map(lambda x : dict(zip(x[::2], x[1::2]))).tolist())
Out[207]:
Fuc Hex HexNAc NeuAc
0 1 7 6 3
1 1 7 6 3
2 NaN 4 5 1
3 NaN 7 6 NaN

Increment dataframe column based on condition

I have a dataframe and I want to create a new column based on a condition on a different column. Create the new column "ans" with 1 and increment based on the column "ix". In the "ix" column if the value is the same as the next one keep the "ans" column the same and if its different increment "ans"
Thank you for your answer, I am new to Python so I am not sure how to do this
index ix
1 pa
2 pa
3 pa
4 pe
5 fc
6 pb
7 pb
8 df
should result in:-
index ix ans
1 pa 1
2 pa 1
3 pa 1
4 pe 2
5 fc 3
6 pb 4
7 pb 4
8 df 5
In [47]: df['ans'] = (df['ix'] != df['ix'].shift(1)).cumsum()
In [48]: df
Out[48]:
index ix ans
0 1 pa 1
1 2 pa 1
2 3 pa 1
3 4 pe 2
4 5 fc 3
5 6 pb 4
6 7 pb 4
7 8 df 5

Pandas merge on aggregated columns

Let's say I create a DataFrame:
import pandas as pd
df = pd.DataFrame({"a": [1,2,3,13,15], "b": [4,5,6,6,6], "c": ["wish", "you","were", "here", "here"]})
Like so:
a b c
0 1 4 wish
1 2 5 you
2 3 6 were
3 13 6 here
4 15 6 here
... and then group and aggregate by a couple columns ...
gb = df.groupby(['b','c']).agg({"a": lambda x: x.nunique()})
Yielding the following result:
a
b c
4 wish 1
5 you 1
6 here 2
were 1
Is it possible to merge df with the newly aggregated table gb such that I create a new column in df, containing the corresponding values from gb? Like this:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
I tried doing the simplest thing:
df.merge(gb, on=['b','c'])
But this gives the error:
KeyError: 'b'
Which makes sense because the grouped table has a Multi-index and b is not a column. So my question is two-fold:
Can I transform the multi-index of the gb DataFrame back into columns (so that it has the b and c column)?
Can I merge df with gb on the column names?
Whenever you want to add some aggregated column from groupby operation back to the df you should be using transform, this produces a Series with its index aligned with your orig df:
In [4]:
df['nc'] = df.groupby(['b','c'])['a'].transform(pd.Series.nunique)
df
Out[4]:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
There is no need to reset the index or perform an additional merge.
There's a simple way of doing this using reset_index().
df.merge(gb.reset_index(), on=['b','c'])
gives you
a_x b c a_y
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2

Adding a column to pandas data frame fills it with NA

I have this pandas dataframe:
SourceDomain 1 2 3
0 www.theguardian.com profile.theguardian.com 1 Directed
1 www.theguardian.com membership.theguardian.com 2 Directed
2 www.theguardian.com subscribe.theguardian.com 3 Directed
3 www.theguardian.com www.google.co.uk 4 Directed
4 www.theguardian.com jobs.theguardian.com 5 Directed
I would like to add a new column which is a pandas series created like this:
Weights = Weights.value_counts()
However, when I try to add the new column using edgesFile[4] = Weights it fills it with NA instead of the values:
SourceDomain 1 2 3 4
0 www.theguardian.com profile.theguardian.com 1 Directed NaN
1 www.theguardian.com membership.theguardian.com 2 Directed NaN
2 www.theguardian.com subscribe.theguardian.com 3 Directed NaN
3 www.theguardian.com www.google.co.uk 4 Directed NaN
4 www.theguardian.com jobs.theguardian.com 5 Directed NaN
How can I add the new column keeping the values?
Thanks?
Dani
You are getting NaNs because the index of Weights does not match up with the index of edgesFile. If you want Pandas to ignore Weights.index and just paste the values in order then pass the underlying NumPy array instead:
edgesFile[4] = Weights.values
Here is an example which demonstrates the difference:
In [14]: df = pd.DataFrame(np.arange(4)*10, index=list('ABCD'))
In [15]: df
Out[15]:
0
A 0
B 10
C 20
D 30
In [16]: s = pd.Series(np.arange(4), index=list('CDEF'))
In [17]: s
Out[17]:
C 0
D 1
E 2
F 3
dtype: int64
Here we see Pandas aligning the index:
In [18]: df[4] = s
In [19]: df
Out[19]:
0 4
A 0 NaN
B 10 NaN
C 20 0
D 30 1
Here, Pandas simply pastes the values in s into the column:
In [20]: df[4] = s.values
In [21]: df
Out[21]:
0 4
A 0 0
B 10 1
C 20 2
D 30 3
This is small example of your question:
You can add new column with a column name in existing DataFrame
>>> df = DataFrame([[1,2,3],[4,5,6]], columns = ['A', 'B', 'C'])
>>> df
A B C
0 1 2 3
1 4 5 6
>>> s = Series([7,8])
>>> s
0 7
1 8
2 9
>>> df['D']=s
>>> df
A B C D
0 1 2 3 7
1 4 5 6 8
Or, You can make DataFrame from Series and concat then
>>> df = DataFrame([[1,2,3],[4,5,6]])
>>> df
0 1 2
0 1 2 3
1 4 5 6
>>> s = DataFrame(Series([7,8]), columns=['4']) # if you don't provide column name, default name will be 0
>>> s
0
0 7
1 8
>>> df = pd.concat([df,s], axis=1)
>>> df
0 1 2 0
0 1 2 3 7
1 4 5 6 8
Hope this will help

Categories