I have a dataframe that looks like that:
df
Out[42]:
Unnamed: 0 Unnamed: 0.1 Region GeneID DistanceValue
0 25520 25520 Olfactory areas 69835573 -1.000000
1 25521 25521 Olfactory areas 583846 -1.000000
2 25522 25522 Olfactory areas 68667661 -1.000000
3 25523 25523 Olfactory areas 70474965 -1.000000
4 25524 25524 Olfactory areas 68341920 -1.000000
... ... ... ... ...
15662 1072369 1072369 Cerebellum unspecific 74743327 -0.960186
15663 1072370 1072370 Cerebellum unspecific 69530983 -0.960139
15664 1072371 1072371 Cerebellum unspecific 68442853 -0.960129
15665 1072372 1072372 Cerebellum unspecific 74514339 -0.960038
15666 1072373 1072373 Cerebellum unspecific 70724637 -0.960003
[15667 rows x 5 columns]
I want to count 'GeneID's, and create a new df, that only contains the rows with GeneID's that are there more than 5 times.. so I did
genelist = df.pivot_table(index=['GeneID'], aggfunc='size')
sort_genelist = genelist.sort_values(axis=0,ascending=False)
sort_genelist
Out[44]:
GeneID
631707 11
68269286 10
633269 10
70302366 9
74357905 9
..
70784714 1
70784824 1
70784898 1
70784916 1
70528527 1
Length: 7875, dtype: int64
So now I want my df dataframe to just contain the rows with the ID's that were counted more than 5 times..
Use Series.isin for mask by index values of values of sort_genelist with length more like 5 and filter by boolean indexing:
df = df[df['GeneID'].isin(sort_genelist.index[sort_genelist > 5])]
I think that the best way to do what you have asked is:
df['gene_id_count'] = df.groupby('GeneID').transform(len)
df.loc[df['gene_id_count'] > 5, :]
Lets take this tiny example:
>>> df = pd.DataFrame({'GeneID': [1,1,1,3,4,5,5,4], 'ID': range(8)})
>>> df
GeneID ID
0 1 0
1 1 1
2 1 2
3 3 3
4 4 4
5 5 5
6 5 6
7 4 7
And consider 2 occurrences (instead of 5)
min_gene_id_count = 2
>>> df['gene_id_count'] = df.groupby('GeneID').transform(len)
>>> df
GeneID ID gene_id_count
0 1 0 3
1 1 1 3
2 1 2 3
3 3 3 1
4 4 4 2
5 5 5 2
6 5 6 2
7 4 7 2
>>> df.loc[df['gene_id_count'] > min_gene_id_count , :]
GeneID ID gene_id_count
0 1 0 3
1 1 1 3
2 1 2 3
Related
I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
I have a pandas dataframe like:
I have the data frame as like below one,
Input DataFrame
id ratio
0 1 5.00%
1 2 9.00%
2 3 6.00%
3 2 13.00%
4 1 19.00%
5 4 30.00%
6 3 5.5%
7 2 22.00%
How can I then group this like
id ratio
0 1 5.00%
4 1 19.00%
6 3 5.5%
2 3 6.00%
1 2 9.00%
3 2 13.00%
7 2 22.00%
5 4 30.00%
So essentially first looks at the ratio, takes the lowest for that value and groups the rest of the rows for which it has the same id. Then looks for the second lowest ratio and groups the rest of the ids again etc.
First convert your ratio column to numeric.
Then we get the lowest rank per group by using Groupby
Finally we sort based on rank and numeric ratio.
df['ratio_num'] = df['ratio'].str[:-1].astype(float).rank()
df['rank'] = df.groupby('id')['ratio_num'].transform('min')
df = df.sort_values(['rank', 'ratio_num']).drop(columns=['rank', 'ratio_num'])
id ratio
0 1 5.00%
1 1 19.00%
2 3 5.5%
3 3 6.00%
4 2 9.00%
5 2 13.00%
6 2 22.00%
7 4 30.00%
With help of pd.Categorical:
d = {'id':[1, 2, 3, 2, 1, 4, 3, 2],
'ratio': ['5.00%', '9.00%', '6.00%', '13.00%', '19.00%', '30.00%', '5.5%', '22.00%']}
df = pd.DataFrame(d)
df['ratio_'] = df['ratio'].map(lambda x: float(x[:-1]))
df['id'] = pd.Categorical(df['id'], categories=df.sort_values(['id', 'ratio_']).groupby('id').head(1).sort_values(['ratio_', 'id'])['id'], ordered=True)
print(df.sort_values(['id', 'ratio_']).drop('ratio_', axis=1))
Prints:
id ratio
0 1 5.00%
4 1 19.00%
6 3 5.5%
2 3 6.00%
1 2 9.00%
3 2 13.00%
7 2 22.00%
5 4 30.00%
I have a dataframe column with values as below:
HexNAc(6)Hex(7)Fuc(1)NeuAc(3)
HexNAc(6)Hex(7)Fuc(1)NeuAc(3)
HexNAc(5)Hex(4)NeuAc(1)
HexNAc(6)Hex(7)
I want to split this information into multiple columns:
HexNAc Hex Fuc NeuAc
6 7 1 3
6 7 1 3
5 4 0 1
6 7 0 0
What is the best way to do this?
Can be done with a combination of string splits and explode (pandas version >= 0.25) then pivot. The rest cleans up some of the columns and fills missing values.
import pandas as pd
s = pd.Series(['HexNAc(6)Hex(7)Fuc(1)NeuAc(3)', 'HexNAc(6)Hex(7)Fuc(1)NeuAc(3)',
'HexNAc(5)Hex(4)NeuAc(1)', 'HexNAc(6)Hex(7)'])
(pd.DataFrame(s.str.split(')').explode().str.split('\(', expand=True))
.pivot(columns=0, values=1)
.rename_axis(None, axis=1)
.dropna(how='all', axis=1)
.fillna(0, downcast='infer'))
Fuc Hex HexNAc NeuAc
0 1 7 6 3
1 1 7 6 3
2 0 4 5 1
3 0 7 6 0
Check
pd.DataFrame(s.str.findall('\w+').map(lambda x : dict(zip(x[::2], x[1::2]))).tolist())
Out[207]:
Fuc Hex HexNAc NeuAc
0 1 7 6 3
1 1 7 6 3
2 NaN 4 5 1
3 NaN 7 6 NaN
I have a data frame where there are several groups of numeric series where the values are cumulative. Consider the following:
df = pd.DataFrame({'Cat': ['A', 'A','A','A', 'B','B','B','B'], 'Indicator': [1,2,3,4,1,2,3,4], 'Cumulative1': [1,3,6,7,2,4,6,9], 'Cumulative2': [1,3,4,6,1,5,7,12]})
In [74]:df
Out[74]:
Cat Cumulative1 Cumulative2 Indicator
0 A 1 1 1
1 A 3 3 2
2 A 6 4 3
3 A 7 6 4
4 B 2 1 1
5 B 4 5 2
6 B 6 7 3
7 B 9 12 4
I need to create discrete series for Cumulative1 and Cumulative2, with starting point being the earliest entry in 'Indicator'.
my Approach is to use diff()
In[82]: df['Discrete1'] = df.groupby('Cat')['Cumulative1'].diff()
Out[82]: df
Cat Cumulative1 Cumulative2 Indicator Discrete1
0 A 1 1 1 NaN
1 A 3 3 2 2.0
2 A 6 4 3 3.0
3 A 7 6 4 1.0
4 B 2 1 1 NaN
5 B 4 5 2 2.0
6 B 6 7 3 2.0
7 B 9 12 4 3.0
I have 3 questions:
How do I avoid the NaN in an elegant/Pythonic way? The correct values are to be found in the original Cumulative series.
Secondly, how do I elegantly apply this computation to all series, say -
cols = ['Cumulative1', 'Cumulative2']
Thirdly, I have a lot of data that needs this computation -- is this the most efficient way?
You do not want to avoid NaNs, you want to fill them with the start values from the "cumulative" column:
df['Discrete1'] = df['Discrete1'].combine_first(df['Cumulative1'])
To apply the operation to all (or select) columns, broadcast it to all columns of interest:
sources = 'Cumulative1', 'Cumulative2'
targets = ["Discrete" + x[len('Cumulative'):] for x in sources]
df[targets] = df.groupby('Cat')[sources].diff()
You still have to condition the NaNs in a loop:
for s,t in zip(sources, targets):
df[t] = df[t].combine_first(df[s])
I have a dataframe with three columns. ID works as the Identifier where I want to use the groupby command. Here, I want to generate the Correlation Coefficient between A and B for every ID. This is how my dataframe looks like:
ID A B
1 5 7
1 3 4
2 4 5
2 7 6
2 9 1
I want to convert this into the following data frame:
ID A B Corr_Coeff
1 5 7 <Value 1>
1 3 4 <Value 1>
2 4 5 <Value 2>
2 7 6 <Value 2>
2 9 1 <Value 2>
This is the code that I currently am using but does not seem to be working:
df['Corr_Coeff'] = df.groupby('ID')[['A','B']].corr()
Would be great if somebody could help me out here! Thanks in advance.
I believe need map by selecting rows by positions by iloc, for removing MultiIndex use reset_index:
df1 = df.groupby('ID')[['A','B']].corr()
print (df1)
A B
ID
1 A 1.000000 1.000000
B 1.000000 1.000000
2 A 1.000000 -0.675845
B -0.675845 1.000000
df['corr'] = df['ID'].map(df1.iloc[0::2, 1].reset_index(level=1, drop=True))
print (df)
ID A B corr
0 1 5 7 1.000000
1 1 3 4 1.000000
2 2 4 5 -0.675845
3 2 7 6 -0.675845
4 2 9 1 -0.675845
Alternative for create mapped Series by corrwith, last convert 1 column Dataframe to Series by DataFrame.squeeze:
s = (df[['A']].groupby(df['ID']).corrwith(df['B'])).squeeze()
print(s)
ID
1 1.000000
2 -0.675845
Name: A, dtype: float64
df['corr'] = df['ID'].map(s)
print (df)
ID A B corr
0 1 5 7 1.000000
1 1 3 4 1.000000
2 2 4 5 -0.675845
3 2 7 6 -0.675845
4 2 9 1 -0.675845