Pandas join DataFrame and Series over a column - python

I have a Pandas DataFrame df that store a matching between a label and an integer, and a Pandas Series s that contains a sequence of labels :
print(df)
label id
0 AAAAAAAAA 0
1 BBBBBBBBB 1
2 CCCCCCCCC 2
3 DDDDDDDDD 3
4 EEEEEEEEE 4
print(s)
0 AAAAAAAAA
1 BBBBBBBBB
2 CCCCCCCCC
3 CCCCCCCCC
4 EEEEEEEEE
5 EEEEEEEEE
6 DDDDDDDDD
I want to join this DataFrame and this Series, to get the sequence of integer corresponding to my sequence s.
Here is the expected result of my example :
print(df.join(s)["id"])
0 0
1 1
2 2
3 2
4 4
5 4
6 3

Use Series.map with Series:
print (s.map(df.set_index('label')['id']))
0 0
1 1
2 2
3 2
4 4
5 4
6 3
Name: a, dtype: int64
Alternative - be careful, if dupes no error but return last dupe row:
print (s.map(dict(zip(df['label'], df['id']))))

Related

How to append a specific string according to each value in a string pandas dataframe column?

Let's take these sample dataframes :
df = pd.DataFrame({'Id':['1','2','3','4','5'], 'Value':[9,8,7,6,5]})
Id Value
0 1 9
1 2 8
2 3 7
3 4 6
4 5 5
df_name = pd.DataFrame({'Id':['1','2','4'], 'Name':['Andrew','Jason','John']})
Id Name
0 1 Andrew
1 2 Jason
2 4 John
I would like to add in the Id column of df the Name of the person (obtainable in df_name) if it exists, in brackets. I know how to do this with a for loop over the Id column of df but it is inefficient with large dataframes. Do you know please a better way do to this ?
Expected output :
Id Value
0 1 (Andrew) 9
1 2 (Jason) 8
2 3 7
3 4 (John) 6
4 5 5
Use Series.map for match values, add () and replace non matche values by original column in Series.fillna:
df['Id'] = ((df['Id'] + ' (' + df['Id'].map(df_name.set_index('Id')['Name']) + ')')
.fillna(df['Id']))
print (df)
Id Value
0 1 (Andrew) 9
1 2 (Jason) 8
2 3 7
3 4 (John) 6
4 5 5

return first column number that fulfills a condition in pandas

I have a dataset with several columns of cumulative sums. For every row, I want to return the first column number that satisfies a condition.
Toy example:
df = pd.DataFrame(np.array(range(20)).reshape(4,5).T).cumsum(axis=1)
>>> df
0 1 2 3
0 0 5 15 30
1 1 7 18 34
2 2 9 21 38
3 3 11 24 42
4 4 13 27 46
If I want to return the first column whose value is greater than 20 for instance.
Desired output:
3
3
2
2
2
Many thanks as always!
Try with idxmax
df.gt(20).idxmax(1)
Out[66]:
0 3
1 3
2 2
3 2
4 2
dtype: object
No as short as #YOBEN_S but works is the chaining of index.get_loc and first_valid_index
df[df>20].apply(lambda x: x.index.get_loc(x.first_valid_index()), axis=1)
0 3
1 3
2 2
3 2
4 2
dtype: int64

Assign the frequency of each value to dataframe with new column

I try to set up a Dataframe that countains a column called frequency.
This column should show how often the value is present in a specific column of the dataframe in every row. Something like this:
Index Category Frequency
0 1 1
1 3 2
2 3 2
3 4 1
4 7 3
5 7 3
6 7 3
7 8 1
This is just an example
I already tried it with value_counts(), however I only receive a value in the last line of the appearing number.
In the case of the example
Index Category Frequency
0 1 1
1 3 N.A
2 3 2
3 4 1
4 7 N.A
5 7 N.A
6 7 3
7 8 1
It is very important that the column has the same number of rows as the dataframe, preferably appended to the same dataframe
df['Frequency'] = df.groupby('Category').transform('count')
Use pandas.Series.map:
df['Frecuency']=df['Category'].map(df['Category'].value_counts())
or pandas.Series.replace:
df['Frecuency']=df['Category'].replace(df['Category'].value_counts())
Output:
Index Category Frecuency
0 0 1 1
1 1 3 2
2 2 3 2
3 3 4 1
4 4 7 3
5 5 7 3
6 6 7 3
7 7 8 1
Details
df['Category'].value_counts()
7 3
3 2
4 1
1 1
8 1
Name: Category, dtype: int64
using value_counts you get a series whose index are the elements of the category and the values ​​is the count. So you can use map or pandas.Series.replace to create a series with the category values ​​replaced by those in the count. And finally assign this series to the frequency column
you can do it using group by like below
df.groupby("Category") \
.apply(lambda g: g.assign(frequency = len(g))) \
.reset_index(level=0, drop=True)

Pandas - split text with values in parenthesis into multiple columns

I have a dataframe column with values as below:
HexNAc(6)Hex(7)Fuc(1)NeuAc(3)
HexNAc(6)Hex(7)Fuc(1)NeuAc(3)
HexNAc(5)Hex(4)NeuAc(1)
HexNAc(6)Hex(7)
I want to split this information into multiple columns:
HexNAc Hex Fuc NeuAc
6 7 1 3
6 7 1 3
5 4 0 1
6 7 0 0
What is the best way to do this?
Can be done with a combination of string splits and explode (pandas version >= 0.25) then pivot. The rest cleans up some of the columns and fills missing values.
import pandas as pd
s = pd.Series(['HexNAc(6)Hex(7)Fuc(1)NeuAc(3)', 'HexNAc(6)Hex(7)Fuc(1)NeuAc(3)',
'HexNAc(5)Hex(4)NeuAc(1)', 'HexNAc(6)Hex(7)'])
(pd.DataFrame(s.str.split(')').explode().str.split('\(', expand=True))
.pivot(columns=0, values=1)
.rename_axis(None, axis=1)
.dropna(how='all', axis=1)
.fillna(0, downcast='infer'))
Fuc Hex HexNAc NeuAc
0 1 7 6 3
1 1 7 6 3
2 0 4 5 1
3 0 7 6 0
Check
pd.DataFrame(s.str.findall('\w+').map(lambda x : dict(zip(x[::2], x[1::2]))).tolist())
Out[207]:
Fuc Hex HexNAc NeuAc
0 1 7 6 3
1 1 7 6 3
2 NaN 4 5 1
3 NaN 7 6 NaN

Slicing a dataframe on column names or alternative column name if they are not available

I am looking for a way to eliminate key errors that are caused by different column names in the data that gets loaded. So for example I might have columns like
dummy_df = pd.DataFrame(np.random.randint(0,5,size=(5, 2)), columns=['Test','Test_v2'])
Test Test_v2
0 0 3
1 0 0
2 1 2
3 4 0
4 4 4
How can I do s.th. like
dummy_df[ if_avail('Test') otherwise 'Test_v2']
It would be nice to be able passing a list, where it starts checking for existence in item order.
I think you can check columns names and select first matched column:
L = ['Test_v1','Test','Test_v2']
m = dummy_df.columns.isin(L)
first = dummy_df.columns[m].values[0]
s = dummy_df[first]
print (s)
0 3
1 2
2 3
3 0
4 0
Name: Test, dtype: int32
Another solution is:
print (dummy_df.reindex(columns=L).dropna(axis=1, how='all').iloc[:, 0])
0 3
1 2
2 3
3 0
4 0
Name: Test, dtype: int32
Explanation:
First reindex by list of columns names:
print (dummy_df.reindex(columns=L))
Test_v1 Test Test_v2
0 NaN 3 2
1 NaN 2 3
2 NaN 3 1
3 NaN 0 0
4 NaN 0 2
And remove all columns with all NaNs:
print (dummy_df.reindex(columns=L).dropna(axis=1, how='all'))
Test Test_v2
0 3 2
1 2 3
2 3 1
3 0 0
4 0 2
And last select first column by iloc:
print (dummy_df.reindex(columns=L).dropna(axis=1, how='all').iloc[:, 0])0 3
1 2
2 3
3 0
4 0
Name: Test, dtype: int32

Categories