Pandas conditionally copying of cell value - python

Working with a Pandas DataFrame, I am trying to copy data from one cell into another cell only if the recipient cell contains a specific value. The transfer should go from:
Col1 Col2
0 4 X
1 2 5
2 1 X
3 7 8
4 12 20
5 3 X
And the result should be
Col1 Col2
0 4 4
1 2 5
2 1 1
3 7 8
4 12 20
5 3 3
Is there an elegant or simple solution I am missing?

df.Col2 = df.Col1.where(df.Col2 == 'X', df.Col2)

import pandas as pd
import numpy as np
df.Col2 = np.where(df.Col2 == 'specific value', df.Col1, df.Col2)

Using pandas.DataFrame.ffill:
>>> df.replace('X', np.nan, inplace=True)
>>> df.ffill(axis=1)
Col1 Col2
0 4 4
1 2 5
2 1 1
3 7 8
4 12 20
5 3 3

Related

how to replace rows by row with condition

i want to replace all rows that have "A" in name column
with single row from another df
i got this
data={"col1":[2,3,4,5,7],
"col2":[4,2,4,6,4],
"col3":[7,6,9,11,2],
"col4":[14,11,22,8,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df=pd.DataFrame.from_dict(data)
df
that is my single row (the another df)
data2={"col1":[0]
,"col2":[1]
,"col3":[5]
,"col4":[6]
}
df2=pd.DataFrame.from_dict(data2)
df2
that how i want it to look like
data={"col1":[0,0,4,0,7],
"col2":[1,1,4,1,4],
"col3":[5,5,9,5,2],
"col4":[6,6,22,6,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df=pd.DataFrame.from_dict(data)
df
i try do this df.loc[df["name"]=="A"][df2.columns]=df2
but it did not work
We can try mask + combine_first
df = df.mask(df['name'].eq('A'), df2.loc[0], axis=1).combine_first(df)
df
col1 col2 col3 col4 name n_roll
0 0 1 5 6 A 8.0
1 0 1 5 6 A 2.0
2 4 4 9 22 V 1.0
3 0 1 5 6 A 3.0
4 7 4 2 5 B 9.0
df.loc[df["name"]=="A"][df2.columns]=df2 is index-chaining and is not expected to work. For details, see the doc.
You can also use boolean indexing like this:
df.loc[df['name']=='A', df2.columns] = df2.values
Output:
col1 col2 col3 col4 name n_roll
0 0 1 5 6 A 8
1 0 1 5 6 A 2
2 4 4 9 22 V 1
3 0 1 5 6 A 3
4 7 4 2 5 B 9

Pandas - split text with values in parenthesis into multiple columns

I have a dataframe column with values as below:
HexNAc(6)Hex(7)Fuc(1)NeuAc(3)
HexNAc(6)Hex(7)Fuc(1)NeuAc(3)
HexNAc(5)Hex(4)NeuAc(1)
HexNAc(6)Hex(7)
I want to split this information into multiple columns:
HexNAc Hex Fuc NeuAc
6 7 1 3
6 7 1 3
5 4 0 1
6 7 0 0
What is the best way to do this?
Can be done with a combination of string splits and explode (pandas version >= 0.25) then pivot. The rest cleans up some of the columns and fills missing values.
import pandas as pd
s = pd.Series(['HexNAc(6)Hex(7)Fuc(1)NeuAc(3)', 'HexNAc(6)Hex(7)Fuc(1)NeuAc(3)',
'HexNAc(5)Hex(4)NeuAc(1)', 'HexNAc(6)Hex(7)'])
(pd.DataFrame(s.str.split(')').explode().str.split('\(', expand=True))
.pivot(columns=0, values=1)
.rename_axis(None, axis=1)
.dropna(how='all', axis=1)
.fillna(0, downcast='infer'))
Fuc Hex HexNAc NeuAc
0 1 7 6 3
1 1 7 6 3
2 0 4 5 1
3 0 7 6 0
Check
pd.DataFrame(s.str.findall('\w+').map(lambda x : dict(zip(x[::2], x[1::2]))).tolist())
Out[207]:
Fuc Hex HexNAc NeuAc
0 1 7 6 3
1 1 7 6 3
2 NaN 4 5 1
3 NaN 7 6 NaN

Sum pandas dataframe column values based on condition of column name

I have a DataFrame with column names in the shape of x.y, where I would like to sum up all columns with the same value on x without having to explicitly name them. That is, the value of column_name.split(".")[0] should determine their group. Here's an example:
import pandas as pd
df = pd.DataFrame({'x.1': [1,2,3,4], 'x.2': [5,4,3,2], 'y.8': [19,2,1,3], 'y.92': [10,9,2,4]})
df
Out[3]:
x.1 x.2 y.8 y.92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
The result should be the same as this operation, only I shouldn't have to explicitly list the column names and how they should group.
pd.DataFrame({'x': df[['x.1', 'x.2']].sum(axis=1), 'y': df[['y.8', 'y.92']].sum(axis=1)})
x y
0 6 29
1 6 11
2 6 3
3 6 7
Another option, you can extract the prefix from the column names and use it as a group variable:
df.groupby(by = df.columns.str.split('.').str[0], axis = 1).sum()
# x y
#0 6 29
#1 6 11
#2 6 3
#3 6 7
You can first create Multiindex by split and then groupby by first level and aggregate sum:
df.columns = df.columns.str.split('.', expand=True)
print (df)
x y
1 2 8 92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
df = df.groupby(axis=1, level=0).sum()
print (df)
x y
0 6 29
1 6 11
2 6 3
3 6 7

What is the difference between using `[data2]` and `[[data2]]` with `groupby`

I am working through a Python for data analysis tutorial and want some clarification on the output I get from using [data2] and [[data2]] when using groupby.
If use:
[data2]
you get Series with Multiindex.
If use subset
[[data2]]
you get DataFrame with Multiindex.
And if use:
df.groupby(['key1','key2'], as_index=False)['data2'].mean()
you get DataFrame with 3 columns without Multiindex.
Maybe it is more clear if use another form:
import pandas as pd
df = pd.DataFrame({'key1':[1,2,2,1,2,2],
'key2':[4,4,4,4,5,5],
'data2':[7,8,9,1,3,5],
'D':[1,3,5,7,9,5]})
print (df)
D data2 key1 key2
0 1 7 1 4
1 3 8 2 4
2 5 9 2 4
3 7 1 1 4
4 9 3 2 5
5 5 5 2 5
print (df['data2'].groupby([df.key1,df.key2]).mean())
key1 key2
1 4 4.0
2 4 8.5
5 4.0
Name: data2, dtype: float64
print (df[['data2']].groupby([df.key1,df.key2]).mean())
data2
key1 key2
1 4 4.0
2 4 8.5
5 4.0

Adding a column to pandas data frame fills it with NA

I have this pandas dataframe:
SourceDomain 1 2 3
0 www.theguardian.com profile.theguardian.com 1 Directed
1 www.theguardian.com membership.theguardian.com 2 Directed
2 www.theguardian.com subscribe.theguardian.com 3 Directed
3 www.theguardian.com www.google.co.uk 4 Directed
4 www.theguardian.com jobs.theguardian.com 5 Directed
I would like to add a new column which is a pandas series created like this:
Weights = Weights.value_counts()
However, when I try to add the new column using edgesFile[4] = Weights it fills it with NA instead of the values:
SourceDomain 1 2 3 4
0 www.theguardian.com profile.theguardian.com 1 Directed NaN
1 www.theguardian.com membership.theguardian.com 2 Directed NaN
2 www.theguardian.com subscribe.theguardian.com 3 Directed NaN
3 www.theguardian.com www.google.co.uk 4 Directed NaN
4 www.theguardian.com jobs.theguardian.com 5 Directed NaN
How can I add the new column keeping the values?
Thanks?
Dani
You are getting NaNs because the index of Weights does not match up with the index of edgesFile. If you want Pandas to ignore Weights.index and just paste the values in order then pass the underlying NumPy array instead:
edgesFile[4] = Weights.values
Here is an example which demonstrates the difference:
In [14]: df = pd.DataFrame(np.arange(4)*10, index=list('ABCD'))
In [15]: df
Out[15]:
0
A 0
B 10
C 20
D 30
In [16]: s = pd.Series(np.arange(4), index=list('CDEF'))
In [17]: s
Out[17]:
C 0
D 1
E 2
F 3
dtype: int64
Here we see Pandas aligning the index:
In [18]: df[4] = s
In [19]: df
Out[19]:
0 4
A 0 NaN
B 10 NaN
C 20 0
D 30 1
Here, Pandas simply pastes the values in s into the column:
In [20]: df[4] = s.values
In [21]: df
Out[21]:
0 4
A 0 0
B 10 1
C 20 2
D 30 3
This is small example of your question:
You can add new column with a column name in existing DataFrame
>>> df = DataFrame([[1,2,3],[4,5,6]], columns = ['A', 'B', 'C'])
>>> df
A B C
0 1 2 3
1 4 5 6
>>> s = Series([7,8])
>>> s
0 7
1 8
2 9
>>> df['D']=s
>>> df
A B C D
0 1 2 3 7
1 4 5 6 8
Or, You can make DataFrame from Series and concat then
>>> df = DataFrame([[1,2,3],[4,5,6]])
>>> df
0 1 2
0 1 2 3
1 4 5 6
>>> s = DataFrame(Series([7,8]), columns=['4']) # if you don't provide column name, default name will be 0
>>> s
0
0 7
1 8
>>> df = pd.concat([df,s], axis=1)
>>> df
0 1 2 0
0 1 2 3 7
1 4 5 6 8
Hope this will help

Categories