python replace string in a specific dataframe column - python

I would like to replace any string in a dataframe column by the string 'Chaudière', for any word that starts with the string "chaud". I would like the first and last name after each "Chaudiere" to disapper, to anonymize the NameDevice
My data frame is called df1 and the column name is NameDevice.
I have tried this:
df1.loc[df['NameDevice'].str.startswith('chaud'), 'NameDevice'] = df1['NameDevice'].str.replace("chaud","Chaudière") . I check with df1.head(), it returns:
IdDevice IdDeviceType SerialDevice NameDevice IdLocation UuidAttributeDevice IdBox IsUpdateDevice
0 119 48 00001 Chaudière Maud Ferrand 4 NaN 4 0
1 120 48 00002 Chaudière Yvan Martinod 6 NaN 6 0
2 121 48 00006 Chaudière Anne-Sophie Premereur 7 NaN 7 0
3 122 48 00005 Chaudière Denis Fauser 8 NaN 8 0
4 123 48 00004 Chaudière Elariak Djilali 3 NaN 3 0

You can do the matching by calling str.lower first, then you can use str.startswith, and then just split on the spaces and take the first entry to anonymise the data:
In [14]:
df.loc[df['NameDevice'].str.lower().str.startswith('chaud'), 'NameDevice'] = df['NameDevice'].str.split().str[0]
df
Out[14]:
IdDevice IdDeviceType SerialDevice NameDevice IdLocation \
0 119 48 1 Chaudière 4
1 120 48 2 Chaudière 6
2 121 48 6 Chaudière 7
3 122 48 5 Chaudière 8
4 123 48 4 Chaudière 3
UuidAttributeDevice IdBox IsUpdateDevice
0 NaN 4 0
1 NaN 6 0
2 NaN 7 0
3 NaN 8 0
4 NaN 3 0
Another method is to use str.extract so it only takes Chaud...:
In [27]:
df.loc[df['NameDevice'].str.lower().str.startswith('chaud'), 'NameDevice'] = df['NameDevice'].str.extract('(Chaud\w+ )', expand=False)
df
Out[27]:
IdDevice IdDeviceType SerialDevice NameDevice IdLocation \
0 119 48 1 Chaudière 4
1 120 48 2 Chaudière 6
2 121 48 6 Chaudière 7
3 122 48 5 Chaudière 8
4 123 48 4 Chaudière 3
UuidAttributeDevice IdBox IsUpdateDevice
0 NaN 4 0
1 NaN 6 0
2 NaN 7 0
3 NaN 8 0
4 NaN 3 0

Related

How to use each vector entry to fill NAN's of a separate groups in a dataframe

Say I have a vector ValsHR which looks like this:
valsHR=[78.8, 82.3, 91.0]
And I have a dataframe MainData
Age Patient HR
21 1 NaN
21 1 NaN
21 1 NaN
30 2 NaN
30 2 NaN
24 3 NaN
24 3 NaN
24 3 NaN
I want to fill the NaNs so that the first value in valsHR will only fill in the NaNs for patient 1, the second will fill the NaNs for patient 2 and the third will fill in for patient 3.
So far I've tried using this:
mainData['HR'] = mainData['HR'].fillna(ValsHR) but it fills all the NaNs with the first value in the vector.
I've also tried to use this:
mainData['HR'] = mainData.groupby('Patient').fillna(ValsHR) fills the NaNs with values that aren't in the valsHR vector at all.
I was wondering if anyone knew a way to do this?
Create dictionary by Patient values with missing values, map to original column and replace missing values only:
print (df)
Age Patient HR
0 21 1 NaN
1 21 1 NaN
2 21 1 NaN
3 30 2 100.0 <- value is not replaced
4 30 2 NaN
5 24 3 NaN
6 24 3 NaN
7 24 3 NaN
p = df.loc[df.HR.isna(), 'Patient'].unique()
valsHR = [78.8, 82.3, 91.0]
df['HR'] = df['HR'].fillna(df['Patient'].map(dict(zip(p, valsHR))))
print (df)
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 100.0
4 30 2 82.3
5 24 3 91.0
6 24 3 91.0
7 24 3 91.0
If some groups has no NaNs:
print (df)
Age Patient HR
0 21 1 NaN
1 21 1 NaN
2 21 1 NaN
3 30 2 100.0 <- group 2 is not replaced
4 30 2 100.0 <- group 2 is not replaced
5 24 3 NaN
6 24 3 NaN
7 24 3 NaN
p = df.loc[df.HR.isna(), 'Patient'].unique()
valsHR = [78.8, 82.3, 91.0]
df['HR'] = df['HR'].fillna(df['Patient'].map(dict(zip(p, valsHR))))
print (df)
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 100.0
4 30 2 100.0
5 24 3 82.3
6 24 3 82.3
7 24 3 82.3
It is simply mapping, if all of NaN should be replaced
import pandas as pd
from io import StringIO
valsHR=[78.8, 82.3, 91.0]
vals = {i:k for i,k in enumerate(valsHR, 1)}
df = pd.read_csv(StringIO("""Age Patient
21 1
21 1
21 1
30 2
30 2
24 3
24 3
24 3"""), sep="\s+")
df["HR"] = df["Patient"].map(vals)
>>> df
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 82.3
4 30 2 82.3
5 24 3 91.0
6 24 3 91.0
7 24 3 91.0

Turn Pandas muti-Index into columns

I have a similar dataframe:
action_type value
0 0 link_click 1
1 mobile_app_install 5
2 video_view 181
3 omni_view_content 2
1 0 post_reaction 32
1 link_click 124
2 mobile_app_install 190
3 video_view 6162
4 omni_custom 2420
5 omni_activate_app 4525
2 0 comment 1
1 link_click 53
2 post_reaction 23
3 video_view 2246
4 mobile_app_install 87
5 omni_view_content 24
6 post_engagement 2323
7 page_engagement 2323
I want to transpose so:
It looks like you can try:
(df.set_index('action_type', append=True)
.reset_index(level=1, drop=True)['value']
.unstack('action_type')
)

Resolve complementary missing values between rows

I have a df that looks like this
Day ID ID_2 AS D E AS1 D1 E1
29 72 Participant 1 PS 6 42 NaN NaN NaN NaN NaN NaN
35 78 Participant 1 NaN yes 3 no 2 no 2
49 22 Participant 2 PS 1 89 NaN NaN NaN NaN NaN NaN
85 18 Participant 2 NaN yes 3 no 2 no 2
I'm looking for a way to add the ID_2 column value to all rows where ID matches (i.e., for Participant 1, fill in the NaN values with the values from the other row where ID=Participant 1). I've looked into using combine but that doesn't seem to work for this particular case.
Expected output:
Day ID ID_2 AS D E AS1 D1 E1
29 72 Participant 1 PS 6 42 yes 3 no 2 no 2
35 78 Participant 1 PS 6 42 yes 3 no 2 no 2
49 22 Participant 2 PS 1 89 yes 3 no 2 no 2
85 18 Participant 2 PS 1 89 yes 3 no 2 no 2
or
Day ID ID_2 AS D E AS1 D1 E1
29 72 Participant 1 PS 6 42 NaN NaN NaN NaN NaN NaN
35 78 Participant 1 PS 6 42 yes 3 no 2 no 2
49 22 Participant 2 PS 1 89 NaN NaN NaN NaN NaN NaN
85 18 Participant 2 PS 1 89 yes 3 no 2 no 2
I think you could try
df.ID_2 = df.groupby('ID').ID_2.ffill()
# 29 PS 6 42
# 35 PS 6 42
# 49 PS 1 89
# 85 PS 1 89
Not tested, but something like this should work - can't copy your df into my browser.
print(df)
Day ID ID_2 AS D E AS1 D1 E1
0 72 Participant_1 PS_6_42 NaN NaN NaN NaN NaN NaN
1 78 Participant_1 NaN yes 3.0 no 2.0 no 2.0
2 22 Participant_2 PS_1_89 NaN NaN NaN NaN NaN NaN
3 18 Participant_2 NaN yes 3.0 no 2.0 no 2.0
df2 = df.set_index('ID').groupby('ID').transform('ffill').transform('bfill').reset_index()
print(df2)
ID Day ID_2 AS D E AS1 D1 E1
0 Participant_1 72 PS_6_42 yes 3 no 2 no 2
1 Participant_1 78 PS_6_42 yes 3 no 2 no 2
2 Participant_2 22 PS_1_89 yes 3 no 2 no 2
3 Participant_2 18 PS_1_89 yes 3 no 2 no 2

How to move every element in a column by n range in a dataframe using python?

I have a dataframe df that looks like below:
No A B value
1 23 36 1
2 45 23 1
3 34 12 2
4 22 76 NaN
...
I would like to shift each of the value in "value" column by 2. And the first row "value" should not be shifted.
I have already tried the normal shift, which directly shifts everthing by 2.
df['value']=df['value'].shift(2)
i expect the below result:
No A B value
1 23 36 1
2 45 23 Nan
3 34 12 Nan
4 22 76 1
5 10 12 Nan
6 34 2 Nan
7 21 11 2
...
In your case
df['Newvalue']=pd.Series(df.value.values,index=np.arange(len(df))*3)
df
Out[41]:
No A B value Newvalue
0 1 23 36 1.0 1.0
1 2 45 23 1.0 NaN
2 3 34 12 2.0 NaN
3 4 22 76 NaN 1.0

pandas pct_change() in reverse

Suppose we have a dataframe and we calculate as percent change between rows
y_axis = [1,2,3,4,5,6,7,8,9]
x_axis = [100,105,115,95,90,88,110,100,0]
DF = pd.DataFrame({'Y':y_axis, 'X':x_axis})
DF = DF[['Y','X']]
DF['PCT'] = DF['X'].pct_change()
Y X PCT
0 1 100 NaN
1 2 105 0.050000
2 3 115 0.095238
3 4 95 -0.173913
4 5 90 -0.052632
5 6 88 -0.022222
6 7 110 0.250000
7 8 100 -0.090909
8 9 0 -1.000000
That way it starts from the first row.
I want calculate pct_change() starting from the last row.
One way to do it
DF['Reverse'] = list(reversed(x_axis))
DF['PCT_rev'] = DF['Reverse'].pct_change()
pct_rev = DF.PCT_rev.tolist()
DF['_PCT_'] = list(reversed(pct_rev))
DF2 = DF[['Y','X','PCT','_PCT_']]
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN
But that is a very ugly and inefficient solution.
I was wondering if there are more elegant solutions?
DF.assign(_PCT_=DF.X.pct_change(-1))
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN
Series.pct_change(periods=1, fill_method='pad', limit=None, freq=None, **kwargs)
periods : int, default 1 Periods to shift for forming percent change
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.pct_change.html
I deleted my other answer because #su79eu7k 's is way better.
You can cut your time in half by using the underlying arrays. But you also have to suppress a warning.
a = DF.X.values
DF.assign(_PCT_=np.append((a[:-1] - a[1:]) / a[1:], np.nan))
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN

Categories