Convert regex values in pandas to 0 or 1 - python

I have the below pandas column. I need to convert cells containing the word 'anaphylaxis' to 1 and the cells not containing the word to 0.
Till now I have tried but there is something missing
df['Name']= df['Name'].replace(r"^(.(?=anaphylaxis))*?$", 1,regex=True)
df['Name']= df['Name'].replace(r"^(.(?<!anaphylaxis))*?$", 0, regex=True)
ID Name
84 Drug-induced anaphylaxis
1041 Acute anaphylaxis
1194 Anaphylactic reaction
1483 Anaphylactic reaction, due to adverse effect o...
2226 Anaphylaxis, initial encounter
2428 Anaphylaxis
2831 Anaphylactic shock
4900 Other anaphylactic reaction

Use str.contains for case-insensitive matching.
import re
df['Name'] = df['Name'].str.contains(r'anaphylaxis', flags=re.IGNORECASE).astype(int)
Or, more concisely,
df['Name'] = df['Name'].str.contains(r'(?i)anaphylaxis').astype(int)
df
ID Name
0 84 1
1 1041 1
2 1194 0
3 1483 0
4 2226 1
5 2428 1
6 2831 0
7 4900 0
contains is useful when you want to also perform regex-based matching. Although in this case, you can probably get rid of the regex completely by adding regex=False for a bit more performance.
However, for even more performance, use a list comprehension.
df['Name'] = np.array(['anaphylaxis' in x.lower() for x in df['Name']], dtype=int)
Or even better,
df['Name'] = [1 if 'anaphylaxis' in x.lower() else 0 for x in df['Name'].tolist()]
df
ID Name
0 84 1
1 1041 1
2 1194 0
3 1483 0
4 2226 1
5 2428 1
6 2831 0
7 4900 0

You can use pd.Series.str.contains instead of regex. This method returns a Boolean series, which we then convert to int.
df['Name']= df['Name'].str.contains('anaphylaxis', case=False, regex=False)\
.astype(int)
Result:
ID Name
0 84 1
1 1041 1
2 1194 0
3 1483 0
4 2226 1
5 2428 1
6 2831 0
7 4900 0

Related

Preserving NaN values when using groupby and lambda function on dataframe

Following on from this question, I have a dataset as such:
ChildID MotherID preDiabetes
0 20 455 No
1 20 455 Not documented
2 13 102 NaN
3 13 102 Yes
4 702 946 No
5 82 571 No
6 82 571 Yes
7 82 571 Not documented
8 60 530 NaN
Which I have transformed to the following such that each mother has a single value for preDiabetes:
ChildID MotherID preDiabetes
0 20 455 No
1 13 102 Yes
2 702 946 No
3 82 571 Yes
4 60 530 No
I did this by applying the following logic:
if preDiabetes=="Yes" for a particular MotherID, assign preDiabetes a value of "Yes" regardless of the remaining observations
else if preDiabetes != "Yes" for a particular MotherID, I will assign preDiabetes a value of "No"
However, after thinking about this again, I realised that I should preserve NaN values to impute them later on, rather than just assign them 'No".
So I should edit my logic to be:
if preDiabetes=="Yes" for a particular MotherID, assign preDiabetes a value of "Yes" regardless of the remaining observations
else if all values for preDiabetes==NaN for a particular MotherID, assign preDiabetes a single NaN value
else assign preDiabetes a value of "No"
So, in the above table MotherID=530 should have a value of NaN for preDiabetes like so:
ChildID MotherID preDiabetes
0 20 455 No
1 13 102 Yes
2 702 946 No
3 82 571 Yes
4 60 530 NaN
I tried doing this using the following line of code:
df=df.groupby(['MotherID', 'ChildID'])['preDiabetes'].apply(
lambda x: 'Yes' if 'Yes' in x.values else (np.NaN if np.NaN in x.values.all() else 'No'))
However, running this line of code is resulting in the following error:
TypeError: 'in ' requires string as left operand, not float
I'd appreciate if you guys can point out what it is I am doing wrong. Thank you.
You can try:
import pandas as pd
import numpy as np
import io
data_string = """ChildID,MotherID,preDiabetes
20,455,No
20,455,Not documented
13,102,NaN
13,102,Yes
702,946,No
82,571,No
82,571,Yes
82,571,Not documented
60,530,NaN
"""
data = io.StringIO(data_string)
df = pd.read_csv(data, sep=',', na_values=['NaN'])
df.fillna('no_value', inplace=True)
df = df.groupby(['MotherID', 'ChildID'])['preDiabetes'].apply(
lambda x: 'Yes' if 'Yes' in x.values else (np.NaN if 'no_value' in x.values.all() else 'No'))
df
Result:
MotherID ChildID
102 13 Yes
455 20 No
530 60 NaN
571 82 Yes
946 702 No
Name: preDiabetes, dtype: object
You can do using a custom function:
def func(s):
if s.eq('Yes').any():
return 'Yes'
elif s.isna().all():
return np.nan
else:
return 'No'
df = (df
.groupby(['ChildID', 'MotherID'])
.agg({'preDiabetes': func}))
print(df)
ChildID MotherID preDiabetes
0 13 102 Yes
1 20 455 No
2 60 530 NaN
3 82 571 Yes
4 702 946 No
Try:
df['preDiabetes']=df['preDiabetes'].map({'Yes': 1, 'No': 0}).fillna(-1)
df=df.groupby(['MotherID', 'ChildID'])['preDiabetes'].max().map({1: 'Yes', 0: 'No', -1: 'NaN'}).reset_index()
First line will format preDiabetes to numbers, assuming NaN to be everything other than Yes or No (denoted by -1).
Second line assuming at least one preDiabetes is Yes - we output Yes for the group. Assuming we have both No and NaN - we output No. Assuming all are NaN we output NaN.
Outputs:
>>> df
MotherID ChildID preDiabetes
0 102 13 Yes
1 455 20 No
2 530 60 NaN
3 571 82 Yes
4 946 702 No

Creating column to keep track of missing values in another column

I am adding a mock dataframe to exemplify my problem.
I have a large dataframe in which some columns are missing values.
I would like to create some extra boolean columns in which 1 corresponds to a non missing value in the row and 0 corresponds to a missing value.
names = ['Banana, Andrew Something (Maria Banana)', np.nan, 'Willis, Mr. Bruce (Demi Moore)', 'Crews, Master Terry', np.nan]
room = [100, 330, 212, 111, 222]
hotel_loon = {'Name' : pd.Series(names), 'Room' : pd.Series(room)}
hotel_loon_df = pd.DataFrame(hotel_loon)
In another question I found on stack overflow they were super thorough and clear on how to proceed to keep track of all the columns that have missing values but not for specific ones.
I tried a few variations of that code (namely using where) but I was not successful with creating what I wanted which would be something like this:
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1
Thank you for your time, I am sure that in the end it is going to be trivial, but for some reason I got stuck.
To save some typing, use DataFrame.notnull, add some suffixes, and join the result back.
pd.concat([df, df.notnull().astype(int).add_suffix('_present')], axis=1)
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1
You can use .isnull() for your case, and change the type from bool to int:
hotel_loon_df['Name_present'] = (~hotel_loon_df['Name'].isnull()).astype(int)
hotel_loon_df['Room_present'] = (~hotel_loon_df['Room'].isnull()).astype(int)
Out[1]:
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1
The ~ means the opposite of, or something that is not.
If you are tracking only for Nan fields, you can use isnull() function.
df['name_present'] =df['name'].isnull()
df['name_present'].replace(True,0, inplace=True)
df['name_present'].replace(False,1, inplace=True)
df['room_present'] =df['room'].isnull()
df['room_present'].replace(True,0, inplace=True)
df['room_present'].replace(False,1, inplace=True)
We can do this in a concise manner by using DataFrame.isnull:
hotel_loon_df[['Name_present', 'Room_present']] = (~hotel_loon_df.isnull()).astype(int)
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1

df.apply(sorted, axis=1) removes column names?

Working through Pandas Cookbook. Counting the Total Number of Flights Between Cities.
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('-----')
desired_width = 320
pd.set_option('display.width', desired_width)
pd.options.display.max_rows = 50
pd.options.display.max_columns = 14
# pd.options.display.float_format = '{:,.2f}'.format
file = "e:\\packt\\data_analysis_and_exploration_with_pandas\\section07\\data\\flights.csv"
flights = pd.read_csv(file)
print(flights.head(10))
print()
# This returns the total number of rows for each group.
flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
print(flights_ct.head(10))
print()
# Get the number of flights between Atlanta and Houston in both directions.
print(flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]])
print()
# Sort the origin and destination cities:
# flights_sort = flights.sort_values(by=['ORG_AIR', 'DEST_AIR'], axis=1)
flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)
print(flights_sort.head(10))
print()
# Passing just the first row.
print(sorted(flights.loc[0, ['ORG_AIR', 'DEST_AIR']]))
print()
# Once each row is independently sorted, the column name are no longer correct.
# We will rename them to something generic, then again find the total number of flights between all cities.
rename_dict = {'ORG_AIR': 'AIR1', 'DEST_AIR': 'AIR2'}
flights_sort = flights_sort.rename(columns=rename_dict)
flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()
print(flights_ct2.head(10))
print()
When I get to this line of code my output differs from the authors:
```flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)```
My output does not contain any column names. As a result, when I get to:
```flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()```
it throws a KeyError. This makes sense, as I am trying to rename columns when no column names exist.
My question is, why are the column names gone? All other output matches the authors output exactly:
Connected to pydev debugger (build 191.7141.48)
NumPy: 1.16.3
Pandas: 0.24.2
-----
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN LAX SLC 1625 58.0 94.0 590 1905 65.0 0 0
1 1 1 4 UA DEN IAD 823 7.0 154.0 1452 1333 -13.0 0 0
2 1 1 4 MQ DFW VPS 1305 36.0 85.0 641 1453 35.0 0 0
3 1 1 4 AA DFW DCA 1555 7.0 126.0 1192 1935 -7.0 0 0
4 1 1 4 WN LAX MCI 1720 48.0 166.0 1363 2225 39.0 0 0
5 1 1 4 UA IAH SAN 1450 1.0 178.0 1303 1620 -14.0 0 0
6 1 1 4 AA DFW MSY 1250 84.0 64.0 447 1410 83.0 0 0
7 1 1 4 F9 SFO PHX 1020 -7.0 91.0 651 1315 -6.0 0 0
8 1 1 4 AA ORD STL 1845 -5.0 44.0 258 1950 -5.0 0 0
9 1 1 4 UA IAH SJC 925 3.0 215.0 1608 1136 -14.0 0 0
ORG_AIR DEST_AIR
ATL ABE 31
ABQ 16
ABY 19
ACY 6
AEX 40
AGS 83
ALB 33
ANC 2
ASE 1
ATW 10
dtype: int64
ORG_AIR DEST_AIR
ATL IAH 121
IAH ATL 148
dtype: int64
*** No columns names *** Why?
0 [LAX, SLC]
1 [DEN, IAD]
2 [DFW, VPS]
3 [DCA, DFW]
4 [LAX, MCI]
5 [IAH, SAN]
6 [DFW, MSY]
7 [PHX, SFO]
8 [ORD, STL]
9 [IAH, SJC]
dtype: object
The author's output. Note the columns names are present.
sorted returns a list object and obliterates the columns:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df.apply(sorted, axis=1)
Out[12]:
0 [1, 2]
1 [3, 4]
dtype: object
In [13]: type(df.apply(sorted, axis=1).iloc[0])
Out[13]: list
It's possible that this wouldn't have been the case in earlier pandas... but it would still be bad code.
You can do this by passing the columns explicitly:
In [14]: df.apply(lambda x: pd.Series(sorted(x), df.columns), axis=1)
Out[14]:
A B
0 1 2
1 3 4
A more efficient way to do this is to sort the sort the underlying numpy array:
In [21]: df = pd.DataFrame([[1, 2], [3, 1]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 1
In [23]: arr = df[["A", "B"]].values
In [24]: arr.sort(axis=1)
In [25]: df[["A", "B"]] = arr
In [26]: df
Out[26]:
A B
0 1 2
1 1 3
As you can see this sorts each row.
A final note. I just applied #AndyHayden numpy based solution from above.
flights_sort = flights[["ORG_AIR", "DEST_AIR"]].values
flights_sort.sort(axis=1)
flights[["ORG_AIR", "DEST_AIR"]] = flights_sort
All I can say is … Wow. What an enormous performance difference. I get the exact same
correct answer and I get it as soon as I click the mouse as compared to the pandas lambda solution also provided by #AndyHayden which takes about 20 seconds to perform the sort. That dataset is 58,000+ rows. The numpy solution returns the sort instantly.

Why i'm not getting my whole output in the run module?

I'm not getting my whole output as well as my column names in my Screen.
import sqlite3
import pandas as pd
hello = sqlite3.connect(r"C:\Users\ravjo\Downloads\Chinook.sqlite")
rs = hello.execute("SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000")
df = pd.DataFrame(rs.fetchall())
hello.close()
print(df.head())
actual result:
0 1 2 3 4 ... 6 7 8 9 10
0 1 3390 3390 One and the Same 271 ... 23 None 217732 3559040 0.99
1 1 3392 3392 Until We Fall 271 ... 23 None 230758 3766605 0.99
2 1 3393 3393 Original Fire 271 ... 23 None 218916 3577821 0.99
3 1 3394 3394 Broken City 271 ... 23 None 228366 3728955 0.99
4 1 3395 3395 Somedays 271 ... 23 None 213831 3497176 0.99
[5 rows x 11 columns]
expected result:
PlaylistId TrackId TrackId Name AlbumId MediaTypeId \
0 1 3390 3390 One and the Same 271 2
1 1 3392 3392 Until We Fall 271 2
2 1 3393 3393 Original Fire 271 2
3 1 3394 3394 Broken City 271 2
4 1 3395 3395 Somedays 271 2
GenreId Composer Milliseconds Bytes UnitPrice
0 23 None 217732 3559040 0.99
1 23 None 230758 3766605 0.99
2 23 None 218916 3577821 0.99
3 23 None 228366 3728955 0.99
4 23 None 213831 3497176 0.99
The ... in the middle actually says that some of the data have been omitted from display. If you want to see the entire data, you should modify the pandas options. You can do so by using pandas.set_option() method. Documentation here.
In your case, you should set display.max_columns to None so that pandas displays unlimited number of columns. You will have to read in the column names from the database of set it manually. Refer here on how to read in the column names from the database itself.
To display all the columns please use below mentioned code snippet.
pd.set_option("display.max_columns",None)
By default, pandas limits number of rows for display. However you can change it to as per your need. Here is helper function I use, whenever I need to print full data-frame
def print_full(df):
import pandas as pd
pd.set_option('display.max_rows', len(df))
print(df)
pd.reset_option('display.max_rows')

Reset secondary index in pandas dataframe to start at 1

Suppose I construct a multi-index dataframe like the one show here:
prim_ind=np.array(range(0,1000))
for i in range(0,1000):
prim_ind[i]=round(i/4)
d = {'prim_ind' :prim_ind,
'sec_ind' : np.array(range(1,1001)),
'a' : np.array(range(325,1325)),
'b' : np.array(range(8318,9318))}
df= pd.DataFrame(d).set_index(['prim_ind','sec_ind'])
The sec_ind runs sequentially from 1 upwards, but I want to reset this second index so that for each of the prim_ind levels the sec_ind always starts at 1. I have been trying to work out if I can use reset index to do this but am failing miserably.
I know i could iterate over the dataframe to get this outcome but that will be a horrible way to do it and there must be a more pythonic way - can anyone help?
Note: the dataframe i'm working with is actually imported from csv, the code above is just to illustrate this question.
You can use cumcount for count categories.
df.index = [df.index.get_level_values(0), df.groupby(level=0).cumcount() + 1]
Or better if want also index names is use MultiIndex.from_arrays:
df.index = pd.MultiIndex.from_arrays([df.index.get_level_values(0),
df.groupby(level=0).cumcount() + 1],
names=df.index.names)
print (df)
a b
prim_ind sec_ind
0 1 325 8318
2 326 8319
3 327 8320
1 1 328 8321
2 329 8322
3 330 8323
2 1 331 8324
So column sec_ind is not necessary, you can use also:
d = {'prim_ind' :prim_ind,
'a' : np.array(range(325,1325)),
'b' : np.array(range(8318,9318))}
df = pd.DataFrame(d)
print (df.head(8))
a b prim_ind
0 325 8318 0
1 326 8319 0
2 327 8320 0
3 328 8321 1
4 329 8322 1
5 330 8323 1
6 331 8324 2
7 332 8325 2
df = df.set_index(['prim_ind', df.groupby('prim_ind').cumcount() + 1]) \
.rename_axis(('first','second'))
print (df.head(8))
a b
first second
0 1 325 8318
2 326 8319
3 327 8320
1 1 328 8321
2 329 8322
3 330 8323
2 1 331 8324
2 332 8325

Categories