Creating column to keep track of missing values in another column

Creating column to keep track of missing values in another column - python

I am adding a mock dataframe to exemplify my problem.
I have a large dataframe in which some columns are missing values.
I would like to create some extra boolean columns in which 1 corresponds to a non missing value in the row and 0 corresponds to a missing value.
names = ['Banana, Andrew Something (Maria Banana)', np.nan, 'Willis, Mr. Bruce (Demi Moore)', 'Crews, Master Terry', np.nan]
room = [100, 330, 212, 111, 222]
hotel_loon = {'Name' : pd.Series(names), 'Room' : pd.Series(room)}
hotel_loon_df = pd.DataFrame(hotel_loon)
In another question I found on stack overflow they were super thorough and clear on how to proceed to keep track of all the columns that have missing values but not for specific ones.
I tried a few variations of that code (namely using where) but I was not successful with creating what I wanted which would be something like this:
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1
Thank you for your time, I am sure that in the end it is going to be trivial, but for some reason I got stuck.

To save some typing, use DataFrame.notnull, add some suffixes, and join the result back.
pd.concat([df, df.notnull().astype(int).add_suffix('_present')], axis=1)
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1

You can use .isnull() for your case, and change the type from bool to int:
hotel_loon_df['Name_present'] = (~hotel_loon_df['Name'].isnull()).astype(int)
hotel_loon_df['Room_present'] = (~hotel_loon_df['Room'].isnull()).astype(int)
Out[1]:
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1
The ~ means the opposite of, or something that is not.

If you are tracking only for Nan fields, you can use isnull() function.
df['name_present'] =df['name'].isnull()
df['name_present'].replace(True,0, inplace=True)
df['name_present'].replace(False,1, inplace=True)
df['room_present'] =df['room'].isnull()
df['room_present'].replace(True,0, inplace=True)
df['room_present'].replace(False,1, inplace=True)

We can do this in a concise manner by using DataFrame.isnull:
hotel_loon_df[['Name_present', 'Room_present']] = (~hotel_loon_df.isnull()).astype(int)
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1

Related

Create binary categorical variables in pandas - combine 4 categories into 2

I have the following pandas dataframe, where the column 'Status' consists of 4 categorical values - 'Open', 'Closed', 'Solved' and 'Pending'.
0 250635 Comcast Cable Internet Speeds 22-04-15
1 223441 Payment disappear - service got disconnected 04-08-15
2 242732 Speed and Service 18-04-15
3 277946 Comcast Imposed a New Usage Cap of 300GB that ... 05-07-15
4 307175 Comcast not working and no service to boot 26-05-15
Date_month_year Time Received Via City State \
0 22-Apr-15 3:53:50 PM Customer Care Call Abingdon Maryland
1 04-Aug-15 10:22:56 AM Internet Acworth Georgia
2 18-Apr-15 9:55:47 AM Internet Acworth Georgia
3 05-Jul-15 11:59:35 AM Internet Acworth Georgia
4 26-May-15 1:25:26 PM Internet Acworth Georgia
Zip code Status Filing on Behalf of Someone
0 21009 Closed No
1 30102 Closed No
2 30101 Closed Yes
3 30101 Open Yes
4 30101 Solved No
I would like to combine the 'Open' and 'Pending' categories as 'Open' column and 'Closed' and 'Solved' as 'Closed' column with 0 and 1 binaries. If I use pd.get_dummies(df, columns=['Status']) I get the following output with 4 new columns for the 4 values but I only want 2, as mentioned earlier. I couldn't find any previous thread here on this so please suggest any possible method. Thank you.
0 22-Apr-15 3:53:50 PM Customer Care Call Abingdon Maryland
1 04-Aug-15 10:22:56 AM Internet Acworth Georgia
2 18-Apr-15 9:55:47 AM Internet Acworth Georgia
3 05-Jul-15 11:59:35 AM Internet Acworth Georgia
4 26-May-15 1:25:26 PM Internet Acworth Georgia
... ... ... ... ...
2219 04-Feb-15 9:13:18 AM Customer Care Call Youngstown Florida
2220 06-Feb-15 1:24:39 PM Customer Care Call Ypsilanti Michigan
2221 06-Sep-15 5:28:41 PM Internet Ypsilanti Michigan
2222 23-Jun-15 11:13:30 PM Customer Care Call Ypsilanti Michigan
2223 24-Jun-15 10:28:33 PM Customer Care Call Ypsilanti Michigan
Zip code Filing on Behalf of Someone Status_Closed Status_Open \
0 21009 No 1 0
1 30102 No 1 0
2 30101 Yes 1 0
3 30101 Yes 0 1
4 30101 No 0 0
... ... ... ...
2219 32466 No 1 0
2220 48197 No 0 0
2221 48197 No 0 0
2222 48197 No 0 0
2223 48198 Yes 0 1
Status_Pending Status_Solved
0 0 0
1 0 0
2 0 0
3 0 0
4 0 1
... ...
2219 0 0
2220 0 1
2221 0 1
2222 0 1
2223 0 0

Just go with
df['Status_open'] = 0
df['Status_closed'] = 0
df.loc[(df['Status'] == 'Open') | (df['Status'] == 'Pending'), 'Status_open'] = 1
df.loc[(df['Status'] == 'Closed') | (df['Status'] == 'Solved'), 'Status_closed'] = 1

Here is the underlying principle:
for i, row in df.iterrows():
if 'Open' in row['Status']:
df.at[i,'Open'] = True # or any other value
if 'Pending' in row['Status']:
df.at[i,'Open'] = True # or any other value
if 'Closed' in row['Status']:
df.at[i,'Closed'] = True # or any other value
if 'Solved' in row['Status']:
df.at[i,'Closed'] = True # or any other value
You iterate through the column check for any value and if the value is found you set a boolean value in the new column "Open". Naturally, you'll need to create the column "Open" before doing this.

(Not tested with a pc)
I think it can be done in this manner:
open_ls = ['Open', 'Pending']
df['New_Status'] = df['Status'].apply(lambda x: 'Open' if x in open_ls else 'Closed')
pd.get_dummies(df, columns=['New_Status'])

How can I merge these two datasets on 'Name' and 'Year'?

I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')

You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.

Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!

Getting KeyError after reading in pipe-separated CSV

I read in a pipe-separated CSV like this
test = pd.read_csv("http://kejser.org/wp-content/uploads/2014/06/Country.csv")
test.head()
This returns
SK_Country|"Number"|"Alpha2Code"|"Alpha3Code"|"CountryName"|"TopLevelDomain"
0 1|20|"ad"|"and"|"Andorra"|".ad"
1 2|4|"af"|"afg"|"Afghanistan"|".af"
2 3|28|"ag"|"atg"|"Antigua and Barbuda"|".ag"
3 4|660|"ai"|"aia"|"Anguilla"|".ai"
4 5|8|"al"|"alb"|"Albania"|".al"
When I try and extract specific data from it, like below:
df = test[["Alpha3Code"]]
I get the following error:
KeyError: ['Alpha3Code'] not in index
I don't understand what goes wrong - I can see the value is in the CSV when I print the head, likewise when I open the CSV, everything looks fine.
I've tried to google around and read some posts regarding the issue here on the stack and tried different approaches, but nothing seems to fix this annoying problem.

Notice how everything is crammed into one string column? That's because you didn't specify the delimiter separating columns to pd.read_csv, which in this case has to be '|'.
test = pd.read_csv("http://kejser.org/wp-content/uploads/2014/06/Country.csv",
sep='|')
test.head()
# SK_Country Number Alpha2Code Alpha3Code CountryName \
# 0 1 20 ad and Andorra
# 1 2 4 af afg Afghanistan
# 2 3 28 ag atg Antigua and Barbuda
# 3 4 660 ai aia Anguilla
# 4 5 8 al alb Albania
#
# TopLevelDomain
# 0 .ad
# 1 .af
# 2 .ag
# 3 .ai
# 4 .al

As pointed out in the comment by #chrisz, you have to specify the delimiter:
test = pd.read_csv("http://kejser.org/wp-content/uploads/2014/06/Country.csv",delimiter='|')
test.head()
SK_Country Number Alpha2Code Alpha3Code CountryName \
0 1 20 ad and Andorra
1 2 4 af afg Afghanistan
2 3 28 ag atg Antigua and Barbuda
3 4 660 ai aia Anguilla
4 5 8 al alb Albania
TopLevelDomain
0 .ad
1 .af
2 .ag
3 .ai
4 .al

Pandas - Count the number of rows that would be true for a function - for each input row

I have a dataframe that needs a column added to it. That column needs to be a count of all the other rows in the table that meet a certain condition, that condition needs to take in input both from the "input" row and the "output" row.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
I'd want the height and weight of the row, as well as the height and weight of the other rows in a function, so I can do something like:
def example_function(height1, weight1, height2, weight2):
if height1 > height2 and weight1 < weight2:
return True
else:
return False
And it would just sum up all the True's and give that sum in the column.
Is something like this possible?
Thanks in advance for any ideas!
Edit: Sample input:
id name height weight country
0 Adam 70 180 USA
1 Bill 65 190 CANADA
2 Chris 71 150 GERMANY
3 Eric 72 210 USA
4 Fred 74 160 FRANCE
5 Gary 75 220 MEXICO
6 Henry 61 230 SPAIN
The result would need to be:
id name height weight country new_column
0 Adam 70 180 USA 1
1 Bill 65 190 CANADA 1
2 Chris 71 150 GERMANY 3
3 Eric 72 210 USA 1
4 Fred 74 160 FRANCE 4
5 Gary 75 220 MEXICO 1
6 Henry 61 230 SPAIN 0
I believe it will need to be some sort of function, as the actual logic I need to use is more complicated.
edit 2:fixed typo

You can add booleans, like this:
count = ((df.height1 > df.height2) & (df.weight1 < df.weight2)).sum()
EDIT:
I test it a bit and then change conditions with custom function:
def f(x):
#check boolean mask
#print ((df.height > x.height) & (df.weight < x.weight))
return ((df.height < x.height) & (df.weight > x.weight)).sum()
df['new_column'] = df.apply(f, axis=1)
print (df)
id name height weight country new_column
0 0 Adam 70 180 USA 2
1 1 Bill 65 190 CANADA 1
2 2 Chris 71 150 GERMANY 3
3 3 Eric 72 210 USA 1
4 4 Fred 74 160 FRANCE 4
5 5 Gary 75 220 MEXICO 1
6 6 Henry 61 230 SPAIN 0
Explanation:
For each row compare values and for count simply sum values True.

For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
As far as I understand, you want to assign to a new column something like
df['num_heigher_and_leighter'] = df.apply(lambda r: ((df.height > r.height) & (df.weight < r.weight)).sum(), axis=1)
However, your text description doesn't seem to match the outcome, which is:
0 2
1 3
2 0
3 1
4 0
5 0
6 6
dtype: int64
Edit
As in any other case, you can use a named function instead of a lambda:
df = ...
def foo(r):
return ((df.height > r.height) & (df.weight < r.weight)).sum()
df['num_heigher_and_leighter'] = df.apply(foo, axis=1)

I'm assuming you had a typo and want to compare heights with heights and weights with weights. If so, you could count the number of persons taller OR heavier like so:
>>> for i,height,weight in zip(df.index,df.height, df.weight):
... cnt = df.loc[((df.height>height) & (df.weight>weight)), 'height'].count()
... df.loc[i,'thing'] = cnt
...
>>> df
name height weight country thing
0 Adam 70 180 USA 2.0
1 Bill 65 190 CANADA 2.0
2 Chris 71 150 GERMANY 3.0
3 Eric 72 210 USA 1.0
4 Fred 74 160 FRANCE 1.0
5 Gary 75 220 MEXICO 0.0
6 Henry 61 230 SPAIN 0.0
Here for instance, no person is Heavier than Henry, and no person is taller than Gary. If that's not what you intended, it should be easy to modify the & above to a | instead or switching out the > to a <.
When you're more accustomed to Pandas, I suggest you use Ami Tavory excellent answer instead.
PS. For the love of god, use the Metric system for representing weight and height, and convert to whatever for presentation. These numbers are totally nonsensical for the world population at large. :)

Get length of values in pandas dataframe column

I'm trying to get the length of each zipCd value in the dataframe mentioned below. When I run the code below I get 958 for every record. I'm expecting to get something more like '4'. Does anyone see what the issue is?
Code:
zipDfCopy['zipCd'].str.len()
Data:
print zipDfCopy[1:5]
Zip Code Place Name State State Abbreviation County \
1 544 Holtsville New York NY Suffolk
2 1001 Agawam Massachusetts MA Hampden
3 1002 Amherst Massachusetts MA Hampshire
4 1003 Amherst Massachusetts MA Hampshire
Latitude Longitude zipCd
1 40.8154 -73.0451 0 501\n1 544\n2 1001...
2 42.0702 -72.6227 0 501\n1 544\n2 1001...
3 42.3671 -72.4646 0 501\n1 544\n2 1001...
4 42.3919 -72.5248 0 501\n1 544\n2 1001...

One way is to convert to string and use pd.Series.map with len built-in.
pd.Series.str is used for vectorized string functions, while pd.Series.astype is used to change column type.
import pandas as pd
df = pd.DataFrame({'ZipCode': [341, 4624, 536, 123, 462, 4642]})
df['ZipLen'] = df['ZipCode'].astype(str).map(len)
# ZipCode ZipLen
# 0 341 3
# 1 4624 4
# 2 536 3
# 3 123 3
# 4 462 3
# 5 4642 4
A more explicit alternative is to use np.log10:
df['ZipLen'] = np.floor(np.log10(df['ZipCode'].values)).astype(int) + 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating column to keep track of missing values in another column - python

Related

Create binary categorical variables in pandas - combine 4 categories into 2

How can I merge these two datasets on 'Name' and 'Year'?

Getting KeyError after reading in pipe-separated CSV

Pandas - Count the number of rows that would be true for a function - for each input row

Get length of values in pandas dataframe column

Categories

Resources