I have a function that I want to apply to each row of a data frame to create a new column. The value returned from the function will be different if certain columns are NaN. I have several conditions in the function (more complicated than the example below), otherwise, I would use np.where.
How do I test if the column is nan using the function below? I tried row['id'] is np.nan but that doesn't work.
# data
d = {'name': {0: 'dave', 1: 'hagen'},
'id': {0: 123456.0, 1: np.nan},
'position': {0: np.nan, 1: 5600.0},
'test': {0: 'has an id', 1: 'has an id'}}
df = pd.DataFrame(d)
# function
def test_func(row):
if row['id'] is np.nan:
val = "missing id"
else:
val = 'has an id'
return val
# apply function
df['test'] = df.apply(test_func, axis=1)
Results (I would expect the test column to say "missing an id" on the second row since id is np.nan on that row:
Use np.where:
>>> df
name id position
0 dave 123456.0 NaN
1 hagen NaN 5600.0
df['test'] = np.where(df['id'].isna(), 'missing id', 'has an id')
>>> df
name id position test
0 dave 123456.0 NaN has an id
1 hagen NaN 5600.0 missing id
numpy has its own isnan() function:
the simplest way is:
val = "missing id" if np.isnan(row['id']) else 'has an id'
You can simply use numpy.isnan and replace booleans with your custom string using replace:
df['test'] = df['id'].isna().replace({True: 'missing id', False: 'has an id'})
output:
name id position test
0 dave 123456.0 NaN has an id
1 hagen NaN 5600.0 missing id
Using apply for something this simple is usually code smell. You can use the isna method to do the same thing in one line:
df.loc[df['id'].isna(), 'test'] = 'missing id'
The problem with NaN is that there are multiple representations of it, so is is not guaranteed to work by any means, and it's specifically defined so that Nan != Nan.
Given that all you are looking for is the mask df['id'].isna(), you may not need an explicit test column at all. That applies to more complicated conditions as well. You can combine masks using & and | element-wise bit operators. For example, to include a condition on either of the columns:
df.loc[df['id'].isna() | df['position'].isna(), 'test'] = 'missing id'
Alternatively, you can do
df.loc[df[['id', 'position']].isna().any(axis=1), 'test'] = 'missing id'
Related
Here is my dataframe:
df = pd.DataFrame({'First Name': ['George', 'Alex', 'Leo'],
'Surname' : ['Davis', 'Mulan', 'Carlitos'],
'Age': [10, 15, 20],
'Size' : [30, 40, 50]})
Output:
First Name Surname Age Size
0 George Davis 10 30
1 Alex Mulan 15 40
2 Leo Carlitos 20 50
And here is a function:
def myfunc(firstname, surname):
print(firstname + ' ' + surname)
Now, I would like to iterate through the dataframe and check for the following conditions:
IF df['age'] > than 11 AND df['size'] < 51
If there is a match (row 2 and row 3), I would like to call 'myfunc' and pass in the data from the applicable rows in the dataframe as the attributes of myfunc.
'myfunc()' would be called as:
myfunc(df[First Name], df[Surname])
So in this example the output after running the code would be:
'Alex Mulan'
'Leo Carlitos'
(The IF, AND condition was true in the second and third row.)
Please explain how could I achieve this goal and provide a working code snippet.
I would prefer a solution where no additional column is created. (If the solution remains practical. Otherwise a new column can be created and added to the dataframe if needed.)
Use .loc to filter your dataframe and apply your function. Use a lambda function as a proxy to call your function with the right signature.
def myfunc(firstname, surname):
return firstname + ' ' + surname
out = df.loc[df['Age'].gt(11) & df['Size'].lt(51), ['First Name', 'Surname']] \
.apply(lambda x: myfunc(*x), axis=1)
Output:
>>> out
1 Alex Mulan
2 Leo Carlitos
dtype: object
>>> type(out)
pandas.core.series.Series
Try with agg then
df.loc[(df.Age>11) & (df.Size<51),['First Name','Surname']].agg(' '.join,1)
Out[124]:
1 Alex Mulan
2 Leo Carlitos
dtype: object
First select the required records using indexing, then concatenate the names:
selected = df[(df['Age'] > 11) & (df['Size'] < 51)]
print(selected['First Name'] + " " + selected['Surname'])
Edit: to pass each row to a generic function and ensure the right columns are passed, you can write a helper like this:
def apply(df, func, kwargs):
return df[kwargs].rename(columns=kwargs) \
.apply(lambda row: func(**row), axis=1)
print(apply(df=selected,
func=myfunc,
kwargs={"First Name": "firstname", "Surname": "surname"}))
There is often a more efficient solution that passes while columns to a function instead of applying it row by row.
my df look like this:
as you can see the User starts with 'ff' and it could be in access column or any other column rather than user column.
i want to create a new column in this df called "UserID" where whenever the is 'ff' in all the columns copy this value to my new column "UserId"
i have been using this method which is working fine but i have to repeat this line in all the columns:
hist.loc[hist.User.str.startswith("ff",na=False),'UserId']=hist['User'].str[2:]
is there any other method i can use to loop over all rows at once?
thanks
If you are cool with picking only the first occurence:
df['UserID'] = df.apply(lambda x: x[x.str.startswith('ff')][:1], axis=1)
NumPy + Pandas solution below.
In case of ambiguity (several ff-strings in a row) leftmost occurance is taken. In case of absence (no ff-string in a row) NaN value is used.
Try it online!
import pandas as pd, numpy as np
df = pd.DataFrame({
'user': ['fftest', 'fwadmin', 'fshelpdesk3', 'no', 'ffone'],
'access': ['fwadmin', 'ffuser2', 'fwadmin', 'user', 'fftwo'],
'station': ['fshelpdesk', 'fshelpdesk2', 'ffuser3', 'here', 'three'],
})
sv = df.values.astype(np.str)
ix = np.argwhere(np.char.startswith(sv, 'ff'))[::-1].T
df.loc[ix[0], 'UserID'] = pd.Series(sv[(ix[0], ix[1])]).str[2:].values
print(df)
Output:
user access station UserID
0 fftest fwadmin fshelpdesk test
1 fwadmin ffuser2 fshelpdesk2 user2
2 fshelpdesk3 fwadmin ffuser3 user3
3 no user here NaN
4 ffone fftwo three one
Hey here is my attempt at solving the problem, hope it helps.
d = df[df.apply(lambda x: x.str.startswith('ff'))]
df['user_id'] = d['user'].fillna(d['access'].fillna(d['station']))
Result
user access station user_id
0 fftest fwadmin fshelpdesk fftest
1 fwadmin ffuser2 fshelpdesk2 ffuser2
2 fshelpdesk3 fwadmin ffuser3 ffuser3
I have dataframe with the following structure:
raw_data = {'website': ['bbc.com', 'cnn.com', 'google.com', 'facebook.com'],
'type': ['image', 'audio', 'image', 'video'],
'source': ['bbc','google','stackoverflow','facebook']}
df = pd.DataFrame(raw_data, columns = ['website', 'type', 'source'])
I would like to modify the values in column type with a condition that if the source exists in website, then suffix type with '_1stParty' else '_3rdParty'. The dataframe should eventually look like:
Test values betwen rows with in and apply for processing each rows separately:
m = df.apply(lambda x: x['source'] in x['website'], axis=1)
Or use zip with list comprehension:
m = [a in b for a, b in zip(df['source'], df['website'])]
and then add new values by numpy.where:
df['type'] += np.where(m, '_1stParty', '_3rdParty')
#'long' alternative
#df['type'] = df['type'] + np.where(m, '_1stParty', '_3rdParty')
print (df)
website type source
0 bbc.com image_1stParty bbc
1 cnn.com audio_3rdParty google
2 google.com image_3rdParty stackoverflow
3 facebook.com video_1stParty facebook
you can use apply method for this like
df["type"] = df.apply(lambda row: f"{row.type}_1stparty" if row.source in row.website \
else f"{row.type}_thirdparty", axis=1)
df
This solution must be faster than others which use apply():
df.type += df.website.str.split('.').str[0].eq(df.source).\
replace({True: '_1stParty', False: '_3rdParty'})
I need to change values of a column in 2 groups like country column has several value but I need US and Non -US in Pandas Dataframe. Please suggest how to achieve this in python dataframe.
I tried below code but no luck
1.
if df['Country'] != 'United-States':
df['Country'] = 'Non-US'
2.
df.loc[df['Country'] != 'United-States', 'Country'] = 'Non-US'
You need:
df = pd.DataFrame({'Country': ['United-States', 'Canada', 'Slovakia']})
print(df)
Country
0 United-States
1 Canada
2 Slovakia
df['Country'] = np.where(df['Country'] == 'United-States', 'US', 'Non-US')
Or:
df['Country'] = np.where(df['Country'] != 'United-States', 'Non-US', 'US')
Another solution:
df['Country'] = df['Country'].map({'United-States':'US'}).fillna('Non-US')
print (df)
Country
0 US
1 Non-US
2 Non-US
Try the following:
US = df[df['Country']=='United-States']
Other = df[df['Country']!='United-States']
This will surely help
You can use apply function as below:
df['country']=df['country'].apply(lambda x: "Non-US" if x != 'United-States' else "United-States")
In addition to the NumPy version in #jezrael's answer, pandas also has its own Series.where() function:
>>> df = pd.DataFrame({'Country': ['United-States', 'Canada', 'Slovakia']})
>>> df.Country.where(df.Country == 'United-States', 'Non-US')
0 United-States
1 Non-US
2 Non-US
The where method is an application of the if-then idiom. For each
element in the calling DataFrame, if cond is True the element is used;
otherwise the corresponding element from the DataFrame other is used.
The signature for DataFrame.where() differs from numpy.where().
Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).
I'm trying to reproduce my Stata code in Python, and I was pointed in the direction of Pandas. I am, however, having a hard time wrapping my head around how to process the data.
Let's say I want to iterate over all values in the column head 'ID.' If that ID matches a specific number, then I want to change two corresponding values FirstName and LastName.
In Stata it looks like this:
replace FirstName = "Matt" if ID==103
replace LastName = "Jones" if ID==103
So this replaces all values in FirstName that correspond with values of ID == 103 to Matt.
In Pandas, I'm trying something like this
df = read_csv("test.csv")
for i in df['ID']:
if i ==103:
...
Not sure where to go from here. Any ideas?
One option is to use Python's slicing and indexing features to logically evaluate the places where your condition holds and overwrite the data there.
Assuming you can load your data directly into pandas with pandas.read_csv then the following code might be helpful for you.
import pandas
df = pandas.read_csv("test.csv")
df.loc[df.ID == 103, 'FirstName'] = "Matt"
df.loc[df.ID == 103, 'LastName'] = "Jones"
As mentioned in the comments, you can also do the assignment to both columns in one shot:
df.loc[df.ID == 103, ['FirstName', 'LastName']] = 'Matt', 'Jones'
Note that you'll need pandas version 0.11 or newer to make use of loc for overwrite assignment operations. Indeed, for older versions like 0.8 (despite what critics of chained assignment may say), chained assignment is the correct way to do it, hence why it's useful to know about even if it should be avoided in more modern versions of pandas.
Another way to do it is to use what is called chained assignment. The behavior of this is less stable and so it is not considered the best solution (it is explicitly discouraged in the docs), but it is useful to know about:
import pandas
df = pandas.read_csv("test.csv")
df['FirstName'][df.ID == 103] = "Matt"
df['LastName'][df.ID == 103] = "Jones"
You can use map, it can map vales from a dictonairy or even a custom function.
Suppose this is your df:
ID First_Name Last_Name
0 103 a b
1 104 c d
Create the dicts:
fnames = {103: "Matt", 104: "Mr"}
lnames = {103: "Jones", 104: "X"}
And map:
df['First_Name'] = df['ID'].map(fnames)
df['Last_Name'] = df['ID'].map(lnames)
The result will be:
ID First_Name Last_Name
0 103 Matt Jones
1 104 Mr X
Or use a custom function:
names = {103: ("Matt", "Jones"), 104: ("Mr", "X")}
df['First_Name'] = df['ID'].map(lambda x: names[x][0])
The original question addresses a specific narrow use case. For those who need more generic answers here are some examples:
Creating a new column using data from other columns
Given the dataframe below:
import pandas as pd
import numpy as np
df = pd.DataFrame([['dog', 'hound', 5],
['cat', 'ragdoll', 1]],
columns=['animal', 'type', 'age'])
In[1]:
Out[1]:
animal type age
----------------------
0 dog hound 5
1 cat ragdoll 1
Below we are adding a new description column as a concatenation of other columns by using the + operation which is overridden for series. Fancy string formatting, f-strings etc won't work here since the + applies to scalars and not 'primitive' values:
df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
+ df.type + ' ' + df.animal
In [2]: df
Out[2]:
animal type age description
-------------------------------------------------
0 dog hound 5 A 5 years old hound dog
1 cat ragdoll 1 A 1 years old ragdoll cat
We get 1 years for the cat (instead of 1 year) which we will be fixing below using conditionals.
Modifying an existing column with conditionals
Here we are replacing the original animal column with values from other columns, and using np.where to set a conditional substring based on the value of age:
# append 's' to 'age' if it's greater than 1
df.animal = df.animal + ", " + df.type + ", " + \
df.age.astype(str) + " year" + np.where(df.age > 1, 's', '')
In [3]: df
Out[3]:
animal type age
-------------------------------------
0 dog, hound, 5 years hound 5
1 cat, ragdoll, 1 year ragdoll 1
Modifying multiple columns with conditionals
A more flexible approach is to call .apply() on an entire dataframe rather than on a single column:
def transform_row(r):
r.animal = 'wild ' + r.type
r.type = r.animal + ' creature'
r.age = "{} year{}".format(r.age, r.age > 1 and 's' or '')
return r
df.apply(transform_row, axis=1)
In[4]:
Out[4]:
animal type age
----------------------------------------
0 wild hound dog creature 5 years
1 wild ragdoll cat creature 1 year
In the code above the transform_row(r) function takes a Series object representing a given row (indicated by axis=1, the default value of axis=0 will provide a Series object for each column). This simplifies processing since you can access the actual 'primitive' values in the row using the column names and have visibility of other cells in the given row/column.
This question might still be visited often enough that it's worth offering an addendum to Mr Kassies' answer. The dict built-in class can be sub-classed so that a default is returned for 'missing' keys. This mechanism works well for pandas. But see below.
In this way it's possible to avoid key errors.
>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> class SurnameMap(dict):
... def __missing__(self, key):
... return ''
...
>>> surnamemap = SurnameMap()
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap[x])
>>> df
ID Surname
0 101 Mohanty
1 201
2 301 Drake
3 401
The same thing can be done more simply in the following way. The use of the 'default' argument for the get method of a dict object makes it unnecessary to subclass a dict.
>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> surnamemap = {}
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap.get(x, ''))
>>> df
ID Surname
0 101 Mohanty
1 201
2 301 Drake
3 401
df['FirstName']=df['ID'].apply(lambda x: 'Matt' if x==103 else '')
df['LastName']=df['ID'].apply(lambda x: 'Jones' if x==103 else '')
In case someone is looking for a way to change the values of multiple rows based on some logical condition of each row itself, using .apply() with a function is the way to go.
df = pd.DataFrame({'col_a':[0,0], 'col_b':[1,2]})
col_a col_b
0 0 1
1 0 2
def func(row):
if row.col_a == 0 and row.col_b <= 1:
row.col_a = -1
row.col_b = -1
return row
df.apply(func, axis=1)
col_a col_b
0 -1 -1 # Modified row
1 0 2
Although .apply() is typically used to add a new row/column to a dataframe, it can be used to modify the values of existing rows/columns.
I found it much easier to debut by printing out where each row meets the condition:
for n in df.columns:
if(np.where(df[n] == 103)):
print(n)
print(df[df[n] == 103].index)