iterate in all Dataframe rows and perform startswith() - python

my df look like this:
as you can see the User starts with 'ff' and it could be in access column or any other column rather than user column.
i want to create a new column in this df called "UserID" where whenever the is 'ff' in all the columns copy this value to my new column "UserId"
i have been using this method which is working fine but i have to repeat this line in all the columns:
hist.loc[hist.User.str.startswith("ff",na=False),'UserId']=hist['User'].str[2:]
is there any other method i can use to loop over all rows at once?
thanks

If you are cool with picking only the first occurence:
df['UserID'] = df.apply(lambda x: x[x.str.startswith('ff')][:1], axis=1)

NumPy + Pandas solution below.
In case of ambiguity (several ff-strings in a row) leftmost occurance is taken. In case of absence (no ff-string in a row) NaN value is used.
Try it online!
import pandas as pd, numpy as np
df = pd.DataFrame({
'user': ['fftest', 'fwadmin', 'fshelpdesk3', 'no', 'ffone'],
'access': ['fwadmin', 'ffuser2', 'fwadmin', 'user', 'fftwo'],
'station': ['fshelpdesk', 'fshelpdesk2', 'ffuser3', 'here', 'three'],
})
sv = df.values.astype(np.str)
ix = np.argwhere(np.char.startswith(sv, 'ff'))[::-1].T
df.loc[ix[0], 'UserID'] = pd.Series(sv[(ix[0], ix[1])]).str[2:].values
print(df)
Output:
user access station UserID
0 fftest fwadmin fshelpdesk test
1 fwadmin ffuser2 fshelpdesk2 user2
2 fshelpdesk3 fwadmin ffuser3 user3
3 no user here NaN
4 ffone fftwo three one

Hey here is my attempt at solving the problem, hope it helps.
d = df[df.apply(lambda x: x.str.startswith('ff'))]
df['user_id'] = d['user'].fillna(d['access'].fillna(d['station']))
Result
user access station user_id
0 fftest fwadmin fshelpdesk fftest
1 fwadmin ffuser2 fshelpdesk2 ffuser2
2 fshelpdesk3 fwadmin ffuser3 ffuser3

Related

Test for np.nan in row wise function

I have a function that I want to apply to each row of a data frame to create a new column. The value returned from the function will be different if certain columns are NaN. I have several conditions in the function (more complicated than the example below), otherwise, I would use np.where.
How do I test if the column is nan using the function below? I tried row['id'] is np.nan but that doesn't work.
# data
d = {'name': {0: 'dave', 1: 'hagen'},
'id': {0: 123456.0, 1: np.nan},
'position': {0: np.nan, 1: 5600.0},
'test': {0: 'has an id', 1: 'has an id'}}
df = pd.DataFrame(d)
# function
def test_func(row):
if row['id'] is np.nan:
val = "missing id"
else:
val = 'has an id'
return val
# apply function
df['test'] = df.apply(test_func, axis=1)
Results (I would expect the test column to say "missing an id" on the second row since id is np.nan on that row:
Use np.where:
>>> df
name id position
0 dave 123456.0 NaN
1 hagen NaN 5600.0
df['test'] = np.where(df['id'].isna(), 'missing id', 'has an id')
>>> df
name id position test
0 dave 123456.0 NaN has an id
1 hagen NaN 5600.0 missing id
numpy has its own isnan() function:
the simplest way is:
val = "missing id" if np.isnan(row['id']) else 'has an id'
You can simply use numpy.isnan and replace booleans with your custom string using replace:
df['test'] = df['id'].isna().replace({True: 'missing id', False: 'has an id'})
output:
name id position test
0 dave 123456.0 NaN has an id
1 hagen NaN 5600.0 missing id
Using apply for something this simple is usually code smell. You can use the isna method to do the same thing in one line:
df.loc[df['id'].isna(), 'test'] = 'missing id'
The problem with NaN is that there are multiple representations of it, so is is not guaranteed to work by any means, and it's specifically defined so that Nan != Nan.
Given that all you are looking for is the mask df['id'].isna(), you may not need an explicit test column at all. That applies to more complicated conditions as well. You can combine masks using & and | element-wise bit operators. For example, to include a condition on either of the columns:
df.loc[df['id'].isna() | df['position'].isna(), 'test'] = 'missing id'
Alternatively, you can do
df.loc[df[['id', 'position']].isna().any(axis=1), 'test'] = 'missing id'

.replace codes will not replace column with new column in python

I am trying to read a column in python, and create a new column using python.
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
df = pd.DataFrame(data={'WT_RESIDUE':['']})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df['WT_RESIDUE'].replace(codes)
df.to_csv (r'C:\Users\User\Documents\Research\output.csv')
I tried this, but it will not create a new column no matter what I do.
example
It seems like you made a silly mistake
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
df = pd.DataFrame(data={'WT_RESIDUE':['']}) # Why do you have this line?
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df['WT_RESIDUE'].replace(codes)
df.to_csv (r'C:\Users\User\Documents\Research\output.csv')
Try removing the line with the comment. AFAIK, it is reinitializing your DataFrame and thus the WT_RESIDUE column becomes empty.
Considering sample from provided input.
We can use map function to map the keys of dict to existing column and persist corresponding values in new column.
df = pd.DataFrame({
'WT_RESIDUE':['ALA', "REMARK", 'VAL', "LYS"]
})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df.WT_RESIDUE.map(codes)
Input
WT_RESIDUE
0 ALA
1 REMARK
2 VAL
3 LYS
Output
WT_RESIDUE MUTATION_CODE
0 ALA A
1 REMARK NaN
2 VAL V
3 LYS K

Return a matching value from 2 dataframes (1 dataframe with single value in cell, 1 with a list in a cell) into 1 dataframe

I have 2 dataframes:
df1
ID Type
2456-AA Coolant
2457-AA Elec
df2
ID Task
[2456-AA, 5656-BB] Check AC
[2456-AA, 2457-AA] Check Equip.
I'm trying return the matched ID's 'Type' from df1 to df2. With the result looking something like this:
df2
ID Task Type
[2456-AA, 5656-BB] Check AC [Coolant]
[2456-AA, 2457-AA] Check Equip. [Coolant , Elec]
I tried the following for loop. I udnerstand it isn't the fastest but i'm struggling to workout a faster alternative:
def type_identifier(type):
df = df1.copy()
device_type = []
for value in df1.ID:
for x in type:
if x == value:
device_type.append(df1.Type.tolist())
else:
None
return device_type
df2['test'] = df2['ID'].apply(lambda x: type_identifier(x))
Could somebody help me out? and also refer me to material that could help me to better approach problems like these?
Thank you,
Use the to_dict of pandas to convert df1 to a dictionary, so we can efficiently translate id to type.
Then, apply lamda that for each ID in df2 converts it to the right type, and assign it to test column as you wished.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'ID':['2456-AA', '2457-AA'],
'Type':['Coolant', 'Elec']})
df2 = pd.DataFrame({'ID':[['2456-AA', '5656-BB'], ['2456-AA', '2457-AA']],
'Task':['Check AC', 'Check Equip.']})
# Use to dict to convert df1 ids to types
id_to_type = df1.set_index('ID').to_dict()['Type']
# {'2456-AA': 'Coolant', '2457-AA': 'Elec'}
print(id_to_type)
# Apply lamda that for each `ID` in `df2` converts it to the right type
df2['test'] = df2['ID'].apply(lambda x: [id_to_type[t] for t in x if t in id_to_type])
print(df2)
Output:
ID Task test
0 [2456-AA, 5656-BB] Check AC [Coolant]
1 [2456-AA, 2457-AA] Check Equip. [Coolant, Elec]

How to compare columns of two dataframes and have consequences when they match in Python Pandas

I am trying to have Python Pandas compare two dataframes with each other. In dataframe 1, i have two columns (AC-Cat and Origin). I am trying to compare the AC-Cat column with the inputs of Dataframe 2. If a match is found between one of the columns of Dataframe 2 and the value of dataframe 1 being studied, i want Pandas to copy the header of the column of Dataframe 2 in which the match is found to a new column in Dataframe 1.
DF1:
f = {'AC-Cat': pd.Series(['B737', 'A320', 'MD11']),
'Origin': pd.Series(['AJD', 'JFK', 'LRO'])}
Flight_df = pd.DataFrame(f)
DF2:
w = {'CAT-C': pd.Series(['DC85', 'IL76', 'MD11', 'TU22', 'TU95']),
'CAT-D': pd.Series(['A320', 'A321', 'AN12', 'B736', 'B737'])}
WCat_df = pd.DataFrame(w)
I imported pandas as pd and numpy as np and tried to define a function to compare these columns.
def get_wake_cat(AC_cat):
try:
Wcat = [WCat_df.columns.values[0]][WCat_df.iloc[:,1]==AC_cat].values[0]
except:
Wcat = np.NAN
return Wcat
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT))
However, the function does not result in the desired outputs. For example: Take the B737 AC-Cat value. I want Python Pandas to then find this value in DF2 in the column CAT-D and copy this header to the new column of DF 1. This does not happen. Can someone help me find out why my code is not giving the desired results?
Not pretty but I think I got it working. Part of the error was that the function did not have WCat_df. I also changed the indexing into two steps:
def get_wake_cat(AC_cat, WCat_df):
try:
d=WCat_df[WCat_df.columns.values][WCat_df.iloc[:]==AC_cat]
Wcat=d.columns[(d==AC_cat).any()][0]
except:
Wcat = np.NAN
return Wcat
Then you need to change your next line to:
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT,WCat_df ))
AC-Cat Origin CAT
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C
Hope that solves the problem
This will give you 2 new columns with the name\s of the match\s found:
Flight_df['CAT1'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-C' if x in list(WCat_df['CAT-C']) else '')
Flight_df['CAT2'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-D' if x in list(WCat_df['CAT-D']) else '')
Flight_df.loc[Flight_df['CAT1'] == '', 'CAT1'] = Flight_df['CAT2']
Flight_df.loc[Flight_df['CAT1'] == Flight_df['CAT2'], 'CAT2'] = ''
IUC, you can do a stack and merge:
final=(Flight_df.merge(WCat_df.stack().reset_index(1,name='AC-Cat'),on='AC-Cat',how='left')
.rename(columns={'level_1':'New'}))
print(final)
Or with melt:
final=Flight_df.merge(WCat_df.melt(var_name='New',value_name='AC-Cat'),
on='AC-Cat',how='left')
AC-Cat Origin New
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C

Change one value based on another value in pandas

I'm trying to reproduce my Stata code in Python, and I was pointed in the direction of Pandas. I am, however, having a hard time wrapping my head around how to process the data.
Let's say I want to iterate over all values in the column head 'ID.' If that ID matches a specific number, then I want to change two corresponding values FirstName and LastName.
In Stata it looks like this:
replace FirstName = "Matt" if ID==103
replace LastName = "Jones" if ID==103
So this replaces all values in FirstName that correspond with values of ID == 103 to Matt.
In Pandas, I'm trying something like this
df = read_csv("test.csv")
for i in df['ID']:
if i ==103:
...
Not sure where to go from here. Any ideas?
One option is to use Python's slicing and indexing features to logically evaluate the places where your condition holds and overwrite the data there.
Assuming you can load your data directly into pandas with pandas.read_csv then the following code might be helpful for you.
import pandas
df = pandas.read_csv("test.csv")
df.loc[df.ID == 103, 'FirstName'] = "Matt"
df.loc[df.ID == 103, 'LastName'] = "Jones"
As mentioned in the comments, you can also do the assignment to both columns in one shot:
df.loc[df.ID == 103, ['FirstName', 'LastName']] = 'Matt', 'Jones'
Note that you'll need pandas version 0.11 or newer to make use of loc for overwrite assignment operations. Indeed, for older versions like 0.8 (despite what critics of chained assignment may say), chained assignment is the correct way to do it, hence why it's useful to know about even if it should be avoided in more modern versions of pandas.
Another way to do it is to use what is called chained assignment. The behavior of this is less stable and so it is not considered the best solution (it is explicitly discouraged in the docs), but it is useful to know about:
import pandas
df = pandas.read_csv("test.csv")
df['FirstName'][df.ID == 103] = "Matt"
df['LastName'][df.ID == 103] = "Jones"
You can use map, it can map vales from a dictonairy or even a custom function.
Suppose this is your df:
ID First_Name Last_Name
0 103 a b
1 104 c d
Create the dicts:
fnames = {103: "Matt", 104: "Mr"}
lnames = {103: "Jones", 104: "X"}
And map:
df['First_Name'] = df['ID'].map(fnames)
df['Last_Name'] = df['ID'].map(lnames)
The result will be:
ID First_Name Last_Name
0 103 Matt Jones
1 104 Mr X
Or use a custom function:
names = {103: ("Matt", "Jones"), 104: ("Mr", "X")}
df['First_Name'] = df['ID'].map(lambda x: names[x][0])
The original question addresses a specific narrow use case. For those who need more generic answers here are some examples:
Creating a new column using data from other columns
Given the dataframe below:
import pandas as pd
import numpy as np
df = pd.DataFrame([['dog', 'hound', 5],
['cat', 'ragdoll', 1]],
columns=['animal', 'type', 'age'])
In[1]:
Out[1]:
animal type age
----------------------
0 dog hound 5
1 cat ragdoll 1
Below we are adding a new description column as a concatenation of other columns by using the + operation which is overridden for series. Fancy string formatting, f-strings etc won't work here since the + applies to scalars and not 'primitive' values:
df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
+ df.type + ' ' + df.animal
In [2]: df
Out[2]:
animal type age description
-------------------------------------------------
0 dog hound 5 A 5 years old hound dog
1 cat ragdoll 1 A 1 years old ragdoll cat
We get 1 years for the cat (instead of 1 year) which we will be fixing below using conditionals.
Modifying an existing column with conditionals
Here we are replacing the original animal column with values from other columns, and using np.where to set a conditional substring based on the value of age:
# append 's' to 'age' if it's greater than 1
df.animal = df.animal + ", " + df.type + ", " + \
df.age.astype(str) + " year" + np.where(df.age > 1, 's', '')
In [3]: df
Out[3]:
animal type age
-------------------------------------
0 dog, hound, 5 years hound 5
1 cat, ragdoll, 1 year ragdoll 1
Modifying multiple columns with conditionals
A more flexible approach is to call .apply() on an entire dataframe rather than on a single column:
def transform_row(r):
r.animal = 'wild ' + r.type
r.type = r.animal + ' creature'
r.age = "{} year{}".format(r.age, r.age > 1 and 's' or '')
return r
df.apply(transform_row, axis=1)
In[4]:
Out[4]:
animal type age
----------------------------------------
0 wild hound dog creature 5 years
1 wild ragdoll cat creature 1 year
In the code above the transform_row(r) function takes a Series object representing a given row (indicated by axis=1, the default value of axis=0 will provide a Series object for each column). This simplifies processing since you can access the actual 'primitive' values in the row using the column names and have visibility of other cells in the given row/column.
This question might still be visited often enough that it's worth offering an addendum to Mr Kassies' answer. The dict built-in class can be sub-classed so that a default is returned for 'missing' keys. This mechanism works well for pandas. But see below.
In this way it's possible to avoid key errors.
>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> class SurnameMap(dict):
... def __missing__(self, key):
... return ''
...
>>> surnamemap = SurnameMap()
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap[x])
>>> df
ID Surname
0 101 Mohanty
1 201
2 301 Drake
3 401
The same thing can be done more simply in the following way. The use of the 'default' argument for the get method of a dict object makes it unnecessary to subclass a dict.
>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> surnamemap = {}
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap.get(x, ''))
>>> df
ID Surname
0 101 Mohanty
1 201
2 301 Drake
3 401
df['FirstName']=df['ID'].apply(lambda x: 'Matt' if x==103 else '')
df['LastName']=df['ID'].apply(lambda x: 'Jones' if x==103 else '')
In case someone is looking for a way to change the values of multiple rows based on some logical condition of each row itself, using .apply() with a function is the way to go.
df = pd.DataFrame({'col_a':[0,0], 'col_b':[1,2]})
col_a col_b
0 0 1
1 0 2
def func(row):
if row.col_a == 0 and row.col_b <= 1:
row.col_a = -1
row.col_b = -1
return row
df.apply(func, axis=1)
col_a col_b
0 -1 -1 # Modified row
1 0 2
Although .apply() is typically used to add a new row/column to a dataframe, it can be used to modify the values of existing rows/columns.
I found it much easier to debut by printing out where each row meets the condition:
for n in df.columns:
if(np.where(df[n] == 103)):
print(n)
print(df[df[n] == 103].index)

Categories