This is my first time trying to use a lambda function, please help me determine what I'm doing incorrectly. I wrote a function to output time zones based on zip codes. The function works but not sure how to implement it as a lambdas function to create a new column in my dataframe
import pandas as pd
from pyzipcode import ZipCodeDatabase
zcdb = ZipCodeDatabase()
def find_tz(zip_code):
try:
tz = zcdb[zip_code].timezone
return tz
except:
return '?'
data = [['Jane','92804'], ['Bob','75014'], ['Ashley','07650']]
df = pd.DataFrame(data, columns=['Contact','Zip'])
in: df
out:
Contact Zip
0 Jane 92804
1 Bob 75014
2 Ashley 07650
Do note that that zip code column data are strings, since US zip codes have leading 0s.
Me testing that the function I wrote works on values from df:
in: print(find_tz(df.loc[0,'Zip']))
print(find_tz(df.loc[1,'Zip']))
print(find_tz(df.loc[2,'Zip']))
out:
-8
-6
-5
My attempt at using a lambda function to create a new Timezone column, and the incorrect result I am getting:
in: df = df.assign(Timezone = lambda x: find_tz(x.Zip))
df
out:
Contact Zip Timezone
0 Jane 92804 ?
1 Bob 75014 ?
2 Ashley 07650 ?
My desired resulting dataframe would look like:
Contact Zip Timezone
0 Jane 92804 -8
1 Bob 75014 -6
2 Ashley 07650 -5
ETA: when I changed my find_tz() function to something like concatenating the input with another string of text, the lambda worked as I expected, so I'm not sure what I've done wrong.
You can use:
df['Timezone'] = df.Zip.apply(find_tz)
When you call lambda x: find_tz(x.Zip) the find_tz function is passed a Pandas series not the individual zip codes
Related
I currently have 2 csv files and am reading them both in, and need to get the ID's in one csv and find them in the other so that I can get their row of data. Currently I have the following code that I believe goes through the first dataframe but only is adding the last match onto the new dataframe. I need it to add all of the subsequent rows however.
Here is my code:
patientSet = pd.read_csv("794_chips_RMA.csv")
affSet = probeset[probeset['Analysis']==1].reset_index(drop=True)
houseGenes = probeset[probeset['Analysis']==0].reset_index(drop=True)
for x in affSet['Probeset']:
#patients = patientSet[patientSet['ID']=='1557366_at'].reset_index(drop=True)
#patients = patientSet[patientSet['ID']=='224851_at'].reset_index(drop=True)
patients = patientSet[patientSet['ID']==x].reset_index(drop=True)
print(affSet['Probeset'])
print(patientSet['ID'])
print(patients)
The following is the output:
0 1557366_at
1 224851_at
2 1554784_at
3 231578_at
4 1566643_a_at
5 210747_at
6 231124_x_at
7 211737_x_at
Name: Probeset, dtype: object
0 1007_s_at
1 1053_at
2 117_at
3 121_at
4 1255_g_at
...
54670 AFFX-ThrX-5_at
54671 AFFX-ThrX-M_at
54672 AFFX-TrpnX-3_at
54673 AFFX-TrpnX-5_at
54674 AFFX-TrpnX-M_at
Name: ID, Length: 54675, dtype: object
ID phchp003v1 phchp003v2 phchp003v3 ... phchp367v1 phchp367v2 phchp368v1 phchp368v2
0 211737_x_at 12.223453 11.747159 9.941889 ... 14.828389 9.322779 10.609053 10.771162
as you can see, it is only matching the very last ID from the first dataframe, and not all of them. How can I get them all to match and be in patients? Thank you.
you probably want to use the merge function
df_inner = pd.merge(df1, df2, on='id', how='inner')
check here https://www.datacamp.com/community/tutorials/joining-dataframes-pandas search for "inner join"
--edit--
you can specify the columns (using left_on=None,right_on=None,) , look here: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging
#Rui Lima already posted the correct answer, but you'll need to use the following to make it work:
df = pd.merge(patientSet , affSet, on='ID', how='inner')
I have a CSV file that looks like below, this is same like my last question but this is by using Pandas.
Group Sam Dan Bori Son John Mave
A 0.00258844 0.983322 1.61479 1.2785 1.96963 10.6945
B 0.0026034 0.983305 1.61198 1.26239 1.9742 10.6838
C 0.0026174 0.983294 1.60913 1.24543 1.97877 10.6729
D 0.00263062 0.983289 1.60624 1.22758 1.98334 10.6618
E 0.00264304 0.98329 1.60332 1.20885 1.98791 10.6505
I have a function like below
def getnewno(value):
value = value + 30
if value > 40 :
value = value - 20
else:
value = value
return value
I want to send all these values to the getnewno function and get a newvalue and update the CSV file. How can this be accomplished in Pandas.
Expected output:
Group Sam Dan Bori Son John Mave
A 30.00258844 30.983322 31.61479 31.2785 31.96963 20.6945
B 30.0026034 30.983305 31.61198 31.26239 31.9742 20.6838
C 30.0026174 30.983294 31.60913 31.24543 31.97877 20.6729
D 30.00263062 30.983289 31.60624 31.22758 31.98334 20.6618
E 30.00264304 30.98329 31.60332 31.20885 31.98791 20.6505
The following should give you what you desire.
Applying a function
Your function can be simplified and here expressed as a lambda function.
It's then a matter of applying your function to all of the columns. There are a number of ways to do so. The first idea that comes to mind is to loop over df.columns. However, we can do better than this by using the applymap or transform methods:
import pandas as pd
# Read in the data from file
df = pd.read_csv('data.csv',
sep='\s+',
index_col=0)
# Simplified function with which to transform data
getnewno = lambda value: value + 10 if value > 10 else value + 30
# Looping over columns
#for col in df.columns:
# df[col] = df[col].apply(getnewno)
# Apply to all columns without loop
df = df.applymap(getnewno)
# Write out updated data
df.to_csv('data_updated.csv')
Using broadcasting
You can achieve your result using broadcasting and a little boolean logic. This avoids looping over any columns, and should ultimately prove faster and less memory intensive (although if your dataset is small any speed-up would be negligible):
import pandas as pd
df = pd.read_csv('data.csv',
sep='\s+',
index_col=0)
df += 30
make_smaller = df > 40
df[make_smaller] -= 20
First of all, your getnewno function looks too complicated... it can be simplified to e.g.:
def getnewno(value):
if value + 30 > 40:
return value - 20
else:
return value
you can even change value + 30 > 40 to value > 10.
Or even a oneliner if you want:
getnewno = lambda value: value-20 if value > 10 else value
Having the function you can apply it to specific values/columns. For example, if want you to create a column Mark_updated basing on Mark column, it should look like this (I assume your pandas DataFrame is called df):
df['Mark_updated'] = df['Mark'].apply(getnewno)
Use the mask function to do an if-else solution, before writing the data to csv
res = (df
.select_dtypes('number')
.add(30)
#the if-else comes in here
#if any entry in the dataframe is greater than 40, subtract 20 from it
#else leave as is
.mask(lambda x: x>40, lambda x: x.sub(20))
)
#insert the group column back
res.insert(0,'Group',df.Group.array)
write to csv
res.to_csv(filename)
Group Sam Dan Bori Son John Mave
0 A 30.002588 30.983322 31.61479 31.27850 31.96963 20.6945
1 B 30.002603 30.983305 31.61198 31.26239 31.97420 20.6838
2 C 30.002617 30.983294 31.60913 31.24543 31.97877 20.6729
3 D 30.002631 30.983289 31.60624 31.22758 31.98334 20.6618
4 E 30.002643 30.983290 31.60332 31.20885 31.98791 20.6505
I have a following dataframe:
id ip
1 219.237.42.155
2 75.74.144.120
3 219.237.42.155
By using maxmindb-geolite2 package, I can find out what city a specific ip is assigned to. The following code:
from geolite2 import geolite2
reader = geolite2.reader()
reader.get('219.237.42.155')
will return a dictionary, and by looking up keys, I can actually get a city name:
reader.get('219.237.42.155')['city']['names']['en']
returns:
'Beijing'
The problem I have is that I do not know how to get the city for each ip in the dataframe and put it in the third column, so the result would be:
id ip city
1 219.237.42.155 Beijing
2 75.74.144.120 Hollywood
3 219.237.42.155 Beijing
The farthest I got was mapping the whole dictionary to a separate column by using the code:
df['city'] = df['ip'].apply(lambda x: reader.get(x))
On the other hand:
df['city'] = df['ip'].apply(lambda x: reader.get(x)['city']['names']['en'])
throws a key error.. What am I missing?
#you can use apply to check if the key exists before trying to access its values.
df.apply(lambda x: reader.get(x.ip,np.nan),axis=1).apply(lambda x: np.nan if pd.isnull(x) else x['city']['names']['en'])
Out[39]:
0 Beijing
1 NaN
2 Beijing
dtype: object
If I have a list of headers and I am using pandas:
[u'GAME_ID', u'TEAM_ID', u'TEAM_ABBREVIATION', u'TEAM_CITY', u'PLAYER_ID', u'PLAYER_NAME', u'START_POSITION', u'COMMENT', u'MIN', u'SPD', u'DIST', u'ORBC', u'DRBC', u'RBC', u'TCHS', u'SAST', u'FTAST', u'PASS', u'AST', u'CFGM', u'CFGA', u'CFG_PCT', u'UFGM', u'UFGA', u'UFG_PCT', u'FG_PCT', u'DFGM', u'DFGA', u'DFG_PCT']
Why do I get an output that is shortened
Like the following below:
PLAYER_NAME START_POSITION COMMENT MIN SPD ... CFGM CFGA \
0 Billy Bob G 37:42 4.12 5 ... 5 12
Why does pandas skip the other stats? Even though my code states:
output= pd.DataFrame(data, columns=stts)
print output
It's done on purpose, more specifically through pandas' Options and Settings.
You can change it through display.max_columns which is set by default to 20, as well as display.max_colwidth. Here's the full default list of information.
simple question here -- how do I replace all of the whitespaces in a column with a zero?
For example:
Name Age
John 12
Mary
Tim 15
into
Name Age
John 12
Mary 0
Tim 15
I've been trying using something like this but I am unsure how Pandas actually reads whitespace:
merged['Age'].replace(" ", 0).bfill()
Any ideas?
merged['Age'] = merged['Age'].apply(lambda x: 0 if x == ' ' else x)
Use the built in method convert_objects and set param convert_numeric=True:
In [12]:
# convert objects will handle multiple whitespace, this will convert them to NaN
# we then call fillna to convert those to 0
df.Age = df[['Age']].convert_objects(convert_numeric=True).fillna(0)
df
Out[12]:
Name Age
0 John 12
1 Mary 0
2 Tim 15
Here's an answer modified from this, more thorough question. I'll make it a little bit more Pythonic and resolve your basestring issue.
def ws_to_zero(maybe_ws):
try:
if maybe_ws.isspace():
return 0
else:
return maybe_ws
except AttributeError:
return maybe_ws
d.applymap(ws_to_zero)
where d is your dataframe.
if you want to use NumPy, then you can use the below snippet:
import numpy as np
df['column_of_interest'] = np.where(df['column_of_interest']==' ',0,df['column_of_interest']).astype(float)
While Paulo's response is excellent, my snippet above may be useful when multiple criteria are required during advanced data manipulation.