adding a column to a pandas dataframe, based on dictionary key - python

I have a following dataframe:
id ip
1 219.237.42.155
2 75.74.144.120
3 219.237.42.155
By using maxmindb-geolite2 package, I can find out what city a specific ip is assigned to. The following code:
from geolite2 import geolite2
reader = geolite2.reader()
reader.get('219.237.42.155')
will return a dictionary, and by looking up keys, I can actually get a city name:
reader.get('219.237.42.155')['city']['names']['en']
returns:
'Beijing'
The problem I have is that I do not know how to get the city for each ip in the dataframe and put it in the third column, so the result would be:
id ip city
1 219.237.42.155 Beijing
2 75.74.144.120 Hollywood
3 219.237.42.155 Beijing
The farthest I got was mapping the whole dictionary to a separate column by using the code:
df['city'] = df['ip'].apply(lambda x: reader.get(x))
On the other hand:
df['city'] = df['ip'].apply(lambda x: reader.get(x)['city']['names']['en'])
throws a key error.. What am I missing?

#you can use apply to check if the key exists before trying to access its values.
df.apply(lambda x: reader.get(x.ip,np.nan),axis=1).apply(lambda x: np.nan if pd.isnull(x) else x['city']['names']['en'])
Out[39]:
0 Beijing
1 NaN
2 Beijing
dtype: object

Related

Pandas remove every entry with a specific value

I would like to go through every row (entry) in my df and remove every entry that has the value of " " (which yes is an empty string).
So if my data set is:
Name Gender Age
Jack 5
Anna F 6
Carl M 7
Jake M 7
Therefore Jack would be removed from the dataset.
On another note, I would also like to remove entries that has the value "Unspecified" and "Undetermined" as well.
Eg:
Name Gender Age Address
Jack 5 *address*
Anna F 6 *address*
Carl M 7 Undetermined
Jake M 7 Unspecified
Now,
Jack will be removed due to empty field.
Carl will be removed due to the value Undetermined present in a column.
Jake will be removed due to the value Unspecified present in a column.
For now, this has been my approach but I keep getting a TypeError.
list = []
for i in df.columns:
if df[i] == "":
# everytime there is an empty string, add 1 to list
list.append(1)
# count list to see how many entries there are with empty string
len(list)
Please help me with this. I would prefer a for loop being used due to there being about 22 columns and 9000+ rows in my actual dataset.
Note - I do understand that there are other questions asked like this, its just that none of them apply to my situation, meaning that most of them are only useful for a few columns and I do not wish to hardcode all 22 columns.
Edit - Thank you for all your feedbacks, you all have been incredibly helpful.
To delete a row based on a condition use the following:
df = df.drop(df[condition].index)
For example:
df = df.drop(df[Age==5].index) , will drop the row where the Age is 5.
I've come across a post regarding the same dating back to 2017, it should help you understand it more clearer.
Regarding question 2, here's how to remove rows with the specified values in a given column:
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Let's assume we have a Pandas DataFrame object df.
To remove every row given your conditions, simply do:
df = df[df.Gender == " " or df.df.Age == " " or df.Address in [" ", "Undetermined", "Unspecified"]]
If the unspecified fields are NaN, you can also do:
df = df.dropna(how="any", axis = 0)
Answer from #ThatCSFresher or #Bence will help you out in removing rows based on single column... Which is great!
However, I think there are multiple condition in your query needed to check across multiple columns at once in a loop. So, probably apply-lambda can do the job; Try the following code;
df = pd.DataFrame({"Name":["Jack","Anna","Carl","Jake"],
"Gender":["","F","M","M"],
"Age":[5,6,7,7],
"Address":["address","address","Undetermined","Unspecified"]})
df["Noise_Tag"] = df.apply(lambda x: "Noise" if ("" in list(x)) or ("Undetermined" in list(x)) or ("Unspecified" in list(x)) else "No Noise",axis=1)
df1 = df[df["Noise_Tag"] == "No Noise"]
del df1["Noise_Tag"]
# Output of df;
Name Gender Age Address Noise_Tag
0 Jack 5 address Noise
1 Anna F 6 address No Noise
2 Carl M 7 Undetermined Noise
3 Jake M 7 Unspecified Noise
# Output of df1;
Name Gender Age Address
1 Anna F 6 address
Well, OP actually wants to delete any column with "empty" string.
df = df[~(df=="").any(axis=1)] # deletes all rows that have empty string in any column.
If you want to delete specifically for address column, then you can just delete using
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Or if any column with Undetermined or Unspecified, try similar as the first solution in my post, just by replacing the empty string with Undertermined or Unspecified.
df = df[~((df=="Undetermined") | (df=="Unspecified")).any(axis=1)]
You can build masks and then filter the df according to it:
m1 = df.eq('').any(axis=1)
# m1 is True if any cell in a row has an empty string
m2 = df['Address'].isin(['Undetermined', 'Unspecified'])
# m2 is True if a row has one of the values in the list in column 'Address'
out = df[~m1 & ~m2] # invert both condition and get the desired output
print(out)
Output:
Name Gender Age Address
1 Anna F 6 *address*
Used Input:
df = pd.DataFrame({'Name': ['Jack', 'Anna', 'Carl', 'Jake'],
'Gender': ['', 'F', 'M', 'M'],
'Age': [5, 6, 7, 7],
'Address': ['*address*', '*address*', 'Undetermined', 'Unspecified']}
)
using lambda fun
Code:
df[df.apply(lambda x: False if (x.Address in ['Undetermined', 'Unspecified'] or '' in list(x)) else True, axis=1)]
Output:
Name Gender Age Address
1 Anna F 6 *add

Add string in a certain position in column in dataframe

Basically this:
hash = "355879ACB6"
hash = hash[:4] + '-' + hash[4:]
print (hash)
3558-79ACB6
I got this part above from another stackoverflow post here
but for a DataFrame.
I am only able to successfully add strings before and after, like this:
data ['col1'] = data['col1'] + 'teststring'
If I try the solution from the link above [:amountofcharacterstocutafter] to add values at a certain position, which would be something like:
test = data[:2] + 'zz'
print (test)
It does not seem to be applicable, as the [:2] operator works different for dataframes as it does for strings. It cuts the ouput after the first 2 rows.
Goal:
I want to add a ' - ' at a certain position. Let's say the input row value is 'TTTT1234', output should be 'TTTT-1234'. For every row.
You can perform the operation you presented on a list but you have a column in a dataframe so its (a bit) different.
So while you can do this:
hash = "355879ACB6"
hash = hash[:4] + '-' + hash[4:]
in order to do this on a dataframe you can do it in at least 2 ways:
consider this dummy df:
LOCATION Hash
0 USA 355879ACB6
1 USA 455879ACB6
2 USA 388879ACB6
3 USA 800879ACB6
4 JAPAN 355870BCB6
5 JAPAN 355079ACB6
A. vectorization: the most efficient way
df['new_hash']=df['Hash'].str[:4]+'-'+df['Hash'].str[4:]
LOCATION Hash new_hash
0 USA 355879ACB6 3558-79ACB6
1 USA 455879ACB6 4558-79ACB6
2 USA 388879ACB6 3888-79ACB6
3 USA 800879ACB6 8008-79ACB6
4 JAPAN 355870BCB6 3558-70BCB6
5 JAPAN 355079ACB6 3550-79ACB6
B. apply lambda: intuitive to implement but less attractive in terms of performance
df['new_hash'] = df.apply(lambda x: x['Hash'][:4]+'-'+x['Hash'][4:], axis=1)
Use pd.Series.str. For example:
import pandas as pd
df = pd.DataFrame({
"c": ["TTTT1234"]
})
df["c"].str[:4] + "-" + df["c"].str[4:] # It will output 'TTTT-1234'
pd.Series.str gives vectorized string functions.

determine time zone from zip code in pandas dataframe?

This is my first time trying to use a lambda function, please help me determine what I'm doing incorrectly. I wrote a function to output time zones based on zip codes. The function works but not sure how to implement it as a lambdas function to create a new column in my dataframe
import pandas as pd
from pyzipcode import ZipCodeDatabase
zcdb = ZipCodeDatabase()
def find_tz(zip_code):
try:
tz = zcdb[zip_code].timezone
return tz
except:
return '?'
data = [['Jane','92804'], ['Bob','75014'], ['Ashley','07650']]
df = pd.DataFrame(data, columns=['Contact','Zip'])
in: df
out:
Contact Zip
0 Jane 92804
1 Bob 75014
2 Ashley 07650
Do note that that zip code column data are strings, since US zip codes have leading 0s.
Me testing that the function I wrote works on values from df:
in: print(find_tz(df.loc[0,'Zip']))
print(find_tz(df.loc[1,'Zip']))
print(find_tz(df.loc[2,'Zip']))
out:
-8
-6
-5
My attempt at using a lambda function to create a new Timezone column, and the incorrect result I am getting:
in: df = df.assign(Timezone = lambda x: find_tz(x.Zip))
df
out:
Contact Zip Timezone
0 Jane 92804 ?
1 Bob 75014 ?
2 Ashley 07650 ?
My desired resulting dataframe would look like:
Contact Zip Timezone
0 Jane 92804 -8
1 Bob 75014 -6
2 Ashley 07650 -5
ETA: when I changed my find_tz() function to something like concatenating the input with another string of text, the lambda worked as I expected, so I'm not sure what I've done wrong.
You can use:
df['Timezone'] = df.Zip.apply(find_tz)
When you call lambda x: find_tz(x.Zip) the find_tz function is passed a Pandas series not the individual zip codes

Pandas: Replacing column values with ones as retrieved from other dataframe

I am stumbled upon a trivial problem in pandas. I have two dataframes. The first one, df_1 is as follows
vendor_name date company_name state
PERTH is june 2019 Abc enterprise Kentucky
Megan Ent 25-april-2019 Xyz Fincorp Texas
The second one df_2 contains the correct values for each column in df_1.
df_2
Field wrong value correct value
vendor_name PERTH Perth Enterprise
date is 15 ## this means that is should be read as 15
company_name Abc enterprise ABC International Enterprise Inc.
In order to replace the values with correct ones in df_1 (except date field) I am using pandas.loc method. Below is the code snippet
vend = df_1['vendor_Name'].tolist()
comp = df_1['company_name'].tolist()
state = df_1['state'].tolist()
for i in vend:
if df_2['wrong value'].str.contains(i):
crct = df_2.loc[df_2['wrong value'] == i,'correct value'].tolist()
Similarly, for company and state I have followed the above way.
However, the crct is returning a blank series. Ideally it should return
['Perth Enterprise','Abc International Enterprise Inc']
The next step would be to replace the respective field values by the above list.
With the above, I have three questions:
Why the above code is generating a blank list? What I am missing here?
How can I replace the respective fields using df_1.replace method?
What should be a correct approach to replace the portion of date in df_1 by the correct one in df_2?
Edit: when data has looping replacement(i.e overlaping keys and values), replacement on whole dataframe will fail. In this case, doing it column by column and concat them together. Finally, use join to adding any missing columns from df1:
df_replace = pd.concat([df1[k].replace(val, regex=True) for k, val in d.items()], axis=1).join(df1.state)
Original:
I tried your code in my interactive and it gives error ValueError: The truth value of a Series is ambiguous on df_2['wrong value'].str.contains(i).
assume you have multiple vendor names, so the simple way is construct a dictionary from groupby of df2 and use it with df.replace on df1.
d = {k: gp.set_index('wrong value')['correct value'].to_dict()
for k, gp in df2.groupby('Field')}
Out[64]:
{'company_name': {'Abc enterprise': 'ABC International Enterprise Inc. '},
'date': {'is': '15'},
'vendor_name': {'PERTH': 'Perth Enterprise'}}
df_replace = df1.replace(d, regex=True)
print(df_replace)
In [68]:
vendor_name date company_name \
0 Perth Enterprise 15 june 2019 ABC International Enterprise Inc.
1 Megan Ent 25-april-2019 Xyz Fincorp
state
0 Kentucky
1 Texas
Note: your sample df2 has only value for vendor PERTH, so it only replace first row. When you have all vendor_names in df2, it will replace them all in df1.
A simple way to do that is to iterate over the first dataframe and then replace the wrong values :
Result = pd.DataFrame()
for i in range(len(df1)):
vendor_name = df1.iloc[i]['vendor_name']
date = df1.iloc[i]['date']
company_name = df1.iloc[i]['company_name']
if vendor_name in df2['wrong value'].values:
vendor_name = df2.loc[df2['wrong value'] == vendor_name]['correct value'].values[0]
if company_name in df2['wrong value'].values:
company_name = df2.loc[df2['wrong value'] == company_name]['correct value'].values[0]
new_row = {'vendor_name':[vendor_name],'date':[date],'company_name':[company_name]}
new_row = pd.DataFrame(new_row,columns=['vendor_name','date','company_name'])
Result = Result.append(new_row,ignore_index=True)
Result :
Define the following replace function:
def repl(row):
fld = row.Field
v1 = row['wrong value']
v2 = row['correct value']
updInd = df_1[df_1[fld].str.contains(v1)].index
df_1.loc[updInd, fld] = df_1.loc[updInd, fld]\
.str.replace(re.escape(v1), v2)
Then call it for each row in df_2:
for _, row in df_2.iterrows():
repl(row)
Note that str.replace alone does not require to import re (Pandas
imports it under the hood).
But in the above function re.escape is called explicitely, from our code,
hence import re is required.

Python reading JSON in dataframe

I have an SQL database which has two columns. One has the timestamp, the other holds data in JSON format
for example df:
ts data
'2017-12-18 02:30:20.553' {'name':'bob','age':10, 'location':{'town':'miami','state':'florida'}}
'2017-12-18 02:30:21.101' {'name':'dan','age':15, 'location':{'town':'new york','state':'new york'}}
'2017-12-18 02:30:21.202' {'name':'jay','age':11, 'location':{'town':'tampa','state':'florida'}}
If I do the following :
df = df['data'][0]
print (df['name'].encode('ascii', 'ignore'))
I get :
'bob'
Is there a way I can get all of the data correspondings to a JSON key for the whole column?
(i.e. for the df column 'data' get 'name')
'bob'
'dan'
'jay'
Essentially I would like to be able to make a new df column called 'name'
You can use json_normalize i.e
pd.io.json.json_normalize(df['data'])['name']
0 bob
1 dan
2 jay
Name: name, dtype: object
IIUC, lets use apply with lambda function to select value from dictionary by key:
df['data'].apply(lambda x: x['name'])
Output:
0 bob
1 dan
2 jay
Name: data, dtype: object

Categories