How do I parse this nested JSON object? - python

I have a data set which is in data format and looks like this:
[{'session_id': ['X061RFWB06K9V'],
'unix_timestamp': [1442503708],
'cities': ['New York NY, Newark NJ'],
'user': [[{'user_id': 2024,
'joining_date': '2015-03-22',
'country': 'UK'}]]},
{'session_id': ['5AZ2X2A9BHH5U'],
'unix_timestamp': [1441353991],
'cities': ['New York NY, Jersey City NJ, Philadelphia PA'],
'user': [[{'user_id': 2853,
'joining_date': '2015-03-28',
'country': 'DE'}]]},
{'session_id': ['SHTB4IYAX4PX6'],
'unix_timestamp': [1440843490],
'cities': ['San Antonio TX'],
'user': [[{'user_id': 10958,
'joining_date': '2015-03-06',
'country': 'UK'}]]}
I am using pandas and processing it and when i use read_json, I get the following:
cities session_id unix_timestamp user
0 [New York NY, Newark NJ] [X061RFWB06K9V] [1442503708] [[{'user_id': 2024, 'joining_date': '2015-03-2...
1 [New York NY, Jersey City NJ, Philadelphia PA] [5AZ2X2A9BHH5U] [1441353991] [[{'user_id': 2853, 'joining_date': '2015-03-2...
2 [San Antonio TX] [SHTB4IYAX4PX6] [1440843490] [[{'user_id': 10958, 'joining_date': '2015-03-...
How do I process this data so that its in a better format?
Here is the data definition:
Columns:
session_id: session id.
unix_timestamp: unix timestamp of session start time
cities: the unique cities which were searched within the same session
user:
user_id: the id of the user
joining_date: when the user created the account
country: where the user is based
I tried using json_normalize but keep getting error:
AttributeError: 'int' object has no attribute 'values'
and also different types of error. Kindly help

You could use a function that completely flattens it out, then reconstruct your dataframe:
import re
import pandas as pd
import numpy as np
jsonData = [{'session_id': ['X061RFWB06K9V'],
'unix_timestamp': [1442503708],
'cities': ['New York NY, Newark NJ'],
'user': [[{'user_id': 2024,
'joining_date': '2015-03-22',
'country': 'UK'}]]},
{'session_id': ['5AZ2X2A9BHH5U'],
'unix_timestamp': [1441353991],
'cities': ['New York NY, Jersey City NJ, Philadelphia PA'],
'user': [[{'user_id': 2853,
'joining_date': '2015-03-28',
'country': 'DE'}]]},
{'session_id': ['SHTB4IYAX4PX6'],
'unix_timestamp': [1440843490],
'cities': ['San Antonio TX'],
'user': [[{'user_id': 10958,
'joining_date': '2015-03-06',
'country': 'UK'}]]} ]
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
flat = flatten_json(jsonData)
results = pd.DataFrame()
columns_list = list(flat.keys())
for item in columns_list:
row_idx = re.findall(r'(\d+)\_', item )[0]
column = item.replace(row_idx+'_', '',1)
column = column.replace('_0', '')
row_idx = int(row_idx)
value = flat[item]
results.loc[row_idx, column] = value
# If you don't want to expand/split the `cities` column, remove line below
results = results.join(results['cities'].str.split(',', expand=True).add_prefix('cities_').fillna(np.nan))
print (results)
Output:
print (results.to_string())
session_id unix_timestamp cities user_user_id user_joining_date user_country cities_0 cities_1 cities_2
0 X061RFWB06K9V 1.442504e+09 New York NY, Newark NJ 2024.0 2015-03-22 UK New York NY Newark NJ NaN
1 5AZ2X2A9BHH5U 1.441354e+09 New York NY, Jersey City NJ, Philadelphia PA 2853.0 2015-03-28 DE New York NY Jersey City NJ Philadelphia PA
2 SHTB4IYAX4PX6 1.440843e+09 San Antonio TX 10958.0 2015-03-06 UK San Antonio TX NaN NaN

Related

Python Dict Comprehension retrieve value from 1 dataframe column if match another column value

I have a dataframe and there are 2 columns ["country"] and ["city"] which basically informs of the country and their cities.
I need to create a dict using dict comprehensions, to get as a key, the country and as values, a list of the city/cities (some of them only have one city, others many).
I'm able to define the keys and create a list but all the cities existing appears a values, I am not able to create the condition that the country of the value should be the key:
Dic = {k: list(megacities["city"]) for k,f in megacities.groupby('country')}
for k in Dic:
print("{}:{}\n".format(k, Dic[k]))
Part of the output that I receive is:
Argentina:['Tokyo', 'Jakarta', 'Delhi', 'Manila', 'São Paulo', 'Seoul', 'Mumbai', 'Shanghai', 'Mexico City', 'Guangzhou', 'Cairo', 'Beijing', 'New York', 'Kolkāta', 'Moscow', 'Bangkok', 'Dhaka', 'Buenos Aires', 'Ōsaka', 'Lagos', 'Istanbul', 'Karachi', 'Kinshasa', 'Shenzhen', 'Bangalore', 'Ho Chi Minh City', 'Tehran', 'Los Angeles', 'Rio de Janeiro', 'Chengdu', 'Baoding', 'Chennai', 'Lahore', 'London', 'Paris', 'Tianjin', 'Linyi', 'Shijiazhuang', 'Zhengzhou', 'Nanyang']
Bangladesh:['Tokyo', 'Jakarta', 'Delhi', 'Manila', 'São Paulo', 'Seoul', 'Mumbai', 'Shanghai', 'Mexico City', 'Guangzhou', 'Cairo', 'Beijing', 'New York', 'Kolkāta', 'Moscow', 'Bangkok', 'Dhaka', 'Buenos Aires', 'Ōsaka', 'Lagos', 'Istanbul', 'Karachi', 'Kinshasa', 'Shenzhen', 'Bangalore', 'Ho Chi Minh City', 'Tehran', 'Los Angeles', 'Rio de Janeiro', 'Chengdu', 'Baoding', 'Chennai', 'Lahore', 'London', 'Paris', 'Tianjin', 'Linyi', 'Shijiazhuang', 'Zhengzhou', 'Nanyang']
Brazil:['Tokyo', 'Jakarta', 'Delhi', 'Manila', 'São Paulo', 'Seoul', 'Mumbai', 'Shanghai', 'Mexico City', 'Guangzhou', 'Cairo', 'Beijing', 'New York', 'Kolkāta', 'Moscow', 'Bangkok', 'Dhaka', 'Buenos Aires', 'Ōsaka', 'Lagos', 'Istanbul', 'Karachi', 'Kinshasa', 'Shenzhen', 'Bangalore', 'Ho Chi Minh City', 'Tehran', 'Los Angeles', 'Rio de Janeiro', 'Chengdu', 'Baoding', 'Chennai', 'Lahore', 'London', 'Paris', 'Tianjin', 'Linyi', 'Shijiazhuang', 'Zhengzhou', 'Nanyang']
So basically the expect output would be:
Argentina:['Buenos Aires']
Bangladesh:['Dhaka']
Brazil:['São Paulo', 'Rio de Janeiro']
How can I should proceed in terms of syntaxis to stablish that condition for the value in the dict comprehension?
Lastly, the dataframe:
city city_ascii lat lng country iso2 iso3 admin_name capital population id
0 Tokyo Tokyo 35.6839 139.7744 Japan JP JPN Tōkyō primary 39105000 1392685764
1 Jakarta Jakarta -6.2146 106.8451 Indonesia ID IDN Jakarta primary 35362000 1360771077
2 Delhi Delhi 28.6667 77.2167 India IN IND Delhi admin 31870000 1356872604
3 Manila Manila 14.6000 120.9833 Philippines PH PHL Manila primary 23971000 1608618140
4 São Paulo Sao Paulo -23.5504 -46.6339 Brazil BR BRA São Paulo admin 22495000 1076532519
5 Seoul Seoul 37.5600 126.9900 South Korea KR KOR Seoul primary 22394000 1410836482
6 Mumbai Mumbai 19.0758 72.8775 India IN IND Mahārāshtra admin 22186000 1356226629
7 Shanghai Shanghai 31.1667 121.4667 China CN CHN Shanghai admin 22118000 1156073548
8 Mexico City Mexico City 19.4333 -99.1333 Mexico MX MEX Ciudad de México primary 21505000 1484247881
9 Guangzhou Guangzhou 23.1288 113.2590 China CN CHN Guangdong admin 21489000 1156237133
10 Cairo Cairo 30.0444 31.2358 Egypt EG EGY Al Qāhirah primary 19787000 1818253931
11 Beijing Beijing 39.9040 116.4075 China CN CHN Beijing primary 19437000 1156228865
12 New York New York 40.6943 -73.9249 United States US USA New York NaN 18713220 1840034016
13 Kolkāta Kolkata 22.5727 88.3639 India IN IND West Bengal admin 18698000 1356060520
14 Moscow Moscow 55.7558 37.6178 Russia RU RUS Moskva primary 17693000 1643318494
15 Bangkok Bangkok 13.7500 100.5167 Thailand TH THA Krung Thep Maha Nakhon primary 17573000 1764068610
16 Dhaka Dhaka 23.7289 90.3944 Bangladesh BD BGD Dhaka primary 16839000 1050529279
17 Buenos Aires Buenos Aires -34.5997 -58.3819 Argentina AR ARG Buenos Aires, Ciudad Autónoma de primary 16216000 1032717330
18 Ōsaka Osaka 34.7520 135.4582 Japan JP JPN Ōsaka admin 15490000 1392419823
19 Lagos Lagos 6.4500 3.4000 Nigeria NG NGA Lagos minor 15487000 1566593751
20 Istanbul Istanbul 41.0100 28.9603 Turkey TR TUR İstanbul admin 15311000 1792756324
21 Karachi Karachi 24.8600 67.0100 Pakistan PK PAK Sindh admin 15292000 1586129469
22 Kinshasa Kinshasa -4.3317 15.3139 Congo (Kinshasa) CD COD Kinshasa primary 15056000 1180000363
23 Shenzhen Shenzhen 22.5350 114.0540 China CN CHN Guangdong minor 14678000 1156158707
24 Bangalore Bangalore 12.9791 77.5913 India IN IND Karnātaka admin 13999000 1356410365
25 Ho Chi Minh City Ho Chi Minh City 10.8167 106.6333 Vietnam VN VNM Hồ Chí Minh admin 13954000 1704774326
26 Tehran Tehran 35.7000 51.4167 Iran IR IRN Tehrān primary 13819000 1364305026
27 Los Angeles Los Angeles 34.1139 -118.4068 United States US USA California NaN 12750807 1840020491
28 Rio de Janeiro Rio de Janeiro -22.9083 -43.1964 Brazil BR BRA Rio de Janeiro admin 12486000 1076887657
29 Chengdu Chengdu 30.6600 104.0633 China CN CHN Sichuan admin 11920000 1156421555
30 Baoding Baoding 38.8671 115.4845 China CN CHN Hebei NaN 11860000 1156256829
31 Chennai Chennai 13.0825 80.2750 India IN IND Tamil Nādu admin 11564000 1356374944
32 Lahore Lahore 31.5497 74.3436 Pakistan PK PAK Punjab admin 11148000 1586801463
33 London London 51.5072 -0.1275 United Kingdom GB GBR London, City of primary 11120000 1826645935
34 Paris Paris 48.8566 2.3522 France FR FRA Île-de-France primary 11027000 1250015082
35 Tianjin Tianjin 39.1467 117.2056 China CN CHN Tianjin admin 10932000 1156174046
36 Linyi Linyi 35.0606 118.3425 China CN CHN Shandong NaN 10820000 1156086320
37 Shijiazhuang Shijiazhuang 38.0422 114.5086 China CN CHN Hebei admin 10784600 1156217541
38 Zhengzhou Zhengzhou 34.7492 113.6605 China CN CHN Henan admin 10136000 1156183137
39 Nanyang Nanyang 32.9987 112.5292 China CN CHN Henan NaN 10013600 1156192287
Many thanks!
Try:
d = {i: g["city"].to_list() for i, g in df.groupby("country")}
print(d)
Prints:
{
"Argentina": ["Buenos Aires"],
"Bangladesh": ["Dhaka"],
"Brazil": ["São Paulo", "Rio de Janeiro"],
"China": [
"Shanghai",
"Guangzhou",
"Beijing",
"Shenzhen",
"Chengdu",
"Baoding",
"Tianjin",
"Linyi",
"Shijiazhuang",
"Zhengzhou",
"Nanyang",
],
"Congo (Kinshasa)": ["Kinshasa"],
"Egypt": ["Cairo"],
"France": ["Paris"],
"India": ["Delhi", "Mumbai", "Kolkāta", "Bangalore", "Chennai"],
"Indonesia": ["Jakarta"],
"Iran": ["Tehran"],
"Japan": ["Tokyo", "Ōsaka"],
"Mexico": ["Mexico City"],
"Nigeria": ["Lagos"],
"Pakistan": ["Karachi", "Lahore"],
"Philippines": ["Manila"],
"Russia": ["Moscow"],
"South Korea": ["Seoul"],
"Thailand": ["Bangkok"],
"Turkey": ["Istanbul"],
"United Kingdom": ["London"],
"United States": ["New York", "Los Angeles"],
"Vietnam": ["Ho Chi Minh City"],
}
Since you are doing the groupby, You need to fetch city from the group
Dic = {k: f['city'].unique() for k,f in megacities.groupby('country')}

How to loop to consecutively go through a list of strings, assign value to each string and return it to a new list

Say instead of a dictionary I have these lists:
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
Asia = ('Tokyo', 'Bangkok')
I want to create a pd.DataFrame from this such as:
City
Continent
New York
America
Vancouver
America
London
Europe
Berlin
Europe
Tokyo
Asia
Bangkok
Asia
Note: this is the minimum reproductible example to keep it simple, but the real dataset is more like city -> country -> continent
I understand with such a small sample it would be possible to manually create a dictionary, but in the real example there are many more data-points. So I need to automate it.
I've tried a for loop and a while loop with arguments such as "if Europe in cities" but that doesn't do anything and I think that's because it's "false" since it compares the whole list "Europe" against the whole list "cities".
Either way, my idea was that the loops would go through every city in the cities list and return (city + continent) for each. I just don't know how to um... actually make that work.
I am very new and I wasn't able to figure anything out from looking at similar questions.
Thank you for any direction!
Problem in your Code:
First of all, let's take a look at a Code Snippet used by you: if Europe in cities: was returned nothing Correct!
It is because you are comparing the whole list [Europe] instead of individual list element ['London', 'Berlin']
Solution:
Initially, I have imported all the important modules and regenerated a List of Sample Data provided by you.
# Import all the Important Modules
import pandas as pd
# Read Data
cities = ['New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok']
Europe = ['London', 'Berlin']
America = ['New York', 'Vancouver']
Asia = ['Tokyo', 'Bangkok']
Now, As you can see in your Expected Output we have 2 Columns mentioned below:
City [Which is already available in the form of cities (List)]
Continent [Which we have to generate based on other Lists. In our case: Europe, America, Asia]
For Generating a proper Continent List follow the Code mentioned below:
# Make Continent list
continent = []
# Compare the list of Europe, America and Asia with cities
for city in cities:
if city in Europe:
continent.append('Europe')
elif city in America:
continent.append('America')
elif city in Asia:
continent.append('Asia')
else:
pass
# Print the continent list
continent
# Output of Above Code:
['America', 'America', 'Europe', 'Europe', 'Asia', 'Asia']
As you can see we have received the expected Continent List. Now let's generate the pd.DataFrame() from the same:
# Make dataframe from 'City' and 'Continent List`
data_df = pd.DataFrame({'City': cities, 'Continent': continent})
# Print Results
data_df
# Output of the above Code:
City Continent
0 New York America
1 Vancouver America
2 London Europe
3 Berlin Europe
4 Tokyo Asia
5 Bangkok Asia
Hope this Solution helps you. But if you are still facing Errors then feel free to start a thread below.
1 : Counting elements
You just count the number of cities in each continent and create a list with it :
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
continent = []
cities = []
for name, cont in zip(['Europe', 'America', 'Asia'], [Europe, America, Asia]):
continent += [name for _ in range(len(cont))]
cities += [city for city in cont]
df = pd.DataFrame({'City': cities, 'Continent': continent}
print(df)
And this gives you the following result :
City Continent
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
This is I think the best solution.
2: With dictionnary
You can create an intermediate dictionnary.
Starting from your code
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
Asia = ('Tokyo', 'Bangkok')
You would do this :
continent = dict()
for cont_name, cont_cities in zip(['Europe', 'America', 'Asia'], [Europe, America, Asia]):
for city in cont_cities:
continent[city] = cont_name
This give you the following result :
{
'London': 'Europe', 'Berlin': 'Europe',
'New York': 'America', 'Vancouver': 'America',
'Tokyo': 'Asia', 'Bangkok': 'Asia'
}
Then, you can create your DataFrame :
df = pd.DataFrame(continent.items())
print(df)
0 1
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
This solution allows you not to override your cities tuple
I think on the long run you might want to elimninate loops for large datasets. Also, you might need to include more continent depending on the content of your data.
import pandas as pd
continent = {
'0': 'Europe',
'1': 'America',
'2': 'Asia'
}
df= pd.DataFrame([Europe, America, Asia]).stack().reset_index()
df['continent']= df['level_0'].astype(str).map(continent)
df.drop(['level_0','level_1'], inplace=True, axis=1)
You should get this output
0 continent
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
Feel free to adjust to suit your use case

How to read and normalize following json in pandas?

I have seen many json reading problems in stackoverflow using pandas, but still I could not manage to solve this simple problem.
Data
{"session_id":{"0":["X061RFWB06K9V"],"1":["5AZ2X2A9BHH5U"]},"unix_timestamp":{"0":[1442503708],"1":[1441353991]},"cities":{"0":["New York NY, Newark NJ"],"1":["New York NY, Jersey City NJ, Philadelphia PA"]},"user":{"0":[[{"user_id":2024,"joining_date":"2015-03-22","country":"UK"}]],"1":[[{"user_id":2853,"joining_date":"2015-03-28","country":"DE"}]]}}
My attempt
import numpy as np
import pandas as pd
import json
from pandas.io.json import json_normalize
# attempt1
df = pd.read_json('a.json')
# attempt2
with open('a.json') as fi:
data = json.load(fi)
df = json_normalize(data,record_path='user',meta=['session_id','unix_timestamp','cities'])
Both of them do not give me the required output.
Required output
session_id unix_timestamp cities user_id joining_date country
0 X061RFWB06K9V 1442503708 New York NY 2024 2015-03-22 UK
0 X061RFWB06K9V 1442503708 Newark NJ 2024 2015-03-22 UK
Preferred method
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html
I would love to see implementation of pd.io.json.json_normalize
pandas.io.json.json_normalize(data: Union[Dict, List[Dict]], record_path: Union[str, List, NoneType] = None, meta: Union[str, List, NoneType] = None, meta_prefix: Union[str, NoneType] = None, record_prefix: Union[str, NoneType] = None, errors: Union[str, NoneType] = 'raise', sep: str = '.', max_level: Union[int, NoneType] = None)
Related links
Pandas explode list of dictionaries into rows
How to normalize json correctly by Python Pandas
JSON to pandas DataFrame
Here is another way:
df = pd.read_json(r'C:\path\file.json')
final=df.stack().str[0].unstack()
final=final.assign(cities=final['cities'].str.split(',')).explode('cities')
final=final.assign(**pd.DataFrame(final.pop('user').str[0].tolist()))
print(final)
session_id unix_timestamp cities user_id joining_date \
0 X061RFWB06K9V 1442503708 New York NY 2024 2015-03-22
0 X061RFWB06K9V 1442503708 Newark NJ 2024 2015-03-22
1 5AZ2X2A9BHH5U 1441353991 New York NY 2024 2015-03-22
1 5AZ2X2A9BHH5U 1441353991 Jersey City NJ 2024 2015-03-22
1 5AZ2X2A9BHH5U 1441353991 Philadelphia PA 2024 2015-03-22
country
0 UK
0 UK
1 UK
1 UK
1 UK
Here's one way to do:
import pandas as pd
# lets say d is your json
df = pd.DataFrame.from_dict(d, orient='index').T.reset_index(drop=True)
# unlist each element
df = df.applymap(lambda x: x[0])
# convert user column to multiple cols
df = pd.concat([df.drop('user', axis=1), df['user'].apply(lambda x: x[0]).apply(pd.Series)], axis=1)
session_id unix_timestamp \
0 X061RFWB06K9V 1442503708
1 5AZ2X2A9BHH5U 1441353991
cities user_id joining_date country
0 New York NY, Newark NJ 2024 2015-03-22 UK
1 New York NY, Jersey City NJ, Philadelphia PA 2853 2015-03-28 DE
I am using explode with join
s=pd.DataFrame(j).apply(lambda x : x.str[0])
s['cities']=s.cities.str.split(',')
s=s.explode('cities')
s.reset_index(drop=True,inplace=True)
s=s.join(pd.DataFrame(sum(s.user.tolist(),[])))
session_id unix_timestamp ... joining_date country
0 X061RFWB06K9V 1442503708 ... 2015-03-22 UK
1 X061RFWB06K9V 1442503708 ... 2015-03-22 UK
2 5AZ2X2A9BHH5U 1441353991 ... 2015-03-28 DE
3 5AZ2X2A9BHH5U 1441353991 ... 2015-03-28 DE
4 5AZ2X2A9BHH5U 1441353991 ... 2015-03-28 DE
[5 rows x 7 columns]
Once you have df, then you can merge two parts:
df = pd.read_json('a.json')
df1 = df.drop('user',axis=1)
df2 = json_normalize(df['user'])
df = df1.merge(df2,left_index=True,right_index=True)
just thought i'd share another means of extracting data from nested json into pandas, for future visitors to this question. Each of the columns is extracted before reading into pandas. jmespath comes in handy here as it allows for easy traversal of json data :
import jmespath
from pprint import pprint
expression = jmespath.compile('''{session_id:session_id.*[],
unix_timestamp : unix_timestamp.*[],
cities:cities.*[],
user_id : user.*[][].user_id,
joining_date : user.*[][].joining_date,
country : user.*[][].country
}''')
res = expression.search(data)
pprint(res)
{'cities': ['New York NY, Newark NJ',
'New York NY, Jersey City NJ, Philadelphia PA'],
'country': ['UK', 'DE'],
'joining_date': ['2015-03-22', '2015-03-28'],
'session_id': ['X061RFWB06K9V', '5AZ2X2A9BHH5U'],
'unix_timestamp': [1442503708, 1441353991],
'user_id': [2024, 2853]}
Read data into pandas and split the cities into individual rows:
df = (pd.DataFrame(res)
.assign(cities = lambda x: x.cities.str.split(','))
.explode('cities')
)
df
session_id unix_timestamp cities user_id joining_date country
0 X061RFWB06K9V 1442503708 New York NY 2024 2015-03-22 UK
0 X061RFWB06K9V 1442503708 Newark NJ 2024 2015-03-22 UK
1 5AZ2X2A9BHH5U 1441353991 New York NY 2853 2015-03-28 DE
1 5AZ2X2A9BHH5U 1441353991 Jersey City NJ 2853 2015-03-28 DE
1 5AZ2X2A9BHH5U 1441353991 Philadelphia PA 2853 2015-03-28 DE

Difflib error when applying onto two columns in pandas dataframe

I have DataFrame that look like this:
Cities Cities_Dict
"San Francisco" ["San Francisco", "New York", "Boston"]
"Los Angeles" ["Los Angeles"]
"berlin" ["Munich", "Berlin"]
"Dubai" ["Dubai"]
I want to create new column that compares city from firest column to the list of cities from secon column and finds the one that is the closest match.
I use difflib for that:
df["new_col"]=difflib.get_close_matches(df["Cities"],df["Cities_Dict"])
However I get error:
TypeError: object of type 'float' has no len()
Use DataFrame.apply with lambda function and axis=1 for processing by rows:
import difflib, ast
#if necessary convert values to lists
#df['Cities_Dict'] = df['Cities_Dict'].apply(ast.literal_eval)
f = lambda x: difflib.get_close_matches(x["Cities"],x["Cities_Dict"])
df["new_col"] = df.apply(f, axis=1)
print (df)
Cities Cities_Dict new_col
0 San Francisco [San Francisco, New York, Boston] [San Francisco]
1 Los Angeles [Los Angeles] [Los Angeles]
2 berlin [Munich, Berlin] [Berlin]
3 Dubai [Dubai] [Dubai]
EDIT:
For first value with empty string for empty list use:
f = lambda x: next(iter(difflib.get_close_matches(x["Cities"],x["Cities_Dict"])), '')
df["new_col"] = df.apply(f, axis=1)
print (df)
Cities Cities_Dict new_col
0 San Francisco [San Francisco, New York, Boston] San Francisco
1 Los Angeles [Los Angeles] Los Angeles
2 berlin [Munich, Berlin] Berlin
3 Dubai [Dubai] Dubai
EDIT1: If possible problematic data is possible use try-except:
def f(x):
try:
return difflib.get_close_matches(x["Cities"],x["Cities_Dict"])[0]
except:
return ''
df["new_col"] = df.apply(f, axis=1)
print (df)
Cities Cities_Dict new_col
0 NaN [San Francisco, New York, Boston]
1 Los Angeles [10]
2 berlin [Munich, Berlin] Berlin
3 Dubai [Dubai] Dubai

map US state name to two letter acronyms that was given in dictionary separately

Suppose now I have a dataframe with 2 columns: State and City.
Then I have a separate dict with the two-letter acronym for each state. Now I want to add a third column to map state name with its two-letter acronym. What should I do in Python/Pandas? For instance the sample question is as follows:
import pandas as pd
a = pd.Series({'State': 'Ohio', 'City':'Cleveland'})
b = pd.Series({'State':'Illinois', 'City':'Chicago'})
c = pd.Series({'State':'Illinois', 'City':'Naperville'})
d = pd.Series({'State': 'Ohio', 'City':'Columbus'})
e = pd.Series({'State': 'Texas', 'City': 'Houston'})
f = pd.Series({'State': 'California', 'City': 'Los Angeles'})
g = pd.Series({'State': 'California', 'City': 'San Diego'})
state_city = pd.DataFrame([a,b,c,d,e,f,g])
state_2 = {'OH': 'Ohio','IL': 'Illinois','CA': 'California','TX': 'Texas'}
Now I have to map the column State in the df state_city using the dictionary of state_2. The mapped df state_city should contain three columns: state, city, and state_2letter.
The original dataset I had had multiple columns with nearly all US major cities.
Therefore it will be less efficient to do it manually. Is there any easy way to do it?
For one, it's probably easier to store the key-value pairs like state name: abbreviation in your dictionary, like this:
state_2 = {'Ohio': 'OH', 'Illinois': 'IL', 'California': 'CA', 'Texas': 'TX'}
You can achieve this easily:
state_2 = {state: abbrev for state, abbrev in state_2.items()}
Using pandas.DataFrame.map:
>>> state_city['abbrev'] = state_city['State'].map(state_2)
>>> state_city
City State abbrev
0 Cleveland Ohio OH
1 Chicago Illinois IL
2 Naperville Illinois IL
3 Columbus Ohio OH
4 Houston Texas TX
5 Los Angeles California CA
6 San Diego California CA
I do agree with #blacksite that the state_2 dictionary should map its values like that:
state_2 = {'Ohio': 'OH','Illinois': 'IL','California': 'CA','Texas': 'TX'}
Then using pandas.DataFrame.replace
state_city['state_2letter'] = state_city.State.replace(state_2)
state_city
|-|State |City |state_2letter|
|-|----- |------ |----------|
|0| Ohio | Cleveland | OH|
|1| Illinois | Chicago | IL|
|2| Illinois | Naperville | IL|
|3| Ohio | Columbus | OH|
|4| Texas | Houston | TX|
|5| California| Los Angeles | CA|
|6| California| San Diego | CA|

Categories