How to read and normalize following json in pandas? - python

I have seen many json reading problems in stackoverflow using pandas, but still I could not manage to solve this simple problem.
Data
{"session_id":{"0":["X061RFWB06K9V"],"1":["5AZ2X2A9BHH5U"]},"unix_timestamp":{"0":[1442503708],"1":[1441353991]},"cities":{"0":["New York NY, Newark NJ"],"1":["New York NY, Jersey City NJ, Philadelphia PA"]},"user":{"0":[[{"user_id":2024,"joining_date":"2015-03-22","country":"UK"}]],"1":[[{"user_id":2853,"joining_date":"2015-03-28","country":"DE"}]]}}
My attempt
import numpy as np
import pandas as pd
import json
from pandas.io.json import json_normalize
# attempt1
df = pd.read_json('a.json')
# attempt2
with open('a.json') as fi:
data = json.load(fi)
df = json_normalize(data,record_path='user',meta=['session_id','unix_timestamp','cities'])
Both of them do not give me the required output.
Required output
session_id unix_timestamp cities user_id joining_date country
0 X061RFWB06K9V 1442503708 New York NY 2024 2015-03-22 UK
0 X061RFWB06K9V 1442503708 Newark NJ 2024 2015-03-22 UK
Preferred method
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html
I would love to see implementation of pd.io.json.json_normalize
pandas.io.json.json_normalize(data: Union[Dict, List[Dict]], record_path: Union[str, List, NoneType] = None, meta: Union[str, List, NoneType] = None, meta_prefix: Union[str, NoneType] = None, record_prefix: Union[str, NoneType] = None, errors: Union[str, NoneType] = 'raise', sep: str = '.', max_level: Union[int, NoneType] = None)
Related links
Pandas explode list of dictionaries into rows
How to normalize json correctly by Python Pandas
JSON to pandas DataFrame

Here is another way:
df = pd.read_json(r'C:\path\file.json')
final=df.stack().str[0].unstack()
final=final.assign(cities=final['cities'].str.split(',')).explode('cities')
final=final.assign(**pd.DataFrame(final.pop('user').str[0].tolist()))
print(final)
session_id unix_timestamp cities user_id joining_date \
0 X061RFWB06K9V 1442503708 New York NY 2024 2015-03-22
0 X061RFWB06K9V 1442503708 Newark NJ 2024 2015-03-22
1 5AZ2X2A9BHH5U 1441353991 New York NY 2024 2015-03-22
1 5AZ2X2A9BHH5U 1441353991 Jersey City NJ 2024 2015-03-22
1 5AZ2X2A9BHH5U 1441353991 Philadelphia PA 2024 2015-03-22
country
0 UK
0 UK
1 UK
1 UK
1 UK

Here's one way to do:
import pandas as pd
# lets say d is your json
df = pd.DataFrame.from_dict(d, orient='index').T.reset_index(drop=True)
# unlist each element
df = df.applymap(lambda x: x[0])
# convert user column to multiple cols
df = pd.concat([df.drop('user', axis=1), df['user'].apply(lambda x: x[0]).apply(pd.Series)], axis=1)
session_id unix_timestamp \
0 X061RFWB06K9V 1442503708
1 5AZ2X2A9BHH5U 1441353991
cities user_id joining_date country
0 New York NY, Newark NJ 2024 2015-03-22 UK
1 New York NY, Jersey City NJ, Philadelphia PA 2853 2015-03-28 DE

I am using explode with join
s=pd.DataFrame(j).apply(lambda x : x.str[0])
s['cities']=s.cities.str.split(',')
s=s.explode('cities')
s.reset_index(drop=True,inplace=True)
s=s.join(pd.DataFrame(sum(s.user.tolist(),[])))
session_id unix_timestamp ... joining_date country
0 X061RFWB06K9V 1442503708 ... 2015-03-22 UK
1 X061RFWB06K9V 1442503708 ... 2015-03-22 UK
2 5AZ2X2A9BHH5U 1441353991 ... 2015-03-28 DE
3 5AZ2X2A9BHH5U 1441353991 ... 2015-03-28 DE
4 5AZ2X2A9BHH5U 1441353991 ... 2015-03-28 DE
[5 rows x 7 columns]

Once you have df, then you can merge two parts:
df = pd.read_json('a.json')
df1 = df.drop('user',axis=1)
df2 = json_normalize(df['user'])
df = df1.merge(df2,left_index=True,right_index=True)

just thought i'd share another means of extracting data from nested json into pandas, for future visitors to this question. Each of the columns is extracted before reading into pandas. jmespath comes in handy here as it allows for easy traversal of json data :
import jmespath
from pprint import pprint
expression = jmespath.compile('''{session_id:session_id.*[],
unix_timestamp : unix_timestamp.*[],
cities:cities.*[],
user_id : user.*[][].user_id,
joining_date : user.*[][].joining_date,
country : user.*[][].country
}''')
res = expression.search(data)
pprint(res)
{'cities': ['New York NY, Newark NJ',
'New York NY, Jersey City NJ, Philadelphia PA'],
'country': ['UK', 'DE'],
'joining_date': ['2015-03-22', '2015-03-28'],
'session_id': ['X061RFWB06K9V', '5AZ2X2A9BHH5U'],
'unix_timestamp': [1442503708, 1441353991],
'user_id': [2024, 2853]}
Read data into pandas and split the cities into individual rows:
df = (pd.DataFrame(res)
.assign(cities = lambda x: x.cities.str.split(','))
.explode('cities')
)
df
session_id unix_timestamp cities user_id joining_date country
0 X061RFWB06K9V 1442503708 New York NY 2024 2015-03-22 UK
0 X061RFWB06K9V 1442503708 Newark NJ 2024 2015-03-22 UK
1 5AZ2X2A9BHH5U 1441353991 New York NY 2853 2015-03-28 DE
1 5AZ2X2A9BHH5U 1441353991 Jersey City NJ 2853 2015-03-28 DE
1 5AZ2X2A9BHH5U 1441353991 Philadelphia PA 2853 2015-03-28 DE

Related

Pandas Merge Result Output Next Row

Suppose I have two dataframes
df_1
city state salary
New York NY 85000
Chicago IL 65000
Miami FL 75000
Dallas TX 78000
Seattle WA 96000
df_2
city state taxes
New York NY 15000
Chicago IL 5000
Miami FL 6500
Next, I join the two dataframes
joined_df = df_1.merge(df_2, how='inner', left_on=['city'], right_on = ['city'])
The Result:
joined_df
city state salary city state taxes
New York NY 85000 New York NY 15000
Chicago IL 65000 Chicago IL 5000
Miami FL 75000 Miami FL 6500
Is there anyway I can stack the two dataframes on top of each other joining on the city instead of extending the line horizontally, like below:
Requested:
joined_df
city state salary taxes
New York NY 85000
New York NY 15000
Chicago IL 65000
Chicago IL 5000
Miami FL 75000
Miami FL 6500
How can I do this in Pandas!
In this case we might need to use merge to restrict to the relevant rows before concat if we need to consider both city and state.
rel_df_1 = df_1.merge(df_2)[df_1.columns]
rel_df_2 = df_2.merge(df_1)[df_2.columns]
df = pd.concat([rel_df_1, rel_df_2]).sort_values(['city', 'state'])
You can use append (a shortcut for concat) to achieve that:
result = df1.append(df2, sort=False)
If your dataframes have overlapping indexes, you can use:
df1.append(df2, ignore_index=True, sort=False)
Also, you can look for more information here
UPDATE: After appending your dataframes, you can filter your result to get only the rows that contains the city in both dataframes:
result = result.loc[result['city'].isin(df1['city'])
& result['city'].isin(df2['city'])]
Try with stack():
stacked = df_1.merge(df_2, on=["city", "state"]).set_index(["city", "state"]).stack()
output = pd.concat([stacked.where(stacked.index.get_level_values(-1)=="salary"),
stacked.where(stacked.index.get_level_values(-1)=="taxes")],
axis=1,
keys=["salary", "taxes"]) \
.droplevel(-1) \
.reset_index()
>>> output
city state salary taxes
0 New York NY 85000.0 NaN
1 New York NY NaN 15000.0
2 Chicago IL 65000.0 NaN
3 Chicago IL NaN 5000.0
4 Miami FL 75000.0 NaN
5 Miami FL NaN 6500.0

Function to move specific row to top or bottom of pandas dataframe

I have two functions which shift a row of a pandas dataframe to the top or bottom, respectively. After applying them more then once to a dataframe, they seem to work incorrectly.
These are the 2 functions to move the row to top / bottom:
def shift_row_to_bottom(df, index_to_shift):
"""Shift row, given by index_to_shift, to bottom of df."""
idx = df.index.tolist()
idx.pop(index_to_shift)
df = df.reindex(idx + [index_to_shift])
return df
def shift_row_to_top(df, index_to_shift):
"""Shift row, given by index_to_shift, to top of df."""
idx = df.index.tolist()
idx.pop(index_to_shift)
df = df.reindex([index_to_shift] + idx)
return df
Note: I don't want to reset_index for the returned df.
Example:
df = pd.DataFrame({'Country' : ['USA', 'GE', 'Russia', 'BR', 'France'],
'ID' : ['11', '22', '33','44', '55'],
'City' : ['New-York', 'Berlin', 'Moscow', 'London', 'Paris'],
'short_name' : ['NY', 'Ber', 'Mosc','Lon', 'Pa']
})
df =
Country ID City short_name
0 USA 11 New-York NY
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
3 BR 44 London Lon
4 France 55 Paris Pa
This is my dataframe:
Now, apply function for the first time. Move row with index 0 to bottom:
df_shifted = shift_row_to_bottom(df,0)
df_shifted =
Country ID City short_name
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
3 BR 44 London Lon
4 France 55 Paris Pa
0 USA 11 New-York NY
The result is exactly what I want.
Now, apply function again. This time move row with index 2 to the bottom:
df_shifted = shift_row_to_bottom(df_shifted,2)
df_shifted =
Country ID City short_name
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
4 France 55 Paris Pa
0 USA 11 New-York NY
2 Russia 33 Moscow Mosc
Well, this is not what I was expecting. There must be a problem when I want to apply the function a second time. The promblem is analog to the function shift_row_to_top.
My question is:
What's going on here?
Is there a better way to shift a specific row to top / bottom of the dataframe? Maybe a pandas-function?
If not, how would you do it?
Your problem is these two lines:
idx = df.index.tolist()
idx.pop(index_to_shift)
idx is a list and idx.pop(index_to_shift) removes the item at index index_to_shift of idx, which is not necessarily valued index_to_shift as in the second case.
Try this function:
def shift_row_to_bottom(df, index_to_shift):
idx = [i for i in df.index if i!=index_to_shift]
return df.loc[idx+[index_to_shift]]
# call the function twice
for i in range(2): df = shift_row_to_bottom(df, 2)
Output:
Country ID City short_name
0 USA 11 New-York NY
1 GE 22 Berlin Ber
3 BR 44 London Lon
4 France 55 Paris Pa
2 Russia 33 Moscow Mosc

Add a column of repeating numbers to existing dataframe

I have the following dataframe where each row is a unique state-city pair:
State City
NY Albany
NY NYC
MA Boston
MA Cambridge
I want to a add a column of years ranging from 2000 to 2018:
State City. Year
NY Albany 2000
NY Albany 2001
NY Albany 2002
...
NY Albany 2018
NY NYC 2000
NY NYC 2018
...
MA Cambridge 2018
I know I can create a list of numbers using Year = list(range(2000,2019))
Does anyone know how to put this list as a column in the dataframe for each state-city?
You could try adding it as a list and then performing explode. I think it should work:
df['Year'] = [list(range(2000,2019))] * len(df)
df = df.explode('Year')
One way is to use the DataFrame.stack() method.
Here is sample of your current data:
data = [['NY', 'Albany'],
['NY', 'NYC'],
['MA', 'Boston'],
['MA', 'Cambridge']]
cities = pd.DataFrame(data, columns=['State', 'City'])
print(cities)
# State City
# 0 NY Albany
# 1 NY NYC
# 2 MA Boston
# 3 MA Cambridge
First, make this into a multi-level index (this will end up in the final dataframe):
cities_index = pd.MultiIndex.from_frame(cities)
print(cities_index)
# MultiIndex([('NY', 'Albany'),
# ('NY', 'NYC'),
# ('MA', 'Boston'),
# ('MA', 'Cambridge')],
# names=['State', 'City'])
Now, make a dataframe with all the years in it (I only use 3 years for brevity):
years = list(range(2000, 2003))
n_cities = len(cities)
years_data = np.repeat(years, n_cities).reshape(len(years), n_cities).T
years_data = pd.DataFrame(years_data, index=cities_index)
years_data.columns.name = 'Year index'
print(years_data)
# Year index 0 1 2
# State City
# NY Albany 2000 2001 2002
# NYC 2000 2001 2002
# MA Boston 2000 2001 2002
# Cambridge 2000 2001 2002
Finally, use stack to transform this dataframe into a vertically-stacked series which I think is what you want:
years_by_city = years_data.stack().rename('Year')
print(years_by_city.head())
# State City Year index
# NY Albany 0 2000
# 1 2001
# 2 2002
# NYC 0 2000
# 1 2001
# Name: Year, dtype: int64
If you want to remove the index and have all the values as a dataframe just do
cities_and_years = years_by_city.reset_index()

Difflib error when applying onto two columns in pandas dataframe

I have DataFrame that look like this:
Cities Cities_Dict
"San Francisco" ["San Francisco", "New York", "Boston"]
"Los Angeles" ["Los Angeles"]
"berlin" ["Munich", "Berlin"]
"Dubai" ["Dubai"]
I want to create new column that compares city from firest column to the list of cities from secon column and finds the one that is the closest match.
I use difflib for that:
df["new_col"]=difflib.get_close_matches(df["Cities"],df["Cities_Dict"])
However I get error:
TypeError: object of type 'float' has no len()
Use DataFrame.apply with lambda function and axis=1 for processing by rows:
import difflib, ast
#if necessary convert values to lists
#df['Cities_Dict'] = df['Cities_Dict'].apply(ast.literal_eval)
f = lambda x: difflib.get_close_matches(x["Cities"],x["Cities_Dict"])
df["new_col"] = df.apply(f, axis=1)
print (df)
Cities Cities_Dict new_col
0 San Francisco [San Francisco, New York, Boston] [San Francisco]
1 Los Angeles [Los Angeles] [Los Angeles]
2 berlin [Munich, Berlin] [Berlin]
3 Dubai [Dubai] [Dubai]
EDIT:
For first value with empty string for empty list use:
f = lambda x: next(iter(difflib.get_close_matches(x["Cities"],x["Cities_Dict"])), '')
df["new_col"] = df.apply(f, axis=1)
print (df)
Cities Cities_Dict new_col
0 San Francisco [San Francisco, New York, Boston] San Francisco
1 Los Angeles [Los Angeles] Los Angeles
2 berlin [Munich, Berlin] Berlin
3 Dubai [Dubai] Dubai
EDIT1: If possible problematic data is possible use try-except:
def f(x):
try:
return difflib.get_close_matches(x["Cities"],x["Cities_Dict"])[0]
except:
return ''
df["new_col"] = df.apply(f, axis=1)
print (df)
Cities Cities_Dict new_col
0 NaN [San Francisco, New York, Boston]
1 Los Angeles [10]
2 berlin [Munich, Berlin] Berlin
3 Dubai [Dubai] Dubai

Geopy, checking cities, avoiding duplicates, pandas

I want to get the lat of ~ 100 k entries in a pandas dataframe. Since I can query geopy only with a second delay, I want to make sure I do not query duplicates (most should be duplicates since there are not that many cities)
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="xxx")
df['loc']=0
for x in range(1,len(df):
for y in range(1,x):
if df['Location'][y]==df['Location'][x]:
df['lat'][x]=df['lat'][y]
else:
location = geolocator.geocode(df['Location'][x])
time.sleep(1.2)
df.at[x,'lat']=location.latitude
The idea is to check if the location is already in the list, and only if not query geopy. Somehow it is painfully slow and seems not to be doing what I intended. Any help or tip is appreciated.
Prepare the initial dataframe:
import pandas as pd
df = pd.DataFrame({
'some_meta': [1, 2, 3, 4],
'city': ['london', 'paris', 'London', 'moscow'],
})
df['city_lower'] = df['city'].str.lower()
df
Out[1]:
some_meta city city_lower
0 1 london london
1 2 paris paris
2 3 London london
3 4 moscow moscow
Create a new DataFrame with unique cities:
df_uniq_cities = df['city_lower'].drop_duplicates().to_frame()
df_uniq_cities
Out[2]:
city_lower
0 london
1 paris
3 moscow
Run geopy's geocode on that new DataFrame:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here")
from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
df_uniq_cities['location'] = df_uniq_cities['city_lower'].apply(geocode)
# Or, instead, do this to get a nice progress bar:
# from tqdm import tqdm
# tqdm.pandas()
# df_uniq_cities['location'] = df_uniq_cities['city_lower'].progress_apply(geocode)
df_uniq_cities
Out[3]:
city_lower location
0 london (London, Greater London, England, SW1A 2DU, UK...
1 paris (Paris, Île-de-France, France métropolitaine, ...
3 moscow (Москва, Центральный административный округ, М...
Merge the initial DataFrame with the new one:
df_final = pd.merge(df, df_uniq_cities, on='city_lower', how='left')
df_final['lat'] = df_final['location'].apply(lambda location: location.latitude if location is not None else None)
df_final['long'] = df_final['location'].apply(lambda location: location.longitude if location is not None else None)
df_final
Out[4]:
some_meta city city_lower location lat long
0 1 london london (London, Greater London, England, SW1A 2DU, UK... 51.507322 -0.127647
1 2 paris paris (Paris, Île-de-France, France métropolitaine, ... 48.856610 2.351499
2 3 London london (London, Greater London, England, SW1A 2DU, UK... 51.507322 -0.127647
3 4 moscow moscow (Москва, Центральный административный округ, М... 55.750446 37.617494
The key to resolving your issue with timeouts is the geopy's RateLimiter class. Check out the docs for more details: https://geopy.readthedocs.io/en/1.18.1/#usage-with-pandas
Imports
see geopy documentation for how to instantiate the Nominatum geoencoder
import pandas as pd
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here") # specify your application name
Generate some data with locations
d = ['New York, NY', 'Seattle, WA', 'Philadelphia, PA',
'Richardson, TX', 'Plano, TX', 'Wylie, TX',
'Waxahachie, TX', 'Washington, DC']
df = pd.DataFrame(d, columns=['Location'])
print(df)
Location
0 New York, NY
1 Seattle, WA
2 Philadelphia, PA
3 Richardson, TX
4 Plano, TX
5 Wylie, TX
6 Waxahachie, TX
7 Washington, DC
Use a dict to geoencode only the unique Locations per this SO post
extract all parameters simultaneously
first, get lat and lon in same step (as tuples in a single column of the DataFrame)
second, split the column of tuples into separate columns
locations = df['Location'].unique()
# Create dict of geoencodings
d = (
dict(zip(locations, pd.Series(locations)
.apply(geolocator.geocode, args=(10,))
.apply(lambda x: (x.latitude, x.longitude)) # get tuple of latitude and longitude
)
)
)
# Map dict to `Location` column
df['city_coord'] = df['Location'].map(d)
# Split single column of tuples into multiple (2) columns
df[['lat','lon']] = pd.DataFrame(df['city_coord'].tolist(), index=df.index)
print(df)
Location city_coord lat lon
0 New York, NY (40.7308619, -73.9871558) 40.730862 -73.987156
1 Seattle, WA (47.6038321, -122.3300624) 47.603832 -122.330062
2 Philadelphia, PA (39.9524152, -75.1635755) 39.952415 -75.163575
3 Richardson, TX (32.9481789, -96.7297206) 32.948179 -96.729721
4 Plano, TX (33.0136764, -96.6925096) 33.013676 -96.692510
5 Wylie, TX (33.0151201, -96.5388789) 33.015120 -96.538879
6 Waxahachie, TX (32.3865312, -96.8483311) 32.386531 -96.848331
7 Washington, DC (38.8950092, -77.0365625) 38.895009 -77.036563

Categories