Normalizing JSON in Python

Normalizing JSON in Python - python

I have a JSON response which is:
{"Neighborhoods":[
{"Name":"Project A",
"Balcony":false,
"Sauna":false,
"ProjectIds":["f94d25e2-3709-42bc-a4a2-bf8e073e9790","b106b4f1-32b9-4fc2-b2b3-55a7e5348c24"],
"NextViewing":null,
"Location":{"Lat":52.484295,"Lon":13.5058143},
"SalesStatus":"ForSale",
"TypeOfContract":7},
{"Name"
I then use pd.json_normalize(Response,'Neighborhoods') for normalizing.
The Location part is then flattened out as I want, as two columns "Location.Lat" and "Location.Lon". My issue is "ProjectIds" which I get in one column as
['f94d25e2-3709-42bc-a4a2-bf8e073e9790', 'b106b4f1-32b9-4fc2-b2b3-55a7e5348c24']
But I would like to have it without '[] and the space in the middle. So that the output would be
f94d25e2-3709-42bc-a4a2-bf8e073e9790,b106b4f1-32b9-4fc2-b2b3-55a7e5348c24

You can use .str.join() to convert the list of strings into comma separated string, as follows:
df['ProjectIds'] = df['ProjectIds'].str.join(',')
Demo
Response ={"Neighborhoods":[
{"Name":"Project A",
"Balcony":'false',
"Sauna":'false',
"ProjectIds":["f94d25e2-3709-42bc-a4a2-bf8e073e9790","b106b4f1-32b9-4fc2-b2b3-55a7e5348c24"],
"NextViewing":'null',
"Location":{"Lat":52.484295,"Lon":13.5058143},
"SalesStatus":"ForSale",
"TypeOfContract":7}]}
df = pd.json_normalize(Response,'Neighborhoods')
df['ProjectIds'] = df['ProjectIds'].str.join(',')
print(df)
Name Balcony Sauna ProjectIds NextViewing SalesStatus TypeOfContract Location.Lat Location.Lon
0 Project A false false f94d25e2-3709-42bc-a4a2-bf8e073e9790,b106b4f1-32b9-4fc2-b2b3-55a7e5348c24 null ForSale 7 52.484295 13.505814

Use ",".join() on the projectIds to convert them to string from list before you pass it to json_nornalize

The way you can solve this is by using ','.join() on the ProjectIds column:
data ={"Neighborhoods":[
{"Name":"Project A",
"Balcony":'false',
"Sauna":'false',
"ProjectIds":["f94d25e2-3709-42bc-a4a2-bf8e073e9790","b106b4f1-32b9-4fc2-b2b3-55a7e5348c24"],
"NextViewing":'null',
"Location":{"Lat":52.484295,"Lon":13.5058143},
"SalesStatus":"ForSale",
"TypeOfContract":7}]}
df = pd.json_normalize(data['Neighborhoods'])
df['ProjectIds'] = df['ProjectIds'].apply(lambda x: ','.join(x))

Related

trim string to first space python

I have a dataframe of this style:
id patient_full_name
7805 TOMAS FRANCONI
7810 Camila Gualtieri
7821 Lola Borrego
7823 XIMENA ALVAREZ LANUS
7824 MONICA VIVIANA RODRIGUEZ DE MARENGO
I need to save the first name of values from the second column. I want to trim that value down to the first spacing and I don't know how.
I would like it to stay in a structure like this:
patients_names = ["TOMAS","CAMILA","LOLA","XIMANA","MONICA",...."N-NAME"]
All this done in Pandas Python

You can use the split function in a list comprehension to do this:
df = pd.DataFrame([
{"id": 7805, "patient_full_name": "TOMAS FRANCONI"},
{"id": 7810, "patient_full_name": "Camila Gualtieri"},
{"id": 7821, "patient_full_name": "Lola Borrego"}
])
df["first_name"] = [n.split(" ")[0] for n in df["patient_full_name"]]
That adds a column (first_name) with the output you wanted, which you can then pull off as a list or series if you want:
first_name_as_series = df["first_name"]
first_name_as_list = list(df["first_name"])
In your question, you show the desired output in all upper case. That's easy to get with a simple tweak to the list comprehension:
df["first_name"] = [n.split(" ")[0].upper() for n in df["patient_full_name"]]

You can do it by using extract as well, which do not rely on a loop:
(df
.assign(first_name=lambda x: x.fullname.str.extract(r"(.*) "))
)

Facing some problems in groupby function for Outlier Removal

I am working on a data cleaning project and in this, I have to remove some outliers of price_per_sqft.. So I used groupby function and by statistic, the formula creates a data frame without outliers and concat it with the output data frame...
But in the output this type of word returns with the location names so how can I get a clean location name instead of this..?
Code:
def remove_pps_outliers(df):
df_out = pd.DataFrame()
for key, subdf in df.groupby('location'):
m = np.mean(subdf.price_per_sqft)
st = np.std(subdf.price_per_sqft)
reduced_df = subdf[(subdf.price_per_sqft>(m-st)) & (subdf.price_per_sqft<=(m+st))]
df_out = pd.concat([df_out,reduced_df],ignore_index=True)
return df_out
df6 = remove_pps_outliers(df5)
df6.head()
Output:
enter image description here
How can I get the answer without "1st Phase" or "1st Block" keywords like this...
enter image description here

A rudimentary fix would be to just replace the characters you do not want. Luckily in this example, both '1st Phase ' and '1st Block ' contain 10 characters so you could use :
df6['location'] = df6['location'].str.slice_replace(0,10,'')

Extracting values to new columns with pandas

I have a dataframe where the coordinates column comes in this format
[-7.821, 37.033]
I would like to create two columns where the first is lonand the second is lat
I've tried
my_dict = df_map['coordinates'].to_dict()
df_map_new = pd.DataFrame(list(my_dict.items()),columns = ['lon','lat'])
But the dictionary that is created does not split the values between ,
Instead it creates a dict with the following format
0: '[-7.821, 37.033]'
What is the best way to extract the values within [,] and put them into two new columns in the original dataframe df_map?
Thank you in advance!

You can parse string:
pattern = r"\[(?P<lon>.*),\s*(?P<lat>.*)\]"
out = df_map['coordinates'].str.extract(pattern).astype(float)
print(out)
# Output
lon lat
0 -7.821 37.033

Convert values to lists by ast.literal_eval, then to lists instead dicts:
import ast
my_L = df_map['coordinates'].apply(ast.literal_eval).tolist()
df_map_new = pd.DataFrame(my_L,columns = ['lon','lat'])

Additionally to the answers already provided, you can also try this:
ser_lon = df['coordinates'].apply(lambda x: x[0])
ser_lat = df['coordinates'].apply(lambda x: x[1])
df_map['lon'] = ser_lon
df_map['lat'] = ser_lat

How to replace string values in pandas dataframe to integers?

I have a Pandas DataFrame that contains several string values.
I want to replace them with integer values in order to calculate similarities.
For example:
stores[['CNPJ_Store_Code','region','total_facings']].head()
Out[24]:
CNPJ_Store_Code region total_facings
1 93209765046613 Geo RS/SC 1.471690
16 93209765046290 Geo RS/SC 1.385636
19 93209765044084 Geo PR/SPI 0.217054
21 93209765044831 Geo RS/SC 0.804633
23 93209765045218 Geo PR/SPI 0.708165
and I want to replace region == 'Geo RS/SC' ==> 1, region == 'Geo PR/SPI'==> 2 etc.
Clarification: I want to do the replacement automatically, without creating a dictionary first, since I don't know in advance what my regions will be.
Any ideas? I am trying to use DictVectorizer, with no success.
I'm sure there's a way to do it in intelligent way, but I just can't find it.
Anyone familiar with a solution?

You can use the .apply() function and a dictionary to map all known string values to their corresponding integer values:
region_dictionary = {'Geo RS/SC': 1, 'Geo PR/SPI' : 2, .... }
stores['region'] = stores['region'].apply(lambda x: region_dictionary[x])

It looks to me like you really would like panda categories
http://pandas-docs.github.io/pandas-docs-travis/categorical.html
I think you just need to change the dtype of your text column to "category" and you are done.
stores['region'] = stores["region"].astype('category')

You can do:
df = pd.read_csv(filename, index_col = 0) # Assuming it's a csv file.
def region_to_numeric(a):
if a == 'Geo RS/SC':
return 1
if a == 'Geo PR/SPI':
return 2
df['region_num'] = df['region'].apply(region_to_numeric)

Importing single record using read_json in pandas

I am trying to import a json file using the function:
sku = pandas.read_json('https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111')
However, i keep getting the following error
ValueError: arrays must all be same length
What should I do to import it correctly into a dataframe?
this is the structure of the json:
{
"id": "5",
"sku": "JOSH:BECO-BRN",
"last_updated": "2013-06-10 15:46:22",
...
"propertyType1": [
"manufacturer_colour"
],
"category": [
{
"category_id": "10",
"category_name": "All Products"
},
...
{
"category_id": "238",
"category_name": "All Sofas"
}
],
"root_categories": [
"516"
],
"url": "/p/Beco Suede Sofa Bed?product_id=5",
"item": [
"2"
],
"image_names": "[\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/L\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/P\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SP\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SS\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/ST\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/WP\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/L\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/P\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SP\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk \\/images\\/products\\/SS\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/ST\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/WP\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\"]"
}

The pandas.read_json function takes multiple formats.
Since you did not specify which format your json file is in (orient= attribute), pandas will default to believing your data is columnar. The different formats pandas expects are discussed below.
The data that you are trying to parse from https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111
Does not seem to conform to any of the supported formats as it seems to be only a single "record". Pandas expects some kind of collection.
You probably should try to collect multiple entries into a single file, then parse it with the read_json function.
EDIT:
Simple way of getting multiple rows and parsing it with the pandas.read_json function:
import urllib2
import pandas as pd
url_base = "https://cws01.worldstores.co.uk/api/product.php?product_sku={}"
products = ["125T:FT0111", "125T:FT0111", "125T:FT0111"]
raw_data_list = []
for sku in products:
url = url_base.format(sku)
raw_data_list.append(urllib2.urlopen(url).read())
data = "[" + (",".join(raw_data_list)) + "]"
data = pd.read_json(data, orient='records')
data
/EDIT
My take on the pandas.read_json function formats.
The pandas.read_json function is yet another shining example of pandas trying to jam as much functionality as possible into a single function. This leads of course to a very very complicated function.
Series
If your data is a Series, pandas.read_json(orient=) defaults to 'index'
The values allowed for orient while parsing a Series are: {'split','records','index'}
Note that the Series index must be unique for orient='index'.
DataFrame
If your data is a DataFrame, pandas.read_json(orient=) defaults to 'columns'
The values allowed for orient while parsing a DataFrame are:
{'split','records','index','columns','values'}
Note that the Series index must be unique for orient='index' and orient='columns', and the DataFrame columns must be unique for orient='index', orient='columns', and orient='records'.
Format
No matter if your data is a DataFrame or a Series, the orient= will expect data in the same format:
Split
Expects a string representation of a dict like what the DataFrame constructor takes:
{"index":[1,2,3,4], "columns":["col1","col2"], "data":[[8,7,6,5], [5,6,7,8]]}
Records
Expects a string representation of a list of dicts like:
[{"col1":8,"col2":5},{"col1":7,"col2":6},{"col1":6,"col2":7},{"col1":5,"col2":8}]
Note there is no index set here.
Index
Expects a string representation of a nested dict dict like:
{"1":{"col1":8,"col2":5},"2":{"col1":7,"col2":6},"3":{"col1":6,"col2":7},"4":{"col1":5,"col2":8}}
Good to note is that it won't accept indicies of other types than strings. May be fixed in later versions.
Columns
Expects a string representation of a nested dict like:
{"col1":{"1":8,"2":7,"3":6,"4":5},"col2":{"1":5,"2":6,"3":7,"4":8}}
Values
Expects a string representation of a list like:
[[8, 5],[7, 6],[6, 7],[5, 8]]
Resulting dataframe
In most cases, the dataframe you get will look like this, with the json strings above:
col1 col2
1 8 5
2 7 6
3 6 7
4 5 8

Maybe this is not the most elegant solution however gives me back what I want, or at least I believe so, feel free to warn if something is wrong
url = "https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111"
data = urllib2.urlopen(url).read()
data = json.loads(data)
data = pd.DataFrame(data.items())
data = data.transpose()

Another solution is to use a try except.
json_path='https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111'
try: a=pd.read_json(json_path)
except ValueError: a=pd.read_json("["+json_path+"]")

Iterating on #firelynx's answer:
#! /usr/bin/env python3
from urllib.request import urlopen
import pandas as pd
products = ["125T:FT0111", "125T:FT0111", "125T:FT0111"]
raw_lines = ""
for sku in products:
url = f"https://cws01.worldstores.co.uk/api/product.php?product_sku={sku}"
raw_lines += urlopen(url).read() + "\n"
data = pd.read_json(raw_lines, lines=True)
This would support any source returning a single JSON object or a bunch of newline ('\n') separated ones.
Or this one-liner(ish) should work the same:
#! /usr/bin/env python3
import pandas as pd
products = ["125T:FT0111", "125T:FT0111", "125T:FT0111"]
data = pd.concat(
pd.read_json(
f"https://cws01.worldstores.co.uk/api/product.php?product_sku={sku}",
lines=True
) for sku in products
)
PS: python3 is only for fstring support here, so you should use str.format for python2 compatibility.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Normalizing JSON in Python - python

Use ",".join() on the projectIds to convert them to string from list before you pass it to json_nornalize

Related

trim string to first space python

Facing some problems in groupby function for Outlier Removal

Extracting values to new columns with pandas

How to replace string values in pandas dataframe to integers?

Importing single record using read_json in pandas

Categories

Resources