How to extract part of JSON file in python? - python

How to extract part of JSON file in python? have been trying to extract only certain data from a JSON file using python.
[{"id":"1", "user":"a"},
{"id":"2", "user":"b"]},
{"id":"2", "user":"c"}]
I want only the "user" data as the output.

What you have pasted is not a json. It's either a tuple, or a list of tuples.
If you have a json 'this'.
>>> this = {
"id":1,
"user":"a"
}
you can simply do
>>> this.get("user")
'a'

Jsons should be treated as dictionaries or lists that contain nested lists or dictionaries. In order to address the items/values within it, you must give it the same treatment.
json = [{"id":"1", "user":"a"}, {"id":"2", "user":"b"]}, {"id":"2", "user":"c"}]
for i in range(len(json):
print(json[i]['id']
Output:
1
2
3
If you wish to have these values stored somewhere you can try creating a dictionary which will append these values:
json = [{"id":"1", "user":"a"}, {"id":"2", "user":"b"]}, {"id":"2", "user":"c"}]
support_dict = {'id':[],'user':[]
for i in range(len(json):
support_dict['id'].append(json[i]['id'])
support_dict['user'].append(json[i]['user']
import pandas as pd
df = pd.DataFrame(support_dict)
print(df)
Output:
id user
1 1.0 a
2 2.0 b
3 3.0 c

As for your example, you have a list [] of three dictionaries{}
x=[{"id":"1", "user":"a"}, {"id":"2", "user":"b"}, {"id":"2", "user":"c"}]
Each dictionary contains keys and values. If you want to print the value assigned to each user; you can try:
for item in x:
for key,value in item.items():
print (value)

Related

Merge 2 dictionaries and store them in pandas dataframe where one dictionary has variable length list elements

I'm iterating through some HTML divs like this with Beautiful Soup:
for div in soup.findAll('a', {'class': 'result'}):
adLink = div.a.get('href')
adInfo= {
u'adLink':adLink,
u'adThumbImg':...some code...,
u'adCounty':...some code...
}
adFullInfo = getFullAdInfo(adLink)
adInfo.update(adFullInfo)
ads_CarsURL = pd.DataFrame(data=adInfo) #Create pandas DF
Where getFullAdInfo is function
def getFullAdInfo {
...some code...
}
which returns dictionary which looks something like this:
{'adID': '2027007',
'adTitle': 'Ford 750 Special',
'adDatePublished': '20.11.2009',
'adTimePublished': '14:23',
'adViewed': '102',
'carPriceEUR': '600',
'carManufacturer': 'Ford'}
So in each iteration I'm getting values from adInfo dict and from adFullInfo function which returns another dict and merging them so I can have single dictionary record.
Idea is on the end to create pandas dataframe.
Error I get is:
ValueError: arrays must all be same length
I don't know why is that so when I initially defined all variables for each dictionary key and assigned empty string to them like adID="" in case they are missing.
After you get the full ad, convert that to a 1 row dataframe, then just append that into a final dataframe. That will take care of the mismatch lengths and if there is data not available on an ad that is there for others. You'll have to work out the logic, as you haven't provided that part of your code to test. So quick example below of what I mean:
import pandas as pd
data1 = {'adID': '2027007',
'adTitle': 'Ford 750 Special',
'adDatePublished': '20.11.2009',
'adTimePublished': '14:23',
'adViewed': '102',
'carPriceEUR': '600',
'carManufacturer': 'Ford'}
data2 = {'adID': '20555',
'adTitle': 'Honda',
'adTimePublished': '11:23',
'adViewed': '2',
'carManufacturer': 'Honda'}
# Initialize empty dataframe
final_df = pd.DataFrame()
# Iterate through your dictionaries, convert to 1 row dataframe and append it to your final dataframe
for data in [data1, data2]:
temp_df = pd.DataFrame(data, index=[0])
final_df = final_df.append(temp_df, sort=True).reset_index(drop=True)
Specifically with what you provided, it will be something like:
ads_CarsURL = pd.DataFrame()
for div in soup.findAll('a', {'class': 'result'}):
adLink = div.a.get('href')
adInfo= {
u'adLink':adLink,
u'adThumbImg':...some code...,
u'adCounty':...some code...
}
adFullInfo = getFullAdInfo(adLink)
adInfo.update(adFullInfo)
temp_df = pd.DataFrame(adInfo, index=[0])
ads_CarsURL = final_df.append(temp_df, sort=True).reset_index(drop=True)
Output:
print (final_df.to_string())
adDatePublished adID adTimePublished adTitle adViewed carManufacturer carPriceEUR
0 20.11.2009 2027007 14:23 Ford 750 Special 102 Ford 600
1 NaN 20555 11:23 Honda 2 Honda NaN
Tried various options, in this is what gave best results:
I've properly merged dictionaries with adFull = {**adBasicInfo,**adOtherInfo} and append them into adFullList list in each iteration.
After that I could successfully create pandas dataframe from adFullList list.
Other solution didn't work because second dictionary had list type of element for some of the values. They look like this:
adFullDF.iloc[2]['carSafety']
which gives:
['Self-tightening belts', 'Rear seat belts', 'Active head restraints']
List of dictionaries in that case can solve shape problem which pandas will give in case you try to save dictionaries in pandas dataframe when one of the dictionaries has variable length list items for some of the items.
Names of some dictionaries are changed for better understanding:
adBasicInfo = {} # 1.st dictionary
adOtherInfo = {} # 2.nd dictionary
adFullInfo = {} # Merged dictionary
adFullList = [] # List for appending merged dictionaries
# In each iteration merge dicts and append them in the list
for div in soup.findAll('a', {'class': 'result'}):
..some code...
adBasicInfo = {
u'adLink':adLink,
u'adThumbImg':...some code...,
u'adCounty':...some code...
}
adOtherInfo = getFullAdInfo(adLink) # Get complex dict
adFull = {**adBasicInfo,**adOtherInfo} # Merge dicts
adFullList.append(adFull) # Append dicts to list
# Save final version of list as pandas dataframe
adFullDF = pd.DataFrame(data=adFullList) # Save final list to dataframe

python - How to get multiple key with single value for a dictionary from excel

I have a excel sheet i.e,
0 first |second |third |root
1 credible |credulous |credibility|cred
2 cryptogram|cryptology|cryptic |crypt
3 aquatic |aquarium |aqueduct |aqua
i want to import key and value for dictionary from above mentioned excel sheet:
for example:
i have written a dictionary code for storing this values
new_dict = {
('credible','credulous','credibility') : 'cred',
('cryptogram','cryptology','cryptic') : 'crypt'
}
so instead of writing key and value for each word in dictionary i want to import from excel where first 3 column(i.e first,second,third) should be the key and last column(i.e root) will be the value of a dictionary. Is it possible to do that?
sorry for my English.
Thanks
Use set_index to set the index for every column (first three, in this case) besides the root, and then call root.to_dict:
df.set_index(df.columns.difference(['root']).tolist()).root.to_dict()
{
('aquatic', 'aquarium', 'aqueduct') : 'aqua',
('credible', 'credulous', 'credibility') : 'cred',
('cryptogram', 'cryptology', 'cryptic') : 'crypt'
}
Use set_index by first 3 columns with Series.to_dict:
d = df.set_index(['first','second','third'])['root'].to_dict()
print (d)
{('credible', 'credulous', 'credibility'): 'cred',
('aquatic', 'aquarium', 'aqueduct'): 'aqua',
('cryptogram', 'cryptology', 'cryptic'): 'crypt'}

Extracting similar data from dictionary and putting into a new list or array

I have a dictionary with 5 keys that each key refers to a list (a CSV file with 100 rows and 5 columns). Each row of the list points to a data of a person. I would like to extract the similar row of each list and put into a new list or an array. So at the end I should have 100 lists/array such that each list/array contains a user’s data. And then I want to do some experiments like machine learning and so on.
This is my example:
My_dict={0,1,2,3}
0={id,var1,var2,var3
User1,med,high,low
User2,med,low,low
…,…,..,..,
User100,hih,low,med}
1={id,var1,var2,var3
User1,high,med,low
User2,high,med,low
…,…,..,..,
User100,low,low,med}
2={id,var1,var2,var3
User1,low,med,low
User2,med,med,low
…,…,..,..,
User100,med,low,med}
So I want to have a list of lists or array of arrays that I can experiment with. Something like this:
User1={id,var1,var2,var3
User1,med,high,low
User1,high,med,low
User1,low,med,low
}
User2={d,var1,var2,var3
User2,med,high,low
User2,high,med,low
User2,low,med,low
}
input_data = {"0":[["U1","med","low","high"],["U2","low","low","high"],["U3","high","low","high"]], "1": [["U1","med","low","high"],["U2","low","low","high"],["U3","high","low","high"]]}
# Assuming that above kind of data you have then below dict will be your output
users_dict = dict()
for key, users in input_data.iteritems():
for user in users:
users_dict.setdefault(user[0], []).append(user)

Access a json column with pandas

I have a csv file where one column is json. I want to be able to access the information in the json column but I can't figure it out.
My csv file is like
id, "letter", "json"
1,"a","{""add"": 2}"
2,"b","{""sub"": 5}"
3,"c","{""add"": {""sub"": 4}}"
I'm reading in the like like
test = pd.read_csv(filename)
df = pd.DataFrame(test)
I'd like to be able to get all the rows that have "sub" in the json column and ultimately be able to get the values for those keys.
Here's one approach, which uses the read_csv converters argument to build json as JSON. Then use apply to select on the json field keys in each row. CustomParser taken from this answer.
EDIT
Updated to look two levels deep, and takes variable target parameter (so it can be "add" or "sub", as needed). This solution won't handle an arbitrary number of levels, though.
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
df = pd.read_csv('test.csv', converters={'json':CustomParser})
def check_keys(json, target):
if target in json:
return True
for key in json:
if isinstance(json[key], dict):
if target in json[key]:
return True
return False
print(df.loc[df.json.apply(check_keys, args=('sub',))])
id letter json
1 2 b {'sub': 5}
2 3 c {'add': {'sub': 4}}
When you read the file in, the json field will still be of str type, you can use ast.literal_eval to convert the string to dictionary, and then use apply method to check if any cell contain the key add:
from ast import literal_eval
df["json"] = df["json"].apply(literal_eval)
df[df["json"].apply(lambda d: "add" in d)]
# id letter json
#0 1 a {'add': 2}
#2 3 c {'add': {'sub': 4}}
In case you want to check nested keys:
def check_add(d):
if "add" in d:
return True
for k in d:
if isinstance(d[k], dict):
if check_add(d[k]):
return True
return False
df[df["json"].apply(check_add)]
# id letter json
#0 1 a {'add': 2}
#2 3 c {'add': {'sub': 4}}
This doesn't check nested values other than dictionary; If you need to, it should be similar to implement based on your data.

Importing single record using read_json in pandas

I am trying to import a json file using the function:
sku = pandas.read_json('https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111')
However, i keep getting the following error
ValueError: arrays must all be same length
What should I do to import it correctly into a dataframe?
this is the structure of the json:
{
"id": "5",
"sku": "JOSH:BECO-BRN",
"last_updated": "2013-06-10 15:46:22",
...
"propertyType1": [
"manufacturer_colour"
],
"category": [
{
"category_id": "10",
"category_name": "All Products"
},
...
{
"category_id": "238",
"category_name": "All Sofas"
}
],
"root_categories": [
"516"
],
"url": "/p/Beco Suede Sofa Bed?product_id=5",
"item": [
"2"
],
"image_names": "[\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/L\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/P\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SP\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SS\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/ST\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/WP\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/L\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/P\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SP\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk \\/images\\/products\\/SS\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/ST\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/WP\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\"]"
}
The pandas.read_json function takes multiple formats.
Since you did not specify which format your json file is in (orient= attribute), pandas will default to believing your data is columnar. The different formats pandas expects are discussed below.
The data that you are trying to parse from https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111
Does not seem to conform to any of the supported formats as it seems to be only a single "record". Pandas expects some kind of collection.
You probably should try to collect multiple entries into a single file, then parse it with the read_json function.
EDIT:
Simple way of getting multiple rows and parsing it with the pandas.read_json function:
import urllib2
import pandas as pd
url_base = "https://cws01.worldstores.co.uk/api/product.php?product_sku={}"
products = ["125T:FT0111", "125T:FT0111", "125T:FT0111"]
raw_data_list = []
for sku in products:
url = url_base.format(sku)
raw_data_list.append(urllib2.urlopen(url).read())
data = "[" + (",".join(raw_data_list)) + "]"
data = pd.read_json(data, orient='records')
data
/EDIT
My take on the pandas.read_json function formats.
The pandas.read_json function is yet another shining example of pandas trying to jam as much functionality as possible into a single function. This leads of course to a very very complicated function.
Series
If your data is a Series, pandas.read_json(orient=) defaults to 'index'
The values allowed for orient while parsing a Series are: {'split','records','index'}
Note that the Series index must be unique for orient='index'.
DataFrame
If your data is a DataFrame, pandas.read_json(orient=) defaults to 'columns'
The values allowed for orient while parsing a DataFrame are:
{'split','records','index','columns','values'}
Note that the Series index must be unique for orient='index' and orient='columns', and the DataFrame columns must be unique for orient='index', orient='columns', and orient='records'.
Format
No matter if your data is a DataFrame or a Series, the orient= will expect data in the same format:
Split
Expects a string representation of a dict like what the DataFrame constructor takes:
{"index":[1,2,3,4], "columns":["col1","col2"], "data":[[8,7,6,5], [5,6,7,8]]}
Records
Expects a string representation of a list of dicts like:
[{"col1":8,"col2":5},{"col1":7,"col2":6},{"col1":6,"col2":7},{"col1":5,"col2":8}]
Note there is no index set here.
Index
Expects a string representation of a nested dict dict like:
{"1":{"col1":8,"col2":5},"2":{"col1":7,"col2":6},"3":{"col1":6,"col2":7},"4":{"col1":5,"col2":8}}
Good to note is that it won't accept indicies of other types than strings. May be fixed in later versions.
Columns
Expects a string representation of a nested dict like:
{"col1":{"1":8,"2":7,"3":6,"4":5},"col2":{"1":5,"2":6,"3":7,"4":8}}
Values
Expects a string representation of a list like:
[[8, 5],[7, 6],[6, 7],[5, 8]]
Resulting dataframe
In most cases, the dataframe you get will look like this, with the json strings above:
col1 col2
1 8 5
2 7 6
3 6 7
4 5 8
Maybe this is not the most elegant solution however gives me back what I want, or at least I believe so, feel free to warn if something is wrong
url = "https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111"
data = urllib2.urlopen(url).read()
data = json.loads(data)
data = pd.DataFrame(data.items())
data = data.transpose()
Another solution is to use a try except.
json_path='https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111'
try: a=pd.read_json(json_path)
except ValueError: a=pd.read_json("["+json_path+"]")
Iterating on #firelynx's answer:
#! /usr/bin/env python3
from urllib.request import urlopen
import pandas as pd
products = ["125T:FT0111", "125T:FT0111", "125T:FT0111"]
raw_lines = ""
for sku in products:
url = f"https://cws01.worldstores.co.uk/api/product.php?product_sku={sku}"
raw_lines += urlopen(url).read() + "\n"
data = pd.read_json(raw_lines, lines=True)
This would support any source returning a single JSON object or a bunch of newline ('\n') separated ones.
Or this one-liner(ish) should work the same:
#! /usr/bin/env python3
import pandas as pd
products = ["125T:FT0111", "125T:FT0111", "125T:FT0111"]
data = pd.concat(
pd.read_json(
f"https://cws01.worldstores.co.uk/api/product.php?product_sku={sku}",
lines=True
) for sku in products
)
PS: python3 is only for fstring support here, so you should use str.format for python2 compatibility.

Categories