Importing single record using read_json in pandas - python

I am trying to import a json file using the function:
sku = pandas.read_json('https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111')
However, i keep getting the following error
ValueError: arrays must all be same length
What should I do to import it correctly into a dataframe?
this is the structure of the json:
{
"id": "5",
"sku": "JOSH:BECO-BRN",
"last_updated": "2013-06-10 15:46:22",
...
"propertyType1": [
"manufacturer_colour"
],
"category": [
{
"category_id": "10",
"category_name": "All Products"
},
...
{
"category_id": "238",
"category_name": "All Sofas"
}
],
"root_categories": [
"516"
],
"url": "/p/Beco Suede Sofa Bed?product_id=5",
"item": [
"2"
],
"image_names": "[\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/L\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/P\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SP\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SS\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/ST\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/WP\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/L\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/P\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SP\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk \\/images\\/products\\/SS\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/ST\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/WP\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\"]"
}

The pandas.read_json function takes multiple formats.
Since you did not specify which format your json file is in (orient= attribute), pandas will default to believing your data is columnar. The different formats pandas expects are discussed below.
The data that you are trying to parse from https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111
Does not seem to conform to any of the supported formats as it seems to be only a single "record". Pandas expects some kind of collection.
You probably should try to collect multiple entries into a single file, then parse it with the read_json function.
EDIT:
Simple way of getting multiple rows and parsing it with the pandas.read_json function:
import urllib2
import pandas as pd
url_base = "https://cws01.worldstores.co.uk/api/product.php?product_sku={}"
products = ["125T:FT0111", "125T:FT0111", "125T:FT0111"]
raw_data_list = []
for sku in products:
url = url_base.format(sku)
raw_data_list.append(urllib2.urlopen(url).read())
data = "[" + (",".join(raw_data_list)) + "]"
data = pd.read_json(data, orient='records')
data
/EDIT
My take on the pandas.read_json function formats.
The pandas.read_json function is yet another shining example of pandas trying to jam as much functionality as possible into a single function. This leads of course to a very very complicated function.
Series
If your data is a Series, pandas.read_json(orient=) defaults to 'index'
The values allowed for orient while parsing a Series are: {'split','records','index'}
Note that the Series index must be unique for orient='index'.
DataFrame
If your data is a DataFrame, pandas.read_json(orient=) defaults to 'columns'
The values allowed for orient while parsing a DataFrame are:
{'split','records','index','columns','values'}
Note that the Series index must be unique for orient='index' and orient='columns', and the DataFrame columns must be unique for orient='index', orient='columns', and orient='records'.
Format
No matter if your data is a DataFrame or a Series, the orient= will expect data in the same format:
Split
Expects a string representation of a dict like what the DataFrame constructor takes:
{"index":[1,2,3,4], "columns":["col1","col2"], "data":[[8,7,6,5], [5,6,7,8]]}
Records
Expects a string representation of a list of dicts like:
[{"col1":8,"col2":5},{"col1":7,"col2":6},{"col1":6,"col2":7},{"col1":5,"col2":8}]
Note there is no index set here.
Index
Expects a string representation of a nested dict dict like:
{"1":{"col1":8,"col2":5},"2":{"col1":7,"col2":6},"3":{"col1":6,"col2":7},"4":{"col1":5,"col2":8}}
Good to note is that it won't accept indicies of other types than strings. May be fixed in later versions.
Columns
Expects a string representation of a nested dict like:
{"col1":{"1":8,"2":7,"3":6,"4":5},"col2":{"1":5,"2":6,"3":7,"4":8}}
Values
Expects a string representation of a list like:
[[8, 5],[7, 6],[6, 7],[5, 8]]
Resulting dataframe
In most cases, the dataframe you get will look like this, with the json strings above:
col1 col2
1 8 5
2 7 6
3 6 7
4 5 8

Maybe this is not the most elegant solution however gives me back what I want, or at least I believe so, feel free to warn if something is wrong
url = "https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111"
data = urllib2.urlopen(url).read()
data = json.loads(data)
data = pd.DataFrame(data.items())
data = data.transpose()

Another solution is to use a try except.
json_path='https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111'
try: a=pd.read_json(json_path)
except ValueError: a=pd.read_json("["+json_path+"]")

Iterating on #firelynx's answer:
#! /usr/bin/env python3
from urllib.request import urlopen
import pandas as pd
products = ["125T:FT0111", "125T:FT0111", "125T:FT0111"]
raw_lines = ""
for sku in products:
url = f"https://cws01.worldstores.co.uk/api/product.php?product_sku={sku}"
raw_lines += urlopen(url).read() + "\n"
data = pd.read_json(raw_lines, lines=True)
This would support any source returning a single JSON object or a bunch of newline ('\n') separated ones.
Or this one-liner(ish) should work the same:
#! /usr/bin/env python3
import pandas as pd
products = ["125T:FT0111", "125T:FT0111", "125T:FT0111"]
data = pd.concat(
pd.read_json(
f"https://cws01.worldstores.co.uk/api/product.php?product_sku={sku}",
lines=True
) for sku in products
)
PS: python3 is only for fstring support here, so you should use str.format for python2 compatibility.

Related

Cast from maptype to stringtype loses json formatting, pyspark

i have a .geojson that i converted to a dataframe using
df = geojson_to_dataframe(raw, glob='*.geojson', batch_size=100000)
Once that is set up, i need to convert the first column "properties" into a stringtype since pyspark labels it as a map type. To cast it as a string i run this code below.
df = df.select(df.properties.cast(StringType()).alias("properties"))
The problem i am having is the new column is stripped of all of the quotation marks and the colon is now a "->" and it loses its json formatting. Does anyone know how to convert a map type to string type without losing the json formatting?
The column as a map type before the cast
{
"BFE_LN_ID":"01001C_722",
"DFIRM_ID":"01001C",
"NAME":"Alabama",
"ALAND":"131185042550",
}
...
The column as a string after the cast
[
BFE_LN_ID->01001C_722,
DFIRM_ID->01001C,
NAME->Alabama,
ALAND->131185042550,
]
Thank you
Update:
Source geojson data that i converted using the above geojson_to_dataframe() code
properties
geometry
{"BFE_LN_ID":"01001C_722","DFIRM_ID":"01001C","NAME":"Alabama","ALAND":"131185042550","Shape_Length":"0.00010778212891927651","AWATER":"4582333181","LEN_UNIT":"Feet","VERSION_ID":"1.1.1.0","GFID":"20140910","INTPTLAT":"+32.7395785","STATENS":"01779775","REGION":"3","FUNCSTAT":"A","DIVISION":"6","GEOID":"01","INTPTLON":"-086.8434469","STATEFP":"01","SOURCE_CIT":"01001C_STUDY1","ELEV":"376.0","V_DATUM":"NAVD88","LSAD":"00","STUSPS":"AL","MTFCC":"G4000"}
{"coordinates": [[-86.46788227456062, 32.487228761833364], [-86.46796895264879, 32.48717204906342], [-86.46797248221748, 32.48716977416143]], "type": "LineString"}
Code used to get incorrect result
def nfhl_string(nfhl):
df = nfhl
df = df.select(df.properties.cast(StringType()).alias("properties"))
return df

How to extract a json file data into a pandas df using Python?

I am new on using JSON files and came across this data.
I am trying to import it into a pandas dataframe as columns so that I can further work with it. However I think I am running into a nested list for some columns of the data and hence my dataframe looks like this:
This is a sample of the data I am using:
{"fraudulent":false,"customer":
{"customerEmail":"josephhoward#yahoo.com","customerPhone":"400-108-5415",
"customerDevice":"yyeiaxpltf82440jnb3v","customerIPAddress":"8.129.104.40",
"customerBillingAddress":"5493 Jones Islands\nBrownside, CA 51896"},
"orders":[{"orderId":"vjbdvd","orderAmount":18,"orderState":"pending",
"orderShippingAddress":"5493 Jones Islands\nBrownside, CA 51896"},
{"orderId":"yp6x27","orderAmount":26,"orderState":"fulfilled",
"orderShippingAddress":"5493 Jones Islands\nBrownside, CA 51896"}],
"paymentMethods":[{"paymentMethodId":"wt07xm68b",
"paymentMethodRegistrationFailure":true,"paymentMethodType":"card",
"paymentMethodProvider":"JCB 16 digit",
"paymentMethodIssuer":"Citizens First Banks"}],"transactions":[
{"transactionId":"a9lcj51r","orderId":"vjbdvd",
"paymentMethodId":"wt07xm68b","transactionAmount":18,
"transactionFailed":false},{"transactionId":"y4wcv03i",
"orderId":"yp6x27","paymentMethodId":"wt07xm68b",
"transactionAmount":26,"transactionFailed":false}]}
As you can see from the image above, columns like orders contain a list of features such as orderAmount, orderState etc. I want those values to be split into their only column so I get a pandas data frame with all features as separate columns with their corresponding values.
So far I have tried using json_normalize but that did not resolve my issue.
Kindly help.
Ok, given what you've said, the first thing you should do is unravel that JSON like so to make it easier to understand:
{
"fraudulent":false,
"customer": {
"customerEmail":"josephhoward#yahoo.com",
"customerPhone":"400-108-5415",
"customerDevice":"yyeiaxpltf82440jnb3v",
"customerIPAddress":"8.129.104.40",
"customerBillingAddress":"5493 Jones Islands\nBrownside, CA 51896"
},
"orders":[
{
"orderId":"vjbdvd",
"orderAmount":18,
"orderState":"pending",
"orderShippingAddress":"5493 Jones Islands\nBrownside, CA 51896"
},
{
"orderId":"yp6x27",
"orderAmount":26,
"orderState":"fulfilled",
"orderShippingAddress":"5493 Jones Islands\nBrownside, CA 51896"}
],
"paymentMethods":[
{
"paymentMethodId":"wt07xm68b",
"paymentMethodRegistrationFailure":true,"paymentMethodType":"card",
"paymentMethodProvider":"JCB 16 digit",
"paymentMethodIssuer":"Citizens First Banks"
}
],
"transactions":[
{
"transactionId":"a9lcj51r","orderId":"vjbdvd",
"paymentMethodId":"wt07xm68b","transactionAmount":18,
"transactionFailed":false
},
{
"transactionId":"y4wcv03i",
"orderId":"yp6x27","paymentMethodId":"wt07xm68b",
"transactionAmount":26,"transactionFailed":false
}
]
}
The next thing to think about is how to structure the DataFrame. By the looks of your data, you will probably have to split this over multiple dataframes since it appears that transactions, payments, and orders might have variable length.
So you could have one DataFrame where each row corresponds to a customer, with a corresponding ID for each customer (or just use something unique like their email), and then have a column for all orders, with a column containing that ID that relates it back to customer DataFrame.
import json
import pandas as pd
customers = []
orders = []
paymentMethods = []
transactions = []
with open('data.json') as f:
raw_json = json.load(f)
customer_id = 0
for i in raw_json:
customers.append([
customer_id,
i['fraudulent'],
i['customer']['customerEmail'],
i['customer']['customerPhone'],
i['customer']['customerDevice'],
i['customer']['customerIPAddress'],
i['customer']['customerBillingAddress'],
])
for j in i['orders']:
orders.append([
customer_id,
j['orderId'],
j['orderAmount']
# etc
])
# etc
customer_id += 1
customers_df = pd.DataFrame(data=customers, columns=['customer_id', 'fraudulent', 'customerEmail', 'customerPhone', 'customerDevice', 'customerIPAddress', 'customerBillingAddress'])
orders_df = pd.DataFrame(data=orders, columns=['customer_id', 'orderId', 'orderAmount'])
You will need to expand that answer to fully parse the JSON, but that should give a good start. See the way customer_id is used to relate the entries in customers_df with orders_df.
I'd look at the Pandas docs for more information about constructing a Dataframe here.

converting google datastore query result to pandas dataframe in python

I need to convert a Google Cloud Datastore query result to a dataframe, to create a chart from the retrieved data. The query:
def fetch_times(limit):
start_date = '2019-10-08'
end_date = '2019-10-19'
query = datastore_client.query(kind='ParticleEvent')
query.add_filter(
'published_at', '>', start_date)
query.add_filter(
'published_at', '<', end_date)
query.order = ['-published_at']
times = query.fetch(limit=limit)
return times
creates a json like string of the results for each entity returned by the query:
Entity('ParticleEvent', 5942717456580608) {'gc_pub_sub_id': '438169950283983', 'data': '605', 'event': 'light intensity', 'published_at': '2019-10-11T14:37:45.407Z', 'device_id': 'e00fce6847be7713698287a1'}>
Thought I found something that would translate to json which I could convert to dataframe, but get an error that the properties attribute does not exist:
def to_json(gql_object):
result = []
for item in gql_object:
result.append(dict([(p, getattr(item, p)) for p in item.properties()]))
return json.dumps(result, cls=JSONEncoder)
Is there a way to iterate through the query results to get them into a dataframe either directly to a dataframe or by converting to json then to dataframe?
Datastore entities can be treated as Python base dictionaries! So you should be able to do something as simple as...
df = pd.DataFrame(datastore_entities)
...and pandas will do all the rest.
If you needed to convert the entity key, or any of its attributes to a column as well, you can pack them into the dictionary separately:
for e in entities:
e['entity_key'] = e.key
e['entity_key_name'] = e.key.name # for example
df = pd.DataFrame(entities)
You can use pd.read_json to read your json query output into a dataframe.
Assuming the output is the string that you have shared above, then the following approach can work.
#Extracting the beginning of the dictionary
startPos = line.find("{")
df = pd.DataFrame([eval(line[startPos:-1])])
Output looks like :
gc_pub_sub_id data event published_at \
0 438169950283983 605 light intensity 2019-10-11T14:37:45.407Z
device_id
0 e00fce6847be7713698287a1
Here, line[startPos:-1] is essentially the entire dictionary in that sthe string input. Using eval, we can convert it into an actual dictionary. Once we have that, it can be easily converted into a dataframe object
Original poster found a workaround, which is to convert each item in the query result object to string, and then manually parse the string to extract the needed data into a list.
The return value of the fetch function is google.cloud.datastore.query.Iterator which behaves like a List[dict] so the output of fetch can be passed directly into pd.DataFrame.
import pandas as pd
df = pd.DataFrame(fetch_times(10))
This is similar to #bkitej, but I added the use of the original poster's function.

pandas read_json: "If using all scalar values, you must pass an index"

I have some difficulty in importing a JSON file with pandas.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
This is the error that I get:
ValueError: If using all scalar values, you must pass an index
The file structure is simplified like this:
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
It is from the machine learning course of University of Washington on Coursera. You can find the file here.
Try
ser = pd.read_json('people_wiki_map_index_to_word.json', typ='series')
That file only contains key value pairs where values are scalars. You can convert it to a dataframe with ser.to_frame('count').
You can also do something like this:
import json
with open('people_wiki_map_index_to_word.json', 'r') as f:
data = json.load(f)
Now data is a dictionary. You can pass it to a dataframe constructor like this:
df = pd.DataFrame({'count': data})
You can do as #ayhan mention which will give you a column base format
Or you can enclose the object in [ ] (source) as shown below to give you a row format that will be convenient if you are loading multiple values and planing on using matrix for your machine learning models.
df = pd.DataFrame([data])
I think what is happening is that the data in
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
is being read as a string instead of a json
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
is actually
'{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}'
Since a string is a scalar, it wants you to load it as a json, you have to convert it to a dict which is exactly what the other response is doing
The best way is to do a json loads on the string to convert it to a dict and load it into pandas
myfile=f.read()
jsonData=json.loads(myfile)
df=pd.DataFrame(data)
{
"biennials": 522004,
"lb915": 116290
}
df = pd.read_json('values.json')
As pd.read_json expects a list
{
"biennials": [522004],
"lb915": [116290]
}
for a particular key, it returns an error saying
If using all scalar values, you must pass an index.
So you can resolve this by specifying 'typ' arg in pd.read_json
map_index_to_word = pd.read_json('Datasets/people_wiki_map_index_to_word.json', typ='dictionary')
For newer pandas, 0.19.0 and later, use the lines parameter, set it to True.
The file is read as a json object per line.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', lines=True)
If fixed the following errors I encountered especially when some of the json files have only one value:
ValueError: If using all scalar values, you must pass an index
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ValueError: Trailing data
For example
cat values.json
{
name: "Snow",
age: "31"
}
df = pd.read_json('values.json')
Chances are you might end up with this
Error: if using all scalar values, you must pass an index
Pandas looks up for a list or dictionary in the value. Something like
cat values.json
{
name: ["Snow"],
age: ["31"]
}
So try doing this. Later on to convert to html tohtml()
df = pd.DataFrame([pd.read_json(report_file, typ='series')])
result = df.to_html()
I solved this by converting it into an array like so
[{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}]

Splitting Regex response column on python

I am receiving an object array after applying re.findall for link and hashtags on Tweets data. My data looks like
b=['https://t.co/1u0dkzq2dV', 'https://t.co/3XIZ0SN05Q']
['https://t.co/CJZWjaBfJU']
['https://t.co/4GMhoXhBQO', 'https://t.co/0V']
['https://t.co/Erutsftlnq']
['https://t.co/86VvLJEzvG', 'https://t.co/zCYv5WcFDS']
Now I want to split it in columns, I am using following
df = pd.DataFrame(b.str.split(',',1).tolist(),columns = ['flips','row'])
But it is not working because of weird datatype I guess, I tried few other solutions as well. Nothing worked.And this is what I am expecting, two separate columns
https://t.co/1u0dkzq2dV https://t.co/3XIZ0SN05Q
https://t.co/CJZWjaBfJU
https://t.co/4GMhoXhBQO https://t.co/0V
https://t.co/Erutsftlnq
https://t.co/86VvLJEzvG
It's not clear from your question what exactly is part of your data. (Does it include the square brackets and single quotes?). In any case, the pandas read_csv function is very versitile and can handle ragged data:
import StringIO
import pandas as pd
raw_data = """
['https://t.co/1u0dkzq2dV', 'https://t.co/3XIZ0SN05Q']
['https://t.co/CJZWjaBfJU']
['https://t.co/4GMhoXhBQO', 'https://t.co/0V']
['https://t.co/Erutsftlnq']
['https://t.co/86VvLJEzvG', 'https://t.co/zCYv5WcFDS']
"""
# You'll probably replace the StringIO part with the filename of your data.
df = pd.read_csv(StringIO.StringIO(raw_data), header=None, names=('flips','row'))
# Get rid of the square brackets and single quotes
for col in ('flips', 'row'):
df[col] = df[col].str.strip("[]'")
df
Output:
flips row
0 https://t.co/1u0dkzq2dV https://t.co/3XIZ0SN05Q
1 https://t.co/CJZWjaBfJU NaN
2 https://t.co/4GMhoXhBQO https://t.co/0V
3 https://t.co/Erutsftlnq NaN
4 https://t.co/86VvLJEzvG https://t.co/zCYv5WcFDS

Categories