I am simply trying to read my json file in Python. I am in the correct folder when I do so; I am in Downloads, and my file is called 'Books_5.json'. However, when I try to use the .read() function, I get the error
OSError: [Errno 22] Invalid argument
This is my code:
import json
config = json.loads(open('Books_5.json').read())
This also raises the same error:
books = open('Books_5.json').read()
If it helps, this is a small snippet of what my data looks like:
{"reviewerID": "A10000012B7CGYKOMPQ4L", "asin": "000100039X", "reviewerName": "Adam", "helpful": [0, 0], "reviewText": "Spiritually and mentally inspiring! A book that allows you to question your morals and will help you discover who you really are!", "overall": 5.0, "summary": "Wonderful!", "unixReviewTime": 1355616000, "reviewTime": "12 16, 2012"}
{"reviewerID": "A2S166WSCFIFP5", "asin": "000100039X", "reviewerName": "adead_poet#hotmail.com \"adead_poet#hotmail.com\"", "helpful": [0, 2], "reviewText": "This is one my must have books. It is a masterpiece of spirituality. I'll be the first to admit, its literary quality isn't much. It is rather simplistically written, but the message behind it is so powerful that you have to read it. It will take you to enlightenment.", "overall": 5.0, "summary": "close to god", "unixReviewTime": 1071100800, "reviewTime": "12 11, 2003"}
I'm using Python 3.6 on MacOSX
It appears that this is some kind of bug that occurs when the file is too large (my file was ~10GB). Once I use split to break up the file by 200 k lines, the .read() error goes away. This is true even if the file is not in strict json format.
Your code looks fine, it just looks like your json data is formatted incorrectly. Try the following. As others have suggested, it should be in the form [{},{},...].
[{"reviewerID": "A10000012B7CGYKOMPQ4L", "asin": "000100039X",
"reviewerName": "Adam", "helpful": [0, 0], "reviewText": "Spiritually and
mentally inspiring! A book that allows you to question your morals and will
help you discover who you really are!", "overall": 5.0, "summary":
"Wonderful!", "unixReviewTime": 1355616000, "reviewTime": "12 16, 2012"},
{"reviewerID": "A2S166WSCFIFP5", "asin": "000100039X", "reviewerName":
"adead_poet#hotmail.com \"adead_poet#hotmail.com\"", "helpful": [0, 2],
"reviewText": "This is one my must have books. It is a masterpiece of
spirituality. I'll be the first to admit, its literary quality isn't much.
It is rather simplistically written, but the message behind it is so
powerful that you have to read it. It will take you to enlightenment.",
"overall": 5.0, "summary": "close to god", "unixReviewTime": 1071100800,
"reviewTime": "12 11, 2003"}]
Your code and this data worked for me on Windows 7 and python 2.7. Different than your setup, but should still be ok.
In order to read json file, you can use next example:
with open('your_data.json') as data_file:
data = json.load(data_file)
print(data)
print(data[0]['your_key']) # get value via key.
and also try to convert your json objects into a list
[
{'reviewerID': "A10000012B7CGYKOMPQ4L", ....},
{'asin': '000100039X', .....}
]
Related
I have an HTML file, and this file contains several scripts
specifically in the last <script></script> contains a value that I would like to get
I need to get the hash value found here
extend(cur, { "hash": "13334a0e457f0793ec", "loginHost": "login", "sureBoxText": false, "strongCode": 0, "joinParams": false, "validationType": 3, "resendDelay": 120, "calledPhoneLen": 4, "calledPhoneExcludeCountries": [1, 49, 200] });
How can I do this? I've tried using soup but I think I'm doing it wrong. I really need to complete this, if you can help me I will be eternally grateful.
I tried using the re library but I don't know how to use it.
ex
re.search(html, "hash: (*?),")
is there any way to do a search like this?
You can use .group() to access a captured group:
import re
data = """extend(cur, { "hash": "13334a0e457f0793ec", "loginHost": "login", "sureBoxText": false, "strongCode": 0, "joinParams": false, "validationType": 3, "resendDelay": 120, "calledPhoneLen": 4, "calledPhoneExcludeCountries": [1, 49, 200] });"""
print(re.search(r'{ "hash": "(.*?)",', data).group(1))
Output:
13334a0e457f0793ec
Regular expression explanation:
I'm struggling with my json data that I get from an API. I've gone into several api urls to grab my data, and I've stored it in an empty list. I then want to take out all fields that say "reputation" and I'm only interested in that number. See my code here:
import json
import requests
f = requests.get('my_api_url')
if(f.ok):
data = json.loads(f.content)
url_list = [] #the list stores a number of urls that I want to request data from
for items in data:
url_list.append(items['details_url']) #grab the urls that I want to enter
total_url = [] #stores all data from all urls here
for index in range(len(url_list)):
url = requests.get(url_list[index])
if(url.ok):
url_data = json.loads(url.content)
total_url.append(url_data)
print(json.dumps(total_url, indent=2)) #only want to see if it's working
Thus far I'm happy and can enter all urls and get the data. It's in the next step I get trouble. The above code outputs the following json data for me:
[
[
{
"id": 316,
"name": "storabro",
"url": "https://storabro.net",
"customer": true,
"administrator": false,
"reputation": 568
}
],
[
{
"id": 541,
"name": "sega",
"url": "https://wedonthaveanyyet.com",
"customer": true,
"administrator": false,
"reputation": 45
},
{
"id": 90,
"name": "Villa",
"url": "https://brandvillas.co.uk",
"customer": true,
"administrator": false,
"reputation": 6
}
]
]
However, I only want to print out the reputation, and I cannot get it working. If I in my code instead use print(total_url['reputation']) it doesn't work and says "TypeError: list indices must be integers or slices, not str", and if I try:
for s in total_url:
print(s['reputation'])
I get the same TypeError.
Feels like I've tried everything but I can't find any answers on the web that can help me, but I understand I still have a lot to learn and that my error will be obvious to some people here. It seems very similar to other things I've done with Python, but this time I'm stuck. To clarify, I'm expecting an output similar to: [568, 45, 6]
Perhaps I used the wrong way to do this from the beginning and that's why it's not working all the way for me. Started to code with Python in October and it's still very new to me but I want to learn. Thank you all in advance!
It looks like your total_url is a list of lists, so you might write a function like:
def get_reputations(data):
for url in data:
for obj in url:
print(obj.get('reputation'))
get_reputations(total_url)
# output:
# 568
# 45
# 6
If you'd rather not work with a list of lists in the first place, you can extend the list with each result instead of append in the expression used to construct total_url
You can also use json.load and try to read the response
def get_rep():
response = urlopen(api_url)
r = response.read().decode('utf-8')
r_obj = json.loads(r)
for item in r_obj['response']:
print("Reputation: {}".format(item['reputation']))
I am trying to convert a JSON string to a CSV file which I can work on further in excel. For that, I am using the following script: https://github.com/vinay20045/json-to-csv
I was on that for a few hours yesterday but could not get it working :(
I reduced my json string to the minimum for the sake of explaining what I mean.
https://pastebin.com/Vjt799Bb
{
"page": 1,
"pages": 2270,
"limit": 10,
"total": 22693,
"items": [
{
"address": {
"city": "cityname first dataset",
"company_name": "companyname first dataset"
},
"amount": 998,
"items": [
{
"description": "first part of first dataset",
"number": "part number of first part of first dataset"
}
],
"number": "number of first dataset",
"service_date": {
"type": "DEFAULT",
"date": "2015-11-18"
},
"vat_option": null
},
{
"address": {
"city": "cityname second dataset",
"company_name": "companyname second dataset"
},
"amount": 998,
"items": [
{
"description": "first part of second dataset",
"number": "part number of first part of second dataset"
},
{
"description": "second part of second dataset",
"number": "part number of second part of second dataset"
}
],
"number": "number of second dataset",
"service_date": {
"type": "DEFAULT",
"date": "2015-11-18"
},
"vat_option": null
}
]
}
I would really appreciate if you could take a look at it.
The script now delivers the following result:
.dropbox.com/s/165zbfl8wn52syf/scriptresult.jpg?dl=0
(please add www in front to have a link)
What the script now needs to do is following (F3, G4 and so on are cell definitions from the above screenshot):
- copy F3 and G3 to D4 and E4
- remove columns F and G
- copy A3:C3 to A4:C4
- copy F3:I3 to F4:I4
Target CSV will then look like:
.dropbox.com/s/l1wj3ntrlomwmaq/target.jpg?dl=0
(please add www in front to have a link)
So all in all, the „items_items_0“ „items_items_1“ is a problem because when the JSON data has sub_items, they will get new columns in the header with the current script. But I’d like to have them in new rows instead.
Do you see any chance how I can reach that? The logic is quite clear to me, but I am an absolute newbie in python - maybe that’s the problem :(
Thank you for your great support!
Cheers,
Tom
I do agree: You're asking about the usage of a specific package without providing the actual code.
I went ahead, made some assumptions, and created a snippet which could help you solve your issue. Instead of using the script you linked, I use a combination of manually creating a dictionary and then using Pandas to print, potential modification, and eventual export. Note: This does not solve your problem (since I'm not really getting it to the fullest extend) – it rather hopes to give you a good start with some of the tools and techniques.
See .ipynb file in this Gist, https://gist.github.com/AndiH/4d4ef85e2dec395a0ae5343c648565eb, the gist of it I'll paste below:
import pandas as pd
import json
with open("input.json") as f:
rawjson = json.load(f)
data = []
for element in rawjson["items"]:
data.append({
"item_address_city": element["address"]["city"],
"item_address_company_name": element["address"]["company_name"],
"items_amount": element["amount"]
})
df = pd.DataFrame(data)
df.head()
df.to_csv("output.csv")
I'm trying to read a json file into a pandas dataframe:
df = pd.read_json('output.json',orient='index')
but I'm getting the error:
/usr/local/lib/python2.7/dist-packages/pandas/io/json.pyc
in read_json(path_or_buf, orient, typ, dtype, convert_axes,
convert_dates,keep_default_dates, numpy, precise_float, date_unit)
196 if exists:
197 with open(filepath_or_buffer, 'r') as fh:
--> 198 json = fh.read()
199 else:
200 json = filepath_or_buffer
MemoryError:
I've also tried reading it using gzip:
def parse(path):
g = gzip.open(path, 'rb')
for l in g:
yield eval(l)
def getDF(path):
i = 0
df = {}
for d in parse(path):
df[i] = d
i +=1
#if i == 10000: break ## hack for local testing
return pd.DataFrame.from_dict(df,orient='index')
pathname ='./output.json.gz'
df = getDF(pathname)
But get a segmentation fault. How can I read in a json file (or json.gz) that's this large?
The head of the json file looks like this:
{"reviewerID": "ARMDSTEI0Z7YW", "asin": "0077614992", "reviewerName": "dodo", "helpful": [0, 0], "unixReviewTime": 1360886400, "reviewText": "This book was a requirement for a college class. It was okay to use although it wasn't used much for my particular class", "overall": 5.0, "reviewTime": "02 15, 2013", "summary": "great"}
{"reviewerID": "A3FYN0SZYWN74", "asin": "0615208479", "reviewerName": "Marilyn Mitzel", "helpful": [0, 0], "unixReviewTime": 1228089600, "reviewText": "This is a great gift for anyone who wants to hang on to what they've got or get back what they've lost. I bought it for my 77 year old mom who had a stroke and myself.I'm 55 and like many of us at that age my memory started slipping. You know how it goes. Can't remember where I put my keys, can't remember names and forget about numbers. As a medical reporter I was researching the importance of exercising the brain. I heard about BrainAerobics and that it can help improve and even restore memory. I had nothing to lose, nor did mom so we tried it and were actually amazed how well it works.My memory improved pretty quickly. I used to have to write notes to myself about every thing. Not any more. I can remember my grocery list and errands without writing it all down. I can even remember phone numbers now. You have to keep doing it. Just like going to the gym for your body several times a week, you must do the same for your brain.But it's a lot of fun and gives you a new sense of confidence because you just feel a lot sharper. On top of your game so to speak.That's important in this competitive world today to keep up with the younger one's in the work force. As for mom, her stroke was over two years ago and we thought she would never regain any more brain power but her mind continues to improve. We've noticed a big difference in just the last few months since she has been doing the BrainAerobics program regularly. She's hooked on it and we are believers.Marilyn Mitzel/Aventura, FL", "overall": 5.0, "reviewTime": "12 1, 2008", "summary": "AMAZING HOW QUICKLY IT WORKS!"}
{"reviewerID": "A2J0WRZSAAHUAP", "asin": "0615269990", "reviewerName": "icu-rn", "helpful": [0, 0], "unixReviewTime": 1396742400, "reviewText": "Very helpful in learning about different disease processes and easy to understand. You do not have to be a med student to play. Also you can play alone or with several players", "overall": 5.0, "reviewTime": "04 6, 2014", "summary": "Must have"}
I am using the yelp dataset and I want to parse the review json file to a dictionary. I tried loading it on a pandas DataFrame and then creating the dictionary, but because the file is too big it is time consuming. I want to keep only the user_id and stars values. A line of the json file looks like this:
{
"votes": {
"funny": 0, "useful": 2, "cool": 1},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17",
"text": (
"dr. goldberg offers everything i look for in a general practitioner. "
"he's nice and easy to talk to without being patronizing; he's always on "
"time in seeing his patients; he's affiliated with a top-notch hospital (nyu) "
"which my parents have explained to me is very important in case something "
"happens and you need surgery; and you can get referrals to see specialists "
"without having to see him first. really, what more do you need? i'm "
"sitting here trying to think of any complaints i have about him, but i'm "
"really drawing a blank."
),
"type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
How can i iterate over every 'field' (for the lack o a better word)? So far i can only iterate over each line.
EDIT
As requested pandas code :
reading the json
with open('yelp_academic_dataset_review.json') as f:
df = pd.DataFrame(json.loads(line) for line in f)
Creating the dictionary
dict = {}
for i, row in df.iterrows():
business_id = row['business_id']
user_id = row['user_id']
rating = row['stars']
key = (business_id, user_id)
dict[key] = rating
You don't need to read this into a DataFrame. json.load() returns a dictionary. For example:
sample.json
{
"votes": {
"funny": 0,
"useful": 2,
"cool": 1
},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg",
"stars": 5,
"date": "2007-05-17",
"text": "dr. goldberg offers everything i look for in a general practitioner. he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first. really, what more do you need? i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
"type": "review",
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
read_json.py
import json
with open('sample.json', 'r') as fh:
result_dict = json.load(fh)
print(result_dict['user_id'])
print(result_dict['stars'])
output
Xqd0DzHaiyRqVH3WRG7hzg
5
With that output you can easily create a DataFrame.
There are several good discussions about parsing json as a stream on SO, but the gist is it's not possible natively, although some tools seem to attempt it.
In the interest of keeping your code simple and with minimal dependencies, you might see if reading the json directory into a dictionary is a sufficient improvement.