Creating pandas dataframe from json file; getting memory error - python

I'm trying to read a json file into a pandas dataframe:
df = pd.read_json('output.json',orient='index')
but I'm getting the error:
/usr/local/lib/python2.7/dist-packages/pandas/io/json.pyc
in read_json(path_or_buf, orient, typ, dtype, convert_axes,
convert_dates,keep_default_dates, numpy, precise_float, date_unit)
196 if exists:
197 with open(filepath_or_buffer, 'r') as fh:
--> 198 json = fh.read()
199 else:
200 json = filepath_or_buffer
MemoryError:
I've also tried reading it using gzip:
def parse(path):
g = gzip.open(path, 'rb')
for l in g:
yield eval(l)
def getDF(path):
i = 0
df = {}
for d in parse(path):
df[i] = d
i +=1
#if i == 10000: break ## hack for local testing
return pd.DataFrame.from_dict(df,orient='index')
pathname ='./output.json.gz'
df = getDF(pathname)
But get a segmentation fault. How can I read in a json file (or json.gz) that's this large?
The head of the json file looks like this:
{"reviewerID": "ARMDSTEI0Z7YW", "asin": "0077614992", "reviewerName": "dodo", "helpful": [0, 0], "unixReviewTime": 1360886400, "reviewText": "This book was a requirement for a college class. It was okay to use although it wasn't used much for my particular class", "overall": 5.0, "reviewTime": "02 15, 2013", "summary": "great"}
{"reviewerID": "A3FYN0SZYWN74", "asin": "0615208479", "reviewerName": "Marilyn Mitzel", "helpful": [0, 0], "unixReviewTime": 1228089600, "reviewText": "This is a great gift for anyone who wants to hang on to what they've got or get back what they've lost. I bought it for my 77 year old mom who had a stroke and myself.I'm 55 and like many of us at that age my memory started slipping. You know how it goes. Can't remember where I put my keys, can't remember names and forget about numbers. As a medical reporter I was researching the importance of exercising the brain. I heard about BrainAerobics and that it can help improve and even restore memory. I had nothing to lose, nor did mom so we tried it and were actually amazed how well it works.My memory improved pretty quickly. I used to have to write notes to myself about every thing. Not any more. I can remember my grocery list and errands without writing it all down. I can even remember phone numbers now. You have to keep doing it. Just like going to the gym for your body several times a week, you must do the same for your brain.But it's a lot of fun and gives you a new sense of confidence because you just feel a lot sharper. On top of your game so to speak.That's important in this competitive world today to keep up with the younger one's in the work force. As for mom, her stroke was over two years ago and we thought she would never regain any more brain power but her mind continues to improve. We've noticed a big difference in just the last few months since she has been doing the BrainAerobics program regularly. She's hooked on it and we are believers.Marilyn Mitzel/Aventura, FL", "overall": 5.0, "reviewTime": "12 1, 2008", "summary": "AMAZING HOW QUICKLY IT WORKS!"}
{"reviewerID": "A2J0WRZSAAHUAP", "asin": "0615269990", "reviewerName": "icu-rn", "helpful": [0, 0], "unixReviewTime": 1396742400, "reviewText": "Very helpful in learning about different disease processes and easy to understand. You do not have to be a med student to play. Also you can play alone or with several players", "overall": 5.0, "reviewTime": "04 6, 2014", "summary": "Must have"}

Related

Python Check if Key/Value exists in JSON output

I have a JSON output and I want to create an IF statement so if it contains the value I am looking for the do something ELSE do something else.
JSON Blob 1
[
{
"domain":"www.slatergordon.co.uk",
"displayed_link":"https://www.slatergordon.co.uk/",
"description":"We Work With Thousands Of People Across The UK In All Areas Of Personal Legal Services. Regardless Of How You Have Been Injured Through Negligence We're Here To Help You. Personal Injury Experts.",
"position":1,
"block_position":"top",
"title":"Car Claims Solicitors - No Win No Fee Solicitors - SlaterGordon.co.uk",
"link":"https://www.slatergordon.co.uk/personal-injury-claim/road-traffic-accidents-solicitors/",
"tracking_link":"https://www.google.co.uk/aclk?sa=l&ai=DChcSEwj8-NSdjLDwAhXBEH0KHRYwA1MYABABGgJwdg&sig=AOD64_3u1ct0jmXAnvemxFHh_tfK5UK8Xg&q&adurl"
},
{
"is_phone_ad":true,
"phone_number":"0333 358 0496",
"domain":"www.accident-claimsline.co.uk",
"displayed_link":"http://www.accident-claimsline.co.uk/",
"description":"Car Insurance Claims Advice - Car Accident Claims Helpline",
"sitelinks":[
{
"title":"Replacement Vehicle Hire",
"tracking_link":"https://www.google.co.uk/aclk?sa=l&ai=DChcSEwj8-NSdjLDwAhXBEH0KHRYwA1MYABALGgJwdg&ae=2&sig=AOD64_20YjAoyMY_c6XVTnBU1vQAD2tDTA&q=&ved=2ahUKEwjvlM-djLDwAhVmJzQIHSZHDLEQvrcBegQIBRAM&adurl="
},
{
"title":"Request a Call Back",
"tracking_link":"https://www.google.co.uk/aclk?sa=l&ai=DChcSEwj8-NSdjLDwAhXBEH0KHRYwA1MYABAOGgJwdg&ae=2&sig=AOD64_36-Pd831AXrPbh1yvUyTbhXH2irg&q=&ved=2ahUKEwjvlM-djLDwAhVmJzQIHSZHDLEQvrcBegQIBRAN&adurl="
}
],
"position":6,
"block_position":"bottom",
"title":"Car Insurance Claims Advice - Car Accident Claims Helpline",
"link":"http://www.accident-claimsline.co.uk/",
"tracking_link":"https://www.google.co.uk/aclk?sa=l&ai=DChcSEwj8-NSdjLDwAhXBEH0KHRYwA1MYABAGGgJwdg&ae=2&sig=AOD64_09pMtWxFo9s8c1dL16NJo5ThOlrg&q&adurl"
}
]
JSON Blob 2
JSON
[
{
"domain":"www.slatergordon.co.uk",
"displayed_link":"https://www.slatergordon.co.uk/",
"description":"We Work With Thousands Of People Across The UK In All Areas Of Personal Legal Services. Regardless Of How You Have Been Injured Through Negligence We're Here To Help You. Personal Injury Experts.",
"position":1,
"block_position":"top",
"title":"Car Claims Solicitors - No Win No Fee Solicitors - SlaterGordon.co.uk",
"link":"https://www.slatergordon.co.uk/personal-injury-claim/road-traffic-accidents-solicitors/",
"tracking_link":"https://www.google.co.uk/aclk?sa=l&ai=DChcSEwj8-NSdjLDwAhXBEH0KHRYwA1MYABABGgJwdg&sig=AOD64_3u1ct0jmXAnvemxFHh_tfK5UK8Xg&q&adurl"
},
{
"is_phone_ad":true,
"phone_number":"0333 358 0496",
"domain":"www.accident-claimsline.co.uk",
"displayed_link":"http://www.accident-claimsline.co.uk/",
"description":"Car Insurance Claims Advice - Car Accident Claims Helpline",
"sitelinks":[
{
"title":"Replacement Vehicle Hire",
"tracking_link":"https://www.google.co.uk/aclk?sa=l&ai=DChcSEwj8-NSdjLDwAhXBEH0KHRYwA1MYABALGgJwdg&ae=2&sig=AOD64_20YjAoyMY_c6XVTnBU1vQAD2tDTA&q=&ved=2ahUKEwjvlM-djLDwAhVmJzQIHSZHDLEQvrcBegQIBRAM&adurl="
},
{
"title":"Request a Call Back",
"tracking_link":"https://www.google.co.uk/aclk?sa=l&ai=DChcSEwj8-NSdjLDwAhXBEH0KHRYwA1MYABAOGgJwdg&ae=2&sig=AOD64_36-Pd831AXrPbh1yvUyTbhXH2irg&q=&ved=2ahUKEwjvlM-djLDwAhVmJzQIHSZHDLEQvrcBegQIBRAN&adurl="
}
],
"position":6,
"block_position":"top",
"title":"Car Insurance Claims Advice - Car Accident Claims Helpline",
"link":"http://www.accident-claimsline.co.uk/",
"tracking_link":"https://www.google.co.uk/aclk?sa=l&ai=DChcSEwj8-NSdjLDwAhXBEH0KHRYwA1MYABAGGgJwdg&ae=2&sig=AOD64_09pMtWxFo9s8c1dL16NJo5ThOlrg&q&adurl"
}
]
Desired Output
if "block_position":"bottom" in JSONBlob:
do something
else:
do something else
but I cant seem to get it to trigger for me. I want it to search through the entire output and if it contains that key/value do something and if it doesnt contain it do something else.
Blob 1 would go down the IF path
Blob 2 would go down the else path
The main problem you have here is that the JSON output is a list/array with two objects inside. As you can have the block_position key in any of the inner objects, you could do something like this:
if any([obj.get('block_position') == 'bottom' for obj in JSONBlob]):
print('I do something')
else:
print('I do somehting else')
EDIT 1: OK, I think I got your point. You only need to do something for each object with block_position set to bottom. Then the following should do it:
for obj in JSONBlob:
if obj.get('block_position') == 'bottom':
print('I do something with the object')
else:
print('I do something else with the object')
EDIT 2: As spoken in the post, if you only want to do something with the objects with block_position set as bottom, you can suppress the else clause as follows:
for obj in JSONBlob:
if obj.get('block_position') == 'bottom':
print('I do something with the object')
you can use JMESPath library. Its a query language for JSON.
Basic jmespath expression for your case would be [?block_position==bottom]. This will filter out the specific node for you.
I tried it online here with data provided by you.
If you are looking for more nested node you will only have to alter your expression to search that specific node.

Get more general Category from the Category of a Wikipedia page

I'm using Python wikipedia library to obtain the list of the categories of a page. I saw it's a wrapper of MediaWiki API.
Anyway I'm wondering how to generalize the categories to marco categories, like these Main topic classifications.
For example if I search the page Hamburger there is a category called German-American cousine, but I would like to get its super category like Food and Drink. How can I do that?
import wikipedia
page = wikipedia.page("Hamburger")
print(page.categories)
# how to filter only imortant categories?
>>>['All articles with specifically marked weasel-worded phrases', 'All articles with unsourced statements', 'American sandwiches', 'Articles with hAudio microformats', 'Articles with short description', 'Articles with specifically marked weasel-worded phrases from May 2015', 'Articles with unsourced statements from May 2017', 'CS1: Julian–Gregorian uncertainty', 'Commons category link is on Wikidata', 'Culture in Hamburg', 'Fast food', 'German-American cuisine', 'German cuisine', 'German sandwiches', 'Hamburgers (food)', 'Hot sandwiches', 'National dishes', 'Short description is different from Wikidata', 'Spoken articles', 'Use mdy dates from October 2020', 'Webarchive template wayback links', 'Wikipedia articles with BNF identifiers', 'Wikipedia articles with GND identifiers', 'Wikipedia articles with LCCN identifiers', 'Wikipedia articles with NARA identifiers', 'Wikipedia indefinitely move-protected pages', 'Wikipedia pages semi-protected against vandalism']
I didn't find an api to go through the hierarchical tree of Wikipedia Categories.
I accept both Python and API requests solutions. Thank you
EDIT:
I have found the api categorytree which seems to do something similar to what I need.
Anyway I dint't find the way to insert options parameter as said in the documentation. I think that the options can be those expressed in this link, like mode=parents, but I can't find the way to insert this parameter in the HTTP url, because it must be a JSON object, as said in the documentation. I was trying this https://en.wikipedia.org/w/api.php?action=categorytree&category=Category:Biscuits&format=json. How to insert options field?
This is a very hard task, since Wikipedia's category graph is a mess (technically speaking :-)). Indeed, in a tree you would expect to get to the root node in logarithmic time. But this is not a tree, since any node can have multiple parents!
Furthermore, I think that it can't be accomplished only using categories, because, as you can see in the example, you are very likely going to get unexpected results. Anyway I tried to reproduce something similar to what you asked.
Explanation of the code below:
Start from a source page (the hardcoded one is "Hamburger");
Go back visiting recursively all the parent categories;
Cache all the met categories, in order to avoid visiting twice the same category (and this solves also the cycles problem);
Cut the current branch if you find a target category;
Stop when the backlog is empty.
Starting from a given page you are likely getting more than one target category, so I organized the result as a dictionary that tells you how many times a target category you have been met with.
As you may imagine, the response is not immediate, so this algorithm should be implemented in offline mode. And it can be improved in many ways (see below).
The code
import requests
import time
import wikipedia
def get_categories(title) :
try : return set(wikipedia.page(title, auto_suggest=False).categories)
except requests.exceptions.ConnectionError :
time.sleep(10)
return get_categories(title)
start_page = "Hamburger"
target_categories = {"Academic disciplines", "Business", "Concepts", "Culture", "Economy", "Education", "Energy", "Engineering", "Entertainment", "Entities", "Ethics", "Events", "Food and drink", "Geography", "Government", "Health", "History", "Human nature", "Humanities", "Knowledge", "Language", "Law", "Life", "Mass media", "Mathematics", "Military", "Music", "Nature", "Objects", "Organizations", "People", "Philosophy", "Policy", "Politics", "Religion", "Science and technology", "Society", "Sports", "Universe", "World"}
result_categories = {c:0 for c in target_categories} # dictionary target category -> number of paths
cached_categories = set() # monotonically encreasing
backlog = get_categories(start_page)
cached_categories.update(backlog)
while (len(backlog) != 0) :
print("\nBacklog size: %d" % len(backlog))
cat = backlog.pop() # pick a category removing it from backlog
print("Visiting category: " + cat)
try:
for parent in get_categories("Category:" + cat) :
if parent in target_categories :
print("Found target category: " + parent)
result_categories[parent] += 1
elif parent not in cached_categories :
backlog.add(parent)
cached_categories.add(parent)
except KeyError: pass # current cat may not have "categories" attribute
result_categories = {k:v for (k,v) in result_categories.items() if v>0} # filter not-found categories
print("\nVisited categories: %d" % len(cached_categories))
print("Result: " + str(result_categories))
Results for your example
In your example, the script would visit 12176 categories (!) and would return the following result:
{'Education': 21, 'Society': 40, 'Knowledge': 17, 'Entities': 4, 'People': 21, 'Health': 25, 'Mass media': 25, 'Philosophy': 17, 'Events': 17, 'Music': 18, 'History': 21, 'Sports': 6, 'Geography': 18, 'Life': 13, 'Government': 36, 'Food and drink': 12, 'Organizations': 16, 'Religion': 23, 'Language': 15, 'Engineering': 7, 'Law': 25, 'World': 13, 'Military': 18, 'Science and technology': 8, 'Politics': 24, 'Business': 15, 'Objects': 3, 'Entertainment': 15, 'Nature': 12, 'Ethics': 12, 'Culture': 29, 'Human nature': 3, 'Energy': 13, 'Concepts': 7, 'Universe': 2, 'Academic disciplines': 23, 'Humanities': 25, 'Policy': 14, 'Economy': 17, 'Mathematics': 10}
As you may notice, the "Food and drink" category has been reached only 12 times, while, for instance, "Society" has been reached 40 times. This tells us a lot about how weird the Wikipedia's category graph is.
Possible improvements
There are so many improvements for optimizing or approximating this algorithm. The first that come to my mind:
Consider keeping track of the path length and suppose that the target category with the shortest path is the most relevant one.
Reduce the execution time:
You can reduce the number of steps by stopping the script after the first target category occurrence (or at the N-th occurrence).
If you execute this algorithm starting from multiple articles, you can keep in memory the information which associates eventual target categories to every category that you met. For example, after your "Hamburger" run you will know that starting from "Category:Fast food" you will get to "Category:Economy", and this can be a precious information. This will be expensive in terms of space, but eventually will help you reducing the execution time.
Use as label only the target categories that are more frequent. E.g. if your result is {"Food and drinks" : 37, "Economy" : 4}, you may want to keep only "Food and drinks" as label. For doing this you can:
take the N most occurring target categories;
take the most relevant fraction (e.g. the first half, or third, or fourth);
take the categories which occurr at least N% of times w.r.t. the most frequent one;
use more sophisticated statistical tests for analyzing statistical significance of frequency.
Something a bit different you can do is getting the machine-predicted article topic, with a query like https://ores.wikimedia.org/v3/scores/enwiki/?models=articletopic&revids=1000459607

How to read a nested Json with unique values in Pandas

I have a source json of steam reviews, in the format:
{
"reviews": {
"69245216": {
"recommendationid": "69245216",
"author": {
"steamid": "76561198166378463",
"num_games_owned": 31,
"num_reviews": 4,
"playtime_forever": 60198,
"playtime_last_two_weeks": 5899,
"last_played": 1589654367
},
"language": "english",
"review": "Me:*Playing Heroes of Hammrwatch\nAlso me 1 year later:*Playing Heroes of Hammrwatch\nIt's one of the best rougelites I've ever played. You can easly say that by the amount of hours I have on this game. I also have every achievement in the game.\nThe things I don't like about this game:\n-Limit- The game has limits like max damage you can deal. This is not that big problem because you would have to play this game as long as me to hit \"the wall\". And its because the damage is codded in 32bit number which makes the limit around 2 billion.\n-Tax- There is tax in the game for gold which scales with the amount of gold you have on you what makes no sense.\nThe things I like about this game:\n-Music- There are different themed ones depending on the act you are on.\n-Pixel Art-\n-Graphics- Game feels so smooth.\n-Classes- 9 Playable characters with unique sets.\n-Challanging gameplay- You can get far on the first run if you play good.\n-Bosses- There is a boss for every act in the game with different skills which can be harder for some characters.\n-Replayable- There are higher difficulty levels called NewGamePlus (NG+).\n-COOP- Playing with friends makes the game much better and also the game balances the difficulty.\n-DLC- There are DLCs for the game with new content (locations,game modes and playable characters).\n-Builds- There are different combination of items which makes game interesting in some situations.\n-Quality of life- Game has many quality of life improvements\n-Price- The game is very cheap. The only price is your soul beacuse you won't stop playing it! ;)\n\n\n\n",
"timestamp_created": 1589644982,
"timestamp_updated": 1589644982,
"voted_up": true,
"votes_up": 0,
"votes_funny": 0,
"weighted_vote_score": 0,
"comment_count": 0,
"steam_purchase": true,
"received_for_free": false,
"written_during_early_access": false
},
"69236471": {
"recommendationid": "69236471",
"author": {
"steamid": "76561198279405449",
"num_games_owned": 595,
"num_reviews": 46,
"playtime_forever": 1559,
"playtime_last_two_weeks": 1559,
"last_played": 1589652037
},
"language": "english",
"review": "Yes",
"timestamp_created": 1589635540,
"timestamp_updated": 1589635540,
"voted_up": true,
"votes_up": 0,
"votes_funny": 0,
"weighted_vote_score": 0,
"comment_count": 0,
"steam_purchase": true,
"received_for_free": false,
"written_during_early_access": false
},
"69226790": {
"recommendationid": "69226790",
"author": {
"steamid": "76561198004456693",
"num_games_owned": 82,
"num_reviews": 14,
"playtime_forever": 216,
"playtime_last_two_weeks": 216,
"last_played": 1589579174
},
"language": "english",
"review": "I really like how Hipshot/Crackshell is improving their formula from game to game. Altough SS Bogus Detour I didn't really like, I see how they implemented what they've learnt there to this game. Visuals just keep getting better and better and for that I really can't wait to see Hammerwatch 2 (check their YoutTube channel, early footage is out there).\nGameplay-wise I think it's a perfect match between the classic Hammerwatch feeling and a rougelike setting. My only issue with this game is the random map generator. Most of the time like 1/5 of all levels are just empty dead-ends. Otherwise highly recommend, already see huge amount of gameplay ahead of me.",
"timestamp_created": 1589623437,
"timestamp_updated": 1589623437,
"voted_up": true,
"votes_up": 0,
"votes_funny": 0,
"weighted_vote_score": 0,
"comment_count": 0,
"steam_purchase": true,
"received_for_free": false,
"written_during_early_access": false
},
and so on..
reading this in with df = pd.read_json(r'review_677120.json')
gives the following
reviews query_summary cursors
69245216 {'recommendationid': '69245216', 'author': {'s... NaN NaN
69236471 {'recommendationid': '69236471', 'author': {'s... NaN NaN
69226790 {'recommendationid': '69226790', 'author': {'s... NaN NaN
However I'd like something more along the lines of
steamid num_games_owned num_reviews playtime_forever playtime_last_two_weeks last_played language review
69245216 76561198166378463 31 4 60198 5899 589654367 english "me..
so each line expanded to one row.
I've tried playing around with json_normalize, but none of my attempts seem to work, I either get errors AttributeError: 'str' object has no attribute 'values' for df = json_normalize(df)
Other attempts have resulted in everything being in one row, and not usable.
Would appreciate any help
import pandas as pd
import json
with open('content.json') as f:
# reading in json file
d = json.load(f)
for id in d['reviews']:
# pulls nested author information into main dictionary
for key, val in d['reviews'][id]['author'].items():
d['reviews'][id][key] = val
del d['reviews'][id]['author']
print(d)
# OUTPUT:
{
'reviews': {
'69245216': {'recommendationid': '69245216', 'language': 'english', 'review': 'Me:*Playing Heroes of Hammrwatch\nAlso me 1 year later:*Playing Heroes of Hammrwatch\nIt\'s one of the best rougelites I\'ve ever played. You can easly say that by the amount of hours I have on this game. I also have every achievement in the game.\nThe things I don\'t like about this game:\n-Limit- The game has limits like max damage you can deal. This is not that big problem because you would have to play this game as long as me to hit "the wall". And its because the damage is codded in 32bit number which makes the limit around 2 billion.\n-Tax- There is tax in the game for gold which scales with the amount of gold you have on you what makes no sense.\nThe things I like about this game:\n-Music- There are different themed ones depending on the act you are on.\n-Pixel Art-\n-Graphics- Game feels so smooth.\n-Classes- 9 Playable characters with unique sets.\n-Challanging gameplay- You can get far on the first run if you play good.\n-Bosses- There is a boss for every act in the game with different skills which can be harder for some characters.\n-Replayable- There are higher difficulty levels called NewGamePlus (NG+).\n-COOP- Playing with friends makes the game much better and also the game balances the difficulty.\n-DLC- There are DLCs for the game with new content (locations,game modes and playable characters).\n-Builds- There are different combination of items which makes game interesting in some situations.\n-Quality of life- Game has many quality of life improvements\n-Price- The game is very cheap. The only price is your soul beacuse you won\'t stop playing it! ;)\n\n\n\n', 'timestamp_created': 1589644982, 'timestamp_updated': 1589644982, 'voted_up': True, 'votes_up': 0, 'votes_funny': 0, 'weighted_vote_score': 0, 'comment_count': 0, 'steam_purchase': True, 'received_for_free': False, 'written_during_early_access': False, 'steamid': '76561198166378463', 'num_games_owned': 31, 'num_reviews': 4, 'playtime_forever': 60198, 'playtime_last_two_weeks': 5899, 'last_played': 1589654367},
'69236471': {'recommendationid': '69236471', 'language': 'english', 'review': 'Yes', 'timestamp_created': 1589635540, 'timestamp_updated': 1589635540, 'voted_up': True, 'votes_up': 0, 'votes_funny': 0, 'weighted_vote_score': 0, 'comment_count': 0, 'steam_purchase': True, 'received_for_free': False, 'written_during_early_access': False, 'steamid': '76561198279405449', 'num_games_owned': 595, 'num_reviews': 46, 'playtime_forever': 1559, 'playtime_last_two_weeks': 1559, 'last_played': 1589652037},
'69226790': {'recommendationid': '69226790', 'language': 'english', 'review': "I really like how Hipshot/Crackshell is improving their formula from game to game. Altough SS Bogus Detour I didn't really like, I see how they implemented what they've learnt there to this game. Visuals just keep getting better and better and for that I really can't wait to see Hammerwatch 2 (check their YoutTube channel, early footage is out there).\nGameplay-wise I think it's a perfect match between the classic Hammerwatch feeling and a rougelike setting. My only issue with this game is the random map generator. Most of the time like 1/5 of all levels are just empty dead-ends. Otherwise highly recommend, already see huge amount of gameplay ahead of me.", 'timestamp_created': 1589623437, 'timestamp_updated': 1589623437, 'voted_up': True, 'votes_up': 0, 'votes_funny': 0, 'weighted_vote_score': 0, 'comment_count': 0, 'steam_purchase': True, 'received_for_free': False, 'written_during_early_access': False, 'steamid': '76561198004456693', 'num_games_owned': 82, 'num_reviews': 14, 'playtime_forever': 216, 'playtime_last_two_weeks': 216, 'last_played': 1589579174}
}
}
df = pd.DataFrame.from_dict(d['reviews'], orient='index')
print(df)
# OUTPUT:
recommendationid language ... playtime_last_two_weeks last_played
69245216 69245216 english ... 5899 1589654367
69236471 69236471 english ... 1559 1589652037
69226790 69226790 english ... 216 1589579174
[3 rows x 19 columns]
print(df.axes)
# OUPUT:
[Index(['69245216', '69236471', '69226790'], dtype='object'), Index(['recommendationid', 'language', 'review', 'timestamp_created',
'timestamp_updated', 'voted_up', 'votes_up', 'votes_funny',
'weighted_vote_score', 'comment_count', 'steam_purchase',
'received_for_free', 'written_during_early_access', 'steamid',
'num_games_owned', 'num_reviews', 'playtime_forever',
'playtime_last_two_weeks', 'last_played'],
dtype='object')]

OSError: [Errno 22] when I try to .read() a json file

I am simply trying to read my json file in Python. I am in the correct folder when I do so; I am in Downloads, and my file is called 'Books_5.json'. However, when I try to use the .read() function, I get the error
OSError: [Errno 22] Invalid argument
This is my code:
import json
config = json.loads(open('Books_5.json').read())
This also raises the same error:
books = open('Books_5.json').read()
If it helps, this is a small snippet of what my data looks like:
{"reviewerID": "A10000012B7CGYKOMPQ4L", "asin": "000100039X", "reviewerName": "Adam", "helpful": [0, 0], "reviewText": "Spiritually and mentally inspiring! A book that allows you to question your morals and will help you discover who you really are!", "overall": 5.0, "summary": "Wonderful!", "unixReviewTime": 1355616000, "reviewTime": "12 16, 2012"}
{"reviewerID": "A2S166WSCFIFP5", "asin": "000100039X", "reviewerName": "adead_poet#hotmail.com \"adead_poet#hotmail.com\"", "helpful": [0, 2], "reviewText": "This is one my must have books. It is a masterpiece of spirituality. I'll be the first to admit, its literary quality isn't much. It is rather simplistically written, but the message behind it is so powerful that you have to read it. It will take you to enlightenment.", "overall": 5.0, "summary": "close to god", "unixReviewTime": 1071100800, "reviewTime": "12 11, 2003"}
I'm using Python 3.6 on MacOSX
It appears that this is some kind of bug that occurs when the file is too large (my file was ~10GB). Once I use split to break up the file by 200 k lines, the .read() error goes away. This is true even if the file is not in strict json format.
Your code looks fine, it just looks like your json data is formatted incorrectly. Try the following. As others have suggested, it should be in the form [{},{},...].
[{"reviewerID": "A10000012B7CGYKOMPQ4L", "asin": "000100039X",
"reviewerName": "Adam", "helpful": [0, 0], "reviewText": "Spiritually and
mentally inspiring! A book that allows you to question your morals and will
help you discover who you really are!", "overall": 5.0, "summary":
"Wonderful!", "unixReviewTime": 1355616000, "reviewTime": "12 16, 2012"},
{"reviewerID": "A2S166WSCFIFP5", "asin": "000100039X", "reviewerName":
"adead_poet#hotmail.com \"adead_poet#hotmail.com\"", "helpful": [0, 2],
"reviewText": "This is one my must have books. It is a masterpiece of
spirituality. I'll be the first to admit, its literary quality isn't much.
It is rather simplistically written, but the message behind it is so
powerful that you have to read it. It will take you to enlightenment.",
"overall": 5.0, "summary": "close to god", "unixReviewTime": 1071100800,
"reviewTime": "12 11, 2003"}]
Your code and this data worked for me on Windows 7 and python 2.7. Different than your setup, but should still be ok.
In order to read json file, you can use next example:
with open('your_data.json') as data_file:
data = json.load(data_file)
print(data)
print(data[0]['your_key']) # get value via key.
and also try to convert your json objects into a list
[
{'reviewerID': "A10000012B7CGYKOMPQ4L", ....},
{'asin': '000100039X', .....}
]

Json file to dictionary

I am using the yelp dataset and I want to parse the review json file to a dictionary. I tried loading it on a pandas DataFrame and then creating the dictionary, but because the file is too big it is time consuming. I want to keep only the user_id and stars values. A line of the json file looks like this:
{
"votes": {
"funny": 0, "useful": 2, "cool": 1},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17",
"text": (
"dr. goldberg offers everything i look for in a general practitioner. "
"he's nice and easy to talk to without being patronizing; he's always on "
"time in seeing his patients; he's affiliated with a top-notch hospital (nyu) "
"which my parents have explained to me is very important in case something "
"happens and you need surgery; and you can get referrals to see specialists "
"without having to see him first. really, what more do you need? i'm "
"sitting here trying to think of any complaints i have about him, but i'm "
"really drawing a blank."
),
"type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
How can i iterate over every 'field' (for the lack o a better word)? So far i can only iterate over each line.
EDIT
As requested pandas code :
reading the json
with open('yelp_academic_dataset_review.json') as f:
df = pd.DataFrame(json.loads(line) for line in f)
Creating the dictionary
dict = {}
for i, row in df.iterrows():
business_id = row['business_id']
user_id = row['user_id']
rating = row['stars']
key = (business_id, user_id)
dict[key] = rating
You don't need to read this into a DataFrame. json.load() returns a dictionary. For example:
sample.json
{
"votes": {
"funny": 0,
"useful": 2,
"cool": 1
},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg",
"stars": 5,
"date": "2007-05-17",
"text": "dr. goldberg offers everything i look for in a general practitioner. he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first. really, what more do you need? i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
"type": "review",
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
read_json.py
import json
with open('sample.json', 'r') as fh:
result_dict = json.load(fh)
print(result_dict['user_id'])
print(result_dict['stars'])
output
Xqd0DzHaiyRqVH3WRG7hzg
5
With that output you can easily create a DataFrame.
There are several good discussions about parsing json as a stream on SO, but the gist is it's not possible natively, although some tools seem to attempt it.
In the interest of keeping your code simple and with minimal dependencies, you might see if reading the json directory into a dictionary is a sufficient improvement.

Categories