Reading a file in python: json.decoder.JSONDecodeError - python

I have a json file.
with open('list.json', "r") as f:
r_list = json.load(f)
crashes with:
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char
0
I checked the schema online and the schema works.
The schema is very simple:
{"foo": [
{"name": "AAA\u2019s BBB CCC", "url": "/foome/foo"}
]}
Tried to play with:
file encoding
Try a dummy file
.. run out of ideas - is it something where ´json.load´ expects a binary?
Edit 1
Code works in a plain file, does not work in the scrapy class
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import json
class myScraper(scrapy.Spider):
name="testScraper"
def start_requests(self):
with open('test.json') as f:
self.logger.info(f.read()) #shows the file content
r_list = json.load(f) # breaks with the error msg
yield "foo"
def parse(self, response):
self.logger.info("foo")
'test.json'
{
"too": "foo"
}

Most likely your file is empty.
Example:
https://repl.it/#mark_boyle_sp/SphericalImpressiveIrc
updated:
Your iterator is exhausted as also discussed in the comments.
Since you log the files contents the iterator is at the end of the file. (looks like an empty file, hence the exception)
Reset the iterator or read the contents to a local value and operate on that.
json_str = f.read()
self.logger.info(json_str) #shows the file content
r_list = json.loads(json_str)
updated again
(I assume) The scrapy issue you are having is in the parse method? The response body is a bytes object you will need to decode it and use loads on the resulting string like so :
def parse(self, response):
self.logger.info("foo")
resp_str = response.body.decode('utf-8')
self.logger.info(resp_str) #shows the response
r_list = json.loads(json_str)

Related

Read JSON file correctly

I am trying to read a JSON file (BioRelEx dataset: https://github.com/YerevaNN/BioRelEx/releases/tag/1.0alpha7) in Python. The JSON file is a list of objects, one per sentence.
This is how I try to do it:
def _read(self, file_path):
with open(cached_path(file_path), "r") as data_file:
for line in data_file.readlines():
if not line:
continue
items = json.loads(lines)
text = items["text"]
label = items.get("label")
My code is failing on items = json.loads(line). It looks like the data is not formatted as the code expects it to be, but how can I change it?
Thanks in advance for your time!
Best,
Julia
With json.load() you don't need to read each line, you can do either of these:
import json
def open_json(path):
with open(path, 'r') as file:
return json.load(file)
data = open_json('./1.0alpha7.dev.json')
Or, even cooler, you can GET request the json from GitHub
import json
import requests
url = 'https://github.com/YerevaNN/BioRelEx/releases/download/1.0alpha7/1.0alpha7.dev.json'
response = requests.get(url)
data = response.json()
These will both give the same output. data variable will be a list of dictionaries that you can iterate over in a for loop and do your further processing.
Your code is reading one line at a time and parsing each line individually as JSON. Unless the creator of the file created the file in this format (which given it has a .json extension is unlikely) then that won't work, as JSON does not use line breaks to indicate end of an object.
Load the whole file content as JSON instead, then process the resulting items in the array.
def _read(self, file_path):
with open(cached_path(file_path), "r") as data_file:
data = json.load(data_file)
for item in data:
text = item["text"]
label appears to be buried in item["interaction"]

How to read start_urls from csv file in scrapy?

I have two spiders. Let's say A and B. A scrapes bunch of urls and writes it into a csv file and B scrapes inside those urls reading from the csv file generated by A. But it throws FileNotFound error from B before A can actually create the file. How can I make my spiders behave such that B waits until A comes back with url? Any other solution would be helpful.
WriteToCsv.py file
def write_to_csv(item):
with open('urls.csv', 'a', newline='') as csvfile:
fieldnames = ['url']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'url': item})
class WriteToCsv(object):
def process_item(self, item, spider):
if item['url']:
write_to_csv("http://pypi.org" +item["url"])
return item
Pipelines.py file
ITEM_PIPELINES = {
'PyPi.WriteToCsv.WriteToCsv': 100,
'PyPi.pipelines.PypiPipeline': 300,
}
read_csv method
def read_csv():
x = []
with open('urls.csv', 'r') as csvFile:
reader = csv.reader(csvFile)
for row in reader:
x = [''.join(url) for url in reader]
return x
start_urls in B spider file
start_urls = read_csv() #Error here
I would consider using a single spider with two methods parse and final_parse. As far as I can tell from the context you have provided there is no need to write the URLs to disk.
parse should contain the logic for scraping the URLs that spider A is currently writing to the csv and should return a new request with a callback to the final_parse method.
def parse(self, response):
url = do_something(response.body_as_unicode())
return scrapy.Request(url, callback=self.final_parse)
final_parse should then contain the parsing logic that was previously in spider B.
def final_parse(self, response):
item = do_something_else(response.body_as_unicode())
return item
Note: If you need to pass any additional information from parse to final_parse you can use the meta argument of scrapy.Request.
If you do need the URLs, you could add this as a field to your item.
It can be accessed with response.url.

Print JSON data from csv list of multiple urls

Very new to Python and haven't found specific answer on SO but apologies in advance if this appears very naive or elsewhere already.
I am trying to print 'IncorporationDate' JSON data from multiple urls of public data set. I have the urls saved as a csv file, snippet below. I am only getting as far as printing ALL the JSON data from one url, and I am uncertain how to run that over all of the csv urls, and write to csv just the IncorporationDate values.
Any basic guidance or edits are really welcomed!
try:
# For Python 3.0 and later
from urllib.request import urlopen
except ImportError:
# Fall back to Python 2's urllib2
from urllib2 import urlopen
import json
def get_jsonparsed_data(url):
response = urlopen(url)
data = response.read().decode("utf-8")
return json.loads(data)
url = ("http://data.companieshouse.gov.uk/doc/company/01046514.json")
print(get_jsonparsed_data(url))
import csv
with open('test.csv') as f:
lis=[line.split() for line in f]
for i,x in enumerate(lis):
print ()
import StringIO
s = StringIO.StringIO()
with open('example.csv', 'w') as f:
for line in s:
f.write(line)
Snippet of csv:
http://business.data.gov.uk/id/company/01046514.json
http://business.data.gov.uk/id/company/01751318.json
http://business.data.gov.uk/id/company/03164710.json
http://business.data.gov.uk/id/company/04403406.json
http://business.data.gov.uk/id/company/04405987.json
Welcome to the Python world.
For dealing with making http requests, we commonly use requests because it's dead simple api.
The code snippet below does what I believe you want:
It grabs the data from each of the urls you posted
It creates a new CSV file with each of the IncorporationDate keys.
```
import csv
import requests
COMPANY_URLS = [
'http://business.data.gov.uk/id/company/01046514.json',
'http://business.data.gov.uk/id/company/01751318.json',
'http://business.data.gov.uk/id/company/03164710.json',
'http://business.data.gov.uk/id/company/04403406.json',
'http://business.data.gov.uk/id/company/04405987.json',
]
def get_company_data():
for url in COMPANY_URLS:
res = requests.get(url)
if res.status_code == 200:
yield res.json()
if __name__ == '__main__':
for data in get_company_data():
try:
incorporation_date = data['primaryTopic']['IncorporationDate']
except KeyError:
continue
else:
with open('out.csv', 'a') as csvfile:
writer = csv.writer(csvfile)
writer.writerow([incorporation_date])
```
First step, you have to read all the URLs in your CSV
import csv
csvReader = csv.reader('text.csv')
# next(csvReader) uncomment if you have a header in the .CSV file
all_urls = [row for row in csvReader if row]
Second step, fetch the data from the URL
from urllib.request import urlopen
def get_jsonparsed_data(url):
response = urlopen(url)
data = response.read().decode("utf-8")
return json.loads(data)
url_data = get_jsonparsed_data("give_your_url_here")
Third step:
Go through all URLs that you got from CSV file
Get JSON data
Fetch the field what you need, in your case "IncorporationDate"
Write into an output CSV file, I'm naming it as IncorporationDates.csv
Code below:
for each_url in all_urls:
url_data = get_jsonparsed_data(each_url)
with open('IncorporationDates.csv', 'w' ) as abc:
abc.write(url_data['primaryTopic']['IncorporationDate'])

JSON.Dump doesn't capture the whole stream

So I have a simple crawler that crawls 3 store location pages and parses the locations of the stores to json. I print(app_data['stores']) and it prints all three pages of stores. However, when I try to write it out I only get one of the three pages, at random, written to my json file. I'd like everything that streams to be written to the file. Any help would be great. Here's the code:
import scrapy
import json
import js2xml
from pprint import pprint
class StlocSpider(scrapy.Spider):
name = "stloc"
allowed_domains = ["bestbuy.com"]
start_urls = (
'http://www.bestbuy.com/site/store-locator/11356',
'http://www.bestbuy.com/site/store-locator/46617',
'http://www.bestbuy.com/site/store-locator/77521'
)
def parse(self, response):
js = response.xpath('//script[contains(.,"window.appData")]/text()').extract_first()
jstree = js2xml.parse(js)
# print(js2xml.pretty_print(jstree))
app_data_node = jstree.xpath('//assign[left//identifier[#name="appData"]]/right/*')[0]
app_data = js2xml.make_dict(app_data_node)
print(app_data['stores'])
for store in app_data['stores']:
yield store
with open('stores.json', 'w') as f:
json.dump(app_data['stores'], f, indent=4)
You are opening the file for writing every time, but you want to append. Try changing the last part to this:
with open('stores.json', 'a') as f:
json.dump(app_data['stores'], f, indent=4)
Where 'a' opens the file for appending.

Urllib/JSON request from text file

I am trying to send data from a text file to a server looking for a match to the sent data in order to get that matched data returned back to me that I store in an existing text file. If I send a list of names to the server within the script, I am fine. I however want to repeat the request and insert a text file as the names to be matched and returned. Here is my text so far:
import json
import urllib2
values = 'E:\names.txt'
url = 'https://myurl.com/get?name=values&key=##########'
response = json.load(urllib2.urlopen(url))
with open('E:\data.txt', 'w') as outfile:
json.dump(response, outfile, sort_keys = True, indent = 4,ensure_ascii=False);
This code just send back a one line file showing nothing has matched. I am assuming that it is just looking at the values as the name instead of the data in the values text file.
Update Trial 1: I updated my code as per suggested below to include the urllib.urlencode suggestion. Here is my updated code:
import json
import urllib
import urllib2
file = 'E:\names.txt'
url = 'https://myurl.com/get'
values = {'name' : file,
'key' : '##########'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = json.load(urllib2.urlopen(req))
with open('E:\data.txt', 'w') as outfile:
json.dump(response, outfile, sort_keys = True, indent = 4,ensure_ascii=False);
fixed traceback errors by editing url. However it is just passing "e:\names.txt" as name in the JSON request. So it seems my issue now is just trying to send the data in the names.txt file to the tuple 'names' properly. Any thoughts?
Make sure when sending parameters to server, they're encoded -- see urllib.urlencode()

Categories