JSON.Dump doesn't capture the whole stream - python

So I have a simple crawler that crawls 3 store location pages and parses the locations of the stores to json. I print(app_data['stores']) and it prints all three pages of stores. However, when I try to write it out I only get one of the three pages, at random, written to my json file. I'd like everything that streams to be written to the file. Any help would be great. Here's the code:
import scrapy
import json
import js2xml
from pprint import pprint
class StlocSpider(scrapy.Spider):
name = "stloc"
allowed_domains = ["bestbuy.com"]
start_urls = (
'http://www.bestbuy.com/site/store-locator/11356',
'http://www.bestbuy.com/site/store-locator/46617',
'http://www.bestbuy.com/site/store-locator/77521'
)
def parse(self, response):
js = response.xpath('//script[contains(.,"window.appData")]/text()').extract_first()
jstree = js2xml.parse(js)
# print(js2xml.pretty_print(jstree))
app_data_node = jstree.xpath('//assign[left//identifier[#name="appData"]]/right/*')[0]
app_data = js2xml.make_dict(app_data_node)
print(app_data['stores'])
for store in app_data['stores']:
yield store
with open('stores.json', 'w') as f:
json.dump(app_data['stores'], f, indent=4)

You are opening the file for writing every time, but you want to append. Try changing the last part to this:
with open('stores.json', 'a') as f:
json.dump(app_data['stores'], f, indent=4)
Where 'a' opens the file for appending.

Related

How to open a large csv with a list of urls and crawl through those?

I have a large csv that is on my local machine that only contains a list of urls no other columns I want to crawl and extract a certain css element from each of those urls. I have completed a test of that doing a sample of just a one off start url by not looking at the csv. I can't figure out how to open up a large csv with like a million urls in it to have scrapy go through each one and scrape it and then go to the next.
import scrapy
from ..items import stkscrapeItem
class stkSpider(scrapy.Spider):
name = 'stkscrape'
start_urls = [
'https://www.exampleurl.com'
]
def parse(self, response):
items = stkscrapeItem()
contriburl = response.css(".b_q_e a::attr(href)").extract()
items['contriburl'] = contriburl
yield items
I just typed this directly here, so can have typos.
But should be pretty close to what you are expecting
Just to make it perform better, you can split the millions CSV file into chunks using pandas:
create_files.py
import pandas as pd
counter = 0
for df in pd.read_csv("your_file_with_urls.csv", chunksize=100000):
df.to_csv(f"input_{counter}.csv", index=False)
counter += 1
This generates some files at the same location.
Now in the same location as the scrapy.cfg file create a file main.py with the following:
main.py
from glob import glob
from scrapy import cmdline
for each_file in glob("input_*.csv"):
cmdline.execute("scrapy crawl your_spider".split() + ["-a", each_file])
This way we are sending each small file's name to the spider's constructor (the init.py function)
In your your_spider, receive the argument each_file as follow:
your_spider.py
def YourSpiderName(scrapy.Spider):
def __init__(self, each_file='', **kwargs):
self.start_urls = set(pd.read_csv("path_to/each_file")["URL_COLUMN"].tolist())
super().__init__(**kwargs) # python3

Reading a file in python: json.decoder.JSONDecodeError

I have a json file.
with open('list.json', "r") as f:
r_list = json.load(f)
crashes with:
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char
0
I checked the schema online and the schema works.
The schema is very simple:
{"foo": [
{"name": "AAA\u2019s BBB CCC", "url": "/foome/foo"}
]}
Tried to play with:
file encoding
Try a dummy file
.. run out of ideas - is it something where ´json.load´ expects a binary?
Edit 1
Code works in a plain file, does not work in the scrapy class
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import json
class myScraper(scrapy.Spider):
name="testScraper"
def start_requests(self):
with open('test.json') as f:
self.logger.info(f.read()) #shows the file content
r_list = json.load(f) # breaks with the error msg
yield "foo"
def parse(self, response):
self.logger.info("foo")
'test.json'
{
"too": "foo"
}
Most likely your file is empty.
Example:
https://repl.it/#mark_boyle_sp/SphericalImpressiveIrc
updated:
Your iterator is exhausted as also discussed in the comments.
Since you log the files contents the iterator is at the end of the file. (looks like an empty file, hence the exception)
Reset the iterator or read the contents to a local value and operate on that.
json_str = f.read()
self.logger.info(json_str) #shows the file content
r_list = json.loads(json_str)
updated again
(I assume) The scrapy issue you are having is in the parse method? The response body is a bytes object you will need to decode it and use loads on the resulting string like so :
def parse(self, response):
self.logger.info("foo")
resp_str = response.body.decode('utf-8')
self.logger.info(resp_str) #shows the response
r_list = json.loads(json_str)

Can't find any way to close a csv file

I've written a script in python scrapy to get different ids and its corresponding names from a webpage. When I execute my script, I can see that the result are rightly coming through and I'm getting a data filled in csv file. I'm using python 3.6, so when I go fo scrapy's built-in command (meant to write data in a csv file), I always get a csv file with blank lines in every alternate row. However, I tried the following to serve the purpose and it does it's job. Now, It produces a csv file fixing blank line issues.
My question: how can I close the csv file when the job is done?
This is my try so far:
import scrapy, csv
class SuborgSpider(scrapy.Spider):
name = "suborg"
start_urls = ['https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page={}'.format(page) for page in range(0,7)]
def __init__(self):
self.file = open("output.csv", "w", newline="")
def parse(self, response):
for item in response.xpath('//*[contains(#class,"views-table")]//tbody//tr'):
idnum = item.xpath('.//*[contains(#class,"views-field-field-reference-number")]/text()').extract()[-1].strip()
name = item.xpath('.//*[contains(#class,"views-field-title")]//span[#dir="ltr"]/text()').extract()[-1].strip()
yield{'ID':idnum,'Name':name}
writer = csv.writer(self.file)
writer.writerow([idnum,name])
You can close the actual file instead:
You can call it in the closed() method which is automatically called when the spider is closed.
def closed(self, reason):
self.file.close()

Read data from api and populate .csv bug

I am trying to write a script (Python 2.7.11, Windows 10) to collect data from an API and append it to a csv file.
The API I want to use returns data in json.
It limits the # of displayed records though, and pages them.
So there is a max number of records you can get with a single query, and then you have to run another query, changing the page number.
The API informs you about the nr of pages a dataset is divided to.
Let's assume that the max # of records per page is 100 and the nr of pages is 2.
My script:
import json
import urllib2
import csv
url = "https://some_api_address?page="
limit = "&limit=100"
myfile = open('C:\Python27\myscripts\somefile.csv', 'ab')
def api_iterate():
for i in xrange(1, 2, 1):
parse_url = url,(i),limit
json_page = urllib2.urlopen(parse_url)
data = json.load(json_page)
for item in data['someobject']:
print item ['some_item1'], ['some_item2'], ['some_item3']
f = csv.writer(myfile)
for row in data:
f.writerow([str(row)])
This does not seem to work, i.e. it creates a csv file, but the file is not populated. There is obviously something wrong with either the part of the script which builds the address for the query OR the part dealing with reading json OR the part dealing with writing query to csv. Or all of them.
I have tried using other resources and tutorials, but at some point I got stuck and I would appreciate your assistance.
The url you have given provides a link to the next page as one of the objects. You can use this to iterate automatically over all of the pages.
The script below gets each page, extracts two of the entries from the Dataobject array and writes them to an output.csv file:
import json
import urllib2
import csv
def api_iterate(myfile):
url = "https://api-v3.mojepanstwo.pl/dane/krs_osoby"
csv_myfile = csv.writer(myfile)
cols = ['id', 'url']
csv_myfile.writerow(cols) # Write a header
while True:
print url
json_page = urllib2.urlopen(url)
data = json.load(json_page)
json_page.close()
for data_object in data['Dataobject']:
csv_myfile.writerow([data_object[col] for col in cols])
try:
url = data['Links']['next'] # Get the next url
except KeyError as e:
break
with open(r'e:\python temp\output.csv', 'wb') as myfile:
api_iterate(myfile)
This will give you an output file looking something like:
id,url
1347854,https://api-v3.mojepanstwo.pl/dane/krs_osoby/1347854
1296239,https://api-v3.mojepanstwo.pl/dane/krs_osoby/1296239
705217,https://api-v3.mojepanstwo.pl/dane/krs_osoby/705217
802970,https://api-v3.mojepanstwo.pl/dane/krs_osoby/802970

how to clean a JSON file and store it to another file in Python

I am trying to read a JSON file with Python. This file is described by the authors as not strict JSON. In order to convert it to strict JSON, they suggest this approach:
import json
def parse(path):
g = gzip.open(path, 'r')
for l in g:
yield json.dumps(eval(l))
however, not being familiar with Python, I am able to execute the script but I am not able to produce any output file with the new clean JSON. How should I modify the script in order to produce a new JSON file? I have tried this:
import json
class Amazon():
def parse(self, inpath, outpath):
g = open(inpath, 'r')
out = open(outpath, 'w')
for l in g:
yield json.dumps(eval(l), out)
amazon = Amazon()
amazon.parse("original.json", "cleaned.json")
but the output is an empty file. Any help more than welcome
import json
class Amazon():
def parse(self, inpath, outpath):
g = open(inpath, 'r')
with open(outpath, 'w') as fout:
for l in g:
fout.write(json.dumps(eval(l)))
amazon = Amazon()
amazon.parse("original.json", "cleaned.json")
another shorter way of doing this
import json
class Amazon():
def parse(readpath, writepath):
with open(readpath) as g, open(writepath, 'w') as fout:
for l in g:
json.dump(eval(l), fout)
amazon = Amazon()
amazon.parse("original.json", "cleaned.json")
While handling json data it is better to use json modules json.dump(json, output_file) for dumping json in file and json.load(file_path) to load the data. In this way you can get maintain json wile saving and reading json data.
For very large amount of data say 1k+ use python pandas module.

Categories