I am currently using Scrapy to crawl some domains from different website and I wonder how to save my data in a local json file with the format of either a list or a dictionary with the key of 'domain' and a list of domains as value.
In the crawler file, the item is like this:
item['domain'] = 'xxx'.extract()
yield item
import json
import codecs
class ChinazPipeline(object):
def __init__(self):
self.file = codecs.open('save.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item
What I expect is:
{"domain": "['google.com', 'cnn.com', 'yahoo.com']"}
or just simply save all domains that I crawled as a list in json, either way works for me.
It's rather simple. Json is default scrapy exporter.
You can use it by turning on output to JSON file:
scrapy runspider yourspider.py -o filename.json
Scrapy will automatically determine format you with to have by file type.
Other options are .csv and .jsonline.
It's an easy way. Otherwize you can write your own ItemExporter. Take a look at exporters documentation.
NB:
You don't even need to open file during spider initiation, scrapy will manage it by itself.
Just yield items and scrapy will write it to file automatically.
Scrapy is most suitable for one page -> one item schema.
What you want is scrape all items in advance and then export them as single list.
So you should have some variable like self.results, append there new domains from every process_item() call. And then export it on spider close event.
There's shortcut for this signal. So you can just add:
def closed(self, reason):
# write self.results list to JSON file.
More documentation on Spider.closed() method.
Related
I have a large csv that is on my local machine that only contains a list of urls no other columns I want to crawl and extract a certain css element from each of those urls. I have completed a test of that doing a sample of just a one off start url by not looking at the csv. I can't figure out how to open up a large csv with like a million urls in it to have scrapy go through each one and scrape it and then go to the next.
import scrapy
from ..items import stkscrapeItem
class stkSpider(scrapy.Spider):
name = 'stkscrape'
start_urls = [
'https://www.exampleurl.com'
]
def parse(self, response):
items = stkscrapeItem()
contriburl = response.css(".b_q_e a::attr(href)").extract()
items['contriburl'] = contriburl
yield items
I just typed this directly here, so can have typos.
But should be pretty close to what you are expecting
Just to make it perform better, you can split the millions CSV file into chunks using pandas:
create_files.py
import pandas as pd
counter = 0
for df in pd.read_csv("your_file_with_urls.csv", chunksize=100000):
df.to_csv(f"input_{counter}.csv", index=False)
counter += 1
This generates some files at the same location.
Now in the same location as the scrapy.cfg file create a file main.py with the following:
main.py
from glob import glob
from scrapy import cmdline
for each_file in glob("input_*.csv"):
cmdline.execute("scrapy crawl your_spider".split() + ["-a", each_file])
This way we are sending each small file's name to the spider's constructor (the init.py function)
In your your_spider, receive the argument each_file as follow:
your_spider.py
def YourSpiderName(scrapy.Spider):
def __init__(self, each_file='', **kwargs):
self.start_urls = set(pd.read_csv("path_to/each_file")["URL_COLUMN"].tolist())
super().__init__(**kwargs) # python3
I'm attempting to scrape articles on 100 companies, and I want to save the content from the multiple articles to a separate csv file for each company. I have the scraper and a csv export pipeline built, and it works fine, however, the spider opens a new csv file for each company (as it should) without closing the file opened for the previous company.
The csv files close after the spider closes, but because of the amount of data I am scraping for each company, the file sizes are significant and causes a strain on my machines memory, and cannot realistically scale, given that if I wanted to increase the number of companies (something I eventually want to do), I will, eventually run into an error for having too many files open at a time. Below is my csv exporter pipeline. I would like to find a way to close one csv file for the current company before moving on to the next company within the same spider:
I guess, theoretically, I could open the file for each article, write the content to new rows, then close it and reopen it again for the next article, but that will slow the spider down significantly. I'd like to keep the file open for a given company while the spider is still making its way through that company's articles, then close it when the spider moves on to the next company.
I'm sure there is a solution but I have not been able to figure one out. Would greatly appreciate help solving this.
class PerTickerCsvExportPipeline:
"""Distribute items across multiple CSV files according to their 'ticker' field"""
def open_spider(self, spider):
self.ticker_to_exporter = {}
def close_spider(self, spider):
for exporter in self.ticker_to_exporter.values():
exporter.finish_exporting()
def _exporter_for_item(self, item):
ticker = item['ticker']
if ticker not in self.ticker_to_exporter:
f = open('{}_article_content.csv'.format(ticker), 'wb')
exporter = CsvItemExporter(f)
exporter.start_exporting()
self.ticker_to_exporter[ticker] = exporter
return self.ticker_to_exporter[ticker]
def process_item(self, item, spider):
exporter = self._exporter_for_item(item)
exporter.export_item(item)
return item
The problem probably is that you keep all the ItemExporters and files open until the spider closes. I suggest that you should try to close the CsvItemExporter and corresponding file for the previous company before you open a new one.
def open_spider(self, spider):
self.ticker_to_exporter = {}
self.files = []
def close_exporters(self):
for ticker, exporter in self.ticker_to_exporter.items():
exporter.finish_exporting()
del self.ticker_to_exporter[ticker]
def close_files(self):
for i, f in enumerate(self.files):
f.close()
del self.files[i]
def close_spider(self, spider):
self.close_exporters()
self.close_files()
def _exporter_for_item(self, item):
ticker = item['ticker']
if ticker not in self.ticker_to_exporter:
self.close_exporters()
self.close_files()
f = open('{}_article_content.csv'.format(ticker), 'a')
self.files.append(f)
exporter = CsvItemExporter(f)
exporter.start_exporting()
self.ticker_to_exporter[ticker] = exporter
return self.ticker_to_exporter[ticker]
I'm new in Scrapy. I have read several discussions about this tool. I have a problem exporting csv files. I'm scrapping numeric values with commas. The default separatos of csv exporter is comma, so I have some problems when I open the resulting file in Excel.
How can I change the default delimitor of csv files in Scrapy to semicolon? I read some discussions about this issue but I don't know what and where i have to add code.
Thanks in advance!
scraper/exporters.py
from scrapy.exporters import CsvItemExporter
class CsvCustomSeperator(CsvItemExporter):
def __init__(self, *args, **kwargs):
kwargs['encoding'] = 'utf-8'
kwargs['delimiter'] = '╡'
super(CsvCustomSeperator, self).__init__(*args, **kwargs)
scraper/settings.py
FEED_EXPORTERS = {
'csv': 'scraper.exporters.CsvCustomSeperator'
}
This solution worked for me.
You should check if quotechar is enabled and set in your export.
https://doc.scrapy.org/en/latest/topics/spiders.html?highlight=CSV_DELIMITER#csvfeedspider-example
Usually text should be enquoted with " so it's no issue that the delimiter is in the text.
try this:
scrapy crawl yourCrawlerName -o output.csv --set delimiter=";"
I've written a script in python scrapy to get different ids and its corresponding names from a webpage. When I execute my script, I can see that the result are rightly coming through and I'm getting a data filled in csv file. I'm using python 3.6, so when I go fo scrapy's built-in command (meant to write data in a csv file), I always get a csv file with blank lines in every alternate row. However, I tried the following to serve the purpose and it does it's job. Now, It produces a csv file fixing blank line issues.
My question: how can I close the csv file when the job is done?
This is my try so far:
import scrapy, csv
class SuborgSpider(scrapy.Spider):
name = "suborg"
start_urls = ['https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page={}'.format(page) for page in range(0,7)]
def __init__(self):
self.file = open("output.csv", "w", newline="")
def parse(self, response):
for item in response.xpath('//*[contains(#class,"views-table")]//tbody//tr'):
idnum = item.xpath('.//*[contains(#class,"views-field-field-reference-number")]/text()').extract()[-1].strip()
name = item.xpath('.//*[contains(#class,"views-field-title")]//span[#dir="ltr"]/text()').extract()[-1].strip()
yield{'ID':idnum,'Name':name}
writer = csv.writer(self.file)
writer.writerow([idnum,name])
You can close the actual file instead:
You can call it in the closed() method which is automatically called when the spider is closed.
def closed(self, reason):
self.file.close()
So I have a simple crawler that crawls 3 store location pages and parses the locations of the stores to json. I print(app_data['stores']) and it prints all three pages of stores. However, when I try to write it out I only get one of the three pages, at random, written to my json file. I'd like everything that streams to be written to the file. Any help would be great. Here's the code:
import scrapy
import json
import js2xml
from pprint import pprint
class StlocSpider(scrapy.Spider):
name = "stloc"
allowed_domains = ["bestbuy.com"]
start_urls = (
'http://www.bestbuy.com/site/store-locator/11356',
'http://www.bestbuy.com/site/store-locator/46617',
'http://www.bestbuy.com/site/store-locator/77521'
)
def parse(self, response):
js = response.xpath('//script[contains(.,"window.appData")]/text()').extract_first()
jstree = js2xml.parse(js)
# print(js2xml.pretty_print(jstree))
app_data_node = jstree.xpath('//assign[left//identifier[#name="appData"]]/right/*')[0]
app_data = js2xml.make_dict(app_data_node)
print(app_data['stores'])
for store in app_data['stores']:
yield store
with open('stores.json', 'w') as f:
json.dump(app_data['stores'], f, indent=4)
You are opening the file for writing every time, but you want to append. Try changing the last part to this:
with open('stores.json', 'a') as f:
json.dump(app_data['stores'], f, indent=4)
Where 'a' opens the file for appending.