I'm attempting to scrape articles on 100 companies, and I want to save the content from the multiple articles to a separate csv file for each company. I have the scraper and a csv export pipeline built, and it works fine, however, the spider opens a new csv file for each company (as it should) without closing the file opened for the previous company.
The csv files close after the spider closes, but because of the amount of data I am scraping for each company, the file sizes are significant and causes a strain on my machines memory, and cannot realistically scale, given that if I wanted to increase the number of companies (something I eventually want to do), I will, eventually run into an error for having too many files open at a time. Below is my csv exporter pipeline. I would like to find a way to close one csv file for the current company before moving on to the next company within the same spider:
I guess, theoretically, I could open the file for each article, write the content to new rows, then close it and reopen it again for the next article, but that will slow the spider down significantly. I'd like to keep the file open for a given company while the spider is still making its way through that company's articles, then close it when the spider moves on to the next company.
I'm sure there is a solution but I have not been able to figure one out. Would greatly appreciate help solving this.
class PerTickerCsvExportPipeline:
"""Distribute items across multiple CSV files according to their 'ticker' field"""
def open_spider(self, spider):
self.ticker_to_exporter = {}
def close_spider(self, spider):
for exporter in self.ticker_to_exporter.values():
exporter.finish_exporting()
def _exporter_for_item(self, item):
ticker = item['ticker']
if ticker not in self.ticker_to_exporter:
f = open('{}_article_content.csv'.format(ticker), 'wb')
exporter = CsvItemExporter(f)
exporter.start_exporting()
self.ticker_to_exporter[ticker] = exporter
return self.ticker_to_exporter[ticker]
def process_item(self, item, spider):
exporter = self._exporter_for_item(item)
exporter.export_item(item)
return item
The problem probably is that you keep all the ItemExporters and files open until the spider closes. I suggest that you should try to close the CsvItemExporter and corresponding file for the previous company before you open a new one.
def open_spider(self, spider):
self.ticker_to_exporter = {}
self.files = []
def close_exporters(self):
for ticker, exporter in self.ticker_to_exporter.items():
exporter.finish_exporting()
del self.ticker_to_exporter[ticker]
def close_files(self):
for i, f in enumerate(self.files):
f.close()
del self.files[i]
def close_spider(self, spider):
self.close_exporters()
self.close_files()
def _exporter_for_item(self, item):
ticker = item['ticker']
if ticker not in self.ticker_to_exporter:
self.close_exporters()
self.close_files()
f = open('{}_article_content.csv'.format(ticker), 'a')
self.files.append(f)
exporter = CsvItemExporter(f)
exporter.start_exporting()
self.ticker_to_exporter[ticker] = exporter
return self.ticker_to_exporter[ticker]
Related
Normally I know what this error means but somehow I believe I did pass in the argument though
I am playing around scrapy and inside pipeline, I figured if I am scraping through few different sites or pages, I want them to let's say all output json file but with different json of course so I can know which json belongs to which website
So I created a service folder and inside there's a file called pipeline
so inside this pipeline.py
I created a class such below
import json
import os
class JsonWriterPipeline(object):
"""
write all items to a file, most likely json file
"""
def __init__(self, filename):
print(filename) # this does prints the filename though
self.file = open(filename, 'w')
def open_spider(self, spider):
self.file.write('[')
def close_spider(self, spider):
# remove the last two char which is ',\n' then add closing bracket ']'
self.file.seek(self.file.seek(0, os.SEEK_END) - 2)
self.file.write(']')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + ",\n"
self.file.write(line)
return item
then inside the original pipeline.py under root folder I have something like this
from scrape.services.pipeline import JsonWriterPipeline
JsonWriterPipeline('testing.json') # so I have passed the filename argument as `'testing.json'`
but I just keep on getting the error, also as mentioned above, when I do print(filename) it prints out properly though.
if I am not passing in the filename and instead of having a static filename, it works perfectly but of course I want it dynamic that's why I created a class so I can reuse it
Anyone has ideas
EDIT:
as Gallaecio below mentioned then realized so pipelines does not take parameters, I did some googling for those answers saying pipeline accepts parameters in such ways are if the parameters are passed through command line not inside the code itself
Thanks for any suggestions and advices given.
I thought of an alternative which is instead of creating new object and passing the argument while creating. Maybe try something like inheritance
sample below
inside service/pipeline.py
import json
import os
class JsonWriterPipeline(object):
"""
write all items to a file, most likely json file
"""
filename = 'demo.json' # instead of passing argument create variable for the class
def __init__(self):
self.file = open(self.filename, 'w+')
def open_spider(self, spider):
self.file.write('[')
def close_spider(self, spider):
# remove the last two char which is ',\n' then add closing bracket ']'
self.file.seek(self.file.seek(0, os.SEEK_END) - 2)
self.file.write(']')
return
def process_item(self, item, spider):
line = json.dumps(dict(item)) + ",\n"
self.file.write(line)
return item
inside the original pipeline.py
from scrape.services.pipeline import JsonWriterPipeline
class JsonWriterPipelineA(JsonWriterPipeline):
filename = 'a.json'
def __init__(self):
super().__init__()
class JsonWriterPipelineB(JsonWriterPipeline):
filename = 'b.json'
def __init__(self):
super().__init__()
this is the alternative way I can think of, hope this helps you
I have a large csv that is on my local machine that only contains a list of urls no other columns I want to crawl and extract a certain css element from each of those urls. I have completed a test of that doing a sample of just a one off start url by not looking at the csv. I can't figure out how to open up a large csv with like a million urls in it to have scrapy go through each one and scrape it and then go to the next.
import scrapy
from ..items import stkscrapeItem
class stkSpider(scrapy.Spider):
name = 'stkscrape'
start_urls = [
'https://www.exampleurl.com'
]
def parse(self, response):
items = stkscrapeItem()
contriburl = response.css(".b_q_e a::attr(href)").extract()
items['contriburl'] = contriburl
yield items
I just typed this directly here, so can have typos.
But should be pretty close to what you are expecting
Just to make it perform better, you can split the millions CSV file into chunks using pandas:
create_files.py
import pandas as pd
counter = 0
for df in pd.read_csv("your_file_with_urls.csv", chunksize=100000):
df.to_csv(f"input_{counter}.csv", index=False)
counter += 1
This generates some files at the same location.
Now in the same location as the scrapy.cfg file create a file main.py with the following:
main.py
from glob import glob
from scrapy import cmdline
for each_file in glob("input_*.csv"):
cmdline.execute("scrapy crawl your_spider".split() + ["-a", each_file])
This way we are sending each small file's name to the spider's constructor (the init.py function)
In your your_spider, receive the argument each_file as follow:
your_spider.py
def YourSpiderName(scrapy.Spider):
def __init__(self, each_file='', **kwargs):
self.start_urls = set(pd.read_csv("path_to/each_file")["URL_COLUMN"].tolist())
super().__init__(**kwargs) # python3
I have two spiders. Let's say A and B. A scrapes bunch of urls and writes it into a csv file and B scrapes inside those urls reading from the csv file generated by A. But it throws FileNotFound error from B before A can actually create the file. How can I make my spiders behave such that B waits until A comes back with url? Any other solution would be helpful.
WriteToCsv.py file
def write_to_csv(item):
with open('urls.csv', 'a', newline='') as csvfile:
fieldnames = ['url']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'url': item})
class WriteToCsv(object):
def process_item(self, item, spider):
if item['url']:
write_to_csv("http://pypi.org" +item["url"])
return item
Pipelines.py file
ITEM_PIPELINES = {
'PyPi.WriteToCsv.WriteToCsv': 100,
'PyPi.pipelines.PypiPipeline': 300,
}
read_csv method
def read_csv():
x = []
with open('urls.csv', 'r') as csvFile:
reader = csv.reader(csvFile)
for row in reader:
x = [''.join(url) for url in reader]
return x
start_urls in B spider file
start_urls = read_csv() #Error here
I would consider using a single spider with two methods parse and final_parse. As far as I can tell from the context you have provided there is no need to write the URLs to disk.
parse should contain the logic for scraping the URLs that spider A is currently writing to the csv and should return a new request with a callback to the final_parse method.
def parse(self, response):
url = do_something(response.body_as_unicode())
return scrapy.Request(url, callback=self.final_parse)
final_parse should then contain the parsing logic that was previously in spider B.
def final_parse(self, response):
item = do_something_else(response.body_as_unicode())
return item
Note: If you need to pass any additional information from parse to final_parse you can use the meta argument of scrapy.Request.
If you do need the URLs, you could add this as a field to your item.
It can be accessed with response.url.
I am currently using Scrapy to crawl some domains from different website and I wonder how to save my data in a local json file with the format of either a list or a dictionary with the key of 'domain' and a list of domains as value.
In the crawler file, the item is like this:
item['domain'] = 'xxx'.extract()
yield item
import json
import codecs
class ChinazPipeline(object):
def __init__(self):
self.file = codecs.open('save.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item
What I expect is:
{"domain": "['google.com', 'cnn.com', 'yahoo.com']"}
or just simply save all domains that I crawled as a list in json, either way works for me.
It's rather simple. Json is default scrapy exporter.
You can use it by turning on output to JSON file:
scrapy runspider yourspider.py -o filename.json
Scrapy will automatically determine format you with to have by file type.
Other options are .csv and .jsonline.
It's an easy way. Otherwize you can write your own ItemExporter. Take a look at exporters documentation.
NB:
You don't even need to open file during spider initiation, scrapy will manage it by itself.
Just yield items and scrapy will write it to file automatically.
Scrapy is most suitable for one page -> one item schema.
What you want is scrape all items in advance and then export them as single list.
So you should have some variable like self.results, append there new domains from every process_item() call. And then export it on spider close event.
There's shortcut for this signal. So you can just add:
def closed(self, reason):
# write self.results list to JSON file.
More documentation on Spider.closed() method.
I've written a script in python scrapy to get different ids and its corresponding names from a webpage. When I execute my script, I can see that the result are rightly coming through and I'm getting a data filled in csv file. I'm using python 3.6, so when I go fo scrapy's built-in command (meant to write data in a csv file), I always get a csv file with blank lines in every alternate row. However, I tried the following to serve the purpose and it does it's job. Now, It produces a csv file fixing blank line issues.
My question: how can I close the csv file when the job is done?
This is my try so far:
import scrapy, csv
class SuborgSpider(scrapy.Spider):
name = "suborg"
start_urls = ['https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page={}'.format(page) for page in range(0,7)]
def __init__(self):
self.file = open("output.csv", "w", newline="")
def parse(self, response):
for item in response.xpath('//*[contains(#class,"views-table")]//tbody//tr'):
idnum = item.xpath('.//*[contains(#class,"views-field-field-reference-number")]/text()').extract()[-1].strip()
name = item.xpath('.//*[contains(#class,"views-field-title")]//span[#dir="ltr"]/text()').extract()[-1].strip()
yield{'ID':idnum,'Name':name}
writer = csv.writer(self.file)
writer.writerow([idnum,name])
You can close the actual file instead:
You can call it in the closed() method which is automatically called when the spider is closed.
def closed(self, reason):
self.file.close()