I've written a script in python scrapy to get different ids and its corresponding names from a webpage. When I execute my script, I can see that the result are rightly coming through and I'm getting a data filled in csv file. I'm using python 3.6, so when I go fo scrapy's built-in command (meant to write data in a csv file), I always get a csv file with blank lines in every alternate row. However, I tried the following to serve the purpose and it does it's job. Now, It produces a csv file fixing blank line issues.
My question: how can I close the csv file when the job is done?
This is my try so far:
import scrapy, csv
class SuborgSpider(scrapy.Spider):
name = "suborg"
start_urls = ['https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page={}'.format(page) for page in range(0,7)]
def __init__(self):
self.file = open("output.csv", "w", newline="")
def parse(self, response):
for item in response.xpath('//*[contains(#class,"views-table")]//tbody//tr'):
idnum = item.xpath('.//*[contains(#class,"views-field-field-reference-number")]/text()').extract()[-1].strip()
name = item.xpath('.//*[contains(#class,"views-field-title")]//span[#dir="ltr"]/text()').extract()[-1].strip()
yield{'ID':idnum,'Name':name}
writer = csv.writer(self.file)
writer.writerow([idnum,name])
You can close the actual file instead:
You can call it in the closed() method which is automatically called when the spider is closed.
def closed(self, reason):
self.file.close()
Related
I'm attempting to scrape articles on 100 companies, and I want to save the content from the multiple articles to a separate csv file for each company. I have the scraper and a csv export pipeline built, and it works fine, however, the spider opens a new csv file for each company (as it should) without closing the file opened for the previous company.
The csv files close after the spider closes, but because of the amount of data I am scraping for each company, the file sizes are significant and causes a strain on my machines memory, and cannot realistically scale, given that if I wanted to increase the number of companies (something I eventually want to do), I will, eventually run into an error for having too many files open at a time. Below is my csv exporter pipeline. I would like to find a way to close one csv file for the current company before moving on to the next company within the same spider:
I guess, theoretically, I could open the file for each article, write the content to new rows, then close it and reopen it again for the next article, but that will slow the spider down significantly. I'd like to keep the file open for a given company while the spider is still making its way through that company's articles, then close it when the spider moves on to the next company.
I'm sure there is a solution but I have not been able to figure one out. Would greatly appreciate help solving this.
class PerTickerCsvExportPipeline:
"""Distribute items across multiple CSV files according to their 'ticker' field"""
def open_spider(self, spider):
self.ticker_to_exporter = {}
def close_spider(self, spider):
for exporter in self.ticker_to_exporter.values():
exporter.finish_exporting()
def _exporter_for_item(self, item):
ticker = item['ticker']
if ticker not in self.ticker_to_exporter:
f = open('{}_article_content.csv'.format(ticker), 'wb')
exporter = CsvItemExporter(f)
exporter.start_exporting()
self.ticker_to_exporter[ticker] = exporter
return self.ticker_to_exporter[ticker]
def process_item(self, item, spider):
exporter = self._exporter_for_item(item)
exporter.export_item(item)
return item
The problem probably is that you keep all the ItemExporters and files open until the spider closes. I suggest that you should try to close the CsvItemExporter and corresponding file for the previous company before you open a new one.
def open_spider(self, spider):
self.ticker_to_exporter = {}
self.files = []
def close_exporters(self):
for ticker, exporter in self.ticker_to_exporter.items():
exporter.finish_exporting()
del self.ticker_to_exporter[ticker]
def close_files(self):
for i, f in enumerate(self.files):
f.close()
del self.files[i]
def close_spider(self, spider):
self.close_exporters()
self.close_files()
def _exporter_for_item(self, item):
ticker = item['ticker']
if ticker not in self.ticker_to_exporter:
self.close_exporters()
self.close_files()
f = open('{}_article_content.csv'.format(ticker), 'a')
self.files.append(f)
exporter = CsvItemExporter(f)
exporter.start_exporting()
self.ticker_to_exporter[ticker] = exporter
return self.ticker_to_exporter[ticker]
I don't know why this started happening recently. I have a function that opens a new text file, writes a url to it, then closes it, but it is not made immediately after the f.close() is executed. The problem is that a function after it open_url() needs to read from that text file a url, but since nothing is there, my program errors out.
Ironically, after my program errors out and I stop it, the url.txt file is made haha. Anyone know why this is happening with the python .write() action? Is there another way to create a text file and write a line of text to that text file faster?
#staticmethod
def write_url():
if not path.exists('url.txt'):
url = UrlObj().url
print(url)
with open('url.txt', 'w') as f:
f.write(url)
f.close
else:
pass
#staticmethod
def open_url():
x = open('url.txt', 'r')
y = x.read()
return y
def main():
scraper = Job()
scraper.write_url()
url = scraper.open_url()
results = scraper.load_craigslist_url(url)
scraper.kill()
dictionary_of_listings = scraper.organizeResults(results)
scraper.to_csv(dictionary_of_listings)
if __name__ == '__main__':
main()
scheduler = BlockingScheduler()
scheduler.add_job(main, 'interval', hours=1)
scheduler.start()
There is another class called url that prompts the user to add attributes to a bare url for seleenium to use. UrlObj().url gives you the url to write which is used to write to the new text file. If the url.txt file already exists, then pass and go to open_url()and get the url from the url.txt file to pass to the url variable which is used to start the scraping.
Just found a work around. If the file does not exist then return the url to be fed directly to load_craigslist_url. If the text file exists then just read from the text file.
I am currently using Scrapy to crawl some domains from different website and I wonder how to save my data in a local json file with the format of either a list or a dictionary with the key of 'domain' and a list of domains as value.
In the crawler file, the item is like this:
item['domain'] = 'xxx'.extract()
yield item
import json
import codecs
class ChinazPipeline(object):
def __init__(self):
self.file = codecs.open('save.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item
What I expect is:
{"domain": "['google.com', 'cnn.com', 'yahoo.com']"}
or just simply save all domains that I crawled as a list in json, either way works for me.
It's rather simple. Json is default scrapy exporter.
You can use it by turning on output to JSON file:
scrapy runspider yourspider.py -o filename.json
Scrapy will automatically determine format you with to have by file type.
Other options are .csv and .jsonline.
It's an easy way. Otherwize you can write your own ItemExporter. Take a look at exporters documentation.
NB:
You don't even need to open file during spider initiation, scrapy will manage it by itself.
Just yield items and scrapy will write it to file automatically.
Scrapy is most suitable for one page -> one item schema.
What you want is scrape all items in advance and then export them as single list.
So you should have some variable like self.results, append there new domains from every process_item() call. And then export it on spider close event.
There's shortcut for this signal. So you can just add:
def closed(self, reason):
# write self.results list to JSON file.
More documentation on Spider.closed() method.
So I have a simple crawler that crawls 3 store location pages and parses the locations of the stores to json. I print(app_data['stores']) and it prints all three pages of stores. However, when I try to write it out I only get one of the three pages, at random, written to my json file. I'd like everything that streams to be written to the file. Any help would be great. Here's the code:
import scrapy
import json
import js2xml
from pprint import pprint
class StlocSpider(scrapy.Spider):
name = "stloc"
allowed_domains = ["bestbuy.com"]
start_urls = (
'http://www.bestbuy.com/site/store-locator/11356',
'http://www.bestbuy.com/site/store-locator/46617',
'http://www.bestbuy.com/site/store-locator/77521'
)
def parse(self, response):
js = response.xpath('//script[contains(.,"window.appData")]/text()').extract_first()
jstree = js2xml.parse(js)
# print(js2xml.pretty_print(jstree))
app_data_node = jstree.xpath('//assign[left//identifier[#name="appData"]]/right/*')[0]
app_data = js2xml.make_dict(app_data_node)
print(app_data['stores'])
for store in app_data['stores']:
yield store
with open('stores.json', 'w') as f:
json.dump(app_data['stores'], f, indent=4)
You are opening the file for writing every time, but you want to append. Try changing the last part to this:
with open('stores.json', 'a') as f:
json.dump(app_data['stores'], f, indent=4)
Where 'a' opens the file for appending.
I'm writing a script in python and I'm trying to wrap my head around a problem. I've a URL that when opened, downloads a document. I'm trying to write a python script that opens the https URL that downloads this document, and automatically send that document to a server I have opened using python's pysftp module.
I can't wrap my head around how to do this... Do you think I'd be able to just do:
server.put(urllib.open('https://......./document'))
EDIT:
This is the code I've tried before the above doesn't work...
download_file = urllib2.urlopen('https://somewebsite.com/file.csv')
file_contents = download_file.read().replace('"', '')
columns = [x.strip() for x in file_contents.split(',')]
# Write Downloaded File Contents To New CSV File
with open('file.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerow(columns)
# Upload New File To Server
srv.put('./file.csv', './SERVERFOLDER/file.csv')
ALSO:
How would I go about getting a FILE that is ONE DAY old from the server? (Examining age of each file)... using paramiko