I'm new in Scrapy. I have read several discussions about this tool. I have a problem exporting csv files. I'm scrapping numeric values with commas. The default separatos of csv exporter is comma, so I have some problems when I open the resulting file in Excel.
How can I change the default delimitor of csv files in Scrapy to semicolon? I read some discussions about this issue but I don't know what and where i have to add code.
Thanks in advance!
scraper/exporters.py
from scrapy.exporters import CsvItemExporter
class CsvCustomSeperator(CsvItemExporter):
def __init__(self, *args, **kwargs):
kwargs['encoding'] = 'utf-8'
kwargs['delimiter'] = '╡'
super(CsvCustomSeperator, self).__init__(*args, **kwargs)
scraper/settings.py
FEED_EXPORTERS = {
'csv': 'scraper.exporters.CsvCustomSeperator'
}
This solution worked for me.
You should check if quotechar is enabled and set in your export.
https://doc.scrapy.org/en/latest/topics/spiders.html?highlight=CSV_DELIMITER#csvfeedspider-example
Usually text should be enquoted with " so it's no issue that the delimiter is in the text.
try this:
scrapy crawl yourCrawlerName -o output.csv --set delimiter=";"
Related
I am currently using Scrapy to crawl some domains from different website and I wonder how to save my data in a local json file with the format of either a list or a dictionary with the key of 'domain' and a list of domains as value.
In the crawler file, the item is like this:
item['domain'] = 'xxx'.extract()
yield item
import json
import codecs
class ChinazPipeline(object):
def __init__(self):
self.file = codecs.open('save.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item
What I expect is:
{"domain": "['google.com', 'cnn.com', 'yahoo.com']"}
or just simply save all domains that I crawled as a list in json, either way works for me.
It's rather simple. Json is default scrapy exporter.
You can use it by turning on output to JSON file:
scrapy runspider yourspider.py -o filename.json
Scrapy will automatically determine format you with to have by file type.
Other options are .csv and .jsonline.
It's an easy way. Otherwize you can write your own ItemExporter. Take a look at exporters documentation.
NB:
You don't even need to open file during spider initiation, scrapy will manage it by itself.
Just yield items and scrapy will write it to file automatically.
Scrapy is most suitable for one page -> one item schema.
What you want is scrape all items in advance and then export them as single list.
So you should have some variable like self.results, append there new domains from every process_item() call. And then export it on spider close event.
There's shortcut for this signal. So you can just add:
def closed(self, reason):
# write self.results list to JSON file.
More documentation on Spider.closed() method.
I've written a script in python scrapy to get different ids and its corresponding names from a webpage. When I execute my script, I can see that the result are rightly coming through and I'm getting a data filled in csv file. I'm using python 3.6, so when I go fo scrapy's built-in command (meant to write data in a csv file), I always get a csv file with blank lines in every alternate row. However, I tried the following to serve the purpose and it does it's job. Now, It produces a csv file fixing blank line issues.
My question: how can I close the csv file when the job is done?
This is my try so far:
import scrapy, csv
class SuborgSpider(scrapy.Spider):
name = "suborg"
start_urls = ['https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page={}'.format(page) for page in range(0,7)]
def __init__(self):
self.file = open("output.csv", "w", newline="")
def parse(self, response):
for item in response.xpath('//*[contains(#class,"views-table")]//tbody//tr'):
idnum = item.xpath('.//*[contains(#class,"views-field-field-reference-number")]/text()').extract()[-1].strip()
name = item.xpath('.//*[contains(#class,"views-field-title")]//span[#dir="ltr"]/text()').extract()[-1].strip()
yield{'ID':idnum,'Name':name}
writer = csv.writer(self.file)
writer.writerow([idnum,name])
You can close the actual file instead:
You can call it in the closed() method which is automatically called when the spider is closed.
def closed(self, reason):
self.file.close()
Related to
How to export text from all pages of a MediaWiki?, but I want the output be individual text files named using the page title.
SELECT page_title, page_touched, old_text
FROM revision,page,text
WHERE revision.rev_id=page.page_latest
AND text.old_id=revision.rev_text_id;
works to dump it into stdout and all pages in one go.
How to split them and dump into individual files?
SOLVED
First dump into one single file:
SELECT page_title, page_touched, old_text
FROM revision,page,text
WHERE revision.rev_id=page.page_latest AND text.old_id=revision.rev_text_id AND page_namespace!='6' AND page_namespace!='8' AND page_namespace!='12'
INTO OUTFILE '/tmp/wikipages.csv'
FIELDS TERMINATED BY '\n'
ESCAPED BY ''
LINES TERMINATED BY '\n######################################\n';
Then split it into individual file, use python:
with open('wikipages.csv', 'rb') as f:
alltxt = f.read().split('\n######################################\n')
for row in alltxt:
one = row.split('\n')
name = one[0].replace('/','-')
try:
del one[0]
del one[0]
except:
continue
txt = '\n'.join(one)
of = open('/tmp/wikipages/' + name + '.txt', 'w')
of.write(txt)
of.close()
In case you have some python knowledge you can utilize mwclient library to achieve this:
install Python 2.7 sudo apt-get install python2.7 ( see https://askubuntu.com/questions/101591/how-do-i-install-python-2-7-2-on-ubuntu in case of troubles )
install mwclient via pip install mwclient
run python script below
import mwclient
wiki = mwclient.Site(('http', 'you-wiki-domain.com'), '/')
for page in wiki.Pages:
file = open(page.page_title, 'w')
file.write(page.text())
file.close()
see mwclient page https://github.com/mwclient/mwclient for reference
From Mediawiki version 1.35, multi-content revision model has been implemented, so the original dump code won't work correctly. Instead, you can use following code:
SELECT page_title, page_touched, old_text
FROM revision,page,text,content,slots
WHERE page.page_latest=revision.rev_id AND revision.rev_id=slots.slot_revision_id AND slots.slot_content_id=convert(substring(content.content_address,4),int) AND convert(substring(content.content_address,4),int)=text.old_id AND page_namespace!='6' AND page_namespace!='8' AND page_namespace!='12'
INTO OUTFILE '/var/tmp/wikipages.csv'
FIELDS TERMINATED BY '\n'
ESCAPED BY ''
LINES TERMINATED BY '\n######################################\n';
I am new to Django and Python and was wondering how I could read a tab dilimited file into a html table by modifying my views.py and then returning the separate columns as a variable and returning that variable through params and then changing my template.html page.
so for example
def index(request):
myfile = open (filename.txt)
for row in myfile:
list = row.rstrip().split('\t')
params = {
"first" = list[0]
}
return render(request, 'index.html', params)
something of this sort any help is greatly appreciated
My first recommendation is to use with to open files:
with open('filename.ext', 'mode') as f:
Using with automagically closes your file for you so you don't have to explicitly do so :)
Second, please visit: http://docs.python.org/3.3/library/csv.html#examples for a great explanation from the source.
The first example demos how to read your CSV file!
I'm using python-docx module to do some edits on a large number of documents. They all contain a header in which I need to replace a number, but everytime I do this the document won't open, with the error that the content is unreadable. Anyone have any ideas as to why this is happening, or sample working code snippets? Thanks.
from docx import *
#document = yourdocument.docx
filename = "NUR-ADM-2001"
relationships = relationshiplist()
document = opendocx("C:/Users/ai/My Documents/Nursing docs/" + filename + ".docx")
docbody = document.xpath('/w:document/w:body',namespaces=nsprefixes)[0]
advReplace(docbody, "NUR-NPM 101", "NUR-NPM 202")
# Create our properties, contenttypes, and other support files
coreprops = coreproperties(title='Nursing Doc',subject='Policies',creator='IA',keywords='Policy'])
appprops = appproperties()
contenttypes = contenttypes()
websettings = websettings()
wordrelationships = wordrelationships(relationships)
# Save our document
savedocx(document,coreprops,appprops,contenttypes,websettings, wordrelationships,"C:/Users/ai/My Documents/Nursing docs/" + filename + ".docx")
Edit: So it eventually can open the document, but it says some content cannot be displayed and the headers have vanished... thoughts?
I don't know this module, but in general you should not edit a file in place. Open file "A", write file "/tmp/A". Close both files and make sure you have no errors, then move "/tmp/A" to "A". Otherwise you risk clobbering your file if something goes wrong during the write.