So I'm building a scraper that imports a .csv excel file which has one row of ~2,400 websites (each website is in its own column) and using these as the start_url. I keep getting this error saying that I am passing in a list and not a string. I think this may be caused by the fact that my list basically just has one reallllllly long list in it that represents the row. How can I overcome this and basically put each website from my .csv as its own seperate string within the list?
raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
exceptions.TypeError: Request url must be str or unicode, got list:
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse
from tutorial.items import DanishItem
from scrapy.http import Request
import csv
with open('websites.csv', 'rbU') as csv_file:
data = csv.reader(csv_file)
scrapurls = []
for row in data:
scrapurls.append(row)
class DanishSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = []
start_urls = scrapurls
def parse(self, response):
for sel in response.xpath('//link[#rel="icon" or #rel="shortcut icon"]'):
item = DanishItem()
item['website'] = response
item['favicon'] = sel.xpath('./#href').extract()
yield item
Thanks!
Joey
Just generating a list for start_urls does not work as it is clearly written in Scrapy documentation.
From documentation:
You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.
The first requests to perform are obtained by calling the
start_requests() method which (by default) generates Request for
the URLs specified in the start_urls and the parse method as
callback function for the Requests.
I would rather do it in this way:
def get_urls_from_csv():
with open('websites.csv', 'rbU') as csv_file:
data = csv.reader(csv_file)
scrapurls = []
for row in data:
scrapurls.append(row)
return scrapurls
class DanishSpider(scrapy.Spider):
...
def start_requests(self):
return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]
I find the following useful when in need:
import csv
import scrapy
class DanishSpider(scrapy.Spider):
name = "rei"
with open("output.csv","r") as f:
reader = csv.DictReader(f)
start_urls = [item['Link'] for item in reader]
def parse(self, response):
yield {"link":response.url}
Try opening the .csv file inside the class (not outside as you have done before) and append the start_urls. This solution worked for me. Hope this helps :-)
class DanishSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = []
start_urls = []
f = open('websites.csv'), 'r')
for i in f:
u = i.split('\n')
start_urls.append(u[0])
for row in data:
scrapurls.append(row)
row is a list [column1, column2, ..]
So I think you need to extract the columns, and append to your start_urls.
for row in data:
# if all the column is the url str
for column in row:
scrapurls.append(column)
Try this way also,
filee = open("filename.csv","r+")
# Removing the \n 'new line' from the url
r=[i for i in filee]
start_urls=[r[j].replace('\n','') for j in range(len(r))]
Related
I have a csv file which contains the imdb movieID's of 300 movies. The imdb movie urls for each movie are of the format : https://www.imdb.com/title/ttmovieID
I want to scrape each movie's dedicated site for thumbnail image link,title,actors and year of release and write it to a csv file where each row will contain data for each movie,
Since I have the movieID for each movie in a csv file, what should be the start_urls of my spider and what should be the structure of my parse function? Also, how to write it to a csv file?
I have the following approach for a top 250 page of imdb. What changes should I make in the start_urls and links ?
import scrapy
import csv
from example.items import MovieItem
class ImdbSpider(scrapy.Spider):
name = "imdbtestspider"
allowed_domains = ["imdb.com"]
start_urls = ['http://www.imdb.com/chart/top',]
def parse(self,response):
links=response.xpath('//tbody[#class="lister-list"]/tr/td[#class="titleColumn"]/a/#href').extract()
i=1
for link in links:
abs_url=response.urljoin(link)
url_next='//*[#id="main"]/div/span/div/div/div[2]/table/tbody/tr['+str(i)+']/td[3]/strong/text()'
rating=response.xpath(url_next).extract()
if(i <= len(links)):
i=i+1
yield scrapy.Request(abs_url, callback=self.parse_indetail, meta={'rating' : rating })
def parse_indetail(self,response):
item = MovieItem()
item['title'] = response.xpath('//div[#class="title_wrapper"]/h1/text()').extract()[0][:-1]
item['director'] = response.xpath('//div[#class="credit_summary_item"]/span[#itemprop="director"]/a/span/text()').extract()
return item
You could just read your .csv file in start_requests function and yield requests from there. Code could be something like:
import csv
from scrapy import Request
...
def start_requests(self):
with open('imdb_ids.csv') as csv_file:
ids = csv.reader(csv_file, delimiter=',')
line = 0
for id in ids:
if line > 0:
yield Request('https://www.imdb.com/title/ttmovie' + id)
line+=1
I'm new to all this. I managed to crawl through 3600+ items in a page and extract data such as Name, address, phone, mail. All of which I wrote to a .csv file.
My excitement was cut short when I discovered that some of the distributors had missing information (information that's written in the website, and have been written incorrectly to the .csv. Furthermore, some blank columns (like 'B') were created.
Also, I couldn't find a way for the square brackets and the apostrophes to not be written, but I can easily erase them all with LibreOficce Calc.
(In my code I only pasted a few urls out of 3600+, including the ones in the attached picture that show the problem)
import scrapy
import requests
import csv
class QuotesSpider(scrapy.Spider):
name = "final"
def start_requests(self):
urls = [
'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla01586 /zarate/bodelon-edgardo-aristides/?countrySelectorCode=AR', 'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla01778/zarate/cesario- mariano-rodrigo/?countrySelectorCode=AR', 'https://www.bosch- professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00140/zarate/de-vicenzi-elio-mario-g.-rosana-sh/?countrySelectorCode=AR', 'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla01941/zarate/de-vincenzi-elio-mario-y-rosana-sh/?countrySelectorCode=AR', 'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla02168/zarate/ferreterias-indufer-s.a./?countrySelectorCode=AR',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
marca = []
names = []
direcc = []
locali = []
telef = []
mail = []
site = []
for item in response.css('div.item-content'):
marca.append('Bosch')
names.append(item.css('p.item-name::text').extract())
lista_direcc = item.css('p.item-address::text').extract()
direcc.append(lista_direcc[0].strip())
locali.append(lista_direcc[1].strip())
telef.append(item.css('a.btn-phone.trackingElement.trackingTeaser::text').extract())
mail.append(item.css('a.btn-email.trackingElement.trackingTeaser::text').extract())
site.append(item.css('a.btn-website.trackingElement.trackingTeaser::text').extract())
with open('base.csv', 'a') as csvFile:
fieldnames = ['Empresa', 'Nombres', 'Dirección' , 'Localidad', 'Teléfono', 'Mail', 'Sitio Web']
writer = csv.DictWriter(csvFile, fieldnames=fieldnames)
writer.writerow({'Empresa' : marca, 'Nombres' : names, 'Dirección' : direcc, 'Localidad' : locali, 'Teléfono' : telef, 'Mail' : mail, 'Sitio Web' : site })
csvFile.close()
You can see an example of what I'm talking about. The program created several extra columns and in some cases shifted the data one column to the left.
I assume that the solution to this is quite simple, as all my previous questions have been. But yet it's puzzling me.
So thanks a lot for any help and for tolerating my poor English. Cheers!
Firstly, rather use the built-in CSV feed exporter instead of your own CSV writer method. In other words, yield the item instead and let Scrapy handle the CSV.
And secondly, don't write lists to the CSV. That is why you get [[ and [ in the output. It is likely also the reason for the extra columns due to unnecessary commas (from the lists) in the output.
Another point is you do not need to implement start_request(). You can just specify your URLs in a start_urls property.
Here is an example:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "final"
start_urls = [
# ...
]
def parse(self, response):
for item in response.css('div.item-content'):
lista_direcc = item.css('p.item-address::text').getall()
yield {
'Empresa': 'Bosch',
'Nombres': item.css('p.item-name::text').get(),
'Dirección': lista_direcc[0].strip(),
'Localidad': lista_direcc[1].strip(),
'Teléfono': item.css('a.btn-phone.trackingElement.trackingTeaser::text').get(),
'Mail': item.css('a.btn-email.trackingElement.trackingTeaser::text').get(),
'Sitio Web': item.css('a.btn-website.trackingElement.trackingTeaser::text').get(),
}
As mentioned by #Gallaecio in the comments below, it is better to use get() instead of extract() when you expect a single item (and it is the preferred usage nowadays). Read more here: https://docs.scrapy.org/en/latest/topics/selectors.html#extract-and-extract-first
To get the CSV you can run:
scrapy runspider spidername.py -o output.csv
My code for scrapping data from alibaba website:
import scrapy
class IndiamartSpider(scrapy.Spider):
name = 'alibot'
allowed_domains = ['alibaba.com']
start_urls = ['https://www.alibaba.com/showroom/acrylic-wine-box_4.html']
def parse(self, response):
Title = response.xpath('//*[#class="title three-line"]/a/#title').extract()
Price = response.xpath('//div[#class="price"]/b/text()').extract()
Min_order = response.xpath('//div[#class="min-order"]/b/text()').extract()
Response_rate = response.xpath('//i[#class="ui2-icon ui2-icon-skip"]/text()').extract()
for item in zip(Title,Price,Min_order,Response_rate):
scraped_info = {
'Title':item[0],
'Price': item[1],
'Min_order':item[2],
'Response_rate':item[3]
}
yield scraped_info
Notice the start url, it only scraps through the given URL, but i want this code to scrap all the urls present in my csv file. My csv file contains large amount of URLs.
Sample of the data.csv file::
'https://www.alibaba.com/showroom/shock-absorber.html',
'https://www.alibaba.com/showroom/shock-wheel.html',
'https://www.alibaba.com/showroom/shoes-fastener.html',
'https://www.alibaba.com/showroom/shoes-women.html',
'https://www.alibaba.com/showroom/shoes.html',
'https://www.alibaba.com/showroom/shoulder-long-strip-bag.html',
'https://www.alibaba.com/showroom/shower-hair-band.html',
...........
How do i import all the links of csv file in the code at once?
To correctly loop through a file without loading all of it into memory you should use generators, as both file objects and start_requests method in python/scrapy are generators:
class MySpider(Spider):
name = 'csv'
def start_requests(self):
with open('file.csv') as f:
for line in f:
if not line.strip():
continue
yield Request(line)
To explain futher:
Scrapy engine uses start_requests to generate requests as it goes. It will keep generating requests untill concurrent request limit is full (settings like CONCURRENT_REQUESTS).
Also worth noting that by default scrapy crawls depth first - newer requests take priority, so start_requests loop will be last to finish.
You're almost there already. The only change is in start_urls, which you want to be "all the urls in the *.csv file." The following code easily implements that change.
with open('data.csv') as file:
start_urls = [line.strip() for line in file]
Let us assume you have stored url list in the form of a dataframe and you want to loop over each URL present inside the dataframe. My approach is given below which worked for me.
class IndiamartSpider(scrapy.Spider):
name = 'alibot'
#allowed_domains = ['alibaba.com']
#start_urls = ['https://www.alibaba.com/showroom/acrylic-wine-box_4.html']
def start_requests(self):
df = pd.read_csv('fileContainingUrls.csv')
#Here fileContainingUrls.csv is a csv file which has a column named as 'URLS'
# contains all the urls which you want to loop over.
urlList = df['URLS'].to_list()
for i in urlList:
yield scrapy.Request(url = i, callback=self.parse)
def parse(self, response):
Title = response.xpath('//*[#class="title three-line"]/a/#title').extract()
Price = response.xpath('//div[#class="price"]/b/text()').extract()
Min_order = response.xpath('//div[#class="min-order"]/b/text()').extract()
for item in zip(Title,Price,Min_order,Response_rate):
scraped_info = {
'Title':item[0],
'Price': item[1],
'Min_order':item[2],
'Response_rate':item[3]
}
yield scraped_info
I have a list of NPIs which I want to scrape the names of the providers for from npidb.org
The NPI values are stored in a csv file.
I am able to do it manually by pasting the URLs in the code. However, I am unable to figure out how to do it if I have a list of NPIs for each of which I want the provider names.
Here is my current code:
import scrapy
from scrapy.spider import BaseSpider
class MySpider(BaseSpider):
name = "npidb"
def start_requests(self):
urls = [
'https://npidb.org/npi-lookup/?npi=1366425381',
'https://npidb.org/npi-lookup/?npi=1902873227',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-1]
filename = 'npidb-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
Assume you have a list of npi from csv file, then you can simply use format to change the website address as following(I also add the part to get list from csv file. If you have it already, you can omit that part):
def start_requests(self):
# get npis from csv file
npis = []
with open('test.csv', 'r') as f:
for line in f.readlines():
l = line.strip()
npis.append((l))
# generate the list of address depending on npi
start_urls = []
for npi in npis:
start_urls.append('https://npidb.org/npi-lookup/?npi={}'.format(npi))
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse)
Well, it depends on the structure of your csv file, but if it contains the npis in separate lines, you could do something like
def start_requests(self):
with open('npis.csv') as f:
for line in f:
yield scrapy.Request(
url='https://npidb.org/npi-lookup/?npi={}'.format(line.strip()),
callback=self.parse
)
So my question is how do I tell scrapy to crawl URLs, which only set apart by one string. So for example: https://www.youtube.com/watch?v=STRING
I got the strings saved in a txt file.
with open("plz_nummer.txt") as f:
cityZIP = f.read().rsplit('\n')
for a in xrange(0,len(cityZIP)):
next_url = 'http://www.firmenfinden.de/?txtPLZ=' + cityZIP[a] + '&txtBranche=&txtKunden='
pass
I would make the loading of the file with zip codes part of the start_requests method as a generator. Something in the lines of:
import scrapy
class ZipSpider(scrapy.Spider):
name = "zipCodes"
self.city_zip_list = []
def start_requests(self):
with open("plz_nummer.txt") as f:
self.city_zip_list = f.read().rsplit('\n')
for city_zip in self.city_zip_list:
url = 'http://www.firmenfinden.de/?txtPLZ={}&txtBranche=&txtKunden='.format(city_zip)
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# Anything else you need
# to do in here
pass
This should give you a good starting point. Also read this article: https://doc.scrapy.org/en/1.1/intro/tutorial.html