Python/Scrapy How to go into a deeper link and go back - python

I am trying to scrap information about every firm in from this website : www.canadianlawlist.com
I have finished most of it, but I am running into a small problem.
I am trying to get the results to display in the following order :
-Firm Name and Information
*Employees from the firm Information.
But instead I am getting very random results.
It will scrape information about 2 firms and then scrap the information of employees. Like that :
-Firm Name and Information
-Firm name and information
*Employee from Firm 1
-Firm name and information
*Employee from Firm 2
It goes something like that . I am not sure what i am missing in my code :
def parse_after_submit(self, response):
basicurl = "canadianlawlist.com/"
products = response.xpath('//*[#class="searchresult_item_regular"]/a/#href').extract()
for p in products:
url = "http://canadianlawlist.com" + p
yield scrapy.Request(url, callback=self.parse_firm_info)
#process next page
#for x in range(2, 6):
# next_page_url = "https://www.canadianlawlist.com/searchresult?searchtype=firms&city=montreal&page=" + str(x)
def parse_firm_info(self,response):
name = response.xpath('//div[#class="listingdetail_companyname"]/h1/span/text()').extract_first()
print name
for info in response.xpath('//*[#class="listingdetail_contactinfo"]'):
street_address = info.xpath('//div[#class="listingdetail_contactinfo"]/div[1]/span/div/text()').extract_first()
city = info.xpath('//*[#itemprop="addressLocality"]/text()').extract_first(),
province = info.xpath('//*[#itemprop="addressRegion"]/text()').extract_first(),
postal_code = info.xpath('//*[#itemprop="postalCode"]/text()').extract_first(),
telephone = info.xpath('//*[#itemprop="telephone"]/text()').extract_first(),
fax_number = info.xpath('//*[#itemprop="faxNumber"]/text()').extract_first(),
email = info.xpath('//*[#itemprop="email"]/text()').extract_first(),
print street_address
print city
print province
print postal_code
print telephone
print fax_number
print email
for people in response.xpath('////div[#id="main_block"]/div[1]/div[2]/div[2]'):
pname = people.xpath('//*[#class="listingdetail_individual_item"]/h3/a/text()').extract()
print pname
basicurl = "canadianlawlist.com/"
employees = response.xpath('//*[#class="listingdetail_individual_item"]/h3/a/#href').extract()
for e in employees:
url2 = "http://canadianlawlist.com" + e
yield scrapy.Request(url2, callback=self.parse_employe_info)
def parse_employe_info(self,response):
ename = response.xpath('//*[#class="listingdetail_individualname"]/h1/span/text()').extract_first()
job_title = response.xpath('//*[#class="listingdetail_individualmaininfo"]/div/i/span/text()').extract_first()
print ename
print job_title

You cannot rely on the order Python's print function when it comes to concurrent programming. If you care about standard output order you need to use logging module.
Scrapy has shortcut function for that in Spider class:
import scrapy
import logging
class MySpider(scrapy.Spider):
def parse(self, response):
self.log("first message", level=logging.INFO)
self.log("second message", level=logging.INFO)

Scrapy run multiple requests at the same time, so the content displayed on the console can be corresponding to any the multiple requests running at same time.
You can go to settings.py and set
CONCURRENT_REQUESTS = 1
Now only one request will be launched at a time so your console will show meaningful data but this will make the scraping slower.

Related

Scrapy list of links

I am building a spider with scrapy, I want to access in every item in a list and then scrape all the data inside each link. but when I run the spider it doesn´t scrape the data. What I am missing?
from ..items import JobscraperItem
from scrapy.linkextractors import LinkExtractor
class JobscraperSpider(scrapy.Spider):
name ='jobspider'
start_urls = ['https://cccc/bolsa/ofertas?oferta=&lugar=&categoria=']
def parse(self, response):
job_detail = response.xpath('//div[#class="list"]/div/a')
yield from response.follow_all(job_detail, self.parse_jobspider)
def parse(self, response):
items = JobscraperItem()
job_title = response.xpath('//h1/text()').extract()
company = response.xpath('//h2/b/text()').extract()
company_url = response.xpath('//div[#class="pull-left"]/a/text()').extract()
description = response.xpath('//div[#class="aviso"]/text()').extract()
salary = response.xpath('//div[#id="aviso"]/p[1]/text()').extract()
city = response.xpath('//div[#id="aviso"]/p[2]/text()').extract()
district = response.xpath('//div[#id="aviso"]/p[5]/text()').extract()
publication_date = response.xpath('//div[#id="publicado"]/text()').extract()
apply = response.xpath('//p[#class="text-center"]/b/text()').extract()
job_type = response.xpath('//div[#id="resumen"]/p[3]/text()').extract()
items['job_title'] = job_title
items['company'] = company
items['company_url'] = company_url
items['description'] = description
items['salary'] = salary
items['city'] = city
items['district'] = district
items['publication_date'] = publication_date
items['apply'] = apply
items['job_type'] = job_type
yield items```
From what I can see, one of the issues is that you are creating two functions called parse(). Since you are using a self.parse_jobspider in your first parse function, I'm guessing that your second parse function is named incorrectly.
Also, are you sure that the URL in the start_urls is correct? https://cccc/bolsa/ofertas?oferta=&lugar=&categoria= doesn't direct to anywhere which would also explain why data isn't being scraped.
rules = (
Rule(LinkExtractor(allow=('/bolsa/166',)), follow=True, callback='parse_item'),
)
I resolve this adding this code to access in every link and scrape the data inside

speed up python scrapy crawler

I'm currently writing vacancies scraper with Scrapy to parse about 3M of vacancies item.
Now I'm on place when spider works and successfully scraping items and storing it tot postgreesql but the thing is it doing it pretty slow.
For 1 hr i stored only 12k vacancies so i'm really ti far from 3M of them.
Thing is that in the end i'm gonna need to scrape and update data once per day and with current performance I'm gonna need more than a day to just parse all data.
I'm new in data scraping so I may do some basic thing wrong and I'll be very gratefull if anybody can hel me.
Code of my spider:
import scrapy
import urllib.request
from lxml import html
from ..items import JobItem
class AdzunaSpider(scrapy.Spider):
name = "adzuna"
start_urls = [
'https://www.adzuna.ru/search?loc=136073&pp=10'
]
def parse(self, response):
job_items = JobItem()
items = response.xpath("//div[#class='sr']/div[#class='a']")
def get_redirect(url):
response = urllib.request.urlopen(url)
response_code = response.read()
result = str(response_code, 'utf-8')
root = html.fromstring(result)
final_url = root.xpath('//p/a/#href')[0]
final_final_url = final_url.split('?utm', 1)[0]
return final_final_url
for item in items:
id = None
data_aid = item.xpath(".//#data-aid").get()
redirect = item.xpath(".//h2/a/#href").get()
url = get_redirect(redirect)
url_header = item.xpath(".//h2/a/strong/text()").get()
if item.xpath(".//p[#class='as']/#data-company-name").get() == None:
company = item.xpath(".//p[#class='as']/text()").get()
else:
company = item.xpath(".//p[#class='as']/#data-company-name").get()
loc = item.xpath(".//p/span[#class='loc']/text()").get()
text = item.xpath(".//p[#class='at']/span[#class='at_tr']/text()").get()
salary = item.xpath(".//p[#class='at']/span[#class='at_sl']/text()").get()
job_items['id'] = id
job_items['data_aid'] = data_aid
job_items['url'] = url
job_items['url_header'] = url_header
job_items['company'] = company
job_items['loc'] = loc
job_items['text'] = text
job_items['salary'] = salary
yield job_items
next_page = response.css("table.pg td:last-child ::attr('href')").get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Use indexes in your table
Insert in BULK instead of inserting one-by-one
Minimize use of meta in your Request
Use tuple instead of list where possible
Set CONCURRENT_ITEMS=100, setting it to higher decreases performance
Try to use less Middlewares and Pipielines
Set AUTOTHROTTLE_ENABLED=False in settings.py
Set TELNETCONSOLE_ENABLED=False in settings.py

Stuck Scraping Multiple Domains sequentially - Python Scrapy

I am fairly new to python as well as web scraping. My first project is web scraping random Craiglist cities (5 cities total) under the transportation sub-domain (i.e. https://dallas.craigslist.org), though I am stuck on having to manually run the script per city after manually updating each cities respective domain under the constants >>>> (start_urls = and absolute_next_url = ) in the script . Is there anyway that I can adjust the script to sequentially run through the cities I have defined (i.e. miami, new york, houston, chicago, etc), and auto-populate the constants for its respective city (start_urls = and absolute_next_url = ) ?
Also, is there a way to adjust the script to output each city into its own .csv >> (i.e. miami.csv, houston.csv, chicago.csv, etc) ?
Thank you in advance
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
class JobsSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["craigslist.org"]
start_urls = ['https://dallas.craigslist.org/d/transportation/search/trp']
def parse(self, response):
jobs = response.xpath('//p[#class="result-info"]')
for job in jobs:
listing_title = job.xpath('a/text()').extract_first()
city = job.xpath('span[#class="result-meta"]/span[#class="result-hood"]/text()').extract_first("")[2:-1]
job_posting_date = job.xpath('time/#datetime').extract_first()
job_posting_url = job.xpath('a/#href').extract_first()
data_id = job.xpath('a/#data-id').extract_first()
yield Request(job_posting_url, callback=self.parse_page, meta={'job_posting_url': job_posting_url, 'listing_title': listing_title, 'city':city, 'job_posting_date':job_posting_date, 'data_id':data_id})
relative_next_url = response.xpath('//a[#class="button next"]/#href').extract_first()
absolute_next_url = "https://dallas.craigslist.org" + relative_next_url
yield Request(absolute_next_url, callback=self.parse)
def parse_page(self, response):
job_posting_url = response.meta.get('job_posting_url')
listing_title = response.meta.get('listing_title')
city = response.meta.get('city')
job_posting_date = response.meta.get('job_posting_date')
data_id = response.meta.get('data_id')
description = "".join(line for line in response.xpath('//*[#id="postingbody"]/text()').extract()).strip()
compensation = response.xpath('//p[#class="attrgroup"]/span[1]/b/text()').extract_first()
employment_type = response.xpath('//p[#class="attrgroup"]/span[2]/b/text()').extract_first()
latitude = response.xpath('//div/#data-latitude').extract_first()
longitude = response.xpath('//div/#data-longitude').extract_first()
posting_id = response.xpath('//p[#class="postinginfo"]/text()').extract()
#yield{'job_posting_url': job_posting_url, 'listing_title': listing_title, 'city':city, 'job_posting_date':job_posting_date, 'description':description, #'compensation':compensation, 'employment_type':employment_type, 'posting_id':posting_id, 'longitude':longitude, 'latitude':latitude }
yield{'job_posting_url':job_posting_url,
'data_id':data_id,
'listing_title':listing_title,
'city':city,
'description':description,
'compensation':compensation,
'employment_type':employment_type,
'latitude':latitude,
'longitude':longitude,
'job_posting_date':job_posting_date,
'posting_id':posting_id,
'data_id':data_id
}
There might be a cleaner way but check out https://docs.scrapy.org/en/latest/topics/practices.html?highlight=multiple%20spiders and you can basically combine multiple instances of your spider together, so you can have a separate 'class' for each city. There are probably some ways to consolidate some code so it's not all repeated.
As for writing to csv, are you doing that via the command line right now? I'd add the code to the spider itself https://realpython.com/python-csv/

scrapy not following links with no error

The url below is both used to extract content and be followed, but nothing happened after the content extracted. Don't know why it was not followed.
It seems no errors.
You run Request of author url twice. First time to scrape list of authors. Second time to scrape current author details. Dumping Scrapy stats (in the end of logging) show "dupefilter/filtered" count. It means scrapy filtered duplicate URLs. Scraping will work if you remove "parse_content" function and write code like this:
def parse(self,response):
if 'tags' in response.meta:
author = {}
author['url'] = response.url
name = response.css(".people-name::text").extract()
join_date = response.css(".joined-time::text").extract()
following_no = response.css(".following-number::text").extract()
followed_no = response.css(".followed-number::text").extract_first()
first_onsale = response.css(".first-onsale-date::text").extract()
total_no = response.css(".total-number::text").extract()
comments = total_no[0]
onsale = total_no[1]
columns = total_no[2]
ebooks = total_no[3]
essays = total_no[4]
author['tags'] = response.meta['tags']
author['name'] = name
author['join_date'] = join_date
author['following_no'] = following_no
author['followed_no'] = followed_no
author['first_onsale'] = first_onsale
author['comments'] = comments
author['onsale'] = onsale
author['columns'] = columns
author['ebooks'] = ebooks
author['essays'] = essays
yield author
authors = response.css('section.following-agents ul.bd li.item')
for author in authors:
tags = author.css('div.author-tags::text').extract_first()
url = author.css('a.lnk-avatar::attr(href)').extract_first()
yield response.follow(url=url, callback=self.parse, meta={'tags': tags})
Be carefull. I removed some lines during testing. You need to use random agents in HTTP headers, request delay or proxy. I run collection and now I got "403 Forbidden" status code.

Python BeautifulSoup accounting for missing data on website when writing to csv

I am practicing my web scraping skills on the following website: "http://web.californiacraftbeer.com/Brewery-Member"
The code I have so far is below. I'm able to grab the fields that I want and write the information to CSV, but the information in each row does not match the actual company details. For example, Company A has the contact name for Company D and the phone number for Company E in the same row.
Since some data does not exist for certain companies, how can I account for this when writing rows that should be separated per company to CSV? What is the best way to make sure that I am grabbing the correct information for the correct companies when writing to CSV?
"""
Grabs brewery name, contact person, phone number, website address, and email address
for each brewery listed.
"""
import requests, csv
from bs4 import BeautifulSoup
url = "http://web.californiacraftbeer.com/Brewery-Member"
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
company_name = soup.find_all(itemprop="name")
contact_name = soup.find_all("div", {"class": "ListingResults_Level3_MAINCONTACT"})
phone_number = soup.find_all("div", {"class": "ListingResults_Level3_PHONE1"})
website = soup.find_all("span", {"class": "ListingResults_Level3_VISITSITE"})
def scraper():
"""Grabs information and writes to CSV"""
print("Running...")
results = []
count = 0
for company, name, number, site in zip(company_name, contact_name, phone_number, website):
print("Grabbing {0} ({1})...".format(company.text, count))
count += 1
newrow = []
try:
newrow.append(company.text)
newrow.append(name.text)
newrow.append(number.text)
newrow.append(site.find('a')['href'])
except Exception as e:
error_msg = "Error on {0}-{1}".format(number.text,e)
newrow.append(error_msg)
results.append(newrow)
print("Done")
outFile = open("brewery.csv","w")
out = csv.writer(outFile, delimiter=',',quoting=csv.QUOTE_ALL, lineterminator='\n')
out.writerows(results)
outFile.close()
def main():
"""Runs web scraper"""
scraper()
if __name__ == '__main__':
main()
Any help is very much appreciated!
You need to use a zip to iterate through all those arrays simultaneously:
for company, name, number, site in zip(company_name, contact_name, phone_number, website):
Thanks for the help.
I realized that since the company details for each company are contained in the Div class "ListingResults_All_CONTAINER ListingResults_Level3_CONTAINER", I could write a nested for-loop that iterates through each of these Divs and then grabs the information I want within the Div.

Categories