And trying to scrape this website with selenium.
https://startupbase.com.br/home/startups?q=&states=all&cities=all&segments=Constru%C3%A7%C3%A3o%20Civil~Imobili%C3%A1rio&targets=all&phases=all&models=all&badges=all
What I need: to enter in every child page and extract a lot of information and do this for all the company that is shown.
The code:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
options = Options()
options.add_argument("window-size=1400,600")
from fake_useragent import UserAgent
ua = UserAgent()
a = ua.random
user_agent = ua.random
print(user_agent)
options.add_argument(f'user-agent={user_agent}')
driver = webdriver.Chrome('chromedriver')
driver.get("https://startupbase.com.br/home/startups?q=&states=all&cities=all&segments=Construção%20Civil~Imobiliário&targets=all&phases=all&models=all&badges=all")
import time
time.sleep(3)
cookies_button = driver.find_element_by_xpath("//button[contains(text(), 'Accept')]")
cookies_button.click()
time.sleep(3)
# Lists that we will iterate to
founder_name = []
name_company = []
site_url = []
local = []
mercado = []
publico_alvo = []
modelo_receita = []
momento = []
sobre = []
fundacao = []
tamanho_time = []
linkedin_company = []
linkedin_founder = []
atualizacao = []
while True:
time.sleep(2)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
code = soup.prettify()
print(code)
containers = soup.find_all("div", {"class": "search-body__item"})
for container in containers:
internal_page = container.find('a', href=True)
The is still in the beginning because I'm trying to enter into the child pages and I can't that.
I've already tried:
internal_page = driver.find_element_by_xpath("/html/body/app-root/ng-component/app-layout/div/div/div/div/div/app-layout-column/ng-component/div/ais-instantsearch/div/div/div/div[2]/section/ais-infinite-hits/div/div[2]/a")
internal_page.click()
Could someone give a light, please?
You can use a different approach than to simulate clicking all the buttons.
If you check the link of each start up base it is https://startupbase.com.br/c/startup/ with start up base name separated by dashes
So you can use a base url
base_url = 'https://startupbase.com.br/c/startup/{}'
You can get the titles of every start up base using the following css selector .org__title.sb-size-6
titles = ['-'.join(title.text.split()) for title in driver.find_elements_by_css_selector('.org__title.sb-size-6')]
After that you can iterate through all titles and add its name after the base url, seperated by dashes instead of spaces
for title in titles:
url = base_url.format(title)
And do whatever code you want to request using the url variable
Code:
base_url = 'https://startupbase.com.br/c/startup/{}'
titles = ['-'.join(title.text.split()) for title in driver.find_elements_by_css_selector('.org__title.sb-size-6')]
for title in titles:
url = base_url.format(title)
You can do that easily using scrapy as api calls json response and following the post method.
CODE:
import scrapy
import json
class ScrollSpider(scrapy.Spider):
body = '{"requests":[{"indexName":"prod_STARTUPBASE","params":"maxValuesPerFacet=100&query=&highlightPreTag=__ais-highlight__&highlightPostTag=__%2Fais-highlight__&page=0&facets=%5B%22segments.primary%22%2C%22state%22%2C%22place%22%2C%22business_target%22%2C%22business_phase%22%2C%22business_model%22%2C%22badges.name%22%5D&tagFilters=&facetFilters=%5B%5B%22segments.primary%3AConstru%C3%A7%C3%A3o%20Civil%22%2C%22segments.primary%3AImobili%C3%A1rio%22%5D%5D"},{"indexName":"prod_STARTUPBASE","params":"maxValuesPerFacet=100&query=&highlightPreTag=__ais-highlight__&highlightPostTag=__%2Fais-highlight__&page=0&hitsPerPage=1&attributesToRetrieve=%5B%5D&attributesToHighlight=%5B%5D&attributesToSnippet=%5B%5D&tagFilters=&analytics=false&clickAnalytics=false&facets=segments.primary"}]}'
name = 'scroll'
def start_requests(self):
yield scrapy.Request(
url= 'https://fwtbnxlfs6-3.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(3.35.1)%3B%20Browser%20(lite)%3B%20angular%20(9.0.7)%3B%20angular-instantsearch%20(3.0.0-beta.5)%3B%20instantsearch.js%20(4.7.0)%3B%20JS%20Helper%20(3.1.2)&x-algolia-application-id=FWTBNXLFS6&x-algolia-api-key=e5fef9eab51259b54d385c6f010cc399',
callback = self.parse,
method = 'POST',
body = self.body,
headers = {'content-type': 'application/x-www-form-urlencoded',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'}
)
def parse(self, response):
resp = json.loads(response.body)
hits = resp['results'][0]['hits']
for hit in hits:
yield {
'Name':hit['name']
}
Output:
{'Name': 'Constr Up'}
2021-09-04 08:07:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://fwtbnxlfs6-3.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(3.35.1)%3B%20Browser%20(lite)%3B%20angular%20(9.0.7)%3B%20angular-instantsearch%20(3.0.0-beta.5)%3B%20instantsearch.js%20(4.7.0)%3B%20JS%20Helper%20(3.1.2)&x-algolia-application-id=FWTBNXLFS6&x-algolia-api-key=e5fef9eab51259b54d385c6f010cc399>
{'Name': 'Agenciou!'}
2021-09-04 08:07:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://fwtbnxlfs6-3.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(3.35.1)%3B%20Browser%20(lite)%3B%20angular%20(9.0.7)%3B%20angular-instantsearch%20(3.0.0-beta.5)%3B%20instantsearch.js%20(4.7.0)%3B%20JS%20Helper%20(3.1.2)&x-algolia-application-id=FWTBNXLFS6&x-algolia-api-key=e5fef9eab51259b54d385c6f010cc399>
{'Name': 'inQuality System'}
2021-09-04 08:07:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://fwtbnxlfs6-3.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(3.35.1)%3B%20Browser%20(lite)%3B%20angular%20(9.0.7)%3B%20angular-instantsearch%20(3.0.0-beta.5)%3B%20instantsearch.js%20(4.7.0)%3B%20JS%20Helper%20(3.1.2)&x-algolia-application-id=FWTBNXLFS6&x-algolia-api-key=e5fef9eab51259b54d385c6f010cc399>
{'Name': 'Constructweb - Gestão eficiente de reformas'}
2021-09-04 08:07:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://fwtbnxlfs6-3.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(3.35.1)%3B%20Browser%20(lite)%3B%20angular%20(9.0.7)%3B%20angular-instantsearch%20(3.0.0-beta.5)%3B%20instantsearch.js%20(4.7.0)%3B%20JS%20Helper%20(3.1.2)&x-algolia-application-id=FWTBNXLFS6&x-algolia-api-key=e5fef9eab51259b54d385c6f010cc399>
{'Name': 'Apê Fácil'}
2021-09-04 08:07:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://fwtbnxlfs6-3.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(3.35.1)%3B%20Browser%20(lite)%3B%20angular%20(9.0.7)%3B%20angular-instantsearch%20(3.0.0-beta.5)%3B%20instantsearch.js%20(4.7.0)%3B%20JS%20Helper%20(3.1.2)&x-algolia-application-id=FWTBNXLFS6&x-algolia-api-key=e5fef9eab51259b54d385c6f010cc399>
{'Name': 'Glück Imóveis '}
2021-09-04 08:07:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://fwtbnxlfs6-3.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(3.35.1)%3B%20Browser%20(lite)%3B%20angular%20(9.0.7)%3B%20angular-instantsearch%20(3.0.0-beta.5)%3B%20instantsearch.js%20(4.7.0)%3B%20JS%20Helper%20(3.1.2)&x-algolia-application-id=FWTBNXLFS6&x-algolia-api-key=e5fef9eab51259b54d385c6f010cc399>
{'Name': 'ArqColab'}
... so on
Response:
downloader/response_status_count/200
Like I've said in the comment I made to your question. You are selecting the wrong element. You have to select the parent element of the <a> since this is what contains the action. You have to use xpath to get to the parent element.
internal_page = driver.find_element_by_xpath("/html/body/app-root/ng-component/app-layout/div/div/div/div/div/app-layout-column/ng-component/div/ais-instantsearch/div/div/div/div[2]/section/ais-infinite-hits/div/div[2]/a")
internal_page.find_element_by_xpath("./..").click()
Related
I want to extract information from a website like price and store that as values in a dictionary. However, I'm trying to learn scrapy so I'd like to know how to achieve this with it.
Here's how it would look like with requests and BeautifulSoup
import numpy as np
import requests as r
import pandas as pd
from bs4 import BeauitfulSoup
html = ['https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=1&_sop=16',
'https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=2&_sop=16',
'https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=3&_sop=16',
'https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=4&_sop=16',
'https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=5&_sop=16']
data = defaultdict(list)
for i in range(0, len(html):
r = requests.get(html[i])
soup = BeautifulSoup(r.content, 'lxml')
name = soup.select(".s-item__title")
value = soup.select(".ITALIC")
for n, v in zip(name, value):
data["card"].append(n.text.strip())
data["price"].append(v.text.strip())
Here's what I have tried with scrapy but I do not get any values after looking at the json output. I just get the links, how do I get the output like the code above?:
import scrapy
from scrapy.loader import ItemLoader
from scrapy.item import Field
from itemloaders.processors import TakeFirst
from scrapy.crawler import CrawlerProcess
html = np.array(['https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=1&_sop=16',
'https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=2&_sop=16',
'https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=3&_sop=16',
'https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=4&_sop=16',
'https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=5&_sop=16'],
dtype=object)
url = pd.DataFrame(html, columns=['data'])
class StatisticsItem(scrapy.Item):
statistics_div = Field(output_processor=TakeFirst())
url = Field(output_processor=TakeFirst())
class StatisticsSpider(scrapy.Spider):
name = 'statistics'
start_urls = url.data.values
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url
)
def parse(self, response):
table = response.xpath("//div[#class='s-item__price']").get()
loader = ItemLoader(StatisticsItem())
loader.add_value('values', table)
loader.add_value('url', response.url)
yield loader.load_item()
process = CrawlerProcess(
settings={
'FEED_URI': 'ebay_data.json',
'FEED_FORMAT': 'jsonlines'
}
)
process.crawl(StatisticsSpider)
process.start()
I set the custom_settings to write to 'cards_info.json' with json format.
Inside parse I go through each card on the page (see xpath) and get the card's title and price, then I yield them. Scrapy will write them into 'cards_info.json'.
import scrapy
from scrapy.item import Field
from itemloaders.processors import TakeFirst
class StatisticsItem(scrapy.Item):
statistics_div = Field(output_processor=TakeFirst())
url = Field(output_processor=TakeFirst())
class StatisticsSpider(scrapy.Spider):
name = 'statistics'
start_urls = ['https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=1&_sop=16',
'https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=2&_sop=16',
'https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=3&_sop=16',
'https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=4&_sop=16',
'https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=5&_sop=16']
custom_settings = {
'FEED_FORMAT': 'json',
'FEED_URI': 'cards_info.json'
}
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url
)
def parse(self, response):
all_cards = response.xpath('//div[#class="s-item__wrapper clearfix"]')
for card in all_cards:
name = card.xpath('.//h3/text()').get()
price = card.xpath('.//span[#class="s-item__price"]//text()').get()
# now do whatever you want, append to dictionary, yield as item.
# example with yield:
yield {
'card': name,
'price': price
}
Output:
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=1&_sop=16>
{'card': 'Pokemon 1st Edition Shadowless Base Set 11 Blister Booster Pack Lot - DM To Buy!', 'price': '£93,805.84'}
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=1&_sop=16>
{'card': 'Pokemon Team Rocket Complete Complete 83/82, German, 1. Edition', 'price': '£102,026.04'}
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ebay.co.uk/b/Collectable-Card-Games-Accessories/2536/bn_2316999?LH_PrefLoc=2&mag=1&rt=nc&_pgn=1&_sop=16>
{'card': 'Yugioh E Hero Pit Boss 2013 World Championship Prize Card BGS 9.5 Gem Mint', 'price': '£100,000.00'}
...
...
cards_info.json:
[
{"card": "1999 Pokemon Base Set Booster Box GREEN WING", "price": "\u00a340,000.00"},
{"card": "1996 MEDIA FACTORY POKEMON NO RARITY BASE SET CHARIZARD 006 BECKETT BGS MINT 9 ", "price": "\u00a339,999.99"},
{"card": "Yugioh - BGS8.5 Jump Festa Blue Eyes White Dragon -1999 - Limited - PSA", "price": "\u00a340,000.00"},
{"card": "PSA 8 CHARIZARD 1999 POKEMON 1ST EDITION THICK STAMP SHADOWLESS #4 HOLO NM-MINT", "price": "\u00a337,224.53"},
{"card": "PSA 9 MINT Pok\u00e9mon Play Promo 50000 PTS Gold Star Japanese Pokemon", "price": "\u00a338,261.06"},
...
...
]
I'm trying to extract data (title, price and description) from ajax but it doesn't work even by changing the user agent
Link : https://scrapingclub.com/exercise/detail_header/
Ajax (data want to extract) : https://scrapingclub.com/exercise/ajaxdetail_header/
import scrapy
class UseragentSpider(scrapy.Spider):
name = 'useragent'
allowed_domains = ['scrapingclub.com/exercise/ajaxdetail_header/']
start_urls = ['https://scrapingclub.com/exercise/ajaxdetail_header/']
user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36"
def parse(self, response):
cardb= response.xpath("//div[#class='card-body']")
for thing in cardb:
title= thing.xpath(".//h3")
yield {'title' : title}
Error log :
2020-09-07 20:34:39 [scrapy.core.engine] INFO: Spider opened
2020-09-07 20:34:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-07 20:34:39 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-07 20:34:40 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://scrapingclub.com/robots.txt> (referer: None)
2020-09-07 20:34:40 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://scrapingclub.com/exercise/ajaxdetail_header/> (referer: None)
2020-09-07 20:34:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://scrapingclub.com/exercise/ajaxdetail_header/>: HTTP status code is not handled or not allowed
AJAX requests should send header
'X-Requested-With': 'XMLHttpRequest'
but not all servers check it. But this server check it. But it doesn't check User-Agent.
Server sends data as JSON so xpath will be useless.
I tested it with requests instead of scrapy because it was simpler for me.
import requests
headers = {
#'User-Agent': 'Mozilla/5.0',
'X-Requested-With': 'XMLHttpRequest',
}
url = 'https://scrapingclub.com/exercise/ajaxdetail_header/'
response = requests.get(url, headers=headers)
data = response.json()
print(data)
print('type:', type(data))
print('keys:', data.keys())
print('--- manually ---')
print('price:', data['price'])
print('title:', data['title'])
print('--- for-loop ---')
for key, value in data.items():
print('{}: {}'.format(key, value))
Result:
{'img_path': '/static/img/00959-A.jpg', 'price': '$24.99', 'description': 'Blouse in airy, crinkled fabric with a printed pattern. Small stand-up collar, concealed buttons at front, and flounces at front. Long sleeves with buttons at cuffs. Rounded hem. 100% polyester. Machine wash cold.', 'title': 'Crinkled Flounced Blouse'}
type: <class 'dict'>
keys: dict_keys(['img_path', 'price', 'description', 'title'])
--- manually ---
price: $24.99
title: Crinkled Flounced Blouse
--- for-loop ---
img_path: /static/img/00959-A.jpg
price: $24.99
description: Blouse in airy, crinkled fabric with a printed pattern. Small stand-up collar, concealed buttons at front, and flounces at front. Long sleeves with buttons at cuffs. Rounded hem. 100% polyester. Machine wash cold.
title: Crinkled Flounced Blouse
EDIT:
The same with Scrapy. I use function start_requests() to create Request() with header 'X-Requested-With'
You can put all code in one file and run python script.py without creating project.
import scrapy
import json
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
url = 'https://scrapingclub.com/exercise/ajaxdetail_header/'
headers = {
#'User-Agent': 'Mozilla/5.0',
'X-Requested-With': 'XMLHttpRequest',
}
yield scrapy.http.Request(url, headers=headers)
def parse(self, response):
print('url:', response.url)
data = response.json()
print(data)
print('type:', type(data))
print('keys:', data.keys())
print('--- manually ---')
print('price:', data['price'])
print('title:', data['title'])
print('--- for-loop ---')
for key, value in data.items():
print('{}: {}'.format(key, value))
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
#'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
#'FEED_FORMAT': 'csv', # csv, json, xml
#'FEED_URI': 'output.csv', #
})
c.crawl(MySpider)
c.start()
EDIT:
The same using setting DEFAULT_REQUEST_HEADERS
import scrapy
import json
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://scrapingclub.com/exercise/ajaxdetail_header/']
def parse(self, response):
print('url:', response.url)
#print('headers:', response.request.headers)
data = response.json()
print(data)
print('type:', type(data))
print('keys:', data.keys())
print('--- manually ---')
print('price:', data['price'])
print('title:', data['title'])
print('--- for-loop ---')
for key, value in data.items():
print('{}: {}'.format(key, value))
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
#'USER_AGENT': 'Mozilla/5.0',
'DEFAULT_REQUEST_HEADERS': {
#'User-Agent': 'Mozilla/5.0',
'X-Requested-With': 'XMLHttpRequest',
}
# save in file CSV, JSON or XML
#'FEED_FORMAT': 'csv', # csv, json, xml
#'FEED_URI': 'output.csv', #
})
c.crawl(MySpider)
c.start()
I'm kinda of newb with Scrapy. My spider is not working properly when I'm trying to scrape the data from forum. When I'm running my spider, it gives me only the printed urls and stops after. So I think that the problem is in compatibility of two function parse and parse_data but I may be wrong. Here is my code:
import scrapy, time
class ForumSpiderSpider(scrapy.Spider):
name = 'forum_spider'
allowed_domains = ['visforvoltage.org/latest_tech/']
start_urls = ['http://visforvoltage.org/latest_tech//']
def parse(self, response):
for href in response.css(r"tbody a[href*='/forum/']::attr(href)").extract():
url = response.urljoin(href)
print(url)
req = scrapy.Request(url, callback=self.parse_data)
time.sleep(10)
yield req
def parse_data(self, response):
for url in response.css('html').extract():
data = {}
data['name'] = response.css(r"div[class='author-pane-line author-name'] span[class='username']::text").extract()
data['date'] = response.css(r"div[class='forum-posted-on']:contains('-') ::text").extract()
data['title'] = response.css(r"div[class='section'] h1[class='title']::text").extract()
data['body'] = response.css(r"div[class='field-items'] p::text").extract()
yield data
next_page = response.css(r"li[class='pager-next'] a[href*='page=']::attr(href)").extract()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse)
Here is the output:
2020-07-23 23:09:58 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'visforvoltage.org': <GET https://visforvoltage.org/forum/14521-aquired-a123-m1-cells-need-charger-and-bms>
https://visforvoltage.org/forum/14448-battery-charger-problems
https://visforvoltage.org/forum/14191-vectrix-trickle-charger
https://visforvoltage.org/forum/14460-what-epoxy-would-you-recommend-loose-magnet-repair
https://visforvoltage.org/forum/14429-importance-correct-grounding-and-well-built-plugs
https://visforvoltage.org/forum/14457-147v-charger-24v-lead-acid-charger-and-dying-vectrix-cells
https://visforvoltage.org/forum/6723-lithium-safety-e-bike
https://visforvoltage.org/forum/11488-how-does-24v-4-wire-reversible-motor-work
https://visforvoltage.org/forum/14444-new-sevcon-gen-4-80v-sale
https://visforvoltage.org/forum/14443-new-sevcon-gen-4-80v-sale
https://visforvoltage.org/forum/12495-3500w-hub-motor-question-about-real-power-and-breaker
https://visforvoltage.org/forum/14402-vectrix-vx-1-battery-pack-problem
https://visforvoltage.org/forum/14068-vectrix-trickle-charger
https://visforvoltage.org/forum/2931-drill-motors
https://visforvoltage.org/forum/14384-help-repairing-gio-hub-motor-freewheel-sprocket
https://visforvoltage.org/forum/14381-zev-charger
https://visforvoltage.org/forum/8726-performance-unite-my1020-1000w-motor
https://visforvoltage.org/forum/7012-controler-mod-veloteq
https://visforvoltage.org/forum/14331-scooter-chargers-general-nfpanec
https://visforvoltage.org/forum/14320-charging-nissan-leaf-cells-lifepo4-charger
https://visforvoltage.org/forum/3763-newber-needs-help-new-gift-kollmorgan-hub-motor
https://visforvoltage.org/forum/14096-european-bldc-controller-seller
https://visforvoltage.org/forum/14242-lithium-bms-vs-manual-battery-balancing
https://visforvoltage.org/forum/14236-mosfet-wiring-ignition-key
https://visforvoltage.org/forum/2007-ok-dumb-question-time%3A-about-golf-cart-controllers
https://visforvoltage.org/forum/10524-my-mf70-recommended-powerpoles-arrived-today
https://visforvoltage.org/forum/9460-how-determine-battery-capacity
https://visforvoltage.org/forum/7705-tricking-0-5-v-hall-effect-throttle
https://visforvoltage.org/forum/13446-overcharged-lead-acid-battery-what-do
https://visforvoltage.org/forum/14157-reliable-high-performance-battery-enoeco-bt-p380
https://visforvoltage.org/forum/2702-hands-test-48-volt-20-ah-lifepo4-pack-ping-battery
https://visforvoltage.org/forum/14034-simple-and-cheap-ev-can-bus-adaptor
https://visforvoltage.org/forum/13933-zivan-ng-3-charger-specs-and-use
https://visforvoltage.org/forum/13099-controllers
https://visforvoltage.org/forum/13866-electric-motor-werks-demos-25-kilowatt-diy-chademo-leaf
https://visforvoltage.org/forum/13796-motor-theory-ac-vs-bldc
https://visforvoltage.org/forum/6184-bypass-bms-lifepo4-good-idea-or-not
https://visforvoltage.org/forum/13763-positive-feedback-kelly-controller
https://visforvoltage.org/forum/13764-any-users-smart-battery-drop-replacement-zapino-and-others
https://visforvoltage.org/forum/13760-contactor-or-fuse-position-circuit-rules-why
https://visforvoltage.org/forum/13759-contactor-or-fuse-position-circuit-rules-why
https://visforvoltage.org/forum/12725-repairing-lithium-battery-pack
https://visforvoltage.org/forum/13752-questions-sepex-motor-theory
https://visforvoltage.org/forum/13738-programming-curtis-controller-software
https://visforvoltage.org/forum/13741-making-own-simple-controller
https://visforvoltage.org/forum/12420-idea-charging-electric-car-portably-wo-relying-electricity-infrastructure
2020-07-23 23:17:28 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2020-07-23 23:17:28 [scrapy.core.engine] INFO: Closing spider (finished)
As I see it didn't iterate over these links and collect the data from them. What could be the reason for that?
I will really appreciate for any help. Thank you!
It's work for me.
import scrapy, time
class ForumSpiderSpider(scrapy.Spider):
name = 'forum_spider'
allowed_domains = ['visforvoltage.org/latest_tech/']
start_urls = ['http://visforvoltage.org/latest_tech/']
def parse(self, response):
for href in response.css(r"tbody a[href*='/forum/']::attr(href)").extract():
url = response.urljoin(href)
req = scrapy.Request(url, callback=self.parse_data, dont_filter=True)
yield req
def parse_data(self, response):
for url in response.css('html'):
data = {}
data['name'] = url.css(r"div[class='author-pane-line author-name'] span[class='username']::text").extract()
data['date'] = url.css(r"div[class='forum-posted-on']:contains('-') ::text").extract()
data['title'] = url.css(r"div[class='section'] h1[class='title']::text").extract()
data['body'] = url.css(r"div[class='field-items'] p::text").extract()
yield data
next_page = response.css(r"li[class='pager-next'] a[href*='page=']::attr(href)").extract()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse)
The issue probably is that the requests are getting filtered, as they are not part of the allowed domain.
allowed_domains = ['visforvoltage.org/latest_tech/']
New requests url:
https://visforvoltage.org/forum/14448-battery-charger-problems
https://visforvoltage.org/forum/14191-vectrix-trickle-charger
...
Since the requests are to the url visforvoltage.org/forum/ and not to the visforvoltage.org/latest_tech/
You can remove the allowed domain property entirely, or change to:
allowed_domains = ['visforvoltage.org']
This will make them crawl the page, you will see a different value in this line in your log:
2020-07-23 23:17:28 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
However the selectors in the parsing don't seem right.
This selector will select the whole page, and the extract() method will return it as a list. So you will have a list, with only one string that composed of all the HTML of the page.
response.css('html').extract()
You can read more on selectors and the getall()/extract() method here.
my spider starts on this page https://finviz.com/screener.ashx and visits every link in the table to yield some items on the other side. This worked perfectly fine. I then wanted to add another layer of depth by having my spider visit a link on the page it initially visits like so:
start_urls > url > url_2
The spider is supposed to visit "url", yield some items along the way, then visit "url_2" and yield a few more items and then move on to the next url from the start_url.
Here is my spider code:
import scrapy
from scrapy import Request
from dimstatistics.items import DimstatisticsItem
class StatisticsSpider(scrapy.Spider):
name = 'statistics'
def __init__(self):
self.start_urls = ['https://finviz.com/screener.ashx? v=111&f=ind_stocksonly&r=01']
npagesscreener = 1000
for i in range(1, npagesscreener + 1):
self.start_urls.append("https://finviz.com/screener.ashx? v=111&f=ind_stocksonly&r="+str(i)+"1")
def parse(self, response):
for href in response.xpath("//td[contains(#class, 'screener-body-table-nw')]/a/#href"):
url = "https://www.finviz.com/" + href.extract()
yield follow.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
item = {}
item['statisticskey'] = response.xpath("//a[contains(#class, 'fullview-ticker')]//text()").extract()[0]
item['shares_outstanding'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[9]
item['shares_float'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[21]
item['short_float'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[33]
item['short_ratio'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[45]
item['institutional_ownership'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[7]
item['institutional_transactions'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[19]
item['employees'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[97]
item['recommendation'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[133]
yield item
url2 = response.xpath("//table[contains(#class, 'fullview-links')]//a/#href").extract()[0]
yield response.follow(url2, callback=self.parse_dir_stats)
def parse_dir_stats(self, response):
item = {}
item['effective_tax_rate_ttm_company'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate (TTM)']]/td[2]/text()").extract()
item['effective_tax_rate_ttm_industry'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate (TTM)']]/td[3]/text()").extract()
item['effective_tax_rate_ttm_sector'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate (TTM)']]/td[4]/text()").extract()
item['effective_tax_rate_5_yr_avg_company'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate - 5 Yr. Avg.']]/td[2]/text()").extract()
item['effective_tax_rate_5_yr_avg_industry'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate - 5 Yr. Avg.']]/td[3]/text()").extract()
item['effective_tax_rate_5_yr_avg_sector'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate - 5 Yr. Avg.']]/td[4]/text()").extract()
yield item
All of the xpaths and links are right, I just can't seem to yield anything at all now. I have a feeling there is an obvious mistake here. My first try at a more elaborate spider.
Any help would be greatly appreciated! Thank you!
***EDIT 2
{'statisticskey': 'AMRB', 'shares_outstanding': '5.97M', 'shares_float':
'5.08M', 'short_float': '0.04%', 'short_ratio': '0.63',
'institutional_ownership': '10.50%', 'institutional_transactions': '2.74%',
'employees': '101', 'recommendation': '2.30'}
2019-03-06 18:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.finviz.com/quote.ashx?t=AMR&ty=c&p=d&b=1>
{'statisticskey': 'AMR', 'shares_outstanding': '154.26M', 'shares_float':
'89.29M', 'short_float': '13.99%', 'short_ratio': '4.32',
'institutional_ownership': '0.10%', 'institutional_transactions': '-',
'employees': '-', 'recommendation': '3.00'}
2019-03-06 18:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.finviz.com/quote.ashx?t=AMD&ty=c&p=d&b=1>
{'statisticskey': 'AMD', 'shares_outstanding': '1.00B', 'shares_float':
'997.92M', 'short_float': '11.62%', 'short_ratio': '1.27',
'institutional_ownership': '0.70%', 'institutional_transactions': '-83.83%',
'employees': '10100', 'recommendation': '2.50'}
2019-03-06 18:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.finviz.com/quote.ashx?t=AMCX&ty=c&p=d&b=1>
{'statisticskey': 'AMCX', 'shares_outstanding': '54.70M', 'shares_float':
'43.56M', 'short_float': '20.94%', 'short_ratio': '14.54',
'institutional_ownership': '3.29%', 'institutional_transactions': '0.00%',
'employees': '1872', 'recommendation': '3.00'}
2019-03-06 18:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.finviz.com/screener.ashx?v=111&f=geo_bermuda>
{'effective_tax_rate_ttm_company': [], 'effective_tax_rate_ttm_industry':
[], 'effective_tax_rate_ttm_sector': [],
'effective_tax_rate_5_yr_avg_company': [],
'effective_tax_rate_5_yr_avg_industry': [],
'effective_tax_rate_5_yr_avg_sector': []}
2019-03-06 18:45:25 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.finviz.com/screener.ashx?v=111&f=geo_china>
{'effective_tax_rate_ttm_company': [], 'effective_tax_rate_ttm_industry':
[], 'effective_tax_rate_ttm_sector': [],
'effective_tax_rate_5_yr_avg_company': [],
'effective_tax_rate_5_yr_avg_industry': [],
'effective_tax_rate_5_yr_avg_sector': []}
*** EDIT 3
Managed to actually have the spider travel to url2 and yield the items there. The problem is it only does it rarely. Most of the time it redirects to the correct link and gets nothing, or doesn't seem to redirect at all and continues on. Not really sure why there is such inconsistency here.
2019-03-06 20:11:57 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.reuters.com/finance/stocks/financial-highlights/BCACU.A>
{'effective_tax_rate_ttm_company': ['--'],
'effective_tax_rate_ttm_industry': ['4.63'],
'effective_tax_rate_ttm_sector': ['20.97'],
'effective_tax_rate_5_yr_avg_company': ['--'],
'effective_tax_rate_5_yr_avg_industry': ['3.98'],
'effective_tax_rate_5_yr_avg_sector': ['20.77']}
The other thing is, I know I've managed to yield a few values on url2 succesfully though they don't appear in my CSV output. I realize this could be a export issue. I updated my code to how it is currently.
url2 is a relative path, but scrapy.Request expects a full URL.
Try this:
yield Request(
response.urljoin(url2),
callback=self.parse_dir_stats)
Or even simpler:
yield response.follow(url2, callback=self.parse_dir_stats)
Please help me to optimize my scrapy spider. Specially next page pagination is not working. There are lot of page per page has 50 items.
I catch first page 50 items(link) in parse_items and next page items also scrap in parse_items.
import scrapy
from scrapy import Field
from fake_useragent import UserAgent
class DiscoItem(scrapy.Item):
release = Field()
images = Field()
class discoSpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['discogs.com']
query = input('ENTER SEARCH MUSIC TYPE : ')
start_urls =['http://www.discogs.com/search?q=%s&type=release'%query]
custome_settings = {
'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",
'handle_httpstatus_list' : [301,302,],
'download_delay' :10}
def start_requests(self):
yield scrapy.Request(url=self.start_urls[0], callback=self.parse)
def parse(self, response):
print('START parse \n')
print("*****",response.url)
#next page pagination
next_page =response.css('a.pagination_next::attr(href)').extract_first()
next_page = response.urljoin(next_page)
yield scrapy.Request(url=next_page, callback=self.parse_items2)
headers={}
for link in response.css('a.search_result_title ::attr(href)').extract():
ua = UserAgent()# random user agent
headers['User-Agent'] = ua.random
yield scrapy.Request(response.urljoin(link),headers=headers,callback=self.parse_items)
def parse_items2(self, response):
print('parse_items2 *******', response.url)
yield scrapy.Request(url=response.url, callback=self.parse)
def parse_items(self,response):
print("parse_items**********",response.url)
items = DiscoItem()
for imge in response.css('div#page_content'):
img = imge.css("span.thumbnail_center img::attr(src)").extract()[0]
items['images'] = img
release=imge.css('div.content a ::text').extract()
items['release']=release[4]
yield items
When I try running your code (after fixing the many indentation, spelling and letter case errors), this line is shown in scrapy's log:
2018-03-05 00:47:28 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.discogs.com/search/?q=rock&type=release&page=2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
Scrapy will filter duplicate requests by default, and your parse_items2() method does nothing but create duplicate requests. I fail to see any reason for that method existing.
What you should do instead is specify the ˙parse()` method as callback for your requests, and avoid having an extra method that does nothing:
yield scrapy.Request(url=next_page, callback=self.parse)
Try this for pagination:
try:
nextpage = response.urljoin( response.xpath("//*[contains(#rel,'next') and contains(#id,'next')]/#url")[0].extract() )
yield scrapy.Request( nextpage, callback=self.parse )
except:
pass