I want to scrape anchors a from container class with scrapy

I want to scrape anchors a from container class with scrapy - python

<div class="breadcrumbs">
<div class="container">
Home
<span class="divider"> </span>
Special Occasion Dresses
<span class="divider"> </span>
Evening Dresses
<span class="divider"> </span>
Formal Evening Dresses
<span class="divider"> </span>
<strong>Deep V-neck Yellow Long Prom Dress Sleeveless Satin Evening Dress</strong>
</div>
I want to scrape the third anchor from container class but I am unable to scape that one I used response.css('.breadcrumbs div.container a').getall() this selector to scrape all anchors but I get only first I am beginner I need help to scrape all these achors

Pretty simple using XPath expressions.
If you want to get anchor by position:
third_url = response.xpath('//div[#class="container"]/a[3]/#href').get()
If you want to get anchor by the text of the link:
evening_dresses_url = response.xpath('//div[#class="container"]/a[.="Evening Dresses"]/#href').get()

Related

How to scrape a url using a LinkExtractor that isn't a full url?

I'm trying to scrape all the matches for this 2013 tennis tournament:
https://www.oddsportal.com/tennis/argentina/atp-buenos-aires-2013/results/
It has two pages and I'm trying to scrape both of them. However, the HTML doesn't seem to provide the full links:
<div id="pagination">
<a href="#/" x-page="1">
<span class="arrow">|«</span>
</a>
<a href="#/" x-page="1">
<span class="arrow">«</span>
</a>
<span class="active-page">1</span>
<a href="#/page/2/" x-page="2">
<span>2</span>
</a>
<a href="#/page/2/" x-page="2">
<span class="arrow">»</span>
</a>
<a href="#/page/2/" x-page="2">
<span class="arrow">»|</span>
</a>
</div>
When I hover over the link using FireFox then I can see the full url so it's stored somewhere!
How would I go about configuring a LinkExtractor() to scrape both the pages?

python selenium get value of specific element

<li id="button1" class="on">
<div class="supply1">
<div class="buildingimg">
<a class="fastBuild tooltip js_hideTipOnMobile" title="Metallmine auf Stufe 4 ausbauen" href="javascript:void(0);" onclick="sendBuildRequest('https://s159-de.ogame.gameforge.com/game/index.php?page=resources&modus=1&type=1&menge=1&token=0c86d8a8bf9a5c559538b0e13cb462b4', null, 1);">
<img src="https://gf2.geo.gfsrv.net/cdndf/3e567d6f16d040326c7a0ea29a4f41.gif" width="22" height="14">
</a>
<a class="detail_button tooltip js_hideTipOnMobile slideIn" title="" ref="1" id="details" href="javascript:void(0);">
<span class="ecke">
<span class="level">
<span class="textlabel">
**Metallmine**
</span>
**3** </span>
</span>
</a>
</div>
</div>
</li>
<li id="button2" class="on">
<div class="supply2">
<div class="buildingimg">
<a class="fastBuild tooltip js_hideTipOnMobile" title="" href="javascript:void(0);" onclick="sendBuildRequest('https://s159-de.ogame.gameforge.com/game/index.php?page=resources&modus=1&type=2&menge=1&token=0c86d8a8bf9a5c559538b0e13cb462b4', null, 1);">
<img src="https://gf2.geo.gfsrv.net/cdndf/3e567d6f16d040326c7a0ea29a4f41.gif" width="22" height="14">
</a>
<a class="detail_button tooltip js_hideTipOnMobile slideIn" title="" ref="2" id="details" href="javascript:void(0);">
<span class="ecke">
<span class="level">
<span class="textlabel">
**Kristallmine**
</span>
**1** </span>
</span>
</a>
</div>
</div>
</li>
Dear Community,
So I want to create a bot for a browser game (just for learning purposes of course). In the game you can build and level up metal and crystall mines to get more resources. To have the best resource proportions it is best to have a metal mine which is always 2 levels higher, than your crystal mine. Writing the code to compare the levels is no problem, but I'm having problems accessing the actual values of the "level" of the mine since there is no unique attribute to them.
Above in the code you can see the "Metallmine" and "Kristallmine" and the corresponding levels. I would like to write a code similar to:
if LevelOfKristallmine - LevelOfMetallmine <-2
driver.find_element_by_whatever('upgradebutton').click()
how can I get the values of LevelOfKristallmine and LevelOfMetallmine?
Thanks alot for your answers!

You are trying to use the ID, I assume as the values? Instead copy and paste the XPath, using something like:
driver.find_element_by_xpath('*//*[#id="example-xpath"]/div/nav/ol*').click()
To copy Xpath, f12, find the element to click, right click, copy > Xpath. Then paste in the parentheses. Follow this other link and you should figure it out mate.

Searching text knowing <i tag class

I need to get div text with class _50x4 using 5pxsel:
<div...>
<i class="5pxsel">
<div>
<div>
<div class="_50x4">
Work in
<a>London</a>
<div class="_50x4">
Work in
<a> Germany </a>
I need to get text using class 5pxsel, not _50x4, and get only first result - 'Work in London'.

trt with following x-path
//*[#class="5pxsel"]/following-sibling::div/div/div[#class='_50x4']

How to implement following-sibling axis of xpath alternative in Beautifulsoup Python

I'm trying to collect the text using Bs4, selenium and Python I want to get the text "Lisa Staprans" using:
name = str(profilePageSource.find(class_="hzi-font hzi-Man-Outline").div.get_text().encode("utf-8"))[2:-1]
Here is the code:
<div class="profile-about-right">
<div class="text-bold">
SF Peninsula Interior Design Firm
<br/>
Best of Houzz 2015
</div>
<br/>
<div class="page-tags" style="display:none">
page_type: pro_plus_profile
</div>
<div class="pro-info-horizontal-list text-m text-dt-s">
<div class="info-list-label">
<i class="hzi-font hzi-Ruler">
</i>
<div class="info-list-text">
<span class="hide" itemscope="" itemtype="http://data-vocabulary.org/Breadcr
umb">
<a href="http://www.houzz.com/professionals/c/Menlo-Park--CA" itemprop="url
">
<span itemprop="title">
Professionals
</span>
</a>
</span>
<span itemprop="child" itemscope="" itemtype="http://data-vocabulary.org/Bre
adcrumb">
<a href="http://www.houzz.com/professionals/interior-designer/c/Menlo-Park-
-CA" itemprop="url">
<span itemprop="title">
Interior Designers & Decorators
</span>
</a>
</span>
</div>
</div>
<div class="info-list-label">
<i class="hzi-font hzi-Man-Outline">
</i>
<div class="info-list-text">
<b>
Contact
</b>
: Lisa Staprans
</div>
</div>
</div>
</div>
Please let me know how it would be.

I assumed you are using Beautifulsoup since you are using class_ attribute dictionary-
If there is one div with class name hzi-font hzi-Man-Outline then try-
str(profilePageSource.find(class_="hzi-font hzi-Man-Outline").findNext('div').get_text().split(":")[-1]).strip()
Extracts 'Lisa Staprans'
Here findNext navigates to next div and extracts text.

I can't test it right now but I would do :
profilePageSource.find_element_by_class_name("info-list-text").get_attribute('innerHTML')
Then you will have to split the result considering the : (if it's always the case).
For more informations : https://selenium-python.readthedocs.org/en/latest/navigating.html

Maybe something is wrong with this part:
find(class_="hzi-font hzi-Man-Outline")
An easy way to get the right information can be: right click on the element you need in the page source by inspecting it with Google Chrome, copy the xpath of the element, and then use:
profilePageSource.find_element_by_xpath(<xpath copied from Chorme>).text
Hope it helps.

Extract link in beautiful soup [duplicate]

This question already has answers here:
retrieve links from web page using python and BeautifulSoup [closed]
(16 answers)
Closed 7 years ago.
I am new to beautiful soup and am trying to figure out how to pull a website from a nested array. The website can be found twice under the "track-visit-website" class.
This is NOT a duplicate of the question asking about how to pull hrefs. I've done that successfully on this page. I am trying to isolate the actual company website.
I've tried several codes, but can't get it to work. Here is an example:
print(item.contents[2].find_all("a", {"class": "track-visit-website"})[0].a)
The site is YP.com Septic Search
Here's the code from the one of the items on the site:
<div class="info">
<h3 class="n">
<div class="info-section info-primary">
<p class="adr" itemprop="address" itemtype="http://schema.org/PostalAddress" itemscope="">
<span class="street-address" itemprop="streetAddress">2806 Farview Dr</span>
<span class="locality" itemprop="addressLocality">Fort Collins, </span>
<span itemprop="addressRegion">CO</span>
<span itemprop="postalCode">80524</span>
</p>
<div class="phones phone primary" itemprop="telephone">(970) 829-0852</div>
</div>
<div class="info-section info-secondary">
<div class="categories">
<div class="links">
<a class="track-visit-website" data-analytics="{"click_id":6,"act":2,"dku":"http://www.affordablesepticanddraincleaning.com","FL":"url","TL":"off","target":"website","LOC":"http://www.affordablesepticanddraincleaning.com"}" target="_blank" rel="nofollow" href="http://www.affordablesepticanddraincleaning.com" data-impressed="1">Website</a>
<a class="track-map-it directions" data-analytics="{"click_id":13,"target":"website","act":4}" href="/listings/1000775636908/directions" data-impressed="1">Directions</a>
<a class="track-more-info" data-analytics="{"click_id":7,"target":"moreInfo","act":1,"FL":"list"}" href="/fort-collins-co/mip/affordable-septic-drain-cleaning-llc-505109997?lid=1000775636908" data-impressed="1">More Info</a>
</div>

Copy this code snippet to a python file and run it
import re
content = """
<div class="info">
<h3 class="n">
<div class="info-section info-primary">
<p class="adr" itemprop="address" itemtype="http://schema.org/PostalAddress" itemscope="">
<span class="street-address" itemprop="streetAddress">2806 Farview Dr</span>
<span class="locality" itemprop="addressLocality">Fort Collins, </span>
<span itemprop="addressRegion">CO</span>
<span itemprop="postalCode">80524</span>
</p>
<div class="phones phone primary" itemprop="telephone">(970) 829-0852</div>
</div>
<div class="info-section info-secondary">
<div class="categories">
<div class="links">
<a class="track-visit-website" data-analytics="{"click_id":6,"act":2,"dku":"http://www.affordablesepticanddraincleaning.com","FL":"url","TL":"off","target":"website","LOC":"http://www.affordablesepticanddraincleaning.com"}" target="_blank" rel="nofollow" href="http://www.affordablesepticanddraincleaning.com" data-impressed="1">Website</a>
<a class="track-map-it directions" data-analytics="{"click_id":13,"target":"website","act":4}" href="/listings/1000775636908/directions" data-impressed="1">Directions</a>
<a class="track-more-info" data-analytics="{"click_id":7,"target":"moreInfo","act":1,"FL":"list"}" href="/fort-collins-co/mip/affordable-septic-drain-cleaning-llc-505109997?lid=1000775636908" data-impressed="1">More Info</a>
</div>
"""
websites = set(re.findall(r'http://[a-zA-Z0-9\.]*\.[a-z]{2,}',content)) # find all urls in the content
websites = list(websites)
print(websites) # or in python2 => print websites
Now find a way to incorporate that into your code, get the html, save it as content, regex it and save to file
Web scraping you have to know regex
read up on regex, a good tutorial is here regex tutorial

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

I want to scrape anchors a from container class with scrapy - python

Related

How to scrape a url using a LinkExtractor that isn't a full url?

python selenium get value of specific element

Searching text knowing <i tag class

How to implement following-sibling axis of xpath alternative in Beautifulsoup Python

Extract link in beautiful soup [duplicate]

Categories

Resources