I am using BeautifulSoup.
I would like to extract a coordinates from the website. The code of web looks like:
<a class="button button--outline link link--emphasis button-full-width js-choose-store" href="/sklep?StoreID=R034" title="Informacje o sklepie">Informacje o sklepie</a>
</div>
</div>
</div>
</div>
<div class="storelist__item ui-expandable js-accordion-store js-store" data-lat="52.225155" data-lng="20.998965" data-icon="/on/demandware.static/Sites-Hebe-Site/-/default/dw081970e9/images/map_markers/hebe.png" data-id="R379" data-coming-soon="false" data-index="81">
<div class="visually-hidden" data-popup-html>
<div class="store-popup">
<div class="store-popup__name text--uppercase">Drogeria Hebe</div>
<div class="store-popup__address">Lindleya 16</div>
<div class="store-popup__city">Warszawa, 02-013</div>
<div class="store-popup__directions">
I need to get 'data-lat' and 'data-lng'.
I had no problem to get address or name of object (it was a text), using for example:
find("div",{"class","store-popup__city"}).text
Try something along the lines of:
dat = soup.select_one('div[data-lat]')
print(dat['data-lat'],dat['data-lng'])
Output:
52.225155 20.998965
Related
I read some webpage contents in html that has the following form:
<div class="cart">
<div class="cart-title">
<img src="https://ug3.technion.ac.il/rishum/img/regCourses.png" width="50" height="50" alt="My Courses">
המקצועות שלי
</div><div class="entry-spacer"></div><div class="cart-entry">
<div class="course-number">
104134
</div>
<div class="course-name">
אלגברה מודרנית ח
</div>
<div class="course-points">
2.5 נק'
</div>
<div class="entry-group">
קבוצה 11
</div><div class="change-group">
שנה קבוצה ל
<select name="UPG104134" onchange="showWaitAndSubmit('regCart')" class="change-group-options">
<option value=""> </option><option>12</option><option>13</option><option>21</option><option>22</option><option>23</option>
</select>
</div><div class="more-actions">
</div>
<div class="clear"></div></div><div class="entry-spacer"></div><div class="cart-entry">
<div class="course-number">
234118
</div>
<div class="course-name">
ארגון ותכנות המחשב
</div>
<div class="course-points">
3 נק'
</div>
<div class="entry-group">
קבוצה 22
</div><div class="change-group">
שנה קבוצה ל
<select name="UPG234118" onchange="showWaitAndSubmit('regCart')" class="change-group-options">
<option value=""> </option><option>11</option><option>12</option><option>13</option><option>14</option><option>21</option>
</select>
</div><div class="more-actions">
</div>
<div class="clear"></div></div><div>
Now the question is how can I read the courses numbers which appear in blue in my image??
Here's an example of how course number appears in the webpage:
<div class="course-number">
104134
</div>
and I want to read: 104134 in this example
First, I'd advise using BeautifulSoup for parsing the HTML and then, off the top of my head, you should dig in for those div tags with that class name like this.
from bs4 import BeautifulSoup
r = requests.get(<your-target>)
soup = BeautifulSoup(r.text, 'lxml')
numbers = [i.a.text for i in soup.find_all('div', attrs={"class": "course-number"})]
I didn't check this, but if it doesn't really work, with that in mind you should find a solution. Check BeautifulSoup's documentation for more information.
Note that in the previous loop, if i does not have an a tag it will throw an error, so if you don't trust the structure of the website will always be the same, better do a normal for-loop and have a try-except or deal with that in some way.
Beware that the previous method will obtain all div tags with class course-number. You may want only a subset of those, so you should either apply more filtering or traverse the HTML tree first until you get to the root of your target content.
<div class="class-one">
<div class="class-two">
sample text
<div class="class-three">
<i class="fal fa-fw fa-file-word"></i><span class="button__title">search</span>
</div>
</div>
</div>
When i do driver.find_elements_by_css_selector('div.class-two') it prints sample text and search too, how can i get only sample text using selenium in python?
You need only write "text" in end.
driver.find_element_by_css_selector('div.class-two').text
HTML:
<div id="searchResult">
<div class="buySearchResultContent">
<div class="buySearchResultContentImg">
<a href="carinfo-333285.php">
<img src="carpics/9400180056/290x200/20180305101502854_4567823.jpg" srcset="carpics/9400180056/290x200/20180305101502854_9098765.jpg 290w, carpics/9400180056/435x300/20180305101502854_00000.jpg 435w , carpics/9400180056/720x520/20180305101502854_00001.jpg 720w" sizes="(min-width: 992px) 75vw, 90vw" alt="auto">
</a>
</div>
<div class="buySearchResultContentImg">
<a href="carinfo-333286.php">
<img src="carpics/9400180056/290x200/20180305101502854_4567824.jpg" srcset="carpics/9400180056/290x200/20180305101502854_9098766.jpg 290w, carpics/9400180056/435x300/20180305101502854_00000.jpg 436w , carpics/9400180056/720x520/20180305101502854_00001.jpg 721w" sizes="(min-width: 992px) 75vw, 90vw" alt="auto">
</a>
</div>
</div>
</div>
What I am trying to do is extract two hrefs, but with my code, I can only extract the first one.
Code:
driver.find_element_by_css_selector("buySearchResultContentImg > div").get_attribute("href")
Try below code to get list of #href values:
links = [link.get_attribute("href") for link in driver.find_elements_by_css_selector(".buySearchResultContentImg>a")]
So I'm new to Scrapy and am looking to do something which is proving a little too ambitious. I'm hoping somebody out there can help guide me on how to gather and parse the info I'm after from this website.
I need to obtain the following:
label1
4810 (this is generated dynamically)
Business name
Name
Address1
Address2
Address3
Address4
Postcode
0800 111111
me#domain.com
Is this even possible using scrapy?
Many thanks in advance.
<div class="mbg">
<a href="http://www.domain.com" aria-label="label1"> <span class="nw1">Label13345</span>
</a>
<span class="mbg-l">
4810
<img
alt="4810"
title="4810"
src="http://www.domain.com/image1"></span>
</div>
<div id="bsi-c" class=" bsi-c-uk-bislr">
<div class="bsi-cnt">
<div class="bsi-ttl section-ttl">
<h2>Info</h2>
<div class="rd-sep"></div>
</div>
<div class="bsi-bn">Business name</div>
<div class="bsi-cic">
<div id="bsi-ec" class="u-flL">
<span class="bsi-arw"></span>
<span class="bsi-cdt">Contact details</span>
</div>
<div id="e8" class="u-flL bsi-ci">
<div class="bsi-c1">
<div>Name</div>
<div>Address1</div>
<div>Address2</div>
<div>Address3</div>
<div>Address4</div>
<div>Postcode</div>
</div>
<div class="bsi-c2">
<br></br>
<div>
<span class="bsi-lbl">Phone:</span>
<span>0800 111111</span>
</div>
<div>
<span class="bsi-lbl">Email:</span>
<span>me#domain.com</span>
</div>
</div>
</div>
</div>
An example of parsing the already received page might look something like this:
import lxml.html
page="""<div><span> . . .</span></div> """
doc = lxml.html.document_fromstring(page)
# get label1 4810
label = doc.cssselect('.mbg .mbg-l a')[0].text_content()
# get address
addres = doc.cssselect('.u-flL .bsi-c1')[0].text_content()
# get phone
phone = doc.cssselect('.bsi-c2 .bsi-lbl')[0].text_content()
# get mail
mail = doc.cssselect('.bsi-c2 .bsi-lbl')[1].text_content()
if a page must be retrieved from the network can make so:
import requests, lxml.html
page = requests.get('site_.com')
doc = lxml.html.document_fromstring(page.text)
phone = doc.cssselect('.bsi-c2 .bsi-lbl')[0].text_content()
I have extracted below HTML content from an URL using Scrapy
<div id="data">
<div style="position:absolute">
<h4 class="course">Python</h4>
<h4 class="count">45</h4>
</div>
<h1 style="position:absolute">Available</h1>
<h2 style="position:absolute">Weekend</h1>
<h1 style="position:absolute">Paid Version</h1>
</div>
and using xpath
headerResponse = response.xpath('//div[#id="data"]').extract()
I have loaded them into headerResponse variable. Now I want to get value, since it doesnt have id or class how to extract them?