So I'm new to Scrapy and am looking to do something which is proving a little too ambitious. I'm hoping somebody out there can help guide me on how to gather and parse the info I'm after from this website.
I need to obtain the following:
label1
4810 (this is generated dynamically)
Business name
Name
Address1
Address2
Address3
Address4
Postcode
0800 111111
me#domain.com
Is this even possible using scrapy?
Many thanks in advance.
<div class="mbg">
<a href="http://www.domain.com" aria-label="label1"> <span class="nw1">Label13345</span>
</a>
<span class="mbg-l">
4810
<img
alt="4810"
title="4810"
src="http://www.domain.com/image1"></span>
</div>
<div id="bsi-c" class=" bsi-c-uk-bislr">
<div class="bsi-cnt">
<div class="bsi-ttl section-ttl">
<h2>Info</h2>
<div class="rd-sep"></div>
</div>
<div class="bsi-bn">Business name</div>
<div class="bsi-cic">
<div id="bsi-ec" class="u-flL">
<span class="bsi-arw"></span>
<span class="bsi-cdt">Contact details</span>
</div>
<div id="e8" class="u-flL bsi-ci">
<div class="bsi-c1">
<div>Name</div>
<div>Address1</div>
<div>Address2</div>
<div>Address3</div>
<div>Address4</div>
<div>Postcode</div>
</div>
<div class="bsi-c2">
<br></br>
<div>
<span class="bsi-lbl">Phone:</span>
<span>0800 111111</span>
</div>
<div>
<span class="bsi-lbl">Email:</span>
<span>me#domain.com</span>
</div>
</div>
</div>
</div>
An example of parsing the already received page might look something like this:
import lxml.html
page="""<div><span> . . .</span></div> """
doc = lxml.html.document_fromstring(page)
# get label1 4810
label = doc.cssselect('.mbg .mbg-l a')[0].text_content()
# get address
addres = doc.cssselect('.u-flL .bsi-c1')[0].text_content()
# get phone
phone = doc.cssselect('.bsi-c2 .bsi-lbl')[0].text_content()
# get mail
mail = doc.cssselect('.bsi-c2 .bsi-lbl')[1].text_content()
if a page must be retrieved from the network can make so:
import requests, lxml.html
page = requests.get('site_.com')
doc = lxml.html.document_fromstring(page.text)
phone = doc.cssselect('.bsi-c2 .bsi-lbl')[0].text_content()
Related
How can I retrieve the page encoded div class of a webpage (title html tag) using Python?
Here my sample html code.
You need to use requests to make a request (it will automatically decode the page, in most cases), and beautifulsoup to extract the data from the HTML.
Update after OP clarifications. CSS classes are not dynamically updating, they're the same (that's what I noticed). Since they're the same, you can:
grab a container with all needed data (a container (CSS selector) that wraps needed data)
for result in soup.select(".pSzOP-AhqUyc-qWD73c.GNzUNc span"):
# ...
use regex to filter (find) all needed data via re.findall() and capture group (.*): only this match will be captured and returned. .*: means to capture everything.
if re.findall(r"^Telephone\s?:\s?(.*)", result.text):
# ...
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. On that note, there's a dedicated web scraping with CSS selectors blog post of mine.
Code and example in the online IDE:
import requests, re
from bs4 import BeautifulSoup
html = requests.get("https://sites.google.com/a/arden.solihull.sch.uk/futures/home")
soup = BeautifulSoup(html.text, "html.parser")
# all regular expressions for this task
# https://regex101.com/r/cxdxgq/1
for result in soup.select(".pSzOP-AhqUyc-qWD73c.GNzUNc span"):
if re.findall(r"^Careers\s?.*\s?:\s?(.*)", result.text):
name = "".join(re.findall(r"^Careers\s?.*\s?:\s?(.*)", result.text.strip()))
print(name)
if re.findall(r"^Telephone\s?:\s?(.*)", result.text):
telephone = "".join(re.findall(r"^Telephone\s?:\s?(.*)", result.text.strip()))
print(telephone)
if re.findall(r"^Email\s?:\s?(.*)", result.text):
email = "".join(re.findall(r"^Email\s?:\s?(.*)", result.text.strip()))
print(email)
# to scrape the role you can do the same thing with regex. Test on regex101.com
'''
Mrs A. Fallis
01564 773348
afallis#arden.solihull.sch.uk
Mr S. Brady
01564 7733478
sbrady#arden.solihull.sch.uk
'''
First solutions without OP clarifications (shows only extraction part since you haven't provided a website URL):
from bs4 import BeautifulSoup
html = """
<div class="L581yb VICjCf" hjdwnd-ahquyc-r6poud="" jndksc="" l6ctce-pszop"="" l6ctce-purzt="" tabindex=" == $0
<div class=">
</div>
<div class="hJDwNd-AhqUyc-WNfPc purZT-AhqUyC-I15mzb PSzOP-AhqUyc-qWD73c JNdks <div class=" jndksc-smkayb"="">
<div class="" f570id"="" jsaction="zXBUYD: ZTPCnb; 2QF9Uc: Qxe3nd;
jsname=" jscontroller="SGWD4d">
>
<div class="oKdM2C KzvoMe">
<div class="hJDwNd-AhqUyc-WNFPC PSzOP-AhqUyc- qWD73c jXK9ad D2fZ2 Oj CsFc whaque GNzUNC" id="h.7f5e93de0cf8a767_49">
<div class="]XK9ad-SmkAyb">
<div class="ty]Ctd mGzaTb baZpAe">
<div class="GV3q8e aP9Z7e" id="h.p_9livxd801krd">
</div>
<h3 class="CDt4ke zfr3Q OmQG5e" dir="ltr" id="h.p_9livxd801krd" tabindex="-1">
.
</h3>
<div class="GV3q8e aP9z7e" id="h.p JrEgQYpyORCF">
</div>
<h3 class="CDt 4Ke zfr3Q OmQG5e" dir="ltr" id="h.p_JrEgQYPYORCF" tabindex="-1">
<div class="CjVfdc" jsaction="touchstart:UrsOsc; click:Kjs
qPd; focusout:QZoaz; mouseover:yOpDld; mouseout:dq0hvd;fvlRjc:jbFSO
d;CrflRd:SzACGe;" jscontroller="Ae65rd">
<div class="PPHIP rviiZ" jsname="haAclf">
.
</div>
<span style="font-family: 'Oswald'; font-weight: 500;">
Telephone : 01564 773348
</span>
</div>
</h3>
<div class="GV3q8e aP9z7e" id="h.p_sylefz-BOSBX">
</div>
><h3 id="h.p_sylefz-BOSBX" dir="ltr" class="CDt 4Ke zfr3Q OmQG5e"
</div>
</div>
</div>
</div>
</div>
</div>
"""
# pass HTML to BeautifulSoup object and assign a html.parser as a HTML parser
soup = BeautifulSoup(html, "html.parser")
# grab a phone number (only first occurrence will be extracted)
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
print(soup.select_one('.CjVfdc span').text.strip())
# Telephone : 01564 773348
# extract <div> element with .L581yb class. returns a list()
print(soup.select('.L581yb'))
'''
[<div class="L581yb VICjCf" hjdwnd-ahquyc-r6poud="" jndksc="" l6ctce-pszop"="" l6ctce-purzt="" tabindex=" == $0
<div class=">
</div>]
'''
# extract <div> element with .hJDwNd-AhqUyc-WNfPc class. returns a list()
print(soup.select('.hJDwNd-AhqUyc-WNfPc'))
'''
[<div class="hJDwNd-AhqUyc-WNfPc purZT-AhqUyC-I15mzb PSzOP-AhqUyc-qWD73c JNdks <div class=" jndksc-smkayb"="">
<div class="" f570id"="" jsaction="zXBUYD: ZTPCnb; 2QF9Uc: Qxe3nd;
jsname=" jscontroller="SGWD4d">
>
<div class="oKdM2C KzvoMe">
<div class="hJDwNd-AhqUyc-WNFPC PSzOP-AhqUyc- qWD73c jXK9ad D2fZ2 Oj CsFc whaque GNzUNC" id="h.7f5e93de0cf8a767_49">
<div class="]XK9ad-SmkAyb">
<div class="ty]Ctd mGzaTb baZpAe">
<div class="GV3q8e aP9Z7e" id="h.p_9livxd801krd">
</div>
<h3 class="CDt4ke zfr3Q OmQG5e" dir="ltr" id="h.p_9livxd801krd" tabindex="-1">
.
</h3>
<div class="GV3q8e aP9z7e" id="h.p JrEgQYpyORCF">
</div>
<h3 class="CDt 4Ke zfr3Q OmQG5e" dir="ltr" id="h.p_JrEgQYPYORCF" tabindex="-1">
<div class="CjVfdc" jsaction="touchstart:UrsOsc; click:Kjs
qPd; focusout:QZoaz; mouseover:yOpDld; mouseout:dq0hvd;fvlRjc:jbFSO
d;CrflRd:SzACGe;" jscontroller="Ae65rd">
<div class="PPHIP rviiZ" jsname="haAclf">
.
</div>
<span style="font-family: 'Oswald'; font-weight: 500;">
Telephone : 01564 773348
</span>
</div>
</h3>
<div class="GV3q8e aP9z7e" id="h.p_sylefz-BOSBX">
</div>
><h3 id="h.p_sylefz-BOSBX" dir="ltr" class="CDt 4Ke zfr3Q OmQG5e"
</div>
</div>
</div>
</div>
</div>
</div>]
'''
The html code is :
<div class="card border p-3">
<span class="small text-muted">Contact<br></span>
<div>Steven Cantrell</div>
<div class="small">Department of Justice</div>
<div class="small">Federal Bureau of Investigation</div>
<!---->
<!---->
<!---->
<div class="small">skcantrell#fbi.gov</div>
<div class="small">256-313-8835</div>
</div>
I want to get the output inside the <div> tag i.e. Steven Cantrell .
I need such a way that I should be able to get the contents of next tag. In this case, it is 'span',{'class':'small text-muted'}
What I tried is :
rfq_name = soup.find('span',{'class':'small text-muted'})
print(rfq_name.next)
But this printed Contact instead of the name.
You're nearly there, just change your print to: print(rfq_name.find_next('div').text)
Find the element that has the text "Contact". Then use .find_next() to get the next <div> tag.
from bs4 import BeautifulSoup
html = '''<div class="card border p-3">
<span class="small text-muted">Contact<br></span>
<div>Steven Cantrell</div>
<div class="small">Department of Justice</div>
<div class="small">Federal Bureau of Investigation</div>
<!---->
<!---->
<!---->
<div class="small">skcantrell#fbi.gov</div>
<div class="small">256-313-8835</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
contact = soup.find(text='Contact').find_next('div').text
Output:
print(contact)
Steven Cantrell
I am using BeautifulSoup.
I would like to extract a coordinates from the website. The code of web looks like:
<a class="button button--outline link link--emphasis button-full-width js-choose-store" href="/sklep?StoreID=R034" title="Informacje o sklepie">Informacje o sklepie</a>
</div>
</div>
</div>
</div>
<div class="storelist__item ui-expandable js-accordion-store js-store" data-lat="52.225155" data-lng="20.998965" data-icon="/on/demandware.static/Sites-Hebe-Site/-/default/dw081970e9/images/map_markers/hebe.png" data-id="R379" data-coming-soon="false" data-index="81">
<div class="visually-hidden" data-popup-html>
<div class="store-popup">
<div class="store-popup__name text--uppercase">Drogeria Hebe</div>
<div class="store-popup__address">Lindleya 16</div>
<div class="store-popup__city">Warszawa, 02-013</div>
<div class="store-popup__directions">
I need to get 'data-lat' and 'data-lng'.
I had no problem to get address or name of object (it was a text), using for example:
find("div",{"class","store-popup__city"}).text
Try something along the lines of:
dat = soup.select_one('div[data-lat]')
print(dat['data-lat'],dat['data-lng'])
Output:
52.225155 20.998965
I'm trying to collect the text using Bs4, selenium and Python I want to get the text "Lisa Staprans" using:
name = str(profilePageSource.find(class_="hzi-font hzi-Man-Outline").div.get_text().encode("utf-8"))[2:-1]
Here is the code:
<div class="profile-about-right">
<div class="text-bold">
SF Peninsula Interior Design Firm
<br/>
Best of Houzz 2015
</div>
<br/>
<div class="page-tags" style="display:none">
page_type: pro_plus_profile
</div>
<div class="pro-info-horizontal-list text-m text-dt-s">
<div class="info-list-label">
<i class="hzi-font hzi-Ruler">
</i>
<div class="info-list-text">
<span class="hide" itemscope="" itemtype="http://data-vocabulary.org/Breadcr
umb">
<a href="http://www.houzz.com/professionals/c/Menlo-Park--CA" itemprop="url
">
<span itemprop="title">
Professionals
</span>
</a>
</span>
<span itemprop="child" itemscope="" itemtype="http://data-vocabulary.org/Bre
adcrumb">
<a href="http://www.houzz.com/professionals/interior-designer/c/Menlo-Park-
-CA" itemprop="url">
<span itemprop="title">
Interior Designers & Decorators
</span>
</a>
</span>
</div>
</div>
<div class="info-list-label">
<i class="hzi-font hzi-Man-Outline">
</i>
<div class="info-list-text">
<b>
Contact
</b>
: Lisa Staprans
</div>
</div>
</div>
</div>
Please let me know how it would be.
I assumed you are using Beautifulsoup since you are using class_ attribute dictionary-
If there is one div with class name hzi-font hzi-Man-Outline then try-
str(profilePageSource.find(class_="hzi-font hzi-Man-Outline").findNext('div').get_text().split(":")[-1]).strip()
Extracts 'Lisa Staprans'
Here findNext navigates to next div and extracts text.
I can't test it right now but I would do :
profilePageSource.find_element_by_class_name("info-list-text").get_attribute('innerHTML')
Then you will have to split the result considering the : (if it's always the case).
For more informations : https://selenium-python.readthedocs.org/en/latest/navigating.html
Maybe something is wrong with this part:
find(class_="hzi-font hzi-Man-Outline")
An easy way to get the right information can be: right click on the element you need in the page source by inspecting it with Google Chrome, copy the xpath of the element, and then use:
profilePageSource.find_element_by_xpath(<xpath copied from Chorme>).text
Hope it helps.
I want to enter a text in a text area. The HTML code is as follows:
<li class="order-unavailable string-type-key string-block clear-fix status- require_changes expanded working autogrowed activity-opened" data-string_status="require_changes" data-master_unit_count="22" data-string_id="2394473">
<div class="key-area clear-fix">
<div class="key-area-container-one clear-fix">
<div class="key-area-container-two">
<div class="col-50 col-left">
<div class="string-controls">
<a class="control-expand-toggle selected" href="#"></a>
<a class="control-activity-toggle " href="#">2</a>
<input class="control-select-string" type="checkbox">
</div>
<div class="master-content">
</div>
<div class="col-50 col-right slave-side-container">
</div>
</div>
</div>
<div class="activity-area clear-fix">
<div class="col-50 col-left">
<div class="col-50 col-right">
<div class="comment-area-inner">
<h3>Add comment</h3>
<div class="comment-container">
<textarea class="comment-content" name="comment_content"></textarea>
</div>
<div class="col-right">
<div class="clear"></div>
<strong>Notification settings</strong>
<p>The people you select will get an email when you post this comment. They'll also be notified by email every time a new comment is added.</p>
<div class="notification-settings">
</div>
</div>
</div>
The textarea component name is comment-content
The xpath of the textarea is:
/html/body/div/section/ol/li[16]/div[2]/div[2]/div/div/textarea
This is the code I am using:
driver.find_element_by_xpath("*//div[#title=\"NOTIFICATION_HOMEPAGE_REDIRECT_CHANGED_SITE\"]
/following-sibling::div[2]/div[2]/div/div/textarea").send_keys("Test comment")
Can someone hekp me how to frame the sibling tag?
div[2]/div[2]/div/div/textarea
The tag before the following-sibling keyword is correct.
Choose the textarea and enter something,
driver.find_element_by_xpath(r'//textarea[#class='comment-content']').send_keys('Test Comment')
For xpath, you can use tool Firepath plugin for Firefox