I read some webpage contents in html that has the following form:
<div class="cart">
<div class="cart-title">
<img src="https://ug3.technion.ac.il/rishum/img/regCourses.png" width="50" height="50" alt="My Courses">
המקצועות שלי
</div><div class="entry-spacer"></div><div class="cart-entry">
<div class="course-number">
104134
</div>
<div class="course-name">
אלגברה מודרנית ח
</div>
<div class="course-points">
2.5 נק'
</div>
<div class="entry-group">
קבוצה 11
</div><div class="change-group">
שנה קבוצה ל
<select name="UPG104134" onchange="showWaitAndSubmit('regCart')" class="change-group-options">
<option value=""> </option><option>12</option><option>13</option><option>21</option><option>22</option><option>23</option>
</select>
</div><div class="more-actions">
</div>
<div class="clear"></div></div><div class="entry-spacer"></div><div class="cart-entry">
<div class="course-number">
234118
</div>
<div class="course-name">
ארגון ותכנות המחשב
</div>
<div class="course-points">
3 נק'
</div>
<div class="entry-group">
קבוצה 22
</div><div class="change-group">
שנה קבוצה ל
<select name="UPG234118" onchange="showWaitAndSubmit('regCart')" class="change-group-options">
<option value=""> </option><option>11</option><option>12</option><option>13</option><option>14</option><option>21</option>
</select>
</div><div class="more-actions">
</div>
<div class="clear"></div></div><div>
Now the question is how can I read the courses numbers which appear in blue in my image??
Here's an example of how course number appears in the webpage:
<div class="course-number">
104134
</div>
and I want to read: 104134 in this example
First, I'd advise using BeautifulSoup for parsing the HTML and then, off the top of my head, you should dig in for those div tags with that class name like this.
from bs4 import BeautifulSoup
r = requests.get(<your-target>)
soup = BeautifulSoup(r.text, 'lxml')
numbers = [i.a.text for i in soup.find_all('div', attrs={"class": "course-number"})]
I didn't check this, but if it doesn't really work, with that in mind you should find a solution. Check BeautifulSoup's documentation for more information.
Note that in the previous loop, if i does not have an a tag it will throw an error, so if you don't trust the structure of the website will always be the same, better do a normal for-loop and have a try-except or deal with that in some way.
Beware that the previous method will obtain all div tags with class course-number. You may want only a subset of those, so you should either apply more filtering or traverse the HTML tree first until you get to the root of your target content.
Related
Is it possible to make a Beautifulsoup Strainer that strains all 'order-cards' from 'container-01' only (without 'order-cards' from other containers)?
Below the sample HTML
<div class="items-container" container-id="container-01">
<div class="order-card">order_01
<div class="item-card">item1</div>
<div class="item-card">item2</div>
<div class="item-card">item3</div>
<div class="item-card">item4</div>
</div>
<div class="order-card">order_02
<div class="item-card">itemA</div>
<div class="item-card">itemB</div>
<div class="item-card">itemC</div>
<div class="item-card">itemD</div>
</div>
<div class="order-card">order_03
<div class="item-card">itemW</div>
<div class="item-card">itemX</div>
<div class="item-card">itemY</div>
<div class="item-card">itemZ</div>
<div class="item-card">item</div>
</div>
</div>
<div class="items-container" container-id="container-02">
<div class="order-card">order_53
<div class="item-card">item_7</div>
<div class="item-card">item_8</div>
</div>
</div>
<div class="items-container" container-id="container-03">
<div class="order-card">order_13
<div class="item-card">item_16</div>
<div class="item-card">item_17</div>
<div class="item-card">item_18</div>
</div>
</div>
What I have so far is the code below which strains ALL 'order-cards' from ALL containers.
The goal is that 'page_soup' contains ALL 'order-card' items that are in 'container-01' only.
The following loop then uses that 'page_soup' to iterate through each item in 'order-card' to get the details from each 'item-card'.
rephrased above!
The goal is to get the details from each 'item-card' that are in 'container-01' only.
There is no need for parsing any other containers than 'container-01'.
only_item_cells = SoupStrainer('div', attrs={"class":"order-card"})
page_soup = BeautifulSoup(page_html, 'html.parser', parse_only=only_item_cells)
Following that is a loop that gets the details from ALL the 'item-cards' in ALL containers. In fact, that is NOT wanted, as the output includes items from containers other than 'container-01' only.
Running Python 3.8.8, on Anaconda, Win64
Use the appropriate attribute as you have indicated:
only_item_cells = SoupStrainer('div', attrs= {"container-id": "container-01"})
How can I retrieve the page encoded div class of a webpage (title html tag) using Python?
Here my sample html code.
You need to use requests to make a request (it will automatically decode the page, in most cases), and beautifulsoup to extract the data from the HTML.
Update after OP clarifications. CSS classes are not dynamically updating, they're the same (that's what I noticed). Since they're the same, you can:
grab a container with all needed data (a container (CSS selector) that wraps needed data)
for result in soup.select(".pSzOP-AhqUyc-qWD73c.GNzUNc span"):
# ...
use regex to filter (find) all needed data via re.findall() and capture group (.*): only this match will be captured and returned. .*: means to capture everything.
if re.findall(r"^Telephone\s?:\s?(.*)", result.text):
# ...
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. On that note, there's a dedicated web scraping with CSS selectors blog post of mine.
Code and example in the online IDE:
import requests, re
from bs4 import BeautifulSoup
html = requests.get("https://sites.google.com/a/arden.solihull.sch.uk/futures/home")
soup = BeautifulSoup(html.text, "html.parser")
# all regular expressions for this task
# https://regex101.com/r/cxdxgq/1
for result in soup.select(".pSzOP-AhqUyc-qWD73c.GNzUNc span"):
if re.findall(r"^Careers\s?.*\s?:\s?(.*)", result.text):
name = "".join(re.findall(r"^Careers\s?.*\s?:\s?(.*)", result.text.strip()))
print(name)
if re.findall(r"^Telephone\s?:\s?(.*)", result.text):
telephone = "".join(re.findall(r"^Telephone\s?:\s?(.*)", result.text.strip()))
print(telephone)
if re.findall(r"^Email\s?:\s?(.*)", result.text):
email = "".join(re.findall(r"^Email\s?:\s?(.*)", result.text.strip()))
print(email)
# to scrape the role you can do the same thing with regex. Test on regex101.com
'''
Mrs A. Fallis
01564 773348
afallis#arden.solihull.sch.uk
Mr S. Brady
01564 7733478
sbrady#arden.solihull.sch.uk
'''
First solutions without OP clarifications (shows only extraction part since you haven't provided a website URL):
from bs4 import BeautifulSoup
html = """
<div class="L581yb VICjCf" hjdwnd-ahquyc-r6poud="" jndksc="" l6ctce-pszop"="" l6ctce-purzt="" tabindex=" == $0
<div class=">
</div>
<div class="hJDwNd-AhqUyc-WNfPc purZT-AhqUyC-I15mzb PSzOP-AhqUyc-qWD73c JNdks <div class=" jndksc-smkayb"="">
<div class="" f570id"="" jsaction="zXBUYD: ZTPCnb; 2QF9Uc: Qxe3nd;
jsname=" jscontroller="SGWD4d">
>
<div class="oKdM2C KzvoMe">
<div class="hJDwNd-AhqUyc-WNFPC PSzOP-AhqUyc- qWD73c jXK9ad D2fZ2 Oj CsFc whaque GNzUNC" id="h.7f5e93de0cf8a767_49">
<div class="]XK9ad-SmkAyb">
<div class="ty]Ctd mGzaTb baZpAe">
<div class="GV3q8e aP9Z7e" id="h.p_9livxd801krd">
</div>
<h3 class="CDt4ke zfr3Q OmQG5e" dir="ltr" id="h.p_9livxd801krd" tabindex="-1">
.
</h3>
<div class="GV3q8e aP9z7e" id="h.p JrEgQYpyORCF">
</div>
<h3 class="CDt 4Ke zfr3Q OmQG5e" dir="ltr" id="h.p_JrEgQYPYORCF" tabindex="-1">
<div class="CjVfdc" jsaction="touchstart:UrsOsc; click:Kjs
qPd; focusout:QZoaz; mouseover:yOpDld; mouseout:dq0hvd;fvlRjc:jbFSO
d;CrflRd:SzACGe;" jscontroller="Ae65rd">
<div class="PPHIP rviiZ" jsname="haAclf">
.
</div>
<span style="font-family: 'Oswald'; font-weight: 500;">
Telephone : 01564 773348
</span>
</div>
</h3>
<div class="GV3q8e aP9z7e" id="h.p_sylefz-BOSBX">
</div>
><h3 id="h.p_sylefz-BOSBX" dir="ltr" class="CDt 4Ke zfr3Q OmQG5e"
</div>
</div>
</div>
</div>
</div>
</div>
"""
# pass HTML to BeautifulSoup object and assign a html.parser as a HTML parser
soup = BeautifulSoup(html, "html.parser")
# grab a phone number (only first occurrence will be extracted)
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
print(soup.select_one('.CjVfdc span').text.strip())
# Telephone : 01564 773348
# extract <div> element with .L581yb class. returns a list()
print(soup.select('.L581yb'))
'''
[<div class="L581yb VICjCf" hjdwnd-ahquyc-r6poud="" jndksc="" l6ctce-pszop"="" l6ctce-purzt="" tabindex=" == $0
<div class=">
</div>]
'''
# extract <div> element with .hJDwNd-AhqUyc-WNfPc class. returns a list()
print(soup.select('.hJDwNd-AhqUyc-WNfPc'))
'''
[<div class="hJDwNd-AhqUyc-WNfPc purZT-AhqUyC-I15mzb PSzOP-AhqUyc-qWD73c JNdks <div class=" jndksc-smkayb"="">
<div class="" f570id"="" jsaction="zXBUYD: ZTPCnb; 2QF9Uc: Qxe3nd;
jsname=" jscontroller="SGWD4d">
>
<div class="oKdM2C KzvoMe">
<div class="hJDwNd-AhqUyc-WNFPC PSzOP-AhqUyc- qWD73c jXK9ad D2fZ2 Oj CsFc whaque GNzUNC" id="h.7f5e93de0cf8a767_49">
<div class="]XK9ad-SmkAyb">
<div class="ty]Ctd mGzaTb baZpAe">
<div class="GV3q8e aP9Z7e" id="h.p_9livxd801krd">
</div>
<h3 class="CDt4ke zfr3Q OmQG5e" dir="ltr" id="h.p_9livxd801krd" tabindex="-1">
.
</h3>
<div class="GV3q8e aP9z7e" id="h.p JrEgQYpyORCF">
</div>
<h3 class="CDt 4Ke zfr3Q OmQG5e" dir="ltr" id="h.p_JrEgQYPYORCF" tabindex="-1">
<div class="CjVfdc" jsaction="touchstart:UrsOsc; click:Kjs
qPd; focusout:QZoaz; mouseover:yOpDld; mouseout:dq0hvd;fvlRjc:jbFSO
d;CrflRd:SzACGe;" jscontroller="Ae65rd">
<div class="PPHIP rviiZ" jsname="haAclf">
.
</div>
<span style="font-family: 'Oswald'; font-weight: 500;">
Telephone : 01564 773348
</span>
</div>
</h3>
<div class="GV3q8e aP9z7e" id="h.p_sylefz-BOSBX">
</div>
><h3 id="h.p_sylefz-BOSBX" dir="ltr" class="CDt 4Ke zfr3Q OmQG5e"
</div>
</div>
</div>
</div>
</div>
</div>]
'''
(1)
</div>
<div class="n_cont5" id="nct7">
<div class="nc_tit">说明书:</div>
<div class="nc5" id="smsdiv">
正在查询请稍候......
</div>
</div>
<div class="n_page">上一篇第<span class="cur">2</span>篇下一篇共<span>53</span>篇转到第
<input type="text" name="pages" id="pages"
onkeydown="return SubmitKeyClick(this,event)"
onkeyup="value=value.replace(/[^\d]/g,'')"
onbeforepaste="clipboardData.setData('text',clipboardData.getData('text').replace(/[^\d]/g,''))"/>
篇</div>
</div>
(2)
<a href="javascript:noAction()" title="PDF下载"
onclick="pdfDownloadDetail('Unexamined_patent_for_invention/2016/20160330/CN105452223A/PDF_PID/CN112014000037041CN00001054522230APDFZH20160330CN00F.PDF,CN201480037041.6')" href="javascript:noAction()">PDF下载</a>
</dd>
</dl>
</div>
</li>
<li>打印</li>
<li><a class="icon7" href="javascript:noAction();" class="zidongfanyi"
onclick="translateToEn('CN201480037041.6', 'FMZL_EN,SYXX_EN')">中译英</a></li>
</ul>
<div class="clear"></div>
</div>
<div class="clear"></div>
</div>
<div class="n_page">上一篇第<span class="cur">2</span>篇下一篇共<span>53</span>篇转到第
<input type="text" name="pages" id="pages"
onkeydown="return SubmitKeyClick(this,event)"
onkeyup="value=value.replace(/[^\d]/g,'')"
onbeforepaste="clipboardData.setData('text',clipboardData.getData('text').replace(/[^\d]/g,''))"/>
篇</div>
Here are two very similar content in the html content. Iwant to get the number "53" from the first?. I used the below code which doesn't work. I also try from div class, but it also failed. How can I get the number "53" from the first html content?
html.xpath('//a[contains(text(),"下一篇")]/span/text()')
Why it didn't work : the span holding 53 is a sibling (not a child) of the a element.
To complete #super.single430's answer, here's an alternative (if encoding issues occur during the parsing process) :
//span[#class="cur"]/following-sibling::span/text()
html.xpath("//a[contains(text(),'下一篇')]/following-sibling::span/text()")
I am using BeautifulSoup.
I would like to extract a coordinates from the website. The code of web looks like:
<a class="button button--outline link link--emphasis button-full-width js-choose-store" href="/sklep?StoreID=R034" title="Informacje o sklepie">Informacje o sklepie</a>
</div>
</div>
</div>
</div>
<div class="storelist__item ui-expandable js-accordion-store js-store" data-lat="52.225155" data-lng="20.998965" data-icon="/on/demandware.static/Sites-Hebe-Site/-/default/dw081970e9/images/map_markers/hebe.png" data-id="R379" data-coming-soon="false" data-index="81">
<div class="visually-hidden" data-popup-html>
<div class="store-popup">
<div class="store-popup__name text--uppercase">Drogeria Hebe</div>
<div class="store-popup__address">Lindleya 16</div>
<div class="store-popup__city">Warszawa, 02-013</div>
<div class="store-popup__directions">
I need to get 'data-lat' and 'data-lng'.
I had no problem to get address or name of object (it was a text), using for example:
find("div",{"class","store-popup__city"}).text
Try something along the lines of:
dat = soup.select_one('div[data-lat]')
print(dat['data-lat'],dat['data-lng'])
Output:
52.225155 20.998965
So I'm new to Scrapy and am looking to do something which is proving a little too ambitious. I'm hoping somebody out there can help guide me on how to gather and parse the info I'm after from this website.
I need to obtain the following:
label1
4810 (this is generated dynamically)
Business name
Name
Address1
Address2
Address3
Address4
Postcode
0800 111111
me#domain.com
Is this even possible using scrapy?
Many thanks in advance.
<div class="mbg">
<a href="http://www.domain.com" aria-label="label1"> <span class="nw1">Label13345</span>
</a>
<span class="mbg-l">
4810
<img
alt="4810"
title="4810"
src="http://www.domain.com/image1"></span>
</div>
<div id="bsi-c" class=" bsi-c-uk-bislr">
<div class="bsi-cnt">
<div class="bsi-ttl section-ttl">
<h2>Info</h2>
<div class="rd-sep"></div>
</div>
<div class="bsi-bn">Business name</div>
<div class="bsi-cic">
<div id="bsi-ec" class="u-flL">
<span class="bsi-arw"></span>
<span class="bsi-cdt">Contact details</span>
</div>
<div id="e8" class="u-flL bsi-ci">
<div class="bsi-c1">
<div>Name</div>
<div>Address1</div>
<div>Address2</div>
<div>Address3</div>
<div>Address4</div>
<div>Postcode</div>
</div>
<div class="bsi-c2">
<br></br>
<div>
<span class="bsi-lbl">Phone:</span>
<span>0800 111111</span>
</div>
<div>
<span class="bsi-lbl">Email:</span>
<span>me#domain.com</span>
</div>
</div>
</div>
</div>
An example of parsing the already received page might look something like this:
import lxml.html
page="""<div><span> . . .</span></div> """
doc = lxml.html.document_fromstring(page)
# get label1 4810
label = doc.cssselect('.mbg .mbg-l a')[0].text_content()
# get address
addres = doc.cssselect('.u-flL .bsi-c1')[0].text_content()
# get phone
phone = doc.cssselect('.bsi-c2 .bsi-lbl')[0].text_content()
# get mail
mail = doc.cssselect('.bsi-c2 .bsi-lbl')[1].text_content()
if a page must be retrieved from the network can make so:
import requests, lxml.html
page = requests.get('site_.com')
doc = lxml.html.document_fromstring(page.text)
phone = doc.cssselect('.bsi-c2 .bsi-lbl')[0].text_content()