I'm parsing a huge file, the following HTML code is only a little part. I have many times the first div. In this div I want to get differents tags in <a> I don't care if I also get the element into the a.
I'm doing this but It doesn't work :
from bs4 import BeautifulSoup
import requests
import re
page_url = 'https://paris-sportifs.pmu.fr/'
page = requests.get(page_url)
soup = BeautifulSoup(page.text, 'html.parser')
with open('pmu.html', 'a+')as file:
for div in soup.find_all('div', class_ = 'time_group', attrs={ 'data-time_group' : re.compile("group[1-9]") }):
event_information = div.find('a', class_ = 'trow--event tc-track-element-events')
print(re.sub(r'\s+', ' ', event_information.text))
An exemple of HTML :
<div class="time_group" data-time_group="group0">
<div class="row">
<div class="col-sm-12">
<a class="trow--event tc-track-element-events" href="/event/522788/football/football/maroc-botola-pro-1/rsb-berkane-rapide-oued-zem" data-event_id="rsb_berkane__rapide_oued_zem" data-compet_id="maroc_-_botola_pro_1" data-sport_id="football" data-name="sportif.clic.paris_live.details" data-toggle="tooltip" data-placement="bottom" title="Football - Maroc - Botola Pro 1 - RSB Berkane // Rapide Oued Zem - 29 mars 2018 - 19h00">
<em class="trow--event--name">
<span>RSB Berkane // Rapide Oued Zem</span>
</em>
</a>
</div>
</div>
</div>
With the for loop i get into the different div which interest me but I don't know how I can use this div to do the next : div.find I want to do the find in the element on this div not outside (in the soup).
What I except :
<a class="trow--event tc-track-element-events" href="/event/522788/football/football/maroc-botola-pro-1/rsb-berkane-rapide-oued-zem" data-event_id="rsb_berkane__rapide_oued_zem" data-compet_id="maroc_-_botola_pro_1" data-sport_id="football" data-name="sportif.clic.paris_live.details" data-toggle="tooltip" data-placement="bottom" title="Football - Maroc - Botola Pro 1 - RSB Berkane // Rapide Oued Zem - 29 mars 2018 - 19h00">
<em class="trow--event--name">
<span>RSB Berkane // Rapide Oued Zem</span>
</em>
</a>
Then I just have to find the different tag values in my var.
I hope my english isn't horrible.
Thank you, in advance for your valuable assistance
EDIT 1 :
Let's take an exemple of source code : https://pastebin.com/KZBp9c3y
in this file when i do for div in soup.find_all('div', class_ = 'time_group', attrs={ 'data-time_group' : re.compile("group[1-9]") }): I find the first div but imagine we have multiple match in the for loop.
Then I want to find in this div the element with tag a and class trow--event... div.find('a', class_ = 'trow--event tc-track-element-events')
An exemple of possible result is:
data-event_id="brescia__pescara"
data-compet_id="italie_-_serie_b"
data-sport_id="football"
score-both :
Anyway the problem is that I don't know how to do a find from the div where I am. I'm in <div class="time_group" data-time_group="group1"> and I want to get different information. I want to parse the div from the top to the bottom.
concretely :
for div in soup:
if current_div is:
do this.....
else if:
do this...
How can I get the current_div ?
Tell me if you don't understand what I want.
Thanks you
I've find something it's not exactly what I wanted but it works :
from bs4 import BeautifulSoup
import requests
import re
page_url = 'https://paris-sportifs.pmu.fr/'
page = requests.get(page_url)
soup = BeautifulSoup(page.text, 'html.parser')
soupdiv = soup.find_all('div', class_ = 'time_group', attrs={ 'data-time_group' : re.compile("group[1-9]") })
for div in soupdiv:
test = div.find("a", {"class":"trow--event tc-track-element-events"})
print(test.text)
I doing my find from the current div in the for.
thanks you.
Related
I am new to Python, and never done HTML. So any help would be appreciated.
I need to extract two numbers: '1062' and '348', from a website's inspect element.
This is my code:
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content, 'html.parser')
Adv = soup.select_one ('.col-sm-6 .advDec:nth-child(1)').text[10:]
Dec = soup.select_two ('.col-sm-6 .advDec:nth-child(2)').text[10:]
The website element looks like below:
<div class="nifty-header-shade1 col-xs-12 col-sm-6 col-md-3">
<div class="row">
<div class="col-sm-12">
<h4>Stocks</h4>
</div>
<div class="col-sm-6">
<p class="advDec">Advanced: 1062</p>
</div>
<div class="col-sm-6">
<p class="advDec">Declined: 348</p>
</div>
</div>
</div>
Using my code, am able to extract first number (1062). But unable to extract the second number (348). Can you please help.
Assuming the Pattern is always the same, you can select your elements by text and get its next_sibling:
adv = soup.select_one('a:-soup-contains("Advanced:")').next_sibling.strip()
dec = soup.select_one('a:-soup-contains("Declined:")').next_sibling.strip()
Example
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content)
adv = soup.select_one('a:-soup-contains("Advanced:")').next_sibling.strip()
dec = soup.select_one('a:-soup-contains("Declined:")').next_sibling.strip()
print(adv, dec)
If there are always 2 elements, then the simplest way would probably be to destructure the array of selected elements.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content, "html.parser")
adv, dec = [elm.next_sibling.strip() for elm in soup.select(".advDec a") ]
print("Advanced:", adv)
print("Declined", dec)
I am on a scraping project and I am lookin to scrape from the following.
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
I want to extract only Christian, Islam as the output.(Without the 'Faith:').
This is my try:
faithdiv = soup.find('div', class_='spec-subcat attributes-religion')
faith = faithdiv.find('span').text.strip()
How can I make this done?
There are several ways you can fix this, I would suggest the following - Find all <span> in <div> that have not the class="h5":
soup.select('div.spec-subcat.attributes-religion span:not(.h5)')
Example
import requests
html_text = '''
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
'''
soup = BeautifulSoup(html_text, 'lxml')
', '.join([x.get_text() for x in soup.select('div.spec-subcat.attributes-religion span:not(.h5)')])
Output
Christian, Islam
I am trying to find an ID in a div class which has multiple values using BS4 the HTML is
<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>
</div>
I want to find data-test5-uk, my current code is soup =
bs(size.text,"html.parser")
sizes = soup.find_all("div",{"class":"size"})
size = sizes[0]["data-test5-uk"]
size.text is from a get request to the site with the html, however it returns
size = sizes[0]["data-test5-uk"]
File "C:\Users\ninja_000\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'data-test5-uk'
Help is appreciated!
Explanation and then the solution.
.find_all('tag') is used to find all instances of that tag and we can later loop through them.
.find('tag') is used to find the ONLY first instance.
We can either extract the content of the argument with ['arg'] or ..get('arg') it is the SAME.
from bs4 import BeautifulSoup
html = '''<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>'''
soup = BeautifulSoup(html, 'lxml')
one_div = soup.find('div', class_='size ')
print( one_div.find('a')['data-test5-uk'])
# your code didn't work because you weren't in the a tag
# we have found the tag that contains the tag .find('a')['data-test5-uk']
# for multiple divs
for each in soup.find_all('div', class_='size '):
# we loop through each instance and do the same
datauk = each.find('a')['data-test5-uk']
print('data-test5-uk:', datauk)
Output:
data-test5-uk: 7
Additional
Why did your ['arg']? - You've tried to extract the ["data-test5-uk"] of the div. <div class="size "> the div has no arguments like that except one class="size "
This is the ResultSet of running soup[0].find_all('div', {'class':'font-160 line-110'}):
[<div class="font-160 line-110" data-container=".snippet-container" data-html="true" data-placement="top" data-template='<div class="tooltip infowin-tooltip" role="tooltip"><div class="tooltip-arrow"><div class="tooltip-arrow-inner"></div></div><div class="tooltip-inner" style="text-align: left"></div></div>' data-toggle="tooltip" title="XIAMEN [CN]">
<a class="no-underline group-ib color-inherit"
href="/en/ais/details/ports/959">
<span class="text-default">CN</span><span class="text-default text-darker">XMN
</span>
</a>
</div>]
In an attempt to pull out XIAMEN [CN] after title I could not use a[0].find('div')['title] (where a is the above BeautifulSoup ResultSet). However, if I copy and paste that HTML as a new string, say,
b = '''<div class="font-160 line-110" data-container=".snippet container" data-html="true" data-placement="top" data-template='<div class="tooltip infowin-tooltip" role="tooltip"><div class="tooltip-arrow"><div class="tooltip-arrow-inner"></div></div><div class="tooltip-inner" style="text-align: left"></div></div>' data-toggle="tooltip" title="XIAMEN [CN]">'''
Then do:
>>soup = BeautifulSoup(b, 'html.parser')
>>soup.find('div')['title']
>>XIAMEN [CN] #prints contents of title
Why do I have to reSoup the Soup? Why doesn't this work on my first search?
Edit, origin of soup:
I have a list of urls that I'm going though via grequests. One of the things I'm looking for is that title that contains XIAMEN [CN].
So soup was created when I did
soup = []
with i in range(2) #number of pages parsed
rawSoup = BeautifulSoup(response[i].content, 'html.parser')
souporigin = rawSoup.find_all('div', {'class': 'bg-default bg-white no- snippet-hide'})
soup.append(souporigin)
The urls are
[
'http://www.marinetraffic.com/en/ais/details/ships/shipid:564352/imo:9643752/mmsi:511228000/vessel:DE%20MI',
'http://www.marinetraffic.com/en/ais/details/ships/shipid:3780155/imo:9712395/mmsi:477588800/vessel:SITC%20GUANGXI?cb=2267'
]
I found out the problem occurred when I set up my BeautifulSoup. I created a list of partial search results then had to iterate over the list to research it. I fixed this by just searching for what I wanted in on line:
I changed:
soup = []
with i in range(2) #number of pages parsed
rawSoup = BeautifulSoup(response[i].content, 'html.parser')
souporigin = rawSoup.find_all('div', {'class': 'bg-default bg-white no- snippet-hide'})
soup.append(souporigin)
to:
a = soup.find("div", class_='font-160 line-110')["title"]
And run this search as soon as I create my soup which removes a lot of redundancies in the code-- I had been creating lists of ResultSets and having to use find on them for new fields.
You use wrong selection.
Selection soup[0].find_all('div', {'class':'font-160 line-110'}) finds <div> and you can even see <div> when you print it. But when you add .find() it starts searching inside <div> - so .find('div') tries to find new div in current div
You need
a[0]['title']
When you create new soup then main/root element is not div but [document] and div is its child (div is inside main "tag") so you can use find('div').
>>> a[0].name
div
>>> soup = BeautifulSoup(b, 'html.parser')
>>> soup.name
[document]
I'm trying to parse the below URL. I would like to get the output of all the prices on this site. The first item would be £59.
I inspected the element and found out that the html looks as below. I believe the best way would be to search for a class 'sr_gs_rackrate_total' or alternatively for a title that starts with "Price for".
How can I do this in Beautiful Soup?
<strong class="price scarcity_color sr_gs_rackrate_price
anim_rack_rate
" title="Price for 1 night £59">
<b>
<span class="sr_gs_rackrate_total">Total: </span>
£59
</b>
</strong>
http://www.booking.com/searchresults.en-gb.html?label=gen173nr-17CAEoggJCAlhYSDNiBW5vcmVmaFCIAQGYAS64AQTIAQTYAQHoAQH4AQs&sid=1a43e0952558ac0ad0061d5b6523a7bc&dcid=1&checkin_monthday=23;checkin_year_month=2016-1;checkout_monthday=24;checkout_year_month=2016-1;&city=-2601889&class_interval=1&csflt=%7B%7D&dtdisc=0&group_adults=7&group_children=0&highlighted_hotels=1192837&hlrd=0&hp_sbox=1&hyb_red=0&inac=0&label_click=undef&nflt=ht_id%3D201%3B&nha_red=0&no_rooms=1&redirected_from_city=0&redirected_from_landmark=0&redirected_from_region=0&review_score_group=empty&room1=A%2CA%2CA%2CA%2CA%2CA%2CA&sb_price_type=total&score_min=0&si=ai%2Cco%2Cci%2Cre%2Cdi&ss=London&ss_all=0&ssafas=1&ssb=empty&sshis=0&ssne=London&ssne_untouched=London&order=price_for_two
Here is one way to do that:
from bs4 import BeautifulSoup
soup = BeautifulSoup(yourhtml)
span = soup.find('span', {'class': 'sr_gs_rackrate_total'})
b = span.parent
b.span.extract()
b.text
In case there is more then one span with a price in it, use
for span in soup.find_all('span', {'class': 'sr_gs_rackrate_total'}):
b = span.parent
b.span.extract()
print b.text