html is
<div class="trn-defstat__value">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-ash.16913d82e3.png" title="ASH" style="height: 35px; padding-right: 8px;">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-jager.600b2773be.png" title="JÄGER" style="height: 35px; padding-right: 8px;">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-bandit.385144d970.png" title="BANDIT" style="height: 35px; padding-right: 8px;">
</div>
I want to get each title value.
But before that, I write like this
from bs4 import BeautifulSoup as bs
import requests
bsURL = "https://r6.tracker.network/profile/pc/Spoit.GODSENT"
respinse = requests.get(bsURL)
html = bs(respinse.text, 'html.parser')
title = html.find_all(class_='trn-defstat__value')[4]
print(title)
Result ->
<div class="trn-defstat__value">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-ash.16913d82e3.png" style="height: 35px; padding-right: 8px;" title="ASH"/>
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-jager.600b2773be.png" style="height: 35px; padding-right: 8px;" title="JÄGER"/>
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-bandit.385144d970.png" style="height: 35px; padding-right: 8px;" title="BANDIT"/>
</div>
What should I do?
This script will print all <img> titles from Top Operators section:
from bs4 import BeautifulSoup as bs
import requests
bsURL = "https://r6.tracker.network/profile/pc/Spoit.GODSENT"
respinse = requests.get(bsURL)
html = bs(respinse.text, 'html.parser')
# find Top Operators tag
operators = html.find(class_='trn-defstat__name', text='Top Operators')
for img in operators.find_next('div').find_all('img'):
print(img['title'])
Prints:
ASH
JÄGER
BANDIT
Or using CSS:
for img in html.select('.trn-defstat__name:contains("Top Operators") + * img'):
print(img['title'])
Just use the .get() function to get the attribute and pass in the attribute name.
pip install html5lib
I suggest you use that, I believe it's a better parser.
from bs4 import BeautifulSoup as bs
import requests
bsURL = "https://r6.tracker.network/profile/pc/Spoit.GODSENT"
respinse = requests.get(bsURL)
html = bs(respinse.content, 'html5lib')
container = html.find("div", class_= "trn-defstat mb0 top-operators")
imgs = container.find_all("img")
for img in imgs:
print(img.get("title"))
I did not seem to understand what part of the site you were trying to scrape but take note of it to sometimes get first the block of html code where there are the details you want to scraped :)
This should help u:
from bs4 import BeautifulSoup
html = """
<div class="trn-defstat__value">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-ash.16913d82e3.png" title="ASH" style="height: 35px; padding-right: 8px;">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-jager.600b2773be.png" title="JÄGER" style="height: 35px; padding-right: 8px;">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-bandit.385144d970.png" title="BANDIT" style="height: 35px; padding-right: 8px;">
</div>
"""
soup = BeautifulSoup(html,'html.parser')
imgs = soup.find_all('img')
for img in imgs:
print(img['title'])
Output:
ASH
JÄGER
BANDIT
Here is the complete code:
from bs4 import BeautifulSoup as bs
import requests
bsURL = "https://r6.tracker.network/profile/pc/Spoit.GODSENT"
respinse = requests.get(bsURL)
html = bs(respinse.text, 'html.parser')
divs = html.find_all('div',class_ = "trn-defstat__value")
imgs = []
for div in divs:
try:
imgs.append(div.find_all('img'))
except:
pass
imgs = [ele for ele in imgs if ele != []]
imgs = [j for sub in imgs for j in sub]
for img in imgs:
print(img['title'])
Output:
ASH
JÄGER
BANDIT
Related
I'm interested in stripping the s3 credientials from image tags within a block of text that is represented as a string in python.
For each tag in the string (of which there can be many), I'd like to start at ".jpeg", end at the next instance of a quotation mark, and delete everything inbetween those locations.
For example, the following string:
<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>
Would become:
<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>
I'm struggling to figure out how to do this. Any help would be appreciated.
Thanks!
Regex is not the tool for the job. A more robust solution is using a HTML parser like BeautifulSoup to extract the src attribute of the img tag, and a URL parser to remove the query from the URL:
from bs4 import BeautifulSoup
from urllib.parse import urlsplit
input_str = '''<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'''
soup = BeautifulSoup(input_str, "html.parser")
img_url = soup.find('img')['src']
new_url = urlsplit(img_url)._replace(query=None).geturl()
soup.find('img')['src'] = new_url
print(soup)
Output:
<p><img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/></p><p><br/></p><p> This is extra text in the body.</p>
Edit: if you have more than one img tag per string, you can use:
input_str = '''<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>
<img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br><p><br></p><p> This is extra text in the body.</p>'''
soup = BeautifulSoup(input_str, "html.parser")
for img in soup.find_all('img'):
img_url = img['src']
new_url = urlsplit(img_url)._replace(query=None).geturl()
img['src'] = new_url
print(soup)
This will update the src attribute of each img tag:
<p><img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/></p><p><br/></p><p> This is extra text in the body.</p>
<img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/><p><br/></p><p> This is extra text in the body.</p>
Assuming the string is stored in s:
import re
re.sub('\.jpeg[^\"]+\"', '.jpeg', s)
This will look for areas that start with ".jpeg" and end with quotation marks and replace them with empty string.
Using re you can find and remove all between ? and "
text = re.sub('\?[^"]+', '', text)
Example code
text = '<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'
expected_result = '<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'
import re
result = re.sub('\?[^"]+', '', text)
print(result == expected_result) # True
EDIT: if there is text with ? and " then you can add more elements in regex
result = re.sub('\.jpeg\?[^"]+', '.jpeg', text)
Use BeautifulSoup to parse the html and then use urlparse
Ex:
from bs4 import BeautifulSoup
try:
from urllib.parse import urlparse #python3
except:
from urlparse import urlparse #python2
html = """<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>"""
soup = BeautifulSoup(html, "html.parser")
for img in soup.find_all("img"): #Find all img tags
o = urlparse(img["src"]) #Get URL
print(o.scheme + "://" + o.netloc + o.path)
Output:
https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg
I'm trying to get the following span from this website:
https://www.indeed.com/jobs?q=data&l=New+York%2C+NY&explvl=entry_level
<span class="indeed-apply-widget indeed-apply-button-container js-IndeedApplyWidget indeed-apply-status-not-applied" aria-labelledby="indeed-apply-button-label" data-indeed-apply-jobtitle="Growth Associate" data-indeed-apply-apitoken="aa102235a5ccb18bd3668c0e14aa3ea7e2503cfac2a7a9bf3d6549899e125af4" data-indeed-apply-coverletter="optional" data-indeed-apply-resume="required" data-indeed-apply-jk="40da42b64688bda8" data-indeed-apply-jobid="19c5d6a1fff8d6ba9724" data-indeed-apply-joblocation="New York, NY" data-indeed-apply-jobcompanyname="Via" data-indeed-apply-joburl="https://www.indeed.com/viewjob?jk=40da42b64688bda8" data-indeed-apply-posturl="https://dradisindeedapply.sandbox.indeed.net/process-indeedapply" data-indeed-apply-jobmeta="{"vtk":"1csimi0m80g7f002", "tk":""}" data-indeed-apply-advnum="7404493598529036" data-indeed-apply-onapplied="indeedApplyHandleApply" data-indeed-apply-onclose="indeedApplyHandleModalClose" data-indeed-apply-onclick="indeedApplyHandleButtonClick" data-indeed-apply-oncontinueclick="indeedApplyHandleModalClose" data-indeed-apply-pingbackurl="https://gdc.indeed.com/conv/orgIndApp?trk.origin=unknown&jk=40da42b64688bda8&vjtk=1csimi0m80g7f002&advn=7404493598529036&co=US&acct_key=899c31afcc98f5e9&sj=0" data-indeed-apply-skipcontinue="false" data-acc-payload="1,2,22,1,144,1,552,1,3648,1,4392,1" style="padding: 0px !important; margin: 0px !important; text-indent: 0px !important; vertical-align: top !important; position: relative; zoom: 1 !important; display: inline-block;"><a class="indeed-apply-button" href="javascript:void(0);" id="indeed-ia-1542520898760-0"><span class="indeed-apply-button-inner" id="indeed-ia-1542520898760-0inner"><span class="indeed-apply-button-label" id="indeed-ia-1542520898760-0label">Apply Now</span><span class="indeed-apply-button-cm"><img src="https://d3fw5vlhllyvee.cloudfront.net/indeedapply/s/14096d1/check.png" style="border: 0px;"></span></span></a></span>
And I tried this code:
url = "https://www.indeed.com/jobs?q=data&l=New+York%2C+NY&explvl=entry_level"
html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features = 'lxml')
soup.find_all("span", {"class":"indeed-apply-widget indeed-apply-button-container js-IndeedApplyWidget indeed-apply-status-not-applied",
"aria-labelledby":"indeed-apply-button-label"})
But the result is [].
There is no such element on URL you mentioned above but it exist in /viewjob?jk=.. page.
The class in your code is generated by javascript, if you view page source the real class is indeed-apply-widget and it only has 1 element
# https://www.indeed.com/viewjob?jk=0ee200c5fc30ce02&from=recjobs&vjtk=1csj1b3nmbi4v800
soup.find("span", {"class":"indeed-apply-widget"})
Using Python and beautifulsoup, I need help extracting information from a parent div and a child div at the same time.
Here is the first example code:
<div id="slide-609becd056bb40a7ad42607a4d1c67f5"
class="slide has-link slick-slide"
data-label="April 2 2018 Acura TLX Offer 2000x700.jpg"
data-link="/new-inventory/index.htm?model=TLX&year=2018" data-target="_self"
style="background-image: url("https://pictures.dealer.com/a/adw/0877/5eabcb338dc604c09b28a4df5a49ad78x.jpg?impolicy=resize&h=514");
width: 1897px; position: relative; left: 0px; top: 0px; z-index: 998; opacity: 0; height: 514px; transition: opacity 750ms ease;" data-slick-index="0" aria-hidden="true" tabindex="-1" role="option" aria-describedby="slick-slide00">
Here is example code 2:
<div id="slide-7ae8b29ddc9e45d1a219beffe5793b2b"
class="html-slide slide slick-slide"
data-label="March-Madness.jpg" data-link="" data-target=""
data-promo-id="" data-slick-index="2" aria-hidden="true" tabindex="-1" role="option"
aria-describedby="slick-slide02"
style="width: 1897px; position: relative; left: -3794px; top: 0px; z-index: 998; opacity: 0; height: 514px; transition: opacity 750ms ease;">
<div class="slide-background"
style="background-image: linear-gradient(rgba(0, 0, 0, 0), rgba(0, 0, 0, 0)), url("https://pictures.dealer.com/g/goodsonacuraofdallasadw/1747/13ed067a023df8ad412feea2c6eddec9x.jpg?impolicy=resize&h=514"); height: 514px;">
<img src="https://pictures.dealer.com/g/goodsonacuraofdallasadw/1747/13ed067a023df8ad412feea2c6eddec9x.jpg?impolicy=resize&h=514" class="placeholder-image pull-left"> </div>
I need to get the style element from both examples of code so I can get the background image url. The issue is that the first code has the style in the parent div and the second set of code has the style in the child div. How do I get those two style elements at the same time using Python and beautifulsoup?
Here is the code I have tried:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.goodsonacura.com/'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
banner_info = page_soup.findAll('div',{'class':['slide has-link', 'html-slide slide has-link']})
picture = [banner.get('style') for banner in banner_info]
This code gives me the correct style element for the first example code, but it gives me the wrong style element for the second example code.
Add "slide-background" class in the find_all query. See the example below:-
banner_info = page_soup.find_all('div',{'class':['slide has-link', 'html-slide slide has-link', 'slide-background']})
It works for me. May this helps you.
I want to try to extract the product name and price from the website using beautifulsoup. But I do not know how to extract the content.
Python code:
from bs4 import BeautifulSoup
import re
div = '<div pagetype="simple_table_nonFashion" class="itemBox"
id="itemSearchResultCon_679026"><p class="proPrice"><em class="num"
id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9"
productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p
class="proName clearfix"><a id="pdlink2_679026" pmid="0"
href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint
{border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'
soup = BeautifulSoup(div, "lxml")
itemBox = soup.find("div", {"class": "itemBox"})
proPrice = itemBox.find("p", {"class": "proPrice"}).find("em").text
pdlink2 = itemBox.find('a',{"id": re.compile('pdlink2_*')}).text
print(proPrice)
print(pdlink2)
Print out the result:
¥49.90
.preSellOrAppoint {border: 1px solid #FFFFFF;}印尼进口
The picture:
My expected result is the content:
49.90
印尼进口
With soup.select_one() method:
from bs4 import BeautifulSoup
div = '''<div pagetype="simple_table_nonFashion" class="itemBox"
id="itemSearchResultCon_679026"><p class="proPrice"><em class="num"
id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9"
productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p
class="proName clearfix"><a id="pdlink2_679026" pmid="0"
href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint
{border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'''
soup = BeautifulSoup(div, "lxml")
proPrice = soup.select_one("p.proPrice em").contents[-1]
pdlink2 = soup.select_one('p.proName > a').contents[-1]
print(proPrice)
print(pdlink2)
The output:
49.90
印尼进口
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
Here's the code based on the BeautifulSoup object you provided:
from bs4 import BeautifulSoup
import re
div = '<div pagetype="simple_table_nonFashion" class="itemBox" id="itemSearchResultCon_679026"><p class="proPrice"><em class="num" id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9" productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p class="proName clearfix"><a id="pdlink2_679026" pmid="0" href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint {border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'
soup = BeautifulSoup(div, "lxml")
proPrice = soup.b.next_sibling
pdlink2 = soup.style.next_sibling
print(proPrice)
print(pdlink2)
.next_sibling allows you to access the text outside of the <b> and <style> tags.
I want to change only the background-color style with BeautifulSoup :
My html :
<td style="font-size: .8em; font-family: monospace; background-color: rgb(244, 244, 244);">
</td>
I would like to do something like this :
soup_td = BeautifulSoup(html_td, "html.parser")
soup_td.td["style"]["background-color"] = "red;"
That's a rather complicated answer above; you can also just do this:
for tag in soup.findAll(attrs={'class':'example'}):
tag['style'] = "color: red;"
Combine the soup.findAll with whatever selector of BeautifulSoup you'd like to use.
Use cssutils to manipulate CSS, like this:
from bs4 import BeautifulSoup
from cssutils import parseStyle
html = '<td style="font-size: .8em; font-family: monospace; background-color: rgb(244, 244, 244);"></td>'
# Create soup from html
soup = BeautifulSoup(html, 'html.parser')
# Parse td's styles
style = parseStyle(soup.td['style'])
# Change 'background-color' to 'red'
style['background-color'] = 'red'
# Replace td's styles in the soup with the modified styles
soup.td['style'] = style.cssText
# Outputs: <td style="font-size: .8em; font-family: monospace; background-color: red"></td>
print(soup.td)
You could also use regex if you're comfortable with using it.