I want to change only the background-color style with BeautifulSoup :
My html :
<td style="font-size: .8em; font-family: monospace; background-color: rgb(244, 244, 244);">
</td>
I would like to do something like this :
soup_td = BeautifulSoup(html_td, "html.parser")
soup_td.td["style"]["background-color"] = "red;"
That's a rather complicated answer above; you can also just do this:
for tag in soup.findAll(attrs={'class':'example'}):
tag['style'] = "color: red;"
Combine the soup.findAll with whatever selector of BeautifulSoup you'd like to use.
Use cssutils to manipulate CSS, like this:
from bs4 import BeautifulSoup
from cssutils import parseStyle
html = '<td style="font-size: .8em; font-family: monospace; background-color: rgb(244, 244, 244);"></td>'
# Create soup from html
soup = BeautifulSoup(html, 'html.parser')
# Parse td's styles
style = parseStyle(soup.td['style'])
# Change 'background-color' to 'red'
style['background-color'] = 'red'
# Replace td's styles in the soup with the modified styles
soup.td['style'] = style.cssText
# Outputs: <td style="font-size: .8em; font-family: monospace; background-color: red"></td>
print(soup.td)
You could also use regex if you're comfortable with using it.
Related
I'm trying to get the text Something here I want to get inside the div element from a html file using Python and BeautifulSoup.
This is how part of the code looks like in html:
<div xmlns="" id="idp46819314579224" style="box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #d43f3a; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;" class="" onclick="toggleSection('idp46819314579224-container');" onmouseover="this.style.cursor='pointer'">Something here I want to get<div id="idp46819314579224-toggletext" style="float: right; text-align: center; width: 8px;">
-
</div>
</div>
And this is how I tried to do:
vu = soup.find_all("div", {"style" : "background: #d43f3a"})
for div in vu:
print(div.text)
I use loop because there are several div with different id but all of them has the same background colour. It has no errors, but I got no output.
How can I get the text using the background colour as the condition?
The style attribute has other content inside it
style="box-sizing: ....; ....;"
Your current code is asking if style == "background: #d43f3a" which it is not.
What you can do is ask if "background: #d43f3a" in style -- a sub-string check.
One approach is passing a regular expression.
>>> import re
>>> vu = soup.find_all("div", style=re.compile("background: #d43f3a"))
...
... for div in vu:
... print(div.text.strip())
Something here I want to get
You can also say the same thing using CSS Selectors
soup.select('div[style*="background: #d43f3a"]')
Or by passing a function/lambda
>>> vu = soup.find_all("div", style=lambda style: "background: #d43f3a" in style)
...
... for div in vu:
... print(div.text.strip())
Something here I want to get
html is
<div class="trn-defstat__value">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-ash.16913d82e3.png" title="ASH" style="height: 35px; padding-right: 8px;">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-jager.600b2773be.png" title="JÄGER" style="height: 35px; padding-right: 8px;">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-bandit.385144d970.png" title="BANDIT" style="height: 35px; padding-right: 8px;">
</div>
I want to get each title value.
But before that, I write like this
from bs4 import BeautifulSoup as bs
import requests
bsURL = "https://r6.tracker.network/profile/pc/Spoit.GODSENT"
respinse = requests.get(bsURL)
html = bs(respinse.text, 'html.parser')
title = html.find_all(class_='trn-defstat__value')[4]
print(title)
Result ->
<div class="trn-defstat__value">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-ash.16913d82e3.png" style="height: 35px; padding-right: 8px;" title="ASH"/>
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-jager.600b2773be.png" style="height: 35px; padding-right: 8px;" title="JÄGER"/>
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-bandit.385144d970.png" style="height: 35px; padding-right: 8px;" title="BANDIT"/>
</div>
What should I do?
This script will print all <img> titles from Top Operators section:
from bs4 import BeautifulSoup as bs
import requests
bsURL = "https://r6.tracker.network/profile/pc/Spoit.GODSENT"
respinse = requests.get(bsURL)
html = bs(respinse.text, 'html.parser')
# find Top Operators tag
operators = html.find(class_='trn-defstat__name', text='Top Operators')
for img in operators.find_next('div').find_all('img'):
print(img['title'])
Prints:
ASH
JÄGER
BANDIT
Or using CSS:
for img in html.select('.trn-defstat__name:contains("Top Operators") + * img'):
print(img['title'])
Just use the .get() function to get the attribute and pass in the attribute name.
pip install html5lib
I suggest you use that, I believe it's a better parser.
from bs4 import BeautifulSoup as bs
import requests
bsURL = "https://r6.tracker.network/profile/pc/Spoit.GODSENT"
respinse = requests.get(bsURL)
html = bs(respinse.content, 'html5lib')
container = html.find("div", class_= "trn-defstat mb0 top-operators")
imgs = container.find_all("img")
for img in imgs:
print(img.get("title"))
I did not seem to understand what part of the site you were trying to scrape but take note of it to sometimes get first the block of html code where there are the details you want to scraped :)
This should help u:
from bs4 import BeautifulSoup
html = """
<div class="trn-defstat__value">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-ash.16913d82e3.png" title="ASH" style="height: 35px; padding-right: 8px;">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-jager.600b2773be.png" title="JÄGER" style="height: 35px; padding-right: 8px;">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-bandit.385144d970.png" title="BANDIT" style="height: 35px; padding-right: 8px;">
</div>
"""
soup = BeautifulSoup(html,'html.parser')
imgs = soup.find_all('img')
for img in imgs:
print(img['title'])
Output:
ASH
JÄGER
BANDIT
Here is the complete code:
from bs4 import BeautifulSoup as bs
import requests
bsURL = "https://r6.tracker.network/profile/pc/Spoit.GODSENT"
respinse = requests.get(bsURL)
html = bs(respinse.text, 'html.parser')
divs = html.find_all('div',class_ = "trn-defstat__value")
imgs = []
for div in divs:
try:
imgs.append(div.find_all('img'))
except:
pass
imgs = [ele for ele in imgs if ele != []]
imgs = [j for sub in imgs for j in sub]
for img in imgs:
print(img['title'])
Output:
ASH
JÄGER
BANDIT
I want to try to extract the product name and price from the website using beautifulsoup. But I do not know how to extract the content.
Python code:
from bs4 import BeautifulSoup
import re
div = '<div pagetype="simple_table_nonFashion" class="itemBox"
id="itemSearchResultCon_679026"><p class="proPrice"><em class="num"
id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9"
productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p
class="proName clearfix"><a id="pdlink2_679026" pmid="0"
href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint
{border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'
soup = BeautifulSoup(div, "lxml")
itemBox = soup.find("div", {"class": "itemBox"})
proPrice = itemBox.find("p", {"class": "proPrice"}).find("em").text
pdlink2 = itemBox.find('a',{"id": re.compile('pdlink2_*')}).text
print(proPrice)
print(pdlink2)
Print out the result:
¥49.90
.preSellOrAppoint {border: 1px solid #FFFFFF;}印尼进口
The picture:
My expected result is the content:
49.90
印尼进口
With soup.select_one() method:
from bs4 import BeautifulSoup
div = '''<div pagetype="simple_table_nonFashion" class="itemBox"
id="itemSearchResultCon_679026"><p class="proPrice"><em class="num"
id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9"
productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p
class="proName clearfix"><a id="pdlink2_679026" pmid="0"
href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint
{border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'''
soup = BeautifulSoup(div, "lxml")
proPrice = soup.select_one("p.proPrice em").contents[-1]
pdlink2 = soup.select_one('p.proName > a').contents[-1]
print(proPrice)
print(pdlink2)
The output:
49.90
印尼进口
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
Here's the code based on the BeautifulSoup object you provided:
from bs4 import BeautifulSoup
import re
div = '<div pagetype="simple_table_nonFashion" class="itemBox" id="itemSearchResultCon_679026"><p class="proPrice"><em class="num" id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9" productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p class="proName clearfix"><a id="pdlink2_679026" pmid="0" href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint {border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'
soup = BeautifulSoup(div, "lxml")
proPrice = soup.b.next_sibling
pdlink2 = soup.style.next_sibling
print(proPrice)
print(pdlink2)
.next_sibling allows you to access the text outside of the <b> and <style> tags.
I am trying to extract 'Italic' Content from a pdf in python. I have converted the pdf to html so that I can use the italic tag to extract the text.
Here is how the html looks like
<br></span></div><div style="position:absolute; border: textbox 1px
solid; writing-mode:lr-tb; left:71px; top:225px; width:422px;
height:15px;"><span style="font-family: TTPGFA+Symbol; font-
size:12px">•</span><span style="font-family: YUWTQX+ArialMT; font-
size:14px"> Kornai, Janos. 1992. </span><span style="font-family:
PUCJZV+Arial-ItalicMT; font-size:14px">The Socialist System: The
Political Economy of Communism</span><span style="font-family:
YUWTQX+ArialMT; font-size:14px">.
This is how the code looks:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/../..myfile.html"))
bTags = []
for i in soup.findAll('span'):
bTags.append(i.text)
I am not sure how can I get only the italic text.
Try this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
bTags = []
for i in soup.find_all('span', style=lambda x: x and 'Italic' in x):
bTags.append(i.text)
print bTags
Passing a function to the style argument will filter results by the result of that function, with its input as the value of the style attribute. We check to see if the string Italic is inside the attribute, and if so, return True.
You may need a more sophisticated algorithm depending on the rest of what your HTML looks like.
I have some html code where there's a lot of lines that I want to remove that look like this
<span style="position:absolute; border: black 1px solid; left:94px; top:600px; width:6px; height:10px;"></span>
Now there are also span tags that have text in between them and I want to keep.
I want to use the python re.sub function to delete those useless span tags. I wrote this but it is not working
html_code_filtered = re.sub('<span*></span>', '', html_code)
I guess I'm missing something on the regular expression to match the lines correctly?
You can use an HTML Parser like BeautifulSoup to remove the span elements with no text.
Working example:
from bs4 import BeautifulSoup
data = """
<div>
<span style="position:absolute; border: black 1px solid; left:94px; top:600px; width:6px; height:10px;"></span>
<span>useful text</span>
<span></span>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
# find and remove "span" elements with empty contents
for useless in soup.find_all("span", text=lambda text: not text):
useless.extract()
print(soup.prettify())
Prints (as you can see span elements with no contents were removed):
<div>
<span>
useful text
</span>
</div>
The problem here is that n* looks for the character n repeated zero or more times. You can use .*? to match all characters until the next > character.
>>> html_code = '<span style="position:absolute; border: black 1px solid; left:94px; top:600px; width:6px; height:10px;"></span>'
>>> re.sub('<span.*?></span>', '', html_code)
''
That being said, refer to maazaa's comment and the answers using a proper html parser for more complex parsing tasks.