Python extract italic content from html - python

I am trying to extract 'Italic' Content from a pdf in python. I have converted the pdf to html so that I can use the italic tag to extract the text.
Here is how the html looks like
<br></span></div><div style="position:absolute; border: textbox 1px
solid; writing-mode:lr-tb; left:71px; top:225px; width:422px;
height:15px;"><span style="font-family: TTPGFA+Symbol; font-
size:12px">•</span><span style="font-family: YUWTQX+ArialMT; font-
size:14px"> Kornai, Janos. 1992. </span><span style="font-family:
PUCJZV+Arial-ItalicMT; font-size:14px">The Socialist System: The
Political Economy of Communism</span><span style="font-family:
YUWTQX+ArialMT; font-size:14px">.
This is how the code looks:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/../..myfile.html"))
bTags = []
for i in soup.findAll('span'):
bTags.append(i.text)
I am not sure how can I get only the italic text.

Try this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
bTags = []
for i in soup.find_all('span', style=lambda x: x and 'Italic' in x):
bTags.append(i.text)
print bTags
Passing a function to the style argument will filter results by the result of that function, with its input as the value of the style attribute. We check to see if the string Italic is inside the attribute, and if so, return True.
You may need a more sophisticated algorithm depending on the rest of what your HTML looks like.

Related

Selecting and stripping img src in HTML string

I'm interested in stripping the s3 credientials from image tags within a block of text that is represented as a string in python.
For each tag in the string (of which there can be many), I'd like to start at ".jpeg", end at the next instance of a quotation mark, and delete everything inbetween those locations.
For example, the following string:
<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>
Would become:
<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>
I'm struggling to figure out how to do this. Any help would be appreciated.
Thanks!
Regex is not the tool for the job. A more robust solution is using a HTML parser like BeautifulSoup to extract the src attribute of the img tag, and a URL parser to remove the query from the URL:
from bs4 import BeautifulSoup
from urllib.parse import urlsplit
input_str = '''<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'''
soup = BeautifulSoup(input_str, "html.parser")
img_url = soup.find('img')['src']
new_url = urlsplit(img_url)._replace(query=None).geturl()
soup.find('img')['src'] = new_url
print(soup)
Output:
<p><img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/></p><p><br/></p><p> This is extra text in the body.</p>
Edit: if you have more than one img tag per string, you can use:
input_str = '''<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>
<img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br><p><br></p><p> This is extra text in the body.</p>'''
soup = BeautifulSoup(input_str, "html.parser")
for img in soup.find_all('img'):
img_url = img['src']
new_url = urlsplit(img_url)._replace(query=None).geturl()
img['src'] = new_url
print(soup)
This will update the src attribute of each img tag:
<p><img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/></p><p><br/></p><p> This is extra text in the body.</p>
<img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/><p><br/></p><p> This is extra text in the body.</p>
Assuming the string is stored in s:
import re
re.sub('\.jpeg[^\"]+\"', '.jpeg', s)
This will look for areas that start with ".jpeg" and end with quotation marks and replace them with empty string.
Using re you can find and remove all between ? and "
text = re.sub('\?[^"]+', '', text)
Example code
text = '<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'
expected_result = '<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'
import re
result = re.sub('\?[^"]+', '', text)
print(result == expected_result) # True
EDIT: if there is text with ? and " then you can add more elements in regex
result = re.sub('\.jpeg\?[^"]+', '.jpeg', text)
Use BeautifulSoup to parse the html and then use urlparse
Ex:
from bs4 import BeautifulSoup
try:
from urllib.parse import urlparse #python3
except:
from urlparse import urlparse #python2
html = """<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>"""
soup = BeautifulSoup(html, "html.parser")
for img in soup.find_all("img"): #Find all img tags
o = urlparse(img["src"]) #Get URL
print(o.scheme + "://" + o.netloc + o.path)
Output:
https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg

How to automate scraping wikipedia info box specifically and print the data using python for any wiki page?

My task is to automate printing the wikipedia infobox data.As an example, I am scraping the Star Trek wikipedia page (https://en.wikipedia.org/wiki/Star_Trek) and extract infobox section from the right hand side and print them row by row on screen using python. I specifically want the info box. So far I have done this:
from bs4 import BeautifulSoup
import urllib.request
# specify the url
urlpage = 'https://en.wikipedia.org/wiki/Star_Trek'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(urlpage)
# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
# find results within table
table = soup.find('table', attrs={'class': 'infobox vevent'})
results = table.find_all('tr')
print(type(results))
print('Number of results', len(results))
print(results)
This gives me everything from the info box. A snippet is shown below:
[<tr><th class="summary" colspan="2" style="text-align:center;font-
size:125%;font-weight:bold;font-style: italic; background: lavender;">
<i>Star Trek</i></th></tr>, <tr><td colspan="2" style="text-align:center">
<a class="image" href="/wiki/File:Star_Trek_TOS_logo.svg"><img alt="Star
Trek TOS logo.svg" data-file-height="132" data-file-width="560" height="59"
I want to extract the data only and print it on screen. So What i want is:
Created by Gene Roddenberry
Original work Star Trek: The Original Series
Print publications
Book(s)
List of reference books
List of technical manuals
Novel(s) List of novels
Comics List of comics
Magazine(s)
Star Trek: The Magazine
Star Trek Magazine
And so on till the end of the infobox. So basically a way of printing every row of the infobox data so I can automate it for any wiki page? (The class of infobox table of all wiki pages is 'infobox vevent' as shown in the code)
This page should help you to parse your html as a simple string without the html tags Using BeautifulSoup Extract Text without Tags
This is a code from that page, it belongs to #0605002
>>> html = """
<p>
<strong class="offender">YOB:</strong> 1987<br />
<strong class="offender">RACE:</strong> WHITE<br />
<strong class="offender">GENDER:</strong> FEMALE<br />
<strong class="offender">HEIGHT:</strong> 5'05''<br />
<strong class="offender">WEIGHT:</strong> 118<br />
<strong class="offender">EYE COLOR:</strong> GREEN<br />
<strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.text
YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN
By using beautifulsoup,you need to reformat the data as you want. use fresult = [e.text for e in result] to get each result
If you want to read a table on html you can try some code like this,though this is using pandas.
import pandas
urlpage = 'https://en.wikipedia.org/wiki/Star_Trek'
data = pandas.read_html(urlpage)[0]
null = data.isnull()
for x in range(len(data)):
first = data.iloc[x][0]
second = data.iloc[x][1] if not null.iloc[x][1] else ""
print(first,second,"\n")

How to extract the content using beautifulsoup

I want to try to extract the product name and price from the website using beautifulsoup. But I do not know how to extract the content.
Python code:
from bs4 import BeautifulSoup
import re
div = '<div pagetype="simple_table_nonFashion" class="itemBox"
id="itemSearchResultCon_679026"><p class="proPrice"><em class="num"
id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9"
productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p
class="proName clearfix"><a id="pdlink2_679026" pmid="0"
href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint
{border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'
soup = BeautifulSoup(div, "lxml")
itemBox = soup.find("div", {"class": "itemBox"})
proPrice = itemBox.find("p", {"class": "proPrice"}).find("em").text
pdlink2 = itemBox.find('a',{"id": re.compile('pdlink2_*')}).text
print(proPrice)
print(pdlink2)
Print out the result:
¥49.90
.preSellOrAppoint {border: 1px solid #FFFFFF;}印尼进口
The picture:
My expected result is the content:
49.90
印尼进口
With soup.select_one() method:
from bs4 import BeautifulSoup
div = '''<div pagetype="simple_table_nonFashion" class="itemBox"
id="itemSearchResultCon_679026"><p class="proPrice"><em class="num"
id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9"
productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p
class="proName clearfix"><a id="pdlink2_679026" pmid="0"
href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint
{border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'''
soup = BeautifulSoup(div, "lxml")
proPrice = soup.select_one("p.proPrice em").contents[-1]
pdlink2 = soup.select_one('p.proName > a').contents[-1]
print(proPrice)
print(pdlink2)
The output:
49.90
印尼进口
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
Here's the code based on the BeautifulSoup object you provided:
from bs4 import BeautifulSoup
import re
div = '<div pagetype="simple_table_nonFashion" class="itemBox" id="itemSearchResultCon_679026"><p class="proPrice"><em class="num" id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9" productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p class="proName clearfix"><a id="pdlink2_679026" pmid="0" href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint {border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'
soup = BeautifulSoup(div, "lxml")
proPrice = soup.b.next_sibling
pdlink2 = soup.style.next_sibling
print(proPrice)
print(pdlink2)
.next_sibling allows you to access the text outside of the <b> and <style> tags.

BeautifulSoup, change specific style attribute

I want to change only the background-color style with BeautifulSoup :
My html :
<td style="font-size: .8em; font-family: monospace; background-color: rgb(244, 244, 244);">
</td>
I would like to do something like this :
soup_td = BeautifulSoup(html_td, "html.parser")
soup_td.td["style"]["background-color"] = "red;"
That's a rather complicated answer above; you can also just do this:
for tag in soup.findAll(attrs={'class':'example'}):
tag['style'] = "color: red;"
Combine the soup.findAll with whatever selector of BeautifulSoup you'd like to use.
Use cssutils to manipulate CSS, like this:
from bs4 import BeautifulSoup
from cssutils import parseStyle
html = '<td style="font-size: .8em; font-family: monospace; background-color: rgb(244, 244, 244);"></td>'
# Create soup from html
soup = BeautifulSoup(html, 'html.parser')
# Parse td's styles
style = parseStyle(soup.td['style'])
# Change 'background-color' to 'red'
style['background-color'] = 'red'
# Replace td's styles in the soup with the modified styles
soup.td['style'] = style.cssText
# Outputs: <td style="font-size: .8em; font-family: monospace; background-color: red"></td>
print(soup.td)
You could also use regex if you're comfortable with using it.

Filter out empty <span> tags from html code

I have some html code where there's a lot of lines that I want to remove that look like this
<span style="position:absolute; border: black 1px solid; left:94px; top:600px; width:6px; height:10px;"></span>
Now there are also span tags that have text in between them and I want to keep.
I want to use the python re.sub function to delete those useless span tags. I wrote this but it is not working
html_code_filtered = re.sub('<span*></span>', '', html_code)
I guess I'm missing something on the regular expression to match the lines correctly?
You can use an HTML Parser like BeautifulSoup to remove the span elements with no text.
Working example:
from bs4 import BeautifulSoup
data = """
<div>
<span style="position:absolute; border: black 1px solid; left:94px; top:600px; width:6px; height:10px;"></span>
<span>useful text</span>
<span></span>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
# find and remove "span" elements with empty contents
for useless in soup.find_all("span", text=lambda text: not text):
useless.extract()
print(soup.prettify())
Prints (as you can see span elements with no contents were removed):
<div>
<span>
useful text
</span>
</div>
The problem here is that n* looks for the character n repeated zero or more times. You can use .*? to match all characters until the next > character.
>>> html_code = '<span style="position:absolute; border: black 1px solid; left:94px; top:600px; width:6px; height:10px;"></span>'
>>> re.sub('<span.*?></span>', '', html_code)
''
That being said, refer to maazaa's comment and the answers using a proper html parser for more complex parsing tasks.

Categories