Python extracting data from HTML using split

Python extracting data from HTML using split - python

A certain page retrieved from a URL, has the following syntax :
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
I want to extract the data in Name, Surname etc. (I have to repeat this task for many pages)
For that I tried using the following code:
import urllib2
url = 'http://www.my.lk/details.aspx?view=1&id=%2031'
source = urllib2.urlopen(url)
start = '<p><strong>Given Name:</strong>'
end = '<strong>Surname'
givenName=(source.read().split(start))[1].split(end)[0]
start = 'Surname: </strong>'
end = 'Former/AKA Name'
surname=(source.read().split(start))[1].split(end)[0]
print(givenName)
print(surname)
When I'm calling the source.read.split method only one time it works fine. But when I use it twice it gives a list index out of range error.
Can someone suggest a solution?

You can use BeautifulSoup for parsing the HTML string.
Here is some code you might try,
It is using BeautifulSoup (to get the text made by the html code), then parses the string for extracting the data.
from bs4 import BeautifulSoup as bs
dic = {}
data = \
"""
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
"""
soup = bs(data)
# Get the text on the html through BeautifulSoup
text = soup.get_text()
# parsing the text
lines = text.splitlines()
for line in lines:
# check if line has ':', if it doesn't, move to the next line
if line.find(':') == -1:
continue
# split the string at ':'
parts = line.split(':')
# You can add more tests here like
# if len(parts) != 2:
# continue
# stripping whitespace
for i in range(len(parts)):
parts[i] = parts[i].strip()
# adding the vaules to a dictionary
dic[parts[0]] = parts[1]
# printing the data after processing
print '%16s %20s' % (parts[0],parts[1])
A tip:
If you are going to use BeautifulSoup to parse HTML,
You should have certain attributes like class=input or id=10, That is, you keep all tags of the same type to be the same id or class.
Update
Well for your comment, see the code below
It applies the tip above, making life (and coding) a lot easier
from bs4 import BeautifulSoup as bs
c_addr = []
id_addr = []
data = \
"""
<h2>Primary Location</h2>
<div class="address" id="10">
<p>
No. 4<br>
Private Drive,<br>
Sri Lanka ON K7L LK <br>
"""
soup = bs(data)
for i in soup.find_all('div'):
# get data using "class" attribute
addr = ""
if i.get("class")[0] == u'address': # unicode string
text = i.get_text()
for line in text.splitlines(): # line-wise
line = line.strip() # remove whitespace
addr += line # add to address string
c_addr.append(addr)
# get data using "id" attribute
addr = ""
if int(i.get("id")) == 10: # integer
text = i.get_text()
# same processing as above
for line in text.splitlines():
line = line.strip()
addr += line
id_addr.append(addr)
print "id_addr"
print id_addr
print "c_addr"
print c_addr

You are calling read() twice. That is the problem. Instead of doing that you want to call read once, store the data in a variable, and use that variable where you were calling read(). Something like this:
fetched_data = source.read()
Then later...
givenName=(fetched_data.split(start))[1].split(end)[0]
and...
surname=(fetched_data.split(start))[1].split(end)[0]
That should work. The reason your code didn't work is because the read() method is reading the content the first time, but after it gets done reading it is looking at the end of the content. The next time you call read() it has no more content remaining and throws an exception.
Check out the docs for urllib2 and methods on file objects

If you want to be quick, regexes are more useful for this kind of task. It can be a harsh learning curve at first but regexes will save your butt one day.
Try this code:
# read the whole document into memory
full_source = source.read()
NAME_RE = re.compile('Name:.+?>(.*?)<')
SURNAME_RE = re.compile('Surname:.+?>(.*?)<')
name = NAME_RE.search(full_source, re.MULTILINE).group(1).strip()
surname = SURNAME_RE.search(full_source, re.MULTILINE).group(1).strip()
See here for more info on how to use regexes in python.
A more comprehensive solution would involve parsing the HTML (using a lib like BeautifulSoup), but that can be overkill depending on your particular application.

You can use HTQL:
page="""
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
"""
import htql
print(htql.query(page, "<p>.<strong> {a=:tx; b=:xx} "))
# [('Name:', ' Pasan '),
# ('Surname: ', ' Wijesingher '),
# ('Former/AKA Name:', ' No Former/AKA Name '),
# ('Gender:', ' Male '),
# ('Language Fluency:', ' ENGLISH ')
# ]

Related

BeautifulSoup: Select P tag that comes after another P tag which should contain a link

The webpage I'm scraping has paragraphs and headings structured this way:
<p>
<strong>
<a href="https://dummy.com" class="">This is a link heading
</strong>
</p>
<p>
Content To Be Pulled
</p>
I wrote the following code to pull the link heading's content:
for anchor in soup.find_all('#pcl-full-content > p > strong > a'):
signs.append(anchor.text)
The next part is confusing me because the text I want to collect next is the <p> tag after the <p> tag which contains the link. I cannot use .next_sibling() here because it is outside of the parent <p> tag.
How do I choose the following paragraph given that the <p> before it contained a link?

One way seems to be to extract from script tag though you will need to split the text by horoscope:
import requests, re, json
r = requests.get('https://indianexpress.com/article/horoscope/weekly-horoscope-june-6-june-12-gemini-cancer-taurus-and-other-signs-check-astrological-prediction-7346080/',
headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'(\{"#context.*articleBody.*\})', r.text).group(1))
print(data['articleBody'])
You could get the horoscopes separately as follows. This dynamically determines which horoscopes are present, and in what order:
import requests, re, json
r = requests.get('https://indianexpress.com/article/horoscope/horoscope-today-april-6-2021-sagittarius-leo-aries-and-other-signs-check-astrological-prediction-7260276/',
headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'(\{"#context.*articleBody.*\})', r.text).group(1))
# print(data['articleBody'])
signs = ['ARIES', 'TAURUS', 'GEMINI', 'CANCER', 'LEO', 'VIRGO', 'LIBRA', 'SCORPIO', 'SAGITTARIUS', 'CAPRICORN', 'AQUARIUS', 'PISCES']
p = re.compile('|'.join(signs))
signs = p.findall(data['articleBody'])
for number, sign in enumerate(signs):
if number < len(signs) - 1:
print(re.search(f'({sign}.*?){signs[number + 1]}', data['articleBody']).group(1))
else:
print(re.search(f'({sign}.*)', data['articleBody']).group(1))

Combine multiple tags with lxml

I have an html file which looks like:
...
<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
2.
<strong>But do not </strong>
<strong>touch this</strong>
<em>Maybe some other tags as well.</em>
bla bla blah...
</p>
...
What I need is, if all the tags in a 'p' block are 'strong', then combine them into one line, i.e.
<p>
<strong>This is a line which I want to join.</strong>
</p>
Without touching the other block since it contains something else.
Any suggestions? I am using lxml.
UPDATE:
So far I tried:
for p in self.tree.xpath('//body/p'):
if p.tail is None: #no text before first element
children = p.getchildren()
for child in children:
if len(children)==1 or child.tag!='strong' or child.tail is not None:
break
else:
etree.strip_tags(p,'strong')
With these code I was able to strip off the strong tag in the part desired, giving:
<p>
This is a line which I want to join.
</p>
So now I just need a way to put the tag back in...

I was able to do this with bs4 (BeautifulSoup):
from bs4 import BeautifulSoup as bs
html = """<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p>"""
soup = bs(html)
s = ''
# note that I use the 0th <p> block ...[0],
# so make the appropriate change in your code
for t in soup.find_all('p')[0].text:
s = s+t.strip('\n')
s = '<p><strong>'+s+'</strong></p>'
print s # prints: <p><strong>This is a line which I want to join.</strong></p>
Then use replace_with():
p_tag = soup.p
p_tag.replace_with(bs(s, 'html.parser'))
print soup
prints:
<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>

I have managed to solve my own problem.
for p in self.tree.xpath('//body/p'):
if p.tail is None: # some conditions specifically for my doc
children = p.getchildren()
if len(children)>1:
for child in children:
#if other stuffs present, break
if child.tag!='strong' or child.tail is not None:
break
else:
# If not break, we find a p block to fix
# Get rid of stuffs inside p, and put a SubElement in
etree.strip_tags(p,'strong')
tmp_text = p.text_content()
p.clear()
subtext = etree.SubElement(p, "strong")
subtext.text = tmp_text
Special thanks to #Scott who helps me come down to this solution. Although I cannot mark his answer correct, I have no less appreciation to his guidance.

Alternatively, you can use more specific xpath to get the targeted p elements directly :
p_target = """
//p[strong]
[not(*[not(self::strong)])]
[not(text()[normalize-space()])]
"""
for p in self.tree.xpath(p_target):
#logic inside the loop can also be the same as your `else` block
content = p.xpath("normalize-space()")
p.clear()
strong = etree.SubElement(p, "strong")
strong.text = content
brief explanation about xpath being used :
//p[strong] : find p element, anywhere in the XML/HTML document, having child element strong...
[not(*[not(self::strong)])] : ..and not having child element other than strong...
[not(text()[normalize-space()])] : ..and not having non-empty text node child.
normalize-space() : get all text nodes from current context element, concatenated with consecutive whitespaces normalized to single space

Change content only to parent element text, in Beautifulsoup

I have this piece of code:
txt = """<p>Hi <span>Mark</span>, how are you?, Don't forget meeting on <strong>sunday</strong>, ok?</p>"""
soup = BeautifulSoup(txt)
for ft in soup.findAll('p'):
print str(ft).upper()
When running I get this:
<P>HI <SPAN>MARK</SPAN>, HOW ARE YOU?, DON'T FORGET MEETING ON <STRONG>SUNDAY</STRONG>, OK?</P>
But I want to get this:
<p>HI <span>Mark</span>, HOW ARE YOU?, DON'T FORGET MEETING ON <strong>sunday<strong>, ok?</p>
I just want to change inner text on p tag but keep format in other inner tags inside p, also I want to keep tag names in lowercase
Thanx

You can assign the modified text to the string attribute of the tag, p.string. So loop over all contents of the <p> tag and use the regular expression module to check if it contains the tag symbols < and > and skip them. Something like:
from bs4 import BeautifulSoup
import re
txt = """<p>Hi <span>Mark</span>, how are you?, Don't forget meeting on <strong>sunday</strong>, ok?</p>"""
soup = BeautifulSoup(txt)
for p in soup.find_all('p'):
p.string = ''.join(
[str(t).upper()
if not re.match(r'<[^>]+>', str(t))
else str(t)
for t in p.contents])
print soup.prettify(formatter=None)
I use the formatter option to avoid the encoding of html special symbols. It yields:
<html>
<body>
<p>
HI <span>Mark</span>, HOW ARE YOU?, DON'T FORGET MEETING ON <strong>sunday</strong>, OK?
</p>
</body>
</html>

python regular expression find text between tags but the file is not in html or xml format

I have a text file which has the format below, and I am trying to get the text between the <doc> tags. But the <doc> tags repeat many times and the file is not in a standard xml or html format, and this causes a problem.
Format:
<doc id = "some_number" url = " some_link " title = " some_title " >
text here
</doc> //then again
<doc id = "some_number" url = " some_link " title = " some_title " >
text here
</doc> // and so on
I tried to get the text using Beautiful soup in python but it said that this object is not callable, and I am guessing it is because it is not in html file format. Moreover I tried using regular expressions so I wrote
pattern = re.compile("<doc.*?>(.*?)</doc>")
pattern.findall(string_text) # string_text is my file
but no match is found.
Thanks for your help.

There is nothing visibly wrong with your HTML, and no reason BeautifulSoup shouldn't be able to parse it. For example:
from bs4 import BeautifulSoup
s = '''
<doc id = "some_number" url = " some_link " title = " some_title " >
text here
</doc>
<doc id = "some_number" url = " some_link " title = " some_title " >
text here
</doc>'''
soup = BeautifulSoup(s)
for doc in soup.find_all('doc'):
print('{}: {}'.format(doc['title'], doc.text))
When I run this, here's what it displays:
some_title :
text here
some_title :
text here
It also works if I leave in the C++-style comments that you had in your question, but had outside the code sections.
If "it said that this object is not callable", you've obviously done something wrong with your code. For example, if I do this:
for doc in soup.find_all('doc'):
doc['title']('text')
… it will of course raise:
TypeError: 'str' object is not callable
But that's not because BS couldn't parse the HTML, it's because I got a string out of BS and tried to call it as a function.
I have no idea what you actually did wrong, because you haven't shown us the code, or even the exact error that you got.
Meanwhile, if you're wondering what's wrong with your HTML, there are three problems.
The first problem is that You can't parse HTML with regexp.
The second problem is that you're trying to use .*? to match, among other things, newlines, and right near the very top of the re docs, it says:
'.'
(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.
So, you need pattern = re.compile("<doc.*?>(.*?)</doc>", re.DOTALL).
However, if a doc can have another doc inside, or if you can have any characters inside quotes that will confuse you, or if… well, there are a lot of reasons this can fail. Which is why:
The third problem is that you didn't read You can't parse HTML with regexp, and you need to go read it.

your_doc = """
<doc id = "some_number" url = " some_link " title = " some_title " >
text here
</doc> //then again
<doc id = "some_number" url = " some_link " title = " some_title " >
text here
</doc>
"""
from bs4 import BeautifulSoup as b
soup = b(your_doc)
specific_doc = b.find('doc', {'id': 'some number'}) #to get a doc with given id.
print specific_doc.contents #printing the text
all_docs = b.findAll('docs') # list of all doc tags.
for doc in all_docs: #printing all the texts
print 'Text in doc id:', doc['id']
print doc.contents

How to extract tags from HTML using Beautifulsoup in Python

I am trying to parse through an HTML page which simplified looks like this:
<div class="anotherclass part"
<a href="http://example.com" >
<div class="column abc"><strike>£3.99</strike><br>£3.59</div>
<div class="column def"></div>
<div class="column ghi">1 Feb 2013</div>
<div class="column jkl">
<h4>A title</h4>
<p>
<img class="image" src="http://example.com/image.jpg">A, List, Of, Terms, To, Extract - 1 Feb 2013</p>
</div>
</a>
</div>
I am a beginner at coding python and I have read and re-read the beautifulsoup documentation at http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
I have got this code:
from BeautifulSoup import BeautifulSoup
with open("file.html") as fp:
html = fp.read()
soup = BeautifulSoup(html)
parts = soup.findAll('a', attrs={"class":re.compile('part'), re.IGNORECASE} )
for part in parts:
mypart={}
# ghi
mypart['ghi'] = part.find(attrs={"class": re.compile('ghi')} ).string
# def
mypart['def'] = part.find(attrs={"class": re.compile('def')} ).string
# h4
mypart['title'] = part.find('h4').string
# jkl
mypart['other'] = part.find('p').string
# abc
pattern = re.compile( r'\&\#163\;(\d{1,}\.?\d{2}?)' )
theprices = re.findall( pattern, str(part) )
if len(theprices) == 2:
mypart['price'] = theprices[1]
mypart['rrp'] = theprices[0]
elif len(theprices) == 1:
mypart['price'] = theprices[0]
mypart['rrp'] = theprices[0]
else:
mypart['price'] = None
mypart['rrp'] = None
I want to extract any text from the classes def and ghi which I think my script does correctly.
I also want to extract the two prices from abc which my script does in a rather clunky fashion at the moment. Sometimes there are two prices, sometimes one and sometimes none in this part.
Finally I want to extract the "A, List, Of, Terms, To, Extract" part from class jkl which my script fails to do. I thought getting the string part of the p tag would work but I cannot understand why it does not. The date in this part always matches the date in class ghi so it should be easy to replace/remove it.
Any advice? Thank-you!

First, if you add convertEntities=bs.BeautifulSoup.HTML_ENTITIES to
soup = bs.BeautifulSoup(html, convertEntities=bs.BeautifulSoup.HTML_ENTITIES)
then the html entities such as £ will be converted to their corresponding unicode character, such as £. This will allow you to use a simpler regex to identify the prices.
Now, given part, you can find the text content in the <div> with the prices using its contents attribute:
In [37]: part.find(attrs={"class": re.compile('abc')}).contents
Out[37]: [<strike>£3.99</strike>, <br />, u'\xa33.59']
All we need to do is extract the number from each item, or skip it if there is no number:
def parse_price(text):
try:
return float(re.search(r'\d*\.\d+', text).group())
except (TypeError, ValueError, AttributeError):
return None
price = []
for item in part.find(attrs={"class": re.compile('abc')}).contents:
item = parse_price(item.string)
if item:
price.append(item)
At this point price will be a list of 0, 1, or 2 floats.
We would like to say
mypart['rrp'], mypart['price'] = price
but that would not work if price is [] or contains only one item.
Your method of handling the three cases with if..else is okay -- it is the most straightforward and arguably the most readable way to proceed. But it is also a bit mundane. If you'd like something a little more terse you could do the following:
Since we want to repeat the same price if price contains only one item, you might be led to think about itertools.cycle.
In the case where price is the empty list, [], we want itertools.cycle([None]), but otherwise we could use itertools.cycle(price).
So to combine both cases into one expression, we could use
price = itertools.cycle(price or [None])
mypart['rrp'], mypart['price'] = next(price), next(price)
The next function peels off the values in the iterator price one by one. Since price is cycling through its values, it will never end; it will just keep yielding the values in sequence and then starting over again if necessary -- which is just what we want.
The A, List, Of, Terms, To, Extract - 1 Feb 2013 could be obtained again through the use of the contents attribute:
# jkl
mypart['other'] = [item for item in part.find('p').contents
if not isinstance(item, bs.Tag) and item.string.strip()]
So, the full runnable code would look like:
import BeautifulSoup as bs
import os
import re
import itertools as IT
def parse_price(text):
try:
return float(re.search(r'\d*\.\d+', text).group())
except (TypeError, ValueError, AttributeError):
return None
filename = os.path.expanduser("~/tmp/file.html")
with open(filename) as fp:
html = fp.read()
soup = bs.BeautifulSoup(html, convertEntities=bs.BeautifulSoup.HTML_ENTITIES)
for part in soup.findAll('div', attrs={"class": re.compile('(?i)part')}):
mypart = {}
# abc
price = []
for item in part.find(attrs={"class": re.compile('abc')}).contents:
item = parse_price(item.string)
if item:
price.append(item)
price = IT.cycle(price or [None])
mypart['rrp'], mypart['price'] = next(price), next(price)
# jkl
mypart['other'] = [item for item in part.find('p').contents
if not isinstance(item, bs.Tag) and item.string.strip()]
print(mypart)
which yields
{'price': 3.59, 'other': [u'A, List, Of, Terms, To, Extract - 1 Feb 2013'], 'rrp': 3.99}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python extracting data from HTML using split - python

Related

BeautifulSoup: Select P tag that comes after another P tag which should contain a link

Combine multiple tags with lxml

Change content only to parent element text, in Beautifulsoup

python regular expression find text between tags but the file is not in html or xml format

How to extract tags from HTML using Beautifulsoup in Python

Categories

Resources