I am using Beautiful Soup 4 to scan through an html file and extract certain features. Specifically, I am using it to find soccer player names, clubs, leagues, stats, etc. Since many player and club names have accent marks I am looking for a way to print out these accent marks rather than seeing an output like "Kak\xe1" I was able to make it work by using
# open html page
fut_page = open('futhead1.html','r')
# read content from html page
fut_read = fut_page.read()
# html parsed page
fut_soup = BeautifulSoup(fut_read, "html.parser")
# grabs all players
players = fut_soup.findAll('li',{'class':'list-group-item list-group-table-row player-group-item dark-hover'})
player = players[2]
# name_tag contains tag with player's name
name_tag = player.find("span",{"class":"player-name"})
# extract just the player's name
player_name = name_tag.text
print player_name.encode('utf-8')
This prints out the proper player name: "Kaká" However, I do not see the same result when using a regex to extract the club name, for example
regex_club = re.compile(ur'\[.*?</strong>\\n\s+\|\s\\n\s+(.*?)\\n', re.MULTILINE)
# extract club name
player_club = re.match(regex_club, str(pos_clb_lge_tag))
print player_club.group(1).encode('utf-8')
This code works in printing out the proper club name, say, "Atl\xe9tico Madrid" but encode() does not work in getting rid of "\xe9" and replacing it with "é"
Below is the piece of the html file where I apply the regex
<li class="list-group-item list-group-table-row player-group-item dark-hover">
<div class="content player-item font-24">
<a class="display-block padding-0" href="/fifa-mobile/17/players/33194/jan-oblak/">
<span class="player-rating stream-col-50 text-center">
<span class="revision-gradient shadowed font-12 fut elite">100</span>
</span>
<span class="player-info">
<img class="player-image" src="http://futhead.cursecdn.com/static/img/fm/17/players/200389_SASC.png">
<img class="player-program" src="http://futhead.cursecdn.com/static/img/fm/17/resources/program_17_VSATTACK.png">
<span class="player-name">Jan Oblak</span>
<span class="player-club-league-name">
<strong>GK</strong>
|
Atlético Madrid
|
LaLiga Santander
</span>
</span>
<span class="player-right text-center hidden-xs">
<span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">83</span><span class="hover-label">PAC</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">50</span><span class="hover-label">SHO</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">66</span><span class="hover-label">PAS</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">55</span><span class="hover-label">DRI</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">58</span><span class="hover-label">DEF</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">85</span><span class="hover-label">PHY</span></span><span class="player-stat stream-col-60 font-12 font-medium text-upper">35</span>
</span>
<span class="player-right slide hidden-sm hidden-xs" data-direction="right" data-max="-482px">
<span class="slide-content text-upper">
<span class="trigger icon icon-dots-three-horizontal"></span>
<span class="player-stat stream-col-80">
<span class="value">+2</span>
<span class="hover-label">MRK</span>
</span>
<span class="player-stat stream-col-80">
<span class="value">+1</span>
<span class="hover-label">OVR</span>
</span>
<span class="player-stat stream-col-100"><span class="value">right</span><span class="hover-label">Strong Foot</span></span>
<span class="player-stat stream-col-100"><span class="value">18<span class="icon icon-star gold margin-l-4"></span></span><span class="hover-label">Weak Foot</span></span>
</span>
</span>
</a>
</div>
So basically, why does encode() not work when I use a regex in the intermediate? If any further clarification is needed please let me know. Thank you.
I suspect you haven't shown all the code (see [mcve]), but calling str on the Unicode object is the wrong thing to do, and should have given:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 40: ordinal not in range(128)
I suspect you've done a setdefaultencoding which is a bad habit.
What str() did was convert the Unicode string into a byte string with escape code text, e.g. '\\n' (two characters) instead of '\n' (one character) and it did the same for the non-ascii character.
If your terminal is configured correctly you should not have to manually encode the final result when printing as well.
Here's a working example that uses BeautifulSoup to retrieve just the text to parse:
from bs4 import BeautifulSoup
import re
# open html page
fut_page = open('futhead1.html','r')
# read content from html page
fut_read = fut_page.read()
# html parsed page
fut_soup = BeautifulSoup(fut_read, "html.parser")
# grabs all players
players = fut_soup.findAll('li',{'class':'list-group-item list-group-table-row player-group-item dark-hover'})
player = players[0]
# name_tag contains tag with player's name
name_tag = player.find("span",{"class":"player-club-league-name"})
# extract just the player's name
pos_clb_lge_tag = name_tag.contents[-1]
regex_club = re.compile(ur'\n\s+\|\s\n\s+(.*?)\n')
# extract club name
player_club = regex_club.match(pos_clb_lge_tag)
print player_club.group(1)
Atlético Madrid
Related
I have an XML with mutiple Div Classes/Span Classes and I'm struggling to extract a text value.
<div class="line">
<span class="html-tag">
"This is a Heading that I dont want"
</span>
<span>This is the text I want</span>
So far I have written this:
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
spans = soup.find_all('span', attrs={'class': 'html-tag'})[29]
print(spans.text)
This unfortunately only prints out the "This is a Heading that I dont want" value e.g.
This is the heading I dont want
Number [29] in the code is the position where the text I need will always appear.
I'm unsure how to retrieve the span value I need.
Please can you assist. Thanks
You can search by <div class="line"> and then select second <span>.
For example:
txt = '''
# line 1
<div class="line">
<span class="html-tag">
"This is a Heading that I dont want"
</span>
<span>This is the text I dont want</span>
</div>
# line 2
<div class="line">
<span class="html-tag">
"This is a Heading that I dont want"
</span>
<span>This is the text I dont want</span>
</div>
# line 3
<div class="line">
<span class="html-tag">
"This is a Heading that I dont want"
</span>
<span>This is the text I want</span> <--- this is I want
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
s = soup.select('div.line')[2].select('span')[1] # select 3rd line 2nd span
print(s.text)
Prints:
This is the text I want
I have a webpage with following code:
<li>
Thalassery (<a class="mw-redirect" href="/wiki/Malayalam_language" title="Malayalam language">Malayalam</a>: <span lang="ml">തലശ്ശേരി</span>), from
<i>Tellicherry</i></li>
<li>Thanjavur (Tamil: <span lang="ta">தஞ்சாவூர்</span>), from British name <i>Tanjore</i></li>
<li>Thane (Marathi: <span lang="mr">ठाणे</span>), from British name <i>Tannah</i></li>
<li>Thoothukudi (Tamil: <span lang="ta">தூத்துக்குடி</span>), from <i>Tuticorin</i> and its short form <i>Tuty</i></li>
I need to parse the output such that the result will be extracting words like: Thalassery, Tellicherry, Thanjavur, Tanjore, Thane, Tannah, Thoothukudi, Tuticorin
Can anyone please help with this
You can use .findAll() to get all the li elements and use find() 'a' and 'i' tag
for item in soup.findAll('li'):
print(item.find('a').text,item.find('i').text)
>>>
Thalassery Tellicherry
Thanjavur Tanjore
Thane Tannah
Thoothukudi Tuticorin
Try simplified_scrapy's solution, its fault tolerance
from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''
<li>
Thalassery (<a class="mw-redirect" href="/wiki/Malayalam_language" title="Malayalam language">Malayalam</a>: <span lang="ml">തലശ്ശേരി</span>), from
<i>Tellicherry</i></li>
<li>Thanjavur (Tamil: <span lang="ta">தஞ்சாவூர்</span>), from British name <i>Tanjore</i></li>
<li>Thane (Marathi: <span lang="mr">ठाणे</span>), from British name <i>Tannah</i></li>
<li>Thoothukudi (Tamil: <span lang="ta">தூத்துக்குடி</span>), from <i>Tuticorin</i> and its short form <i>Tuty</i></li>
'''
doc = SimplifiedDoc(html)
lis = doc.lis
print ([(li.a.text,li.i.text if li.i else '') for li in lis])
Result:
[('Thalassery', 'Tellicherry'), ('Thanjavur', 'Tanjore'), ('Thane', 'Tannah'), ('Thoothukudi', 'Tuticorin')]
I have html code that looks something like this (the soup):
<label for="02" class="highlited">"Some text here"</label>
<span class="type3 type3-display">
<label for="01" class="highlited">"Some text here"</label>
<span class="type1 type1-display">
<label> Somete text here </label>
<span class="type999 type999-display">
<span class="type1 type1-display">
I have to grab both the labels and the spans from the page but with multiple search parameters.
for the labels i have to grab only those that contains for= (any text inside)
for the spans i have to grab only those that contain a word in a list e.g
myList = ['type1', 'type2', 'type3']
The order as they are found on the page must be respected
the result I need would look like this:
<label for="02" class="highlited">"Some text here"</label>
<span class="type3 type3-display">
<label for="01" class="highlited">"Some text here"</label>
<span class="type1 type1-display">
<span class="type1 type1-display">
To find the labels that contain anything after "for=" i use the following code:
soup.find_all('label', {'for': re.compile('.*')}) # it works as expected
But now i need to also find all the spans with specific wording and respect the order as they are found on the web page.
I tried this but it didn't worked:
soup.find_all(['label', 'span'], [{'for': re.compile('.*')}, {'class': 'type1'}], recursive=False) # here i just used {'class': 'type1'} becase I don't know how to pass in a list to soup to search for a match)
Thank you in advance!
edit: I also tried to combine 2 find_all() searches with (+) but then i loose the order.
edit2: spelling
You can do that without regex as well.
from bs4 import BeautifulSoup
data='''<label for="02" class="highlited">"Some text here"</label>
<span class="type3 type3-display"></span>
<label for="01" class="highlited">"Some text here"</label>
<span class="type1 type1-display"></span>
<label> Somete text here </label>
<span class="type999 type999-display"></span>
<span class="type1 type1-display"></span>'''
myList = ['type1', 'type2', 'type3']
soup=BeautifulSoup(data,'html.parser')
for item in soup.find_all():
if (item.name=='label') and 'for' in item.attrs :
print(item)
if (item.name == 'span') and item['class'][0] in myList :
print(item)
Output:
<label class="highlited" for="02">"Some text here"</label>
<span class="type3 type3-display"></span>
<label class="highlited" for="01">"Some text here"</label>
<span class="type1 type1-display"></span>
<span class="type1 type1-display"></span>
I'm trying to extract tuples from an url and I've managed to extract string text and tuples using the re.search(pattern_str, text_str). However, I got stuck when I tried to extract a list of tuples using re.findall(pattern_str, text_str).
The text looks like:
<li>
<a href="11111">
some text 111
<span class="some-class">
#11111
</span>
</a>
</li><li>
<a href="22222">
some text 222
<span class="some-class">
#22222
</span>
</a>
</li><li>
<a href="33333">
some text 333
<span class="some-class">
#33333
</span>
</a>
... # repeating
...
...
and I'm using the following pattern & code to extract the tuples:
text_above = "..." # this is the text above
pat_str = '<a href="(\d+)">\n(.+)\n<span class'
pat = re.compile(pat_str)
# following line is supposed to return the numbers from the 2nd line
# and the string from the 3rd line for each repeating sequence
list_of_tuples = re.findall(pat, text_above)
for t in list_of tuples:
# supposed to print "11111 -> blah blah 111"
print(t[0], '->', t[1])
Maybe I'm trying something weird & impossible, maybe its better to extract the data using primitive string manipulations... But in case there exists a solution?
Your regex does not take into account the whitespace (indentation) between \n and <span. (And neither the whitespace at the start of the line you want to capture, but that's not as much of a problem.) To fix it, you could add some \s*:
pat_str = '<a href="(\d+)">\n\s*(.+)\n\s*<span class'
As suggested in the comments, use a html parser like BeautifulSoup:
from bs4 import BeautifulSoup
h = """<li>
<a href="11111">
some text 111
<span class="some-class">
#11111
</span>
</a>
</li><li>
<a href="22222">
some text 222
<span class="some-class">
#22222
</span>
</a>
</li><li>
<a href="33333">
some text 333
<span class="some-class">
#33333
</span>
</a>"""
soup = BeautifulSoup(h)
You can get the href and the previous_sibling to the span:
print([(a["href"].strip(), a.span.previous_sibling.strip()) for a in soup.find_all("a")])
[('11111', u'some text 111'), ('22222', u'some text 222'), ('33333', u'some text 333')]
Or the href and the first content from the anchor:
print([(a["href"].strip(), a.contents[0].strip()) for a in soup.find_all("a")])
Or with .find(text=True) to only get the tag text and not from the children.
[(a["href"].strip(), a.find(text=True).strip()) for a in soup.find_all("a")]
Also if you just want the anchors inside the list tags, you can specifically parse those:
[(a["href"].strip(), a.contents[0].strip()) for a in soup.select("li a")]
I'm web scraping a wikipedia page using BeautifulSoup in python and I was wondering whether there is anyone to know the number of text objects in an HTML object. For example the following code gets me the following HTML:
soup.find_all(class_ = 'toctext')
<span class="toctext">Actors and actresses</span>, <span class="toctext">Archaeologists and anthropologists</span>, <span class="toctext">Architects</span>, <span class="toctext">Artists</span>, <span class="toctext">Broadcasters</span>, <span class="toctext">Businessmen</span>, <span class="toctext">Chefs</span>, <span class="toctext">Clergy</span>, <span class="toctext">Criminals</span>, <span class="toctext">Conspirators</span>, <span class="toctext">Economists</span>, <span class="toctext">Engineers</span>, <span class="toctext">Explorers</span>, <span class="toctext">Filmmakers</span>, <span class="toctext">Historians</span>, <span class="toctext">Humourists</span>, <span class="toctext">Inventors / engineers</span>, <span class="toctext">Journalists / newsreaders</span>, <span class="toctext">Military: soldiers/sailors/airmen</span>, <span class="toctext">Monarchs</span>, <span class="toctext">Musicians</span>, <span class="toctext">Philosophers</span>, <span class="toctext">Photographers</span>, <span class="toctext">Politicians</span>, <span class="toctext">Scientists</span>, <span class="toctext">Sportsmen and sportswomen</span>, <span class="toctext">Writers</span>, <span class="toctext">Other notables</span>, <span class="toctext">English expatriates</span>, <span class="toctext">References</span>, <span class="toctext">See also</span>
I can get the first text object by running the following:
soup.find_all(class_ = 'toctext')[0].text
My goal here is to get and store all of the text objects in a list. I'm doing this by using a for loop, however I don't know how many text objects there are in the html block. Naturally I would hit an error if I get to an index that doesn't exist Is there an alternative?
You can use a for...in loop.
In [13]: [t.text for t in soup.find_all(class_ = 'toctext')]
Out[13]:
['Actors and actresses',
'Archaeologists and anthropologists',
'Architects',
'Artists',
'Broadcasters',
'Businessmen',
'Chefs',
'Clergy',
'Criminals',
'Conspirators',
'Economists',
'Engineers',
'Explorers',
'Filmmakers',
'Historians',
'Humourists',
'Inventors / engineers',
'Journalists / newsreaders',
'Military: soldiers/sailors/airmen',
'Monarchs',
'Musicians',
'Philosophers',
'Photographers',
'Politicians',
'Scientists',
'Sportsmen and sportswomen',
'Writers',
'Other notables',
'English expatriates',
'References',
'See also']
Try the following code:
for txt in soup.find_all(class_ = 'toctext'):
print(txt.text)