Extracting text :after an element with Beautiful Soup - python

I would like to extract the text :after the <strong> element.
<li data-toggle="tooltip" title="" data-original-title=""><strong>06:25</strong> vP</li>
I've tried the following
medmar_live_departures_table = list(soup.select('li.tratta'))
for li in medmar_live_departures_table:
info = li.text
but I'm getting both texts. I could use re to split the string but I was wondering if there was a more efficient and straightforward way of doing it.
Output
16:40  vP
Desired output
vP

You can get last text child node of each li as below:
medmar_live_departures_table = soup.select('li.tratta')
for li in medmar_live_departures_table :
info = [text for text in li.stripped_strings][-1]
print(info)

Related

Extracting text inside tags from html document

I have an html document like this: https://dropmefiles.com/wezmb
So I need to extract text inside tags <span id="1" and </span , but I don't know how.
I'm trying and write this code:
from bs4 import BeautifulSoup
with open("10_01.htm") as fp:
soup = BeautifulSoup(fp,features="html.parser")
for a in soup.find_all('span'):
print (a.string)
But it extract all information from all 'span' tags. So, how can i extract text inside tags <span id="1" and </span in Python?
What you need is the .contents function. documentation
Find the span <span id = "1"> ... </span> using
for x in soup.find(id = 1).contents:
print(x)
OR
x = soup.find(id = 1).contents[0] # since there will only be one element with the id 1.
print(x)
This will give you :
10
that is, an empty line followed by 10 followed by another empty line. This is because the string in the HTML is actually like that and prints 10 in a new line, as you can also see in the HTML that 10 has its separate line.
The string will correctly be '\n10\n'.
If you want just x = '10' from x = '\n10\n' you can do : x = x[1:-1] since '\n' is a single character. Hope this helped.

How to extract text inside a li tag using Selenium in Python

This is the HTML code from where I need to extract the text:
<li class="inline t-24 t-black t-normal break-words">
Nilesh Sengupta
</li>
This is my code:
items = driver.find_elements_by_tag_name("li")
print(items.text)
driver.find_elements_by_tag_name("li") returns a list of web elements
So in order to extract texts from all the elements you have to iterate on all the elements in the list and extract text from each one.
So you should
items = driver.find_elements_by_tag_name("li")
for el in items:
print(el.text)
In case you want to extract a text from a specific element you should use find_element_by_tag_name instead of find_elements_by_tag_name.
In this case
item = driver.find_element_by_tag_name("li")
item is a web element and you can extract a text from it directly by
print(item.text)
With css selectors is the best way:
tag_list = driver.find_elements_by_css_selector(".inline.t-24.t-black.t-normal.break-words").text

How to extract text which is surrounded by two <span> element?

I am in mid of scraping data from a website, but I encounter following code
code = "<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372 
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li> "
I need to extract only "₹ 7,372".
I have tried following.
1. Code.text
but it result to
'\n\n₹ 7,372\xa0\r\n \n–\n\n'
code.text.strip()
but it result to
'₹ 7,372\xa0\r\n \n–'
Is there any method?
Please let me know, so that I can complete my project.
Ok, I managed to clean data that you need. This way is a little ugly, but works=)
from bs4 import BeautifulSoup as BS
html= """<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li> """
soup=BS(html)
li = soup.find('li').text
for j in range(3):
for i in ['\n',' ', '–', '\xa0', '\r','\x20','\x0a','\x09','\x0c','\x0d']:
li=li.strip(i)
print(li)
output:
₹ 7,372
In the loop list I outlined all (as far as I know) ASCII spaces and the symbols that you get.
Loop launches 3 times because needed value doesn't clean from the first time, you can check it every iteration in variable explorer.
Also optionally you can try to figure out what precise symbol gives a lot of pseudo spaces between <span> tags.
from bs4 import BeautifulSoup as bs
code = '''<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li>'''
soup = bs(code,'html.parser')
w = soup.find_all('li')
l = []
for item in w:
l.append(item)
words = str(l)
t = words.split('\n')
print(t[2][7:])
₹ 7,372

Beautiful Soup / Regular Expressions: Extract a portion of text from NavigableString

I'm really new to learning python so this could be really obvious, but I have extracted a NavigableString from BeautifulSoup and I need to find data in the string. However, it's not as easy as some of the examples I've seen online.
My end goal is to create a dictionary that looks something like this:
dict = {'Fandom':'Undertale (Video Game)', 'Works':15341}
Here's are two examples of the strings:
<li>
<a class="tag" href="/tags/Undertale%20(Video%20Game)/works">Undertale (Video Game)</a>
(15341)
</li>
<a class="tag" href="/tags/Sherlock%20Holmes%20*a*%20Related%20Fandoms/works">Sherlock Holmes & Related Fandoms</a>
(101015)
</li>
I've already succeeded extracting fandom from the string, but now I need the works count in parenthesis. How would I use Beautiful Soup and/or Regular Expressions to do this?
I also need to do error handling because while a fandom will always be displayed, it may not have a work count next to it.
<li>
<a class="tag" href="/tags/Composer%20-%20Fandom/works">Composer - Fandom</a>
</li>
Here's the relevant pieces of code:
for each_f in cate:
#print(each_f)
result = each_f.find('a')
if result !=-1:
#here is where I grab the Fandom vals
fandom_name = result.contents
#print(result.contents)
NOTE: I know I'm missing the code to append to the dictionary, I haven't made it that far yet. I'm just trying to get the values to print to the screen.
use dict.fromkeys(('Fandom', 'Works')) to get :
In [17]: dict.fromkeys(('Fandom', 'Works'))
Out[17]: {'Fandom': None, 'Works': None}
use zip to combines the key with strings in the li tag, this will only combines the shortest:
zip(('Fandom', 'Works'),li.stripped_strings)
[('Fandom', 'Undertale (Video Game)'), ('Works', '(15341)')]
[('Fandom', 'Sherlock Holmes & Related Fandoms'), ('Works', '(101015)')]
[('Fandom', 'Composer - Fandom')]
then we update the dict with those data:
In [20]: for li in soup.find_all('li'):
...: d = dict.fromkeys(('Fandom', 'Works'))
...: out = zip(('Fandom', 'Works'),li.stripped_strings)
...: d.update(out)
...: print(d)
out:
{'Works': '(15341)', 'Fandom': 'Undertale (Video Game)'}
{'Works': '(101015)', 'Fandom': 'Sherlock Holmes & Related Fandoms'}
{'Works': None, 'Fandom': 'Composer - Fandom'}
You can use stripped_strings and unpack the values to get your blocks of text. You can store the results in a dictso that you can use them later.
Example:
from bs4 import BeautifulSoup
import requests
example = """<li>
<a class="tag" href="/tags/Undertale%20(Video%20Game)/works">Undertale (Video Game)</a>
(15341)
</li>
<li><a class="tag" href="/tags/Sherlock%20Holmes%20*a*%20Related%20Fandoms/works">Sherlock Holmes & Related Fandoms</a>
(101015)
</li>
<li>
<a class="tag" href="/tags/Composer%20-%20Fandom/works">Composer - Fandom</a>
</li>"""
soup = BeautifulSoup(example, "html.parser")
Fandom = {"Fandom" : []}
for li in soup.find_all("li"):
try:
fandom, count = li.stripped_strings
Fandom["Fandom"].append({fandom.strip() : count[1:-1]})
except:
fandom = li.text.strip()
Fandom["Fandom"].append({fandom.strip() : 0})
print (Fandom)
This outputs:
{'Fandom': [{'Undertale (Video Game)': '15341'}, {'Sherlock Holmes & Related Fandoms': '101015'}, {'Composer - Fandom': 0}]}
The try-catch will catch any unpacking that doesn't contains two values: your fandom title and the word count.

Parsing HTML tags using python

I am new to python. I want to store each HTML tag into item of list.
from bs4 import BeautifulSoup
text = """
<body>
<div class="product">
<div class="x">orange</div>
<div class="x">apple</div>
<p> This is text </p>
</div>
</body>"""
soup = BeautifulSoup(text)
y=[]
for i in (soup.find_all("div", class_="product")):
y.append(i.get_text().encode("utf-8").strip())
From the above code the length of y is 1 and store all text on one item of list. However,it is necessary to parse using "div product" and store each text inside html tag into different item to list.
So the y will be :
y =['orange', 'apple', 'This is text']
instead of :
y=['orange\napple\n This is text']
Don't use text if all you wanted was the directly contained string, and ask only for elements contained in the div.product tag:
for elem in soup.select("div.product *"):
y.append(elem.string.strip().encode('utf8'))
Demo:
>>> y = []
>>> for elem in soup.select("div.product *"):
... y.append(elem.string.strip().encode('utf8'))
...
>>> y
['orange', 'apple', 'This is text']
soup.find_all("div",class_="product")
gives all the div tags with class product so u have a list now. So when u run the for loop it iterates only once and gives the complete text in div tag.
So to get data of every child use something like this
for child in soup.find_all("div", class_="product").findChildren():
y.append(child.string.strip().encode('utf8'))

Categories