I am facing a problem and don't know how to solve it properly.
I want to extract the price (so in the first example 130€, in the second 130€).
the problem is that the attributes are changing all the time. so I am unable to do something like this, because I am scraping hundreds of sites and and on each site the first 2 chars of the "id" attribute may differ:
tag = soup_expose_html.find('span', attrs={'id' : re.compile(r'(07_content$)')})
Even if I would use something like this it wont work, because there is no link to the price and I would probably get some other value:
tag = soup_expose_html.find('span', attrs={'id' : re.compile(r'([0-9]{2}_content$)')})
Example html code:
<span id="07_lbl" class="lbl">Price:</span>
<span id="07_content" class="content">130 €</span>
<span id="08_lbl" class="lbl">Value:</span>
<span id="08_content" class="content">90000 €</span>
<span id="03_lbl" class="lbl">Price:</span>
<span id="03_content" class="content">130 €</span>
<span id="04_lbl" class="lbl">Value:</span>
<span id="04_content" class="content">90000 €</span>
The only thing I can imagine of at the moment is to identify the price tag with something like "text = 'Price:'" and after that get .next_sibling and extract the string. but I am not sure if there is better way to do it. Any suggestions? :-)
How about a findAll solution?
First collect all possibles id prefixes and then iterate them and get all elements
>>> from bs4 import BeautifulSoup
>>> import re
>>> html = """
... <span id="07_lbl" class="lbl">Price:</span>
... <span id="07_content" class="content">130 €</span>
... <span id="08_lbl" class="lbl">Value:</span>
... <span id="08_content" class="content">90000 €</span>
...
...
... <span id="03_lbl" class="lbl">Price:</span>
... <span id="03_content" class="content">130 €</span>
... <span id="04_lbl" class="lbl">Value:</span>
... <span id="04_content" class="content">90000 €</span>
... """
>>>
>>> soup = BeautifulSoup(html)
>>> span_id_prefixes = [
... span['id'].replace("_content","")
... for span in soup.findAll('span', attrs={'id' : re.compile(r'(_content$)')})
... ]
>>> for prefix in span_id_prefixes:
... lbl = soup.find('span', attrs={'id' : '%s_lbl' % prefix})
... content = soup.find('span', attrs={'id' : '%s_content' % prefix})
... if lbl and content:
... print lbl.text, content.text
...
Price: 130 €
Value: 90000 €
Price: 130 €
Value: 90000 €
Here is how you would easily extract only the price values like you had in mind in your original post.
html = """
<span id="07_lbl" class="lbl">Price:</span>
<span id="07_content" class="content">130 €</span>
<span id="08_lbl" class="lbl">Value:</span>
<span id="08_content" class="content">90000 €</span>
<span id="03_lbl" class="lbl">Price:</span>
<span id="03_content" class="content">130 €</span>
<span id="04_lbl" class="lbl">Value:</span>
<span id="04_content" class="content">90000 €</span>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
price_texts = soup.find_all('span', text='Price:')
for element in price_texts:
# .next_sibling() might work, too, with a parent element present
price_value = element.find_next_sibling('span')
print price_value.get_text()
# It prints:
# 130 €
# 130 €
This solution has less code and, IMO, is more clear.
Try Beautiful soup selects function. It uses css selectors:
for span in soup_expose_html.select("span[id$=_content]"):
print span.text
the result is a list with all spans which have an id ending with _content
Related
Result of this :
soup.find('span', {'class':'js-date-picker btn--secondary btn--secondary--no-spacing'})
<span class="js-date-picker btn--secondary btn--secondary--no-spacing" data-clear="/h/?type=ln&search=ethereum&lang=en&searchheadlines=1" data-date='{"sel":false,"latest":1599889680,"now":1599897600}' data-href="/h/?type=ln&search=ethereum&lang=en&searchheadlines=1&d=" href="javascript://">
<span class="btn--secondary__icon"><i class="far fa-calendar-alt"></i></span>
<span class="btn--secondary__label">
<span class="dtctxt"><span class="d">12 Sep</span><span class="t"> 06:48</span></span></span>
</span>
Now I want to extract {"sel":false,"latest":1599889680,"now":1599897600} from this text
How can I do that?
Try this:
import ast
from bs4 import BeautifulSoup
html = """
<span class="js-date-picker btn--secondary btn--secondary--no-spacing" data-clear="/h/?type=ln&search=ethereum&lang=en&searchheadlines=1" data-date='{"sel":false,"latest":1599889680,"now":1599897600}' data-href="/h/?type=ln&search=ethereum&lang=en&searchheadlines=1&d=" href="javascript://"></span>
<span class="btn--secondary__icon"><i class="far fa-calendar-alt"></i></span>
<span class="btn--secondary__label">
<span class="dtctxt"><span class="d">12 Sep</span><span class="t"> 06:48</span></span></span>
</span>
"""
soup = BeautifulSoup(html, 'html.parser').find("span", {"class": "js-date-picker btn--secondary btn--secondary--no-spacing"})
result = soup.get("data-date")
print(result)
Output:
{"sel":false,"latest":1599889680,"now":1599897600}
If you need, you can convert the result to a dict object e.g.:
data_date = ast.literal_eval(result.replace("false", "False"))
print(data_date['now'])
Output: 1599897600
I have a webpage with following code:
<li>
Thalassery (<a class="mw-redirect" href="/wiki/Malayalam_language" title="Malayalam language">Malayalam</a>: <span lang="ml">തലശ്ശേരി</span>), from
<i>Tellicherry</i></li>
<li>Thanjavur (Tamil: <span lang="ta">தஞ்சாவூர்</span>), from British name <i>Tanjore</i></li>
<li>Thane (Marathi: <span lang="mr">ठाणे</span>), from British name <i>Tannah</i></li>
<li>Thoothukudi (Tamil: <span lang="ta">தூத்துக்குடி</span>), from <i>Tuticorin</i> and its short form <i>Tuty</i></li>
I need to parse the output such that the result will be extracting words like: Thalassery, Tellicherry, Thanjavur, Tanjore, Thane, Tannah, Thoothukudi, Tuticorin
Can anyone please help with this
You can use .findAll() to get all the li elements and use find() 'a' and 'i' tag
for item in soup.findAll('li'):
print(item.find('a').text,item.find('i').text)
>>>
Thalassery Tellicherry
Thanjavur Tanjore
Thane Tannah
Thoothukudi Tuticorin
Try simplified_scrapy's solution, its fault tolerance
from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''
<li>
Thalassery (<a class="mw-redirect" href="/wiki/Malayalam_language" title="Malayalam language">Malayalam</a>: <span lang="ml">തലശ്ശേരി</span>), from
<i>Tellicherry</i></li>
<li>Thanjavur (Tamil: <span lang="ta">தஞ்சாவூர்</span>), from British name <i>Tanjore</i></li>
<li>Thane (Marathi: <span lang="mr">ठाणे</span>), from British name <i>Tannah</i></li>
<li>Thoothukudi (Tamil: <span lang="ta">தூத்துக்குடி</span>), from <i>Tuticorin</i> and its short form <i>Tuty</i></li>
'''
doc = SimplifiedDoc(html)
lis = doc.lis
print ([(li.a.text,li.i.text if li.i else '') for li in lis])
Result:
[('Thalassery', 'Tellicherry'), ('Thanjavur', 'Tanjore'), ('Thane', 'Tannah'), ('Thoothukudi', 'Tuticorin')]
I have the following html code:
<div>
<span class="test">
<span class="f1">
5 times
</span>
</span>
</span>
</div>
<div>
</div>
<div>
<span class="test">
<span class="f1">
6 times
</span>
</span>
</span>
</div>
I managed to navigate the tree, but when trying to print I get the following error:
AttributeError: 'list' object has no attribute 'text'
Python code working:
x=soup.select('.f1')
print(x)
gives the following:
[]
[]
[]
[]
[<span class="f1"> 19 times</span>]
[<span class="f1"> 12 times</span>]
[<span class="f1"> 6 times</span>]
[]
[]
[]
[<span class="f1"> 6 times</span>]
[<span class="f1"> 1 time</span>]
[<span class="f1"> 11 times</span>]
but print(x.prettify) throws the error above. I am basically trying to get the text between the span tags for all instances, blank when none and string when available.
select() returns a list of the results, regardless if the result has 0 items. Since list object does not have a text attribute, it gives you the AttributeError.
Likewise, prettify() is to make the html more readable, not a way to interpret the list.
If all you're looking to do is extract the texts when available:
texts = [''.join(i.stripped_strings) for i in x if i]
# ['5 times', '6 times']
This removes all the superfluous space/newline characters in the string and give you just the bare text. The last if i indicates to only return the text if i is not None.
If you actually care for the spaces/newlines, do this instead:
texts = [i.text for i in x if i]
# ['\n 5 times\n ', '\n 6 times\n ']
from bs4 import BeautifulSoup
html = '''<div>
<span class="test">
<span class="f1">
5 times
</span>
</span>
</span>
</div>
<div>
</div>
<div>
<span class="test">
<span class="f1">
6 times
</span>
</span>
</span>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
aaa = soup.find_all('span', attrs={'class':'f1'})
for i in aaa:
print(i.text)
Output:
5 times
6 times
I'd recommend you using .findAll method and loop over matched spans.
Example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for span in soup.findAll("span", class_="f1"):
if span.text.isspace():
continue
else:
print(span.text)
The .isspace() method is checking whether a string is empty (checking if a string is True won't work here since an empty html span cointans spaces).
I have a document which contains several div.inventory siblings.
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
I would like to iterate over them to print the item number and link of the item.
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
How do I parse these two values after selecting the div.inventory element?
import requests
from bs4 import BeautifulSoup
htmlSource = requests.get(url).text
soup = BeautifulSoup(htmlSource)
matches = soup.select('div.inventory')
for match in matches:
#prints 123
#prints http://linktoitem
Also - what is the difference between the select function and find* functions?
You can find both items using find() relying on the class attributes:
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Example:
from bs4 import BeautifulSoup
data = """
<body>
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">456</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">789</span>
<span class="cost">
$1.23
</span>
</div>
</body>
"""
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Prints:
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
Note the use of select() - this method allows to use CSS Selectors for searching over the page. Also note the use of class_ argument - underscore is important since class is a reversed keyword in Python.
My code is like this
response = urllib2.urlopen("file:///C:/data20140801.html")
page = response.read()
tree = etree.HTML(page)
data = tree.xpath("//p/span/text()")
html page could have this structures
<span style="font-size:10.0pt">Something</span>
html page could also have this structures
<p class="Normal">
<span style="font-size:10.0pt">Some</span>
<span style="font-size:10.0pt">thing<span>
</p>
Using same code for both I want to get "Something"
The XPath expression returns a list of values:
>>> from lxml.html import etree
>>> tree = etree.HTML('''\
... <p class="Normal">
... <span style="font-size:10.0pt">Some</span>
... <span style="font-size:10.0pt">thing<span>
... </p>
... ''')
>>> tree.xpath("//p/span/text()")
['Some', 'thing']
Use ''.join() to combine the two strings into one:
>>> ''.join(tree.xpath("//p/span/text()"))
'Something'