Find unknown tag containing given text - python

My HTML is like :
<body>
<div class="afds">
<span class="dfsdf">mytext</span>
</div>
<div class="sdf dzf">
<h1>some random text</h1>
</div>
</body>
I want to find all tags containing "text" & their corresponding classes. In this case, I want:
span, "dfsdf"
h1, null
Next, I want to be able to navigate through the returned tags. For example, find the div parent tag & respective classes of all the returned tags.
If I execute the following
soupx.find_all(text=re.compile(".*text.*"))
it simply returns the text part of the tags:
['mytext', ' some random text']
Please help.

You are probably looking for something along these lines:
ts = soup.find_all(text=re.compile(".*text.*"))
for t in ts:
if len(t.parent.attrs)>0:
for k in t.parent.attrs.keys():
print(t.parent.name,t.parent.attrs[k][0])
else:
print(t.parent.name,"null")
Output:
span dfsdf
h1 null

find_all() does not return just strings, it returns bs4.element.NavigableString.
That means you can call other beautifulsoup functions on those results.
Have a look at find_parent and find_parents: documentation
childs = soupx.find_all(text=re.compile(".*text.*"))
for c in childs:
c.find_parent("div")

Related

Python3 - Extract the text from a bs4.element.Tag and add to a dictonary

I am scraping a website which returns a bs4.element.Tag similar to the following:
<span class="attributes-value">
<span class="four-door">four door</span>
<span class="inline-4-engine">inline 4 engine</span>
<span class="24-gallons-per-mile">24 gallons per mile</span>
</span>
I am trying to extract just the text from this block and add it to a dictionary. All of the examples that I am seeing on the forum include some sort of common element like an 'id' or similar. I am not an html guy so i may be using incorrect terms.
What I would like to do is get the text ("four door", "v6 engine", etc) and add them as values to a dictionary with the key being a pre-designated variable of car_model.
cars = {'528i':['four door', 'inline 4 engine']}
I cant figure out a universal way to pull out the text because there may be more or fewer span classes with different text. Thanks for your help!
You need to loop through all the elements by selector and extract text value from these elements.
A selector is a specific path to the element you want. In my case, the selector is .attributes-value span, where .attributes-value allows you to access the class, and span allows you to access the tags within that class.
The get_text() method retrieves the content between the opening and closing tags. This is exactly what you need.
I also recommend using lxml because it will speed up your code.
The full code is attached below:
from bs4 import BeautifulSoup
import lxml
html = '''
<span class="attributes-value">
<span class="four-door">four door</span>
<span class="inline-4-engine">inline 4 engine</span>
<span class="24-gallons-per-mile">24 gallons per mile</span>
</span>
'''
soup = BeautifulSoup(html, 'lxml')
cars = {
'528i': []
}
for span in soup.select(".attributes-value span"):
cars['528i'].append(span.get_text())
print(cars)
Output:
{'528i': ['four door', 'inline 4 engine', '24 gallons per mile']}
You can use:
out = defaultdict(list)
soup = BeautifulSoup(html_doc, 'html.parser')
for tag in soup.select(".attributes-value span"):
out["528i"].append(tag.text)
print(dict(out))
Prints:
{'528i': ['four door', 'inline 4 engine', '24 gallons per mile']}

Dynamically matching a string that starts with a substring

I need to dynamically match a string that starts with forsale_. Here, I'm finding it by hardcoding the characters that follow, but I'd like to do this dynamically:
for_sale = response.html.find('span.forsale_QoVFl > a', first=True)
I tried using startswith(), but I'm not sure how to implement it.
Sample response.html:
<section id="release-marketplace" class="section_9nUx6 open_BZ6Zt">
<header class="header_W2hzl">
<div class="header_3eShg">
<h3>Marketplace</h3>
<span class="forsale_QoVFl">2 For Sale from <span class="price_2Wkos">$355.92</span></span>
</div>
</header>
<div class="content_1TFzi">
<div class="buttons_1G_mP">Buy CDSell CD</div>
</div>
</section>
startswith() is straightforward. x = txt.startswith("forsale_") will return a bool, where txt is the string you want to test.
For more involved pattern matching, you want to look at regular expressions. Something like this is the equivalent of the startswith() line above:
import re
txt = "forsale_arbitrarychars"
x = re.search("^forsale_", txt)
where if you were to replace ^forsale_ with something like ^forsale_[0-9]*$, it would only accept ints after the underscore
I assume your final expected output is the link in the target <span>. If so, I would do it using lxml and xpath:
import lxml.html as lh
sale = """[your html above]"""
doc = lh.fromstring(sale)
print(doc.xpath('//span[#class[starts-with(.,"forsale_")]]/a/#href')[0])
Output:
/sell/release/XXX

BeautifulSoup search attributes-value

I'm trying to search in HTML documents for specific attribute values.
e.g.
<html>
<h2 itemprop="prio1"> TEXT PRIO 1 </h2>
<span id="prio2"> TEXT PRIO 2 </span>
</html>
I want to find all items with atrributes values beginning with "prio"
I know that I can do something like:
soup.find_all(itemprop=re.compile('prio.*')) )
Or
soup.find_all(id=re.compile('prio.*')) )
But what I am looking for is something like:
soup.find_all(*=re.compile('prio.*')) )
First off your regex is wrong, if you wanted to only find strings starting with prio you would prefix with ^, as it is your regex would match prio anywhere in the string, if you were going to search each attribute you should just use str.startswith:
h = """<html>
<h2 itemprop="prio1"> TEXT PRIO 1 </h2>
<span id="prio2"> TEXT PRIO 2 </span>
</html>"""
soup = BeautifulSoup(h, "lxml")
tags = soup.find_all(lambda t: any(a.startswith("prio") for a in t.attrs.values()))
If you just want to check for certain attributes:
tags = soup.find_all(lambda t: t.get("id","").startswith("prio") or t.get("itemprop","").startswith("prio"))
But if you wanted a more efficient solution you might want to look at lxml which allows you to use wildcards:
from lxml import html
xml = html.fromstring(h)
tags = xml.xpath("//*[starts-with(#*,'prio')]")
print(tags)
Or just id an itemprop:
tags = xml.xpath("//*[starts-with(#id,'prio') or starts-with(#itemprop, 'prio')]")
I don't know if this is the best way, but this works:
>>> soup.find_all(lambda element: any(re.search('prio.*', attr) for attr in element.attrs.values()))
[<h2 itemprop="prio1"> TEXT PRIO 1 </h2>, <span id="prio2"> TEXT PRIO 2 </span>]
In this case, you can access the element use lambda in lambda element:. And we search for 'prio.*' use re.search in the element.attrs.values() list.
Then, we use any() on the result to see if there's an element which has an attribute and it's value starts with 'prio'.
You can also use str.startswith here instead of RegEx since you're just trying to check that attributes-value starts with 'prio' or not, like below:
soup.find_all(lambda element: any(attr.startswith('prio') for attr in element.attrs.values())))

Extract based on instances of nodes

I'm trying to extract some text using Beautiful Soup. The relevant portion looks something like this.
...
<p class="consistent"><strong>RecurringText</strong></p>
<p class="consistent">Text1</p>
<p class="consistent">Text2</p>
<p class="consistent">Text3</p>
<p class="consistent"><strong>VariableText</strong></p>
...
RecurringText, as the name implies, is consistent in all the files. However, VariableText changes. The only thing it has in common is it is the next coded section. I'd like to get Text1, Text2, and Text3 extract. What comes before (up to and including RecurringText) and what comes after (including and after VariableText) can be left behind. The portion of extract from RecurringText I have found elsewhere, but I am unsure how to remove the next item, if that makes sense.
In sum, how I can extract based on the characteristic of VariableText (which the string is variable throughout the urls) consistently coming after the last item of Text1, Text2, ..., Textn (where n is different across files).
You can basically get items from p element containing strong element to another p element containing strong element:
from bs4 import BeautifulSoup
data = """
<div>
<p class="consistent"><strong>RecurringText</strong></p>
<p class="consistent">Text1</p>
<p class="consistent">Text2</p>
<p class="consistent">Text3</p>
<p class="consistent"><strong>VariableText</strong></p>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for p in soup.find_all(lambda elm: elm and elm.name == "p" and elm.text == "RecurringText" and \
"consistent" in elm.get("class") and elm.strong):
for item in p.find_next_siblings("p"):
if item.strong:
break
print(item.text)
Prints:
Text1
Text2
Text3

Filtering xml file to remove lines with certain text in them?

For example, suppose I have:
<div class="info"><p><b>Orange</b>, <b>One</b>, ...
<div class="info"><p><b>Blue</b>, <b>Two</b>, ...
<div class="info"><p><b>Red</b>, <b>Three</b>, ...
<div class="info"><p><b>Yellow</b>, <b>Four</b>, ...
And I'd like to remove all lines that have words from a list so I'll only use xpath on the lines that fit my criteria. For example, I could use the list as ['Orange', 'Red'] to mark the unwanted lines, so in the above example I'd only want to use lines 2 and 4 for further processing.
How can I do this?
Use:
//div
[not(p/b[contains('|Orange|Red|',
concat('|', ., '|')
)
]
)
]
This selects any div elements in the XML document, such that it has no p child whose b child's string valu is one of the strings in the pipe-separated list of strings to use as filters.
This approach allows extensibility by just adding new filter values to the pipe-separated list, without changing anything else in the XPath expression.
Note: When the structure of the XML document is statically known, always avoid using the // XPath pseudo-operator, because it leads to significant inefficiency (slowdown).
import lxml.html as lh
# http://lxml.de/xpathxslt.html
# http://exslt.org/regexp/functions/match/index.html
content='''\
<table>
<div class="info"><p><b>Orange</b>, <b>One</b></p></div>
<div class="info"><p><b>Blue</b>, <b>Two</b></p></div>
<div class="info"><p><b>Red</b>, <b>Three</b></p></div>
<div class="info"><p><b>Yellow</b>, <b>Four</b></p></div>
</table>
'''
NS = 'http://exslt.org/regular-expressions'
tree = lh.fromstring(content)
exclude=['Orange','Red']
for elt in tree.xpath(
"//div[not(re:test(p/b[1]/text(), '{0}'))]".format('|'.join(exclude)),
namespaces={'re': NS}):
print(lh.tostring(elt))
print('-'*80)
yields
<div class="info"><p><b>Blue</b>, <b>Two</b></p></div>
--------------------------------------------------------------------------------
<div class="info"><p><b>Yellow</b>, <b>Four</b></p></div>
--------------------------------------------------------------------------------

Categories