Finding id with specific condition in BeautifulSoup - python

I'm scraping a website with BeautifulSoup in Python
I'd like to find all the a href with id starts with "des "(with space at the tail) + '3-4 letters'
I just tried:
bsObj.findAll("a",{"id":"des "})
But it does not find what I originally intended to.
Do I need to use regex or something?
I would appreciate all your helps. Thanks.
<div>
<a id="des 6TN" href="/en-kr/shop/men/11-by-boris-bidjan-saberi?lvrid=_gm_d6tn">
11 BY BORIS BIDJAN SABERI
</a>
<br/>
<a id="des R6L" href="/en-kr/shop/men/11-eleven?lvrid=_gm_dr6l">
11 ELEVEN
</a>
<br/>
</div>

If you go the regex route, you can pass a compiled regex pattern to the id parameter like so (added an irrelevant/unmatch a tag for demonstration purpose):
from bs4 import BeautifulSoup
import re
​
soup = BeautifulSoup("""<div><a id="des 6TN" href="/en-kr/shop/men/11-by-boris-bidjan-saberi?
lvrid=_gm_d6tn">11 BY BORIS BIDJAN SABERI</a><br /><a id="des R6L" href="/en-
kr/shop/men/11-eleven?lvrid=_gm_dr6l">11 ELEVEN</a><a id="ds R6L" href="/en-
kr/shop/men/11-eleven?lvrid=_gm_dr6l">11 ELEVEN</a><br />""")
soup.find_all('a', id=re.compile('^des \w{3,4}$'))
#[<a href="/en-kr/shop/men/11-by-boris-bidjan-saberi?
# lvrid=_gm_d6tn" id="des 6TN">11 BY BORIS BIDJAN SABERI</a>, <a href="/en-
# kr/shop/men/11-eleven?lvrid=_gm_dr6l" id="des R6L">11 ELEVEN</a>]

Here's another way (not using regex) I don't like regular expressions where I don't need them necessary.
all_des = soup.findAll('a')
#list of every <a> tag
for i in all_des: #loops through all
if i.has_attr('id') and i['id'].startswith('des'):
# check if there is an id within the <a> and if the id starts with des.
print(i)
Output:
<a href="/en-kr/shop/men/11-by-boris-bidjan-saberi?lvrid=_gm_d6tn" id="des 6TN">
11 BY BORIS BIDJAN SABERI
</a>
<a href="/en-kr/shop/men/11-eleven?lvrid=_gm_dr6l" id="des R6L">
11 ELEVEN
</a>
Hopefully, that answers your question, the method above by the awesome '#Psidom' is maybe more convenient to you but I'm pretty confident that pythons inbuilt methods are faster than using regular expressions. Where the regex '^des \w{3,4}$':
**^** asserts position at start of the string
des matches the characters des literally (case sensitive)
**\w{3,4}** matches any word character (equal to [a-zA-Z0-9_])
**{3,4}** Quantifier — Matches between 3 and 4 times, as many times as possible, giving back as needed (greedy)
**$** asserts position at the end of the string

Related

Get words between specific words in a Python string

I'm working on getting the words between certain words in a string.
Find string between two substrings Referring to this article, I succeeded in catching words in the following way.
s = 'asdf=5;iwantthis123jasd'
result = re.search('asdf=5;(.*)123jasd', s)
print(result.group(1))
But in the sentence below it failed.
s = ''' <div class="prod-origin-price ">
<span class="discount-rate">
4%
</span>
<span class="origin-price">'''
result = re.search('<span class="discount-rate">(.*)</span>', s)
print(result.group(1))
I'm trying to bring '4%'. Everything else succeeds, but I don't know why only this one fails.
Help
Try this (mind the white spaces and new lines)
import re
s = ''' <div class="prod-origin-price ">
<span class="discount-rate">
4%
</span>
<span class="origin-price">'''
result = re.search('<span class="discount-rate">\s*(.*)\s*</span>', s)
print(result.group(1))
Use re.DOTALL flag for matching new lines:
result = re.search('<span class="discount-rate">(.*)</span>', s, re.DOTALL)
Documentation: https://docs.python.org/3/library/re.html
This is structured data, not just a string, so we can use a library like Beautiful Soup to help us simplify such tasks:
from bs4 import BeautifulSoup
s = ''' <div class="prod-origin-price ">
<span class="discount-rate">
4%
</span>
<span class="origin-price">'''
soup = BeautifulSoup(s)
value = soup.find(class_='discount-rate').get_text(strip=True)
print(value)
# Output:
4%

Dynamically matching a string that starts with a substring

I need to dynamically match a string that starts with forsale_. Here, I'm finding it by hardcoding the characters that follow, but I'd like to do this dynamically:
for_sale = response.html.find('span.forsale_QoVFl > a', first=True)
I tried using startswith(), but I'm not sure how to implement it.
Sample response.html:
<section id="release-marketplace" class="section_9nUx6 open_BZ6Zt">
<header class="header_W2hzl">
<div class="header_3eShg">
<h3>Marketplace</h3>
<span class="forsale_QoVFl">2 For Sale from <span class="price_2Wkos">$355.92</span></span>
</div>
</header>
<div class="content_1TFzi">
<div class="buttons_1G_mP">Buy CDSell CD</div>
</div>
</section>
startswith() is straightforward. x = txt.startswith("forsale_") will return a bool, where txt is the string you want to test.
For more involved pattern matching, you want to look at regular expressions. Something like this is the equivalent of the startswith() line above:
import re
txt = "forsale_arbitrarychars"
x = re.search("^forsale_", txt)
where if you were to replace ^forsale_ with something like ^forsale_[0-9]*$, it would only accept ints after the underscore
I assume your final expected output is the link in the target <span>. If so, I would do it using lxml and xpath:
import lxml.html as lh
sale = """[your html above]"""
doc = lh.fromstring(sale)
print(doc.xpath('//span[#class[starts-with(.,"forsale_")]]/a/#href')[0])
Output:
/sell/release/XXX

regex match for the exact match not all the match in python

hi I have a string as http://www.yifysubtitles.com/subtitles/blockers2018720pwebripx264-ytsam-arabic-128849"><span class="text-muted">subtitle</span> Blockers.2018.720p.WEBRip.x264-[YTS.AM]</a></td><td class="other-cell"></td><td class="uploader-cell">SHINAWY</td><td class="download-cell"><td class="rating-cell"><span class="label">0</span></td><td class="flag-cell"><span class="flag flag-cn"></span><span class="sub-lang">Chinese</span></td><td><a href="/subtitles/blockers2018720pblurayx264-ytsmecht-chinese-128835"><span class="text-muted">subtitle</span> Blockers.2018.720p.BluRay.x264-[YTS.ME].cht </td><td class="other-cell"></td><td class="uploader-cell">osamawang</td><td class="download-cell"><td class="rating-cell"><span class="label label-success">6</span></td><td class="flag-cell"><span class="flag flag-gb"></span><span class="sub-lang">English</span></td><td><a href="/subtitles/blockers2018web-dlx264-fgt-english-128543"><span class="text-muted">subtitle</span> Blockers.2018.WEB-DL.x264-FGT</td><td class="other-cell"></td><td class="uploader-cell">sub</td><td class="download-cell"><td class="rating-cell"><span class="label">0</span></td><td class="flag-cell"><span class="flag flag-rs"></span><span class="sub-lang">Serbian</span></td><td><a href="/subtitles/blockers2018720pblurayx264ytsag-serbian-128633"><span class="text-muted">subtitle</span> Blockers.2018.720p.BluRay.x264.[YTS.AG]</td><td class="other-cell"></td><td class="uploader-cell">TesneGace</td><td class="download-cell"><td class="rating-cell"><span class="label label-success">2</span></td><td class="flag-cell"><span class="flag flag-es"></span><span class="sub-lang">Spanish</span></td><td><a href="/subtitles/blockers2018720pblurayx264ytsag-spanish-128702"><span class="text-muted">subtitle</span> Blockers.2018.720p.BluRay.x264.[YTS.AG]</td><td class="other-cell"></td><td class="uploader-cell"><a href="/subtitles/blockers-english-yify-128543
and I am trying to match the first occurance of english-yify "/subtitles/blockers-english-yify-128543
my pattern is re.search(r'/subtitles/.+\-english\-yify-\d+',text)
but my code returns the entire string, pls help
my regex available here
Your string is in fact html - you should use html parser instead. I suggest the excellent lxml.html parser.
To answer your question, regexes are greedy by default, that means your .+ part will grab as many chars as it can to satisfy the condition. So you will get the first /subtitles/ and the last -english\-yify- and everything in between.

BeautifulSoup search attributes-value

I'm trying to search in HTML documents for specific attribute values.
e.g.
<html>
<h2 itemprop="prio1"> TEXT PRIO 1 </h2>
<span id="prio2"> TEXT PRIO 2 </span>
</html>
I want to find all items with atrributes values beginning with "prio"
I know that I can do something like:
soup.find_all(itemprop=re.compile('prio.*')) )
Or
soup.find_all(id=re.compile('prio.*')) )
But what I am looking for is something like:
soup.find_all(*=re.compile('prio.*')) )
First off your regex is wrong, if you wanted to only find strings starting with prio you would prefix with ^, as it is your regex would match prio anywhere in the string, if you were going to search each attribute you should just use str.startswith:
h = """<html>
<h2 itemprop="prio1"> TEXT PRIO 1 </h2>
<span id="prio2"> TEXT PRIO 2 </span>
</html>"""
soup = BeautifulSoup(h, "lxml")
tags = soup.find_all(lambda t: any(a.startswith("prio") for a in t.attrs.values()))
If you just want to check for certain attributes:
tags = soup.find_all(lambda t: t.get("id","").startswith("prio") or t.get("itemprop","").startswith("prio"))
But if you wanted a more efficient solution you might want to look at lxml which allows you to use wildcards:
from lxml import html
xml = html.fromstring(h)
tags = xml.xpath("//*[starts-with(#*,'prio')]")
print(tags)
Or just id an itemprop:
tags = xml.xpath("//*[starts-with(#id,'prio') or starts-with(#itemprop, 'prio')]")
I don't know if this is the best way, but this works:
>>> soup.find_all(lambda element: any(re.search('prio.*', attr) for attr in element.attrs.values()))
[<h2 itemprop="prio1"> TEXT PRIO 1 </h2>, <span id="prio2"> TEXT PRIO 2 </span>]
In this case, you can access the element use lambda in lambda element:. And we search for 'prio.*' use re.search in the element.attrs.values() list.
Then, we use any() on the result to see if there's an element which has an attribute and it's value starts with 'prio'.
You can also use str.startswith here instead of RegEx since you're just trying to check that attributes-value starts with 'prio' or not, like below:
soup.find_all(lambda element: any(attr.startswith('prio') for attr in element.attrs.values())))

Filtering xml file to remove lines with certain text in them?

For example, suppose I have:
<div class="info"><p><b>Orange</b>, <b>One</b>, ...
<div class="info"><p><b>Blue</b>, <b>Two</b>, ...
<div class="info"><p><b>Red</b>, <b>Three</b>, ...
<div class="info"><p><b>Yellow</b>, <b>Four</b>, ...
And I'd like to remove all lines that have words from a list so I'll only use xpath on the lines that fit my criteria. For example, I could use the list as ['Orange', 'Red'] to mark the unwanted lines, so in the above example I'd only want to use lines 2 and 4 for further processing.
How can I do this?
Use:
//div
[not(p/b[contains('|Orange|Red|',
concat('|', ., '|')
)
]
)
]
This selects any div elements in the XML document, such that it has no p child whose b child's string valu is one of the strings in the pipe-separated list of strings to use as filters.
This approach allows extensibility by just adding new filter values to the pipe-separated list, without changing anything else in the XPath expression.
Note: When the structure of the XML document is statically known, always avoid using the // XPath pseudo-operator, because it leads to significant inefficiency (slowdown).
import lxml.html as lh
# http://lxml.de/xpathxslt.html
# http://exslt.org/regexp/functions/match/index.html
content='''\
<table>
<div class="info"><p><b>Orange</b>, <b>One</b></p></div>
<div class="info"><p><b>Blue</b>, <b>Two</b></p></div>
<div class="info"><p><b>Red</b>, <b>Three</b></p></div>
<div class="info"><p><b>Yellow</b>, <b>Four</b></p></div>
</table>
'''
NS = 'http://exslt.org/regular-expressions'
tree = lh.fromstring(content)
exclude=['Orange','Red']
for elt in tree.xpath(
"//div[not(re:test(p/b[1]/text(), '{0}'))]".format('|'.join(exclude)),
namespaces={'re': NS}):
print(lh.tostring(elt))
print('-'*80)
yields
<div class="info"><p><b>Blue</b>, <b>Two</b></p></div>
--------------------------------------------------------------------------------
<div class="info"><p><b>Yellow</b>, <b>Four</b></p></div>
--------------------------------------------------------------------------------

Categories