I trying to create a regex to extract telephone, streetAddress, Pages values (9440717256,H.No. 3-11-62, RTC Colony..) from the html page in python. These three fields are optional I tried this regex, but output is inconsistent
telephone\S+>(.+)</em>.*(?:streetAddress\S+(.+)</span>)?.*(?:pages\S+>(.+)</a></span>)?
sample string
<em phone="**telephone**">9440717256</em></div></div></li><li class="row"><i class="icon-sm icon-address"></i><div class="profile-details"><strong>Address</strong><div class="profi`enter code here`le-child"><address itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress" class="data-item"><span itemprop="**streetAddress**">H.No. 3-11-62, RTC Colony</span>, <span>Vastu Colony, </span><span class="text-black" itemprop="addressLocality">Lal Bahadur Nagar</span>
Can anyone help me building the regex please ?
Considering that your input is not valid HTML and that it may be subject to change, you can use a HTML parser like BeautifulSoup. But if your input changes, these simple selectors will have to be adapted.
from bs4 import BeautifulSoup
h = """<em phone="**telephone**">9440717256</em></div></div></li><li class="row"><i class="icon-sm icon-address"></i><div class="profile-details"><strong>Address</strong><div class="profi`enter code here`le-child"><address itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress" class="data-item"><span itemprop="**streetAddress**">H.No. 3-11-62, RTC Colony</span>, <span>Vastu Colony, </span><span class="text-black" itemprop="addressLocality">Lal Bahadur Nagar</span>"""
soup = BeautifulSoup(h)
Edit: Since you now tell us that you want the text of the elements that have the specified attribute value, you can use a function as filter.
def find_phone(tag):
return tag.has_attr("phone") and tag.get("phone") == "**telephone**"
def find_streetAddress(tag):
return tag.has_attr("itemprop") and tag.get("itemprop") == "**streetAddress**"
def find_pages(tag):
return tag.has_attr("title") and tag.get("title") == "**Pages**"
print(soup.find(find_phone).string)
print(soup.find(find_streetAddress).string)
print(soup.find(find_pages).string)
Output:
9440717256
H.No. 3-11-62, RTC Colony
Lal Bahadur Nagar
Regex is safe to use in case you know the HTML provider, what the code inside looks like.
Then, just use alternations and named capture groups.
telephone[^>]*>(?P<Telephone>[^<]+)|streetAddress[^>]*>(?P<Address>[^<]+)|Pages[^>]*>(?P<Pages>[^<]+)
See demo
In case > is not serialized, you can use this regex (more universal one, edit: now, verbose):
telephone[^<]*> # Looking for telephone
(?P<Telephone>[^<]+) # Capture telephone (all text up to the next tag)
|
streetAddress[^<]*> # Looking for streetAddress
(?P<Address>[^<]+) # Capture address (all text up to the next tag)
|
Pages[^<]*> # Looking for Pages
(?P<Pages>[^<]+) # Capture Pages (all text up to the next tag)
Sample demo on IDEONE
Pasting regex code part:
p = re.compile(ur'''telephone[^<]*> # Looking for telephone
(?P<Telephone>[^<]+) # Capture telephone (all text up to the next tag)
|
streetAddress[^<]*> # Looking for streetAddress
(?P<Address>[^<]+) # Capture address (all text up to the next tag)
|
Pages[^<]*> # Looking for Pages
(?P<Pages>[^<]+) # Capture Pages (all text up to the next tag)''', re.IGNORECASE | re.VERBOSE)
test_str = "YOUR STRING"
print filter(None, [x.group("Telephone") for x in re.finditer(p, test_str)])
print filter(None, [x.group("Address") for x in re.finditer(p, test_str)])
print filter(None, [x.group("Pages") for x in re.finditer(p, test_str)])
Output (doubled results are the result of my duplicating the input string with different node order):
[u'9440717256', u'9440717256']
[u'H.No. 3-11-62, RTC Colony', u'H.No. 3-11-62, RTC Colony']
[u'Lal Bahadur Nagar', u'Lal Bahadur Nagar']
Related
I found this link [and a few others] which talks a bit about BeautifulSoup for reading html. It mostly does what I want, grabs a title for a webpage.
def get_title(url):
html = requests.get(url).text
if len(html) > 0:
contents = BeautifulSoup(html)
title = contents.title.string
return title
return None
The issue that I run into is that sometimes articles will come back with metadata attached at the end with " - some_data". A good example is this link to a BBC Sport article which reports the title as
Jack Charlton: 1966 England World Cup winner dies aged 85 - BBC Sport
I could do something simple like cut off anything after the last '-' character
title = title.rsplit(', ', 1)[0]
But that assumes that any meta exists after a "-" value. I don't want to assume that there will never be an article who's title ends in " - part_of_title"
I found the Newspaper3k library but it's definitely more than I need - all I need is to grab a title and ensure that it's the same as what the user posted. My friend who pointed me to Newspaper3k also mentioned it could be buggy and didn't always find titles correctly, so I would be inclined to use something else if possible.
My current thought is to continue using BeautifulSoup and just add on fuzzywuzzy which would honestly also help with slight misspellings or punctuation differences. But, I would certainly prefer to start from a place that included comparing against accurate titles.
Here is how reddit handles getting title data.
https://github.com/reddit-archive/reddit/blob/40625dcc070155588d33754ef5b15712c254864b/r2/r2/lib/utils/utils.py#L255
def extract_title(data):
"""Try to extract the page title from a string of HTML.
An og:title meta tag is preferred, but will fall back to using
the <title> tag instead if one is not found. If using <title>,
also attempts to trim off the site's name from the end.
"""
bs = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES)
if not bs or not bs.html.head:
return
head_soup = bs.html.head
title = None
# try to find an og:title meta tag to use
og_title = (head_soup.find("meta", attrs={"property": "og:title"}) or
head_soup.find("meta", attrs={"name": "og:title"}))
if og_title:
title = og_title.get("content")
# if that failed, look for a <title> tag to use instead
if not title and head_soup.title and head_soup.title.string:
title = head_soup.title.string
# remove end part that's likely to be the site's name
# looks for last delimiter char between spaces in strings
# delimiters: |, -, emdash, endash,
# left- and right-pointing double angle quotation marks
reverse_title = title[::-1]
to_trim = re.search(u'\s[\u00ab\u00bb\u2013\u2014|-]\s',
reverse_title,
flags=re.UNICODE)
# only trim if it won't take off over half the title
if to_trim and to_trim.end() < len(title) / 2:
title = title[:-(to_trim.end())]
if not title:
return
# get rid of extraneous whitespace in the title
title = re.sub(r'\s+', ' ', title, flags=re.UNICODE)
return title.encode('utf-8').strip()
I have an XML document where I wish to extract certain text contained in specific tags such as-
<title>Four-minute warning</title>
<categories>
<category>Nuclear warfare</category>
<category>Cold War</category>
<category>Cold War military history of the United Kingdom</category>
<category>disaster preparedness in the United Kingdom</category>
<category>History of the United Kingdom</category>
</categories>
<bdy>
some text
</bdy>
In this toy example, if I want to extract all the text contained in tags by using the following Regular Expression code in Python 3-
# Python 3 code using RE-
file = open("some_xml_file.xml", "r")
xml_doc = file.read()
file.close()
title_text = re.findall(r'<title>.+</title>', xml_doc)
if title_text:
print("\nMatches found!\n")
for title in title_text:
print(title)
else:
print("\nNo matches found!\n\n")
It gives me the text within the XML tags ALONG with the tags. An example of a single output would be-
<title>Four-minute warning</title>
My question is, how should I frame the pattern within the re.findall() or re.search() methods so that and tags are skipped and all I get is the text between them.
Thanks for your help!
Just use a capture group in your regex (re.findall() takes care of the rest in this case). For example:
import re
s = '<title>Four-minute warning</title>'
title_text = re.findall(r'<title>(.+)</title>', s)
print(title_text[0])
# OUTPUT
# Four-minute warning
I'm scraping a website with BeautifulSoup in Python
I'd like to find all the a href with id starts with "des "(with space at the tail) + '3-4 letters'
I just tried:
bsObj.findAll("a",{"id":"des "})
But it does not find what I originally intended to.
Do I need to use regex or something?
I would appreciate all your helps. Thanks.
<div>
<a id="des 6TN" href="/en-kr/shop/men/11-by-boris-bidjan-saberi?lvrid=_gm_d6tn">
11 BY BORIS BIDJAN SABERI
</a>
<br/>
<a id="des R6L" href="/en-kr/shop/men/11-eleven?lvrid=_gm_dr6l">
11 ELEVEN
</a>
<br/>
</div>
If you go the regex route, you can pass a compiled regex pattern to the id parameter like so (added an irrelevant/unmatch a tag for demonstration purpose):
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup("""<div><a id="des 6TN" href="/en-kr/shop/men/11-by-boris-bidjan-saberi?
lvrid=_gm_d6tn">11 BY BORIS BIDJAN SABERI</a><br /><a id="des R6L" href="/en-
kr/shop/men/11-eleven?lvrid=_gm_dr6l">11 ELEVEN</a><a id="ds R6L" href="/en-
kr/shop/men/11-eleven?lvrid=_gm_dr6l">11 ELEVEN</a><br />""")
soup.find_all('a', id=re.compile('^des \w{3,4}$'))
#[<a href="/en-kr/shop/men/11-by-boris-bidjan-saberi?
# lvrid=_gm_d6tn" id="des 6TN">11 BY BORIS BIDJAN SABERI</a>, <a href="/en-
# kr/shop/men/11-eleven?lvrid=_gm_dr6l" id="des R6L">11 ELEVEN</a>]
Here's another way (not using regex) I don't like regular expressions where I don't need them necessary.
all_des = soup.findAll('a')
#list of every <a> tag
for i in all_des: #loops through all
if i.has_attr('id') and i['id'].startswith('des'):
# check if there is an id within the <a> and if the id starts with des.
print(i)
Output:
<a href="/en-kr/shop/men/11-by-boris-bidjan-saberi?lvrid=_gm_d6tn" id="des 6TN">
11 BY BORIS BIDJAN SABERI
</a>
<a href="/en-kr/shop/men/11-eleven?lvrid=_gm_dr6l" id="des R6L">
11 ELEVEN
</a>
Hopefully, that answers your question, the method above by the awesome '#Psidom' is maybe more convenient to you but I'm pretty confident that pythons inbuilt methods are faster than using regular expressions. Where the regex '^des \w{3,4}$':
**^** asserts position at start of the string
des matches the characters des literally (case sensitive)
**\w{3,4}** matches any word character (equal to [a-zA-Z0-9_])
**{3,4}** Quantifier — Matches between 3 and 4 times, as many times as possible, giving back as needed (greedy)
**$** asserts position at the end of the string
I'm trying to search in HTML documents for specific attribute values.
e.g.
<html>
<h2 itemprop="prio1"> TEXT PRIO 1 </h2>
<span id="prio2"> TEXT PRIO 2 </span>
</html>
I want to find all items with atrributes values beginning with "prio"
I know that I can do something like:
soup.find_all(itemprop=re.compile('prio.*')) )
Or
soup.find_all(id=re.compile('prio.*')) )
But what I am looking for is something like:
soup.find_all(*=re.compile('prio.*')) )
First off your regex is wrong, if you wanted to only find strings starting with prio you would prefix with ^, as it is your regex would match prio anywhere in the string, if you were going to search each attribute you should just use str.startswith:
h = """<html>
<h2 itemprop="prio1"> TEXT PRIO 1 </h2>
<span id="prio2"> TEXT PRIO 2 </span>
</html>"""
soup = BeautifulSoup(h, "lxml")
tags = soup.find_all(lambda t: any(a.startswith("prio") for a in t.attrs.values()))
If you just want to check for certain attributes:
tags = soup.find_all(lambda t: t.get("id","").startswith("prio") or t.get("itemprop","").startswith("prio"))
But if you wanted a more efficient solution you might want to look at lxml which allows you to use wildcards:
from lxml import html
xml = html.fromstring(h)
tags = xml.xpath("//*[starts-with(#*,'prio')]")
print(tags)
Or just id an itemprop:
tags = xml.xpath("//*[starts-with(#id,'prio') or starts-with(#itemprop, 'prio')]")
I don't know if this is the best way, but this works:
>>> soup.find_all(lambda element: any(re.search('prio.*', attr) for attr in element.attrs.values()))
[<h2 itemprop="prio1"> TEXT PRIO 1 </h2>, <span id="prio2"> TEXT PRIO 2 </span>]
In this case, you can access the element use lambda in lambda element:. And we search for 'prio.*' use re.search in the element.attrs.values() list.
Then, we use any() on the result to see if there's an element which has an attribute and it's value starts with 'prio'.
You can also use str.startswith here instead of RegEx since you're just trying to check that attributes-value starts with 'prio' or not, like below:
soup.find_all(lambda element: any(attr.startswith('prio') for attr in element.attrs.values())))
I have some html page to scrape data from.
So I need to get item title like here: 'Caliper Ring'.
I'm getting data from tag where that title appears:
item_title = base_page.find_all('h1', class_='itemTitle')
It contains these tags structure:
> [<h1 class="itemTitle"> <div class="l1">Caliper</div>
> Ball
> </h1>]
To extract 'Caliper Ball' I`m using
collector = []
for _ in item_title:
collector.append(_.text)
so I`m getting such ugly output in collector list:
[u"\nCaliper\r\n Ball\r\n "]
How can I make output clear like here "Caliper Ball"
Don't use regex. You're adding too much overhead for something simple. BeautifulSoup4 already has something for this called stripped_strings. See my code below.
from bs4 import BeautifulSoup as bsoup
html = """[<h1 class="itemTitle"> <div class="l1">Caliper</div>
Ball
</h1>]"""
soup = bsoup(html)
soup.prettify()
item = soup.find("h1", class_="itemTitle")
base = list(item.stripped_strings)
print " ".join(base)
Result:
Caliper Ball
[Finished in 0.5s]
Explanation: stripped_strings basically gets all the text inside a specified tag, strips them of all the spaces, line breaks, what have you. It returns a generator, which we can catch with list so it returns a list instead. Once it's a list, it's just a matter of using " ".join.
Let us know if this helps.
PS: Just to correct myself -- there's actually no need to use list on the result of stripped_strings, but it's better to show the above as such so it's explicit.
This regex will help you to get the output(Caliper Ball),
import re
str="""[<h1 class="itemTitle"> <div class="l1">Caliper</div>
Ball
</h1>]"""
regex = r'.*>([^<]*)<\/div>\s*\n\s*(\w*).*'
match = re.findall(regex, str)
new_data = (' '.join(w) for w in match)
print ''.join(new_data) # => Caliper Ball
you can use replace() method to replace \n and \r with nothing or space and after this use method trim() to remvoe spaces.