Python Regex Help - python

I am trying to sort through HTML tags and I can't seem to get it right.
What I have done so far
import urllib
import re
s = raw_input('Enter URL: ')
f = urllib.urlopen(s)
s = f.read()
f.close
r = re.compile('<TAG\b[^>]*>(.*?)</TAG>',)
result = re.findall(r, s)
print(result)
Where I replace "TAG" with tag I want to see.
Thanks in advance.

You should really try using libraries which can perform HTML parsing out of the box. Beautiful Soup is one of my favorites.

An example from BS is this
from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
soup.findAll('b')
[<b>one</b>, <b>two</b>]
As for a regular expression, you can use
aa = doc[0]
aa
'<html><head><title>Page title</title></head>'
pt = re.compile('(?<=<title>).*?(?=</title>)')
re.findall(pt,aa)
['Page title']

I'm not entirely clear on what you are trying to achieve with the regex. Capturing the contents between two div tags for instance works with
re.compile("<div.*?>.*?</div>")
Although you will run into some problems with nested divs with the above one.

Related

How to get the content of a tag with a Beautiful Soup?

I'm trying to extract questions from various AMC tests. Consider https://artofproblemsolving.com/wiki/index.php/2002_AMC_10B_Problems/Problem_1 for example. To get the problem text, I just need the regular string text in the first <p> element and the latex in the <img> in the first <p> element.
My code so far:
res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
latex_equation = soup.select('p img')[0].get('alt')
It works when I get the latex equation, but there is more parts of the question before in double quotes. Is there a way to get the other part of the question which is "What is the value of". I'm thinking of using a regex but I want to see if Beautiful Soup has a feature that can get it for me.
Try using zip():
import requests
from bs4 import BeautifulSoup
URL = "https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for text, tag in zip(soup.select_one(".mw-parser-output p"), soup.select("p img")):
print(text, tag.get("alt"))
break
Output:
What is the value of $\frac{2a^{-1}+\frac{a^{-1}}{2}}{a}$
Edit:
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for text, tag in zip(soup.select(".mw-parser-output p"), soup.select("p img")):
print(text.text.strip(), tag.get("alt"))
Well BS4 seems to be a bit buggy. Took me a while to get this. Don't think that it is viable with these weird spacings and everything. A RegEx would be your best option. Let me know if this is good. Checked on the first 2 questions and they worked fine. The AMC does have some image problems with geometry, however, so I don't think it will work for those.
import bs4
import requests
import re
res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.content, 'html.parser').find('p')
elements = [i for i in soup.prettify().split("\n") if i][1:-2]
latex_reg = re.compile(r'alt="(.*?)"')
for n, i in enumerate(elements):
mo = latex_reg.search(i)
if mo:
elements[n] = mo.group(1)
elements[n] = re.sub(' +', ' ', elements[n]).lstrip()
if elements[n][0] == "$":
elements[n] = " "+elements[n]+" "
print(elements)
print("".join(elements))

scraping B in <span><span> flow text </span> B </span> using BeautifulSoup

I have the following bs4 element tag :
<span><span>some content</span> B</span>
The len of string B is unknown (I named it B for simplification)
How can I use beautifulSoup to extract "b" ? Or I just have as solution to extract the text and then use some regexp techniques
Thanks
Edit : Complete code
def get_doc_yakarouler(license_plate,url = 'https://www.yakarouler.com/car_search/immat?immat='):
response = requests.get(url+license_plate)
content = response.content
doc = BeautifulSoup(content,'html.parser')
result = doc.span.text
if 'identifié' in result :
return doc
else :
return f"La plaque {license_plate} n'est pas recensé sur yakarouler"
doc = get_doc_yakarouler('AA300AA')
span = doc.find_all('span')
motorisation_tag = span[1]
I want to extract "1.6 TDI"
I found solution using : motorisation_tag.text.replace(u'\xa0', ' ').split(' ')[1] but I would like if it is directly possible using bs4
Assuming you have a variable span which represents the outer <span> tag, you can do the following to extract 'B': span.contents[1]. This works because .contents will return a list of the tag's contents, in this case [<span>some content</span>, ' B']. And then you can access the 'B' text as the second element of the array. Be aware that if there is a space before B, like you showed in your HTML sample, the space will be included in the string
from bs4 import BeautifulSoup as bs , NavigableString
html = '<span><span>some content</span> B</span>'
soup = bs(html, 'html.parser')
span = soup.find("span")
# First approach Using Regular Expressions
outer_text_1 = span.find(text=True, recursive=False)
# Second approach is looping through the contents of the tag and check if it's the outer text and not a tag
outer_text_2 = ' '.join([t for t in span.contents if type(t)== NavigableString])
print(outer_text_1) # output B
print(outer_text_2) # output B

How to get all unique HTML tags on a webpage using regular expression?

I have html source code from an HTML page:
import requests
text = requests.get("https://en.wikipedia.org/wiki/Collatz_conjecture").text
What I would like to do is to get a count of the number of unique HTML tags on this page.
For example: <head>, <title>. Closing tags do not count (<head> and </head> would be counted only once).
Yes, I know this is much easier using HTML parsers such as Beautiful Soup but I would like to accomplish this using only Regular Expression.
I've brute force counted this and the answer is in the ballpark of around 60 unique tags.
How would I go about doing this?
I've already tried using re.findall(), to no avail.
Since the answer is around 60, I would like the output to be:
"Number of unique HTML tags: 60"
The following will yield 63 URL's from the url in question
import requests
import re
url = "https://en.wikipedia.org/wiki/Collatz_conjecture"
text = requests.get(url).text
url_pattern = r"((http(s)?://)([\w-]+\.)+[\w-]+[.com]+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)"
# Get all matching patterns of url_pattern
# this will return a list of tuples
# where we are only interested in the first item of the tuple
urls = re.findall(url_pattern, text)
# using list comprehension to get the first item of the tuple,
# and the set function to filter out duplicates
unique_urls = set([x[0] for x in urls])
print(f'Number of unique HTML tags: {len(unique_urls)} found on {url}')
out:
Number of unique HTML tags: 63 found on https://en.wikipedia.org/wiki/Collatz_conjecture
Please! Do not parse a HTML in regex use modules like bs4. But still if you insist Do that as follows:
import requests
import re
url = 'https://en.wikipedia.org/wiki/Collatz_conjecture'
text = requests.get(url).text
tags = re.findall('<[^>]*>',text)
total=[]
for i in range(len(tags)):
total.append(re.match('<[^\s\>]+',tags[i]).group())
total=[elem+'>' for elem in total]
r= re.compile('</[^<]')
unwanted =list(filter(r.match,total))
un=['<!-->','<!--[if>','<!DOCTYPE>','<![endif]-->']
unwanted.extend(un)
final=[x for x in list(set(total)) if x not in set(unwanted)]
print('Number of Unique HTML tags : ',len(final))

How to substract soup.find_all() in python 3

I want to change the output of my soup.find.all. In the original source we have this:
my soup.find_all:
href = [b.get('href') for b in soup.find_all('a', href=re.compile(r'.*\?id\=\d{4,8}'))]
gives me this:
/book/nfo/?id=4756888
but I want this:
http://127.0.0.1/book/download/?id=4756888
You can use the properties of a Python string to add and replace parts to/from it:
a='/book/nfo/?id=4756888'
b = 'http://127.0.0.1' + a.replace('nfo', 'download')
print(b)
which gives:
'http://127.0.0.1/book/download/?id=4756888'
There's no need to use regex here.
You could compile a regular expression and apply it in a list comprehension as follows:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup('', 'html.parser')
re_s = re.compile(r'(.*?\/)nfo(\/.*?)').sub
hrefs = [re_s('http://127.0.0.1' + r'\1download\2', a.get('href')) for a in soup.find_all('a', href=re.compile(r'.*\?id\=\d{4,8}'))]
print(hrefs)
Giving you:
['http://127.0.0.1/book/download/?id=4756888']
You can prepend http://127.0.0.1 in front and replace 'nfo' by 'download' using python's re.sub() function.
re.sub(r'pattern_to_match',r'replacement_string', string)
You can implement it as follows:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup("""""")
c = ['http://127.0.0.1'+b.get('href') for b in soup.find_all('a', href=re.compile(r'.*\?id\=\d{4,8}'))]
print([re.sub(r'nfo',r'download',q) for q in c ])
Output:
['http://127.0.0.1/book/download/?id=4756888']

beautifulsoup html parser for words outside tags

I am trying to use beautifulsoup html parser to parse some texts within and outiside tags, but I get strange formatting especially with words outside any of the tags. THe texts are inside a dictionary and looks like this:
([('<tag1>London</tag1>in<tag2>UK</tag2>', 'Locations')])
for key, value in s.iteritems():
soup = BeautifulSoup(key, 'html.parser')
print soup.text
>LondoninUK
but I noticed that if I introduce a space before and after in, for example
<tag1>London</tag1> in <tag2>UK</tag2>, Locations it parses it correctly London in UK
How can I rectify this, instead of having to manually add space in between words outside tags?
Try this:
from BeautifulSoup import BeautifulSoup
for key, value in s.iteritems():
soup = BeautifulSoup(key)
print soup.getText(' ')
Adding execution:
>>> s = '<tag1>London</tag1> in <tag2>UK</tag2>'
>>> soup = BeautifulSoup(s)
>>> soup.getText(' ')
u'London in UK'
>>> s = '<tag1>London</tag1>in<tag2>UK</tag2>'
>>> soup = BeautifulSoup(s)
>>> soup.getText(' ')
u'London in UK'
>>>
Where does it add spaces?

Categories