Matching url in HTML using regex - python

It's been a while since I've used regex, and I feel like this should be simple to figure out.
I have a web page full of links that looks like the string_to_match in the below code. I want to grab just the numbers in the links, like number "58" in the string_to_match. For the life of me I can't figure it out.
import re
string_to_match = 'Roster'
re.findall('Roster',string_to_match)

Instead of using regular expressions, you can use a combination of HTML parsing (using BeautifulSoup parser) to locate the desired link and extract the href attribute value and URL parsing, which in this case, we'll use regular expressions for:
import re
from bs4 import BeautifulSoup
data = """
<body>
Roster
</body>
"""
soup = BeautifulSoup(data, "html.parser")
link = soup.find("a", text="Roster")["href"]
print(re.search(r"teamId=(\d+)", link).group(1))
Prints 58.

I would recommend using BeautifulSoup or lxml, it's worth the learning curve.
...But if you still want to use regexp
re.findall('href="[^"]*teamId=(\d+)',string_to_match)

Related

Find a string in a string which starts and ends with different string in Python

I have complete html of a page and from that I need to find GA (google Analytics) id of it. For example:
<script>ga('create', 'UA-4444444444-1', 'auto');</script>
From above string I need to get UA-4444444444-1, which starts from "UA-" and ends with "-1". I have tried this:
re.findall(r"\"trackingId\"\s?:\s?\"(UA-\d+-\d+)\"", raw_html)
but didn't get any success. Please let me know what mistake I am making.
Thanks
It seems that you are overthinking it, you could just seek for the UA token directly:
re.findall(r"UA-\d+-\d+")
Never use regex in parsing through the html. BeautifulSoup should be find in extracting text from tags. Here we extract script tags from html, then we apply regex to text located in script tags.
import re
from bs4 import BeautifulSoup as bs4
html = "<script>ga('create', 'UA-4444444444-1', 'auto');</script>"
soup = bs4(html, 'lxml')
pattern = re.compile("UA-[0-9]+-[0-9]+")
ids = []
for i in soup.findAll("script"):
ids.append(pattern.findall(i.text)[0])
print(ids)

beautiful soup parse url from messy output

I have beautiful soup code that looks like:
for item in beautifulSoupObj.find_all('cite'):
pagelink.append(item.get_text())
the problem is, the html code I'm trying to parse looks like:
<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>
My current selector above would get everything, including strong tags in it.
Thus, how can I parse only:
https://www.websiteurl.com/id=6
Note <cite> appears multiple times throughout the page, and I want to extract, and print everything.
Thank you.
Extracting only the text portion is easy as doing .text on the object.
We can use basic BeautifulSoup methods to traverse the tree hierarchy.
Helpful explanation on how to do that: HERE
from bs4 import BeautifulSoup
html = '''<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.cite.text)
# is the same as soup.find('cite').text

Finding all links matching specific URL template in an HTML page

So lets say I have the following base url http://example.com/Stuff/preview/v/{id}/fl/1/t/. There are a number of urls with different {id}s on the page being parsed. I want to find all the links matching this template in an HTML page.
I can use xpath to just match to a part of the template//a[contains(#href,preview/v] or just use regexes, but I was wondering if anyone knew a more elegant way to match to the entire template using xpath and regexes so its fast and the matches are definitely correct.
Thanks.
Edit. I timed it on a sample page. With my internet connection and 100 trials the iteration takes 0.467 seconds on average and BeautifulSoup takes 0.669 seconds.
Also if you have Scrapy its one can use Selectors.
data=get(url).text
sel = Selector(text=data, type="html")
a=sel.xpath('//a[re:test(#href,"/Stuff/preview/v/\d+/fl/1/t/")]//#href').extract()
Average time on this is also 0.467
You cannot use regexes in the xpath expressions using lxml, since lxml supports xpath 1.0 and xpath 1.0 doesn't support regular expression search.
Instead, you can find all the links on a page using iterlinks(), iterate over them and check the href attribute value:
import re
import lxml.html
tree = lxml.html.fromstring(data)
pattern = re.compile("http://example.com/Stuff/preview/v/\d+/fl/1/t/")
for element, attribute, link, pos in tree.iterlinks():
if not pattern.match(link):
continue
print link
An alternative option would be to use BeautifulSoup html parser:
import re
from bs4 import BeautifulSoup
data = "your html"
soup = BeautifulSoup(data)
pattern = re.compile("http://example.com/Stuff/preview/v/\d+/fl/1/t/")
print soup.find_all('a', {'href': pattern})
To make BeautifulSoup parsing faster you can let it use lxml:
soup = BeautifulSoup(data, "lxml")
Also, you can make use of a SoupStrainer class that lets you parse only specific web page parts instead of a whole page.
Hope that helps.

Python Regex scraping data from a webpage

My idea was to explore the Groupon's website to extract the url of the deals. The problem is that I'm trying to do a findall on the Groupon's page to find datas like this: (of this page: http://www.groupon.de/alle-deals/muenchen/restaurant-296)
"category":"RESTAURANT1","dealPermaLink":"/deals/muenchen-special/Casa-Lavecchia/24788330", and I'd like to get the 'deals/muenchen-special/Casa-Lavecchia/24788330'.
I tried the whole night but I'm unable to find a correct regex. I tried:
import urllib2
import re
Page_Web = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
for m in re.findall('category*RESATAURANT1*dealPermaLink*:?/*/*/*/*\d$',Page_Web):
print m
But it doesn't print anything.
In order to extrapolate the block that interest you, I would do this way:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
soup = BeautifulSoup(html)
scriptResults = soup('script',{'type' : 'text/javascript'})
js_block = scriptResults[12]
Starting from this you can parse with a regex if you want or try to interprete the js (there are some threads on stackoverflow about that).
Anyway, like the others said, you should use groupon api...
P.S.
The block that you are parsing can be easily parsed as a dictionary, is already a list of dictionary if you look well...
How about changing RESATAURANT1 to RESTAURANT1, for starters?

Web Scraper Not Producing Results Using Python

I am a young grasshopper in need of your help. I've done a lot of research and can't seem to find the solution. I written the following code below. When ran it doesn't pull any of the titles. I believe my regular expressions are correct. Not sure what the problem is. Probably obvious to a seasoned sensei. Thanks in advance.
from urllib import urlopen
import re
url = urlopen('http://www.realclearpolitics.com/epolls/2012/senate/ma/massachusetts_senate_brown_vs_warren-2093.html#polls').read()
'''
a href="http://multimedia.heraldinteractive.com/misc/umlrvnov2012final.pdf">Title a>
'''
A = 'a href.*pdf">(expression to pull everything) a>'
B = re.compile(A)
C = re.findall(B,url)
print C
This comes up pretty often here on SO. Rather than using Regular Expressions you should be using an HTML parser that allows you to search/traverse the document tree.
I would use BeautifulSoup:
Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."
>>> from bs4 import BeautifulSoup
>>> html = ? # insert your raw HTML here
>>> soup = BeautifulSoup(html)
>>> a_tags = soup.find_all("a")
>>> for anchor in a_tags:
>>> ... print anchor.contents
I'll echo the other comment about not using RegEx for parsing HTML, but sometimes it is quick and easy. It looks like the HTML in your example is not quite correct, but I'd try something like:
re.findall('href.*?pdf">(.+?)<\/a>', A)

Categories