Find all HTML and non-HTML encoded URLs in string

Find all HTML and non-HTML encoded URLs in string - python

I would like to find all URLs in a string. I found various solutions on StackOverflow that vary depending on the content of the string.
For example, supposing my string contained HTML, this answer recommends using either BeautifulSoup or lxml.
On the other hand, if my string contained only a plain URL without HTML tags, this answer recommends using a regular expression.
I wasn't able to find a good solution given my string contains both HTML encoded URL as well as a plain URL. Here is some example code:
import lxml.html
example_data = """Click Me!
http://www.another-random-domain.com/xyz.html"""
dom = lxml.html.fromstring(example_data)
for link in dom.xpath('//a/#href'):
print "Found Link: ", link
As expected, this results in:
Found Link: http://www.some-random-domain.com/abc123/def.html
I also tried the twitter-text-python library that #Yannisp mentioned, but it doesn't seem to extract both URLS:
>>> from ttp.ttp import Parser
>>> p = Parser()
>>> r = p.parse(example_data)
>>> r.urls
['http://www.another-random-domain.com/xyz.html']
What is the best approach for extracting both kinds of URLs from a string containing a mix of HTML and non HTML encoded data? Is there a good module that already does this? Or am I forced to combine regex with BeautifulSoup/lxml?

I upvoted because it triggered my curiosity. There seems to be a library called twitter-text-python, that parses Twitter posts to detect both urls and hrefs. Otherwise, I would go with the combination regex + lxml

You could use RE to find all URLs:
import re
urls = re.findall("(https?://[\w\/\$\-\_\.\+\!\*\'\(\)]+)", example_data)
It's including alphanumerics, '/' and "Characters allowed in a URL"

Based on the answer by #YannisP, I was able to come up with this solution:
import lxml.html
from ttp.ttp import Parser
def extract_urls(data):
urls = set()
# First extract HTML-encoded URLs
dom = lxml.html.fromstring(data)
for link in dom.xpath('//a/#href'):
urls.add(link)
# Next, extract URLs from plain text
parser = Parser()
results = parser.parse(data)
for url in results.urls:
urls.add(url)
return list(urls)
This results in:
>>> example_data
'Click Me!\nhttp://www.another-random-domain.com/xyz.html'
>>> urls = extract_urls(example_data)
>>> print urls
['http://www.another-random-domain.com/xyz.html', 'http://www.some-random-domain.com/abc123/def.html']
I'm not sure how well this will work on other URLs, but it seems to work for what I need it to do.

Related

Python Re.Search: How to find a substring between two strings, that must also contain a specific substring

I am writing a little script to get my F#H user data from a basic HTML page.
I want to locate my username on that page and the numbers before and after it.
All the data I want is between two HTML <tr> and </tr> tags.
I am currently using this:
re.search(r'<tr>(.*?)</tr>', htmlstring)
I know this works for any substring, as all google results for my question show. The difference here is i need it only when that substring also contains a specific word
However that only returns the first string between those two delimiters, not even all of them.
This pattern occurs hundreds of times on the page. I suspect it doesn't get them all because I'm not handling all the newline characters correctly but I'm not sure.
If it would return all of them, I could at least then sort them out to find one that contains my username going through each result.group(), but I can't even do that.
I have been fiddling with different regex expressions for ages now but can't figure what one I need to much frustration.
TL;DR -
I need a re.search() pattern that finds a substring between two words, that also contains a specific word.

If I understand correctly something like this might work
<tr>(?:(?:(?:(?!<\/tr>).)*?)\bWORD\b(?:.*?))<\/tr>
<tr> find "<tr>"
(?:(?:(?!<\/tr>).)*?) Find anything except "</tr>" as few times as possible
\bWORD\b find WORD
(?:.*?)) find anything as few times as possible
<\/tr> find "</tr>"
Sample

There are a few ways to do it but I prefer the pandas way:
from urllib import request
import pandas as pd # you need to install pandas
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
web_df: pd.DataFrame = pd.read_html(web_request, attrs={'class': 'members'})
web_df = web_df[0].set_index(keys=['Name'])
# print(web_df)
user_name_to_find_in_table = 'SteveMoody'
user_name_df = web_df.loc[user_name_to_find_in_table]
print(user_name_df)
Then there are plenty of ways to do this. Using just Beautifulsoup find or css selectors, or maybe re as Peter suggest?
Using beautifulsoup and "find" method, and re, you can do it the following way:
import re
from bs4 import BeautifulSoup as bs # you need to install beautifullsoup
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)
user_name_to_find_in_table = 'SteveMoody'
row_tag = page_soup.find(
lambda t: t.name == "td"
and re.findall(user_name_to_find_in_table, t.text, flags=re.I)
).find_parent(name="tr")
print(row_tag.get_text().strip('tr'))
Using Beautifulsoup and CSS Selectors(no re but Beautifulsoup):
from bs4 import BeautifulSoup as bs # you need to install beautifulsoup
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)
user_name_to_find_in_table = 'SteveMoody'
row_tag = page_soup.select_one(f'tr:has(> td:contains({user_name_to_find_in_table})) ')
print(row_tag.get_text().strip('tr'))
In your case I would favor the pandas example as you keep headers and can easily get other stats, and it runs very quickly.
Using Re:
So fa, best input is Peters' commentLink, so I just adapted it to Python code (happy to get edited), as this solution doesn't need any extra libraries installation.
import re
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
user_name_to_find_in_table = 'SteveMoody'
re_patern = rf'<tr>(?:(?:(?:(?!<\/tr>).)*?)\{user_name_to_find_in_table}\b(?:.*?))<\/tr>'
res = re.search(pattern=re_patern, string= str(web_request))
print(res.group(0))
Helpful lin to use variables in regex: stackflow

Python Regex scraping data from a webpage

My idea was to explore the Groupon's website to extract the url of the deals. The problem is that I'm trying to do a findall on the Groupon's page to find datas like this: (of this page: http://www.groupon.de/alle-deals/muenchen/restaurant-296)
"category":"RESTAURANT1","dealPermaLink":"/deals/muenchen-special/Casa-Lavecchia/24788330", and I'd like to get the 'deals/muenchen-special/Casa-Lavecchia/24788330'.
I tried the whole night but I'm unable to find a correct regex. I tried:
import urllib2
import re
Page_Web = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
for m in re.findall('category*RESATAURANT1*dealPermaLink*:?/*/*/*/*\d$',Page_Web):
print m
But it doesn't print anything.

In order to extrapolate the block that interest you, I would do this way:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
soup = BeautifulSoup(html)
scriptResults = soup('script',{'type' : 'text/javascript'})
js_block = scriptResults[12]
Starting from this you can parse with a regex if you want or try to interprete the js (there are some threads on stackoverflow about that).
Anyway, like the others said, you should use groupon api...
P.S.
The block that you are parsing can be easily parsed as a dictionary, is already a list of dictionary if you look well...

How about changing RESATAURANT1 to RESTAURANT1, for starters?

Extract information from a webpage in a particular format

I am trying to make a simple python script to extract certain links from a webpage. I am able to extract link successfully but now I want to extract some more information like bitrate,size,duration given on that webpage.
I am using the below xpath to extract the above mentioned info
>>> doc = lxml.html.parse('http://mp3skull.com/mp3/linkin_park_faint.html')
>>> info = doc.xpath(".//*[#id='song_html']/div[1]/text()")
>>> info[0:7]
['\n\t\t\t', '\n\t\t\t\t3.71 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t3.49 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t192 kbps', '2:41']
Now what I need is that for a particular link the info I require is generated in a form of tuple like (bitrate,size,duration).
The xpath I mentioned above generates the required info but it is ill-formatted that is it is not possible to achieve my required format with any logic at least I am not able to that.
So, is there any way to achieve the output in my format.?

I think BeautifulSoup will do the job, it parses even badly formatted HTML:
http://www.crummy.com/software/BeautifulSoup/
parsing is quite easy with BeautifulSoup - for example:
import bs4
import urllib
soup = bs4.BeautifulSoup(urllib.urlopen('http://mp3skull.com/mp3/linkin_park_faint.html').read())
print soup.find_all('a')
and have quite good docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

You can actually strip everything out with XPath:
translate(.//*[#id='song_html']/div[1]/text(), "\n\t,'", '')
So for your additional question, either:
info[0, len(info)]
for altogether, or:
info.rfind(" ")
Since the translate leaves a space character, but you could replace that with whatever you wanted.
Addl info found here

How are you with regular expressions and python's re module?
http://docs.python.org/library/re.html may be essential.
As far as getting the data out of the array, re.match(regex,info[n]) should suffice, as far as the triple tuple goes, the python tuple syntax takes care of it. Simply match from members of your info array with re.match.
import re
matching_re = '.*' # this re matches whole strings, rather than what you need
incoming_value_1 = re.match(matching_re,info[1])
# etc.
var truple = (incoming_value_1, incoming_value_2, incoming_value_2

How to grab html using regex

#<link rel='canonical' href='http://www.samplewebsite.com/image/5434553/' />
#I am trying to grab the text in href
image = str(Soup)
image_re = re.compile('\<link rel=\'cononical\' href=')
image_pat = re.findall(image_re, image)
print image_pa
#>> []
#Thanks!

Edit: This uses the BeautifulSoup package, which I thought I saw in the previous version of this question.
Edit: More straightforward is this:
soup = BeautifulSoup(document)
links = soup.findAll('link', rel='canonical')
for link in links:
print link['href']
Instead of all that, you can use:
soup = BeautifulSoup(document)
links = soup("link")
for link in links:
if "rel" in link and link["rel"] == 'canonical':
print link["href"]

Use two regular expressions:
import re
link_tag_re = re.compile(r'(<link[^>]*>')
# capture all link tags in your text with it. Then for each of those, use:
href_capture = re.compile(r'href\s*=\s*(\'[^\']*\'|"[^"]*")')
The first regex will capture the entire <link> tag; the second one will look for href="something" or href='something'.
In general, though, you should probably use an XML parser for HTML, even though this problem is a perfectly regular language problem. They're far simpler to use for this sort of thing and are less likely to cause you problems.

You're better of using a proper HTML parser on the data, but if you really want to go down this route then the following will do it:
>>> data = "... <link rel='canonical' href='http://www.samplewebsite.com/image/5434553/' /> ..."
>>>
>>> re.search("<link[^>]+?rel='canonical'[^>]+?href='([^']+)", x).group(1)
'http://www.samplewebsite.com/image/5434553/'
>>>
I also notice that your HTML uses single quotes rather than double quotes.

You should use an HTML parser such as lxml.html or BeautifulSoup. But if you only want to grab the href of a single link, you could use a simple regex too:
re.findall(r"href=(['\"])([^\1]*)\1", url)

This would be the regex to match the example html you've given:
<link rel='canonical' href='(\S+)'
But I'm not sure if regex is the right tool. This regex will fail when using double quotes (or no quotes) for the values. Or if rel and href are turned around.
I'd recommend using something like BeautifulSoup to find and collect all rel canonical href values.

How to find links with all uppercase text using Python (without a 3rd party parser)?

I am using BeautifulSoup in a simple function to extract links that have all uppercase text:
def findAllCapsUrls(page_contents):
""" given HTML, returns a list of URLs that have ALL CAPS text
"""
soup = BeautifulSoup.BeautifulSoup(page_contents)
all_urls = node_with_links.findAll(name='a')
# if the text for the link is ALL CAPS then add the link to good_urls
good_urls = []
for url in all_urls:
text = url.find(text=True)
if text.upper() == text:
good_urls.append(url['href'])
return good_urls
Works well most of the time, but a handful of pages will not parse correctly in BeautifulSoup (or lxml, which I also tried) due to malformed HTML on the page, resulting in an object with no (or only some) links in it. A "handful" might sound like not-a-big-deal, but this function is being used in a crawler so there could be hundreds of pages that the crawler will never find...
How can the above function be refactored to not use a parser like BeautifulSoup? I've searched around for how to do this using regex, but all the answers say "use BeautifulSoup." Alternatively, I started looking at how to "fix" the malformed HTML so that is parses, but I don't think that is the best route...
What is an alternative solution, using re or something else, that can do the same as the function above?

If the html pages are malformed, there is not a lot of solutions that can really help you. BeautifulSoup or other parsing library are the way to go to parse html files.
If you want to avoir the library path, you could use a regexp to match all your links see regular-expression-to-extract-url-from-an-html-link using a range of [A-Z]

When I need to parse a really broken html and speed is not the most important factor I automate a browser with selenium & webdriver.
This is the most resistant way of html parsing I know.
Check this tutorial it shows how to extract google suggestion using webdriver (the code is in java but it can be changed to python).

I ended up with a combination of regex and BeautifulSoup:
def findAllCapsUrls2(page_contents):
""" returns a list of URLs that have ALL CAPS text, given
the HTML from a page. Uses a combo of RE and BeautifulSoup
to handle malformed pages.
"""
# get all anchors on page using regex
p = r'<a\s+href\s*=\s*"([^"]*)"[^>]*>(.*?(?=</a>))</a>'
re_urls = re.compile(p, re.DOTALL)
all_a = re_urls.findall(page_contents)
# if the text for the anchor is ALL CAPS then add the link to good_urls
good_urls = []
for a in all_a:
href = a[0]
a_content = a[1]
a_soup = BeautifulSoup.BeautifulSoup(a_content)
text = ''.join([s.strip() for s in a_soup.findAll(text=True) if s])
if text and text.upper() == text:
good_urls.append(href)
return good_urls
This is working for my use cases so far, but I wouldn't guarantee it to work on all pages. Also, I only use this function if the original one fails.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find all HTML and non-HTML encoded URLs in string - python

I upvoted because it triggered my curiosity. There seems to be a library called twitter-text-python, that parses Twitter posts to detect both urls and hrefs. Otherwise, I would go with the combination regex + lxml

You could use RE to find all URLs: import re urls = re.findall("(https?://[\w\/\$\-\_\.\+\!\*\'\(\)]+)", example_data) It's including alphanumerics, '/' and "Characters allowed in a URL"

Related

Python Re.Search: How to find a substring between two strings, that must also contain a specific substring

Python Regex scraping data from a webpage

Extract information from a webpage in a particular format

How to grab html using regex

How to find links with all uppercase text using Python (without a 3rd party parser)?

Categories

Resources