How to convert Unicode contained inside an HTML link?

How to convert Unicode contained inside an HTML link? - python

I've got the following HTML code :
Bad URL
This is a string \uff42\uff41\uff44\uff0e\uff43\uff4f\uff4d which both
Chrome and Firefox think http://ｂａｄ．ｃｏｍ is the same as http://bad.com.
I need to compare the collected URLs with a list of whitelisted URLs.
How do I convert http://ｂａｄ．ｃｏｍ to http://bad.com using Python? Do the browsers replace "confusible" characters?
Alternatively is it possible to compare two URLs?

You can use unicodedata:
import unicodedata
link = 'http://ｂａｄ．ｃｏｍ'
normalized = unicodedata.normalize('NFKC', link)
What is 'NFKC' you can find in official docs.

Related

web-scraping, regex and iteration in python

I have the following url 'http://www.alriyadh.com/file/278?&page=1'
I would like to write a regex to access urls from page=2 till page=12
For example, this url is needed 'http://www.alriyadh.com/file/278?&page=4', but not page = 14
I reckon what will work is a function that iterate the specified 10 pages to access all the urls within them. I have tried this regex but does not work
'.*?=[2-9]'
My aim is to get the content from those urls using newspaper package. I simply want this data for my research
Thanks in advance

does not require regex, a simple preset loop will do.
import requests
from bs4 import BeautifulSoup as bs
url = 'http://www.alriyadh.com/file/278?&page='
for page in range(2,13):
html = requests.get(url+str(page)).text
soup = bs(html)

Here's a regex to access the proper range (i.e. 2-12):
([2-9]|1[012])
Judging by what you have now, I am unsure that your regex will work as you intend it to. Perhaps I am misinterpreting your regex altogether, but is the '?=' intended to be a lookahead?
Or are you actually searching for a '?' immediately followed by a '=' immediately followed by any number 2-9?
How familiar are you with regexs in general? This particular one seems dangerously vague to find a meaningful match.

Find all HTML and non-HTML encoded URLs in string

I would like to find all URLs in a string. I found various solutions on StackOverflow that vary depending on the content of the string.
For example, supposing my string contained HTML, this answer recommends using either BeautifulSoup or lxml.
On the other hand, if my string contained only a plain URL without HTML tags, this answer recommends using a regular expression.
I wasn't able to find a good solution given my string contains both HTML encoded URL as well as a plain URL. Here is some example code:
import lxml.html
example_data = """Click Me!
http://www.another-random-domain.com/xyz.html"""
dom = lxml.html.fromstring(example_data)
for link in dom.xpath('//a/#href'):
print "Found Link: ", link
As expected, this results in:
Found Link: http://www.some-random-domain.com/abc123/def.html
I also tried the twitter-text-python library that #Yannisp mentioned, but it doesn't seem to extract both URLS:
>>> from ttp.ttp import Parser
>>> p = Parser()
>>> r = p.parse(example_data)
>>> r.urls
['http://www.another-random-domain.com/xyz.html']
What is the best approach for extracting both kinds of URLs from a string containing a mix of HTML and non HTML encoded data? Is there a good module that already does this? Or am I forced to combine regex with BeautifulSoup/lxml?

I upvoted because it triggered my curiosity. There seems to be a library called twitter-text-python, that parses Twitter posts to detect both urls and hrefs. Otherwise, I would go with the combination regex + lxml

You could use RE to find all URLs:
import re
urls = re.findall("(https?://[\w\/\$\-\_\.\+\!\*\'\(\)]+)", example_data)
It's including alphanumerics, '/' and "Characters allowed in a URL"

Based on the answer by #YannisP, I was able to come up with this solution:
import lxml.html
from ttp.ttp import Parser
def extract_urls(data):
urls = set()
# First extract HTML-encoded URLs
dom = lxml.html.fromstring(data)
for link in dom.xpath('//a/#href'):
urls.add(link)
# Next, extract URLs from plain text
parser = Parser()
results = parser.parse(data)
for url in results.urls:
urls.add(url)
return list(urls)
This results in:
>>> example_data
'Click Me!\nhttp://www.another-random-domain.com/xyz.html'
>>> urls = extract_urls(example_data)
>>> print urls
['http://www.another-random-domain.com/xyz.html', 'http://www.some-random-domain.com/abc123/def.html']
I'm not sure how well this will work on other URLs, but it seems to work for what I need it to do.

Extract information from a webpage in a particular format

I am trying to make a simple python script to extract certain links from a webpage. I am able to extract link successfully but now I want to extract some more information like bitrate,size,duration given on that webpage.
I am using the below xpath to extract the above mentioned info
>>> doc = lxml.html.parse('http://mp3skull.com/mp3/linkin_park_faint.html')
>>> info = doc.xpath(".//*[#id='song_html']/div[1]/text()")
>>> info[0:7]
['\n\t\t\t', '\n\t\t\t\t3.71 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t3.49 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t192 kbps', '2:41']
Now what I need is that for a particular link the info I require is generated in a form of tuple like (bitrate,size,duration).
The xpath I mentioned above generates the required info but it is ill-formatted that is it is not possible to achieve my required format with any logic at least I am not able to that.
So, is there any way to achieve the output in my format.?

I think BeautifulSoup will do the job, it parses even badly formatted HTML:
http://www.crummy.com/software/BeautifulSoup/
parsing is quite easy with BeautifulSoup - for example:
import bs4
import urllib
soup = bs4.BeautifulSoup(urllib.urlopen('http://mp3skull.com/mp3/linkin_park_faint.html').read())
print soup.find_all('a')
and have quite good docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

You can actually strip everything out with XPath:
translate(.//*[#id='song_html']/div[1]/text(), "\n\t,'", '')
So for your additional question, either:
info[0, len(info)]
for altogether, or:
info.rfind(" ")
Since the translate leaves a space character, but you could replace that with whatever you wanted.
Addl info found here

How are you with regular expressions and python's re module?
http://docs.python.org/library/re.html may be essential.
As far as getting the data out of the array, re.match(regex,info[n]) should suffice, as far as the triple tuple goes, the python tuple syntax takes care of it. Simply match from members of your info array with re.match.
import re
matching_re = '.*' # this re matches whole strings, rather than what you need
incoming_value_1 = re.match(matching_re,info[1])
# etc.
var truple = (incoming_value_1, incoming_value_2, incoming_value_2

Python: Keyword to Links

I am building a blog on Google App Engine. I would like to convert some keywords in my blog posts to links, just like what you see in many WordPress blogs.
Here is one WP plugin which do the same thing:http://wordpress.org/extend/plugins/blog-mechanics-keyword-link-plugin-v01/
A plugin that allows you to define keyword/link pairs. The keywords are automatically linked in each of your posts.
I think this is more than a simple Python Replace. What I am dealing with is HTML code. It can be quite complex sometimes.
Take the following code snippet as an example. I want to conver the word example into a link to http://example.com:
Here is an example link:example.com
By a simple Python replace function which replaces example with example, it would output:
Here is an example link:example.com">example.com</a>
but I want:
Here is an example link:example.com
Is there any Python plugin that capable of this? Thanks a lot!

This is roughly what you could do using Beautifulsoup:
from BeautifulSoup import BeautifulSoup
html_body ="""
Here is an example link:<a href='http://example.com'>example.com</a>
"""
soup = BeautifulSoup(html_body)
for link_tag in soup.findAll('a'):
link_tag.string = "%s%s%s" % ('|',link_tag.string,'|')
for text in soup.findAll(text=True):
text_formatted = ['example'\
if word == 'example' and not (word.startswith('|') and word.endswith('|'))\
else word for word in foo.split() ]
text.replaceWith(' '.join(text_formatted))
for link_tag in soup.findAll('a'):
link_tag.string = link_tag.string[1:-1]
print soup
Basically I'm stripping out all the text from the post_body, replacing the example word with the given link, without touching the links text that are saved by the '|' characters during the parsing.
This is not 100% perfect, for example it does not work if the word you are trying to replace ends with a period; with some patience you could fix all the edge cases.

This would probably be better suited to client-side code. You could easily modify a word highlighter to get the desired results. By keeping this client-side, you can avoid having to expire page caches when your 'tags' change.
If you really need it to be processed server-side, then you need to look at using re.sub which lets you pass in a function, but unless you are operating on plain-text you will have to first parse the HTML using something like minidom to ensure you are not replacing something in the middle of any elements.

What's the easiest way to extract the links on a web page using python without BeautifulSoup?

I'm using cygwin and do not have BeautifulSoup installed.

Getting the value of href attributes in all <a> tags on a html file with Python
python, regex to find anchor link html
Regular expression to extract URL from an HTML link

If you don't care much about performance you can use regular expressions:
import re
linkre = re.compile(r"""href=["']([^"']+)["']""")
links = linkre.findall(your_html)
If you just want links like in http:// links then change the expression to:
linkre = re.compile(r"""href=["']http:([^"']+)["']""")
Or you can put "' as optional if by some chance you have html without them around the links.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert Unicode contained inside an HTML link? - python

You can use unicodedata: import unicodedata link = 'http://ｂａｄ．ｃｏｍ' normalized = unicodedata.normalize('NFKC', link) What is 'NFKC' you can find in official docs.

Related

web-scraping, regex and iteration in python

Find all HTML and non-HTML encoded URLs in string

Extract information from a webpage in a particular format

Python: Keyword to Links

What's the easiest way to extract the links on a web page using python without BeautifulSoup?

Categories

Resources