Extracting a URL's in Python from XML

Extracting a URL's in Python from XML - python

I read this thread about extracting url's from a string. https://stackoverflow.com/a/840014/326905
Really nice, i got all url's from a XML document containing http://www.blabla.com with
>>> s = '<link href="http://www.blabla.com/blah" />
<link href="http://www.blabla.com" />'
>>> re.findall(r'(https?://\S+)', s)
['http://www.blabla.com/blah"', 'http://www.blabla.com"']
But i can't figure out, how to customize the regex to omit the double qoute at the end of the url.
First i thought that this is the clue
re.findall(r'(https?://\S+\")', s)
or this
re.findall(r'(https?://\S+\Z")', s)
but it isn't.
Can somebody help me out and tell me how to omit the double quote at the end?
Btw. the questionmark after the "s" of https means "s" can occur or can not occur. Am i right?

>>>from lxml import html
>>>ht = html.fromstring(s)
>>>ht.xpath('//a/#href')
['http://www.blabla.com/blah', 'http://www.blabla.com']

You're already using a character class (albeit a shorthand version). I might suggest modifying the character class a bit, that way you don't need a lookahead. Simply add the quote as part of the character class:
re.findall(r'(https?://[^\s"]+)', s)
This still says "one or more characters not a whitespace," but has the addition of not including double quotes either. So the overall expression is "one or more character not a whitespace and not a double quote."

You want the double quotes to appear as a look-ahead:
re.findall(r'(https?://\S+)(?=\")', s)
This way they won't appear as part of the match. Also, yes the ? means the character is optional.
See example here: http://regexr.com?347nk

I used to extract URLs from text through this piece of code:
url_rgx = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
# convert string to lower case
text = text.lower()
matches = re.findall(url_rgx, text)
# patch the 'http://' part if it is missed
urls = ['http://%s'%url[0] if not url[0].startswith('http') else url[0] for url in matches]
print urls
It works great!

Thanks. I just read this https://stackoverflow.com/a/13057368/326905
and checked out this which is also working.
re.findall(r'"(https?://\S+)"', urls)

Related

How to find the match URL from a HTML page using RegEx Python

I'm trying to match the following URL by its query string from a html page in Python but could not able to solved it. I'm a newbie in python.
<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>
I want to match the above URL with &user_id=[any_digit_from_0_to_99]& and print this URL on the screen.
URL without this &user_id=[any_digit_from_0_to_99]& wont be match.
Here's my horror incomplete regex code:
https?:\/\/.{0,30}\.+[a-zA-Z0-9\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"
I know this code has so many wrong, but this code somehow managed to match the above URL till " double qoute.
Complete code would look like this:
import re
reg = re.compile(r'https?:\/\/.{0,30}\.+[a-zA-Z0-9\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"')
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)
Output:
$ python reg.py
http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E"
It shows the " at the end of the URL and I know this is not the good regex code I want the better version of my above code.

A few remarks can be made on your regexp:
/ is not a special re character, there's no need to escape it
Has the fact that the domain can't be larger than 30 chracters been done on purpose? Otherwise, you can just select as much characters as you want with .*
Do you know that the string you're working with contains a valid URL? If no, there are some things you can do, like ensuring the domain is at least 4 chracters long, contains a period which is not the last character, etc...
The [0-9][0-9] part will also match stuff like 04, which is not strictly speaking a digit between 0 and 99
Taking this into account, you can design this simpler regex:
reg = re.compile("https?://.*&user_id=[1-9][0-9]?&")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)
Using this regex on your example will print 'http://example.com/?query_id=9&user_id=4&', without the " at the end. If you want to have to full URL, then you can look for the /> symbol:
reg = re.compile("https?://.*&user_id=[1-9][0-9]?&.*/>")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()[:-2]
print(result)
Note the [:-2] which is used to remove the /> symbol. In that case, this code will print http://example.com/?query_id=9&user_id=4&token_id=4JGO4I394HD83E" id="838".
Note also that these regexp usesthe wildcard .. Depending on whether you are sure that the strings you're working with contains only valid URLs, you may want to change this. For instance, a domain name can only contain ASCII characters. You may want to look at the \w special sequence with the ASCII flag of the re module.

how to extract values inside quote using regex python 3.x

I have following
\"url\": \"\\\/maru.php?superalox=1\", \"params\": {\"params\": \"EgtmbUtXdUZWdDF2SSoCCABQAQ%3D%3D\", \"session_value\": \"QUFFLUhqbnQ4eW5HeEZYdDBULU5EVk1LREU2VndMMm1nd3xBQ3Jtc0trMUlMOWRqTWxpS0pOT2pNUVN6RENVU3k0Tmc4blplodexsWkxrVDRmOUN2Q0lXVkl1N0YwUFhoV1puQ3ZFQm10X1RzNWR4Q3RUeG5kMkdLNnNobTUyRkNuaG90d2c=\"}, \"log_params\":
i want to extract the value of params which is EgtmbUtXdUZWdDF2SSoCCABQAQ%3D%3D
i have tried this but it didnt work
my_text = """ \"url\": \"\\\/maru.php?superalox=1\", \"params\": {\"params\": \"EgtmbUtXdUZWdDF2SSoCCABQAQ%3D%3D\", \"session_value\": \"QUFFLUhqbnQ4eW5HeEZYdDBULU5EVk1LREU2VndMMm1nd3xBQ3Jtc0trMUlMOWRqTWxpS0pOT2pNUVN6RENVU3k0Tmc4blplodexsWkxrVDRmOUN2Q0lXVkl1N0YwUFhoV1puQ3ZFQm10X1RzNWR4Q3RUeG5kMkdLNnNobTUyRkNuaG90d2c=\"}, \"log_params\": """
extract_data = re.search(r'(\\\"params\": \\\")(\w*)', my_text)
print(extract_data)
Thanks

You can use:
re.search(r'"params": "([^"]+)"', my_text).group(1)

I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
This also works: (["'])(\?.)*?\1 Easier to read.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.
This is copied from another answer found here: RegEx: Grabbing values between quotation marks

Get only URL from string - Python

I am scraping a page with Python and BeautifulSoup library.
I have to get the URL only from this string. This actually is in href attribute of the a tag. I have scraped it but cannot seem to find a way to extract the URL from this
javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');

You can write a straightforward regex to extract the URL.
>>> import re
>>> href = "javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');"
>>> re.findall(r"'(.*?)'", href)
['/Sheraton-Tucson-Hotel-177/tnc/150/24795/en', 'TC_POPUP', 'width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no']
>>> _[0]
'/Sheraton-Tucson-Hotel-177/tnc/150/24795/en'
The regex in question here is
'(.*?)'
Which reads "find a single-quote, followed by whatever (and capture the whatever), followed by another single quote, and do so non-greedily because of the ? operator". This extracts the arguments of window.open; then, just pick the first one to get the URL.
You shouldn't have any nested ' in your href, since those should be escaped to %27. If you do, though, this will not work, and you may need a solution that doesn't use regexes.

I did it that way.
terms = javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');
terms.split("('")[1].split("','")[0]
outputs
/Sheraton-Tucson-Hotel-177/tnc/150/24795/en

Instead of a regex, you could just partition it twice on something, (eg: '):
s.partition("'")[2].partition("'")[0]
# /Sheraton-Tucson-Hotel-177/tnc/150/24795/en

Here's a quick and ugly answer
href.split("'")[1]

Strip all html lines/code from string in python

Given the following string parsed from an email body...
s = "Keep all of this <h1>But remove this including the tags</h1> this is still good, but
<p>there also might be new lines like this that need to be removed</p>
<p> and even other lines like this all the way down here with whitespace after being parsed from email that need to be removed.</p>
But this is still okay."
How do I remove all the html code and lines from the string to simply return "Keep all of this this is still good But this is still okay." on one line? I've looked at bleach and lxml but they are simply just removing the html <> and returning what's inside, whereas I don't want any of it.

You can still use lxml to get all of the root element's text nodes:
import lxml.html
html = '''
Keep all of this <h1>But remove this including the tags</h1> this is still good, but
<p>there also might be new lines like this that need to be removed</p>
<p> and even other lines like this all the way down here with whitespace after being parsed from email that need to be removed.</p>
But this is still okay.
'''
root = lxml.html.fromstring('<div>' + html + '</div>')
text = ' '.join(t.strip() for t in root.xpath('text()') if t.strip())
Seems to work fine:
>>> text
'Keep all of this this is still good, but But this is still okay.'

Simple solution that requires no external packages:
import re
while '<' in s:
s = re.sub('<.+?>.+?<.+?>', '', s)
Not very efficient, since it passes over the target string many times, but it should work. Note there must be absolutely no < or > characters on the string.

This one?
import re
s = # Your string here
print re.sub('[\s\n]*<.+?>.+?<.+?>[\s\n]*', ' ', s)
Edit: Just made a few mods to #BoppreH answer albeit with an extra space.

Retrieving string from html on non-unique table

Here is the html I am trying to parse.
<TD>Serial Number</TD><TD>AB12345678</TD>
I am attempting to use regex to parse the data. I heard about BeautifulSoup but there are around 50 items like this on the page all using the same table parameters and none of them have ID numbers. The closest they have to unique identifiers is the data in the cell before the data I need.
serialNumber = re.search("Serial Number</td><td>\n(.*?)</td>", source)
Source is simply the source code of the page grabbed using urllib. There is new line in the html between the second and the serial number but I am unsure if that matters.

Pyparsing can give you a little more robust extractor for your data:
from pyparsing import makeHTMLTags, Word, alphanums
htmlfrag = """<blah></blah><TD>Serial Number</TD><TD>
AB12345678
</TD><stuff></stuff>"""
td,tdEnd = makeHTMLTags("td")
sernoFormat = (td + "Serial Number" + tdEnd +
td + Word(alphanums)('serialNumber') + tdEnd)
for sernoData in sernoFormat.searchString(htmlfrag):
print sernoData.serialNumber
Prints:
AB12345678
Note that pyparsing doesn't care where the extra whitespace falls, and it also handles unexpected attributes that might crop up in the defined tags, whitespace inside tags, tags in upper/lower case, etc.

In most of the cases it is better to work on html using an appropriate parser, but for some cases it is perfectly OK to use regular expressions for the job. I do not know enough about your task to judge if it is a good solution or if it is better to go with #Paul 's solution, but here I try to fix your regex:
serialNumber = re.search("Serial Number</td><td>(.*?)</td>", source, re.S | re.I )
I removed the \n, because it is difficult in my opinion (\n,\r,\r\n, ...?), instead I used the option re.S (Dotall).
But be aware, now if there is a newline, it will be in your capturing group! i.e. you should strip whitespaces afterwards from your result.
Another problem of your regex is the <TD> in your string but you search for <td>. There for is the option re.I (IgnoreCase).
You can find more explanations about regex here on docs.python.org

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting a URL's in Python from XML - python

>>>from lxml import html >>>ht = html.fromstring(s) >>>ht.xpath('//a/#href') ['http://www.blabla.com/blah', 'http://www.blabla.com']

You want the double quotes to appear as a look-ahead: re.findall(r'(https?://\S+)(?=\")', s) This way they won't appear as part of the match. Also, yes the ? means the character is optional. See example here: http://regexr.com?347nk

Thanks. I just read this https://stackoverflow.com/a/13057368/326905 and checked out this which is also working. re.findall(r'"(https?://\S+)"', urls)

Related

How to find the match URL from a HTML page using RegEx Python

how to extract values inside quote using regex python 3.x

Get only URL from string - Python

Strip all html lines/code from string in python

Retrieving string from html on non-unique table

Categories

Resources