I'm making a regex so I can find youtube links (can be multiple) in a piece of HTML text posted by an user.
Currently I'm using the following regex to change 'http://www.youtube.com/watch?v=-JyZLS2IhkQ' into displaying the corresponding youtube video:
return re.compile('(http(s|):\/\/|)(www.|)youtube.(com|nl)\/watch\?v\=([a-zA-Z0-9-_=]+)').sub(tag, value)
(where the variable 'tag' is a bit of html so the video works and 'value' a user post)
Now this works.. until the url is like this:
'http://www.youtube.com/watch?v=-JyZLS2IhkQ&feature...'
Now I'm hoping you guys could help me figure how to also match the '&feature...' part so it disappears.
Example HTML:
No replies to this post..
Youtube vid:
http://www.youtube.com/watch?v=-JyZLS2IhkQ
More blabla
Thanks for your thoughts, much appreciated
Stefan
Here how I'm solving it:
import re
def youtube_url_validation(url):
youtube_regex = (
r'(https?://)?(www\.)?'
'(youtube|youtu|youtube-nocookie)\.(com|be)/'
'(watch\?v=|embed/|v/|.+\?v=)?([^&=%\?]{11})')
youtube_regex_match = re.match(youtube_regex, url)
if youtube_regex_match:
return youtube_regex_match
return youtube_regex_match
TESTS:
youtube_urls_test = [
'http://www.youtube.com/watch?v=5Y6HSHwhVlY',
'http://youtu.be/5Y6HSHwhVlY',
'http://www.youtube.com/embed/5Y6HSHwhVlY?rel=0" frameborder="0"',
'https://www.youtube-nocookie.com/v/5Y6HSHwhVlY?version=3&hl=en_US',
'http://www.youtube.com/',
'http://www.youtube.com/?feature=ytca']
for url in youtube_urls_test:
m = youtube_url_validation(url)
if m:
print('OK {}'.format(url))
print(m.groups())
print(m.group(6))
else:
print('FAIL {}'.format(url))
You should specify your regular expressions as raw strings.
You don't have to escape every character that looks special, just the ones which are.
Instead of specifying an empty branch ((foo|)) to make something optional, you can use ?.
If you want to include - in a character set, you have to escape it or put it at right after the opening bracket.
You can use special character sets like \w (equals [a-zA-Z0-9_]) to shorten your regex.
r'(https?://)?(www\.)?youtube\.(com|nl)/watch\?v=([-\w]+)'
Now, in order to match the whole URL, you have to think about what can or cannot follow it in the input. Then you put that into a lookahead group (you don't want to consume it).
In this example I took everything except -, =, %, & and alphanumerical characters to end the URL (too lazy to think about it any harder).
Everything between the v-argument and the end of the URL is non-greedily consumed by .*?.
r'(https?://)?(www\.)?youtube\.(com|nl)/watch\?v=([\w-]+)(&.*?)?(?=[^-\w&=%])'
Still, I would not put too much faith into this general solution. User input is notoriously hard to parse robustly.
What if you used the urlparse module to pick apart the youtube address you find and put it back into the format you want? You could then simplify your regex so that it only finds the entire url and then use urlparse to do the heavy lifting of picking it apart for you.
from urlparse import urlparse,parse_qs,urlunparse
from urllib import urlencode
youtube_url = urlparse('http://www.youtube.com/watch?v=aFNzk7TVUeY&feature=grec_index')
params = parse_qs(youtube_url.query)
new_params = {'v': params['v'][0]}
cleaned_youtube_url = urlunparse((youtube_url.scheme, \
youtube_url.netloc, \
youtube_url.path,
None, \
urlencode(new_params), \
youtube_url.fragment))
It's a bit more code, but it allows you to avoid regex madness.
And as hop said, you should use raw strings for the regex.
Here's how I implemented it in my script:
string = "Hey, check out this video: https://www.youtube.com/watch?v=bS5P_LAqiVg"
youtube = re.findall(r'(https?://)?(www\.)?((youtube\.(com))/watch\?v=([-\w]+)|youtu\.be/([-\w]+))', string)
if youtube:
print youtube
That outputs:
["", "youtube.com/watch?v=BS5P_LAqiVg", ".com", "watch", "com", "bS5P_LAqiVg", ""]
If you just wanted to grab the video id, for example, you would do:
video_id = [c for c in youtube[0] if c] # Get rid of empty list objects
video_id = video_id[len(video_id)-1] # Return the last item in the list
Related
I'm trying to match the following URL by its query string from a html page in Python but could not able to solved it. I'm a newbie in python.
<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>
I want to match the above URL with &user_id=[any_digit_from_0_to_99]& and print this URL on the screen.
URL without this &user_id=[any_digit_from_0_to_99]& wont be match.
Here's my horror incomplete regex code:
https?:\/\/.{0,30}\.+[a-zA-Z0-9\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"
I know this code has so many wrong, but this code somehow managed to match the above URL till " double qoute.
Complete code would look like this:
import re
reg = re.compile(r'https?:\/\/.{0,30}\.+[a-zA-Z0-9\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"')
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)
Output:
$ python reg.py
http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E"
It shows the " at the end of the URL and I know this is not the good regex code I want the better version of my above code.
A few remarks can be made on your regexp:
/ is not a special re character, there's no need to escape it
Has the fact that the domain can't be larger than 30 chracters been done on purpose? Otherwise, you can just select as much characters as you want with .*
Do you know that the string you're working with contains a valid URL? If no, there are some things you can do, like ensuring the domain is at least 4 chracters long, contains a period which is not the last character, etc...
The [0-9][0-9] part will also match stuff like 04, which is not strictly speaking a digit between 0 and 99
Taking this into account, you can design this simpler regex:
reg = re.compile("https?://.*&user_id=[1-9][0-9]?&")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)
Using this regex on your example will print 'http://example.com/?query_id=9&user_id=4&', without the " at the end. If you want to have to full URL, then you can look for the /> symbol:
reg = re.compile("https?://.*&user_id=[1-9][0-9]?&.*/>")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()[:-2]
print(result)
Note the [:-2] which is used to remove the /> symbol. In that case, this code will print http://example.com/?query_id=9&user_id=4&token_id=4JGO4I394HD83E" id="838".
Note also that these regexp usesthe wildcard .. Depending on whether you are sure that the strings you're working with contains only valid URLs, you may want to change this. For instance, a domain name can only contain ASCII characters. You may want to look at the \w special sequence with the ASCII flag of the re module.
Python beginner here. I'm stumped on part of this code for a bot I'm writing.
I am making a reddit bot using Praw to comb through posts and removed a specific set of characters (steam CD keys).
I made a test post here: https://www.reddit.com/r/pythonforengineers/comments/91m4l0/testing_my_reddit_scraping_bot/
This should have all the formats of keys.
Currently, my bot is able to find the post using a regex expression. I have these variables:
steamKey15 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w')
steamKey25 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.')
steamKey17 = (r'\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\s\w\w')
I am finding the text using this:
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
if submission.id not in steamKeyPostID:
if re.search(steamKey15, submission.selftext, re.IGNORECASE):
searchLogic()
saveSteamKey()
So this is just to show that the things I should be using in a filter function is a combination of steamKey15/25/17, and submission.selftext.
So here is the part where I am confused. I cant find a function that works, or is doing what I want. My goal is to remove all the text from submission.selftext(the body of the post) BUT the keys, which will eventually be saved in a .txt file.
Any advice on a good way to go around this? I've looked into re.sub and .translate but I don't understand how the parts fit together.
I am using Python 3.7 if it helps.
can't you just get the regexp results?
m = re.search(steamKey15, submission.selftext, re.IGNORECASE)
if m:
print(m.group(0))
Also note that a dot . means any char in a regexp. If you want to match only dots, you should use \.. You can probably write your regexp like this instead:
r'\w{5}[-.]\w{5}[-.]\w{5}'
This will match the key when separated by . or by -.
Note that this will also match anything that begin or end with a key, or has a key in the middle - that can cause you problems as your 15-char key regexp is contained in the 25-key one! To fix that use negative lookahead/negative lookbehind:
r'(?<![\w.-])\w{5}[-.]\w{5}[-.]\w{5}(?![\w.-])'
that will only find the keys if there are no extraneous characters before and after them
Another hint is to use re.findall instead of re.search - some posts contain more than one steam key in the same post! findall will return all matches while search only returns the first one.
So a couple things first . means any character in regex. I think you know that, but just to be sure. Also \w\w\w\w\w can be replaced with \w{5} where this specifies 5 alphanumerics. I would use re.findall.
import re
steamKey15 = (r'(?:\w{5}.){2}\w{5}')
steamKey25 = (r'(?:\w{5}.){5}')
steamKey17 = (r'\w{15}\s\w\w')
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
if submission.id not in steamKeyPostID:
finds_15 = re.findall(steamKey15, submission.selftext)
finds_25 = re.findall(steamKey25, submission.selftext)
finds_17 = re.findall(steamKey17, submission.selftext)
I have a string and I want to search this string to find special pattern containing URL and it's name and then I need to change it's format:
Input string:
'Thsi is my [site](http://example.com/url) you can watch it.'
Output string:
'This is my site you can watch it.'
The string may have several URLs and I need to change the format of every one and site is in unicode and can be every character in any language.
What pattern should be used and how I can do it?
This should help
import re
A = 'Thsi is my [site](http://example.com/url) you can watch it.'
site = re.compile( "\[(.*)\]" ).search(A).group(1)
url = re.compile( "\((.*)\)" ).search(A).group(1)
print A.replace("[{0}]".format(site), "").replace("({0})".format(url), '{1}'.format(url, site))
Output:
Thsi is my site you can watch it.
Update as request in Comments:
s = 'my [site](site.com) is about programing (python language)'
site, url = s[s.find("[")+1:s.find(")")].split("](")
print s.replace("[{0}]".format(site), "").replace("({0})".format(url), '{1}'.format(url, site))
Output:
my site is about programing (python language)
I'm not a markdown expert, but if this is indeed markdown that you're trying to replace, and not your own syntax, you should use an appropriate parser. Note that, if you paste your string directly into stackoverflow - which also uses markdown - it will be transformed into a link, so it would clearly be valid markdown.
If it is indeed your own format, however, try the following to transform
'This is my [site](http://example.com/url) you can watch it.'
into
'This is my site you can watch it.'
using the following match:
\[(.*?)\]\((.*?)\)
and the following replacement regex:
<a href="\\2">\\1<\/a>
In python, re.sub(match, replace, stringThatYouWantToReplaceStuffIn) should do the trick. Don't forget to assign the return value of re.sub to whatever variable should contain the new string.
I am a newbie in python and I am trying to cut piece of string in another string at python.
I looked at other similar questions but I could not find my answer.
I have a variable which contain a domain list which the domains look like this :
http://92.230.38.21/ios/Channel767/Hotbird.mp3
http://92.230.38.21/ios/Channel9798/Coldbird.mp3
....
I want the mp3 file name (in this example Hotbird, Coldbird etc)
I know I must be able to do it with re.findall() but I have no idea about regular expressions I need to use.
Any idea?
Update:
Here is the part I used:
for final in match2:
netname=re.findall('\W+\//\W+\/\W+\/\W+\/\W+', final)
print final
print netname
Which did not work. Then I tried to do this one which only cut the ip address (92.230.28.21) but not the name:
for final in match2:
netname=re.findall('\d+\.\d+\.\d+\.\d+', final)
print final
You may just use str.split():
>>> urls = ["http://92.230.38.21/ios/Channel767/Hotbird.mp3", "http://92.230.38.21/ios/Channel9798/Coldbird.mp3"]
>>> for url in urls:
... print(url.split("/")[-1].split(".")[0])
...
Hotbird
Coldbird
And here is an example regex-based approach:
>>> import re
>>>
>>> pattern = re.compile(r"/(\w+)\.mp3$")
>>> for url in urls:
... print(pattern.search(url).group(1))
...
Hotbird
Coldbird
where we are using a capturing group (\w+) to capture the mp3 filename consisting of one or more aplhanumeric characters which is followed by a dot, mp3 at the end of the url.
How about ?
([^/]*mp3)$
I think that might work
Basically it says...
Match from the end of the line, start with mp3, then match everything back to the first slash.
Think it will perform well.
I am developing a web scraper code. The basic thing which I am retrieving is email address from the HTML source. I am using the following code
r = re.compile(r"\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
emailAddresses = r.findall(html)
In few websites the email address is in the format abcd[at]gmail.com/abcd(at)gmail.com. I need a generic regex code which will retrieve email address in either of the three formats abcd[at]gmail.com/abcd(at)gmail.com/abcd#gmail.com. I tried the following code, but didn't get expected result. Can any one help me.
r = re.compile(r"\b[A-Z0-9._%+-]+[#|(at)|[at]][A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
emailAddresses = r.findall(html)
Solution: Replace # by (#|\(at\)|\[at\]) as such:
r = re.compile(r"\b[A-Z0-9._%+-]+(#|\(at\)|\[at\])[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
emailAddresses = r.findall(html)
Explanation: In your attempt, you did [one|two|three], you cannot do that. […] is used for single characters or for sets ([a-z] is the same as [abcd…xyz]). You must use (one|two|three) instead. [1]
Also, you attempt to match () and [] which are all special characters regarding to REGEX, so they have special functionality. If you want to actually match them (and not using their special functionality), you must remember to escape them before by putting a \ in front of them. Same goes for .?+* etc.
Suggestion: You can also try to match [dot] and (dot) that very same way if you wish so.
Just remember that there are a ton of way to obfuscate email addresses out there, including some you might not be aware of.
And that, also, validating email addresses (and so trying to catch them with REGEX) can be very tricky:
The actual official REGEX is (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\]).
(EDIT: Source: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html Looks like it could be even worse than the above REGEX!!)
[1] Beware that using (…) will capture its content, if you wish this content not being captured you have to use (?:…) instead.
r = re.compile(r"\b[A-Z0-9._%+-]+(?:#|[(\[]at[\])])[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
^^^^^^^^^^^^^^^^^^
emailAddresses = r.findall(html)
See demo.
https://regex101.com/r/nD5jY4/5#python