I am a newbie in python and I am trying to cut piece of string in another string at python.
I looked at other similar questions but I could not find my answer.
I have a variable which contain a domain list which the domains look like this :
http://92.230.38.21/ios/Channel767/Hotbird.mp3
http://92.230.38.21/ios/Channel9798/Coldbird.mp3
....
I want the mp3 file name (in this example Hotbird, Coldbird etc)
I know I must be able to do it with re.findall() but I have no idea about regular expressions I need to use.
Any idea?
Update:
Here is the part I used:
for final in match2:
netname=re.findall('\W+\//\W+\/\W+\/\W+\/\W+', final)
print final
print netname
Which did not work. Then I tried to do this one which only cut the ip address (92.230.28.21) but not the name:
for final in match2:
netname=re.findall('\d+\.\d+\.\d+\.\d+', final)
print final
You may just use str.split():
>>> urls = ["http://92.230.38.21/ios/Channel767/Hotbird.mp3", "http://92.230.38.21/ios/Channel9798/Coldbird.mp3"]
>>> for url in urls:
... print(url.split("/")[-1].split(".")[0])
...
Hotbird
Coldbird
And here is an example regex-based approach:
>>> import re
>>>
>>> pattern = re.compile(r"/(\w+)\.mp3$")
>>> for url in urls:
... print(pattern.search(url).group(1))
...
Hotbird
Coldbird
where we are using a capturing group (\w+) to capture the mp3 filename consisting of one or more aplhanumeric characters which is followed by a dot, mp3 at the end of the url.
How about ?
([^/]*mp3)$
I think that might work
Basically it says...
Match from the end of the line, start with mp3, then match everything back to the first slash.
Think it will perform well.
Related
I'm trying to match the following URL by its query string from a html page in Python but could not able to solved it. I'm a newbie in python.
<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>
I want to match the above URL with &user_id=[any_digit_from_0_to_99]& and print this URL on the screen.
URL without this &user_id=[any_digit_from_0_to_99]& wont be match.
Here's my horror incomplete regex code:
https?:\/\/.{0,30}\.+[a-zA-Z0-9\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"
I know this code has so many wrong, but this code somehow managed to match the above URL till " double qoute.
Complete code would look like this:
import re
reg = re.compile(r'https?:\/\/.{0,30}\.+[a-zA-Z0-9\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"')
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)
Output:
$ python reg.py
http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E"
It shows the " at the end of the URL and I know this is not the good regex code I want the better version of my above code.
A few remarks can be made on your regexp:
/ is not a special re character, there's no need to escape it
Has the fact that the domain can't be larger than 30 chracters been done on purpose? Otherwise, you can just select as much characters as you want with .*
Do you know that the string you're working with contains a valid URL? If no, there are some things you can do, like ensuring the domain is at least 4 chracters long, contains a period which is not the last character, etc...
The [0-9][0-9] part will also match stuff like 04, which is not strictly speaking a digit between 0 and 99
Taking this into account, you can design this simpler regex:
reg = re.compile("https?://.*&user_id=[1-9][0-9]?&")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)
Using this regex on your example will print 'http://example.com/?query_id=9&user_id=4&', without the " at the end. If you want to have to full URL, then you can look for the /> symbol:
reg = re.compile("https?://.*&user_id=[1-9][0-9]?&.*/>")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()[:-2]
print(result)
Note the [:-2] which is used to remove the /> symbol. In that case, this code will print http://example.com/?query_id=9&user_id=4&token_id=4JGO4I394HD83E" id="838".
Note also that these regexp usesthe wildcard .. Depending on whether you are sure that the strings you're working with contains only valid URLs, you may want to change this. For instance, a domain name can only contain ASCII characters. You may want to look at the \w special sequence with the ASCII flag of the re module.
I have a string that's
/path/to/file?_subject_ID_SOMEOTHERSTRING
the path/to/file part changes depends on situation, and subject_ID is always there. I try to write a regex that extract only file part of the string. Using ?subject_ID is definite, but I don't know how to safely get the file
My current regex looks like (.*[\/]).*\?_subject_ID
url = '/path/to/file?_subject_ID_SOMEOTHERSTRING'
file_re = re.compile('(.*[\/]).*\?_subject_ID')
file_re.search(url)
this will find the right string, but I still can't extract the file name
printing _.group(1) will get me /path/to/. What's the next step that gets me the actual file name?
As for your '(.*[\/]).*\?_subject_ID' regex approach, you just need to add a capturing group around the second .*. You could use r'(.*/)(.*)\?_subject_ID' (then, there will be .group(1) and .group(2) parts captured), but it is not the most appropriate way to parse URLs in Python.
You may use the non-regex approach here, here is a snippet showing how to leverage urlparse and os.path to parse the URL like yours:
import urlparse
path = urlparse.urlparse('/path/to/file?_subject_ID_SOMEOTHERSTRING').path
import os.path
print(os.path.split(path)[1]) # => file
print(os.path.split(path)[0]) # => /path/to
See the IDEONE demo
It's pretty simple, really. Just match a / before and ?subject_ID after:
([^/?]*)\?subject_ID
The [^/?]* (as opposed to .*) is because otherwise it'd match the part before, too. The ? in the character class
If you want to get both the path and the file, you can do much the same thing, but also grab the part before the /:
([^?]*)([^/?]*)\?subject_ID
It's basically the same as the one before but with the first bit captured instead of ignored.
The program I am currently working on retrieves URLs from a website and puts them into a list. What I want to get is the last section of the URL.
So, if the first element in my list of URLs is "https://docs.python.org/3.4/tutorial/interpreter.html" I would want to remove everything before "interpreter.html".
Is there a function, library, or regex I could use to make this happen? I've looked at other Stack Overflow posts but the solutions don't seem to work.
These are two of my several attempts:
for link in link_list:
file_names.append(link.replace('/[^/]*$',''))
print(file_names)
&
for link in link_list:
file_names.append(link.rpartition('//')[-1])
print(file_names)
Have a look at str.rsplit.
>>> s = 'https://docs.python.org/3.4/tutorial/interpreter.html'
>>> s.rsplit('/',1)
['https://docs.python.org/3.4/tutorial', 'interpreter.html']
>>> s.rsplit('/',1)[1]
'interpreter.html'
And to use RegEx
>>> re.search(r'(.*)/(.*)',s).group(2)
'interpreter.html'
Then match the 2nd group which lies between the last / and the end of String. This is a greedy usage of the greedy technique in RegEx.
Debuggex Demo
Small Note - The problem with link.rpartition('//')[-1] in your code is that you are trying to match // and not /. So remove the extra / as in link.rpartition('/')[-1].
That doesn't need regex.
import os
for link in link_list:
file_names.append(os.path.basename(link))
You can use rpartition():
>>> s = 'https://docs.python.org/3.4/tutorial/interpreter.html'
>>> s.rpartition('/')
('https://docs.python.org/3.4/tutorial', '/', 'interpreter.html')
And take the last part of the 3 element tuple that is returned:
>>> s.rpartition('/')[2]
'interpreter.html'
Just use string.split:
url = "/some/url/with/a/file.html"
print url.split("/")[-1]
# Result should be "file.html"
split gives you an array of strings that were separated by "/". The [-1] gives you the last element in the array, which is what you want.
Here's a more general, regex way of doing this:
re.sub(r'^.+/([^/]+)$', r'\1', "http://test.org/3/files/interpreter.html")
'interpreter.html'
This should work if you plan to use regex
for link in link_list:
file_names.append(link.replace('.*/',''))
print(file_names)
I'm making a regex so I can find youtube links (can be multiple) in a piece of HTML text posted by an user.
Currently I'm using the following regex to change 'http://www.youtube.com/watch?v=-JyZLS2IhkQ' into displaying the corresponding youtube video:
return re.compile('(http(s|):\/\/|)(www.|)youtube.(com|nl)\/watch\?v\=([a-zA-Z0-9-_=]+)').sub(tag, value)
(where the variable 'tag' is a bit of html so the video works and 'value' a user post)
Now this works.. until the url is like this:
'http://www.youtube.com/watch?v=-JyZLS2IhkQ&feature...'
Now I'm hoping you guys could help me figure how to also match the '&feature...' part so it disappears.
Example HTML:
No replies to this post..
Youtube vid:
http://www.youtube.com/watch?v=-JyZLS2IhkQ
More blabla
Thanks for your thoughts, much appreciated
Stefan
Here how I'm solving it:
import re
def youtube_url_validation(url):
youtube_regex = (
r'(https?://)?(www\.)?'
'(youtube|youtu|youtube-nocookie)\.(com|be)/'
'(watch\?v=|embed/|v/|.+\?v=)?([^&=%\?]{11})')
youtube_regex_match = re.match(youtube_regex, url)
if youtube_regex_match:
return youtube_regex_match
return youtube_regex_match
TESTS:
youtube_urls_test = [
'http://www.youtube.com/watch?v=5Y6HSHwhVlY',
'http://youtu.be/5Y6HSHwhVlY',
'http://www.youtube.com/embed/5Y6HSHwhVlY?rel=0" frameborder="0"',
'https://www.youtube-nocookie.com/v/5Y6HSHwhVlY?version=3&hl=en_US',
'http://www.youtube.com/',
'http://www.youtube.com/?feature=ytca']
for url in youtube_urls_test:
m = youtube_url_validation(url)
if m:
print('OK {}'.format(url))
print(m.groups())
print(m.group(6))
else:
print('FAIL {}'.format(url))
You should specify your regular expressions as raw strings.
You don't have to escape every character that looks special, just the ones which are.
Instead of specifying an empty branch ((foo|)) to make something optional, you can use ?.
If you want to include - in a character set, you have to escape it or put it at right after the opening bracket.
You can use special character sets like \w (equals [a-zA-Z0-9_]) to shorten your regex.
r'(https?://)?(www\.)?youtube\.(com|nl)/watch\?v=([-\w]+)'
Now, in order to match the whole URL, you have to think about what can or cannot follow it in the input. Then you put that into a lookahead group (you don't want to consume it).
In this example I took everything except -, =, %, & and alphanumerical characters to end the URL (too lazy to think about it any harder).
Everything between the v-argument and the end of the URL is non-greedily consumed by .*?.
r'(https?://)?(www\.)?youtube\.(com|nl)/watch\?v=([\w-]+)(&.*?)?(?=[^-\w&=%])'
Still, I would not put too much faith into this general solution. User input is notoriously hard to parse robustly.
What if you used the urlparse module to pick apart the youtube address you find and put it back into the format you want? You could then simplify your regex so that it only finds the entire url and then use urlparse to do the heavy lifting of picking it apart for you.
from urlparse import urlparse,parse_qs,urlunparse
from urllib import urlencode
youtube_url = urlparse('http://www.youtube.com/watch?v=aFNzk7TVUeY&feature=grec_index')
params = parse_qs(youtube_url.query)
new_params = {'v': params['v'][0]}
cleaned_youtube_url = urlunparse((youtube_url.scheme, \
youtube_url.netloc, \
youtube_url.path,
None, \
urlencode(new_params), \
youtube_url.fragment))
It's a bit more code, but it allows you to avoid regex madness.
And as hop said, you should use raw strings for the regex.
Here's how I implemented it in my script:
string = "Hey, check out this video: https://www.youtube.com/watch?v=bS5P_LAqiVg"
youtube = re.findall(r'(https?://)?(www\.)?((youtube\.(com))/watch\?v=([-\w]+)|youtu\.be/([-\w]+))', string)
if youtube:
print youtube
That outputs:
["", "youtube.com/watch?v=BS5P_LAqiVg", ".com", "watch", "com", "bS5P_LAqiVg", ""]
If you just wanted to grab the video id, for example, you would do:
video_id = [c for c in youtube[0] if c] # Get rid of empty list objects
video_id = video_id[len(video_id)-1] # Return the last item in the list
Sorry, I know this is probably a duplicate but having searched for 'python regular expression match between' I haven't found anything that answers my question!
The document (which to make clear, is a long HTML page) I'm searching has a whole bunch of strings in it (inside a JavaScript function) that look like this:
link: '/Hidden/SidebySideGreen/dei1=1204970159862'};
link: '/Hidden/SidebySideYellow/dei1=1204970159862'};
I want to extract the links (i.e. everything between quotes within these strings) - e.g. /Hidden/SidebySideYellow/dei1=1204970159862
To get the links, I know I need to start with:
re.matchall(regexp, doc_sting)
But what should regexp be?
The answer to your question depends on how the rest of the string may look like. If they are all like this link: '<URL>'}; then you can do it very simple using simple string manipulation:
myString = "link: '/Hidden/SidebySideGreen/dei1=1204970159862'};"
print( myString[7:-3] )
(If you just have one string with multiple lines by that, you can just split the string into lines.)
If it is a bit more complex though, using regular expressions are fine. One example that just looks for the url inside of the quotes would be:
myDoc = """link: '/Hidden/SidebySideGreen/dei1=1204970159862'};
link: '/Hidden/SidebySideYellow/dei1=1204970159862'};"""
print( re.findall( "'([^']+)'", myDoc ) )
Depending on how the whole string looks, you might have to include the link: as well:
print( re.findall( "link: '([^']+)'", myDoc ) )
I'd start with:
regexp = "'([^']+)'"
And check if it works okay - I mean, if the only condition is that string is in one line between '', it should be good as it is.
Use a few simple splits
>>> s="link: '/Hidden/SidebySideGreen/dei1=1204970159862'};"
>>> s.split("'")
['link: ', '/Hidden/SidebySideGreen/dei1=1204970159862', '};']
>>> for i in s.split("'"):
... if "/" in i:
... print i
...
/Hidden/SidebySideGreen/dei1=1204970159862
>>>