I am using Python and Flask, and I have some YouTube URLs I need to convert to their embed versions. For example, this:
https://www.youtube.com/watch?v=X3iFhLdWjqc
has to be converted into this:
https://www.youtube.com/embed/X3iFhLdWjqc
Should I use Regexp, or is there a Flask method to convert the URLs?
Assuming your URLs are just strings, you don't need regexes or special Flask functions to do it.
This code will replace all YouTube URLs with the embedded versions, based off of how you said it should be handled:
url = "https://youtube.com/watch?v=TESTURLNOTTOBEUSED"
url = url.replace("watch?v=", "embed/")
All you have to do is replace url with whatever variable you store the URL in.
To do this for a list, use:
new_url_list = list()
for address in old_url_list:
new_address = address.replace("watch?v=", "embed/")
new_url_list.append(new_address)
old_url_list = new_url_list
where old_url_list is the list which your URLs are included in.
There are two types of youtube links:
http://www.youtube.com/watch?v=xxxxxxxxxxx
or
http://youtu.be/xxxxxxxxxxx
Use this function it will work with the 2 kinds of youtube links.
import re
def embed_url(video_url):
regex = r"(?:https:\/\/)?(?:www\.)?(?:youtube\.com|youtu\.be)\/(?:watch\?v=)?(.+)"
return re.sub(regex, r"https://www.youtube.com/embed/\1",video_url)
You can try this:
import re
videoUrl = "https://www.youtube.com/watch?v=X3iFhLdWjqc"
embedUrl = re.sub(r"(?ism).*?=(.*?)$", r"https://www.youtube.com/embed/\1", videoUrl )
print (embedUrl)
Output:
https://www.youtube.com/embed/X3iFhLdWjqc
Demo
Related
I'm still a newbie in Python but I'm trying to make my first little program.
My intention is to print only the link ending with .m3u8 (if available) istead of printing the whole web page.
The code I'm currently using:
import requests
channel1 = requests.get('https://website.tv/user/111111')
print(channel1.content)
print('\n')
channel2 = requests.get('https://website.tv/user/222222')
print(channel2.content)
print('\n')
input('Press Enter to Exit...')
The link I'm looking for always has 47 characters in total, and it's always the same model just changing the stream id represented as X:
https://website.tv/live/streamidXXXXXXXXX.m3u8
Can anyone help me?
You can use regex for this problem.
Explanation:
here in the expression portion .*? means to consider everything and whatever enclosed in \b(expr)\b needs to be present there mandatorily.
For e.g.:
import re
link="https://website.tv/live/streamidXXXXXXXXX.m3u8"
p=re.findall(r'.*?\b.m3u8\b',link)
print(p)
OUTPUT:
['https://website.tv/live/streamidXXXXXXXXX.m3u8']
There are a few ways to go about this, one that springs to mind which others have touched upon is using regex with findall that returns back a list of matched urls from our url_list.
Another option could also be BeautifulSoup but without more information regarding the html structure it may not be the best tool here.
Using Regex
from re import findall
from requests import get
def check_link(response):
result = findall(
r'.*?\b.m3u8\b',
str(response.content),
)
return result
def main(url):
response = get(url)
if response.ok:
link_found = check_link(response)
if link_found:
print('link {} found at {}'.format(
link_found,
url,
),
)
if __name__ == '__main__':
url_list = [
'http://www.test_1.com',
'http://www.test_2.com',
'http://www.test_3.com',
]
for url in url_list:
main(url)
print("All finished")
If I understand your question correctly I think you want to use Python's .split() string method. If your goal is to take a string like "https://website.tv/live/streamidXXXXXXXXX.m3u8" and extract just "streamidXXXXXXXXX.m3u8" then you could do that with the following code:
web_address = "https://website.tv/live/streamidXXXXXXXXX.m3u8"
specific_file = web_address.split('/')[-1]
print(specific_file)
The calling .split('/') on the string like that will return a list of strings where each item in the list is a different part of the string (first part being "https:", etc.). The last one of these (index [-1]) will be the file extension you want.
This will extract all URLs from webpage and filter only those which contain your required keyword ".m3u8"
import requests
import re
def get_desired_url(data):
urls = []
for url in re.findall(r'(https?://\S+)', data):
if ".m3u8" in url:
urls.append(url)
return urls
channel1 = requests.get('https://website.tv/user/111111')
urls = get_desired_url(channel1 )
Try this, I think this will be robust
import re
links=[re.sub('^<[ ]*a[ ]+.*href[ ]*=[ ]*', '', re.sub('.*>$', '', link) for link in re.findall(r'<[ ]*a[ ]+.*href[ ]*=[]*"http[s]*://.+\.m3u8".*>',channel2.content)]
I've created a script in python using regular expression to parse emails from few websites. The pattern that I've used to grab email is \w+#\w+\.{1}\w+ which works most of the cases. However, trouble comes up when it encounters items like 8b4e078a51d04e0e9efdf470027f0ec1#sentry.wixpress, Slice_1#2x.png e.t.c. The pattern grabs them as well which I would like to get rid of.
I've tried with:
import re
import requests
pattern = r'\w+#\w+\.{1}\w+'
urls = (
'https://rainforestfarms.org/contact',
'https://www.auucvancouver.ca/',
'http://www.bcla.bc.ca/',
'http://www.palstudiotheatre.com/',
)
def get_email(link,pattern):
res = requests.get(link)
email = re.findall(pattern,res.text)
if email:
return link,email[0]
else:
return link
if __name__ == '__main__':
for link in urls:
print(get_email(link,pattern))
Output I'm getting:
('https://rainforestfarms.org/contact', 'rainforestfarmsllc#gmail.com')
('https://www.auucvancouver.ca/', '8b4e078a51d04e0e9efdf470027f0ec1#sentry.wixpress')
('http://www.bcla.bc.ca/', 'Slice_1#2x.png')
('http://www.palstudiotheatre.com/', 'theatre#palvancouver.org')
Output I wish to get:
('https://rainforestfarms.org/contact', 'rainforestfarmsllc#gmail.com')
https://www.auucvancouver.ca/
http://www.bcla.bc.ca/'
('http://www.palstudiotheatre.com/', 'theatre#palvancouver.org')
How can I get rid of unwanted items using regex?
It depends what you means by "unwanted".
One way to define them is to use a whitelist of allowed domain suffixes, for example 'org', 'com', etc.
import re
import requests
pattern = r'\w+#\w+\.(?:com|org)'
urls = (
'https://rainforestfarms.org/contact',
'https://www.auucvancouver.ca/',
'http://www.bcla.bc.ca/',
'http://www.palstudiotheatre.com/',
)
def get_email(link,pattern):
res = requests.get(link)
email = re.findall(pattern, res.text)
if email:
return link, email[0]
else:
return link
for link in urls:
print(get_email(link,pattern))
yields
('https://rainforestfarms.org/contact', 'rainforestfarmsllc#gmail.com')
https://www.auucvancouver.ca/
http://www.bcla.bc.ca/
('http://www.palstudiotheatre.com/', 'theatre#palvancouver.org')
You could obviously do more complex things such as blacklists or regex patterns for the suffix.
As always for this kind of question I strongly recommend using regex101 to check and understand your regex.
Let's say I'm passing a variable into a function, and I want to ensure it's properly formatted for my end use, with consideration for several potential unwanted formats.
Example; I want to store only lowercase representations of url addresses, without http:// or https://.
def standardize(url):
# Lowercase
temp_url = url
url = temp_url.lower()
# Remove 'http://'
if 'http://' in url:
temp_url = url
url = temp_url.replace('http://', '')
if 'https://' in url:
temp_url = url
url = temp_url.replace('https://', '')
return url
I'm only just encroaching on the title of Novice, and was wondering if there is more pythonic approach to achieving this type of process?
End goal being the trasformation of a url as such https://myurl.com/RANDoM --> myurl.com/random
The application of url string formating isn't of any particular importance.
A simple re.sub will do the trick:
import re
def standardize(url):
return re.sub("^https?://",'',url.lower())
# with 'https'
print(standardize('https://myurl.com/RANDoM')) # prints 'myurl.com/random'
# with 'http'
print(standardize('http://myurl.com/RANDoM')) # prints 'myurl.com/random'
# both works
def standardize(url):
return url.lower().replace("https://","").replace("http://","")
That's as simple as I can make it, but, the chaining is a little ugly.
If you want to import regex, could also do something like this:
import re
def standardize(url):
return re.sub("^https?://", "", url.lower())
I am using python and trying to fetch a particular part of the url as below
from urlparse import urlparse as ue
url = "https://www.google.co.in"
img_url = ue(url).hostname
Result
www.google.co.in
case1:
Actually i will have a number of urls(stored in a list or some where else), so what i want is, need to find the domain name as above in the url and fetch the part after www. and before .co.in, that is the string starts after first dot and before second dot which results only google in the present scenario.
So suppose the url given is url given is www.gmail.com, i should fetch only gmail in that, so what ever the url given, the code should fetch the part thats starts with first dot and before second dot.
case2:
Also some urls may be given directly like this domain.com, stackoverflow.com without www in the url, in that cases it should fetch only stackoverflow and domain.
Finally my intention is to fetch the main name from the url that gmail, stackoverflow, google like so.....
Generally if i have one url i can use list slicing and will fetch the string, but i will have a number of ulrs, so need to fetch the wanted part like mentioned above dynamically
Can anyone please let me know how to satisfy the above concept ?
Why can't you just do this:
from urlparse import urlparse as ue
urls = ['https://www.google.com', 'http://stackoverflow.com']
parsed = []
for url in urls:
decoded = ue(url).hostname
if decoded.startswith('www.'):
decoded = ".".join(decoded.split('.')[1:])
parsed.append(decoded.split('.')[0])
#parsed is now your parsed list of hostnames
Also, you might want to change the if statement in the for loop, because some domains might start with other things that you would want to get rid of.
What about using a set of predefined toplevel doamains?
import re
from urlparse import urlparse
#Fake top level domains... EG: co.uk, co.in, co.cc
TOPLEVEL = [".co.[a-zA-Z]+", ".fake.[a-zA-Z]+"]
def TLD(rgx, host, max=4): #4 = co.name
match = re.findall("(%s)" % rgx, host, re.IGNORECASE)
if match:
if len(match[0].split(".")[1])<=max:
return match[0]
else:
return False
parsed = []
urls = ["http://www.mywebsite.xxx.asd.com", "http://www.dd.test.fake.uk/asd"]
for url in urls:
o = urlparse(url)
h = o.hostname
for j in range(len(TOPLEVEL)):
TL = TLD(TOPLEVEL[j], h)
if TL:
name = h.replace(TL, "").split(".")[-1]
parsed.append(name)
break
elif(j+1==len(TOPLEVEL)):
parsed.append(h.split(".")[-2])
break
print parsed
It's a bit hacky, and maybe cryptic for some, but it does the trick, and nothing more has to be done :)
Here is my solution, at the end, domains holds a list of domains you expected.
import urlparse
urls = [
'https://www.google.com',
'http://stackoverflow.com',
'http://www.google.co.in',
'http://domain.com',
]
hostnames = [urlparse.urlparse(url).hostname for url in urls]
hostparts = [hostname.split('.') for hostname in hostnames]
domains = [p[0] == 'www' and p[1] or p[0] for p in hostparts]
print domains # ==> ['google', 'stackoverflow', 'google', 'domain']
Discussion
First, we extract the host names from the list of URLs using urlparse.urlparse(). The hostnames list looks like this:
[ 'www.google.com', 'stackoverflow.com, ... ]
In the next line, we break each host into parts, using the dot as the separator. Each item in the hostparts looks like this:
[ ['www', 'google', 'com'], ['stackoverflow', 'com'], ... ]
The interesting work is in the next line. This line says, "if the first part before the dot is www, then the domain is the second part (p[1]). Otherwise, the domain is the first part (p[0]). The domains list looks like this:
[ 'google', 'stackoverflow', 'google', 'domain' ]
My code does not know how to handle login.gmail.com.hk. I hope someone else can solve this problem as I am late for bed. Update: Take a look at the tldextract by John Kurkowski, which should do what you want.
I am using wikipedia api and using following api request,
http://en.wikipedia.org/w/api.php?`action=query&meta=globaluserinfo&guiuser='$cammer'&guiprop=groups|merged|unattached&format=json`
but the problem is I am unable to escape Dollar Sign and similar characters like that, I tried the following but it didn't work,
r['guiprop'] = u'groups|merged|unattached'
r['guiuser'] = u'$cammer'
I found it this in w3school but checking this for every single character would a pain full, what would be the best way to escape this in the strip.http://www.w3schools.com/tags/ref_urlencode.asp
You should take a look at using urlencode.
from urllib import urlencode
base_url = "http://en.wikipedia.org/w/api.php?"
arguments = dict(action="query",
meta="globaluserinfo",
guiuser="$cammer",
guiprop="groups|merged|unattached",
format="json")
url = base_url + urlencode(arguments)
If you don't need to build a complete url you can just use the quote function for a single string:
>>> import urllib
>>> urllib.quote("$cammer")
'%24cammer'
So you end up with:
r['guiprop'] = urllib.quote(u'groups|merged|unattached')
r['guiuser'] = urllib.quote(u'$cammer')