Parsing all possible YouTube urls - python

I am looking for all the features that a YouTube url can have?
http://www.youtube.com/watch?v=6FWUjJF1ai0&feature=related
So far I have seen feature=relmfu, related, fvst, fvwrel. Is there a list for this somewhere. Also, my ultimate aim is to extract the video id (6FWUjJF1ai) from all possible youtube urls. How can I do that? It seems to be difficult. Is there anyone who has already done that?

You can use urlparse to get the query string from your url, then you can use parse_qs to get the video id from the query string.

wrote the code for your assistance....the credit of solving is purely Frank's though.
import urlparse as ups
m = ups.urlparse('http://www.youtube.com/watch?v=6FWUjJF1ai0&feature=related')
print ups.parse_qs(m.query)['v']

From the following answer https://stackoverflow.com/a/43490746/8534966, I ran 55 different test cases and it was able to get 51 matches. See my tests.
So I wrote some if else code to fix it:
# Get YouTube video ID
if "watch%3Fv%3D" in youtube_url:
# e.g.: https://www.youtube.com/attribution_link?a=8g8kPrPIi-ecwIsS&u=/watch%3Fv%3DyZv2daTWRZU%26feature%3Dem-uploademail
search_pattern = re.search("watch%3Fv%3D(.*?)%", youtube_url)
if search_pattern:
youtube_id = search_pattern.group(1)
elif "watch?v%3D" in youtube_url:
# e.g.: http://www.youtube.com/attribution_link?a=JdfC0C9V6ZI&u=%2Fwatch%3Fv%3DEhxJLojIE_o%26feature%3Dshare
search_pattern = re.search("v%3D(.*?)&format", youtube_url)
if search_pattern:
youtube_id = search_pattern.group(1)
elif "/e/" in youtube_url:
# e.g.: http://www.youtube.com/e/dQw4w9WgXcQ
youtube_url += " "
search_pattern = re.search("/e/(.*?) ", youtube_url)
if search_pattern:
youtube_id = search_pattern.group(1)
else:
# All else.
search_pattern = re.search("(?:[?&]vi?=|\/embed\/|\/\d\d?\/|\/vi?\/|https?:\/\/(?:www\.)?youtu\.be\/)([^&\n?#]+)",
youtube_url)
if search_pattern:
youtube_id = search_pattern.group(1)

You may rather want to consider a wider spectrum of url parser as suggested on this Gist.
It will parse more than what urlparse can do.

Related

Python requests returns incomprehensible content

I'm trying to parse the site. I don't want to use selenium. Requests is coping. BUT! something strange is happening. I can't cut out the text I need with a regular expression (and it's there - you can see it if you do print(data.text)) But re doesn't see him. If this text is copied to notepad++, it outputs this - it sees these characters as a single line.
import requests
import re
data = requests.get('https://ru.runetki3.com/?page=1')
print(data.text)
What is it and how to work with it?pay attention to the line numbers
You can try to use their Ajax API to load all usernames + thumb images:
import pandas as pd
import requests
url = 'https://ru.runetki3.com/tools/listing_v3.php?livetab=female&offset=0&limit={}'
headers = {'X-Requested-With': 'XMLHttpRequest'}
all_data = []
for p in range(1, 4): # <-- increase number of pages here
data = requests.get(url.format(p * 144), headers=headers).json()
for m in data['models']:
all_data.append((m['username'], m['display_name'], m['thumb_image'].replace('{ext}', 'jpg')))
df = pd.DataFrame(all_data, columns=['username', 'display_name', 'thumb'])
print(df.head())
Prints:
username display_name thumb
0 wetlilu Little_Lilu //i.bimbolive.com/live/034/263/131/xbig_lq/c30823.jpg
1 mellannie8 mellannieSEX //i.bimbolive.com/live/034/24f/209/xbig_lq/314348.jpg
2 mokkoann mokkoann //i.bimbolive.com/live/034/270/279/xbig_lq/cb25cb.jpg
3 ogurezzi CynEp-nuCbka //i.bimbolive.com/live/034/269/02c/xbig_lq/3ebe2a.jpg
4 Pepetka22 _-Katya-_ //i.bimbolive.com/live/034/24f/36e/xbig_lq/18da8e.jpg
Avoid using . in a regex unless you really want to get any character; here, the usernames (as far as I can see) only contain - and alphanumeric characters, so you can retrieve them with:
re.findall(r'"username":"([\w|-]+)"',data.text)
An even simpler way, which will remove the need to deal with special characters by getting all characters except " is:
re.findall(r'"username":"([^"]+)"',data.text)
So here's a way of getting the info you seek (I joined them into a dictionary, but you can change that to whatever you prefer):
import requests
import re
data = requests.get('https://ru.runetki3.com/?page=1')
with open ("return.txt",'w', encoding = 'utf-8') as f:
f.write(data.text)
names = re.findall(r'"username":"([^"]+)"',data.text)
disp_names = re.findall(r'"display_name":"([^"]+)"',data.text)
thumbs = re.findall(r'"thumb_image":"([^"]+)"',data.text)
names_dict = {name:[disp, thumb.replace('{ext}', 'jpg')] for name, disp, thumb in zip(names, disp_names, thumbs)}
Example
names_dict['JuliaCute']
# ['_Cute',
# '\\/\\/i.bimbolive.com\\/live\\/055\\/2b0\\/15d\\/xbig_lq\\/d89ef4.jpg']

I need to scrape the instagram link that is highlighted in the image

I am trying regex in python. I am facing a problem how to clip out the portion that is<
"www.instagram.com%2FMohakMeet"
I need to know the characters which I need to use in regex.
#python3
for d in g:
stripped = (d.rstrip())
url = stripped+"/about"
print("Retreiving" + url)
response = requests.get(url)
data = response.text
link = re.findall('''(www.instagram.com.+?)\s?\"?''', data)
if link == []:
print ('No Link')
else: x = str(link[0])
print ("Insta Link", x)
y = x.replace("%2F", '/', 3)
print (y)
# with open ('l.txt', 'a') as v:
# v.write(y)
# v.write("\n")
This is My Code but the main problem is, while scraping Python is scraping the Description of the Youtube page shown in the 2nd picture.
Please Help.
This is the pattern which is not working.
(www.instagram.com.+?)\s?\"?'''
this link will let you debug your regex https://regex101.com/
a common pitfall when creating regex, is using standard string ('my string') and not raw strings r'my string' .
see also https://docs.python.org/3/library/re.html

How to match and extract using regex - Python

I'm taking a look how to use regex and trying to figure out how to extract the Latitude and Longitude, no matter if the number is positive or negative, right after the "?ll=" as shown below:
https://maps.google.com/maps?ll=-6.148222,106.8462&q=loc:-6.148222,106.8462&
I have used the following code in python to get only the first digits marked above:
for link in soup.find_all('a', {'class': 'popup-gmaps'}):
lnk = str(link.get('href'))
print(lnk)
m = re.match('-?\d+(?!.*ll=)(?!&q=loc)*', lnk)
print(m)
#lat, *long = m.split(',')
#print(lat)
#print(long)
The result I got isn't what I was expecting:
https://maps.google.com/maps?ll=-6.148222,106.8462&q=loc:-6.148222,106.8462&
None
I'm getting "None" rather than the value "-6.148222,106.8462". I also tried to split those numbers into two variables called lat and long, but since I always got "None" python stops processing with "exit code 1" until I comment lines.
Cheers,
You should use re.search() instead of re.match() cause re.match() is used for exact matches.
This can solve the problem
for link in soup.find_all('a', {'class': 'popup-gmaps'}):
lnk = str(link.get('href'))
m = re.search(r"(-?\d*\.\d*),(-?\d*\.\d*)", lnk)
print(m.group())
print("lat = "+m.group(1))
print("lng = "+m.group(2))
I'd use a proper URL parser, using regex here is asking for problems in case the URL embedded in the page you are crawling is changing in a way that will break the regex you use.
from urllib.parse import urlparse, parse_qs
url = 'https://maps.google.com/maps?ll=-6.148222,106.8462&q=loc:-6.148222,106.8462&'
scheme, netloc, path, params, query, fragment = urlparse(url)
# or just
# query = urlparse(url).query
parsed_query_string = parse_qs(query)
print(parsed_query_string)
lat, long = parsed_query_string['ll'][0].split(',')
print(lat)
print(long)
outputs
{'ll': ['-6.148222,106.8462'], 'q': ['loc:-6.148222,106.8462']}
-6.148222
106.8462
use diff regex for latitude and longitude
import re
str1="https://maps.google.com/maps?ll=6.148222,-106.8462&q=loc:-6.148222,106.8462&"
lat=re.search(r"(-)*\d+(.)\d+",str1).group()
lon=re.search(r",(-)*\d+(.)\d+",str1).group()
print(lat)
print(lon[1:])
output
6.148222
-106.8462

Get YouTube video url or YouTube video ID from a string using RegEx

So I've been stuck on this for about an hour or so now and I just cannot get it to work. So far I have been trying to extract the whole link from the string, but now I feel like it might be easier to just get the video ID.
The RegEx would need to take the ID/URL from the following link styles, no matter where they are in a string:
http://youtube.com/watch?v=iwGFalTRHDA
http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related
https://youtube.com/iwGFalTRHDA
http://youtu.be/n17B_uFF4cA
youtube.com/iwGFalTRHDA
youtube.com/n17B_uFF4cA
http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4
http://www.youtube.com/watch?v=t-ZRX8984sc
http://youtu.be/t-ZRX8984sc
So far, I have this RegEx:
((http(s)?\:\/\/)?(www\.)?(youtube|youtu)((\.com|\.be)\/)(watch\?v=)?([0-z]{11}|[0-z]{4}(\-|\_)[0-z]{4}|.(\-|\_)[0-z]{9}))
This catches the link, however it also breaks down the link in to multiple parts and also adds that to the list too, so if a string contains a single youtube link, the output when I print the list is something like this:
('https://www.youtube.com/watch?v=Idn7ODPMhFY', 'https://', 's', 'www.', 'youtube', '.com/', '.com', 'watch?v=', 'Idn7ODPMhFY', '', '')
I need the list to only contain the link itself, or just the video id (which would be more preferable). I have really tried doing this myself for quite a while now but I just cannot figure it out. I was wondering if someone could sort out the regex for me and tell me where I am going wrong so that I don't run in to this issue again in the future?
Instead of writing a complicated regex that probably work not in all cases, you better use tools to analyze the url, like urllib:
from urllib.parse import urlparse, parse_qs
url = 'http://youtube.com/watch?v=iwGFalTRHDA'
def get_id(url):
u_pars = urlparse(url)
quer_v = parse_qs(u_pars.query).get('v')
if quer_v:
return quer_v[0]
pth = u_pars.path.split('/')
if pth:
return pth[-1]
This function will return None if both attempts fail.
I tested it with the sample urls:
>>> get_id('http://youtube.com/watch?v=iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related')
'iwGFalTRHDA'
>>> get_id('https://youtube.com/iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('http://youtu.be/n17B_uFF4cA')
'n17B_uFF4cA'
>>> get_id('youtube.com/iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('youtube.com/n17B_uFF4cA')
'n17B_uFF4cA'
>>> get_id('http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4')
'r5nB9u4jjy4'
>>> get_id('http://www.youtube.com/watch?v=t-ZRX8984sc')
't-ZRX8984sc'
>>> get_id('http://youtu.be/t-ZRX8984sc')
't-ZRX8984sc'
Here's the approach I'd use, no regex needed at all.
(This is pretty much equivalent to #Willem Van Onsem's solution, plus an easy to run / update unit test).
from urlparse import parse_qs
from urlparse import urlparse
import re
import unittest
TEST_URLS = [
('iwGFalTRHDA', 'http://youtube.com/watch?v=iwGFalTRHDA'),
('iwGFalTRHDA', 'http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related'),
('iwGFalTRHDA', 'https://youtube.com/iwGFalTRHDA'),
('n17B_uFF4cA', 'http://youtu.be/n17B_uFF4cA'),
('iwGFalTRHDA', 'youtube.com/iwGFalTRHDA'),
('n17B_uFF4cA', 'youtube.com/n17B_uFF4cA'),
('r5nB9u4jjy4', 'http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4'),
('t-ZRX8984sc', 'http://www.youtube.com/watch?v=t-ZRX8984sc'),
('t-ZRX8984sc', 'http://youtu.be/t-ZRX8984sc'),
(None, 'http://www.stackoverflow.com')
]
YOUTUBE_DOMAINS = [
'youtu.be',
'youtube.com',
]
def extract_id(url_string):
# Make sure all URLs start with a valid scheme
if not url_string.lower().startswith('http'):
url_string = 'http://%s' % url_string
url = urlparse(url_string)
# Check host against whitelist of domains
if url.hostname.replace('www.', '') not in YOUTUBE_DOMAINS:
return None
# Video ID is usually to be found in 'v' query string
qs = parse_qs(url.query)
if 'v' in qs:
return qs['v'][0]
# Otherwise fall back to path component
return url.path.lstrip('/')
class TestExtractID(unittest.TestCase):
def test_extract_id(self):
for expected_id, url in TEST_URLS:
result = extract_id(url)
self.assertEqual(
expected_id, result, 'Failed to extract ID from '
'URL %r (got %r, expected %r)' % (url, result, expected_id))
if __name__ == '__main__':
unittest.main()
I really advise on #LukasGraf's comment, however if you really must use regex you can check the following:
(?:(?:https?\:\/\/)?(?:www\.)?(?:youtube|youtu)(?:(?:\.com|\.be)\/)(?:embed\/)?(?:watch\?)?(?:feature=player_embedded)?&?(?:v=)?([0-z]{11}|[0-z]{4}(\-|\_)[0-z]{4}|.(\-|\_)[0-z]{9}))
Here is a working example in regex101:
https://regex101.com/r/5eRqn2/1
And here the python example:
In [38]: r = re.compile('(?:(?:https?\:\/\/)?(?:www\.)?(?:youtube|youtu)(?:(?:\.com|\.be)\/)(?:embed\/)?(?:watch\?)?(?:feature=player_embedded)?&?(?:v=)?([0-z]{11}|[0-z]{4}(?:\-|\_)[0-z]{4}|.(?:\-|\_)[0-z]{9}))')
In [39]: r.match('http://youtube.com/watch?v=iwGFalTRHDA').groups()
Out[39]: ('iwGFalTRHDA',)
In [40]: r.match('http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related').groups()
Out[40]: ('iwGFalTRHDA',)
In [41]: r.match('https://youtube.com/iwGFalTRHDA').groups()
Out[41]: ('iwGFalTRHDA',)
In order to not catch specific group in regex you should this: (?:...)

using find with multiple strings to search within larger string

Using python 2.7 I am trying to scrape title from a page, but cut it off before the closing title tag if i find one of these characters : .-_<| (as I'm just trying to get the name of the company/website) I have some code working but I'm sure there must be a simpler way. I'm open to suggestions as to libraries (beautiful soup, scrappy etc), but I would be most happy to do it without as I am happy to be slowly learning my way around python right now. You can see my code searches individually for each of the characters rather than all at once. I was hoping there was a find( x or x) function but I could not find. Later I will also be doing the same thing but looking for any numbers within 0-9 range.
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [{'User-agent' , 'Mozilla/5.0'}]
def findTitle(webaddress):
url = (webaddress)
ourUrl = opener.open(url).read()
ourUrlLower = ourUrl.lower()
x=0
positionStart = ourUrlLower.find("<title>",x)
if positionStart == -1:
return "Insert Title Here"
endTitleSignals = ['.',',','-','_','#','+',':','|','<']
positionEnd = positionStart + 50
for e in endTitleSignals:
positionHolder = ourUrlLower.find(e ,positionStart + 1)
if positionHolder < positionEnd and positionHolder != -1:
positionEnd = positionHolder
return ourUrl[positionStart + 7:positionEnd]
print findTitle('http://www.com)
The regular expression library (re) could help, but if you'd like to learn more about general python instead of specialized libraries, you could do it with sets, which are something you'll want to know about.
import sets
string = "garbage1and2recycling"
charlist = ['1', '2']
charset = sets.Set(charlist)
index = 0
for index in range(len(string)):
if string[index] in charset: break
print(index) # 7
Note that you could do the above using just charlist instead of charset, but that would take longer to run.

Categories