I want to get the content from a google search which fits the following format
<h3 class="zBAuLc"><div class="BNeawe vvjwJb AP7Wnd">Google</div></h3>
How to make regular expression of this?
Here is what I've tried:
import requests, webbrowser
import re
userResearch = input('Enter what to search:')
print('Searching...')
searcher = requests.get("https://www.google.com/search?q="+userResearch)
results = re.findall(r'<h3 class=".+"><div class=".+">.+</div></h3>', searcher.text)
print (results)
But the re.findall does not return what I expect
You didn't escape the /.
Try the regex:
r'<h3 class=".+"><div class=".+">.+<\/div><\/h3>'
Related
I'm still a newbie in Python but I'm trying to make my first little program.
My intention is to print only the link ending with .m3u8 (if available) istead of printing the whole web page.
The code I'm currently using:
import requests
channel1 = requests.get('https://website.tv/user/111111')
print(channel1.content)
print('\n')
channel2 = requests.get('https://website.tv/user/222222')
print(channel2.content)
print('\n')
input('Press Enter to Exit...')
The link I'm looking for always has 47 characters in total, and it's always the same model just changing the stream id represented as X:
https://website.tv/live/streamidXXXXXXXXX.m3u8
Can anyone help me?
You can use regex for this problem.
Explanation:
here in the expression portion .*? means to consider everything and whatever enclosed in \b(expr)\b needs to be present there mandatorily.
For e.g.:
import re
link="https://website.tv/live/streamidXXXXXXXXX.m3u8"
p=re.findall(r'.*?\b.m3u8\b',link)
print(p)
OUTPUT:
['https://website.tv/live/streamidXXXXXXXXX.m3u8']
There are a few ways to go about this, one that springs to mind which others have touched upon is using regex with findall that returns back a list of matched urls from our url_list.
Another option could also be BeautifulSoup but without more information regarding the html structure it may not be the best tool here.
Using Regex
from re import findall
from requests import get
def check_link(response):
result = findall(
r'.*?\b.m3u8\b',
str(response.content),
)
return result
def main(url):
response = get(url)
if response.ok:
link_found = check_link(response)
if link_found:
print('link {} found at {}'.format(
link_found,
url,
),
)
if __name__ == '__main__':
url_list = [
'http://www.test_1.com',
'http://www.test_2.com',
'http://www.test_3.com',
]
for url in url_list:
main(url)
print("All finished")
If I understand your question correctly I think you want to use Python's .split() string method. If your goal is to take a string like "https://website.tv/live/streamidXXXXXXXXX.m3u8" and extract just "streamidXXXXXXXXX.m3u8" then you could do that with the following code:
web_address = "https://website.tv/live/streamidXXXXXXXXX.m3u8"
specific_file = web_address.split('/')[-1]
print(specific_file)
The calling .split('/') on the string like that will return a list of strings where each item in the list is a different part of the string (first part being "https:", etc.). The last one of these (index [-1]) will be the file extension you want.
This will extract all URLs from webpage and filter only those which contain your required keyword ".m3u8"
import requests
import re
def get_desired_url(data):
urls = []
for url in re.findall(r'(https?://\S+)', data):
if ".m3u8" in url:
urls.append(url)
return urls
channel1 = requests.get('https://website.tv/user/111111')
urls = get_desired_url(channel1 )
Try this, I think this will be robust
import re
links=[re.sub('^<[ ]*a[ ]+.*href[ ]*=[ ]*', '', re.sub('.*>$', '', link) for link in re.findall(r'<[ ]*a[ ]+.*href[ ]*=[]*"http[s]*://.+\.m3u8".*>',channel2.content)]
I'm new to web scraping and regex syntax.
I'm trying to find all matches for videoIds from a YouTube search html file. I'm not able to do it with BeautifulSoup's parse, since they were recently moved to a JS script. So I'm trying with regex.
They appear in the JS script as something like: "videoId":"jNQXAC9IVRw"
Note the ID is always 11 characters long.
So far, I'm trying:
html = urllib.request.urlopen(url).read().decode('utf-8')
pattern = re.compile('<quote>(\w{11})<quote>')
matches = re.findall(pattern, html)
for i in range(3):
print(matches[i])
But it won't find anything.
My desire is to have a list of IDs, like:
lYtFMmByfJk
d2RlyAz6VQ
utTAphB1y4Y
What am I doing wrong?
If you change < quote > to ", it should work:
import re
html = '"videoId":"jNQXAC9IVRw","videoId":"jNQXACffRwl","videoId":"jNQXAC9ffsw"'
pattern = re.compile('videoId":"(\w{11})"')
matches = re.findall(pattern, html)
print(matches)
for i in range(3):
print(matches[i])
Output is:
>python .\vidIDs.py
['jNQXAC9IVRw', 'jNQXACffRwl', 'jNQXAC9ffsw']
jNQXAC9IVRw
jNQXACffRwl
jNQXAC9ffsw
I've created a script in python using regular expression to parse emails from few websites. The pattern that I've used to grab email is \w+#\w+\.{1}\w+ which works most of the cases. However, trouble comes up when it encounters items like 8b4e078a51d04e0e9efdf470027f0ec1#sentry.wixpress, Slice_1#2x.png e.t.c. The pattern grabs them as well which I would like to get rid of.
I've tried with:
import re
import requests
pattern = r'\w+#\w+\.{1}\w+'
urls = (
'https://rainforestfarms.org/contact',
'https://www.auucvancouver.ca/',
'http://www.bcla.bc.ca/',
'http://www.palstudiotheatre.com/',
)
def get_email(link,pattern):
res = requests.get(link)
email = re.findall(pattern,res.text)
if email:
return link,email[0]
else:
return link
if __name__ == '__main__':
for link in urls:
print(get_email(link,pattern))
Output I'm getting:
('https://rainforestfarms.org/contact', 'rainforestfarmsllc#gmail.com')
('https://www.auucvancouver.ca/', '8b4e078a51d04e0e9efdf470027f0ec1#sentry.wixpress')
('http://www.bcla.bc.ca/', 'Slice_1#2x.png')
('http://www.palstudiotheatre.com/', 'theatre#palvancouver.org')
Output I wish to get:
('https://rainforestfarms.org/contact', 'rainforestfarmsllc#gmail.com')
https://www.auucvancouver.ca/
http://www.bcla.bc.ca/'
('http://www.palstudiotheatre.com/', 'theatre#palvancouver.org')
How can I get rid of unwanted items using regex?
It depends what you means by "unwanted".
One way to define them is to use a whitelist of allowed domain suffixes, for example 'org', 'com', etc.
import re
import requests
pattern = r'\w+#\w+\.(?:com|org)'
urls = (
'https://rainforestfarms.org/contact',
'https://www.auucvancouver.ca/',
'http://www.bcla.bc.ca/',
'http://www.palstudiotheatre.com/',
)
def get_email(link,pattern):
res = requests.get(link)
email = re.findall(pattern, res.text)
if email:
return link, email[0]
else:
return link
for link in urls:
print(get_email(link,pattern))
yields
('https://rainforestfarms.org/contact', 'rainforestfarmsllc#gmail.com')
https://www.auucvancouver.ca/
http://www.bcla.bc.ca/
('http://www.palstudiotheatre.com/', 'theatre#palvancouver.org')
You could obviously do more complex things such as blacklists or regex patterns for the suffix.
As always for this kind of question I strongly recommend using regex101 to check and understand your regex.
I am using Jupyter Notebook to get docid=PE209374738 as my output using reg ex. It is currently stored in a dictionary in this format:
{'Url': 'https://backtoschool.com/document.php?docid=PE209374738&datasource=PHE&vid=3326&referrer=api'}.
This is my code:
results= xmldoc.getElementsByTagName("result")
dict= {}
for a in results:
url= 'Url'
dict[url] = a.getElementsByTagName("url")[0].childNodes[0].nodeValue
docid= re.search(r'\?(.*?)&')
Does anyone have any suggestions on how to print that id?
The standard library already has methods for parsing URLs properly, no need for regex.
In Python 3:
from urllib.parse import urlparse, parse_qs
url = 'https://backtoschool.com/document.php?docid=PE209374738&datasource=PHE&vid=3326&referrer=api'
print(parse_qs(urlparse(url).query)['docid'][0]) # PE209374738
In Python 2 the first line is:
from urlparse import urlparse, parse_qs
#alex-hall is correct, you probably should better parse this using a proper URL parser.
That said, your original question was about doing it with using regexps, so here is the solution (which you nearly nailed already):
s = 'https://backtoschool.com/document.php?docid=PE209374738&datasource=PHE&vid=3326&referrer=api'
m = re.search(r'\?docid=(.*?)&', s)
print m.groups()[0]
This will print the desired PE209374738.
I am using wikipedia api and using following api request,
http://en.wikipedia.org/w/api.php?`action=query&meta=globaluserinfo&guiuser='$cammer'&guiprop=groups|merged|unattached&format=json`
but the problem is I am unable to escape Dollar Sign and similar characters like that, I tried the following but it didn't work,
r['guiprop'] = u'groups|merged|unattached'
r['guiuser'] = u'$cammer'
I found it this in w3school but checking this for every single character would a pain full, what would be the best way to escape this in the strip.http://www.w3schools.com/tags/ref_urlencode.asp
You should take a look at using urlencode.
from urllib import urlencode
base_url = "http://en.wikipedia.org/w/api.php?"
arguments = dict(action="query",
meta="globaluserinfo",
guiuser="$cammer",
guiprop="groups|merged|unattached",
format="json")
url = base_url + urlencode(arguments)
If you don't need to build a complete url you can just use the quote function for a single string:
>>> import urllib
>>> urllib.quote("$cammer")
'%24cammer'
So you end up with:
r['guiprop'] = urllib.quote(u'groups|merged|unattached')
r['guiuser'] = urllib.quote(u'$cammer')