I need to get the UserID from a github link
Example:
https://github.com/myName/blabla/ or
https://github.com/myName/
https://github.com/myName
output should be myName.
My code:
(\/github.com\/)\/(.*?)(\/)?
You've got an extra slash in the middle, and you don't need the slash at the end, just grab characters that aren't slashes for the username:
/github\.com/([^/]+)
Your match will be in group 1.
Interactive demo: https://regex101.com/r/vEOksV/2
import re
text = 'https://github.com/testName/'
try:
found = re.search('github\.com\/(.+?)\/', text).group(1)
except AttributeError:
# pattern not found
found = '' # apply your error handling
# found: testName
Interactive IDE with this example: https://onlinegdb.com/ry1w1dSB8
Related
I want to get the content from a google search which fits the following format
<h3 class="zBAuLc"><div class="BNeawe vvjwJb AP7Wnd">Google</div></h3>
How to make regular expression of this?
Here is what I've tried:
import requests, webbrowser
import re
userResearch = input('Enter what to search:')
print('Searching...')
searcher = requests.get("https://www.google.com/search?q="+userResearch)
results = re.findall(r'<h3 class=".+"><div class=".+">.+</div></h3>', searcher.text)
print (results)
But the re.findall does not return what I expect
You didn't escape the /.
Try the regex:
r'<h3 class=".+"><div class=".+">.+<\/div><\/h3>'
I'm still a newbie in Python but I'm trying to make my first little program.
My intention is to print only the link ending with .m3u8 (if available) istead of printing the whole web page.
The code I'm currently using:
import requests
channel1 = requests.get('https://website.tv/user/111111')
print(channel1.content)
print('\n')
channel2 = requests.get('https://website.tv/user/222222')
print(channel2.content)
print('\n')
input('Press Enter to Exit...')
The link I'm looking for always has 47 characters in total, and it's always the same model just changing the stream id represented as X:
https://website.tv/live/streamidXXXXXXXXX.m3u8
Can anyone help me?
You can use regex for this problem.
Explanation:
here in the expression portion .*? means to consider everything and whatever enclosed in \b(expr)\b needs to be present there mandatorily.
For e.g.:
import re
link="https://website.tv/live/streamidXXXXXXXXX.m3u8"
p=re.findall(r'.*?\b.m3u8\b',link)
print(p)
OUTPUT:
['https://website.tv/live/streamidXXXXXXXXX.m3u8']
There are a few ways to go about this, one that springs to mind which others have touched upon is using regex with findall that returns back a list of matched urls from our url_list.
Another option could also be BeautifulSoup but without more information regarding the html structure it may not be the best tool here.
Using Regex
from re import findall
from requests import get
def check_link(response):
result = findall(
r'.*?\b.m3u8\b',
str(response.content),
)
return result
def main(url):
response = get(url)
if response.ok:
link_found = check_link(response)
if link_found:
print('link {} found at {}'.format(
link_found,
url,
),
)
if __name__ == '__main__':
url_list = [
'http://www.test_1.com',
'http://www.test_2.com',
'http://www.test_3.com',
]
for url in url_list:
main(url)
print("All finished")
If I understand your question correctly I think you want to use Python's .split() string method. If your goal is to take a string like "https://website.tv/live/streamidXXXXXXXXX.m3u8" and extract just "streamidXXXXXXXXX.m3u8" then you could do that with the following code:
web_address = "https://website.tv/live/streamidXXXXXXXXX.m3u8"
specific_file = web_address.split('/')[-1]
print(specific_file)
The calling .split('/') on the string like that will return a list of strings where each item in the list is a different part of the string (first part being "https:", etc.). The last one of these (index [-1]) will be the file extension you want.
This will extract all URLs from webpage and filter only those which contain your required keyword ".m3u8"
import requests
import re
def get_desired_url(data):
urls = []
for url in re.findall(r'(https?://\S+)', data):
if ".m3u8" in url:
urls.append(url)
return urls
channel1 = requests.get('https://website.tv/user/111111')
urls = get_desired_url(channel1 )
Try this, I think this will be robust
import re
links=[re.sub('^<[ ]*a[ ]+.*href[ ]*=[ ]*', '', re.sub('.*>$', '', link) for link in re.findall(r'<[ ]*a[ ]+.*href[ ]*=[]*"http[s]*://.+\.m3u8".*>',channel2.content)]
I've created a script in python using regular expression to parse emails from few websites. The pattern that I've used to grab email is \w+#\w+\.{1}\w+ which works most of the cases. However, trouble comes up when it encounters items like 8b4e078a51d04e0e9efdf470027f0ec1#sentry.wixpress, Slice_1#2x.png e.t.c. The pattern grabs them as well which I would like to get rid of.
I've tried with:
import re
import requests
pattern = r'\w+#\w+\.{1}\w+'
urls = (
'https://rainforestfarms.org/contact',
'https://www.auucvancouver.ca/',
'http://www.bcla.bc.ca/',
'http://www.palstudiotheatre.com/',
)
def get_email(link,pattern):
res = requests.get(link)
email = re.findall(pattern,res.text)
if email:
return link,email[0]
else:
return link
if __name__ == '__main__':
for link in urls:
print(get_email(link,pattern))
Output I'm getting:
('https://rainforestfarms.org/contact', 'rainforestfarmsllc#gmail.com')
('https://www.auucvancouver.ca/', '8b4e078a51d04e0e9efdf470027f0ec1#sentry.wixpress')
('http://www.bcla.bc.ca/', 'Slice_1#2x.png')
('http://www.palstudiotheatre.com/', 'theatre#palvancouver.org')
Output I wish to get:
('https://rainforestfarms.org/contact', 'rainforestfarmsllc#gmail.com')
https://www.auucvancouver.ca/
http://www.bcla.bc.ca/'
('http://www.palstudiotheatre.com/', 'theatre#palvancouver.org')
How can I get rid of unwanted items using regex?
It depends what you means by "unwanted".
One way to define them is to use a whitelist of allowed domain suffixes, for example 'org', 'com', etc.
import re
import requests
pattern = r'\w+#\w+\.(?:com|org)'
urls = (
'https://rainforestfarms.org/contact',
'https://www.auucvancouver.ca/',
'http://www.bcla.bc.ca/',
'http://www.palstudiotheatre.com/',
)
def get_email(link,pattern):
res = requests.get(link)
email = re.findall(pattern, res.text)
if email:
return link, email[0]
else:
return link
for link in urls:
print(get_email(link,pattern))
yields
('https://rainforestfarms.org/contact', 'rainforestfarmsllc#gmail.com')
https://www.auucvancouver.ca/
http://www.bcla.bc.ca/
('http://www.palstudiotheatre.com/', 'theatre#palvancouver.org')
You could obviously do more complex things such as blacklists or regex patterns for the suffix.
As always for this kind of question I strongly recommend using regex101 to check and understand your regex.
I am able to add a ppa using it but cannot remove. I cannot find the correct syntax to remove the ppa from sources.list. Here's my code:
import aptsources.sourceslist as s
repo = ('deb', 'http://ppa.launchpad.net/danielrichter2007/grub-customizer/ubuntu', 'xenial', ['main'])
sources = s.SourcesList()
sources.add(repo)
sources.save()
#doesn't work
sources.remove(repo)
I tried reading the docs found here but I still cannot find the format to call sources.remove(repo)
The SourcesList.remove() help text reads remove(source_entry), which indicates that what it wants is a SourceEntry object. As it hapens, sources.add() returns a SourceEntry object:
import aptsources.sourceslist as sl
sources = sl.SourcesList()
entry = sources.add('deb', 'mirror://mirrors.ubuntu.com/mirrors.txt', 'xenial', ['main'])
print(type(entry))
Outputs:
<class 'aptsources.sourceslist.SourceEntry'>
To remove the entry:
sources.remove(entry)
sources.save()
You can also disable it (which will leave a commented-out entry in sources.list:
entry.set_enabled(False)
sources.save()
I'm using this to do the removing for now.
import fileinput
filename = '/etc/apt/sources.list'
word = 'grub-customizer'
n = ""
remove = fileinput.input(filename, inplace=1)
for line in remove:
if word in line:
line = n
line.strip()
print line,
remove.close()
After using beautiful soup's soup.findAll('a', {'link': 'go to'}) I extracted a list of links like:
lis_links = ['https://foo.com/019774_s009_TEV 234.xml https://foo.com/019774_s009_TEV 23.xml https://foo.com/019774_s009_TEV24.xml https://foo.com/019774_s009_TEV 120.xml https://foo.com/WERW FOR INJ.xml']
As you can see the some links have "", how to fix that blank space with it's proper encoding (I guess its %20)?. I tried to use replace(' ', '%20') but I do not have control of where to use it.
Use a negative lookahead to find all spaces not followed by http : \s(?!http)
RegEx demo
Python sample
import re
def fixLinks(str):
return re.sub(r"\s(?!http)", "%20", str)
links = ["https://foo.com/019774_s009_TEV 234.xml https://foo.com/019774_s009_TEV 23.xml https://foo.com/019774_s009_TEV24.xml https://foo.com/019774_s009_TEV 120.xml https://foo.com/WERW FOR INJ.xml"]
links[0] = fixLinks(links[0])
print links[0];
Python online demo