I need to get twitter handle from different type of data
https://twitter.com/elonmusk
https://twitter.com/elonmusk/status/43940840234234
https://twitter.com/elonmusk?t=w5i1O32q6dM7usSQEaTGvA&s=09
https://twitter.com/elonmusk
#elonmusk
all of this should return elonmusk
and also I can have more then 1 twitter handle in one message, so it should return all handles as list
https://twitter.com/elonmusk https://twitter.com/elonmusk/status/43940840234234
this should return elonmusk,elonmusk
import re
s = "some text..."
print(re.findall(r'(?:twitter\.com/|#)(\w+)', s))
This matches a sequence of word characters following "twitter.com/" or "#".
Related
I'm still a newbie in Python but I'm trying to make my first little program.
My intention is to print only the link ending with .m3u8 (if available) istead of printing the whole web page.
The code I'm currently using:
import requests
channel1 = requests.get('https://website.tv/user/111111')
print(channel1.content)
print('\n')
channel2 = requests.get('https://website.tv/user/222222')
print(channel2.content)
print('\n')
input('Press Enter to Exit...')
The link I'm looking for always has 47 characters in total, and it's always the same model just changing the stream id represented as X:
https://website.tv/live/streamidXXXXXXXXX.m3u8
Can anyone help me?
You can use regex for this problem.
Explanation:
here in the expression portion .*? means to consider everything and whatever enclosed in \b(expr)\b needs to be present there mandatorily.
For e.g.:
import re
link="https://website.tv/live/streamidXXXXXXXXX.m3u8"
p=re.findall(r'.*?\b.m3u8\b',link)
print(p)
OUTPUT:
['https://website.tv/live/streamidXXXXXXXXX.m3u8']
There are a few ways to go about this, one that springs to mind which others have touched upon is using regex with findall that returns back a list of matched urls from our url_list.
Another option could also be BeautifulSoup but without more information regarding the html structure it may not be the best tool here.
Using Regex
from re import findall
from requests import get
def check_link(response):
result = findall(
r'.*?\b.m3u8\b',
str(response.content),
)
return result
def main(url):
response = get(url)
if response.ok:
link_found = check_link(response)
if link_found:
print('link {} found at {}'.format(
link_found,
url,
),
)
if __name__ == '__main__':
url_list = [
'http://www.test_1.com',
'http://www.test_2.com',
'http://www.test_3.com',
]
for url in url_list:
main(url)
print("All finished")
If I understand your question correctly I think you want to use Python's .split() string method. If your goal is to take a string like "https://website.tv/live/streamidXXXXXXXXX.m3u8" and extract just "streamidXXXXXXXXX.m3u8" then you could do that with the following code:
web_address = "https://website.tv/live/streamidXXXXXXXXX.m3u8"
specific_file = web_address.split('/')[-1]
print(specific_file)
The calling .split('/') on the string like that will return a list of strings where each item in the list is a different part of the string (first part being "https:", etc.). The last one of these (index [-1]) will be the file extension you want.
This will extract all URLs from webpage and filter only those which contain your required keyword ".m3u8"
import requests
import re
def get_desired_url(data):
urls = []
for url in re.findall(r'(https?://\S+)', data):
if ".m3u8" in url:
urls.append(url)
return urls
channel1 = requests.get('https://website.tv/user/111111')
urls = get_desired_url(channel1 )
Try this, I think this will be robust
import re
links=[re.sub('^<[ ]*a[ ]+.*href[ ]*=[ ]*', '', re.sub('.*>$', '', link) for link in re.findall(r'<[ ]*a[ ]+.*href[ ]*=[]*"http[s]*://.+\.m3u8".*>',channel2.content)]
I've created a script in python using regular expression to parse emails from few websites. The pattern that I've used to grab email is \w+#\w+\.{1}\w+ which works most of the cases. However, trouble comes up when it encounters items like 8b4e078a51d04e0e9efdf470027f0ec1#sentry.wixpress, Slice_1#2x.png e.t.c. The pattern grabs them as well which I would like to get rid of.
I've tried with:
import re
import requests
pattern = r'\w+#\w+\.{1}\w+'
urls = (
'https://rainforestfarms.org/contact',
'https://www.auucvancouver.ca/',
'http://www.bcla.bc.ca/',
'http://www.palstudiotheatre.com/',
)
def get_email(link,pattern):
res = requests.get(link)
email = re.findall(pattern,res.text)
if email:
return link,email[0]
else:
return link
if __name__ == '__main__':
for link in urls:
print(get_email(link,pattern))
Output I'm getting:
('https://rainforestfarms.org/contact', 'rainforestfarmsllc#gmail.com')
('https://www.auucvancouver.ca/', '8b4e078a51d04e0e9efdf470027f0ec1#sentry.wixpress')
('http://www.bcla.bc.ca/', 'Slice_1#2x.png')
('http://www.palstudiotheatre.com/', 'theatre#palvancouver.org')
Output I wish to get:
('https://rainforestfarms.org/contact', 'rainforestfarmsllc#gmail.com')
https://www.auucvancouver.ca/
http://www.bcla.bc.ca/'
('http://www.palstudiotheatre.com/', 'theatre#palvancouver.org')
How can I get rid of unwanted items using regex?
It depends what you means by "unwanted".
One way to define them is to use a whitelist of allowed domain suffixes, for example 'org', 'com', etc.
import re
import requests
pattern = r'\w+#\w+\.(?:com|org)'
urls = (
'https://rainforestfarms.org/contact',
'https://www.auucvancouver.ca/',
'http://www.bcla.bc.ca/',
'http://www.palstudiotheatre.com/',
)
def get_email(link,pattern):
res = requests.get(link)
email = re.findall(pattern, res.text)
if email:
return link, email[0]
else:
return link
for link in urls:
print(get_email(link,pattern))
yields
('https://rainforestfarms.org/contact', 'rainforestfarmsllc#gmail.com')
https://www.auucvancouver.ca/
http://www.bcla.bc.ca/
('http://www.palstudiotheatre.com/', 'theatre#palvancouver.org')
You could obviously do more complex things such as blacklists or regex patterns for the suffix.
As always for this kind of question I strongly recommend using regex101 to check and understand your regex.
I need to get the UserID from a github link
Example:
https://github.com/myName/blabla/ or
https://github.com/myName/
https://github.com/myName
output should be myName.
My code:
(\/github.com\/)\/(.*?)(\/)?
You've got an extra slash in the middle, and you don't need the slash at the end, just grab characters that aren't slashes for the username:
/github\.com/([^/]+)
Your match will be in group 1.
Interactive demo: https://regex101.com/r/vEOksV/2
import re
text = 'https://github.com/testName/'
try:
found = re.search('github\.com\/(.+?)\/', text).group(1)
except AttributeError:
# pattern not found
found = '' # apply your error handling
# found: testName
Interactive IDE with this example: https://onlinegdb.com/ry1w1dSB8
I'm trying with the following code to generate an XML file contains tags </documents>.
string = "dasdd Wonder asdf new single, “Tomorrow” #URL# | " \
"oiojk asfddsf releases new asdfdf, “gfsg” | " \
"Identity of asfqw who dasd off asdfsdf Mainland jtyjyjui revealed #URL#"
from yattag import Doc, indent
import html, re
doc, tag, text = Doc().tagtext()
with tag('author', lang='en'):
with tag('documents'):
for tweet in string.split(' | '):
with tag('document'):
tweet = html.unescape(tweet)
text('<![CDATA[{}]]'.format(tweet))
result = indent(doc.getvalue(), indentation=' ' * 4, newline='\n')
with open('test.xml', 'w', encoding='utf-8') as f:
f.write(result)
I wanted to add CDATA token around the text, but when I open the generated file using Notepad++ instead of have the output as:
<document><![CDATA[oiojk asfddsf releases new asdfdf, “gfsg”]]></document>
it appears like (with HTML entities):
<document><![CDATA[oiojk asfddsf releases new asdfdf, “gfsg”]]</document>
I tried to use HTML library (html.unescape line) to discard the HTML entities but I wasn't able.
How can I solve this encoding issue?
The text method always replaces '<' with <. If you wanted no escaping of that kind, you would use the asis method instead (it inserts the string "as is"). But, in your case, it would be more appropriate to use Yattag's cdata method.
from yattag import Doc
help(Doc.cdata)
cdata(self, strg, safe=False) appends a CDATA section containing the supplied string.
You don't have to worry about potential ]]> sequences that would terminate
the CDATA section. They are replaced with ]]]]><![CDATA[>.
If you're sure your string does not contain ]]>, you can pass safe = True.
If you do that, your string won't be searched for ]]> sequences.
So, in your case, you can do:
for tweet in string.split(' | '):
with tag('document'):
tweet = html.unescape(tweet)
doc.cdata(tweet)
I am using wikipedia api and using following api request,
http://en.wikipedia.org/w/api.php?`action=query&meta=globaluserinfo&guiuser='$cammer'&guiprop=groups|merged|unattached&format=json`
but the problem is I am unable to escape Dollar Sign and similar characters like that, I tried the following but it didn't work,
r['guiprop'] = u'groups|merged|unattached'
r['guiuser'] = u'$cammer'
I found it this in w3school but checking this for every single character would a pain full, what would be the best way to escape this in the strip.http://www.w3schools.com/tags/ref_urlencode.asp
You should take a look at using urlencode.
from urllib import urlencode
base_url = "http://en.wikipedia.org/w/api.php?"
arguments = dict(action="query",
meta="globaluserinfo",
guiuser="$cammer",
guiprop="groups|merged|unattached",
format="json")
url = base_url + urlencode(arguments)
If you don't need to build a complete url you can just use the quote function for a single string:
>>> import urllib
>>> urllib.quote("$cammer")
'%24cammer'
So you end up with:
r['guiprop'] = urllib.quote(u'groups|merged|unattached')
r['guiuser'] = urllib.quote(u'$cammer')