Repeating regex pattern in Python - python

I have a file with millions of retweets – like this:
RT #Username: Text_of_the_tweet
I just need to extract the username from this string.
Since I'm a total zero when it comes to regex, sometime ago here I was advised to use
username = re.findall('#([^:]+)', retweet)
This works great for the most part, but sometimes I get lines like this:
RT #ReutersAero: Further pictures from the #MH17 crash site in in Grabovo, #Ukraine #MH17 - #reuterspictures (GRAPHIC): http://t.co/4rc7Y4…
I only need "ReutersAero" from the string, but since it contains another "#" and ":" it messes up the regex, and I get this output:
['ReutersAero', 'reuterspictures (GRAPHIC)']
Is there a way to use the regex only for the first instance it finds in the string?

You can use a regex like this:
RT #(\w+):
Working demo
Match information:
MATCH 1
1. [4-15] `ReutersAero`
MATCH 2
1. [145-156] `AnotherAero`
You can use this python code:
import re
p = re.compile(ur'RT #(\w+):')
test_str = u"RT #ReutersAero: Further pictures from the #MH17 crash site in in Grabovo, #Ukraine #MH17 - #reuterspictures (GRAPHIC): http://t.co/4rc7Y4…\nRT #AnotherAero: Further pictures from the #MH17 crash site in in Grabovo, #Ukraine #MH17 - #reuterspictures (GRAPHIC): http://t.co/4rc7Y4…\n"
re.findall(p, test_str)

Is there a way to use the regex only for the first instance it finds in the string?
Do not use findall, but search.

Related

RegEx works in regexr but not in python re

I have this regex: If you don't want these messages, please [a-zA-Z0-9öäüÖÄÜ<>\n\-=#;&?_ "/:.#]+settings<\/a>. It works on regexr but not when I am using the re
library in Python:
data = "<my text (comes from a file)>"
search = "If you don't want these messages, please [a-zA-Z0-9öäüÖÄÜ<>\n\-=#;&?_ \"/:.#]+settings<\/a>" # this search string comes from a database, so it's not hardcoded into my script
print(re.search(search, data))
Is there something I don't see?
Thank you!
the pattern you are using on regexr contains \- but in your exemple shows \\- wich may give an incorrect regex. (and add the r in front of of the string as jupiterby said).

How to find the match URL from a HTML page using RegEx Python

I'm trying to match the following URL by its query string from a html page in Python but could not able to solved it. I'm a newbie in python.
<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>
I want to match the above URL with &user_id=[any_digit_from_0_to_99]& and print this URL on the screen.
URL without this &user_id=[any_digit_from_0_to_99]& wont be match.
Here's my horror incomplete regex code:
https?:\/\/.{0,30}\.+[a-zA-Z0-9\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"
I know this code has so many wrong, but this code somehow managed to match the above URL till " double qoute.
Complete code would look like this:
import re
reg = re.compile(r'https?:\/\/.{0,30}\.+[a-zA-Z0-9\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"')
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)
Output:
$ python reg.py
http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E"
It shows the " at the end of the URL and I know this is not the good regex code I want the better version of my above code.
A few remarks can be made on your regexp:
/ is not a special re character, there's no need to escape it
Has the fact that the domain can't be larger than 30 chracters been done on purpose? Otherwise, you can just select as much characters as you want with .*
Do you know that the string you're working with contains a valid URL? If no, there are some things you can do, like ensuring the domain is at least 4 chracters long, contains a period which is not the last character, etc...
The [0-9][0-9] part will also match stuff like 04, which is not strictly speaking a digit between 0 and 99
Taking this into account, you can design this simpler regex:
reg = re.compile("https?://.*&user_id=[1-9][0-9]?&")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)
Using this regex on your example will print 'http://example.com/?query_id=9&user_id=4&', without the " at the end. If you want to have to full URL, then you can look for the /> symbol:
reg = re.compile("https?://.*&user_id=[1-9][0-9]?&.*/>")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()[:-2]
print(result)
Note the [:-2] which is used to remove the /> symbol. In that case, this code will print http://example.com/?query_id=9&user_id=4&token_id=4JGO4I394HD83E" id="838".
Note also that these regexp usesthe wildcard .. Depending on whether you are sure that the strings you're working with contains only valid URLs, you may want to change this. For instance, a domain name can only contain ASCII characters. You may want to look at the \w special sequence with the ASCII flag of the re module.

How to filter out specific strings from a string

Python beginner here. I'm stumped on part of this code for a bot I'm writing.
I am making a reddit bot using Praw to comb through posts and removed a specific set of characters (steam CD keys).
I made a test post here: https://www.reddit.com/r/pythonforengineers/comments/91m4l0/testing_my_reddit_scraping_bot/
This should have all the formats of keys.
Currently, my bot is able to find the post using a regex expression. I have these variables:
steamKey15 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w')
steamKey25 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.')
steamKey17 = (r'\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\s\w\w')
I am finding the text using this:
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
if submission.id not in steamKeyPostID:
if re.search(steamKey15, submission.selftext, re.IGNORECASE):
searchLogic()
saveSteamKey()
So this is just to show that the things I should be using in a filter function is a combination of steamKey15/25/17, and submission.selftext.
So here is the part where I am confused. I cant find a function that works, or is doing what I want. My goal is to remove all the text from submission.selftext(the body of the post) BUT the keys, which will eventually be saved in a .txt file.
Any advice on a good way to go around this? I've looked into re.sub and .translate but I don't understand how the parts fit together.
I am using Python 3.7 if it helps.
can't you just get the regexp results?
m = re.search(steamKey15, submission.selftext, re.IGNORECASE)
if m:
print(m.group(0))
Also note that a dot . means any char in a regexp. If you want to match only dots, you should use \.. You can probably write your regexp like this instead:
r'\w{5}[-.]\w{5}[-.]\w{5}'
This will match the key when separated by . or by -.
Note that this will also match anything that begin or end with a key, or has a key in the middle - that can cause you problems as your 15-char key regexp is contained in the 25-key one! To fix that use negative lookahead/negative lookbehind:
r'(?<![\w.-])\w{5}[-.]\w{5}[-.]\w{5}(?![\w.-])'
that will only find the keys if there are no extraneous characters before and after them
Another hint is to use re.findall instead of re.search - some posts contain more than one steam key in the same post! findall will return all matches while search only returns the first one.
So a couple things first . means any character in regex. I think you know that, but just to be sure. Also \w\w\w\w\w can be replaced with \w{5} where this specifies 5 alphanumerics. I would use re.findall.
import re
steamKey15 = (r'(?:\w{5}.){2}\w{5}')
steamKey25 = (r'(?:\w{5}.){5}')
steamKey17 = (r'\w{15}\s\w\w')
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
if submission.id not in steamKeyPostID:
finds_15 = re.findall(steamKey15, submission.selftext)
finds_25 = re.findall(steamKey25, submission.selftext)
finds_17 = re.findall(steamKey17, submission.selftext)

Regex used within python giving unknown results

I've written a script in python using regular expression to find phone numbers from two different sites. when I tried with below pattern to scrape the two phone numbers locally then it works flawlessly. However, when i try the same in the websites, It no longer works. It only fetches two unidentified numbers 1999 and 8211.
This is what I've tried so far:
import requests, re
links=[
'http://www.latamcham.org/contact-us/',
'http://www.cityscape.com.sg/?page_id=37'
]
def FetchPhone(site):
res = requests.get(site).text
phone = re.findall(r"\+?[\d]+\s?[\d]+\s?[\d]+",res)[0] #I'm not sure if it is an ideal pattern. Works locally though
print(phone)
if __name__ == '__main__':
for link in links:
FetchPhone(link)
The output I wish to have:
+65 6881 9083
+65 93895060
This is what I meant by locally:
import re
phonelist = "+65 6881 9083,+65 93895060"
phone = [item for item in re.findall(r"\+?[\d]+\s?[\d]+\s?[\d]+",phonelist)]
print(phone) #it can print them
Post script: the phone numbers are not generated dynamically. When I print text then I can see the numbers in the console.
In your case below regex should return required output
r"\+\d{2}\s\d{4}\s?\d{4}"
Note that it can be applied to mentioned schemas:
+65 6881 9083
+65 93895060
and might not work in other cases
You are using \d+\s?\d+ which will match 9 9, 99 and 1999 because the + quantifier allows the first \d+ to grab as many digits as it can while leaving at least one digit to the others. One solution is to state a specific number of repetitions you want (like in Andersson's answer).
I suggest you try regex101.com, it will highlight to help you visualize what the regex is matching and capturing. There you can paste an example of the text you want to search and tweak your regex.

how to find and replace special url patern (Markdown syntax to HTML) by re module in python

I have a string and I want to search this string to find special pattern containing URL and it's name and then I need to change it's format:
Input string:
'Thsi is my [site](http://example.com/url) you can watch it.'
Output string:
'This is my site you can watch it.'
The string may have several URLs and I need to change the format of every one and site is in unicode and can be every character in any language.
What pattern should be used and how I can do it?
This should help
import re
A = 'Thsi is my [site](http://example.com/url) you can watch it.'
site = re.compile( "\[(.*)\]" ).search(A).group(1)
url = re.compile( "\((.*)\)" ).search(A).group(1)
print A.replace("[{0}]".format(site), "").replace("({0})".format(url), '{1}'.format(url, site))
Output:
Thsi is my site you can watch it.
Update as request in Comments:
s = 'my [site](site.com) is about programing (python language)'
site, url = s[s.find("[")+1:s.find(")")].split("](")
print s.replace("[{0}]".format(site), "").replace("({0})".format(url), '{1}'.format(url, site))
Output:
my site is about programing (python language)
I'm not a markdown expert, but if this is indeed markdown that you're trying to replace, and not your own syntax, you should use an appropriate parser. Note that, if you paste your string directly into stackoverflow - which also uses markdown - it will be transformed into a link, so it would clearly be valid markdown.
If it is indeed your own format, however, try the following to transform
'This is my [site](http://example.com/url) you can watch it.'
into
'This is my site you can watch it.'
using the following match:
\[(.*?)\]\((.*?)\)
and the following replacement regex:
<a href="\\2">\\1<\/a>
In python, re.sub(match, replace, stringThatYouWantToReplaceStuffIn) should do the trick. Don't forget to assign the return value of re.sub to whatever variable should contain the new string.

Categories