Regex used within python giving unknown results - python

I've written a script in python using regular expression to find phone numbers from two different sites. when I tried with below pattern to scrape the two phone numbers locally then it works flawlessly. However, when i try the same in the websites, It no longer works. It only fetches two unidentified numbers 1999 and 8211.
This is what I've tried so far:
import requests, re
links=[
'http://www.latamcham.org/contact-us/',
'http://www.cityscape.com.sg/?page_id=37'
]
def FetchPhone(site):
res = requests.get(site).text
phone = re.findall(r"\+?[\d]+\s?[\d]+\s?[\d]+",res)[0] #I'm not sure if it is an ideal pattern. Works locally though
print(phone)
if __name__ == '__main__':
for link in links:
FetchPhone(link)
The output I wish to have:
+65 6881 9083
+65 93895060
This is what I meant by locally:
import re
phonelist = "+65 6881 9083,+65 93895060"
phone = [item for item in re.findall(r"\+?[\d]+\s?[\d]+\s?[\d]+",phonelist)]
print(phone) #it can print them
Post script: the phone numbers are not generated dynamically. When I print text then I can see the numbers in the console.

In your case below regex should return required output
r"\+\d{2}\s\d{4}\s?\d{4}"
Note that it can be applied to mentioned schemas:
+65 6881 9083
+65 93895060
and might not work in other cases

You are using \d+\s?\d+ which will match 9 9, 99 and 1999 because the + quantifier allows the first \d+ to grab as many digits as it can while leaving at least one digit to the others. One solution is to state a specific number of repetitions you want (like in Andersson's answer).
I suggest you try regex101.com, it will highlight to help you visualize what the regex is matching and capturing. There you can paste an example of the text you want to search and tweak your regex.

Related

Regex: Identify string occurrence in first X characters after keyword

Suppose the following string:
text = r"Microsoft enables digital transformation for the era of an intelligent cloud and an intelligent edge.
SOURCE Microsoft Corp."
Goal:
I want to check if the company's name (Microsoft in the above example) occurs within the first X (250 for example) characters after the keyword "SOURCE".
Attempt:
source = re.compile(r"SOURCE.*")
re.findall(source,text)
#output ['SOURCE Microsoft Corp.']
In order to account for the character limitation in which the keyword should occur, I thought of using the .split() function on the output string and count the position at which the company's name occurs. This should work just fine if the company's name consists of one word only.
However, in cases where the company name includes multiple words (e.g., "Procter & Gamble") splitting the output string would result in ['SOURCE', 'Procter', '&', 'Gamble'] so that searching for the position of "Procter & Gamble" in this list wouldn't give back any results.
Is there a way I can implement the restriction that the company name has to occur after X characters in the regex command?
A performant alternative to a regex would be str.find with start and end parameters:
p1 = t.find('SOURCE')
p2 = t.find('Microsoft', p1, p1 + limit - len('SOURCE'))
p2 will be > 0 if 'Microsoft' is found within limit chars from 'SOURCE' and -1 otherwise.
You could put something between SOURCE and the company name. So if the company name (Microsoft in this example) is 9 characters and you need it to be within the first 200 characters directly following SOURCE there can be from 0 up to a maximum 200-9=191 characters before the company name. So you would write:
re.findall('SOURCE.{0,191}Microsoft', text)
The .{a,b} expression would match any character from a to b number of times.
Another solution using re (regex101):
import re
text = """Microsoft enables digital transformation for the era of an intelligent cloud and an intelligent edge.
SOURCE Microsoft Corp."""
pat = re.compile(r"(?<=SOURCE)(?=.{,250}Microsoft).*?Microsoft", flags=re.S)
if pat.search(text):
print("Found")
Prints:
Found
You don't need regex for this. You can slice the string after finding the upper index of SOURCE with rfind, then check if the word is in the sliced string:
text = "Microsoft enables digital transformation for the era of an intelligent cloud and an intelligent edge. SOURCE Microsoft Corp."
print('Microsoft' in text[text.rfind('SOURCE'):250])
Output:
True
You can use the following RegEx to get everything after 'SOURCE' after X occurrences:
SOURCE.{250}(.*)
I only put 250 on the RegEx as an example. You can use any number you like.
For example, to match exactly 'Microsoft Corp.' you could do SOURCE.{1}(.*).
The parentheses define a RegEx capture group, which are basically output variables. In Python you can match the capture groups using re.findall():
>>> import re
>>> r = re.compile(r'SOURCE.{1}(.*)')
>>> r.findall('SOURCE Microsoft Corp.')
['Microsoft Corp.']
EDIT: Many of this post's answers rely on using 'Microsoft' for finding the company name, but that doesn't make much sense IMO

How to find the match URL from a HTML page using RegEx Python

I'm trying to match the following URL by its query string from a html page in Python but could not able to solved it. I'm a newbie in python.
<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>
I want to match the above URL with &user_id=[any_digit_from_0_to_99]& and print this URL on the screen.
URL without this &user_id=[any_digit_from_0_to_99]& wont be match.
Here's my horror incomplete regex code:
https?:\/\/.{0,30}\.+[a-zA-Z0-9\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"
I know this code has so many wrong, but this code somehow managed to match the above URL till " double qoute.
Complete code would look like this:
import re
reg = re.compile(r'https?:\/\/.{0,30}\.+[a-zA-Z0-9\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"')
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)
Output:
$ python reg.py
http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E"
It shows the " at the end of the URL and I know this is not the good regex code I want the better version of my above code.
A few remarks can be made on your regexp:
/ is not a special re character, there's no need to escape it
Has the fact that the domain can't be larger than 30 chracters been done on purpose? Otherwise, you can just select as much characters as you want with .*
Do you know that the string you're working with contains a valid URL? If no, there are some things you can do, like ensuring the domain is at least 4 chracters long, contains a period which is not the last character, etc...
The [0-9][0-9] part will also match stuff like 04, which is not strictly speaking a digit between 0 and 99
Taking this into account, you can design this simpler regex:
reg = re.compile("https?://.*&user_id=[1-9][0-9]?&")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)
Using this regex on your example will print 'http://example.com/?query_id=9&user_id=4&', without the " at the end. If you want to have to full URL, then you can look for the /> symbol:
reg = re.compile("https?://.*&user_id=[1-9][0-9]?&.*/>")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()[:-2]
print(result)
Note the [:-2] which is used to remove the /> symbol. In that case, this code will print http://example.com/?query_id=9&user_id=4&token_id=4JGO4I394HD83E" id="838".
Note also that these regexp usesthe wildcard .. Depending on whether you are sure that the strings you're working with contains only valid URLs, you may want to change this. For instance, a domain name can only contain ASCII characters. You may want to look at the \w special sequence with the ASCII flag of the re module.

How to search for symbols using regex

Learning Python and trying to get the User ID from a HTML page, through the use of Regular Expressions. (LTT is the website, just for practice).
I want to be able to type 'findID username' into cmd and return the 6 digit ID number.
Have spent hours trying different code and looking up references, maybe someone can explain it simple for me. I can configure the searchRegex object to correctly identify 6 digit numbers in the page, but it does not find the correct 6 digit combination that I am looking for. (Grabs another random 6 digits as opposed to the 6 specific User ID digits)
import re, requests, sys, time
if len(sys.argv)>1:
search=requests.get('https://linustechtips.com/main/search/?&q='+str(sys.argv[1:])+'&type=core_members')
searchRegex=re.compile(r"^'$\d\d\d\d\d\d^'$")
ID=searchRegex.search(search.text)
print(ID)
time.sleep(10)
else:
print('Enter a search term...')
I have tried many different ways of getting the code to recognise ' symbol. But when i try like this, returns None. Why can the regex find 6 digits, but can't find 6 digits beginning and ending with '.
This is the HTML page I am testing it on.
view-source:https://linustechtips.com/main/search/?&q=missiontomine&type=core_members
Try Regex: (?<=profile\/)\d{6}
Demo
The html text has the userid as part of the url like:
https://linustechtips.com/main/profile/600895-missiontomine/?do=hovercard
(?<=profile\/) does a positive lookbehind

Extract number (dash included) from string using regex in Python

Hello and thanks for helping.
String examples:
"Hello 43543" ---> "43543"
"John Doe 434-234" ---> "434-234"
I need a regex to extract the examples on the right.
I would do it following way:
import re
pattern = r'\d[0-9\-]*'
number1 = re.findall(pattern,'Hello 43543')
number2 = re.findall(pattern,'John Doe 434-234')
print(number1[0]) #43543
print(number2[0]) #434-234
My solution assumes that you are looking for any string starting with digit and with all other characters being digit or -, this mean it will also grab for example 4--- or 9-2-4--- and so on, however this might be not issue in your use case.
I want to note that before writing pattern, you should answer question: what it should match exactly? My pattern works as intended for examples you given, but keep in mind that this do NOT automatically mean it would give desired output with all data you might want to process using it.
If all your strings are like this, you can achieve the same without re:
s = "John Doe 434-234"
n = s.split()[-1]
print(n)
>>> "434-234"
It will split your string on spaces and give you the last field.

Repeating regex pattern in Python

I have a file with millions of retweets – like this:
RT #Username: Text_of_the_tweet
I just need to extract the username from this string.
Since I'm a total zero when it comes to regex, sometime ago here I was advised to use
username = re.findall('#([^:]+)', retweet)
This works great for the most part, but sometimes I get lines like this:
RT #ReutersAero: Further pictures from the #MH17 crash site in in Grabovo, #Ukraine #MH17 - #reuterspictures (GRAPHIC): http://t.co/4rc7Y4…
I only need "ReutersAero" from the string, but since it contains another "#" and ":" it messes up the regex, and I get this output:
['ReutersAero', 'reuterspictures (GRAPHIC)']
Is there a way to use the regex only for the first instance it finds in the string?
You can use a regex like this:
RT #(\w+):
Working demo
Match information:
MATCH 1
1. [4-15] `ReutersAero`
MATCH 2
1. [145-156] `AnotherAero`
You can use this python code:
import re
p = re.compile(ur'RT #(\w+):')
test_str = u"RT #ReutersAero: Further pictures from the #MH17 crash site in in Grabovo, #Ukraine #MH17 - #reuterspictures (GRAPHIC): http://t.co/4rc7Y4…\nRT #AnotherAero: Further pictures from the #MH17 crash site in in Grabovo, #Ukraine #MH17 - #reuterspictures (GRAPHIC): http://t.co/4rc7Y4…\n"
re.findall(p, test_str)
Is there a way to use the regex only for the first instance it finds in the string?
Do not use findall, but search.

Categories