How to search for symbols using regex - python

Learning Python and trying to get the User ID from a HTML page, through the use of Regular Expressions. (LTT is the website, just for practice).
I want to be able to type 'findID username' into cmd and return the 6 digit ID number.
Have spent hours trying different code and looking up references, maybe someone can explain it simple for me. I can configure the searchRegex object to correctly identify 6 digit numbers in the page, but it does not find the correct 6 digit combination that I am looking for. (Grabs another random 6 digits as opposed to the 6 specific User ID digits)
import re, requests, sys, time
if len(sys.argv)>1:
search=requests.get('https://linustechtips.com/main/search/?&q='+str(sys.argv[1:])+'&type=core_members')
searchRegex=re.compile(r"^'$\d\d\d\d\d\d^'$")
ID=searchRegex.search(search.text)
print(ID)
time.sleep(10)
else:
print('Enter a search term...')
I have tried many different ways of getting the code to recognise ' symbol. But when i try like this, returns None. Why can the regex find 6 digits, but can't find 6 digits beginning and ending with '.
This is the HTML page I am testing it on.
view-source:https://linustechtips.com/main/search/?&q=missiontomine&type=core_members

Try Regex: (?<=profile\/)\d{6}
Demo
The html text has the userid as part of the url like:
https://linustechtips.com/main/profile/600895-missiontomine/?do=hovercard
(?<=profile\/) does a positive lookbehind

Related

How can I customise error messages shown by PyInputPlus?

How can I customise error messages shown by PyInputPlus in python?
I have tried many method but unable to do it.
import pyinputplus as pyip
number = pyip.inputNum("Enter your phone number : ",
min=1000000000,
max=9999999999)
number
I want to print error message as please enter a valid 10 digit phone number.
Is there any way to so it?
I try to use "allowRegexes and blockRegexes" but unable to understand it.
If you are using this for a real world project I would recommend using input from python itself, this lib doesn't seem very well documented and mantained. This could bring a lot of weird errors in your code for the future.
But to answer your question, you could do it using regex with the parameter blockRegexes. If you were unable to understand it, this will be more a regex question than a python question.
From this website you can learn a lot about regex, that I recommend, regex is a very important tool to understand.
About your problem, accordingly to the docs:
blocklistRegexes (Sequence, None): A sequence of regex str or
(regex_str, error_msg_str) tuples that, if matched, will
explicitly fail validation.
So, in your case the first item in the tuple, should be a regex to block everything that have more or less than 10 integers characters:
^\d{10}$
The full explanation for this regex can be found here
The second item in your touple should be the string you want to appear when the error occurs:
"please enter a valid 10 digit phone number"
So your code would be like this:
number = pyip.inputNum("Enter your phone number : ",
min=1000000000,
max=9999999999,
blockRegexes=[(r"^\d{10}$","please enter a valid 10 digit phone number")])

How to get python to search for whole numbers in a string-not just digits

Okay please do not close this and send me to a similar question because I have been looking for hours at similar questions with no luck.
Python can search for digits using re.search([0-9])
However, I want to search for any whole number. It could be 547 or 2 or 16589425. I don't know how many digits there are going to be in each whole number.
Furthermore I need it to specifically find and match numbers that are going to take a form similar to this: 1005.2.15 or 100.25.1 or 5.5.72 or 1102.170.24 etc.
It may be that there isn't a way to do this using re.search but any info on what identifier I could use would be amazing.
Just use
import re
your_string = 'this is 125.156.56.531 and this is 0540505050.5 !'
result = re.findall(r'\d[\d\.]*', your_string)
print(result)
output
['125.156.56.531', '0540505050.5']
Assuming that you're looking for whole numbers only, try re.search(r"[0-9]+")

Regex used within python giving unknown results

I've written a script in python using regular expression to find phone numbers from two different sites. when I tried with below pattern to scrape the two phone numbers locally then it works flawlessly. However, when i try the same in the websites, It no longer works. It only fetches two unidentified numbers 1999 and 8211.
This is what I've tried so far:
import requests, re
links=[
'http://www.latamcham.org/contact-us/',
'http://www.cityscape.com.sg/?page_id=37'
]
def FetchPhone(site):
res = requests.get(site).text
phone = re.findall(r"\+?[\d]+\s?[\d]+\s?[\d]+",res)[0] #I'm not sure if it is an ideal pattern. Works locally though
print(phone)
if __name__ == '__main__':
for link in links:
FetchPhone(link)
The output I wish to have:
+65 6881 9083
+65 93895060
This is what I meant by locally:
import re
phonelist = "+65 6881 9083,+65 93895060"
phone = [item for item in re.findall(r"\+?[\d]+\s?[\d]+\s?[\d]+",phonelist)]
print(phone) #it can print them
Post script: the phone numbers are not generated dynamically. When I print text then I can see the numbers in the console.
In your case below regex should return required output
r"\+\d{2}\s\d{4}\s?\d{4}"
Note that it can be applied to mentioned schemas:
+65 6881 9083
+65 93895060
and might not work in other cases
You are using \d+\s?\d+ which will match 9 9, 99 and 1999 because the + quantifier allows the first \d+ to grab as many digits as it can while leaving at least one digit to the others. One solution is to state a specific number of repetitions you want (like in Andersson's answer).
I suggest you try regex101.com, it will highlight to help you visualize what the regex is matching and capturing. There you can paste an example of the text you want to search and tweak your regex.

Extracting 4 characters out of an HTML Array, Python

I am working on scraping a betting website for odds as my first web-scraping project. I have successfully scraped what I want so far and now have an array like this
[<b>+5\xbd\xa0-110</b>, <b>-5\xbd\xa0-110</b>]
[<b>+6\xa0-115</b>, <b>-6\xa0-105</b>]
[<b>+6\xa0-115</b>, <b>-6\xa0-105</b>]
Is there a way I can just pull out the -105/110/115? The numbers I am looking for are those 3 to the left of the </b> and I also need to include the positive or negative sign to the left of the three numbers. Do I need to use a regular expression?
Thanks a lot!
Weston
regex will work depending on if this is the only format the numbers are in.
Also, do you know if the positive sign is shown or it only shows negative?
If it does show positive...
([+-][\d]{3})<\/b>
If it doesn't show positive use...
([+-]?[\d]{3})<\/b>
http://regexr.com/3h08d
You should be able to extract the contents inside the round brackets.
Edit: you probably want to do something like below. This code will get each string from the list and then do a regex search on the string. It will append the result to the nums list. The result will be a 3 digit number with the sign in front, since it extracts the first group inside the round brackets.
import re
nums = []
for line in odds:
result = re.search(('[+-][\d]{3})<\/b>',line)
nums.append(result.group(1)))
print (nums)

Beginner with regular expressions; need help writing a specific query - space, followed by 1-3 numbers, followed by any number of letters

I'm working with some poorly formatted HTML and I need to find every instance of a certain type of pattern. The issue is as follows:
a space, followed by a 1 to 3 digit number, followed by letters (a word, usually). Here are some examples of what I mean.
hello 7Out
how 99In
are 123May
So I would be looking for the expression to get the "7Out", "99In", "123May", etc. The initial space does not need to be included. I hope this is descriptive enough, as I am literally just starting to expose myself to regular expressions and am still struggling a bit. In the end, I will want to count the total number of these instances and add the total count to a df that already exists, so if you have any suggestions on how to do that I would be open to that as well. Thanks for your help in advance!
Your regular expression will be: r'\w\s(\d{1,3}[a-zA-Z]+)'
So in order to get count you can use len() upon list returned by findall. The code will be
import re
string='hello 70qwqeqwfwe123 12wfgtr123 34wfegr123 dqwfrgb'
result=re.findall(r'\w\s(\d{1,3}[a-zA-Z]+)',string)
print "result = ",result #this will give you all the found occurances as list
print "len(result) = ",len(result) #this will give you total no of occurances.
The result will be:
result = ['70qwqeqwfwe', '12wfgtr', '34wfegr']
len(result) = 3
Hint: findall will evaluate regular expression and returns results based on grouping. I'm using that to solve this problem.
Try these:
re.findall(r'(\w\s((\d{1,3})[a-zA-Z]+))',string)
re.findall(r'\w\s((\d{1,3})[a-zA-Z]+)',string)
To get an idea about regular expressions refer python re, tutorials point and to play with the matching characters use this.

Categories