I am working on scraping a betting website for odds as my first web-scraping project. I have successfully scraped what I want so far and now have an array like this
[<b>+5\xbd\xa0-110</b>, <b>-5\xbd\xa0-110</b>]
[<b>+6\xa0-115</b>, <b>-6\xa0-105</b>]
[<b>+6\xa0-115</b>, <b>-6\xa0-105</b>]
Is there a way I can just pull out the -105/110/115? The numbers I am looking for are those 3 to the left of the </b> and I also need to include the positive or negative sign to the left of the three numbers. Do I need to use a regular expression?
Thanks a lot!
Weston
regex will work depending on if this is the only format the numbers are in.
Also, do you know if the positive sign is shown or it only shows negative?
If it does show positive...
([+-][\d]{3})<\/b>
If it doesn't show positive use...
([+-]?[\d]{3})<\/b>
http://regexr.com/3h08d
You should be able to extract the contents inside the round brackets.
Edit: you probably want to do something like below. This code will get each string from the list and then do a regex search on the string. It will append the result to the nums list. The result will be a 3 digit number with the sign in front, since it extracts the first group inside the round brackets.
import re
nums = []
for line in odds:
result = re.search(('[+-][\d]{3})<\/b>',line)
nums.append(result.group(1)))
print (nums)
Related
To give an example of what I'm trying to do, let's say there is a website that displays results of a lottery every hour. The webpage itself is static with the surrounding text staying the same and only the numbers changing (input by human not updated dynamically).
Something like The lucky number is: X where X indicates a different number each hour.
Now I want to run a python script that parses the number(s) each hour, and then at the end of the day would print out all the numbers in a nice format.
I know how to get the webpage content and get only the text parts of it without html tags etc by using the BeautifulSoup and requests libraries, however I'm not quite sure how to get the target number.
I was thinking something like a regex which would find a static word from the text e.g. 'number is:' in this case and then grab the word (number) right after it.
Is this doable? and if yes, how?
Thank you in advance.
It's possible with regex but if you know the string already and if it's static , use simple split on that string.
Let's say
var='The lucky number is: 123'
Out= int(var.split(':')[1])
Out will be 123
Learning Python and trying to get the User ID from a HTML page, through the use of Regular Expressions. (LTT is the website, just for practice).
I want to be able to type 'findID username' into cmd and return the 6 digit ID number.
Have spent hours trying different code and looking up references, maybe someone can explain it simple for me. I can configure the searchRegex object to correctly identify 6 digit numbers in the page, but it does not find the correct 6 digit combination that I am looking for. (Grabs another random 6 digits as opposed to the 6 specific User ID digits)
import re, requests, sys, time
if len(sys.argv)>1:
search=requests.get('https://linustechtips.com/main/search/?&q='+str(sys.argv[1:])+'&type=core_members')
searchRegex=re.compile(r"^'$\d\d\d\d\d\d^'$")
ID=searchRegex.search(search.text)
print(ID)
time.sleep(10)
else:
print('Enter a search term...')
I have tried many different ways of getting the code to recognise ' symbol. But when i try like this, returns None. Why can the regex find 6 digits, but can't find 6 digits beginning and ending with '.
This is the HTML page I am testing it on.
view-source:https://linustechtips.com/main/search/?&q=missiontomine&type=core_members
Try Regex: (?<=profile\/)\d{6}
Demo
The html text has the userid as part of the url like:
https://linustechtips.com/main/profile/600895-missiontomine/?do=hovercard
(?<=profile\/) does a positive lookbehind
I'm working with some poorly formatted HTML and I need to find every instance of a certain type of pattern. The issue is as follows:
a space, followed by a 1 to 3 digit number, followed by letters (a word, usually). Here are some examples of what I mean.
hello 7Out
how 99In
are 123May
So I would be looking for the expression to get the "7Out", "99In", "123May", etc. The initial space does not need to be included. I hope this is descriptive enough, as I am literally just starting to expose myself to regular expressions and am still struggling a bit. In the end, I will want to count the total number of these instances and add the total count to a df that already exists, so if you have any suggestions on how to do that I would be open to that as well. Thanks for your help in advance!
Your regular expression will be: r'\w\s(\d{1,3}[a-zA-Z]+)'
So in order to get count you can use len() upon list returned by findall. The code will be
import re
string='hello 70qwqeqwfwe123 12wfgtr123 34wfegr123 dqwfrgb'
result=re.findall(r'\w\s(\d{1,3}[a-zA-Z]+)',string)
print "result = ",result #this will give you all the found occurances as list
print "len(result) = ",len(result) #this will give you total no of occurances.
The result will be:
result = ['70qwqeqwfwe', '12wfgtr', '34wfegr']
len(result) = 3
Hint: findall will evaluate regular expression and returns results based on grouping. I'm using that to solve this problem.
Try these:
re.findall(r'(\w\s((\d{1,3})[a-zA-Z]+))',string)
re.findall(r'\w\s((\d{1,3})[a-zA-Z]+)',string)
To get an idea about regular expressions refer python re, tutorials point and to play with the matching characters use this.
I'm making a script to crawl through a web page and find all upper case names, equalling a number (ex. DUP_NB_FUNC=8). The part where my regular expression has to match only upper case letters however, does not seem to be working properly.
value = re.findall(r"[A-Z0-9_]*(?==\d).{2,}", input)
|tc_apb_conf_00.v:-:DUP_NB_FUNC=2
|:-:DUP_NB_FUNC=2
|:-:DUP_NB_FUNC=4
|:-:DUP_NB_FUNC=5
|tc_apb_conf_01.v:-:DUP_NB_FUNC=8
Desired output should look something like the above. However, I am getting:
|tc_apb_conf_00.v:-:=1" name="viewport"/>
|:-:DUP_NB_FUNC=2
|:-:DUP_NB_FUNC=4
|:-:DUP_NB_FUNC=5
|tc_apb_conf_01.v:-:DUP_NB_FUNC=8
Based on the input I can see its finding a match starting at =1. I don't however understand why as I've put only A-Z in the regex range. I'd really appreciate a bit of assistance and clearing up.
This should be help:
[A-Z0-9_]+(?==\d).{2,}
or
\b[A-Z0-9_]*(?==\d).{2,}\b
But anyway your regex quite weird, according to your requirement above I suggest this
[A-Z0-9_]+=\d+
Instead of using
(?==\d).{2,}: any letters two or more and make sure that the first two letter are = and a one integer respectively,
you can just use
=\d+
Try this.
value = re.findall(r"[A-Z0-9_]+(?==\d).{2,}", input)
You want the case sensitive match to match at least once, which means you want the + quantifier, not the * quantifier, that matches between zero and unlimited times.
I will suggest you define your pattern and check you input if it is available
for i in tlist:
value=re.compile(r"[A-Z0-9_:-.]+=\d+")
jee=value.match(i)
if jee is not None:
print i
tlist contains your input
The data files which I have look like:
Title
10000XX 1.09876543e+02
There are many lines in this form with the column 1 values ranging from 1000000-2000099 and with column 2 values ranging from -9000 to 9000 including some values with negative exponents. I am very new to regex so any help would be useful. The rest of my program is written in python so I am using:
re.search()
Some help with this syntax would be great.
Thanks
As Robert says, you can just use the split() function.
Assuming the separator is spaces like you have in the question, you can run the code below to give a list of values, then do with that as you will:
>>> line = "10000XX 1.09876543e+02"
>>> line.split()
['10000XX', '1.09876543e+02']
You can convert the second item to a floating point number with float(). e.g. float('1.09876543e+02')
Just iterate over your lines and ignore any that don't start with a number.
Regular expressions are a bit more fiddly.