Syntax error in web scraping program using beautifulsoup, requests and regex

Syntax error in web scraping program using beautifulsoup, requests and regex - python

As part of 'Automate the boring stuff' I am trying to learn how to code in python. One of the exercises is to create a web scraper using beautifulsoup and requests.
I decided to try amazons stock price instead of a price of a product on amazon. I managed to get it to work, but the output was several lines.
So wanted to use regex to just return the stock price and not the loss/win and time stamp as well.
It however kept giving me syntax errors one line 1, I've tried removing the Regex part to return it to just the bs4 and requests part going back to the start but that still gave me the syntax error (I am using VSC to avoid parenthesis errors).
Where am I going wrong? and depending on how wrong, how would the correct code look like?
My code currently looks like this:
import bs4, requests, re
def extractedStockPrice(price):
stockPriceRegex = re.compile(r'''
[0-9]?
,?
[0-9]+
/.
[0-9]*
''', re.VERBOSE)
return stockPriceRegex.search(price)
def getStockPrice(stockUrl):
res = requests.get(stockUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('#quote-header-info > div.My\(6px\).Pos\(r\).smartphone_Mt\(6px\)')
return elems[0].text.strip()
stockPrice = extractedStockPrice(getStockPrice('https://finance.yahoo.com/quote/AMZN?p=AMZN&.tsrc=fin-srch'))
print('The price is ' + stockPrice)

The issue seems to be with you regex expression - in the function extractedStockPrice. It does not match the price expression and the search returns "None" which causes the type error mentioned in the comment.
The price string variable, when it reaches the regex part looks like this (example):
'2,042.76-0.24 (-0.01%)At close: 4:00PM EDT'
You can use a regex syntax checker to confirm your regex code: https://www.regexpal.com/ (post the above string as "Test String" and your regex code as "Regular Expression).
Looks like your forward slash should be backwards slash. Also, you need to extract the match once found - you can do this with group(0) (see this and search for re.search: https://docs.python.org/3/library/re.html).
The below code should work (run with Python 3.7):
import bs4, requests, re
def extractedStockPrice(price):
# fixes here:
# 1) use backslash "\" instead of "/".
# 2) use ".group(0)" to extract match.
stockPriceRegex = re.compile(r'''[0-9]?,?[0-9]+\.[0-9]*''', re.VERBOSE)
return stockPriceRegex.search(price).group(0)
def getStockPrice(stockUrl):
res = requests.get(stockUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('#quote-header-info > div.My\(6px\).Pos\(r\).smartphone_Mt\(6px\)')
return elems[0].text.strip()
stockPrice = extractedStockPrice(getStockPrice('https://finance.yahoo.com/quote/AMZN?p=AMZN&.tsrc=fin-srch'))
print('The price is ' + stockPrice)
Result: "The price is 2,042.76".

Related

AttributeError: 'NoneType' object has no attribute 'groups' in Regex

I have written a regex to scrape the data from the web page. However I am getting the mentioned error. I am not able to find a solution to that.
Someone had suggested
try:
code
except:
Attribute error
Original Code:
import urllib.request
import bs4
import re
url ='https://ipinfo.io/AS7018'
def url_to_soup(url):
req = urllib.request.Request(url)
opener = urllib.request.build_opener()
html = opener.open(req)
soup = bs4.BeautifulSoup(html, "html.parser")
return soup
s = str(url_to_soup(url))
#print(s)
asn_code, name = re.search(r'<h3 class="font-semibold m-0 t-xs-24">(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s]+)</h3>', s)\
.groups() # Error code
print(asn_code)
""" This is where the error : From above code """
country = re.search(r'.*href="/countries.*">(?P<COUNTRY>.*)?</a>',s).group("COUNTRY")
print(country)
registry = re.search(r'Registry.*?pb-md-1">(?P<REGISTRY>.*?)</p>',s, re.S).group("REGISTRY").strip()
print(registry)
# flag re.S make the '.' special character match any character at all, including a newline;
ip = re.search(r'IP Addresses.*?pb-md-1">(?P<IP>.*?)</p>',s, re.S).group("IP").strip()
print(ip)

The statement:
re.search(r'<h3 class="font-semibold m-0 t-xs-24">(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s]+)</h3>', s)
is returning None the pattern you're looking for has not being found in the string s.
According to documentation for re.search
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
You have to redesign your regex or debug your code in order to find out what s contains by the time the mentioned pattern is used.

re.search returns None when it fails to find anything. None does not respond to the method .groups(). Check whether a match exists or not before you inspect the match in detail.
match = re.search(r'<h3 class="font-semibold m-0 t-xs-24">(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s]+)</h3>', s)
if match:
asn_code, name = match.groups()
However, since you're using Beautiful Soup, why stringify and then regex match? It's like buying a packet of instant soup, adding the powder to the water, boiling the thing, then dehydrating it back to powder. Why even use BeautifulSoup then?
soup.select('h3.font-semibold.m-0.t-xs-24')[0].content
will give you the content of that <h3> element; then apply regex on that, if you need to. Regexping through HTML documents is generally a bad idea.
EDIT: What exactly gives you TypeError? This is a typical XY problem, where you're solving the wrong thing. I verified this to work, with no TypeError (Python 3):
ast_re = re.compile(r'(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s]+)')
soup = url_to_soup(url)
ast_h3 = next(
(m for m in (ast_re.match(h3.text) for h3 in soup.select('h3')) if m),
None)
if ast_h3:
asn_code, name = asn_h3.groups()

Using re to pull out url

I am trying to use re to pull out a url from something I have scraped. I am using the below code to pull out the data below but it seems to come up empty. I am not very familiar with re. Could you give me how to pull out the url?
match = ["http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html';", "http://www.stats.gov.cn'+urlstr+'"]
url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', match`
#print url just prints both. I only need the match = "http://www.stats.gov.cn/tjsj/zxfb/ANYTHINGHERE/ANYTHINGHERE.html"
print(url)
Expected Output = ["http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html';"]

Okay I found the solution. The .+ looks for any number of characters between http://www.stats.gov.cn/ & .html. Thanks for your help with this.
match = ["http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html';", "http://www.stats.gov.cn'+urlstr+'"]
url = re.findall('http://www.stats.gov.cn/.+.html', str(match))
print(url)
Expected Output = ["http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html"]

Python 3 Regular expression on html links

my title may not be the most precise but I had some trouble coming up with a better one and considering it's work hours I'll go with this.
What I am trying to do is get the links from this specific page, then by using RE find specific links that are job ads with certain keywords in it.
Currently I find 2 ads but I haven't been able to get all the ads that match my keyword(in this case it's "säljare", Swedish for sales).
I would appreciate it anyone could look at my RE and say or hint towards fixing it. Thank you!:)
import urllib, urllib.request
import re
from bs4 import BeautifulSoup
url = "https://se.indeed.com/jobb?l=V%C3%A4stra+G%C3%B6talands+L%C3%A4n&start=10&pp=AAoAAAFd6hHqiAAAAAEX-kSOAQABQVlE682pK5mDD9vTZGjJhZBXQGaw6Nf2QaY"
reKey = re.compile('^<a.*?href=\"(.*?)\".*?>(.*säljare.*)</a>')
data = urllib.request.urlopen(url)
dataSoup = BeautifulSoup(data, 'html.parser')
for link in dataSoup.find_all('a'):
linkMatch = re.match(reKey, str(link))
if linkMatch:
print(linkMatch)
print(linkMatch.group(1), linkMatch.group(2))

If I understand your question correctly, you do not need a regex at all. Just check, if the title attribute containing the job title is present in the link and then check for a list of keyword (I added truckförare as a second keyword).
import urllib, urllib.request
import re
import ssl
from bs4 import BeautifulSoup
url = "https://se.indeed.com/jobb?l=V%C3%A4stra+G%C3%B6talands+L%C3%A4n&start=10&pp=AAoAAAFd6hHqiAAAAAEX-kSOAQABQVlE682pK5mDD9vTZGjJhZBXQGaw6Nf2QaY"
keywords = ['säljare', 'truckförare']
data = urllib.request.urlopen(url)
dataSoup = BeautifulSoup(data, 'html.parser')
for link in dataSoup.find_all('a'):
# if we do have a title attribute, check for all keywords
# if at least one of them is present,
# then print the title and the href attribute
if 'title' in link.attrs:
title = link.attrs['title'].lower()
for kw in keywords:
if kw in title:
print(title, link.attrs['href'])
While I personally like regexes (yes, I'm that kind of person ), most of the time you can get away with a little parsing in Python which IMHO makes the code more readable.

Instead of using re can you try in keyword.
for link in dataSoup.find_all('a'):
if keyword in link:
print link

A working solution:
<a[^>]+href=\"([^\"]+)\"[^>]+title=\"((?=[^\"]*säljare[^\"]*)[^\"]+)\"
<a // literal
[^>]+ // 1 or more not '>'
href=\"([^\"]+)\" // href literal then 1 or more not '"' grouped
[^>]+ // 1 or more not '>'
title=\" // literal
( // start of group
(?=[^\"]*säljare[^\"]*) // look ahead and match literal enclosed by 0 or more not '"'
[^\"]+ // 1 or more not '"'
)\" // end of group
Flags: global, case insensitive
Assumes: title after href
Demo

How to scrape data from a website using Python 2?

So when I run this code I keep getting empty brackets instead of the actual data.
I am trying to figure out why sense I don't receive any error messages.
import urllib
import re
symbolslist = ["aapl","spy","goog","nflx"]
for symbol in symbolslist:
url = "http://finance.yahoo.com/q?s=%s&ql=1"%(symbol)
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<span id="yfs_184_%s">(.+?)</span>'%(symbol.lower())
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print price

The brackets come up because the element code for regex is not 184 its l84 its an L not a one.

There is a number of libraries around which can help you to scrape sites. Take a look at Scrapy or at Beautiful Soup they should support both Python 2 and 3 as far as I know.

Python - regex lookup for multiple lines of HTML

How do you parse multiple lines in HTML using regex in Python. I have managed to string match patterns on the same line using the code below.
i=0
while i<len(newschoollist):
url = "http://profiles.doe.mass.edu/profiles/general.aspx?topNavId=1&orgcode="+ newschoollist[i] +"&orgtypecode=6&"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '>Phone:</td><td>(.+?)</td></tr>'
pattern = re.compile(regex)
value = re.findall(pattern,htmltext)
print newschoollist[i], valuetag, value
i+=1
However when i try to recognize more complicated HTML like this...
<td>Attendance Rate</td>
<td class='center'> 90.1</td>
I get null values. I believe the problem is with my syntax. I have googled regex and read most of the documentation but am looking for some help with this kind of application. I am hoping someone can point me in the right direction. Is there a (.+?) like combination that will help me tell regex to jump down a line of HTML?
What i want the findall to pick up is the 90.1 when it finds
"Attendance Rate
"
Thanks!

Use an HTML Parser. Example using BeautifulSoup:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://profiles.doe.mass.edu/profiles/general.aspx?topNavId=1&orgcode=00350326'
soup = BeautifulSoup(urlopen(url))
for label in soup.select('div#whiteboxRight table td'):
value = label.find_next_sibling('td')
if not value:
continue
print label.get_text(strip=True), value.get_text(strip=True)
print "----"
Prints (profile contact information):
...
----
NCES ID: 250279000331
----
Web Site: http://www.bostonpublicschools.org
----
MA School Type: Public School
----
NCES School Reconstituted: No
...

I ended up using (soup.get_text()) and it worked great. Thanks!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Syntax error in web scraping program using beautifulsoup, requests and regex - python

Related

AttributeError: 'NoneType' object has no attribute 'groups' in Regex

Using re to pull out url

Python 3 Regular expression on html links

How to scrape data from a website using Python 2?

Python - regex lookup for multiple lines of HTML

Categories

Resources