How to scrape data from a website using Python 2? - python

So when I run this code I keep getting empty brackets instead of the actual data.
I am trying to figure out why sense I don't receive any error messages.
import urllib
import re
symbolslist = ["aapl","spy","goog","nflx"]
for symbol in symbolslist:
url = "http://finance.yahoo.com/q?s=%s&ql=1"%(symbol)
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<span id="yfs_184_%s">(.+?)</span>'%(symbol.lower())
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print price

The brackets come up because the element code for regex is not 184 its l84 its an L not a one.

There is a number of libraries around which can help you to scrape sites. Take a look at Scrapy or at Beautiful Soup they should support both Python 2 and 3 as far as I know.

Related

Syntax error in web scraping program using beautifulsoup, requests and regex

As part of 'Automate the boring stuff' I am trying to learn how to code in python. One of the exercises is to create a web scraper using beautifulsoup and requests.
I decided to try amazons stock price instead of a price of a product on amazon. I managed to get it to work, but the output was several lines.
So wanted to use regex to just return the stock price and not the loss/win and time stamp as well.
It however kept giving me syntax errors one line 1, I've tried removing the Regex part to return it to just the bs4 and requests part going back to the start but that still gave me the syntax error (I am using VSC to avoid parenthesis errors).
Where am I going wrong? and depending on how wrong, how would the correct code look like?
My code currently looks like this:
import bs4, requests, re
def extractedStockPrice(price):
stockPriceRegex = re.compile(r'''
[0-9]?
,?
[0-9]+
/.
[0-9]*
''', re.VERBOSE)
return stockPriceRegex.search(price)
def getStockPrice(stockUrl):
res = requests.get(stockUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('#quote-header-info > div.My\(6px\).Pos\(r\).smartphone_Mt\(6px\)')
return elems[0].text.strip()
stockPrice = extractedStockPrice(getStockPrice('https://finance.yahoo.com/quote/AMZN?p=AMZN&.tsrc=fin-srch'))
print('The price is ' + stockPrice)
The issue seems to be with you regex expression - in the function extractedStockPrice. It does not match the price expression and the search returns "None" which causes the type error mentioned in the comment.
The price string variable, when it reaches the regex part looks like this (example):
'2,042.76-0.24 (-0.01%)At close: 4:00PM EDT'
You can use a regex syntax checker to confirm your regex code: https://www.regexpal.com/ (post the above string as "Test String" and your regex code as "Regular Expression).
Looks like your forward slash should be backwards slash. Also, you need to extract the match once found - you can do this with group(0) (see this and search for re.search: https://docs.python.org/3/library/re.html).
The below code should work (run with Python 3.7):
import bs4, requests, re
def extractedStockPrice(price):
# fixes here:
# 1) use backslash "\" instead of "/".
# 2) use ".group(0)" to extract match.
stockPriceRegex = re.compile(r'''[0-9]?,?[0-9]+\.[0-9]*''', re.VERBOSE)
return stockPriceRegex.search(price).group(0)
def getStockPrice(stockUrl):
res = requests.get(stockUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('#quote-header-info > div.My\(6px\).Pos\(r\).smartphone_Mt\(6px\)')
return elems[0].text.strip()
stockPrice = extractedStockPrice(getStockPrice('https://finance.yahoo.com/quote/AMZN?p=AMZN&.tsrc=fin-srch'))
print('The price is ' + stockPrice)
Result: "The price is 2,042.76".

Extracting a specific substring from a specific hyper-reference using Python

I'm new to Python, and for my second attempt at a project, I wanted to extract a substring – specifically, an identifying number – from a hyper-reference on a url.
For example, this url is the result of my search query, giving the hyper-reference http://www.chessgames.com/perl/chessgame?gid=1012809. From this I want to extract the identifying number "1012809" and append it to navigate to the url http://www.chessgames.com/perl/chessgame?gid=1012809, after which I plan to download the file at the url http://www.chessgames.com/pgn/alekhine_naegeli_1932.pgn?gid=1012809 . But I am currently stuck a few steps behind this because I can't figure out a way to extract the identifier.
Here is my MWE:
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
import re
y = str(soup)
x = re.findall("gid=[0-9]+",y)
print x
z = re.sub("gid=", "", x(1)) #At this point, things have completely broken down...
As Albin Paul commented, re.findall return a list, you need to extract elements from it. By the way, you don't need BeautifulSoup here, use urllib2.urlopen(url).read() to get the string of the content, and the re.sub is also not needed here, one regex pattern (?:gid=)([0-9]+) is enough.
import re
import urllib2
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url).read()
result = re.findall(r"(?:gid=)([0-9]+)",page)
print(result[0])
#'1012809'
You don't need regex here at all. Css selector along with string manipulation will lead you to the right direction. Try the below script:
import requests
from bs4 import BeautifulSoup
page_link = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
soup = BeautifulSoup(requests.get(page_link).text, 'lxml')
item_num = soup.select_one("[href*='gid=']")['href'].split("gid=")[1]
print(item_num)
Output:
1012809

Python - regex lookup for multiple lines of HTML

How do you parse multiple lines in HTML using regex in Python. I have managed to string match patterns on the same line using the code below.
i=0
while i<len(newschoollist):
url = "http://profiles.doe.mass.edu/profiles/general.aspx?topNavId=1&orgcode="+ newschoollist[i] +"&orgtypecode=6&"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '>Phone:</td><td>(.+?)</td></tr>'
pattern = re.compile(regex)
value = re.findall(pattern,htmltext)
print newschoollist[i], valuetag, value
i+=1
However when i try to recognize more complicated HTML like this...
<td>Attendance Rate</td>
<td class='center'> 90.1</td>
I get null values. I believe the problem is with my syntax. I have googled regex and read most of the documentation but am looking for some help with this kind of application. I am hoping someone can point me in the right direction. Is there a (.+?) like combination that will help me tell regex to jump down a line of HTML?
What i want the findall to pick up is the 90.1 when it finds
"Attendance Rate
"
Thanks!
Use an HTML Parser. Example using BeautifulSoup:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://profiles.doe.mass.edu/profiles/general.aspx?topNavId=1&orgcode=00350326'
soup = BeautifulSoup(urlopen(url))
for label in soup.select('div#whiteboxRight table td'):
value = label.find_next_sibling('td')
if not value:
continue
print label.get_text(strip=True), value.get_text(strip=True)
print "----"
Prints (profile contact information):
...
----
NCES ID: 250279000331
----
Web Site: http://www.bostonpublicschools.org
----
MA School Type: Public School
----
NCES School Reconstituted: No
...
I ended up using (soup.get_text()) and it worked great. Thanks!

Dont Want Spaces Between Paragraphs : Python

I am web scraping a news website to get news articles by using the following code :
import mechanize
from selenium import webdriver
from bs4 import BeautifulSoup
url = "http://www.thehindu.com/archive/web/2012/06/19/"
link_dictionary = {}
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)
for tag_li in soup.findAll('li', attrs={"data-section":"Editorial"}):
for link in tag_li.findAll('a'):
link_dictionary[link.string] = link.get('href')
print link_dictionary[link.string]
urlnew = link_dictionary[link.string]
brnew = mechanize.Browser()
htmltextnew = brnew.open(urlnew).read()
articletext = ""
soupnew = BeautifulSoup(htmltextnew)
for tag in soupnew.findAll('p'):
articletext += tag.text
print articletext
driver.close()
I am getting the desired result but I want a particular news article in a single line. For some articles, I am getting the whole article in a single line while in others I am getting different paragraphs. Can someone help me to sort out the issue ?? I am new to python programming. Thanks and Regards.
This is likely related to the way whitespace is managed in the particular site's HTML, and the fact that not all sites will use "p" tags for their content. Your best bet is to probably do a regular expression replace which eliminates the extra spaces (including newlines).
At the beginning of your file, import the regular expression module:
import re
Then after you've built your articletext, add the following code:
print re.sub('\s+', ' ', articletext, flags=re.M)
You might also want to extract the text from other elements that might be contained within.

How to re.findall() using `daringfireball`'s regex?

I have used the code below to extract urls from a html page using daringfireball's regex, http://daringfireball.net/2010/07/improved_regex_for_matching_urls, i.e.
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s!()[]{};:'".,<>?«»“”‘’]))`
The regex works amazingly but using re.findall() is almost taking forever. Is there anyway that I can get all the urls in the html quickly?
import urllib, re
seed = "http://web.archive.org/web/20100412111652/http://app.singaporeedu.gov.sg/asp/index.asp"
page = urllib.urlopen(seed).read().decode('utf8')
#print page
pattern = r'''(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))'''
match = re.search(pattern,page)
print match.group(0)
matches = re.findall(pattern,page) # this line takes more than 3 mins on my i3 laptop
print matches
Yes. By not using regex at all. Use a HTML parser such as BeautifulSoup. That's what they're for.
>>> from bs4 import BeautifulSoup as BS
>>> import urllib2
>>> seed = "http://web.archive.org/web/20100412111652/http://app.singaporeedu.gov.sg/asp/index.asp"
>>> soup = BS(urllib2.urlopen(seed))
>>> print soup.find_all('a')
Do you just want all url's from a page? Wouldn't a simple regex like this be sufficient?
<a[^>]*href="([^"]+)"[^>]*>

Categories