I'm creating a web scraper that will be used to value stocks. The problem I got is that my code returns a object "placement" (Not sure what it should be called) instead of the value.
import requests
class Guru():
MedianPE = 0.0
def __init__(self, ticket):
self.ticket = ticket
try:
url = ("https://www.gurufocus.com/term/pettm/"+ticket+"/PE-Ratio-TTM/")
response = requests.get(url)
htmlText = response.text
firstSplit = htmlText
secondSplit = firstSplit.split("And the <strong>median</strong> was <strong>")[1]
thirdSplit = secondSplit.split("</strong>")[0]
lastSplit = float(thirdSplit)
try:
Guru.MedianPE = lastSplit
except:
print(ticket + ": Median PE N/A")
except:
print(ticket + ": Median PE N/A")
def getMedianPE(self):
return float(Guru.getMedianPE)
g1 = Guru("AAPL")
g1.getMedianPE
print("Median + " + str(g1))
If I print the lastSplit inside the __init__ it returns the value I want 15.53 but when I try to get it by the function getMedianPE I just get Median + <__main__.Guru object at 0x0000016B0760D288>
Thanks a lot for your time!
Looks like you are trying to cast a function object to a float. Simply change return float(Guru.getMedianPE) to return float(Guru.MedianPE)
getMedianPE is a function (also called object method when part of a class), so you need to call it with parentheses. If you call it without parentheses, you get the method/function itself rather than the result of calling the method/function.
The other problem is that getMedianPE returns the function Guru.getMedianPE rather than the value Guru.MedianPE. I don't think you want MedianPE to be a class variable - you probably just want to set it as a default of 0 in init so that each object has its own median_PE value.
Also, it is not a good idea to include all of the scraping code in your init method. That should be moved to a scrape() method (or some other name) that you call after instantiating the object.
Finally, if you are going to print an object, it is useful to have a str method, so I added a basic one here.
So putting all of those comments together, here is a recommended refactor of your code.
import requests
class Guru():
def __init__(self, ticket, median_PE=0):
self.ticket = ticket
self.median_PE = median_PE
def __str__(self):
return f'{self.ticket} {self.median_PE}'
def scrape(self):
try:
url = f"https://www.gurufocus.com/term/pettm/{self.ticket}/PE-Ratio-TTM/"
response = requests.get(url)
htmlText = response.text
firstSplit = htmlText
secondSplit = firstSplit.split("And the <strong>median</strong> was <strong>")[1]
thirdSplit = secondSplit.split("</strong>")[0]
lastSplit = float(thirdSplit)
self.median_PE = lastSplit
except ValueError:
print(f"{self.ticket}: Median PE N/A")
Then you run the code
>>>g1 = Guru("AAPL")
...g1.scrape()
...print(g1)
AAPL 15.53
Related
I have a problem with passing a variable from a method to method within a given class.
The code is this (I am practicing beginner):
class Calendar():
def __init__(self,link):
self.link = link
self.request = requests.get(link)
self.request.encoding='UTF-8'
self.soup = BeautifulSoup(self.request.text,'lxml')
def DaysMonth(self):
Dates = []
tds = self.soup.findAll('td', {'class':'action'})
for td in tds:
check = (td.findAll('a')[0].text)
if "Víkendová odstávka" in check:
date = td.findAll('span')[0].text
Dates.append(date)
return Dates
def PrintCal(self):
return ['Víkendová odstávka serverů nastane ' + date + '. den v měsíci.' for date in Dates]
def main(self):
PrintCal(DaysMonth())
I would like to pass the list Dates from the method DaysMonth to the method PrintCal. When I initiate the class, i.e. cal = Calendar('link'), and run cal.PrinCal(), I get that the name Dates has not been defined. If I run cal.DaysMonth(), the output is as expected.
What is the issue here? Thank you!
Dates is a local variable in the DaysMonth method, and is therefore not visible anywhere else. Fortunately, DaysMonth does return Dates, so it's easy to get the value you want. Simply add the following line to your PrintCal method (before the return statement):
Dates = self.DaysMonth()
You are trying to do too much in the Calendar object, particularly in the init method. You don't want to combine the scraping and parsing of a website at the time the object is instantiated. I would use the Calendar object to store and display the results of the scraping/parsing. If you need everything to be object-oriented, than create a separate Scraper/Parser class that handles that part of the logic.
class Calendar():
def __init__(self, dates):
self.dates = dates
def display_dates(self):
return ['Víkendová odstávka serverů nastane ' + date + '. den v měsíci.'
for date in self.dates]
r = requests.get(link, encoding='UTF-8')
soup = BeautifulSoup(r.text,'lxml')
dates = []
for td in soup.findAll('td', {'class':'action'}):
check = (td.findAll('a')[0].text)
if "Víkendová odstávka" in check:
dates.append(td.findAll('span')[0].text)
c = Calendar(dates=dates)
print(c.display_dates)
Trying to pass a list from one function to another (teamurls), which then in turn I will use in another program. I have my program working where I have a value output using yield. (yield full_team_urls)
How do I pass the list from the first function into the second (the def - team_urls) Also is that still possible to return the list and continue using yield aswell?
Can each function only return or output one object?
Edit: Tried to pass teamurls into the second function as shown below and I get the error - TypeError: team_urls() missing 1 required positional argument: 'teamurls'>
def table():
url = 'https://www.skysports.com/premier-league-table'
base_url = 'https://www.skysports.com'
today = str(date.today())
premier_r = requests.get(url)
print(premier_r.status_code)
premier_soup = BeautifulSoup(premier_r.text, 'html.parser')
headers = "Position, Team, Pl, W, D, L, F, A, GD, Pts\n"
premier_soup_tr = premier_soup.find_all('tr', {'class': 'standing-table__row'})
premier_soup_th = premier_soup.find_all('thead')
f = open('premier_league_table.csv', 'w')
f.write("Table as of {}\n".format (today))
f.write(headers)
premier_soup_tr = premier_soup.find_all('tr', {'class': 'standing-table__row'})
result = [[r.text.strip() for r in td.find_all('td', {'class': 'standing-table__cell'})][:-1] for td in premier_soup_tr[1:]]
teamurls = ([a.find("a",href=True)["href"] for a in premier_soup_tr[1:]])
return teamurls
for item in result:
f.write(",".join(item))
f.write("\n")
f.close()
print('\n Premier league teams full urls:\n')
for item in teamurls:
entire_team = []
# full_team_urls.append(base_url+ item)
full_team_urls = (base_url + item + '-squad')
yield full_team_urls
table()
def team_urls(teamurls):
teams = [i.strip('/') for i in teamurls]
print (teams)
team_urls()
To pass a value into a method, use arguments:
team_urls(teamurls)
You'll need to specify that argument at the definition of team_urls as such:
def team_urls(teamurls):
Im working on a small project of retrieving information about books from the Google Books API using Python 3. For this i make a call to the API, read out the variables and store those in a list. For a search like "linkedin" this works perfectly. However when i enter "Google", it reads the second title from the JSON input. How can this happen?
Please find my code below (Google_Results is the class I use to initialize the variables):
import requests
def Book_Search(search_term):
parms = {"q": search_term, "maxResults": 3}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
print(r.url)
results = r.json()
i = 0
for result in results["items"]:
try:
isbn13 = str(result["volumeInfo"]["industryIdentifiers"][0]["identifier"])
isbn10 = str(result["volumeInfo"]["industryIdentifiers"][1]["identifier"])
title = str(result["volumeInfo"]["title"])
author = str(result["volumeInfo"]["authors"])[2:-2]
publisher = str(result["volumeInfo"]["publisher"])
published_date = str(result["volumeInfo"]["publishedDate"])
description = str(result["volumeInfo"]["description"])
pages = str(result["volumeInfo"]["pageCount"])
genre = str(result["volumeInfo"]["categories"])[2:-2]
language = str(result["volumeInfo"]["language"])
image_link = str(result["volumeInfo"]["imageLinks"]["thumbnail"])
dict = Google_Results(isbn13, isbn10, title, author, publisher, published_date, description, pages, genre,
language, image_link)
gr.append(dict)
print(gr[i].title)
i += 1
except:
pass
return
gr = []
Book_Search("Linkedin")
I am a beginner to Python, so any help would be appreciated!
It does so because there is no publisher entry in volumeInfo of the first entry, thus it raises a KeyError and your except captures it. If you're going to work with fuzzy data you have to account for the fact that it will not always have the expected structure. For simple cases you can rely on dict.get() and its default argument to return a 'valid' default entry if an entry is missing.
Also, there are a few conceptual problems with your function - it relies on a global gr which is bad design, it shadows the built-in dict type and it captures all exceptions guaranteeing that you cannot exit your code even with a SIGINT... I'd suggest you to convert it to something a bit more sane:
def book_search(search_term, max_results=3):
results = [] # a list to store the results
parms = {"q": search_term, "maxResults": max_results}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
try: # just in case the server doesn't return valid JSON
for result in r.json().get("items", []):
if "volumeInfo" not in result: # invalid entry - missing volumeInfo
continue
result_dict = {} # a dictionary to store our discovered fields
result = result["volumeInfo"] # all the data we're interested is in volumeInfo
isbns = result.get("industryIdentifiers", None) # capture ISBNs
if isinstance(isbns, list) and isbns:
for i, t in enumerate(("isbn10", "isbn13")):
if len(isbns) > i and isinstance(isbns[i], dict):
result_dict[t] = isbns[i].get("identifier", None)
result_dict["title"] = result.get("title", None)
authors = result.get("authors", None) # capture authors
if isinstance(authors, list) and len(authors) > 2: # you're slicing from 2
result_dict["author"] = str(authors[2:-2])
result_dict["publisher"] = result.get("publisher", None)
result_dict["published_date"] = result.get("publishedDate", None)
result_dict["description"] = result.get("description", None)
result_dict["pages"] = result.get("pageCount", None)
genres = result.get("authors", None) # capture genres
if isinstance(genres, list) and len(genres) > 2: # since you're slicing from 2
result_dict["genre"] = str(genres[2:-2])
result_dict["language"] = result.get("language", None)
result_dict["image_link"] = result.get("imageLinks", {}).get("thumbnail", None)
# make sure Google_Results accepts keyword arguments like title, author...
# and make them optional as they might not be in the returned result
gr = Google_Results(**result_dict)
results.append(gr) # add it to the results list
except ValueError:
return None # invalid response returned, you may raise an error instead
return results # return the results
Then you can easily retrieve as much info as possible for a term:
gr = book_search("Google")
And it will be far more tolerant of data omissions, provided that your Google_Results type makes most of the entries optional.
Following #Coldspeed's recommendation it became clear that missing information in the JSON file caused the exception to run. Since I only had a "pass" statement there it skipped the entire result. Therefore I will have to adapt the "Try and Except" statements so errors do get handled properly.
Thanks for the help guys!
Is there a way to make a function that makes other functions to be called later named after the variables passed in?
For the example let's pretend https://example.com/engine_list returns this xml file, when I call it in get_search_engine_xml
<engines>
<engine address="https://www.google.com/">Google</engine>
<engine address="https://www.bing.com/">Bing</engine>
<engine address="https://duckduckgo.com/">DuckDuckGo</engine>
</engines>
And here's my code:
import re
import requests
import xml.etree.ElementTree as ET
base_url = 'https://example.com'
def make_safe(s):
s = re.sub(r"[^\w\s]", '', s)
s = re.sub(r"\s+", '_', s)
s = str(s)
return s
# This is what I'm trying to figure out how to do correctly, create a function
# named after the engine returned in get_search_engine_xml(), to be called later
def create_get_engine_function(function_name, address):
def function_name():
r = requests.get(address)
return function_name
def get_search_engine_xml():
url = base_url + '/engine_list'
r = requests.get(url)
engines_list = str(r.content)
engines_root = ET.fromstring(engines_list)
for child in engines_root:
engine_name = child.text.lower()
engine_name = make_safe(engine_name)
engine_address = child.attrib['address']
create_get_engine_function(engine_name, engine_address)
## Runs without error.
get_search_engine_xml()
## But if I try to call one of the functions.
google()
I get the following error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'google' is not defined
Defining engine_name and engine_address seems to be working when I log it out. So I'm pretty sure the problem lies in create_get_engine_function, which admittedly I don't know what I'm doing and I was trying to piece together from similar questions.
Can you name a function created by another function with an argument that's passed in? Is there a better way to do this?
You can assign them to globals()
def create_get_engine_function(function_name, address):
def function():
r = requests.get(address)
function.__name__ = function_name
function.__qualname__ = function_name # for Python 3.3+
globals()[function_name] = function
Although, depending on what you're actually trying to accomplish, a better design would be to store all the engine names/addresses in a dictionary and access them as needed:
# You should probably should rename this to 'parse_engines_from_xml'
def get_search_engine_xml():
...
search_engines = {} # maps names to addresses
for child in engines_root:
...
search_engines[engine_name] = engine_address
return search_engines
engines = get_search_engine_xml()
e = requests.get(engines['google'])
<do whatever>
e = requests.get(engines['bing'])
<do whatever>
def images_custom_list(args, producer_data):
tenant, token, url = producer_data
url = url.replace(".images", ".servers")
url = url + '/' + 'detail'
output = do_request(url, token)
output = output[0].json()["images"]
custom_images_list = [custom_images for custom_images in output
if custom_images["metadata"].get('user_id', None)]
temp_image_list = []
for image in custom_images_list:
image_temp = ( { "status": image["status"],
"links": image["links"][0]["href"],
"id": image["id"], "name": image["name"]} )
temp_image_list.append(image_temp)
print json.dumps(temp_image_list, indent=2)
def image_list_detail(args, producer_data):
tenant, token, url = producer_data
url = url.replace(".images", ".servers")
uuid = args['uuid']
url = url + "/" + uuid
output = do_request(url, token)
print output[0]
I am trying to make the code more efficient and clean looking by utilizing the Python's function decoration. Since these 2 functions share the same first 2 lines, how could I make a function decorator with these 2 lines and have these 2 functions be decorated it?
here's a way to solve it:
from functools import wraps
def fix_url(function):
#wraps(function)
def wrapper(*args, **kwarg):
kwarg['url'] = kwarg['url'].replace(".images", ".servers")
return function(*args, **kwarg)
return wrapper
#fix_url
def images_custom_list(args, tenant=None, token=None, url=None):
url = url + '/' + 'detail'
output = do_request(url, token)
output = output[0].json()["images"]
custom_images_list = [custom_images for custom_images in output
if custom_images["metadata"].get('user_id', None)]
temp_image_list = []
for image in custom_images_list:
image_temp = ( { "status": image["status"],
"links": image["links"][0]["href"],
"id": image["id"], "name": image["name"]} )
temp_image_list.append(image_temp)
print json.dumps(temp_image_list, indent=2)
#fix_url
def image_list_detail(args, tenant=None, token=None, url=None):
uuid = args['uuid']
url = url + "/" + uuid
output = do_request(url, token)
print output[0]
sadly for you, you may notice that you need to get rid of producer_data, but have it split in multiple arguments because you cannot factorize that part of the code, as you'll anyway need to split it again in each of the functions. I chose to use keyword arguments (by setting a default value to None), but you could use positional arguments as well, your call.
BTW, note that it's not making the code more efficient, though it's helping in making it a bit more readable (you know that you're changing the URL the same way for both methods, and when you fix the URL changing part, it's done the same way everywhere), but it's making 2 more function calls each time you call the function, so it's in no way more "efficient".
N.B.: It's basically based over #joel-cornett's example (I wouldn't have used #wraps otherwise, just plain old double function decorator), I just specialized it. (I don't think he deserves a -1)
Please at least +1 his answer or accept it.
But I think a simpler way to do it would be:
def fix_url(producer_data):
return (producer_data[0], producer_data[1], producer_data[2].replace(".images", ".servers"))
def images_custom_list(args, producer_data):
tenant, token, url = fix_url(producer_data)
# stuff ...
def image_list_detail(args, producer_data):
tenant, token, url = fix_url(producer_data)
# stuff ...
which uses a simpler syntax (no decorator) and does only one more function call.
Like this:
from functools import wraps
def my_timesaving_decorator(function):
#wraps(function)
def wrapper(*args, **kwargs):
execute_code_common_to_multiple_function()
#Now, call the "unique" code
#Make sure that if you modified the function args,
#you pass the modified args here, not the original ones.
return function(*args, **kwargs)
return wrapper