How to get only tweets from Snscrape? - python

After scraping data from Twitter using Snscrape, I am unable to get only tweets.
Under the column for tweet.sourceLabel, I am getting a mixture of twitter, instagram and foursquare.
import snscrape.modules.twitter as sntwitter
keyword = '(COVID OR Corona Vírus)'
maxTweets = 30
tweets = []
for i,tweet in enumerate(sntwitter.TwitterSearchScraper(keyword + ' since:2020-01-01 lang:pt').get_items()) :
if i > maxTweets :
break
tweets.append([tweet.date, tweet.id, tweet.content, tweet.user.username, tweet.sourceLabel])

I'm not seeing any other social media other than Twitter for tweet.sourceLabel. I have fixed few typos in your code as well.
import snscrape.modules.twitter as sntwitter
keyword = '(COVID OR Corona Vírus)'
maxTweets = 30
tweets = []
for i,tweet in enumerate(sntwitter.TwitterSearchScraper(keyword + ' since:2020-01-01 lang:pt').get_items()) :
if i > maxTweets :
break
tweets.append([tweet.sourceLabel])
print(tweets)
Output:

Related

How do I return multiple 'scorers' when scraping for football results using Python?

I'm just a few hours into learning Python so please go easy with me! I'm just wanting to scrape scores and scorers off a website, I've been able to do that, however, I'm only getting one scorer (if there is one!), when there are multiple goal scorers I am only getting the first. I think I'm trying to look for multiple scorers under '# Home Scorers'.
My code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.skysports.com/football-results"
match_results = {}
match_details = {}
match_no = 0
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data,'html.parser')
matches = soup.find_all('div',{'class':'fixres__item'})
for match in matches:
try:
match_url_get = match.find('a',{'class':'matches__item matches__link'}).get('href')
match_url = match_url_get if match_url_get else "unknown"
event_id = match_url[-6:]
match_response = requests.get(match_url)
match_data = match_response.text
match_soup = BeautifulSoup(match_data,'html.parser')
# Match Details
match_date = match_soup.find('time',{'class':'sdc-site-match-header__detail-time'}).text
match_location = match_soup.find('span',{'class':'sdc-site-match-header__detail-venue'}).text
match_info = match_soup.find('p',{'class':'sdc-site-match-header__detail-fixture'}).text
# Home Scores & Team
home_details = match_soup.find_all('span',{'class':'sdc-site-match-header__team-name sdc-site-match-header__team-name--home'})
for home_detail in home_details:
home_team = home_detail.find('span',{'class':'sdc-site-match-header__team-name-block-target'}).text
home_score_get = match_soup.find('span',{'class':'sdc-site-match-header__team-score-block','data-update':'score-home'})
home_score = home_score_get.text if home_score_get else "none"
# Home Scorers
home_scorer_details = match_soup.find_all('ul',{'class':'sdc-site-match-header__team-synopsis','data-update':'synopsis-home'})
for home_scorer_detail in home_scorer_details:
goal_scorer_get = home_scorer_detail.find('li',{'class':'sdc-site-match-header__team-synopsis-line'})
goal_scorer = goal_scorer_get.text if goal_scorer_get else "none"
goal_score_minute_get = home_scorer_detail.find('span',{'class':'sdc-site-match-header__event-time'})
goal_score_minute = goal_score_minute_get.text if goal_score_minute_get else "none"
# Away Scores & Team
away_details = match_soup.find_all('span',{'class':'sdc-site-match-header__team-name sdc-site-match-header__team-name--away'})
for away_detail in away_details:
away_team = away_detail.find('span',{'class':'sdc-site-match-header__team-name-block-target'}).text
away_score_get = match_soup.find('span',{'class':'sdc-site-match-header__team-score-block','data-update':'score-away'})
away_score = away_score_get.text if away_score_get else "none"
# Home Scorers
away_scorer_details = match_soup.find_all('ul',{'class':'sdc-site-match-header__team-synopsis','data-update':'synopsis-away'})
for away_scorer_detail in away_scorer_details:
away_goal_scorer_get = away_scorer_detail.find('li',{'class':'sdc-site-match-header__team-synopsis-line'})
away_goal_scorer = away_goal_scorer_get.text if away_goal_scorer_get else "none"
away_goal_score_minute_get = away_scorer_detail.find('span',{'class':'sdc-site-match-header__event-time'})
away_goal_score_minute = away_goal_score_minute_get.text if away_goal_score_minute_get else "none"
print("Match: ",event_id , "Match Date:", match_date, "Match Location:", match_location, "Match Info:", match_info, "\nResult: ", home_team, home_score, away_team, away_score)
print("Home Scorer:", goal_scorer, "Minute:",goal_score_minute, "\nAway Scorer:", away_goal_scorer, "Minute:",away_goal_score_minute)
print(match_date)
except:
pass
match_no+=1
match_results[match_no] = [event_id, home_team, home_score, away_team, away_score, match_url, match_date, match_location, match_info]
match_details[match_no] = [event_id, goal_scorer, goal_score_minute, away_goal_scorer, away_goal_score_minute]
Period = "2021-22"
print("Total Matches: ", match_no)
match_results = pd.DataFrame.from_dict(match_results, orient='index', columns = ['Event_ID:', 'Home Team:','Home Score:','Away Team:','Away Score:','Link:','Match Date:','Match Location:','Match Info:'])
match_results.to_csv("Python/FL/Premier League Results (SkySports.com) " + Period + ".csv")
match_details = pd.DataFrame.from_dict(match_details, orient='index', columns = ['Event_ID:', 'Home Goal:','Home Goal Minute:','Away Goal:','Away Goal Minute:'])
match_details.to_csv("Python/FL/Premier League Details (SkySports.com) " + Period + ".csv")
So the bit that's not working correctly is:
# Home Scorers
home_scorer_details = match_soup.find_all('ul',{'class':'sdc-site-match-header__team-synopsis','data-update':'synopsis-home'})
for home_scorer_detail in home_scorer_details:
goal_scorer_get = home_scorer_detail.find('li',{'class':'sdc-site-match-header__team-synopsis-line'})
goal_scorer = goal_scorer_get.text if goal_scorer_get else "none"
goal_score_minute_get = home_scorer_detail.find('span',{'class':'sdc-site-match-header__event-time'})
goal_score_minute = goal_score_minute_get.text if goal_score_minute_get else "none"
Any ideas how I can return multiple rows for that bit?!
Thanks in advance :)
home_scorer_details only has 1 item, the unordered list itself.
To get all the scorers you need to get the items in that list.
The following code, which is pretty rough, will create a list of dictionaries where each dictionary has the name of the scorer and the minute(s) they scored.
You could use similar code to get all the away scorers.
Like I said, this code is rough and needs refined but it should give you a start.
# Home Scorers
home_scorer_details = match_soup.find_all('ul',{'class':'sdc-site-match-header__team-synopsis','data-update':'synopsis-home'})
home_scorers = []
for home_scorer_detail in home_scorer_details[0].find_all('li'):
goal_scorer = home_scorer_detail.text
goal_score_minute_get = home_scorer_detail.find('span',{'class':'sdc-site-match-header__event-time'})
goal_score_minute = goal_score_minute_get.text if goal_score_minute_get else "none"
home_scorers.append({'scorer': goal_scorer, 'minute': goal_score_minute})
print(home_scorers)

Error while scraping: "Expecting value: line 1 column 1 (char 0)"

I am scraping reviews from rotten tomatoes website using the following code:
Link to the page.
import requests
import re
import json
import pandas as pd
import numpy as np
r = requests.get("https://www.rottentomatoes.com/m/avatar/reviews?type=user")
content = json.loads(re.search('movieReview\s=\s(.*);', r.text).group(1))
movieId = content["movieId"]
def getReviews(endCursor):
r = requests.get(f"https://www.rottentomatoes.com/napi/movie/{movieId}/reviews/user",
params = {
"direction": "next",
"endCursor": endCursor,
"startCursor": ""
})
return r.json()
data = {"User_Name": [], "Rating": [], "Review": []}
result = {}
for i in range(0, 5):
#print(f"[{i}] request review")
result = getReviews(result["pageInfo"]["endCursor"] if i != 0 else "")
data['User_Name'].extend(t['displayName'] for t in result["reviews"])
data['Rating'].extend(t['score'] for t in result["reviews"])
data['Review'].extend(t['review'] for t in result["reviews"])
df = pd.DataFrame(data)
I want to convert the above code to a separate function.
Here I have posted the code where I have tried to get this functionality code but is giving an Error with json.loads():
"Expecting value: line 1 column 1 (char 0)"
I had googled the solution for this and found that adding headers parameter will solve but is not working here.
I am not able to understand what is casing this error. It would be helpful if someone can guide me.
import requests
import re
import json
import pandas as pd
import numpy as np
def getReviews(movieId, endCursor):
r = requests.get(f"https://www.rottentomatoes.com/napi/{movieId}/reviews/user",
params = {
"direction": "next",
"endCursor": endCursor,
"startCursor": ""
},
headers={'Content-Type': 'application/json'}
)
return r.json()
def ScrapeReviews(movie):
url = "https://www.rottentomatoes.com/m/" + movie + "/reviews?type=user"
req = requests.get(url)
content = json.loads(re.search('movieReview\s=\s(.*);', req.text).group(1))
movie_id = content["movieId"]
data = {"User_Name": [], "Rating": [], "Review": []}
result = {}
for i in range(0, 5):
#print(f"[{i}] request review")
result = getReviews(movie_id, result["pageInfo"]["endCursor"] if i != 0 else "")
data['User_Name'].extend(t['displayName'] for t in result["reviews"])
data['Rating'].extend(t['score'] for t in result["reviews"])
data['Review'].extend(t['review'] for t in result["reviews"])
df = pd.DataFrame(data)
return df
d = ScrapeReviews('avatar')
The error is in getReviews function, the url should be:
"https://www.rottentomatoes.com/napi/**movie**/{movieId}/reviews/user"
import requests
import re
import json
import pandas as pd
import numpy as np
def getReviews(movieId, endCursor):
r = requests.get(
f"https://www.rottentomatoes.com/napi/movie/{movieId}/reviews/user",
params={"direction": "next", "endCursor": endCursor, "startCursor": ""},
headers={"Content-Type": "application/json"},
)
return r.json()
def ScrapeReviews(movie):
url = "https://www.rottentomatoes.com/m/" + movie + "/reviews?type=user"
req = requests.get(url)
content = json.loads(re.search("movieReview\s=\s(.*);", req.text).group(1))
movie_id = content["movieId"]
data = {"User_Name": [], "Rating": [], "Review": []}
result = {}
for i in range(0, 5):
result = getReviews(
movie_id, result["pageInfo"]["endCursor"] if i != 0 else ""
)
data["User_Name"].extend(t["displayName"] for t in result["reviews"])
data["Rating"].extend(t["score"] for t in result["reviews"])
data["Review"].extend(t["review"] for t in result["reviews"])
df = pd.DataFrame(data)
return df
d = ScrapeReviews("avatar")
print(d)
Prints:
User_Name Rating Review
0 Joe D 5.0 To me this is the most perfect blockbuster of all.\nLove Sam Worthington's empty cup, I find his everyman acting compelling, Saldana may be the most beautiful woman on the planet with her trademark perfect posture, and Sigourney adds class with extra to spare wherever she goes.\nThe planet Pandora remains the real star, and the revelation that we're the bad guys and the spiritual tree-huggers were right all along, I find genuinely touching every time.\nFirst class and I can't wait for more of Cameron's magic touch.
1 Jimmy W 1.0 The fact that this movie can make the most money of all time and also gain a following of hive-minded morons to defend it says more about the state of society than it does the movie itself. For a movie that's meant to make a point about abusive use of the environment, they sure seem to indulge in the use of massive amounts of expensive technology that no doubt utilized way more than its fair share of natural resources. Oh well, at least you can pretend to be vindicated by the box office numbers.
2 Goudkuil E 1.5 Apart from the visuals everything feels uninspired and thrown together using a old cliche of an outsider seeing what's wrong with what he's people have been doing falling in love then whiching sides. The acting is ok, the dialogue is kinda rough. The movie is padded with a lot of nice scenique views with no real narrative meaning.
3 Antonio D 4.0 o filme possui uma fotografia muito bela e, mesmo o filme sendo de 2009 não conseguimos encontrar defeitos em relação a montagem e fotografia, a história é satisfatória e é um reflexo do que sabemos que aconteceu no inicio da colonização
...and so on.

Extracting tweets using Python and Tweepy

I am trying to look at President Trump's tweets on immigration and do some sentiment analysis on it. My code is:
import pprint
import datetime
# startDate = datetime.datetime(2019, 4, 20, 0, 0, 0)
# endDate = datetime.datetime(2020, 4, 29, 0, 0, 0)
username = "realDonaldTrump"
page = 1
stop_loop = False
finalList1 = []
curs = tweepy.Cursor(api.user_timeline, id=username)
for item in curs.items():
finalList1.append(item)
print(len(finalList1))
data = pd.DataFrame(data=[tweet.text for tweet in finalList1], columns=['Tweets'])
#Add Relevant data
data['len'] = np.array([len(tweet.text) for tweet in finalList1])
data['ID'] = np.array([tweet.id for tweet in finalList1])
data['Date'] = np.array([tweet.created_at for tweet in finalList1])
data['Source'] = np.array([tweet.source for tweet in finalList1])
data['Likes'] = np.array([tweet.favorite_count for tweet in finalList1])
data['RTs'] = np.array([tweet.retweet_count for tweet in finalList1])
#Sentiment Analysis
from textblob import TextBlob
import re
def clean_tweet(tweet):
'''
Utility function to clean the text in a tweet by removing
links and special characters using regex.
'''
return ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
def analize_sentiment(tweet):
'''
Utility function to classify the polarity of a tweet
using textblob.
'''
analysis = TextBlob(clean_tweet(tweet))
if analysis.sentiment.polarity > 0:
return 1
elif analysis.sentiment.polarity == 0:
return 0
else:
return -1
data['SA'] = np.array([ analize_sentiment(tweet) for tweet in data['Tweets'] ])
The code works perfectly fine. However, I have 2 questions:
How do I get access to tweets before these? It gives me 3200 tweets; how do I get the ones before that
How do I get the Donald Trump's tweets which have specific keywords like 'immigration', 'refugee', 'china' etc.
I have been trying to figure out a way but unsuccessful.
for searching for a specific keywords you can use
[API.search][1]
for example:
q="immigration"
searched_tweets = [status for status in tweepy.Cursor(api.search, q=query).items(max_tweets)]
[1]: http://docs.tweepy.org/en/latest/api.html [2]:
Managing Tweepy API Search

How to fix IndexError when you have tried "everything"

My Python web scraper gathers a lot of data and then all of the sudden stops with an IndexError. I have tried different pages and setups, but they stop at random spots.
(part of) My code is as follows
numListings = int(re.findall(r'\d+', numListingsRaw)[0])
numPages = math.ceil(numListings / 100)
print(numPages)
for numb in range(1, numPages):
pageSoup = make_soup("https://url" + str(numb) + "&pmax=5000&srt=df-a")
containers = pageSoup.findAll("li", {"class":"occasion popup_click_event
aec_popup_click"})
for container in containers:
ID = container.a["data-id"]
titel = container["data-vrnt"].replace(",", "|")
URL = container.a["href"]
merk = container["data-mrk"]
soort = container["data-mdl"]
prijs = container.find("div", {"class":"occ_price"}).text.strip()
## Bouwjaar en km
bouwjaarKM = container.span.text.strip().split(", ")
bouwjaarRaw = bouwjaarKM[0].split(": ")
bouwjaar = bouwjaarRaw[1]
km_int = int(''.join(filter(str.isdigit, bouwjaarKM[1])))
km = str(km_int)
rest = container.find("div", {"class":"occ_extrainfo"}).text.strip()
rest_split = rest.split(", ")
brandstof = rest_split[0]
inhoud = rest_split[1]
vermogen = rest_split[2]
transmissie = rest_split[3]
carroserie = rest_split[4]
kleur = rest_split[5]
This it the exact error message:
"Traceback (most recent call last):
File "Webscraper_multi2.py", line 62, in <module>
inhoud = rest_split[1]
IndexError: list index out of range"
I know it has something to do with the for loop, but I cannot get my head around it.
Your help is much appreciated.
Thanks in advance,
Tom
Check length before trying to access a given index that requires the length:
rest = container.find("div", {"class":"occ_extrainfo"}).text.strip()
rest_split = rest.split(", ")
if len(rest_split) >= 6:
brandstof = rest_split[0]
inhoud = rest_split[1]
vermogen = rest_split[2]
transmissie = rest_split[3]
carroserie = rest_split[4]
kleur = rest_split[5]
If you know that your split list is exactly the length you want (if len(rest_split) == 6:), you can unpack the list in a single line:
brandstof, inhoud, vermogen, transmissie, carroserie, kleur = rest_split
Print the value of rest_split. You will find that it is a list with length less than 2 for that is what is needed for a list to have an index 1.
Thank you all for the extremely fast replies! With your help I got it working.
For some context:
I was trying to scrape a 2nd hand automobile website. With the tips that I got I changed the output per item to print the rest_split list.
The list that I am trying to scrape is 7 elements long. But on the website, for some reason a motor cycle was added to the search results. This one only had 1 element, hence the error.
The solution for people that might have a similar problem:
rest = container.find("div", {"class":"occ_extrainfo"}).text.strip()
rest_split = rest.split(", ")
if len(rest_split) == 7:
brandstof = rest_split[0]
inhoud = rest_split[1]
vermogen = rest_split[2]
transmissie = rest_split[3]
carroserie = rest_split[4]
kleur = rest_split[5]
Special thanks to JacobIRR who actually made life so easy that I didn't even have to think about it.

Issues scraping EV/EBITDA, Sale of Purchase Stock & Net Borrowings from Yahoo Finance

I pulled a Python script off of Github which is intended to analyze & rank stocks. I finally got it running but unfortunately the EV/EBITDA and Shareholder Yield are populating their default values, 1000 & 0 respectively.
I've spent the last few days attempting to troubleshoot, learning a lot in the process, but unfortunately had no luck.. I think it's attempting to extract data from a nonexistent line on the 'Scraper' portion or referencing an incorrect HTML. I'll paste the two code snips I think the error may lie within though the rest of the files are linked above.
Main File
from sys import stdout
from Stock import Stock
import Pickler
import Scraper
import Rankings
import Fixer
import Writer
# HTML error code handler - importing data is a chore, and getting a connection
# error halfway through is horribly demotivating. Use a pickler to serialize
# imported data into a hot-startable database.
pklFileName = 'tmpstocks.pkl'
pickler = Pickler.Pickler()
# Check if a pickled file exists. Load it if the user requests. If no file
# loaded, stocks is an empty list.
stocks = pickler.loadPickledFile(pklFileName)
# Scrape data from FINVIZ. Certain presets have been established (see direct
# link for more details)
url = 'http://finviz.com/screener.ashx?v=152&f=cap_smallover&' + \
'ft=4&c=0,1,2,6,7,10,11,13,14,45,65'
html = Scraper.importHtml(url)
# Parse the HTML for the number of pages from which we'll pull data
nPages = -1
for line in html:
if line[0:40] == '<option selected="selected" value=1>Page':
# Find indices
b1 = line.index('/') + 1
b2 = b1 + line[b1:].index('<')
# Number of pages containing stock data
nPages = int(line[b1:b2])
break
# Parse data from table on the first page of stocks and store in the database,
# but only if no data was pickled
if pickler.source == Pickler.PickleSource.NOPICKLE:
Scraper.importFinvizPage(html, stocks)
# The first page of stocks (20 stocks) has been imported. Now import the
# rest of them
source = Pickler.PickleSource.FINVIZ
iS = pickler.getIndex(source, 1, nPages + 1)
for i in range(iS, nPages + 1):
try:
# Print dynamic progress message
print('Importing FINVIZ metrics from page ' + str(i) + ' of ' + \
str(nPages) + '...', file=stdout, flush=True)
# Scrape data as before
url = 'http://finviz.com/screener.ashx?v=152&f=cap_smallover&ft=4&r=' + \
str(i*20+1) + '&c=0,1,2,6,7,10,11,13,14,45,65'
html = Scraper.importHtml(url)
# Import stock metrics from page into a buffer
bufferList = []
Scraper.importFinvizPage(html, bufferList)
# If no errors encountered, extend buffer to stocks list
stocks.extend(bufferList)
except:
# Error encountered. Pickle stocks for later loading
pickler.setError(source, i, stocks)
break
# FINVIZ stock metrics successfully imported
print('\n')
# Store number of stocks in list
nStocks = len(stocks)
# Handle pickle file
source = Pickler.PickleSource.YHOOEV
iS = pickler.getIndex(source, 0, nStocks)
# Grab EV/EBITDA metrics from Yahoo! Finance
for i in range(iS, nStocks):
try:
# Print dynamic progress message
print('Importing Key Statistics for ' + stocks[i].tick +
' (' + str(i) + '/' + str(nStocks - 1) + ') from Yahoo! Finance...', \
file=stdout, flush=True)
# Scrape data from Yahoo! Finance
url = 'http://finance.yahoo.com/q/ks?s=' + stocks[i].tick + '+Key+Statistics'
html = Scraper.importHtml(url)
# Parse data
for line in html:
# Check no value
if 'There is no Key Statistics' in line or \
'Get Quotes Results for' in line or \
'Changed Ticker Symbol' in line or \
'</html>' in line:
# Non-financial file (e.g. mutual fund) or
# Ticker not located or
# End of html page
stocks[i].evebitda = 1000
break
elif 'Enterprise Value/EBITDA' in line:
# Line contains EV/EBITDA data
evebitda = Scraper.readYahooEVEBITDA(line)
stocks[i].evebitda = evebitda
break
except:
# Error encountered. Pickle stocks for later loading
pickler.setError(source, i, stocks)
break
# Yahoo! Finance EV/EBITDA successfully imported
print('\n')
# Handle pickle file
source = Pickler.PickleSource.YHOOBBY
iS = pickler.getIndex(source, 0, nStocks)
# Grab BBY metrics from Yahoo! Finance
for i in range(iS, nStocks):
try:
# Print dynamic progress message
print('Importing Cash Flow for ' + stocks[i].tick +
' (' + str(i) + '/' + str(nStocks - 1) + ') from Yahoo! Finance...', \
file=stdout, flush=True)
# Scrape data from Yahoo! Finance
url = 'http://finance.yahoo.com/q/cf?s=' + stocks[i].tick + '&ql=1'
html = Scraper.importHtml(url)
# Parse data
totalBuysAndSells = 0
for line in html:
# Check no value
if 'There is no Cash Flow' in line or \
'Get Quotes Results for' in line or \
'Changed Ticker Symbol' in line or \
'</html>' in line:
# Non-financial file (e.g. mutual fund) or
# Ticker not located or
# End of html page
break
elif 'Sale Purchase of Stock' in line:
# Line contains Sale/Purchase of Stock information
totalBuysAndSells = Scraper.readYahooBBY(line)
break
# Calculate BBY as a percentage of current market cap
bby = round(-totalBuysAndSells / stocks[i].mktcap * 100, 2)
stocks[i].bby = bby
except:
# Error encountered. Pickle stocks for later loading
pickler.setError(source, i, stocks)
break
# Yahoo! Finance BBY successfully imported
if not pickler.hasErrorOccurred:
# All data imported
print('\n')
print('Fixing screener errors...')
# A number of stocks may have broken metrics. Fix these (i.e. assign out-of-
# bounds values) before sorting
stocks = Fixer.fixBrokenMetrics(stocks)
print('Ranking stocks...')
# Calculate shareholder Yield
for i in range(nStocks):
stocks[i].shy = stocks[i].div + stocks[i].bby
# Time to rank! Lowest value gets 100
rankPE = 100 * (1 - Rankings.rankByValue([o.pe for o in stocks]) / nStocks)
rankPS = 100 * (1 - Rankings.rankByValue([o.ps for o in stocks]) / nStocks)
rankPB = 100 * (1 - Rankings.rankByValue([o.pb for o in stocks]) / nStocks)
rankPFCF = 100 * (1 - Rankings.rankByValue([o.pfcf for o in stocks]) / nStocks)
rankEVEBITDA = 100 * (1 - Rankings.rankByValue([o.evebitda for o in stocks]) / nStocks)
# Shareholder yield ranked with highest getting 100
rankSHY = 100 * (Rankings.rankByValue([o.shy for o in stocks]) / nStocks)
# Rank total stock valuation
rankStock = rankPE + rankPS + rankPB + rankPFCF + rankEVEBITDA + rankSHY
# Rank 'em
rankOverall = Rankings.rankByValue(rankStock)
# Calculate Value Composite - higher the better
valueComposite = 100 * rankOverall / len(rankStock)
# Reverse indices - lower index -> better score
rankOverall = [len(rankStock) - 1 - x for x in rankOverall]
# Assign to stocks
for i in range(nStocks):
stocks[i].rank = rankOverall[i]
stocks[i].vc = round(valueComposite[i], 2)
print('Sorting stocks...')
# Sort all stocks by normalized rank
stocks = [x for (y, x) in sorted(zip(rankOverall, stocks))]
# Sort top decile by momentum factor. O'Shaughnessey historically uses 25
# stocks to hold. The top decile is printed, and the user may select the top 25
# (or any n) from the .csv file.
dec = int(nStocks / 10)
topDecile = []
# Store temporary momentums from top decile for sorting reasons
moms = [o.mom for o in stocks[:dec]]
# Sort top decile by momentum
for i in range(dec):
# Get index of top momentum performer in top decile
topMomInd = moms.index(max(moms))
# Sort
topDecile.append(stocks[topMomInd])
# Remove top momentum performer from further consideration
moms[topMomInd] = -100
print('Saving stocks...')
# Save momentum-weighted top decile
topCsvPath = 'top.csv'
Writer.writeCSV(topCsvPath, topDecile)
# Save results to .csv
allCsvPath = 'stocks.csv'
Writer.writeCSV(allCsvPath, stocks)
print('\n')
print('Complete.')
print('Top decile (sorted by momentum) saved to: ' + topCsvPath)
print('All stocks (sorted by trending value) saved to: ' + allCsvPath)
Scraper
import re
from urllib.request import urlopen
from Stock import Stock
def importHtml(url):
"Scrapes the HTML file from the given URL and returns line break delimited \
strings"
response = urlopen(url, data = None)
html = response.read().decode('utf-8').split('\n')
return html
def importFinvizPage(html, stocks):
"Imports data from a FINVIZ HTML page and stores in the list of Stock \
objects"
isFound = False
for line in html:
if line[0:15] == '<td height="10"':
isFound = True
# Import data line into stock database
_readFinvizLine(line, stocks)
if isFound and len(line) < 10:
break
return
def _readFinvizLine(line, stocks):
"Imports stock metrics from the data line and stores it in the list of \
Stock objects"
# Parse html
(stkraw, dl) = _parseHtml(line)
# Create new stock object
stock = Stock()
# Get ticker symbol
stock.tick = stkraw[dl[1] + 1: dl[2]]
# Get company name
stock.name = stkraw[dl[2] + 1 : dl[3]]
# Get market cap multiplier (either MM or BB)
if stkraw[dl[4] - 1] == 'B':
capmult = 1000000000
else:
capmult = 1000000
# Get market cap
stock.mktcap = capmult * _toFloat(stkraw[dl[3] + 1 : dl[4] - 1])
# Get P/E ratio
stock.pe = _toFloat(stkraw[dl[4] + 1 : dl[5]])
# Get P/S ratio
stock.ps = _toFloat(stkraw[dl[5] + 1 : dl[6]])
# Get P/B ratio
stock.pb = _toFloat(stkraw[dl[6] + 1 : dl[7]])
# Get P/FCF ratio
stock.pfcf = _toFloat(stkraw[dl[7] + 1 : dl[8]])
# Get Dividend Yield
stock.div = _toFloat(stkraw[dl[8] + 1 : dl[9] - 1])
# Get 6-mo Relative Price Strength
stock.mom = _toFloat(stkraw[dl[9] + 1 : dl[10] - 1])
# Get Current Stock Price
stock.price = _toFloat(stkraw[dl[11] + 1 : dl[12]])
# Append stock to list of stocks
stocks.append(stock)
return
def _toFloat(line):
"Converts a string to a float. Returns NaN if the line can't be converted"
try:
num = float(line)
except:
num = float('NaN')
return num
def readYahooEVEBITDA(line):
"Returns EV/EBITDA data from Yahoo! Finance HTML line"
# Parse html
(stkraw, dl) = _parseHtml(line)
for i in range(0, len(dl)):
if (stkraw[dl[i] + 1 : dl[i] + 24] == 'Enterprise Value/EBITDA'):
evebitda = stkraw[dl[i + 1] + 1 : dl[i + 2]]
break
return _toFloat(evebitda)
def readYahooBBY(line):
"Returns total buys and sells from Yahoo! Finance HTML line. Result will \
still need to be divided by market cap"
# Line also contains Borrowings details - Remove it all
if 'Net Borrowings' in line:
# Remove extra data
line = line[:line.find('Net Borrowings')]
# Trim prior data
line = line[line.find('Sale Purchase of Stock'):]
# Determine if buys or sells, replace open parantheses:
# (#,###) -> -#,###
line = re.sub(r'[(]', '-', line)
# Eliminate commas and close parantheses: -#,### -> -####
line = re.sub(r'[,|)]', '', line)
# Remove HTML data and markup, replacing with commas
line = re.sub(r'[<.*?>|]', ',', line)
line = re.sub(' ', ',', line)
# Locate the beginnings of each quarterly Sale Purchase points
starts = [m.start() for m in re.finditer(',\d+,|,.\d+', line)]
# Locate the ends of each quarterly Sale Purchase points
ends = [m.start() for m in re.finditer('\d,', line)]
# Sum all buys and sells across year
tot = 0
for i in range(0, len(starts)):
# x1000 because all numbers are in thousands
tot = tot + float(line[starts[i] + 1 : ends[i] + 1]) * 1000
return tot
def _parseHtml(line):
"Parses the HTML line by </td> breaks and returns the delimited string"
# Replace </td> breaks with placeholder, '`'
ph = '`'
rem = re.sub('</td>', ph, line)
# The ticker symbol initial delimiter is different
# Remove all other remaining HTML data
stkraw = re.sub('<.*?>', '', rem)
# Replace unbalanced HTML
stkraw = re.sub('">', '`', stkraw)
# Find the placeholders
dl = [m.start() for m in re.finditer(ph, stkraw)]
return (stkraw, dl)
If anyone has any input or perhaps a better method such as beautifulsoup, I'd really appreciate it! I'm very open to any tutorials that would help as well. My intent is to both better my programming ability and have an effective stock screener.
I was having the same issue scraping the Yahoo data in Python, and in Matlab as well. As a workaround, I wrote a macro in VBA to grab all of the EV/EBITDA data from Yahoo by visiting each stock's Key Statistics page. However, it takes about a day to run on all 3,000+ stocks with market caps over $200M, which is not really practical.
I've tried finding the EV/EBITDA on various stock screeners online, but they either don't report it or only let you download a couple hundred stocks' data without paying. Busy Stock's screener seems the best in this regard, but their EV/EBITDA figures don't line up to Yahoo's, which worries me that they are using different methodology.
One solution and my recommendation to you is to use the Trending Value algorithm in Quantopian, which is free. You can find the code here: https://www.quantopian.com/posts/oshaugnessy-what-works-on-wall-street
Quantopian will let you backtest the algorithm to 2002, and live test it as well.

Categories