Scraping only new elements from website using requests and bs4

Scraping only new elements from website using requests and bs4 - python

I want to look at a certain website to collect data from it, at the first visit to the said website I collect all the data that was there to ignore it. I want to perform a certain action if a new row is added(for example print as in here). But whenever a new item appears it seems to print every single row on the website even though I'm checking if the row exists already in the dictionary. Don't know how to fix it, can anyone take a look?
import requests
import re
import copy
import time
from datetime import date
from bs4 import BeautifulSoup
class KillStatistics:
def __init__(self):
self.records = {}
self.watched_names = ["Test"]
self.iter = 0
def parse_records(self):
r = requests.get("http://149.56.28.71/?subtopic=killstatistics")
soup = BeautifulSoup(r.content, "html.parser")
table = soup.findChildren("table")
for record in table:
for data in record:
if data.text == "Last Deaths":
pass
else:
entry = data.text
entry = re.split("..?(?=[0-9][A-Z]).", data.text)
entry[0] = entry[0].split(", ")
entry[0][0] = entry[0][0].split(".")
entry_id, day, month, year, hour = (
entry[0][0][0],
entry[0][0][1],
entry[0][0][2],
entry[0][0][3],
entry[0][1],
)
message = entry[1]
nickname = (re.findall(".+?(?=at)", message)[0]).strip()
killed_by = (re.findall(r"(?<=\bby).*", message)[0]).strip()
if self.iter < 1:
"""Its the first visit to the website, i want it to download all the data and store it in dictionary"""
self.records[
entry_id
] = f"{nickname} was killed by {killed_by} at {day}-{month}-{year} {hour}"
elif (
self.iter > 1
and f"{nickname} was killed by {killed_by} at {day}-{month}-{year} {hour}"
not in self.records.values()
):
"""Here I want to look into the dictionary to check if the element exists in it,
if not print it and add to the dictionary at [entry_id] so we can skip it in next iteration
Don't know why but whenever a new item appears on the website it seems to edit every item in the dictionary instead of just editing the one
that wasnt there"""
print(
f"{nickname} was killed by {killed_by} at {day}-{month}-{year} {hour}"
)
self.records[
entry_id
] = f"{nickname} was killed by {killed_by} at {day}-{month}-{year} {hour}"
print("---")
self.iter += 1
ks = KillStatistics()
if __name__ == "__main__":
while True:
ks.parse_records()
time.sleep(10)
The entry_id are always the same its 500 rows of data and they have id of 1,2,3...500 the newest is always 1. I know i could always check for 1 to get the newest but sometimes for example 10 players can die at the same time so i would like to check them all if they changed and only perform print on new ones.
Current output:
Velerion was killed by Rat and Cave Rat at 27-12-2021 16:53
Scrappy was killed by Cursed Queen at 27-12-2021 16:52
Velerion was killed by Rat at 27-12-2021 16:28
Velerion was killed by Rat at 27-12-2021 16:22
Velerion was killed by Rat at 27-12-2021 16:21
Velerion was killed by Rat at 27-12-2021 15:51
Shade was killed by Tentacle Slayer at 27-12-2021 15:46
Mr Yahoo was killed by Immortal Hunter at 27-12-2021 15:41
Scrappy was killed by Witch Hunter at 27-12-2021 15:39
Barbudo Arqueiro was killed by Seahorse at 27-12-2021 15:23
Emperor Martino was killed by Dark Slayer at 27-12-2021 15:14
Shade was killed by Tentacle Slayer at 27-12-2021 15:11
Head Hunter was killed by Demon Blood Slayer at 27-12-2021 15:09
Expected output:
Velerion was killed by Rat and Cave Rat at 27-12-2021 16:53

Here's what I've changed your processing. I'm using one regex to parse all the header information. That gets me all 7 numeric fields at once, plus the overall length of the match tells me where the message starts.
Then, I'm using the timestamp to determine what data is new. The newest entry is always first, so I grab the first timestamp of the lot to use as the threshold for the next.
Then, I'm storing the entries in a list instead of a dict. If you don't really need to store them forever, but just want to print them, then you don't need to track the list at all.
import requests
import re
import copy
import time
from datetime import date
from bs4 import BeautifulSoup
prefix = r"(\d+)\.(\d+)\.(\d+)\.(\d+)\, (\d+):(\d+):(\d+)"
class KillStatistics:
def __init__(self):
self.records = []
self.latest = (0,0,0,0,0,0,0)
self.watched_names = ["Test"]
def parse_records(self):
r = requests.get("http://149.56.28.71/?subtopic=killstatistics")
soup = BeautifulSoup(r.content, "html.parser")
table = soup.findChildren("table")
latest = None
for record in table:
for data in record:
if data.text == "Last Deaths":
continue
entry = data.text
mo = re.match(prefix, entry)
entry_id, day, month, year, hour, mm, ss = mo.groups()
stamp = tuple(int(i) for i in (year, month, day, hour, mm, ss))
if latest is None:
latest = stamp
if stamp > self.latest:
rest = entry[mo.span()[1]:]
i = rest.find(" at ")
j = rest.find(" by ")
nickname = rest[:i]
killed_by = rest[j+4:]
msg = f"{nickname} was killed by {killed_by} at {day}-{month}-{year} {hour}"
print( msg )
self.records.append( msg )
print("---")
self.latest = latest
ks = KillStatistics()
if __name__ == "__main__":
while True:
ks.parse_records()
time.sleep(10)

I managed to find the solution by myself. In this specific case, I needed to create a list of entries and a copy of it. Then after each lookup, I compare the newly created list to the old one and using set().difference() I return the new records.
import requests
import re
import copy
import time
from datetime import date
from bs4 import BeautifulSoup
class KillStatistics:
def __init__(self):
self.records = []
self.old_table = []
self.watched_names = ["Test"]
self.visited = False
def parse_records(self):
r = requests.get("http://149.56.28.71/?subtopic=killstatistics")
soup = BeautifulSoup(r.content, "html.parser")
table = soup.findChildren("table")
self.records = []
for record in table:
for data in record:
if data.text == "Last Deaths":
continue
else:
entry = data.text
entry = re.split("..?(?=[0-9][A-Z]).", data.text)
entry[0] = entry[0].split(", ")
entry[0][0] = entry[0][0].split(".")
entry_id, day, month, year, hour = (
entry[0][0][0],
entry[0][0][1],
entry[0][0][2],
entry[0][0][3],
entry[0][1],
)
## czas recordów jest w EST. Można konwertować na czas w Brazylii i w CET
message = entry[1]
nickname = (re.findall(".+?(?=at)", message)[0]).strip()
killed_by = (re.findall(r"(?<=\bby).*", message)[0]).strip()
record_to_add = (
f"{nickname},{killed_by},{day}-{month}-{year} {hour}"
)
self.records.append(record_to_add)
if len(self.old_table) > 0:
self.compare_records(self.old_table, self.records)
else:
print("Setting up initial data...")
print("---")
self.visited = True
def compare_records(self, old_records, new_records):
new_record = list(set(new_records).difference(old_records))
if len(new_record) > 0:
print("We got new record")
for i in new_record:
print(i)
with open("/home/sammy/gp/new_records", "a") as f:
f.write(i + "\n")
else:
print("No new records")
class DiscordBot:
def __init__(self):
pass
if __name__ == "__main__":
ks = KillStatistics()
while True:
ks.parse_records()
ks.old_table = copy.deepcopy(ks.records)
time.sleep(10)

Related

IndexError: list index out of range (on Reddit data crawler)

is expected the below is supposed to run without issues.
Solution to Reddit data:
import requests
import re
import praw
from datetime import date
import csv
import pandas as pd
import time
import sys
class Crawler(object):
'''
basic_url is the reddit site.
headers is for requests.get method
REX is to find submission ids.
'''
def __init__(self, subreddit="apple"):
'''
Initialize a Crawler object.
subreddit is the topic you want to parse. default is r"apple"
basic_url is the reddit site.
headers is for requests.get method
REX is to find submission ids.
submission_ids save all the ids of submission you will parse.
reddit is an object created using praw API. Please check it before you use.
'''
self.basic_url = "https://www.reddit.com"
self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
self.REX = re.compile(r"<div class=\" thing id-t3_[\w]+")
self.subreddit = subreddit
self.submission_ids = []
self.reddit = praw.Reddit(client_id="your_id", client_secret="your_secret", user_agent="subreddit_comments_crawler")
def get_submission_ids(self, pages=2):
'''
Collect all ids of submissions..
One page has 25 submissions.
page url: https://www.reddit.com/r/subreddit/?count25&after=t3_id
id(after) is the last submission from last page.
'''
# This is page url.
url = self.basic_url + "/r/" + self.subreddit
if pages <= 0:
return []
text = requests.get(url, headers=self.headers).text
ids = self.REX.findall(text)
ids = list(map(lambda x: x[-6:], ids))
if pages == 1:
self.submission_ids = ids
return ids
count = 0
after = ids[-1]
for i in range(1, pages):
count += 25
temp_url = self.basic_url + "/r/" + self.subreddit + "?count=" + str(count) + "&after=t3_" + ids[-1]
text = requests.get(temp_url, headers=self.headers).text
temp_list = self.REX.findall(text)
temp_list = list(map(lambda x: x[-6:], temp_list))
ids += temp_list
if count % 100 == 0:
time.sleep(60)
self.submission_ids = ids
return ids
def get_comments(self, submission):
'''
Submission is an object created using praw API.
'''
# Remove all "more comments".
submission.comments.replace_more(limit=None)
comments = []
for each in submission.comments.list():
try:
comments.append((each.id, each.link_id[3:], each.author.name, date.fromtimestamp(each.created_utc).isoformat(), each.score, each.body) )
except AttributeError as e: # Some comments are deleted, we cannot access them.
# print(each.link_id, e)
continue
return comments
def save_comments_submissions(self, pages):
'''
1. Save all the ids of submissions.
2. For each submission, save information of this submission. (submission_id, #comments, score, subreddit, date, title, body_text)
3. Save comments in this submission. (comment_id, submission_id, author, date, score, body_text)
4. Separately, save them to two csv file.
Note: You can link them with submission_id.
Warning: According to the rule of Reddit API, the get action should not be too frequent. Safely, use the defalut time span in this crawler.
'''
print("Start to collect all submission ids...")
self.get_submission_ids(pages)
print("Start to collect comments...This may cost a long time depending on # of pages.")
submission_url = self.basic_url + "/r/" + self.subreddit + "/comments/"
comments = []
submissions = []
count = 0
for idx in self.submission_ids:
temp_url = submission_url + idx
submission = self.reddit.submission(url=temp_url)
submissions.append((submission.name[3:], submission.num_comments, submission.score, submission.subreddit_name_prefixed, date.fromtimestamp(submission.created_utc).isoformat(), submission.title, submission.selftext))
temp_comments = self.get_comments(submission)
comments += temp_comments
count += 1
print(str(count) + " submissions have got...")
if count % 50 == 0:
time.sleep(60)
comments_fieldnames = ["comment_id", "submission_id", "author_name", "post_time", "comment_score", "text"]
df_comments = pd.DataFrame(comments, columns=comments_fieldnames)
df_comments.to_csv("comments.csv")
submissions_fieldnames = ["submission_id", "num_of_comments", "submission_score", "submission_subreddit", "post_date", "submission_title", "text"]
df_submission = pd.DataFrame(submissions, columns=submissions_fieldnames)
df_submission.to_csv("submissions.csv")
return df_comments
if __name__ == "__main__":
args = sys.argv[1:]
if len(args) != 2:
print("Wrong number of args...")
exit()
subreddit, pages = args
c = Crawler(subreddit)
c.save_comments_submissions(int(pages))
but I got:
(base) UserAir:scrape_reddit user$ python reddit_crawler.py apple 2
Start to collect all submission ids...
Traceback (most recent call last):
File "reddit_crawler.py", line 127, in
c.save_comments_submissions(int(pages))
File "reddit_crawler.py", line 94, in save_comments_submissions
self.get_submission_ids(pages)
File "reddit_crawler.py", line 54, in get_submission_ids
after = ids[-1]
IndexError: list index out of range

Erik's answer diagnoses the specific cause of this error, but more broadly I think this is caused by you not using PRAW to its fullest potential. Your script imports requests and performs a lot of manual requests that PRAW has methods for already. The whole point of PRAW is to prevent you from having to write these requests that do things such as paginate a listing, so I recommend you take advantage of that.
As an example, your get_submission_ids function (which scrapes the web version of Reddit and handles paginating) could be replaced by just
def get_submission_ids(self, pages=2):
return [
submission.id
for submission in self.reddit.subreddit(self.subreddit).hot(
limit=25 * pages
)
]
because the .hot() function does everything you tried to do by hand.
I'm going to go one step further here and have the function just return a list of Submission objects, because the rest of your code ends up doing things that would better by done by interacting with the PRAW Submission object. Here's that code (I renamed the function to reflect its updated purpose):
def get_submissions(self, pages=2):
return list(self.reddit.subreddit(self.subreddit).hot(limit=25 * pages))
(I've updated this function to just return its result, as your version both returns the value and sets it as self.submission_ids, unless pages is 0. That felt quite inconsistent, so I made it just return the value.)
Your get_comments function looks good.
The save_comments_submissions function, like get_submission_ids, does a lot of manual work that PRAW can handle. You construct a temp_url that has the full URL of a post, and then use that to make a PRAW Submission object, but we can replace that with directly using the one returned by get_submissions. You also have some calls to time.sleep() which I removed because PRAW will automatically sleep the appropriate amount for you. Lastly, I removed the return value of this function because the point of the function is to save data to disk, not to return it to anywhere else, and the rest of your script doesn't use the return value. Here's the updated version of that function:
def save_comments_submissions(self, pages):
"""
1. Save all the ids of submissions.
2. For each submission, save information of this submission. (submission_id, #comments, score, subreddit, date, title, body_text)
3. Save comments in this submission. (comment_id, submission_id, author, date, score, body_text)
4. Separately, save them to two csv file.
Note: You can link them with submission_id.
Warning: According to the rule of Reddit API, the get action should not be too frequent. Safely, use the defalut time span in this crawler.
"""
print("Start to collect all submission ids...")
submissions = self.get_submissions(pages)
print(
"Start to collect comments...This may cost a long time depending on # of pages."
)
comments = []
pandas_submissions = []
for count, submission in enumerate(submissions):
pandas_submissions.append(
(
submission.name[3:],
submission.num_comments,
submission.score,
submission.subreddit_name_prefixed,
date.fromtimestamp(submission.created_utc).isoformat(),
submission.title,
submission.selftext,
)
)
temp_comments = self.get_comments(submission)
comments += temp_comments
print(str(count) + " submissions have got...")
comments_fieldnames = [
"comment_id",
"submission_id",
"author_name",
"post_time",
"comment_score",
"text",
]
df_comments = pd.DataFrame(comments, columns=comments_fieldnames)
df_comments.to_csv("comments.csv")
submissions_fieldnames = [
"submission_id",
"num_of_comments",
"submission_score",
"submission_subreddit",
"post_date",
"submission_title",
"text",
]
df_submission = pd.DataFrame(pandas_submissions, columns=submissions_fieldnames)
df_submission.to_csv("submissions.csv")
Here's an updated version of the whole script that uses PRAW fully:
from datetime import date
import sys
import pandas as pd
import praw
class Crawler:
"""
basic_url is the reddit site.
headers is for requests.get method
REX is to find submission ids.
"""
def __init__(self, subreddit="apple"):
"""
Initialize a Crawler object.
subreddit is the topic you want to parse. default is r"apple"
basic_url is the reddit site.
headers is for requests.get method
REX is to find submission ids.
submission_ids save all the ids of submission you will parse.
reddit is an object created using praw API. Please check it before you use.
"""
self.subreddit = subreddit
self.submission_ids = []
self.reddit = praw.Reddit(
client_id="your_id",
client_secret="your_secret",
user_agent="subreddit_comments_crawler",
)
def get_submissions(self, pages=2):
"""
Collect all submissions..
One page has 25 submissions.
page url: https://www.reddit.com/r/subreddit/?count25&after=t3_id
id(after) is the last submission from last page.
"""
return list(self.reddit.subreddit(self.subreddit).hot(limit=25 * pages))
def get_comments(self, submission):
"""
Submission is an object created using praw API.
"""
# Remove all "more comments".
submission.comments.replace_more(limit=None)
comments = []
for each in submission.comments.list():
try:
comments.append(
(
each.id,
each.link_id[3:],
each.author.name,
date.fromtimestamp(each.created_utc).isoformat(),
each.score,
each.body,
)
)
except AttributeError as e: # Some comments are deleted, we cannot access them.
# print(each.link_id, e)
continue
return comments
def save_comments_submissions(self, pages):
"""
1. Save all the ids of submissions.
2. For each submission, save information of this submission. (submission_id, #comments, score, subreddit, date, title, body_text)
3. Save comments in this submission. (comment_id, submission_id, author, date, score, body_text)
4. Separately, save them to two csv file.
Note: You can link them with submission_id.
Warning: According to the rule of Reddit API, the get action should not be too frequent. Safely, use the defalut time span in this crawler.
"""
print("Start to collect all submission ids...")
submissions = self.get_submissions(pages)
print(
"Start to collect comments...This may cost a long time depending on # of pages."
)
comments = []
pandas_submissions = []
for count, submission in enumerate(submissions):
pandas_submissions.append(
(
submission.name[3:],
submission.num_comments,
submission.score,
submission.subreddit_name_prefixed,
date.fromtimestamp(submission.created_utc).isoformat(),
submission.title,
submission.selftext,
)
)
temp_comments = self.get_comments(submission)
comments += temp_comments
print(str(count) + " submissions have got...")
comments_fieldnames = [
"comment_id",
"submission_id",
"author_name",
"post_time",
"comment_score",
"text",
]
df_comments = pd.DataFrame(comments, columns=comments_fieldnames)
df_comments.to_csv("comments.csv")
submissions_fieldnames = [
"submission_id",
"num_of_comments",
"submission_score",
"submission_subreddit",
"post_date",
"submission_title",
"text",
]
df_submission = pd.DataFrame(pandas_submissions, columns=submissions_fieldnames)
df_submission.to_csv("submissions.csv")
if __name__ == "__main__":
args = sys.argv[1:]
if len(args) != 2:
print("Wrong number of args...")
exit()
subreddit, pages = args
c = Crawler(subreddit)
c.save_comments_submissions(int(pages))
I realize that my answer here gets into Code Review territory, but I hope that this answer is helpful for understanding some of the things PRAW can do. Your "list index out of range" error would have been avoided by using the pre-existing library code, so I do consider this to be a solution to your problem.

When my_list[-1] throws an IndexError, it means that my_list is empty:
>>> ids = []
>>> ids[-1]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list index out of range
>>> ids = ['1']
>>> ids[-1]
'1'

Optimizing selenium code

So I wrote some code to grab data about classes at a college to build an interactive scheduler. Here is the code I have to get data:
from selenium import webdriver
import os
import pwd
import shlex
import re
import time
usr = pwd.getpwuid(os.getuid()).pw_name
Path = ('/Users/%s/Downloads/chromedriver') %usr # Have chromedriver dowloaded
# Create a new instance of the Chrome driver
options = webdriver.ChromeOptions()
options.binary_location = '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'
options.add_argument('headless') # Headless so no window is opened
options.add_argument('window-size=1200x600')
driver = webdriver.Chrome(Path, chrome_options=options)
driver.get('https://web.stevens.edu/scheduler/core/2017F/2017F.xml') # Go to database
classes = {}
def Database(AllSelectedCourseInfo):
ClassDict = {}
for item in AllSelectedCourseInfo: # Go through list of class info
try:
thing = item.split("=") # Split string by = to get subject name and value
name = thing[0]
if any(char.isdigit() for char in thing[1]): # Get rid of annoying Z at the end of numbers
thing[1] = re.sub("[Z]","",thing[1])
value = thing[1]
if value: # If subject has a value, store it
ClassDict[str(name)] = str(value) # Store value in a dictionary with the subject as the key
except:
pass
classes[str(ClassDict["Section"])] = ClassDict # Add to dictionary
def makeDatabase(section):
if "Title" in driver.find_element_by_xpath("//*[text()='%s']"%section).find_element_by_xpath("..").text:
classSection = driver.find_elements_by_xpath("//*[text()='%s']"%section) # If class name given find class
for i in range(0, len(classSection)):
AllSelectedCourseInfo = shlex.split(classSection[i].find_element_by_xpath(".." + "/.."*4).text.replace("/>", "").replace(">", "")) # sort into a list grouping string in quotes and getting rid of unnecessary symbols
Database(AllSelectedCourseInfo)
else:
classSection = driver.find_element_by_xpath("//*[text()='%s']"%section) # If class section give, find class
AllSelectedCourseInfo = shlex.split(classSection.find_element_by_xpath(".." + "/.."*3).text.replace("/>", "").replace(">", "")) # sort into a list grouping string in quotes and getting rid of unnecessary symbols
Database(AllSelectedCourseInfo)
def printDic():
for key in classes:
print "\n-------------%s------------" %key
for classkey in classes[key]:
print "%s : %s" %(classkey, classes[key][classkey])
start = time.time()
makeDatabase("Differential Calculus")
makeDatabase("MA 124B")
printDic()
end = time.time()
print end - start
driver.quit()
It takes about 20 seconds for me to pull data from one class and one class section, if I am to make this practical it is going to need at least 7 classes, and that would take over a minute just to create the dictionaries. Does anyone know of a way to make this run any faster?

I tried to integrate lxml and requests into my code but it just didn't have what I was looking for. After a few days of trying to use lxml to accomplish this with no avail I decided to try beautifulsoup4 with urllib. This worked better than I could have hoped,
from bs4 import BeautifulSoup
from HTMLParser import HTMLParser
import urllib
import shlex
import re
import time
h = HTMLParser()
page = urllib.urlopen('https://web.stevens.edu/scheduler/core/2017F/2017F.xml').read() # Get to database
soup = BeautifulSoup(page)
RawClassData = soup.contents[10].contents[0].contents[0].contents
classes = {}
backupClasses = {}
def makeDatabase():
for i in range(0, len(RawClassData)): # Parse through each class
try:
AllSelectedCourseInfo = shlex.split(h.unescape(str(RawClassData[i]).replace(">", " "))) # sort into a list grouping string in quotes and getting rid of unnecessary symbols
ClassDict = {}
for item in AllSelectedCourseInfo: # Go through list of class info
try:
thing = item.split("=") # Split string by = to get subject name and value
name = thing[0]
if any(char.isdigit() for char in thing[1]): # Get rid of annoying Z at the end of numbers
thing[1] = re.sub("[Z]","",thing[1])
value = thing[1]
if value: # If subject has a value, store it
ClassDict[str(name)] = str(value) # Store value in a dictionary with the subject as the key
except:
pass
classes[str(ClassDict["section"])] = ClassDict
except:
pass
def printDic():
with open("Classes", "w") as f:
for key in classes:
f.write("\n-------------%s------------" %key)
for classkey in classes[key]:
f.write( "\n%s : %s" %(classkey, classes[key][classkey]))
f.write("\n")
def printSection(selection):
print "\n-------------%s------------" %selection
for classkey in classes[selection]:
print "%s : %s" %(classkey, classes[selection][classkey])
def printClass(selection):
try:
for key in classes:
if classes[key]["title"] == selection:
print "\n-------------%s------------" %key
for classkey in classes[key]:
print "%s : %s" %(classkey, classes[key][classkey])
finally:
print "\n-------------%s------------" %selection
for classkey in classes[selection]:
print "%s : %s" %(classkey, classes[selection][classkey])
start = time.time()
makeDatabase()
end = time.time()
printClass("Circuits and Systems")
printClass("Differential Equations")
printClass("Writing & Communications Collqm")
printClass("Mechanics of Solids")
printClass("Electricity & Magnetism")
printClass("Engineering Design III")
printClass("Freshman Quiz")
printDic()
print end - start
This new code creates a library of all classes then prints out the desired class, all in 2 seconds. The selenium code took 89 seconds to just build the library for the desired classes and print them out, I would say thats a slight improvement... Thanks a ton to perfect5th for the suggestion!

How to scrape entire integer in python with Beautiful Soup?

Working on getting some wave heights from websites and my code fails when the wave heights get into the double digit range.
Ex: Currently the code would scrape a 12 from the site as '1' and '2' separately, not '12'.
#Author: David Owens
#File name: soupScraper.py
#Description: html scraper that takes surf reports from various websites
import csv
import requests
from bs4 import BeautifulSoup
NUM_SITES = 2
reportsFinal = []
###################### SURFLINE URL STRINGS AND TAG ###########################
slRootUrl = 'http://www.surfline.com/surf-report/'
slSunsetCliffs = 'sunset-cliffs-southern-california_4254/'
slScrippsUrl = 'scripps-southern-california_4246/'
slBlacksUrl = 'blacks-southern-california_4245/'
slCardiffUrl = 'cardiff-southern-california_4786/'
slTagText = 'observed-wave-range'
slTag = 'id'
#list of surfline URL endings
slUrls = [slSunsetCliffs, slScrippsUrl, slBlacksUrl]
###############################################################################
#################### MAGICSEAWEED URL STRINGS AND TAG #########################
msRootUrl = 'http://magicseaweed.com/'
msSunsetCliffs = 'Sunset-Cliffs-Surf-Report/4211/'
msScrippsUrl = 'Scripps-Pier-La-Jolla-Surf-Report/296/'
msBlacksUrl = 'Torrey-Pines-Blacks-Beach-Surf-Report/295/'
msTagText = 'rating-text'
msTag = 'li'
#list of magicseaweed URL endings
msUrls = [msSunsetCliffs, msScrippsUrl, msBlacksUrl]
###############################################################################
'''
This class represents a surf break. It contains all wave, wind, & tide data
associated with that break relevant to the website
'''
class surfBreak:
def __init__(self, name,low, high, wind, tide):
self.name = name
self.low = low
self.high = high
self.wind = wind
self.tide = tide
#toString method
def __str__(self):
return '{0}: Wave height: {1}-{2} Wind: {3} Tide: {4}'.format(self.name,
self.low, self.high, self.wind, self.tide)
#END CLASS
'''
This returns the proper attribute from the surf report sites
'''
def reportTagFilter(tag):
return (tag.has_attr('class') and 'rating-text' in tag['class']) \
or (tag.has_attr('id') and tag['id'] == 'observed-wave-range')
#END METHOD
'''
This method checks if the parameter is of type int
'''
def representsInt(s):
try:
int(s)
return True
except ValueError:
return False
#END METHOD
'''
This method extracts all ints from a list of reports
reports: The list of surf reports from a single website
returns: reportNums - A list of ints of the wave heights
'''
def extractInts(reports):
print reports
reportNums = []
afterDash = False
num = 0
tens = 0
ones = 0
#extract all ints from the reports and ditch the rest
for report in reports:
for char in report:
if representsInt(char) == True:
num = int(char)
reportNums.append(num)
else:
afterDash = True
return reportNums
#END METHOD
'''
This method iterates through a list of urls and extracts the surf report from
the webpage dependent upon its tag location
rootUrl: The root url of each surf website
urlList: A list of specific urls to be appended to the root url for each
break
tag: the html tag where the actual report lives on the page
returns: a list of strings of each breaks surf report
'''
def extractReports(rootUrl, urlList, tag, tagText):
#empty list to hold reports
reports = []
reportNums = []
index = 0
#loop thru URLs
for url in urlList:
try:
index += 1
#request page
request = requests.get(rootUrl + url)
#turn into soup
soup = BeautifulSoup(request.content, 'lxml')
#get the tag where surflines report lives
reportTag = soup.findAll(reportTagFilter)[0]
reports.append(reportTag.text.strip())
#notify if fail
except:
print 'scrape failure at URL ', index
pass
reportNums = extractInts(reports)
return reportNums
#END METHOD
'''
This method calculates the average of the wave heights
'''
def calcAverages(reportList):
#empty list to hold averages
finalAverages = []
listIndex = 0
waveIndex = 0
#loop thru list of reports to calc each breaks ave low and high
for x in range(0, 6):
#get low ave
average = (reportList[listIndex][waveIndex]
+ reportList[listIndex+1][waveIndex]) / NUM_SITES
finalAverages.append(average)
waveIndex += 1
return finalAverages
#END METHOD
slReports = extractReports(slRootUrl, slUrls, slTag, slTagText)
msReports = extractReports(msRootUrl, msUrls, msTag, msTagText)
reportsFinal.append(slReports)
reportsFinal.append(msReports)
print 'Surfline: ', slReports
print 'Magicseaweed: ', msReports

You are not actually extracting integers, but floats, it seems, since the values in reports are something like ['0.3-0.6 m']. Right now you are just going through every single character and converting them to int one by one or discarding. So no wonder that you will get only single-digit numbers.
One (arguably) simple way to extract those numbers from that string is with regexp:
import re
FLOATEXPR = re.compile("(\d+\.\d)-(\d+\.\d) {0,1}m")
def extractFloats(reports):
reportNums = []
for report in reports:
groups = re.match(FLOATEXPR, report).groups()
for group in groups:
reportNums.append(float(group))
return reportNums
This expression would match your floats and return them as a list.
In detail, the expression will match anything that has at least one digit before a '.', and one digit after it, a '-' between, another float sequence and ending with 'm' or ' m'. Then it groups the parts representing floats to a tuple. For example that ['12.0m-3.0m'] would return [12.0, 3.0]. If you expect it to have more digits after the floating point, you can add an extra '+' after the second 'd':s in the expression.

Python Parsing with lxml

I've created the following scraper for NFL play-by-play data. It writes the results to a csv file and does everything I need it to except I don't know how to attach a column for who actually has possession of the ball in each line of the csv file.
I can grab the text from the "home" and "away" <tr> tag to show who is playing in the game for query purposes later, but I need the scraper to recognize when possession changes (goes from home to away or vice versa). I'm fairly new to Python and have tried different indention but I don't think that's the issue. Any help would be greatly appreciated. I feel like the answer is beyond my scope of understanding.
I also realize that my code probably isn't the most Pythonic but I'm still learning. I'm using Python 2.7.9.
import lxml
from lxml import html
import csv
import urllib2
import re
game_date = raw_input('Enter game date: ')
data_html = 'http://www.cbssports.com/nfl/gametracker/playbyplay/NFL_20160109_PIT#CIN'
url = urllib2.urlopen(data_html).read()
data = lxml.html.fromstring(url)
plays = data.cssselect('tr#play')
home = data.cssselect('tr#home')
away = data.cssselect('tr#away')
csvfile = open('C:\\DATA\\PBP.csv', 'a')
writer = csv.writer(csvfile)
for play in plays:
frame = []
play = play.text_content()
down = re.search(r'\d', play)
if down == None:
pass
else:
down = down.group()
dist = re.search(r'-(\d+)', play)
if dist == None:
pass
else:
dist = dist.group(1)
field_end = re.search(r'[A-Z]+', play)
if field_end == None:
pass
else:
field_end = field_end.group()
yard_line = re.search(r'[A-Z]+([\d]+)', play)
if yard_line == None:
pass
else:
yard_line = yard_line.group(1)
desc = re.search(r'\s(.*)', play)
if desc == None:
pass
else:
desc = desc.group()
time = re.search(r'\((..*\d)\)\s', play)
if time == None:
pass
else:
time = time.group(1)
for team in away:
teamA = team.text_content()
teamA = re.search(r'(\w+)\s', teamA)
teamA = teamA.group(1)
teamA = teamA.upper()
for team in home:
teamH = team.text_content()
teamH = re.search(r'(\w+)\s', teamH)
teamH = teamH.group(1)
teamH = teamH.upper()
frame.append(game_date)
frame.append(down)
frame.append(dist)
frame.append(field_end)
frame.append(yard_line)
frame.append(time)
frame.append(teamA)
frame.append(teamH)
frame.append(desc)
writer.writerow(frame)
csvfile.close()

I guess you need to append another value to the frame, for each row, which an indication of whether the possession changed.
After:
frame.append(desc)
add:
if teamA == teamH:
frame.append("Same possession")
else:
frame.append("Changed possession")
(note this assumes the team names are consistent, no extra spaces/padding/formatting in the teamA/teamH values).
You don't have to use strings, for example you could put 0 for no change and 1 for a change of possession.
HTH
Barny

How do I join results of looping script into a single variable?

I have looping script returning different filtered results, I can make this data return as an array for each of the different filter classes. However I am unsure of the best method to join all of these arrays together.
import mechanize
import urllib
import json
import re
import random
import datetime
from sched import scheduler
from time import time, sleep
from sets import Set
##### Code to loop the script and set up scheduling time
s = scheduler(time, sleep)
random.seed()
##### Code to stop duplicates part 1
userset = set ()
def run_periodically(start, end, interval, func):
event_time = start
while event_time < end:
s.enterabs(event_time, 0, func, ())
event_time += interval + random.randrange(-5, 10)
s.run()
##### Code to get the data required from the URL desired
def getData():
post_url = "URL OF INTEREST"
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]
##### These are the parameters you've got from checking with the aforementioned tools
parameters = {'page' : '1',
'rp' : '250',
'sortname' : 'race_time',
'sortorder' : 'asc'
}
##### Encode the parameters
data = urllib.urlencode(parameters)
trans_array = browser.open(post_url,data).read().decode('UTF-8')
xmlload1 = json.loads(trans_array)
pattern2 = re.compile('/control/profile/view/(.*)\' title=')
pattern4 = re.compile('title=\'posted: (.*) strikes:')
pattern5 = re.compile('strikes: (.*)\'><img src=')
for row in xmlload1['rows']:
cell = row["cell"]
##### defining the Keys (key is the area from which data is pulled in the XML) for use in the pattern finding/regex
user_delimiter = cell['username']
selection_delimiter = cell['race_horse']
user_numberofselections = float(re.findall(pattern4, user_delimiter)[0])
user_numberofstrikes = float(re.findall(pattern5, user_delimiter)[0])
strikeratecalc1 = user_numberofstrikes/user_numberofselections
strikeratecalc2 = strikeratecalc1*100
userid_delimiter_results = (re.findall(pattern2, user_delimiter)[0])
##### Code to stop duplicates throughout the day part 2 (skips if the id is already in the userset)
if userid_delimiter_results in userset: continue;
userset.add(userid_delimiter_results)
arraym = ""
arrayna = ""
if strikeratecalc2 > 50 and strikeratecalc2 < 100):
arraym0 = "System M"
arraym1 = "user id = ",userid_delimiter_results
arraym2 = "percantage = ",strikeratecalc2,"%"
arraym3 = ""
arraym = [arraym0, arraym1, arraym2, arraym3]
if strikeratecalc2 > 0 and strikeratecalc2 < 50):
arrayna0 = "System NA"
arrayna1 = "user id = ",userid_delimiter_results
arrayna2 = "percantage = ",strikeratecalc2,"%"
arrayna3 = ""
arrayna = [arrayna0, arrayna1, arrayna2, arrayna3]
getData()
run_periodically(time()+5, time()+1000000, 10, getData)
What I want to be able to do, is return both the 'arraym' and the 'arrayna' as one final Array, however due to the looping nature of the script upon each loop of the script the old 'arraym'/'arrayna' are overwritten, currently my attempts to yield one array containing all of the data has resulted in the last userid for 'systemm' and the last userid for 'sustemna'. This is obviously because, upon each run of the loop it overwrites the old 'arraym' and the 'arrayna' however I do not know of a way to get around this, so that all of my data can be accumulated in one array. Please note, I have been coding for cumulatively two weeks now, so there may well be some simple function to overcome this problem.
Kind regards AEA

Without looking at that huge code segment, typically you can do something like:
my_array = [] # Create an empty list
for <some loop>:
my_array.append(some_value)
# At this point, my_array is a list containing some_value for each loop iteration
print(my_array)
Look into python's list.append()
So your code might look something like:
#...
arraym = []
arrayna = []
for row in xmlload1['rows']:
#...
if strikeratecalc2 > 50 and strikeratecalc2 < 100):
arraym.append("System M")
arraym.append("user id = %s" % userid_delimiter_results)
arraym.append("percantage = %s%%" % strikeratecalc2)
arraym.append("")
if strikeratecalc2 > 0 and strikeratecalc2 < 50):
arrayna.append("System NA")
arrayna.append("user id = %s" % userid_delimiter_results)
arrayna.append("percantage = %s%%" % strikeratecalc2)
arrayna.append("")
#...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping only new elements from website using requests and bs4 - python

Related

IndexError: list index out of range (on Reddit data crawler)

Optimizing selenium code

How to scrape entire integer in python with Beautiful Soup?

Python Parsing with lxml

How do I join results of looping script into a single variable?

Categories

Resources