I'm working on some NFL statistics web scraping, honestly the activity doesn't matter much. I spent a ton of time debugging because I couldn't believe what it was doing, either I'm going crazy or there is some sort of bug in a package or python itself. Here's the code I'm working with:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
import string
import numpy as np
#get player list
players = pd.DataFrame({"name":[],"url":[],"positions":[],"startYear":[],"endYear":[]})
letters = list(string.ascii_uppercase)
for letter in letters:
print(letter)
players_html = requests.get("https://www.pro-football-reference.com/players/"+letter+"/")
soup = bs(players_html.content,"html.parser")
for player in soup.find("div",{"id":"div_players"}).find_all("p"):
temp_row = {}
temp_row["url"] = "https://www.pro-football-reference.com"+player.find("a")["href"]
temp_row["name"] = player.text.split("(")[0].strip()
years = player.text.split(")")[1].strip()
temp_row["startYear"] = int(years.split("-")[0])
temp_row["endYear"] = int(years.split("-")[1])
temp_row["positions"] = player.text.split("(")[1].split(")")[0]
players = players.append(temp_row,ignore_index=True)
players = players[players.endYear > 2000]
players.reset_index(inplace=True,drop=True)
game_df = pd.DataFrame()
def apply_test(row):
#print(row)
url = row['url']
#print(list(range(int(row['startYear']),int(row['endYear'])+1)))
for yr in range(int(row['startYear']),int(row['endYear'])+1):
print(yr)
content = requests.get(url.split(".htm")[0]+"/gamelog/"+str(yr)).content
soup = bs(content,'html.parser').find("div",{"id":"all_stats"})
#overheader
over_headers = []
for over in soup.find("thead").find("tr").find_all("th"):
if("colspan" in over.attrs.keys()):
for i in range(0,int(over['colspan'])):
over_headers = over_headers + [over.text]
else:
over_headers = over_headers + [over.text]
#headers
headers = []
for header in soup.find("thead").find_all("tr")[1].find_all("th"):
headers = headers + [header.text]
all_headers = [a+"___"+b for a,b in zip(over_headers,headers)]
#remove first column, it's meaningless
all_headers = all_headers[1:len(all_headers)]
for row in soup.find("tbody").find_all("tr"):
temp_row = {}
for i,col in enumerate(row.find_all("td")):
temp_row[all_headers[i]] = col.text
game_df = game_df.append(temp_row,ignore_index=True)
players.apply(apply_test,axis=1)
Now again I could get into what I'm trying to do, but there seems to be a much higher-level issue here. startYear and endYear in the for loop are 2013 and 2014, so the loop should be setting the yr variable to 2013 then 2014. But when you look at what prints out due to the print(yr), you realize it's printing out 2013 twice. But if you simply comment out the game_df = game_df.append(temp_row,ignore_index=True) line, the printouts of yr are correct. There is an error shortly after the first two lines, but that is expected and one I am comfortable debugging. But the fact that appending to a global dataframe is causing a for loop to behave differently is blowing my mind right now. Can someone help with this?
Thanks.
I don't really follow what the overall aim is but I do note two things:
You either need the local game_df to be declared as global game_df before game_df = game_df.append(temp_row,ignore_index=True) or better still pass as an arg in the def signature though you would need to amend this: players.apply(apply_test,axis=1) accordingly.
You need to handle the cases of find returning None e.g. with soup.find("thead").find_all("tr")[1].find_all("th") for page https://www.pro-football-reference.com/players/A/AaitIs00/gamelog/2014. Perhaps put in try except blocks with appropriate default values to be supplied.
Related
I was trying to break down the code to the simplest form before adding more variables and such. I'm stuck.
I wanted it so when I use intertools the first response is the permutations of tricks and the second response is dependent on the trick's landings() and is a permutation of the trick's corresponding landing. I want to add additional variables that further branch off from landings() and so on.
The simplest form should print a list that looks like:
Backflip Complete
Backflip Hyper
180 Round Complete
180 Round Mega
Gumbi Complete
My Code:
from re import I
import pandas as pd
import numpy as np
import itertools
from io import StringIO
backflip = "Backflip"
one80round = "180 Round"
gumbi = "Gumbi"
tricks = [backflip,one80round,gumbi]
complete = "Complete"
hyper = "Hyper"
mega = "Mega"
backflip_landing = [complete,hyper]
one80round_landing = [complete,mega]
gumbi_landing = [complete]
def landings(tricks):
if tricks == backflip:
landing = backflip_landing
elif tricks == one80round:
landing = one80round_landing
elif tricks == gumbi:
landing = gumbi_landing
return landing
for trik, land in itertools.product(tricks,landings(tricks)):
trick_and_landing = (trik, land)
result = (' '.join(trick_and_landing))
tal = StringIO(result)
tl = (pd.DataFrame((tal)))
print(tl)
I get the error:
UnboundLocalError: local variable 'landing' referenced before assignment
Add a landing = "" after def landings(tricks): to get rid of the error.
But the if checks in your function are wrong. You check if tricks, which is a list, is equal to backflip, etc. which are all strings. So thats why none of the ifs are true and landing got no value assigned.
That question was also about permutation in python. Maybe it helps.
Is it possible to split variables that have already been assigned values, and re-piece them back together to hold those same previous values?
For Example:
URLs.QA.Signin = 'https://qa.test.com'
TestEnvironment = 'QA'
CurrentURL = 'URLs.' + TestEnvironment + '.Signin'
print(CurrentURL)
Outputs as: 'URLs.QA.Signin'
but I would like it to:
Output as: 'https://qa.test.com'
The purpose is so I can plug in any value to my 'TestEnvironment' variable and thus access any of my massive list of URL's with ease =P
I am green with Python. Your time and efforts are greatly appreciated! =)
Based upon evanrelf's answer, I tried and loved the following code!:
This is exactly what i'm looking for, I might be over complicating it, any suggestions to clean up the code?
urls = {}
environment = 'qa'
district = 'pleasanthill'
url = environment + district
urls[url] = 'https://' + environment + '.' + district + '.test.com'
print(urls[url])
Output is: https://qa.pleasanthill.test.com
I would recommend you look into Python's dictionaries.
urls = {}
urls['qa'] = 'https://qa.test.com'
test_environment = 'qa'
print(urls[test_environment])
// => https://qa.test.com
I believe to my comprehension that you are trying to input a string and get a new string (the url) back. The simplest answer that I can understand is to use a dictionary. An example of this is by simply doing
URLS = {'sheep' : 'wool.com', 'cows' : 'beef.com'}
either this or by using two arrays and referencing a common index, but who wants to do that :p
I am trying to use JSON to search through googlemapapi. So, I give location "Plymouth" - in googlemapapi it is showing 6 resultset but when I try to parse in Json, I am getting length of only 2. I tried with multiple cities too, but all I am getting is resultset of 2 rather.
What is wrong below?
import urllib.request as UR
import urllib.parse as URP
import json
url = "http://maps.googleapis.com/maps/api/geocode/json?address=Plymouth&sensor=false"
uh = UR.urlopen(url)
data = uh.read()
count = 0
js1 = json.loads(data.decode('utf-8') )
print ("Length: ", len(js1))
for result in js1:
location = js1["results"][count]["formatted_address"]
lat = js1["results"][count]["geometry"]["location"]["lat"]
lng = js1["results"][count]["geometry"]["location"]["lng"]
count = count + 1
print ('lat',lat,'lng',lng)
print (location)
Simply replace for result in js1: with for result in js1['results']:
By the way, as posted in a comment in the question, no need to use a counter. You can rewrite your for loop as:
for result in js1['results']:
location = result["formatted_address"]
lat = result["geometry"]["location"]["lat"]
lng = result["geometry"]["location"]["lng"]
print('lat',lat,'lng',lng)
print(location)
If you look at the json that comes in, you'll see that its a single dict with two items ("results" and "status"). Add print('result:', result) to the top of your for loop and it will print result: status and result: results because all you are iterating the the keys of that outer dict. That's a general debugging trick in python... if you aren't getting the stuff you want, put in a print statement to see what you got.
The results (not surprisingly) and in a list under js1["results"]. In your for loop, you ignore the variable you are iterating and go back to the original js1 for its data. This is unnecessary and in your case, it hid the error. Had you tried to reference cities off of result you would gotten an error and it may have been easier to see that result was "status", not the array you were after.
Now a few tweaks fix the problem
import urllib.request as UR
import urllib.parse as URP
import json
url = "http://maps.googleapis.com/maps/api/geocode/json?address=Plymouth&sensor=false"
uh = UR.urlopen(url)
data = uh.read()
count = 0
js1 = json.loads(data.decode('utf-8') )
print ("Length: ", len(js1))
for result in js1["results"]:
location = result["formatted_address"]
lat = result["geometry"]["location"]["lat"]
lng = result["geometry"]["location"]["lng"]
count = count + 1
print ('lat',lat,'lng',lng)
print (location)
test=[]
sites = sel.css(".info")
for site in sites:
money = site.xpath("./h2[#class='money']/text()").extract()
people = site.xpath("//p[#class='poeple']/text()").extract()
test.append('{"money":'+str(money[0])+',"people":'+str(people[0])+'}')
My result test is:
['{"money":1,"people":23}',
'{"money":3,"people":21}',
'{"money":12,"people":82}',
'{"money":1,"people":54}' ]
I was stuck by two thing:
One is I print the type of test is string,so is not like JSON format
Two is the money value with 1 is duplicate,so I need to add the people together ,
so the final format I want is:
[
{"money":1,"people":77},
{"money":3,"people":21},
{"money":12,"people":82},
]
How can I do this??
I'd collect money entries in a dict and add up the people as values, the output to json should be done using a json library indeed (I've not tested the code but it should give you an idea how you can approach the problem):
money_map = {}
sites = sel.css(".info")
for site in sites:
money = site.xpath("./h2[#class='money']/text()").extract()[0]
people = int(site.xpath("//p[#class='poeple']/text()").extract()[0])
if money not in money_map:
money_map[money] = 0
money_map[money] += people
import json
output = [{'money': key, 'people': value} for key, value in money_map.items()]
json_output = json.dumps(output)
basically this:
import json
foo = ['{"money":1,"people":23}',
'{"money":3,"people":21}',
'{"money":12,"people":82}',
'{"money":1,"people":54}' ]
bar = []
for i in foo:
j = json.loads(i) # string to json/dict
# if j['money'] is not in bar:
bar.append(j)
# else:
# find index of duplicate and add j['people']
Above is incomplete solution, you have to implement the 'duplicate check and add'
I'm getting the above error with the code below. The error occurs at the last line. Please excuse the subject matter, I'm just practicing my python skills. =)
from urllib.request import urlopen
from bs4 import BeautifulSoup
from pprint import pprint
from pickle import dump
moves = dict()
moves0 = set()
url = 'http://www.marriland.com/pokedex/1-bulbasaur'
print(url)
# Open url
with urlopen(url) as usock:
# Get url data source
data = usock.read().decode("latin-1")
# Soupify
soup = BeautifulSoup(data)
# Find move tables
for div_class1 in soup.find_all('div', {'class': 'listing-container listing-container-table'}):
div_class2 = div_class1.find_all('div', {'class': 'listing-header'})
if len(div_class2) > 1:
header = div_class2[0].find_all(text=True)[1]
# Take only moves from Level Up, TM / HM, and Tutor
if header in ['Level Up', 'TM / HM', 'Tutor']:
# Get rows
for row in div_class1.find_all('tbody')[0].find_all('tr'):
# Get cells
cells = row.find_all('td')
# Get move name
move = cells[1].find_all(text=True)[0]
# If move is new
if not move in moves:
# Get type
typ = cells[2].find_all(text=True)[0]
# Get category
cat = cells[3].find_all(text=True)[0]
# Get power if not Status or Support
power = '--'
if cat != 'Status or Support':
try:
# not STAB
power = int(cells[4].find_all(text=True)[1].strip(' \t\r\n'))
except ValueError:
try:
# STAB
power = int(cells[4].find_all(text=True)[-2])
except ValueError:
# Moves like Return, Frustration, etc.
power = cells[4].find_all(text=True)[-2]
# Get accuracy
acc = cells[5].find_all(text=True)[0]
# Get pp
pp = cells[6].find_all(text=True)[0]
# Add move to dict
moves[move] = {'type': typ,
'cat': cat,
'power': power,
'acc': acc,
'pp': pp}
# Add move to pokemon's move set
moves0.add(move)
pprint(moves)
dump(moves, open('pkmn_moves.dump', 'wb'))
I have reduced the code as much as possible in order to produce the error. The fault may be simple, but I can't just find it. In the meantime, I made a workaround by setting the recursion limit to 10000.
Just want to contribute an answer for anyone else who may have this issue. Specifically, I was having it with caching BeautifulSoup objects in a Django session from a remote API.
The short answer is the pickling BeautifulSoup nodes is not supported. I instead opted to store the original string data in my object and have an accessor method that parsed it on the fly, so that only the original string data is pickled.