I want to extract information from a website (TeamRankings) and put it in a new CSV file. The website has the data I want in columns and I want to keep that same format for my CSV file. Whenever I run the code, there are no errors, it just keeps running. I thought it got stuck in an infinite loop or something, so I added print statements and a timeout of 5 seconds, but nothing is happening. Any help would be appreciated. I am on Mac using PyCharm, not Windows.
import requests
import pandas as pd
from bs4 import BeautifulSoup
import os
try:
# Send a GET request to the website
url1 = "https://www.teamrankings.com/nba/stat/points-per-game"
response1 = requests.get(url1, timeout=5)
# Parse the HTML content of the response
soup = BeautifulSoup(response1.content, "html.parser")
# Find the table containing the team points per game data
table = soup.find("table", {"class": "tr-table datatable scrollable"})
# Create an empty list to store the team data
data = []
# Extract the team data from each row of the table
for row in table.find_all("tr")[1:]:
print("test 1")
cells = row.find_all("td")
team = cells[0].get_text()
ppg_2022 = cells[1].get_text()
ppg_last_3 = cells[2].get_text()
ppg_last_game = cells[3].get_text()
ppg_home = cells[4].get_text()
ppg_away = cells[5].get_text()
ppg_2021 = cells[6].get_text()
data.append([team, ppg_2022, ppg_last_3, ppg_last_game, ppg_home, ppg_away, ppg_2021])
# Create a Pandas DataFrame from the team data
df = pd.DataFrame(data, columns=["Team", "2022 Points Per Game", "Last 3 Games", "Last Game", "Home", "Away",
"2021 Points Per Game"])
# Save the DataFrame to a CSV file on the Desktop
file_path = os.path.expanduser("~/Desktop/sports_betting/nba/nba_team_ppg.csv")
df.to_csv(file_path, index=False)
print("File successfully saved")
except Exception as e:
print("Error occurred: ", e)
Update: Sorry to all who only saw half the code, it was glitching while I was trying to update it.
Related
I have a list of basketball players that I want to pass through a web scraping for loop I've already set up. The list of players is a list of the 2011 NBA Draft picks. I want to loop through each player and get their college stats from their final year in college. The problem is some drafted players did not go to college and therefore do not have a url formatted in their name so every time I pass in even one player that did not play in college the whole code gets an error. I have tried including "pass" and "continue" but nothing seems to work. This is the closest I gotten so far:
from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {'User Agent':'Mozilla/5.0'}
players = [
'kyrie-irving','derrick-williams','enes-kanter',
'tristan-thompson','jonas-valanciunas','jan-vesely',
'bismack-biyombo','brandon-knight','kemba-walker,
'jimmer-fredette','klay-thompson'
]
#the full list of players goes on for a total of 60 players, this is just the first handful
player_stats = []
for player in players:
url = (f'https://www.sports-reference.com/cbb/players/{player}-1.html')
res = requests.get(url)
#if player in url:
#continue
#else:
#print("This player has no college stats")
#Including this if else statement makes the error say header is not defined. When not included, the error says NoneType object is not iterable
soup = BeautifulSoup(res.content, 'lxml')
header = [th.getText() for th in soup.findAll('tr', limit = 2)[0].findAll('th')]
rows = soup.findAll('tr')
player_stats.append([td.getText() for td in soup.find('tr', id ='players_per_game.2011')])
player_stats
graph = pd.DataFrame(player_stats, columns = header)
You can do 1 of 2 things:
check the response status code. 200 is successful response, anything else is an error. Problem with that is some site will have a valid html page to say "invalid page", so you could still get a successful 200 response.
Just use try/except. If it fails, continue to the next item in the list
Because of that issue with option 1, go with option 2 here. Also, have you considered using pandas to parse the table? It's a little easier to do (and uses BeautifulSoup under the hood)?
Lastly, you're going to need to do a little more logic with this. There are multiple college players "Derrick William". I suspect you're not meaning https://www.sports-reference.com/cbb/players/derrick-williams-1.html. So you need to figure out how to work that out.
from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {'User Agent':'Mozilla/5.0'}
players = [
'kyrie-irving','derrick-williams','enes-kanter',
'tristan-thompson','jonas-valanciunas','jan-vesely',
'bismack-biyombo','brandon-knight','kemba-walker',
'jimmer-fredette','klay-thompson'
]
#the full list of players goes on for a total of 60 players, this is just the first handful
player_stats = []
for player in players:
url = (f'https://www.sports-reference.com/cbb/players/{player}-1.html')
res = requests.get(url)
try:
soup = BeautifulSoup(res.content, 'lxml')
header = [th.getText() for th in soup.findAll('tr', limit = 2)[0].findAll('th')]
rows = soup.findAll('tr')
player_stats.append([td.getText() for td in soup.find('tr', id ='players_per_game.2011')])
player_stats
except:
print("%s has no college stats" %player)
graph = pd.DataFrame(player_stats, columns = header)
With Pandas:
graph = pd.DataFrame()
for player in players:
try:
url = (f'https://www.sports-reference.com/cbb/players/{player}-1.html')
df = pd.read_html(url)[0]
cols = list(df.columns)
df = df.iloc[-2][cols]
df['Player'] = player
graph = graph.append(df).reset_index(drop=True)
graph = graph[['Player'] + cols]
except:
print("%s has no college stats" %player)
I have some code that I am working on and I was needing some help on how to schedule my program to run weekly. I also was wanting to export my output to a CSV file and wasn't sure how to implement it into the code that I already have. I am getting stock information from https://www.eia.gov/petroleum/ here is my code is:
# Importing needed libraries
import requests
from bs4 import BeautifulSoup
URL = "https://www.eia.gov/petroleum/" # Specifiy which URL/web page we are going to be scrapping
res = requests.get(URL).text # Open the URl using requests
soup = BeautifulSoup(res,'lxml')
for items in soup.find('table', class_='basic_table').find_all('tr')[1::1]: # Exception handling
data = items.find_all(['td']) # Use 'find_all' function to bring back all instances
try: # Looks up information in each specified data row
stocks = data[0].text
third_week = data[1].text
second_week = data[2].text
first_week = data[3].text
except IndexError:pass
print("{}| {}: {} | {}: {} | {}: {}".format(stocks, "Price in million barrels 3 weeks ago",third_week,"Price in million barrels 2 weeks ago",second_week,"Price in million barrels 1 week ago",first_week)) # Formatting my intended output
I've scraped the website for my research but I couldn't find the right way to extract it into data frame. I believe that my problem is related with list objects that are between lines 36 and 38.
The print line has worked very nice that I can see the final version of data frame in the Python console.
The solution can be really easy but I couldn't figure it out. Thanks in advance for all help.
from time import sleep
from bs4 import BeautifulSoup, SoupStrainer
import requests
import pandas as pd
# Insert the hisghest page number for website
highest_number = 12
def total_page_number(url):
all_webpage_links = []
all_webpage_links.insert(0, url)
pages = [str(each_number) for each_number in range(2, highest_number)]
for page in pages:
link = ''.join(url + '&page=' + page)
all_webpage_links.append(link)
return all_webpage_links
# Use total_page_number function to create page list for website
All_page = total_page_number(
'https://www.imdb.com/search/title?countries=tr&languages=tr&locations=Turkey&count=250&view=simple')
def clean_text(text):
""" Removes white-spaces before, after, and between characters
:param text: the string to remove clean
:return: a "cleaned" string with no more than one white space between
characters
"""
return ' '.join(text.split())
# Create list objects for data
# Problem occurs in this line !!!!!!
actor_names = []
titles = []
dates = []
def get_cast_from_link(movie_link):
""" Go to the IMDb Movie page in link, and find the cast overview list.
Prints tab-separated movie_title, actor_name, and character_played to
stdout as a result. Nothing returned
:param movie_link: string of the link to IMDb movie page (http://imdb.com
...)
:return: void
"""
movie_page = requests.get(movie_link)
# Use SoupStrainer to strain the cast_list table from the movie_page
# This can save some time in bigger scraping projects
cast_strainer = SoupStrainer('table', class_='cast_list')
movie_soup = BeautifulSoup(movie_page.content, 'html.parser', parse_only=cast_strainer)
# Iterate through rows and extract the name and character
# Remember that some rows might not be a row of interest (e.g., a blank
# row for spacing the layout). Therefore, we need to use a try-except
# block to make sure we capture only the rows we want, without python
# complaining.
for row in movie_soup.find_all('tr'):
try:
actor = clean_text(row.find(itemprop='name').text)
actor_names.append(actor)
titles.append(movie_title)
dates.append(movie_date)
print('\t'.join([movie_title, actor, movie_date]))
except AttributeError:
pass
# Export data frame
# Problem occurs in this line !!!!!!
tsd_df = pd.DataFrame({'Actor_Names': actor_names,
'Movie_Title': titles,
'Movie_Date': dates})
tsd_df.to_csv('/Users/ea/Desktop/movie_df.tsv', encoding='utf-8')
for each in All_page:
# Use requests.get('url') to load the page you want
web_page = requests.get(each)
# https://www.imdb.com/search/title?countries=tr&languages=tr&count=250&view=simple&page=2
# Prepare the SoupStrainer to strain just the tbody containing the list of movies
list_strainer = SoupStrainer('div', class_='lister-list')
# Parse the html content of the web page with BeautifulSoup
soup = BeautifulSoup(web_page.content, 'html.parser', parse_only=list_strainer)
# Generate a list of the "Rank & Title" column of each row and iterate
movie_list = soup.find_all('span', class_='lister-item-header')
for movie in movie_list:
movie_title = movie.a.text
movie_date = movie.find('span', class_='lister-item-year text-muted unbold').text
# get the link to the movie's own IMDb page, and jump over
link = 'http://imdb.com' + movie.a.get('href')
get_cast_from_link(link)
# remember to be nice, and sleep a while between requests!
sleep(15)
I'm scraping the names of massage therapists along with their addresses from a directory. The addresses are all being saved into the CSV in one column for the whole string, but the title/name of each therapist is being saved one word per column over 2 or 3 columns.
What do I need to do in order to get the string that's being extracted to save in one column, like the addresses are being saved? (The top two lines of code are example html from the page, the next set of code is the extract from the script targeting this element)
<span class="name">
<img src="/images/famt-placeholder-sm.jpg" class="thumb" alt="Tiffani D Abraham"> Tiffani D Abraham</span>
import mechanize
from lxml import html
import csv
import io
from time import sleep
def save_products (products, writer):
for product in products:
for price in product['prices']:
writer.writerow([ product["title"].encode('utf-8') ])
writer.writerow([ price["contact"].encode('utf-8') ])
writer.writerow([ price["services"].encode('utf-8') ])
f_out = open('mtResult.csv', 'wb')
writer = csv.writer(f_out)
links = ["https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=2&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=3&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=4&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=5&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=6&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=7&PageSize=10", "https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=8&PageSize=10", "https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=9&PageSize=10", "https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=10&PageSize=10" ]
br = mechanize.Browser()
for link in links:
print(link)
r = br.open(link)
content = r.read()
products = []
tree = html.fromstring(content)
product_nodes = tree.xpath('//ul[#class="famt-results"]/li')
for product_node in product_nodes:
product = {}
price_nodes = product_node.xpath('.//a')
product['prices'] = []
for price_node in price_nodes:
price = {}
try:
product['title'] = product_node.xpath('.//span[1]/text()')[0]
except:
product['title'] = ""
try:
price['services'] = price_node.xpath('./span[2]/text()')[0]
except:
price['services'] = ""
try:
price['contact'] = price_node.xpath('./span[3]/text()')[0]
except:
price['contact'] = ""
product['prices'].append(price)
products.append(product)
save_products(products, writer)
f_out.close()
I'm not positive if this solves the issue you were having, but either way there are a few improvements and modifications you might be interested in.
For example, since each link varies by a page index you can loop through the links easily rather than copying all 50 down to a list. Each therapist per page also has their own index, so you can also loop through the xpaths for each therapist's information.
#import modules
import mechanize
from lxml import html
import csv
import io
#open browser
br = mechanize.Browser()
#create file headers
titles = ["NAME"]
services = ["TECHNIQUE(S)"]
contacts = ["CONTACT INFO"]
#loop through all 50 webpages for therapist data
for link_index in range(1,50):
link = "https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=" + str(link_index) + "&PageSize=10"
r = br.open(link)
page = r.read()
tree = html.fromstring(page)
#loop through therapist data for each therapist per page
for therapist_index in range(1,10):
#store names
title = tree.xpath('//*[#id="content"]/div[2]/ul[1]/li[' + str(therapist_index) + ']/a/span[1]/text()')
titles.append(" ".join(title))
#store techniques and convert to unicode
service = tree.xpath('//*[#id="content"]/div[2]/ul[1]/li[' + str(therapist_index) + ']/a/span[2]/text()')
try:
services.append(service[0].encode("utf-8"))
except:
services.append(" ")
#store contact info and convert to unicode
contact = tree.xpath('//*[#id="content"]/div[2]/ul[1]/li[' + str(therapist_index) + ']/a/span[3]/text()')
try:
contacts.append(contact[0].encode("utf-8"))
except:
contacts.append(" ")
#open file to write to
f_out = open('mtResult.csv', 'wb')
writer = csv.writer(f_out)
#get rows in correct format
rows = zip(titles, services, contacts)
#write csv line by line
for row in rows:
writer.writerow(row)
f_out.close()
The script loops through all 50 links on the provided webpage, and seems to be scraping all relevant information for each therapist if provided. Finally, it prints all the data to a csv with all data stored under respective columns for 'Name', 'Technique(s)', and 'Contact Info' if this is what you were originally struggling with.
Hope this helps!
This may end up being a really novice question, because i'm a novice, but here goes.
i have a set of .html pages obtained using wget. i want to iterate through them and extract certain info, putting it in a .csv file.
using the code below, all the names print when my program runs, but only the info from the next to last page (i.e., page 29.html here) prints to the .csv file. i'm trying this with only a handful of files at first, there are about 1,200 that i'd like to get into this format.
the files are based on those here: https://www.cfis.state.nm.us/media/ReportLobbyist.aspx?id=25&el=2014 where page numbers are the id
thanks for any help!
from bs4 import BeautifulSoup
import urllib2
import csv
for i in xrange(22, 30):
try:
page = urllib2.urlopen('file:{}.html'.format(i))
except:
continue
else:
soup = BeautifulSoup(page.read())
n = soup.find(id='ctl00_ContentPlaceHolder1_lnkBCLobbyist')
name = n.string
print name
table = soup.find('table', 'reportTbl')
#get the rows
list_of_rows = []
for row in table.findAll('tr')[1:]:
col = row.findAll('td')
filing = col[0].string
status = col[1].string
cont = col[2].string
exp = col[3].string
record = (name, filing, status, cont, exp)
list_of_rows.append(record)
#write to file
writer = csv.writer(open('lob.csv', 'wb'))
writer.writerows(list_of_rows)
You need to append each time not overwrite, use a, open('lob.csv', 'wb') is overwriting each time through your outer loop:
writer = csv.writer(open('lob.csv', 'ab'))
writer.writerows(list_of_rows)
You could also declare list_of_rows = [] outside the for loops and write to the file once at the very end.
If you are wanting page 30 also you need to loop in range(22,31).