Not Parsing all data - only header of table - python

Just not fully understanding the datetime import yet as when i parse over to get data i'm not able to get full table data
from datetime import datetime, date, timedelta
import requests
import re
from bs4 import BeautifulSoup
base_url = "http://www.harness.org.au/racing/results/?firstDate="
webpage_response = requests.get('http://www.harness.org.au/racing/results/?firstDate=')
soup = BeautifulSoup(webpage_response.content, "html.parser")
format = "%d-%m-%y"
delta = timedelta(days=1)
yesterday = datetime.today() - timedelta(days=1)
yesterday1 = yesterday.strftime(format)
enddate = datetime(2018, 1, 1)
enddate1 = enddate.strftime(format)
while enddate <= yesterday:
enddate += timedelta(days=1)
enddate.strftime(format)
new_url = base_url + str(enddate)
soup12 = requests.get(new_url)
soup1 = BeautifulSoup(soup12.content, "html.parser")
table1 = soup1.find('table', class_='meetingListFull')
for table2 in table1.find('td'):
name = table2.find('a')
i want to re-iterate over all names from datelist to eventually get all href and scrape data from all past results. Below is actually what i want to get from table1 data but it was not showing up.
Globe Derby Park
So purpose is create href to reiterate over to get all href for past 2 years, re-iterate over tables and then get data from each href below

You can try the following code for your loop:
for tr in table1.find_all('tr'):
all_cells = tr.find_all('td')
if all_cells:
name_cell = all_cells[0]
try:
text = name_cell.a.text.strip()
except:
continue
else:
print(text)
find_all returns an iterable list and since you only look for a name, just use the first cell.
Hope that helps.

Related

How can I plug this section of code into my BeautifulSoup script?

I am new to Python and Beautiful Soup. My project I am working on is a script which scrapes the pages inside of the hyperlinks on this page:
https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html
Currently, the script has a filter which will only scrape the pages which have a "Last Out" date which is past a certain date.
I am trying to add an additional filter to the script, which does the following:
Scrape the "Profit from price change:" section on the page inside hyperlink (Example page: https://bitinfocharts.com/dogecoin/address/D8WhgsmFUkf4imvsrwYjdhXL45LPz3bS1S
Convert the profit into a float
Compare the profit to a variable called "goal" which has a float assigned to it.
If the profit is greater or equal to goal, then scrape the contents of the page. If the profit is NOT greater or equal to the goal, do not scrape the webpage, and continue the script.
Here is the snippet of code I am using to try and do this:
#Get the profit
sections = soup.find_all(class_='table-striped')
for section in sections:
oldprofit = section.find_all('td')[11].text
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
# Compare profit to goal
goal = float(50000)
if profit >= goal
Basically, what I am trying to do is run an if statement on a value on the webpage, and if the statement is true, then scrape the webpage. If the if statement is false, then do not scrape the page and continue the code.
Here is the entire script that I am trying to plug this into:
import csv
import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime
headers = []
datarows = []
# define 1-1-2020 as a datetime object
after_date = datetime(2020, 1, 1)
with requests.Session() as s:
s.headers = {"User-Agent": "Safari/537.36"}
r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
soup = bs(r.content, 'lxml')
# select all tr elements (minus the first one, which is the header)
table_elements = soup.select('tr')[1:]
address_links = []
for element in table_elements:
children = element.contents # get children of table element
url = children[1].a['href']
last_out_str = children[8].text
# check to make sure the date field isn't empty
if last_out_str != "":
# load date into datetime object for comparison (second part is defining the layout of the date as years-months-days hour:minute:second timezone)
last_out = datetime.strptime(last_out_str, "%Y-%m-%d %H:%M:%S %Z")
# if check to see if the date is after 2020/1/1
if last_out > after_date:
address_links.append(url)
for url in address_links:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.find(id="table_maina")
#Get the Doge Address for the filename
item = soup.find('h1').text
newitem = item.replace('Dogecoin', '')
finalitem = newitem.replace('Address', '')
#Get the profit
sections = soup.find_all(class_='table-striped')
for section in sections:
oldprofit = section.find_all('td')[11].text
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
# Compare profit to goal
goal = float(50000)
if profit >= goal
if table:
for row in table.find_all('tr'):
heads = row.find_all('th')
if heads:
headers = [th.text for th in heads]
else:
datarows.append([td.text for td in row.find_all('td')])
fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
fcsv.writerow(headers)
fcsv.writerows(datarows)
I am familiar with if statements however I this unsure how to plug this into the existing code and have it accomplish what I am trying to do. If anyone has any advice I would greatly appreciate it. Thank you.
From my understanding, it seems like all that you are asking is how to have the script continue if it fails that criteria in which case you need to just do
if profit < goal:
continue
Though the for loop in your snippet is only using the final value of profit, if there are other profit values that you need to look at those values are not being evaluated.

How to resolve AttributeError: 'NoneType' object has no attribute 'find'

i found this wicked piece of code that im going to try and use for a project to scrape song data off spotify- only thing is the piece of code is broken and could use a little bit of love. Any ideas as to why i receive the error "AttributeError: 'NoneType' object has no attribute 'find'' for line 36 ? here is my code;
from bs4 import BeautifulSoup
import pandas as pd
import requests
from time import sleep
from datetime import date, timedelta
#create empty arrays for data we're collecting
dates=[]
url_list=[]
final = []
#map site
url = "https://spotifycharts.com/regional/au/weekly"
start_date= date(2016, 12, 29)
end_date= date(2020, 12, 24)
delta= end_date-start_date
for i in range(delta.days+1):
day = start_date+timedelta(days=i)
day_string= day.strftime("%Y-%m-%d")
dates.append(day_string)
def add_url():
for date in dates:
c_string = url+date
url_list.append(c_string)
add_url()
#function for going through each row in each url and finding relevant song info
def song_scrape(x):
pg = x
for tr in songs.find("tbody").findAll("tr"):
artist= tr.find("td", {"class": "chart-table-track"}).find("span").text
artist= artist.replace("by ","").strip()
title= tr.find("td", {"class": "chart-table-track"}).find("strong").text
songid= tr.find("td", {"class": "chart-table-image"}).find("a").get("href")
songid= songid.split("track/")[1]
url_date= x.split("daily/")[1]
final.append([title, artist, songid, url_date])
#loop through urls to create array of all of our song info
for u in url_list:
read_pg= requests.get(u)
sleep(2)
soup= BeautifulSoup(read_pg.text, "html.parser")
songs= soup.find("table", {"class":"chart-table"})
song_scrape(u)
#convert to data frame with pandas for easier data manipulation
final_df = pd.DataFrame(final, columns= ["Title", "Artist", "Song ID", "Chart Date"])
#write to csv
with open('spmooddata.csv', 'w') as f:
final_df.to_csv(f, header= True, index=False)
The function song_scrape uses the external songs param, which is None for some cases, e.g. an invalid url, or any page that doesn't contain the specific table being search for (a table of class chart-table).
So for example, you've got a url (https://spotifycharts.com/regional/au/weekly2016-12-29) with a page-not-found result, and so the line that searches for a table returns None. Which indeed has no find attribute...
In fact, all of your urls are invalid! So what you really need to do is to fix your url list.
Let's go deeper:
It seems that spotify has changed the url format, to now use a "from-date--to-date" structure. If you fix-up the url-building part of your code that's a first step to setting things back on track. So for starters -
start_date = date(2021, 1, 8)
end_date = date(2021, 6, 30)
num_of_dates = (end_date - start_date).days // 7
for i in range(num_of_dates):
start_day = start_date + timedelta(days=i * 7)
start_day_string = start_day.strftime("%Y-%m-%d")
end_day = start_date + timedelta(days=(i + 1) * 7)
end_day_string = end_day.strftime("%Y-%m-%d")
dates.append('--'.join([start_day_string, end_day_string]))
... gives valid urls!
But at this point, it seems that there're some security measures installed, because instead of loading the actual page, I'm being referred to a captcha page. But that's a whole different question :-)
Note: Before scraping web pages, it's worth looking at the robots.txt file.
In this case, the file https://spotifycharts.com/robots.txt shows -
User-agent: *
Disallow:
So one shouldn't have a problem crawling spotify. For overcoming captchas, try looking at the python-anticaptch package

Always making a new DataFrame

I followed a tutorial but I didn't like the result, so I am trying to optimise it but I can't seem to find a way around always making a new dataframe. And I know it is a result from the while loop.
So what i want is the price to append to the dataframe i made.
Thanks in advance!
import pandas as pd
import bs4
import requests
from bs4 import BeautifulSoup
import datetime
#getting actual price
def Real_time_Price(stock):
url = ('https://finance.yahoo.com/quote/'+stock+'?p='+stock)
r=requests.get(url)
web_content=BeautifulSoup(r.text, 'lxml')
web_content = web_content.find('div',{'class':"My(6px) Pos(r) smartphone_Mt(6px)"})
web_content = web_content.find('span').text
return web_content
and here is where my problem starts
while True:
price = []
col = []
time_stamp = datetime.datetime.now()
#de milli seconden wegknallen.
time_stamp = time_stamp.strftime("%Y-%m-%d %H:%M:%S")
#welke stocks wilje checken
ticker_symbols = ['TSLA','AAPL','MSFT']
for stock in ticker_symbols:
price.append(Real_time_Price(stock))
#getting it into a pandas dataframe
#You want [data] for pandas to understand they're rows.
df=pd.DataFrame(data=[price], index=[time_stamp], columns=ticker_symbols)
print(df)
Create dataframe once and use DataFrame.loc([]) to append
df=pd.DataFrame(index=[time_stamp], columns=ticker_symbols)
while True:
price = []
col = []
time_stamp = datetime.datetime.now()
#de milli seconden wegknallen.
time_stamp = time_stamp.strftime("%Y-%m-%d %H:%M:%S")
#welke stocks wilje checken
ticker_symbols = ['TSLA','AAPL','MSFT']
for stock in ticker_symbols:
price.append(Real_time_Price(stock))
df.loc[time_stamp]=price

Beautiful soup how select <a href> and <td> elements with whitespaces

I'm trying to use BeautifulSoup to select the date, url, description, and additional url from table and am having trouble accessing them given the weird white spaces:
So far I've written:
import urllib
import urllib.request
from bs4 import BeautifulSoup
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
soup = make_soup('https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive2010.shtml')
test1 = soup.findAll("td", {"nowrap" : "nowrap"})
test2 = [item.text.strip() for item in test1]
With bs4 4.7.1 you can use :has and nth-of-type in combination with next_sibling to get those columns
from bs4 import BeautifulSoup
import requests, re
def make_soup(url):
the_page = requests.get(url)
soup_data = BeautifulSoup(the_page.content, "html.parser")
return soup_data
soup = make_soup('https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive2010.shtml')
releases = []
links = []
dates = []
descs = []
addit_urls = []
for i in soup.select('td:nth-of-type(1):has([href^="/litigation/litreleases/"])'):
sib_sib = i.next_sibling.next_sibling.next_sibling.next_sibling
releases+= [i.a.text]
links+= [i.a['href']]
dates += [i.next_sibling.next_sibling.text.strip()]
descs += [re.sub('\t+|\s+',' ',sib_sib.text.strip())]
addit_urls += ['N/A' if sib_sib.a is None else sib_sib.a['href']]
result = list(zip(releases, links, dates, descs, addit_urls))
print(result)
Unfortunately there is no class or id HTML attribute to quickly identify the table to scrape; after experimentation I found it was the table at index 4.
Next we ignore the header by separating it from the data, which still has table rows that are just separations for quarters. We can skip over these using a try-except block since those only contain one table data tag.
I noticed that the description is separated by tabs, so I split the text on \t.
For the urls, I used .get('href') rather than ['href'] since not every anchor tag has an href attribute from my experience scraping. This avoids errors should that case occur. Finally the second anchor tag does not always appear, so this is wrapped in a try-except block as well.
data = []
table = soup.find_all('table')[4] # target the specific table
header, *rows = table.find_all('tr')
for row in rows:
try:
litigation, date, complaint = row.find_all('td')
except ValueError:
continue # ignore quarter rows
id = litigation.text.strip().split('-')[-1]
date = date.text.strip()
desc = complaint.text.strip().split('\t')[0]
lit_url = litigation.find('a').get('href')
try:
comp_url = complaint.find('a').get('href')
except AttributeError:
comp_ulr = None # complaint url is optional
info = dict(id=id, date=date, desc=desc, lit_url=lit_url, comp_url=comp_url)
data.append(info)

Saving multiple data frames from loop

I have been searching for a solution to my problem, but all answers I find uses print() at the end of the answer, and NOT saving the data frames as I would like to.
Below I have a (almost) functioning code that prints 3 seperate tables. How do I save these three tables in 3 seperate data frames with the names matches_october, matches_november and matches_december?
The last line in my code is not working as I want it to work. I hope it is clear what I would like the code to do (Saving a data frame at the end of each of the 3 rounds in the loop)
import pandas as pd
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.basketball-reference.com/leagues/NBA_2019_games-'
valid_pages = ['october','november','december']
end = '.html'
for i in valid_pages:
url = '{}{}{}'.format(base_url, i, end)
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
print(df)
matches + valid_pages = df[0]
You can case it, but that's not very robust (and it's rather ugly).
if i == 'october':
matches_october = pd.read_html(str(table))
if i == 'november':
# so on and so forth
A more elegant solution is to use a dictionary. Before the loop, declare matches = {}. Then, in each iteration:
matches[i] = pd.read_html(str(table))
Then you can access the October matches DataFrame via matches['october'].
You can't compose variable names using +, try using a dict instead:
import pandas as pd
import requests
from bs4 import BeautifulSoup
matches = {} # create an empty dict
base_url = 'https://www.basketball-reference.com/leagues/NBA_2019_games-'
valid_pages = ['october','november','december']
end = '.html'
for i in valid_pages:
url = '{}{}{}'.format(base_url, i, end)
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
print(df)
matches[i] = df[0] # store it in the dict
Thanks guys. That worked! :)
import pandas as pd
import requests
from bs4 import BeautifulSoup
matches = {} # create an empty dict
base_url = 'https://www.basketball-reference.com/leagues/NBA_2019_games-'
valid_pages = ['october','november','december']
end = '.html'
for i in valid_pages:
url = '{}{}{}'.format(base_url, i, end)
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
matches[i] = df[0] # store it in the dict
matches_october = matches['october']

Categories