I am new to Python and Beautiful Soup. My project I am working on is a script which scrapes the pages inside of the hyperlinks on this page:
https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html
Currently, the script has a filter which will only scrape the pages which have a "Last Out" date which is past a certain date.
I am trying to add an additional filter to the script, which does the following:
Scrape the "Profit from price change:" section on the page inside hyperlink (Example page: https://bitinfocharts.com/dogecoin/address/D8WhgsmFUkf4imvsrwYjdhXL45LPz3bS1S
Convert the profit into a float
Compare the profit to a variable called "goal" which has a float assigned to it.
If the profit is greater or equal to goal, then scrape the contents of the page. If the profit is NOT greater or equal to the goal, do not scrape the webpage, and continue the script.
Here is the snippet of code I am using to try and do this:
#Get the profit
sections = soup.find_all(class_='table-striped')
for section in sections:
oldprofit = section.find_all('td')[11].text
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
# Compare profit to goal
goal = float(50000)
if profit >= goal
Basically, what I am trying to do is run an if statement on a value on the webpage, and if the statement is true, then scrape the webpage. If the if statement is false, then do not scrape the page and continue the code.
Here is the entire script that I am trying to plug this into:
import csv
import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime
headers = []
datarows = []
# define 1-1-2020 as a datetime object
after_date = datetime(2020, 1, 1)
with requests.Session() as s:
s.headers = {"User-Agent": "Safari/537.36"}
r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
soup = bs(r.content, 'lxml')
# select all tr elements (minus the first one, which is the header)
table_elements = soup.select('tr')[1:]
address_links = []
for element in table_elements:
children = element.contents # get children of table element
url = children[1].a['href']
last_out_str = children[8].text
# check to make sure the date field isn't empty
if last_out_str != "":
# load date into datetime object for comparison (second part is defining the layout of the date as years-months-days hour:minute:second timezone)
last_out = datetime.strptime(last_out_str, "%Y-%m-%d %H:%M:%S %Z")
# if check to see if the date is after 2020/1/1
if last_out > after_date:
address_links.append(url)
for url in address_links:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.find(id="table_maina")
#Get the Doge Address for the filename
item = soup.find('h1').text
newitem = item.replace('Dogecoin', '')
finalitem = newitem.replace('Address', '')
#Get the profit
sections = soup.find_all(class_='table-striped')
for section in sections:
oldprofit = section.find_all('td')[11].text
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
# Compare profit to goal
goal = float(50000)
if profit >= goal
if table:
for row in table.find_all('tr'):
heads = row.find_all('th')
if heads:
headers = [th.text for th in heads]
else:
datarows.append([td.text for td in row.find_all('td')])
fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
fcsv.writerow(headers)
fcsv.writerows(datarows)
I am familiar with if statements however I this unsure how to plug this into the existing code and have it accomplish what I am trying to do. If anyone has any advice I would greatly appreciate it. Thank you.
From my understanding, it seems like all that you are asking is how to have the script continue if it fails that criteria in which case you need to just do
if profit < goal:
continue
Though the for loop in your snippet is only using the final value of profit, if there are other profit values that you need to look at those values are not being evaluated.
Related
I'm a beginner to Python and am trying to create a program that will scrape the football/soccer schedule from skysports.com and will send it through SMS to my phone through Twilio. I've excluded the SMS code because I have that figured out, so here's the web scraping code I am getting stuck with so far:
import requests
from bs4 import BeautifulSoup
URL = "https://www.skysports.com/football-fixtures"
page = requests.get(URL)
results = BeautifulSoup(page.content, "html.parser")
d = defaultdict(list)
comp = results.find('h5', {"class": "fixres__header3"})
team1 = results.find('span', {"class": "matches__item-col matches__participant matches__participant--side1"})
date = results.find('span', {"class": "matches__date"})
team2 = results.find('span', {"class": "matches__item-col matches__participant matches__participant--side2"})
for ind in range(len(d)):
d['comp'].append(comp[ind].text)
d['team1'].append(team1[ind].text)
d['date'].append(date[ind].text)
d['team2'].append(team2[ind].text)
Down below should do the trick for you:
from bs4 import BeautifulSoup
import requests
a = requests.get('https://www.skysports.com/football-fixtures')
soup = BeautifulSoup(a.text,features="html.parser")
teams = []
for date in soup.find_all(class_="fixres__header2"): # searching in that date
for i in soup.find_all(class_="swap-text--bp30")[1:]: #skips the first one because that's a heading
teams.append(i.text)
date = soup.find(class_="fixres__header2").text
print(date)
teams = [i.strip('\n') for i in teams]
for x in range(0,len(teams),2):
print (teams[x]+" vs "+ teams[x+1])
Let me further explain what I have done:
All the football have this class name - swap-text--bp30
So we can use find_all to extract all the classes with that name.
Once we have our results we can put them into an array "teams = []" then append them in a for loop "team.append(i.text)". ".text" strips the html
Then we can get rid of "\n" in the array by stripping it and printing out each string in the array two by two.
This should be your final output:
EDIT: To scrape the title of the leagues we will do pretty much the same:
league = []
for date in soup.find_all(class_="fixres__header2"): # searching in that date
for i in soup.find_all(class_="fixres__header3"): #skips the first one because that's a heading
league.append(i.text)
Strip the array and create another one:
league = [i.strip('\n') for i in league]
final = []
Then add this final bit of code which is essentially just printing the league then the two teams over and over:
for x in range(0,len(teams),5):
final.append(teams[x]+" vs "+ teams[x+1])
for i in league:
print(i)
for i in final:
print(i)
I`m trying to scrape a website rating. I want to get each individual rating and it´s particular date. However, I only get one result in my list, although there should be several.
Am I doing something wrong in the for loop?
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time
url = "https://www.kununu.com/de/heidelpay/kommentare"
while url != " ":
print(url)
time.sleep(15)
r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
print(r.status_code)
soup = BeautifulSoup(r.text, "html.parser")
#print(soup.prettify())
#Get overall score of the company
score_avg = soup.find("span", class_="index__aggregationValue__32exy").text
print(score_avg)
#get individuel scores and dates of the company
rating_list = []
for box in soup.find_all(".index__rating__3nC2L"):
score_ind = box.select(".index__score__16yy9").text
date = select(".index__date__eIOxr").text
rating = [score_ind, date]
rating_list.append(rating)
print(rating_list)
3,3
[['5,0', 'Januar 2017']]
Many thanks in advance!
It looks like you aren't appending the rating to the rating_list until the last loop is done. Is the printed rating perchance the very last one?
Add the append to your loop, like so:
for box in soup.find_all(".index__rating__3nC2L"):
score_ind = box.select(".index__score__16yy9").text
date = select(".index__date__eIOxr").text
rating = [score_ind, date]
rating_list.append(rating)
Well, the problem is that you're just appending the last rating value in rating_list.append(rating) because it's out of the foor loop, so what you have to do is this:
for box in soup.find_all(".index__rating__3nC2L"):
score_ind = box.select(".index__score__16yy9").text
date = select(".index__date__eIOxr").text
rating = [score_ind, date]
rating_list.append(rating)
Like this way you're gonna append each rating value in each iteration of the forloop. Hope this can help you
I am learning to scrape websites with Beautifulsoup, and was trying to fetch data from yahoo finance. As I advance, I am stuck wondering if there would be a reason why it is successfully fetching what I want when I am not in a for loop (searing for a specific ticker), but as soon as I try to make it use a csv file to search for more than one ticker, the .find() method returns an error instead of the tag I am looking for.
Here is the code when it runs well,
```
import requests
import csv
from bs4 import BeautifulSoup
> ------ FOR LOOP THAT MESSES THINGS UP ----- <
# with open('s&p500_tickers.csv', 'r') as tickers:
# for ticker in tickers:
ticker = 'AAPL' > ------ TEMPORARY TICKER TO TEST CODE
web = requests.get(f'https://ca.finance.yahoo.com/quote/{ticker}/financials?p={ticker}').text
soup = BeautifulSoup(web, 'lxml')
section = soup.find('section', class_='smartphone_Px(20px) Mb(30px)')
tbl = section.find('div', class_='M(0) Whs(n) BdEnd Bdc($seperatorColor) D(itb)')
headerRow = tbl.find("div", class_="D(tbr) C($primaryColor)")
> ------ CODE I USED TO VISUALIZE THE RESULT ------ <
breakdownHead = headerRow.text[0:9]
ttmHead = headerRow.text[9:12]
lastYear = headerRow.text[12:22]
twoYears = headerRow.text[22:32]
threeYears = headerRow.text[32:42]
fourYears = headerRow.text[42:52]
print(breakdownHead, ttmHead, lastYear, twoYears, threeYears, fourYears)
```
It returns this:
```
Breakdown ttm 2019-09-30 2018-09-30 2017-09-30 2016-09-30
Process finished with exit code 0
```
Here is the code that does not work
```
import requests
import csv
from bs4 import BeautifulSoup
with open('s&p500_tickers.csv', 'r') as tickers:
for ticker in tickers:
web = requests.get(f'https://ca.finance.yahoo.com/quote/{ticker}/financials?p={ticker}').text
soup = BeautifulSoup(web, 'lxml')
section = soup.find('section', class_='smartphone_Px(20px) Mb(30px)')
tbl = section.find('div', class_='M(0) Whs(n) BdEnd Bdc($seperatorColor) D(itb)')
headerRow = tbl.find("div", class_="D(tbr) C($primaryColor)")
breakdownHead = headerRow.text[0:9]
ttmHead = headerRow.text[9:12]
lastYear = headerRow.text[12:22]
twoYears = headerRow.text[22:32]
threeYears = headerRow.text[32:42]
fourYears = headerRow.text[42:52]
print(breakdownHead, ttmHead, lastYear, twoYears, threeYears, fourYears)
```
I welcome any feedback on my code as I am always trying to get better.
Thank you very much
So I have resolved the problem.
I realized that the .writerow() method of the csv module adds '\n' at the end of the string.(Ex:'MMM\n').
Somehow, the new line was keeping the .find() method to be executed in the for loop. (Still don't know why)
Afterward, it worked for the first line but since there was empty spaces I had to get python to pass the empty spaces with an If statement.
I replaced the '\n' with a '' and it worked.
Here's what it looks like:
'''
for ticker in tickers.readlines():
ticker = ticker.replace('\n', '')
if ticker == '':
pass
else:
web = requests.get(f'https://ca.finance.yahoo.com/quote/{ticker}/financials?p={ticker}').text
soup = BeautifulSoup(web, 'lxml')
headerRow = soup.find("div", class_="D(tbr) C($primaryColor)")
'''
If any of you see a better way to do it, I would be pleased to have some of your feedback.
I am new to programming and would really like to know what I am doing wrong!
I am trying to extract data from this site: http://www.afl.com.au/fixture
in a way such that I have a dictionary having the date as key and the "Preview" links as Values in a list, like
dict = {Saturday, June 07: ["preview url-1, "preview url-2","preview url-3","preview url-4"]}
Please help me get it, I have used the code below:
def extractData():
lDateInfoMatchCase = False
# lDateInfoMatchCase = []
global gDict
for row in table_for_players.findAll("tr"):
for lDateRowIndex in row.findAll("th", {"colspan" : "4"}):
ldateList.append(lDateRowIndex.text)
print ldateList
for index in ldateList:
#print index
lPreviewLinkList = []
for row in table_for_players.findAll("tr"):
for lDateRowIndex in row.findAll("th", {"colspan" : "4"}):
if lDateRowIndex.text == index:
lDateInfoMatchCase = True
else:
lDateInfoMatchCase = False
if lDateInfoMatchCase == True:
for lInfoRowIndex in row.findAll("td", {"class": "info"}):
for link in lInfoRowIndex.findAll("a", {"class" : "preview"}):
lPreviewLinkList.append("http://www.afl.com.au/" + link.get('href'))
print lPreviewLinkList
gDict[index] = lPreviewLinkList
My main aim is to get the all player names who are playing for a match in home and in away team according to date in a data structure.
I prefer using CSS Selectors. Select the first table, then all rows in the tbody for ease of processing; the rows are 'grouped' by tr th rows. From there you can select all next siblings that don't contain th headers and scan these for preview links:
previews = {}
table = soup.select('table.fixture')[0]
for group_header in table.select('tbody tr th'):
date = group_header.string
for next_sibling in group_header.parent.find_next_siblings('tr'):
if next_sibling.th:
# found a next group, end scan
break
for preview in next_sibling.select('a.preview'):
previews.setdefault(date, []).append(
"http://www.afl.com.au" + preview.get('href'))
This builds a dictionary of lists; for the current version of the page this produces:
{u'Monday, June 09': ['http://www.afl.com.au/match-centre/2014/12/melb-v-coll'],
u'Sunday, June 08': ['http://www.afl.com.au/match-centre/2014/12/gcfc-v-syd',
'http://www.afl.com.au/match-centre/2014/12/fre-v-adel',
'http://www.afl.com.au/match-centre/2014/12/nmfc-v-rich']}
I am trying to use webscrape method to get some Temperature and Precipitation data for www.wunderground.com (they have an API, but I must use web scrape method in my project)
My problem is that I can't figure out how to store my data after the scrape.
There is my code for example:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.wunderground.com/history/airport/KBUF/2014/5/25/DailyHistory.html"
soup = BeautifulSoup(urllib2.urlopen(url).read()
#Mean Temperature Values
mean_temp_row = soup.findAll('table')[0].findAll('tr')[2]
for tds in mean_temp_row.findAll('td'):
print tds.text
The output I'am getting is:
Mean Temperature
15 °C
16 °C
I would like to know how I can get something like: station = {"Temp_Mean":[15 , 16]}
Is this output format always the same ?
If it is, so we can see the info name is on the first td of the row. Then, there is an empty td, then the min, then empty, empty, and at the end the max.
So you could do something like :
def celcius2float(celcius):
return float(celcius.split('°')[0].strip())
cells = Mean_Temp_Row.findAll('td')
name = cells[0].text
min_temp = celcius2float(cells[2].text)
max_temp = celcius2float(cells[5].text)
# Then you can do all you want with this suff :
station = {name: [min_temp, max_temp]}
After considering the answer from TurpIF , here is my code
def collect_data(url):
soup = BeautifulSoup(urllib2.urlopen(url).read())
Mean_temp = soup.findAll('table')[0].findAll('tr')[2].findAll('td')
temp = Mean_temp[1].text.split()[0].encode('utf8')
rows = soup.findAll('table')[0].findAll('tr')
for num,row in enumerate(rows):
if "Precipitation" in row.text:
preci_line = num
Preci = soup.findAll('table')[0].findAll('tr')[preci_line].findAll('td')
perci = Preci[1].text.split()[0].encode('utf8')
return temp,perci
So,
url = "http://www.wunderground.com/history/airport/KBUF/2014/5/25/DailyHistory.html"
temp,perci = collect_data(url)