I am learning to scrape websites with Beautifulsoup, and was trying to fetch data from yahoo finance. As I advance, I am stuck wondering if there would be a reason why it is successfully fetching what I want when I am not in a for loop (searing for a specific ticker), but as soon as I try to make it use a csv file to search for more than one ticker, the .find() method returns an error instead of the tag I am looking for.
Here is the code when it runs well,
```
import requests
import csv
from bs4 import BeautifulSoup
> ------ FOR LOOP THAT MESSES THINGS UP ----- <
# with open('s&p500_tickers.csv', 'r') as tickers:
# for ticker in tickers:
ticker = 'AAPL' > ------ TEMPORARY TICKER TO TEST CODE
web = requests.get(f'https://ca.finance.yahoo.com/quote/{ticker}/financials?p={ticker}').text
soup = BeautifulSoup(web, 'lxml')
section = soup.find('section', class_='smartphone_Px(20px) Mb(30px)')
tbl = section.find('div', class_='M(0) Whs(n) BdEnd Bdc($seperatorColor) D(itb)')
headerRow = tbl.find("div", class_="D(tbr) C($primaryColor)")
> ------ CODE I USED TO VISUALIZE THE RESULT ------ <
breakdownHead = headerRow.text[0:9]
ttmHead = headerRow.text[9:12]
lastYear = headerRow.text[12:22]
twoYears = headerRow.text[22:32]
threeYears = headerRow.text[32:42]
fourYears = headerRow.text[42:52]
print(breakdownHead, ttmHead, lastYear, twoYears, threeYears, fourYears)
```
It returns this:
```
Breakdown ttm 2019-09-30 2018-09-30 2017-09-30 2016-09-30
Process finished with exit code 0
```
Here is the code that does not work
```
import requests
import csv
from bs4 import BeautifulSoup
with open('s&p500_tickers.csv', 'r') as tickers:
for ticker in tickers:
web = requests.get(f'https://ca.finance.yahoo.com/quote/{ticker}/financials?p={ticker}').text
soup = BeautifulSoup(web, 'lxml')
section = soup.find('section', class_='smartphone_Px(20px) Mb(30px)')
tbl = section.find('div', class_='M(0) Whs(n) BdEnd Bdc($seperatorColor) D(itb)')
headerRow = tbl.find("div", class_="D(tbr) C($primaryColor)")
breakdownHead = headerRow.text[0:9]
ttmHead = headerRow.text[9:12]
lastYear = headerRow.text[12:22]
twoYears = headerRow.text[22:32]
threeYears = headerRow.text[32:42]
fourYears = headerRow.text[42:52]
print(breakdownHead, ttmHead, lastYear, twoYears, threeYears, fourYears)
```
I welcome any feedback on my code as I am always trying to get better.
Thank you very much
So I have resolved the problem.
I realized that the .writerow() method of the csv module adds '\n' at the end of the string.(Ex:'MMM\n').
Somehow, the new line was keeping the .find() method to be executed in the for loop. (Still don't know why)
Afterward, it worked for the first line but since there was empty spaces I had to get python to pass the empty spaces with an If statement.
I replaced the '\n' with a '' and it worked.
Here's what it looks like:
'''
for ticker in tickers.readlines():
ticker = ticker.replace('\n', '')
if ticker == '':
pass
else:
web = requests.get(f'https://ca.finance.yahoo.com/quote/{ticker}/financials?p={ticker}').text
soup = BeautifulSoup(web, 'lxml')
headerRow = soup.find("div", class_="D(tbr) C($primaryColor)")
'''
If any of you see a better way to do it, I would be pleased to have some of your feedback.
I am new to programming and would really like to know what I am doing wrong!
Related
I am new to Python and Beautiful Soup. My project I am working on is a script which scrapes the pages inside of the hyperlinks on this page:
https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html
Currently, the script has a filter which will only scrape the pages which have a "Last Out" date which is past a certain date.
I am trying to add an additional filter to the script, which does the following:
Scrape the "Profit from price change:" section on the page inside hyperlink (Example page: https://bitinfocharts.com/dogecoin/address/D8WhgsmFUkf4imvsrwYjdhXL45LPz3bS1S
Convert the profit into a float
Compare the profit to a variable called "goal" which has a float assigned to it.
If the profit is greater or equal to goal, then scrape the contents of the page. If the profit is NOT greater or equal to the goal, do not scrape the webpage, and continue the script.
Here is the snippet of code I am using to try and do this:
#Get the profit
sections = soup.find_all(class_='table-striped')
for section in sections:
oldprofit = section.find_all('td')[11].text
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
# Compare profit to goal
goal = float(50000)
if profit >= goal
Basically, what I am trying to do is run an if statement on a value on the webpage, and if the statement is true, then scrape the webpage. If the if statement is false, then do not scrape the page and continue the code.
Here is the entire script that I am trying to plug this into:
import csv
import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime
headers = []
datarows = []
# define 1-1-2020 as a datetime object
after_date = datetime(2020, 1, 1)
with requests.Session() as s:
s.headers = {"User-Agent": "Safari/537.36"}
r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
soup = bs(r.content, 'lxml')
# select all tr elements (minus the first one, which is the header)
table_elements = soup.select('tr')[1:]
address_links = []
for element in table_elements:
children = element.contents # get children of table element
url = children[1].a['href']
last_out_str = children[8].text
# check to make sure the date field isn't empty
if last_out_str != "":
# load date into datetime object for comparison (second part is defining the layout of the date as years-months-days hour:minute:second timezone)
last_out = datetime.strptime(last_out_str, "%Y-%m-%d %H:%M:%S %Z")
# if check to see if the date is after 2020/1/1
if last_out > after_date:
address_links.append(url)
for url in address_links:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.find(id="table_maina")
#Get the Doge Address for the filename
item = soup.find('h1').text
newitem = item.replace('Dogecoin', '')
finalitem = newitem.replace('Address', '')
#Get the profit
sections = soup.find_all(class_='table-striped')
for section in sections:
oldprofit = section.find_all('td')[11].text
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
# Compare profit to goal
goal = float(50000)
if profit >= goal
if table:
for row in table.find_all('tr'):
heads = row.find_all('th')
if heads:
headers = [th.text for th in heads]
else:
datarows.append([td.text for td in row.find_all('td')])
fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
fcsv.writerow(headers)
fcsv.writerows(datarows)
I am familiar with if statements however I this unsure how to plug this into the existing code and have it accomplish what I am trying to do. If anyone has any advice I would greatly appreciate it. Thank you.
From my understanding, it seems like all that you are asking is how to have the script continue if it fails that criteria in which case you need to just do
if profit < goal:
continue
Though the for loop in your snippet is only using the final value of profit, if there are other profit values that you need to look at those values are not being evaluated.
I'm a beginner with Python & trying to learn with a BeautifulSoup webscraping project.
I'm looking to scrape the record item title, URL of item & purchase date from this URL & export to a CSV.
I made great progress with scraping title & URL but just cannot figure out how to properly code the purchase date info correctly in my for loop (purchase_date variable below).
What's currently happening is the data in the csv file for the purchase date (e.g. p_date title) just displays blank cells with no text.. no error message just no data getting put into csv. Any guidance is much appreciated.
Thank you!!
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
headers = {"Accept-Language": "en-US, en;q=0.5"}
url = "https://www.popsike.com/php/quicksearch.php?searchtext=metal+-signed+-promo+-beatles+-zeppelin+-acetate+-test+-sinatra&sortord=aprice&pagenum=1&incldescr=1&sprice=100&eprice=&endfrom=2020&endthru=2020&bidsfrom=&bidsthru=&layout=&flabel=&fcatno="
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.text, "html.parser")
title = []
date = []
URL = []
record_div = soup.find_all('div', class_='col-md-7 add-desc-box')
for container in record_div:
description = container.a.text
title.append(description)
link = container.find('a')
URL.append(link.get('href'))
purchase_date = container.find('span',class_= 'info-row').text
date.append(purchase_date)
test_data = pd.DataFrame({
'record_description': title,
'link': URL,
'p_date': date
})
test_data['link'] = test_data['link'].str.replace('../','https://www.popsike.com/',1)
print(test_data)
test_data.to_csv('popaaron.csv')
I suggest to change parser type:
soup = BeautifulSoup(results.text, "html5")
And fix search expression for purchase date:
purchase_date = container.select('span.date > b')[0].text.strip(' \t\n\r')
I am using beautifulsoup to scrape a website but need help with this as I am new to python and beautifulsoup
How do I get VET from the following
"[[VET]]"
This is my code so far
import bs4 as bs
import urllib.request
import pandas as pd
#This is the Home page of the website
source = urllib.request.urlopen('file:///C:/Users/Aiden/Downloads/stocks/Stock%20Premarket%20Trading%20Activity%20_%20Biggest%20Movers%20Before%20the%20Market%20Opens.html').read().decode('utf-8')
soup = bs.BeautifulSoup(source,'lxml')
#find the Div and put all info into varTable
table = soup.find('table',{"id":"decliners_tbl"}).tbody
#find all Rows in table and puts into varTableRows
tableRows = table.find_all('tr')
print ("There is ",len(tableRows),"Rows in the Table")
print(tableRows)
columns = [tableRows[1].find_all('td')]
print(columns)
a = [tableRows[1].find_all("a")]
print(a)
So my output from print(a) is "[[<a class="mplink popup_link" href="https://marketchameleon.com/Overview/VET/">VET</a>]]"
and I want to extract VET out
AD
You can use a.text or a.get_text().
If you have multiple elements you'd need list comprehension on this function
Thank you for all the reply, I was able to work it out using the following code
source = urllib.request.urlopen('file:///C:/Users/Aiden/Downloads/stocks/Stock%20Premarket%20Trading%20Activity%20_%20Biggest%20Movers%20Before%20the%20Market%20Opens.html').read().decode('utf-8')
soup = bs.BeautifulSoup(source,'html.parser')
table = soup.find("table",id="decliners_tbl")
for decliners in table.find_all("tbody"):
rows = decliners.find_all("tr")
for row in rows:
ticker = row.find("a").text
volume = row.findAll("td", class_="rightcell")[3].text
print(ticker, volume)
As of now, I'm only getting ['1'] as the output of what's being printed with my current code below. I want to grab 1-54 on the Team Batting table in the Rk column on the website https://www.baseball-reference.com/teams/NYY/2019.shtml.
How would I go about modifying colNum so it can print the 1-54 in the Rk column? I'm pointing out the colNum line because I feel the issue lies there but I could be wrong.
import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser') # parse as HTML page, this is the source code of the page
week = soup.find(class_='table_outer_container')
items = week.find("thead").get_text() # grabs table headers
th = week.find("th").get_text() # grabs Rk only.
tbody = week.find("tbody")
tr = tbody.find("tr")
thtwo = tr.find("th").get_text()
colNum = [thtwo for thtwo in thtwo]
print(colNum)
Your mistake was in the last few lines as you mentioned. If I understood right, you wanted a list of all the values in the "Rk" column. In order to get all the rows, you have to use the find_all() function. I tweaked your code a little bit in order to get the text of the first field in each row in the following lines:
import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser')
is the source code of the page
week = soup.find(class_='table_outer_container')
items = week.find("thead").get_text()
th = week.find("th").get_text()
tbody = week.find("tbody")
tr = tbody.find_all("tr")
colnum = [row.find("th").get_text() for row in tr]
print(colnum)
I am very new and I am getting totally stuck with recent task. I want to autorefresh stock price automatically as it is changing. I am scrapping nasdaq.com
website for actual intraday price.
I have a recent code:
import bs4 as bs
import urllib
tiker = input("zadaj ticker: ")
url = urllib.request.urlopen("http://www.nasdaq.com/symbol/"+tiker+"/real-time")
stranka = url.read()
soup = bs.BeautifulSoup(stranka, 'lxml')
print (tiker.upper())
for each in soup.find('div', attrs={'id': 'qwidget_lastsale'}):
print(each.string)
I was only able to make an infinite loop while True but i get prints in lines despite i want to change only one line as actual price is changing.
very thank you for your notes.
You can achieve it by printing "\b" to remove the previously printed string and then printing on the same line:
import bs4 as bs
import urllib
import time
import sys
tiker = input("zadaj ticker: ")
print (tiker.upper())
written_string = ''
while True:
url = urllib.request.urlopen("http://www.nasdaq.com/symbol/"+tiker+"/real-time")
stranka = url.read()
soup = bs.BeautifulSoup(stranka, 'lxml')
for each in soup.find('div', attrs={'id': 'qwidget_lastsale'}):
for i in range(len(written_string)):
sys.stderr.write("\b")
sys.stderr.write(each.string)
written_string = each.string
time.sleep(1)