I am learning to use beautifulsoup and python to extract an html table. I tried using the following code to extract the balance sheet for Google. However, I can't seem to get all the rows scraped correctly.
I can't manage to omit rows that is just a spacer and I don't manage to extract rows of the Totals (eg. Total Asset).
Any advice? Advice on simplifying the code also valuable.
from bs4 import BeautifulSoup
import requests
def bs_extract(stock_ticker):
url= 'https://finance.yahoo.com/q/bs?s='+str(stock_ticker)+'&annual'
source_code = requests.get(url)
plain_text=source_code.text
soup = BeautifulSoup(plain_text)
c1= ""
c2= ""
c3= ""
c4= ""
c5= ""
table = soup.find("table", { "class" : "yfnc_tabledata1" })
# print (table)
for row in table.findAll("tr"):
cells = row.findAll("td")
if len(cells)==5:
c1=cells[0].find(text=True)
c2=cells[1].find(text=True)
c3=cells[2].find(text=True)
c4=cells[3].find(text=True)
c5=cells[4].find(text=True)
elif len(cells)==6:
c1=cells[1].find(text=True)
c2=cells[2].find(text=True)
c3=cells[3].find(text=True)
c4=cells[4].find(text=True)
c5=cells[5].find(text=True)
elif len(cells)==1:
c1=cells[0].find(text=True)
c2=""
c3=""
c4=""
c5=""
else:
pass
print(c1,c2,c3,c4,c5)
bs_extract('goog')
You might find it easier to get this data structured, through YQL. See http://goo.gl/qKeWXw
Related
I am new to data scraping with Beautiful Soup. I would like to get data from pro-football-reference on these stats: https://www.pro-football-reference.com/boxscores/201009090nor.htm#all_pbp
I would like to iterate through every row under the 'Detail Column' under the Full Play-By-Play Table so that if the Detail contains the word "Penalty" I can save that. Any chance anyone knows how I could possibly do this? This table seems different than others.
# Any example of how I extracted another element (Referee Name)
# from the same page but different table
table = soup.select_one('#all_officials').find_next(text=lambda t: isinstance(t, Comment))
table = BeautifulSoup(table, 'html.parser')
for tr in table.select('tr'):
tds = [td.get_text(strip=True) for td in tr.select('td')]
if str(*tds) != "Officials":
referee = str(*tds)
break
The table is commented out. A common and reliable way is to import Comment and handle with for comment in soup.find_all(text=lambda text: isinstance(text, Comment)) as shown here: https://stackoverflow.com/a/60381103.
For this particular instance, I am just removing the comments strings through substitution.
Then I use :-soup-contains to target the appropriate rows, filtering on only those rows within the table where the text Penalty appears in the elements with data-stat attribute having value = detail i.e. the detail column.
I then use pandas to reconstitute the table from the filtered trs html joined and then book-ended by table tags
from bs4 import BeautifulSoup as bs
import pandas as pd
import requests
import re
r = requests.get('https://www.pro-football-reference.com/boxscores/201009090nor.htm#all_pbp',
headers={'User-Agent': 'Mozilla/5.0'})
s = re.sub(r'<!--|-->', '', r.text)
soup = bs(s, 'lxml')
s2 = '<table>' + ''.join([str(i) for i in soup.select(
'#pbp tr:has([data-stat=detail]:-soup-contains("Penalty"))')]) + '</table>'
df = pd.read_html(s2)[0]
df.columns = [i.text for i in soup.select('#pbp thead > tr > th')]
df
I am using beautifulsoup to scrape a website but need help with this as I am new to python and beautifulsoup
How do I get VET from the following
"[[VET]]"
This is my code so far
import bs4 as bs
import urllib.request
import pandas as pd
#This is the Home page of the website
source = urllib.request.urlopen('file:///C:/Users/Aiden/Downloads/stocks/Stock%20Premarket%20Trading%20Activity%20_%20Biggest%20Movers%20Before%20the%20Market%20Opens.html').read().decode('utf-8')
soup = bs.BeautifulSoup(source,'lxml')
#find the Div and put all info into varTable
table = soup.find('table',{"id":"decliners_tbl"}).tbody
#find all Rows in table and puts into varTableRows
tableRows = table.find_all('tr')
print ("There is ",len(tableRows),"Rows in the Table")
print(tableRows)
columns = [tableRows[1].find_all('td')]
print(columns)
a = [tableRows[1].find_all("a")]
print(a)
So my output from print(a) is "[[<a class="mplink popup_link" href="https://marketchameleon.com/Overview/VET/">VET</a>]]"
and I want to extract VET out
AD
You can use a.text or a.get_text().
If you have multiple elements you'd need list comprehension on this function
Thank you for all the reply, I was able to work it out using the following code
source = urllib.request.urlopen('file:///C:/Users/Aiden/Downloads/stocks/Stock%20Premarket%20Trading%20Activity%20_%20Biggest%20Movers%20Before%20the%20Market%20Opens.html').read().decode('utf-8')
soup = bs.BeautifulSoup(source,'html.parser')
table = soup.find("table",id="decliners_tbl")
for decliners in table.find_all("tbody"):
rows = decliners.find_all("tr")
for row in rows:
ticker = row.find("a").text
volume = row.findAll("td", class_="rightcell")[3].text
print(ticker, volume)
I am new to web scraping and I'm trying to scrape the "statistics" page of yahoo finance for AAPL. Here's the link: https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL
Here is the code I have so far...
from bs4 import BeautifulSoup
from requests import get
url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stock_data = soup.find_all("table")
for stock in stock_data:
print(stock.text)
When I run that, I return all of the table data on the page. However, I only want specific data from each table (e.g. "Market Cap", "Revenue", "Beta").
I tried messing around with the code by doing print(stock[1].text) to see if I could limit the amount of data returned to just the second value in each table but that returned an error message. Am I on the right track by using BeautifulSoup or do I need to use a completely different library? What would I have to do in order to only return particular data and not all of the table data on the page?
Examining the HTML-code gives you the best idea of how BeautifulSoup will handle what it sees.
The web page seems to contain several tables, which in turn contain the information you are after. The tables follow a certain logic.
First scrape all the tables on the web page, then find all the table rows (<tr>) and the table data (<td>) that those rows contain.
Below is one way of achieving this. I even threw in a function to print only a specific measurement.
from bs4 import BeautifulSoup
from requests import get
url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stock_data = soup.find_all("table")
# stock_data will contain multiple tables, next we examine each table one by one
for table in stock_data:
# Scrape all table rows into variable trs
trs = table.find_all('tr')
for tr in trs:
# Scrape all table data tags into variable tds
tds = tr.find_all('td')
# Index 0 of tds will contain the measurement
print("Measure: {}".format(tds[0].get_text()))
# Index 1 of tds will contain the value
print("Value: {}".format(tds[1].get_text()))
print("")
def get_measurement(table_array, measurement):
for table in table_array:
trs = table.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
if measurement.lower() in tds[0].get_text().lower():
return(tds[1].get_text())
# print only one measurement, e.g. operating cash flow
print(get_measurement(stock_data, "operating cash flow"))
Although this isn't Yahoo Finance, you can do something very similar like this...
import requests
from bs4 import BeautifulSoup
base_url = 'https://finviz.com/screener.ashx?v=152&o=price&t=MSFT,AAPL,SBUX,S,GOOG&o=price&c=0,1,2,3,4,5,6,7,8,9,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("C:\\your_path\\AAA.csv", header=False)
This is a nice substitute in case Yahoo decided to depreciate more of the functionality of their API. I know they cut out a lot of things (mostly historical quotes) a couple years ago. It was sad to see that go away.
I am currently running the following python script:
import requests
from bs4 import BeautifulSoup
origin= ["USD","GBP","EUR"]
i=0
while i < len(origin):
page = requests.get("https://www.x-rates.com/table/?from="+origin[i]+"&amount=1")
soup = BeautifulSoup(page.content, "html.parser")
tables = soup.findChildren('table')
my_table = tables[0]
rows = my_table.findChildren(['td'])
i = i +1
for rows in rows:
cells = rows.findChildren('a')
for cell in cells:
value = cell.string
print(value)
To scrape data from this HTML:
https://i.stack.imgur.com/DkX83.png
The problem I have is that I'm struggling to only scrape the first column without scraping the second one as well because they are both under tags and in the same table row as each other. The href is the only thing which differentiates between the two tags and I have tried filtering using this but it doesn't seem to work and returns a blank value. Also when i try to sort the data manually the output is amended vertically and not horizontally, I am new to coding so any help would be appreciated :)
There is another way you might wanna try as well to achieve the same:
import requests
from bs4 import BeautifulSoup
keywords = ["USD","GBP","EUR"]
for keyword in keywords:
page = requests.get("https://www.x-rates.com/table/?from={}&amount=1".format(keyword))
soup = BeautifulSoup(page.content, "html.parser")
for items in soup.select_one(".ratesTable tbody").find_all("tr"):
data = [item.text for item in items.find_all("td")[1:2]]
print(data)
It is easier to follow what happens when you print every item you got from the top e.g. in this case from table item. The idea is to go one by one so you can follow.
import requests
from bs4 import BeautifulSoup
origin= ["USD","GBP","EUR"]
i=0
while i < len(origin):
page = requests.get("https://www.x-rates.com/table/?from="+origin[i]+"&amount=1")
soup = BeautifulSoup(page.content, "html.parser")
tables = soup.findChildren('table')
my_table = tables[0]
i = i +1
rows = my_table.findChildren('tr')
for row in rows:
cells = row.findAll('td',class_='rtRates')
if len(cells) > 0:
first_item = cells[0].find('a')
value = first_item.string
print(value)
I have an excel file with many Chinese names in the first row like this:
enter image description here
And what I am doing is to scrape some more Chinese names from a web table and the names are all at the 2nd col in each row (tr). I want to see if the names being scraped is already in my excel file. So I use a boolean have to keep track. It should return True if found. And I want to know the exact position (column number) of the found name, so I use name_position to keep track.
from lxml import html
from bs4 import BeautifulSoup
import requests
import openpyxl
from openpyxl.workbook import Workbook
wb=openpyxl.load_workbook('hehe.xlsx')
ws1=wb.get_sheet_by_name('Taocan')
page = requests.get(url)
tree = html.fromstring(page.text)
web = page.text
soup = BeautifulSoup(web, 'lxml')
table = soup.find('table', {'class': "tc_table"})
trs = table.find_all('tr')
for tr in trs:
ls = []
for td in tr.find_all('td'):
ls.append(td.text)
ls = [x.encode('utf-8') for x in ls]
try:
name = ls[1]
have = False
name_position = 1
for cell in ws1[1]:
if name == cell:
have = True
break
else:
name_position += 1
except IndexError:
print("there is an index error")
However, my code doesn't seem to work, and I think the problem is from the comparison of the names:
if name == cell
I changed to:
if name == cell.value
it still doesn't work.
Can anyone help me with this? thanks/:
Just to add on: the web page Im scraping is also in Chinese. So when I
print(ls)
it gives a list like this
['1', '\xe4\xb8\x80\xe8\x88\xac\xe6\xa3\x80\xe6\x9f\xa5', '\xe8\xba\xab\xe9\xab\x98\xe3\x80\x81\xe4\xbd\x93\xe9\x87\x8d\xe3\x80\x81\xe4\xbd\x93\xe9\x87\x8d\xe6\x8c\x87\xe6\x95\xb0\xe3\x80\x81\xe8\x85\xb0\xe5\x9b\xb4\xe3\x80\x81\xe8\x88\x92\xe5\xbc\xa0\xe5\x8e\x8b\xe3\x80\x81\xe6\x94\xb6\xe7\xbc\xa9\xe5\x8e\x8b\xe3\x80\x81\xe8\xa1\x80\xe5\x8e\x8b\xe6\x8c\x87\xe6\x95\xb0', '\xe9\x80\x9a\xe8\xbf\x87\xe4\xbb\xaa\xe5\x99\xa8\xe6\xb5\x8b\xe9\x87\x8f\xe4\xba\xba\xe4\xbd\x93\xe8\xba\xab\xe9\xab\x98\xe3\x80\x81\xe4\xbd\x93\xe9\x87\x8d\xe3\x80\x81\xe4\xbd\x93\xe8\x84\x82\xe8\x82\xaa\xe7\x8e\x87\xe5\x8f\x8a\xe8\xa1\x80\xe5\x8e\x8b\xef\xbc\x8c\xe7\xa7\x91\xe5\xad\xa6\xe5\x88\xa4\xe6\x96\xad\xe4\xbd\x93\xe9\x87\x8d\xe6\x98\xaf\xe5\x90\xa6\xe6\xa0\x87\xe5\x87\x86\xe3\x80\x81\xe8\xa1\x80\xe5\x8e\x8b\xe6\x98\xaf\xe5\x90\xa6\xe6\xad\xa3\xe5\xb8\xb8\xe3\x80\x81\xe4\xbd\x93\xe8\x84\x82\xe8\x82\xaa\xe6\x98\xaf\xe5\x90\xa6\xe8\xb6\x85\xe6\xa0\x87\xe3\x80\x82']
but if I
print(ls[1])
it gives Chinese name like "广州"