What is the proper syntax for .find() in bs4?

What is the proper syntax for .find() in bs4? - python

I am trying to scrape the bitcoin price off of coinbase and cannot find the proper syntax. When I run the program (without the line with question marks) I get the block of html that I need, but I don't know how to narrow down and retrieve the price itself. Any help appreciated, thanks.
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/charts'
data = requests.get(url)
nicedata = data.text
soup = BeautifulSoup(nicedata, 'html.parser')
prettysoup = soup.prettify()
bitcoin = soup.find('h4', {'class':
'Header__StyledHeader-sc-1q6y56a-0 hZxUBM
TextElement__Spacer-sc-18l8wi5-0 hpeTzd'})
price = bitcoin.find('???')
print(price)
The attached image contains the html

To get text from item:
price = bitcoin.text
But this page has many items <h4> with this class but find() gets only first one and it has text Bitcoin, not price from your image. You may need find_all() to get list with all items and then you can use index [index] or slicing [start:end] to get some items, or you can use for-loop to work with every item on list.
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/charts'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
all_h4 = soup.find_all('h4', {'class': 'Header__StyledHeader-sc-1q6y56a-0 hZxUBM TextElement__Spacer-sc-18l8wi5-0 hpeTzd'})
for h4 in all_h4:
print(h4.text)
It can be easier to work with data if you keep it in list of list or array or DataFrame. But to create list of lists it would be easier to find rows <tr> and inside every row search <h4>
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/charts'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
all_tr = soup.find_all('tr')
data = []
for tr in all_tr:
row = []
for h4 in tr.find_all('h4'):
row.append(h4.text)
if row: # skip empty row
data.append(row)
for row in data:
print(row)
It doesn't need class to get all h4.
BTW: This page uses JavaScript to append new rows when you scroll page but requests and BeautifulSoup can't run JavaScript - so if you will need all rows then you may need Selenium to control web browser which runs JavaScript

Related

Retrieving td value from tr that has certain other td value

Need to get the links from a td in rows that has a certain td value.
this is a tr in the table and I want to get the link from the div "Match" if the div "Home team" is of a certain value. There are many rows and I want to find every link that is matching. I have tried this and every time I only get the first row of the table. Here is the link https://wp.nif.no/PageTournamentDetailWithMatches.aspx?tournamentId=403373&seasonId=200937&number=all . Note that I translated some of the values to English in the examples below
homegames = browser.find_elements_by_xpath('//div[#data-title = "Home team"]/a[text()="Cleveland"]//parent::div//parent::td//parent::tr')
for link in homegames:
print(link.find_element_by_xpath('//td[3]/div/a').get_attribute('href'))
<td><div data-title="Date">23.10.2021</div></td>
<td><div data-title="Tid">16:15</div></td>
<td>div data-title="Matchnr">
2121503051
</div>
</td><td><div data-title="Home team">Cleveland</div></td>
<td><div data-title="Away team">
Ohio Travellers</div></td>
<td><div data-title="Court">F21</div></td><td><div data-title="Result">71 - 64</div></td>
<td><div data-title="Referee">John Doe<br>Will Smith<br></div></td></tr>```

The data is within the html source (so no need to use Selenium). But regardless of using Selenium or not, what you can do here is let BeautifulSoup find the specific tags you are after.
Without Selenium, it requires a little manipulation as decode the html.
import requests
from bs4 import BeautifulSoup
import json
import html
keyword = 'Askim'
url = 'https://wp.nif.no/PageTournamentDetailWithMatches.aspx?tournamentId=403373&seasonId=200937&number=all'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
jsonStr = soup.find('div', {'class':'xwp_table_bg'}).find_next('input')['value']
jsonData = json.loads(jsonStr)
links_list = []
for each in jsonData['data']:
#each = jsonData['data'][6]
htmlStr = ''.join(each)
htmlStr = html.unescape(htmlStr)
soup = BeautifulSoup(htmlStr, 'html.parser')
if soup.find('div', {'data-title':'Hjemmelag'}, text=keyword):
link = soup.find('div', {'data-title':'Kampnr'}).find('a')['href']
links_list.append(link)

How to scrape particular data from Yahoo Finance?

I am new to web scraping and I'm trying to scrape the "statistics" page of yahoo finance for AAPL. Here's the link: https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL
Here is the code I have so far...
from bs4 import BeautifulSoup
from requests import get
url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stock_data = soup.find_all("table")
for stock in stock_data:
print(stock.text)
When I run that, I return all of the table data on the page. However, I only want specific data from each table (e.g. "Market Cap", "Revenue", "Beta").
I tried messing around with the code by doing print(stock[1].text) to see if I could limit the amount of data returned to just the second value in each table but that returned an error message. Am I on the right track by using BeautifulSoup or do I need to use a completely different library? What would I have to do in order to only return particular data and not all of the table data on the page?

Examining the HTML-code gives you the best idea of how BeautifulSoup will handle what it sees.
The web page seems to contain several tables, which in turn contain the information you are after. The tables follow a certain logic.
First scrape all the tables on the web page, then find all the table rows (<tr>) and the table data (<td>) that those rows contain.
Below is one way of achieving this. I even threw in a function to print only a specific measurement.
from bs4 import BeautifulSoup
from requests import get
url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stock_data = soup.find_all("table")
# stock_data will contain multiple tables, next we examine each table one by one
for table in stock_data:
# Scrape all table rows into variable trs
trs = table.find_all('tr')
for tr in trs:
# Scrape all table data tags into variable tds
tds = tr.find_all('td')
# Index 0 of tds will contain the measurement
print("Measure: {}".format(tds[0].get_text()))
# Index 1 of tds will contain the value
print("Value: {}".format(tds[1].get_text()))
print("")
def get_measurement(table_array, measurement):
for table in table_array:
trs = table.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
if measurement.lower() in tds[0].get_text().lower():
return(tds[1].get_text())
# print only one measurement, e.g. operating cash flow
print(get_measurement(stock_data, "operating cash flow"))

Although this isn't Yahoo Finance, you can do something very similar like this...
import requests
from bs4 import BeautifulSoup
base_url = 'https://finviz.com/screener.ashx?v=152&o=price&t=MSFT,AAPL,SBUX,S,GOOG&o=price&c=0,1,2,3,4,5,6,7,8,9,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("C:\\your_path\\AAA.csv", header=False)
This is a nice substitute in case Yahoo decided to depreciate more of the functionality of their API. I know they cut out a lot of things (mostly historical quotes) a couple years ago. It was sad to see that go away.

Adding objects for each item added from scraping data from a website

I am trying to retrieve data from a website with and add for each row of data and object, I am new to python and I clearly miss something because I can get only 1 object, what Im trying to get is all the objects I get sorted by key value pairs:
import urllib.request
import bs4 as bs
url = 'http://freemusicarchive.org/search/?quicksearch=drake/'
search = ''
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(html, 'html.parser')
tracks_info = [{}]
spans = soup.find_all('span', {'class': 'ptxt-artist'})
for span in spans:
arts = span.find_all('a')
for art in arts:
print(art.text)
spans = soup.find_all('span', {'class': 'ptxt-track'})
for span in spans:
tracks = span.find_all('a')
for track in tracks:
print(track.text)
for download_links in soup.find_all('a', {'title': 'Download'}):
print(download_links.get('href'))
for info in tracks_info:
info.update({'artist': art.text})
info.update({'track': track.text})
info.update({'link': download_links.get('href')})
print(info)
I failed to add an object for each element I get from the website, Im clearly doing something wrong\or not doing and any help would be much appreciated!

You could use a slightly different struture and syntax such as below.
I use a contains CSS class selector to retrieve the rows of info as the id is different for each track
The CSS selector combination of div[class*="play-item gcol gid-electronic tid-"]
looks for div elements with class attribute having value containing play-item gcol gid-electronic tid-.
Within that the various columns of interest are then selected by their class name and a descendant css selector is used for the a tag element for the final download link.
import urllib.request
import bs4 as bs
import pandas as pd
url = 'http://freemusicarchive.org/search/?quicksearch=drake/'
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(html, 'html.parser')
tracks_Info = []
headRow = ['Artist','TrackName','DownloadLink']
for item in soup.select('div[class*="play-item gcol gid-electronic tid-"]'):
tracks_Info.append([item.select_one(".ptxt-artist").text.strip(), item.select_one(".ptxt-track").text, item.select_one(".playicn a").get('href')])
df = pd.DataFrame(tracks_Info,columns=headRow)
print(df)

How to only scrape the first item in a row using Beautiful Soup

I am currently running the following python script:
import requests
from bs4 import BeautifulSoup
origin= ["USD","GBP","EUR"]
i=0
while i < len(origin):
page = requests.get("https://www.x-rates.com/table/?from="+origin[i]+"&amount=1")
soup = BeautifulSoup(page.content, "html.parser")
tables = soup.findChildren('table')
my_table = tables[0]
rows = my_table.findChildren(['td'])
i = i +1
for rows in rows:
cells = rows.findChildren('a')
for cell in cells:
value = cell.string
print(value)
To scrape data from this HTML:
https://i.stack.imgur.com/DkX83.png
The problem I have is that I'm struggling to only scrape the first column without scraping the second one as well because they are both under tags and in the same table row as each other. The href is the only thing which differentiates between the two tags and I have tried filtering using this but it doesn't seem to work and returns a blank value. Also when i try to sort the data manually the output is amended vertically and not horizontally, I am new to coding so any help would be appreciated :)

There is another way you might wanna try as well to achieve the same:
import requests
from bs4 import BeautifulSoup
keywords = ["USD","GBP","EUR"]
for keyword in keywords:
page = requests.get("https://www.x-rates.com/table/?from={}&amount=1".format(keyword))
soup = BeautifulSoup(page.content, "html.parser")
for items in soup.select_one(".ratesTable tbody").find_all("tr"):
data = [item.text for item in items.find_all("td")[1:2]]
print(data)

It is easier to follow what happens when you print every item you got from the top e.g. in this case from table item. The idea is to go one by one so you can follow.
import requests
from bs4 import BeautifulSoup
origin= ["USD","GBP","EUR"]
i=0
while i < len(origin):
page = requests.get("https://www.x-rates.com/table/?from="+origin[i]+"&amount=1")
soup = BeautifulSoup(page.content, "html.parser")
tables = soup.findChildren('table')
my_table = tables[0]
i = i +1
rows = my_table.findChildren('tr')
for row in rows:
cells = row.findAll('td',class_='rtRates')
if len(cells) > 0:
first_item = cells[0].find('a')
value = first_item.string
print(value)

Using the BeautifulSoup find method to obtain data from a table row

I am writing a Python script using BeautifulSoup to scrape values from this webpage: https://uk-air.defra.gov.uk/latest/currentlevels
I want to use soup.find() to get values for "Hourly mean Nitrogen dioxide" and "Last updated" from the table row where the "Monitoring site" is "Edinburgh St Leonards".
As I am new to web scraping I am having a bit of trouble so would be grateful for any help on this.

Scrap all the html tables in a list of tables.
The table index may change, then you should not rely on a row/column index.
A part of the folowing script look up for the index of the searched data. Moreover, it prints the header name: so you know want are the data you get.
from bs4 import BeautifulSoup
import urllib.request
import re
with urllib.request.urlopen('https://uk-air.defra.gov.uk/latest/currentlevels?view=region') as response:
htmlData = response.read()
soup = BeautifulSoup(htmlData, 'html5lib')
tables = soup.find_all('table', attrs={'class':'current_levels_table'})
#what you want to check:
Iwant = ['nitrogen', 'update']
about = 'Edinburgh'
for table in tables:
#get header to have the data (we're looking for) column number and table real names
table_head = table.find('thead')
headrows = table_head.find_all('tr')
measures = headrows[1].find_all('th')
for colnum, measure in enumerate(measures):
index.update({colnum: measure.text.strip() for wanted in Iwant if re.search(wanted+'(?iu)', measure.text)})
#get table content and look for Edinburgh
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cels = row.find_all('td')
rowContent = [cel.text.strip().replace(u'\xa0', u' ').replace(u'\n Timeseries Graph', u'') for cel in cels if cel]
if re.search(about+'(?iu)', rowContent[0]):
for indexwanted, measurewanted in index.items():
print(measurewanted, ':', rowContent[indexwanted])

Making use of the suggestion from d2718nis, you can do it in this way. Of course, many other ways would work too.
First, find the link that has the 'Edinburgh St Leonards' text in it. Then find the grandparent of that link element, which is a tr element. Now identify the td elements in the tr. When you examine the table you see that the columns you want are the 4th and 7th. Get those from all of the td elements as the (0-relative) 3rd and 6th. Finally, display the crude texts of these elements.
You will need to do something clever to extract properly readable strings from these results.
>>> import requests
>>> import bs4
>>> page = requests.get('https://uk-air.defra.gov.uk/latest/currentlevels', headers={'User-Agent': 'Not blank'}).content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> Edinburgh_link = soup.find_all('a',string='Edinburgh St Leonards')[0]
>>> Edinburgh_link
Edinburgh St Leonards
>>> Edinburgh_row = Edinburgh_link.findParent('td').findParent('tr')
>>> Edinburgh_columns = Edinburgh_row.findAll('td')
>>> Edinburgh_columns[3]
<td class="center"><span class="bg_low1 bold">20 (1 Low)</span></td>
>>> Edinburgh_columns[6]
<td>05/08/2017<br/>14:00:00</td>
>>> Edinburgh_columns[3].text
'20\xa0(1\xa0Low)'
>>> Edinburgh_columns[6].text
'05/08/201714:00:00'

you can start with this:
import requests
from bs4 import BeautifulSoup
# Request the page, set headers to prevent 403 Forbidden
page = requests.get(
url='https://uk-air.defra.gov.uk/latest/currentlevels',
headers={'User-Agent': 'Not blank'})
# Get html from page
html = page.text
# BeautifulSoup object
soup = BeautifulSoup(html, 'html5lib')
for table in soup.find_all('table'):
# Print all tables on the page
print(table)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is the proper syntax for .find() in bs4? - python

Related

Retrieving td value from tr that has certain other td value

How to scrape particular data from Yahoo Finance?

Adding objects for each item added from scraping data from a website

How to only scrape the first item in a row using Beautiful Soup

Using the BeautifulSoup find method to obtain data from a table row

Categories

Resources