I am new to Python and learning data analysis. I am trying to scrape data from this web page: https://bitinfocharts.com/dogecoin/address/DN5Hp2kCkvCsdwr5SPmwHpiJgjKnC5wcT7
I am able to scrape data with simple websites but I think since BitInfoCharts has tables it may be a more complex HTML setup than the tutorials I am following.
My goal is to scrape the data from the table which includes Block, Time, Amount, Balance, ect and have it in a csv file. I previously tried using pandas but found that it was difficult to select the data I want from the HTML.
To do this, I think that what I need to do is get the header/table information from the "class="table abtb tablesorter tablesorter-default" and then pull all of the information from each object inside that class that contains "class="trb". The class=trb changes from page to page (Example, one person may have 7 transactions, and another may have 40). I am not exactly sure though as this is new territory for me.
I would really appreciate any help.
import requests
from bs4 import BeautifulSoup as bs
url = 'https://bitinfocharts.com/dogecoin/address/DN5Hp2kCkvCsdwr5SPmwHpiJgjKnC5wcT7'
headers = {"User-Agent":"Mozilla/5.0"}
r = requests.get(url, headers=headers)
soup = bs(r.content)
table = soup.find_all("table_maina")
print(table)
If you do decide to do it manually, this does the same thing:
import csv
import requests
from bs4 import BeautifulSoup as bs
url = 'https://bitinfocharts.com/dogecoin/address/DN5Hp2kCkvCsdwr5SPmwHpiJgjKnC5wcT7'
headers = {"User-Agent":"Mozilla/5.0"}
r = requests.get(url, headers=headers)
soup = bs(r.content,'lxml')
table = soup.find(id="table_maina")
headers = []
datarows = []
for row in table.find_all('tr'):
heads = row.find_all('th')
if heads:
headers = [th.text for th in heads]
else:
datarows.append( [td.text for td in row.find_all('td')] )
fcsv = csv.writer( open('x.csv','w',newline=''))
fcsv.writerow(headers)
fcsv.writerows(datarows)
There is only one table element called 'table_maina' so you should call find() vs find_all(). Also, you need you specify the "table" tag as first argument in find() function.
Try:
table = soup.find('table', id='table_maina')
for tr in table.find_all('tr', class_='trb'):
print(tr.text)
Output:
4066317 2022-01-17 15:41:22 UTC2022-01-17 15:41:22 UTC-33,000,000 DOGE (5,524,731.65 USD)220,000,005.04121223 DOGE$36,831,545 # $0.167$-28,974,248
4063353 2022-01-15 11:04:46 UTC2022-01-15 11:04:46 UTC+4,000,000 DOGE (759,634.87 USD)253,000,005.04121223 DOGE$48,046,907 # $0.19$-23,283,618
...
Next, to output each row into CSV file then try this:
import csv
import requests
from bs4 import BeautifulSoup
url = 'https://bitinfocharts.com/dogecoin/address/DN5Hp2kCkvCsdwr5SPmwHpiJgjKnC5wcT7'
headers = {"User-Agent": "Mozilla/5.0"}
r = requests.get(url, headers=headers, verify=False)
soup = BeautifulSoup(r.content, "html.parser")
table = soup.find("table", id='table_maina')
with open('out.csv', 'w', newline='') as fout:
csv_writer = csv.writer(fout)
csv_writer.writerow(['Block', 'Time', 'Amount', 'Balance', 'Price', 'Profit'])
for tr in table.find_all('tr', class_='trb'):
tds = tr.find_all('td')
csv_writer.writerow([x.text for x in tds])
Output:
Block,Time,Amount,Balance,Price,Profit
4066317 2022-01-17 15:41:22 UTC,2022-01-17 15:41:22 UTC,"-33,000,000 DOGE (5,524,731.65 USD)","220,000,005.04121223 DOGE","$36,831,545 # $0.167","$-28,974,248"
...
Related
I am new to web scraping and I'm trying to scrape the "statistics" page of yahoo finance for AAPL. Here's the link: https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL
Here is the code I have so far...
from bs4 import BeautifulSoup
from requests import get
url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stock_data = soup.find_all("table")
for stock in stock_data:
print(stock.text)
When I run that, I return all of the table data on the page. However, I only want specific data from each table (e.g. "Market Cap", "Revenue", "Beta").
I tried messing around with the code by doing print(stock[1].text) to see if I could limit the amount of data returned to just the second value in each table but that returned an error message. Am I on the right track by using BeautifulSoup or do I need to use a completely different library? What would I have to do in order to only return particular data and not all of the table data on the page?
Examining the HTML-code gives you the best idea of how BeautifulSoup will handle what it sees.
The web page seems to contain several tables, which in turn contain the information you are after. The tables follow a certain logic.
First scrape all the tables on the web page, then find all the table rows (<tr>) and the table data (<td>) that those rows contain.
Below is one way of achieving this. I even threw in a function to print only a specific measurement.
from bs4 import BeautifulSoup
from requests import get
url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stock_data = soup.find_all("table")
# stock_data will contain multiple tables, next we examine each table one by one
for table in stock_data:
# Scrape all table rows into variable trs
trs = table.find_all('tr')
for tr in trs:
# Scrape all table data tags into variable tds
tds = tr.find_all('td')
# Index 0 of tds will contain the measurement
print("Measure: {}".format(tds[0].get_text()))
# Index 1 of tds will contain the value
print("Value: {}".format(tds[1].get_text()))
print("")
def get_measurement(table_array, measurement):
for table in table_array:
trs = table.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
if measurement.lower() in tds[0].get_text().lower():
return(tds[1].get_text())
# print only one measurement, e.g. operating cash flow
print(get_measurement(stock_data, "operating cash flow"))
Although this isn't Yahoo Finance, you can do something very similar like this...
import requests
from bs4 import BeautifulSoup
base_url = 'https://finviz.com/screener.ashx?v=152&o=price&t=MSFT,AAPL,SBUX,S,GOOG&o=price&c=0,1,2,3,4,5,6,7,8,9,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("C:\\your_path\\AAA.csv", header=False)
This is a nice substitute in case Yahoo decided to depreciate more of the functionality of their API. I know they cut out a lot of things (mostly historical quotes) a couple years ago. It was sad to see that go away.
I'm trying to scrape the "team per game stats" table from this website using this code:
from urllib.request import urlopen as uo
from bs4 import BeautifulSoup as BS
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2020.html'
html = uo(url)
soup = BS(html, 'html.parser')
soup.findAll('tr')
headers = [th.getText() for th in soup.findAll('tr')]
headers = headers[1:]
print(headers)
rows = soup.findAll('tr')[1:]
team_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
stats = pd.DataFrame(team_stats, columns=headers)
But it returns this error:
AssertionError: 71 columns passed, passed data had 212 columns
The problem is that the data is hidden in a commented section of the HTML. The table you want to extract is rendered with Javascript in your browser. Requesting the page with requests or urllib just yields the raw HTML.
So be aware that you have to examine the source code of the page with "View page source" rather than the rendered page with "Inspect Element" if you search for the proper tags to find with BeautifulSoup.
Try this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2020.html'
html = requests.get(url)
section_start = '<span class="section_anchor" id="team-stats-per_game_link" data-label="Team Per Game Stats">'
block_start = html.text.split(section_start)[1].split("<!--")[1]
block = block_start.split("-->")[0]
soup = BeautifulSoup(block)
data = [th.get_text(",") for th in soup.findAll('tr')]
header = data[0]
header = [x.strip() for x in header.split(",") if x.strip() !=""]
data = [x.split(",") for x in data[1:]]
pd.DataFrame(data, columns=header)
Explanation: You first need to find the commented section by simply splitting the raw HTML just before the section. You extract the section as text, convert to soup and then parse.
I expected a csv file created with in my desktop directory.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://basketball.realgm.com/ncaa/conferences/Big-12-
Conference/3/Kansas/54/nba-players"
# get permission
response = requests.get(url)
# access html files
soup = BeautifulSoup(response.text, 'html.parser')
# creating data frame
columns = ['Player', 'Position', 'Height', 'Weight', 'Draft Year', 'NBA
Teams', 'Years', 'Games Played','Points Per Game', 'Rebounds Per Game',
'Assists Per Game']
df = pd.DataFrame(columns=columns)
table = soup.find(name='table', attrs={'class': 'tablesaw','data-
tablesaw-mode':'swipe','id': 'table-6615'}).tbody
trs = table.find('tr')
# rewording html
for tr in trs:
tds = tr.find_all('td')
row = [td.text.replace('\n', '')for td in tds]
df = df.append(pd.Series(row, index=columns), ignore_index=True)
df.to_csv('kansas_player', index=False)
I expected a csv file created with in my desktop directory.
Looks like by your way the soup.find(...) can not find 'table', and that's might
be why you get a None type returned, here is my change and you can tailor it to cope with you csv export need:
from bs4 import BeautifulSoup
import urllib.request
url = "https://basketball.realgm.com/ncaa/conferences/Big-12-Conference/3/Kansas/54/nba-players"
# get permission
response = urllib.request.urlopen(url)
# access html files
html = response.read()
soup = BeautifulSoup(html)
table = soup.find("table", {"class": "tablesaw"})
At this point, you can return full table content as:
From there on, you can easily extract the table row information by such as:
for tr in table.findAll('tr'):
tds = tr.find_all('td')
row = [td.text.replace('\n', '')for td in tds]
.....
Now each row would look like:
Finally, you can write each row into the csv with or without the pandas, your call then.
I'm performing some data analysis for my own knowledge from nhl spread/betting odds information. I'm able to pull some information, but Not the entire data set. I want to pull the list of games and the associated into a panda dataframe, but I have been able to perform the proper loop around the html tags. I've tried the findAll option and the xpath route. I'm not successful with either.
from bs4 import BeautifulSoup
import requests
page_link = 'https://www.thespread.com/nhl-hockey-public-betting-chart'
page_response = requests.get(page_link, timeout=5)
# here, we fetch the content from the url, using the requests library
page_content = BeautifulSoup(page_response.content, "html.parser")
# Take out the <div> of name and get its value
name_box = page_content.find('div', attrs={'class': 'datarow'})
name = name_box.text.strip()
print (name)
This script goes through each datarow and pulls out each item individually and then appends them into a pandas DataFrame.
from bs4 import BeautifulSoup
import requests
import pandas as pd
page_link = 'https://www.thespread.com/nhl-hockey-public-betting-chart'
page_response = requests.get(page_link, timeout=5)
# here, we fetch the content from the url, using the requests library
page_content = BeautifulSoup(page_response.content, "html.parser")
# Take out the <div> of name and get its value
tables = page_content.find_all('div', class_='datarow')
# Iterate through rows
rows = []
# Iterate through each datarow and pull out each home/away separately
for table in tables:
# Get time and date
time_and_date_tag = table.find_all('div', attrs={"class": "time"})[0].contents
date = time_and_date_tag[1]
time = time_and_date_tag[-1]
# Get teams
teams_tag = table.find_all('div', attrs={"class": "datacell teams"})[0].contents[-1].contents
home_team = teams_tag[1].text
away_team = teams_tag[-1].text
# Get opening
opening_tag = table.find_all('div', attrs={"class": "child-open"})[0].contents
home_open_value = opening_tag[1]
away_open_value = opening_tag[-1]
# Get current
current_tag = table.find_all('div', attrs={"class": "child-current"})[0].contents
home_current_value = current_tag[1]
away_current_value = current_tag[-1]
# Create list
rows.append([time, date, home_team, away_team,
home_open_value, away_open_value,
home_current_value, away_current_value])
columns = ['time', 'date', 'home_team', 'away_team',
'home_open', 'away_open',
'home_current', 'away_current']
print(pd.DataFrame(rows, columns=columns))
Here is my solution to your question.
from bs4 import BeautifulSoup
import requests
page_link = 'https://www.thespread.com/nhl-hockey-public-betting-chart'
page_response = requests.get(page_link, timeout=5)
# here, we fetch the content from the url, using the requests library
page_content = BeautifulSoup(page_response.content, "html.parser")
for cell in page_content.find_all('div', attrs={'class': 'datarow'}):
name = cell.text.strip()
print (name)
I am trying to download the data on this website
https://coinmunity.co/
...in order to manipulate later it in Python or Pandas
I have tried to do it directly to Pandas via Requests, but did not work, using this code:
res = requests.get("https://coinmunity.co/")
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table')[0]
dfm = pd.read_html(str(table), header = 0)
dfm = dfm[0].dropna(axis=0, thresh=4)
dfm.head()
In most of the things I tried, I could only get to the info in the headers, which seems to be the only table seen in this page by the code.
Seeing that this did not work, I tried to do the same scraping with Requests and BeautifulSoup, but it did not work either. This is my code:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://coinmunity.co/")
soup = BeautifulSoup(res.content, 'lxml')
#table = soup.find_all('table')[0]
#table = soup.find_all('div', {'class':'inner-container'})
#table = soup.find_all('tbody', {'class':'_ngcontent-c0'})
#table = soup.find_all('table')[0].findAll('tr')
#table = soup.find_all('table')[0].find('tbody')#.find_all('tbody _ngcontent-c3=""')
table = soup.find_all('p', {'class':'stats change positiveSubscribers'})
You can see in the lines commented, all the things I have tried, but nothing worked.
Is there any way to easily download that table to use it on Pandas/Python, in the tidiest, easier and quickest possible way?
Thank you
Since the content is loaded dynamically after the initial request is made, you won't be able to scrape this data with request. Here's what I would do instead:
from selenium import webdriver
import pandas as pd
import time
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.implicitly_wait(10)
driver.get("https://coinmunity.co/")
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'lxml')
results = []
for row in soup.find_all('tr')[2:]:
data = row.find_all('td')
name = data[1].find('a').text
value = data[2].find('p').text
# get the rest of the data you need about each coin here, then add it to the dictionary that you append to results
results.append({'name':name, 'value':value})
df = pd.DataFrame(results)
df.head()
name value
0 NULS 14,005
1 VEN 84,486
2 EDO 20,052
3 CLUB 1,996
4 HSR 8,433
You will need to make sure that geckodriver is installed and that it is in your PATH. I just scraped the name of each coin and the value but getting the rest of the information should be easy.