Beautifulsoup scraping table from website with requests for pandas

Beautifulsoup scraping table from website with requests for pandas - python

I am trying to download the data on this website
https://coinmunity.co/
...in order to manipulate later it in Python or Pandas
I have tried to do it directly to Pandas via Requests, but did not work, using this code:
res = requests.get("https://coinmunity.co/")
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table')[0]
dfm = pd.read_html(str(table), header = 0)
dfm = dfm[0].dropna(axis=0, thresh=4)
dfm.head()
In most of the things I tried, I could only get to the info in the headers, which seems to be the only table seen in this page by the code.
Seeing that this did not work, I tried to do the same scraping with Requests and BeautifulSoup, but it did not work either. This is my code:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://coinmunity.co/")
soup = BeautifulSoup(res.content, 'lxml')
#table = soup.find_all('table')[0]
#table = soup.find_all('div', {'class':'inner-container'})
#table = soup.find_all('tbody', {'class':'_ngcontent-c0'})
#table = soup.find_all('table')[0].findAll('tr')
#table = soup.find_all('table')[0].find('tbody')#.find_all('tbody _ngcontent-c3=""')
table = soup.find_all('p', {'class':'stats change positiveSubscribers'})
You can see in the lines commented, all the things I have tried, but nothing worked.
Is there any way to easily download that table to use it on Pandas/Python, in the tidiest, easier and quickest possible way?
Thank you

Since the content is loaded dynamically after the initial request is made, you won't be able to scrape this data with request. Here's what I would do instead:
from selenium import webdriver
import pandas as pd
import time
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.implicitly_wait(10)
driver.get("https://coinmunity.co/")
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'lxml')
results = []
for row in soup.find_all('tr')[2:]:
data = row.find_all('td')
name = data[1].find('a').text
value = data[2].find('p').text
# get the rest of the data you need about each coin here, then add it to the dictionary that you append to results
results.append({'name':name, 'value':value})
df = pd.DataFrame(results)
df.head()
name value
0 NULS 14,005
1 VEN 84,486
2 EDO 20,052
3 CLUB 1,996
4 HSR 8,433
You will need to make sure that geckodriver is installed and that it is in your PATH. I just scraped the name of each coin and the value but getting the rest of the information should be easy.

Related

How can I get an html table into python if the table has different tabs?

https://www.worldometers.info/coronavirus/#countries is the website that I'm using and I'm trying to get the table with All tab selected to pull from html into my jupyter notebook. The problem I seem to be having is if I use class = 'table' it pulls all continent tabs first then the all table and it messes up how my data gets pulled in when I try looking at rows.
import requests
import lxml.html as lh
import pandas as pd
import csv
import requests
from bs4 import BeautifulSoup
url = 'https://www.worldometers.info/coronavirus/#countries'
page = requests.get(url)
print(page.status_code) #Checking the http response status code. Should be 200
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
all_tables=soup.find_all("table")
right_table = soup.find('table',{'class':'table'})
col_headers = [th.getText() for th in right_table.findAll('th')]
data = [[td.getText() for td in right_table.findAll('td')] for tr in right_table()]
When I try to combine the col_headers and data it says I have13 columns passed, data had 2990 columns. Any guidance would be appreciated.

You have "flattened" the table - created a list of all <td>s. What you need to do is to create a nested list:
data = [ [ td.text for td in tr.find_all("td") ] for tr in right_table.find_all("tr")]
df = pd.DataFrame(data, columns=col_header)
print(df.shape) # (231, 13)

extract title from a link using BeautifulSoup

I am using beautifulsoup to scrape a website but need help with this as I am new to python and beautifulsoup
How do I get VET from the following
"[[VET]]"
This is my code so far
import bs4 as bs
import urllib.request
import pandas as pd
#This is the Home page of the website
source = urllib.request.urlopen('file:///C:/Users/Aiden/Downloads/stocks/Stock%20Premarket%20Trading%20Activity%20_%20Biggest%20Movers%20Before%20the%20Market%20Opens.html').read().decode('utf-8')
soup = bs.BeautifulSoup(source,'lxml')
#find the Div and put all info into varTable
table = soup.find('table',{"id":"decliners_tbl"}).tbody
#find all Rows in table and puts into varTableRows
tableRows = table.find_all('tr')
print ("There is ",len(tableRows),"Rows in the Table")
print(tableRows)
columns = [tableRows[1].find_all('td')]
print(columns)
a = [tableRows[1].find_all("a")]
print(a)
So my output from print(a) is "[[<a class="mplink popup_link" href="https://marketchameleon.com/Overview/VET/">VET</a>]]"
and I want to extract VET out
AD

You can use a.text or a.get_text().
If you have multiple elements you'd need list comprehension on this function

Thank you for all the reply, I was able to work it out using the following code
source = urllib.request.urlopen('file:///C:/Users/Aiden/Downloads/stocks/Stock%20Premarket%20Trading%20Activity%20_%20Biggest%20Movers%20Before%20the%20Market%20Opens.html').read().decode('utf-8')
soup = bs.BeautifulSoup(source,'html.parser')
table = soup.find("table",id="decliners_tbl")
for decliners in table.find_all("tbody"):
rows = decliners.find_all("tr")
for row in rows:
ticker = row.find("a").text
volume = row.findAll("td", class_="rightcell")[3].text
print(ticker, volume)

How to scrape particular data from Yahoo Finance?

I am new to web scraping and I'm trying to scrape the "statistics" page of yahoo finance for AAPL. Here's the link: https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL
Here is the code I have so far...
from bs4 import BeautifulSoup
from requests import get
url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stock_data = soup.find_all("table")
for stock in stock_data:
print(stock.text)
When I run that, I return all of the table data on the page. However, I only want specific data from each table (e.g. "Market Cap", "Revenue", "Beta").
I tried messing around with the code by doing print(stock[1].text) to see if I could limit the amount of data returned to just the second value in each table but that returned an error message. Am I on the right track by using BeautifulSoup or do I need to use a completely different library? What would I have to do in order to only return particular data and not all of the table data on the page?

Examining the HTML-code gives you the best idea of how BeautifulSoup will handle what it sees.
The web page seems to contain several tables, which in turn contain the information you are after. The tables follow a certain logic.
First scrape all the tables on the web page, then find all the table rows (<tr>) and the table data (<td>) that those rows contain.
Below is one way of achieving this. I even threw in a function to print only a specific measurement.
from bs4 import BeautifulSoup
from requests import get
url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stock_data = soup.find_all("table")
# stock_data will contain multiple tables, next we examine each table one by one
for table in stock_data:
# Scrape all table rows into variable trs
trs = table.find_all('tr')
for tr in trs:
# Scrape all table data tags into variable tds
tds = tr.find_all('td')
# Index 0 of tds will contain the measurement
print("Measure: {}".format(tds[0].get_text()))
# Index 1 of tds will contain the value
print("Value: {}".format(tds[1].get_text()))
print("")
def get_measurement(table_array, measurement):
for table in table_array:
trs = table.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
if measurement.lower() in tds[0].get_text().lower():
return(tds[1].get_text())
# print only one measurement, e.g. operating cash flow
print(get_measurement(stock_data, "operating cash flow"))

Although this isn't Yahoo Finance, you can do something very similar like this...
import requests
from bs4 import BeautifulSoup
base_url = 'https://finviz.com/screener.ashx?v=152&o=price&t=MSFT,AAPL,SBUX,S,GOOG&o=price&c=0,1,2,3,4,5,6,7,8,9,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("C:\\your_path\\AAA.csv", header=False)
This is a nice substitute in case Yahoo decided to depreciate more of the functionality of their API. I know they cut out a lot of things (mostly historical quotes) a couple years ago. It was sad to see that go away.

How can I loop through all <th> tags within my script for web scraping?

As of now, I'm only getting ['1'] as the output of what's being printed with my current code below. I want to grab 1-54 on the Team Batting table in the Rk column on the website https://www.baseball-reference.com/teams/NYY/2019.shtml.
How would I go about modifying colNum so it can print the 1-54 in the Rk column? I'm pointing out the colNum line because I feel the issue lies there but I could be wrong.
import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser') # parse as HTML page, this is the source code of the page
week = soup.find(class_='table_outer_container')
items = week.find("thead").get_text() # grabs table headers
th = week.find("th").get_text() # grabs Rk only.
tbody = week.find("tbody")
tr = tbody.find("tr")
thtwo = tr.find("th").get_text()
colNum = [thtwo for thtwo in thtwo]
print(colNum)

Your mistake was in the last few lines as you mentioned. If I understood right, you wanted a list of all the values in the "Rk" column. In order to get all the rows, you have to use the find_all() function. I tweaked your code a little bit in order to get the text of the first field in each row in the following lines:
import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser')
is the source code of the page
week = soup.find(class_='table_outer_container')
items = week.find("thead").get_text()
th = week.find("th").get_text()
tbody = week.find("tbody")
tr = tbody.find_all("tr")
colnum = [row.find("th").get_text() for row in tr]
print(colnum)

How to only scrape the first item in a row using Beautiful Soup

I am currently running the following python script:
import requests
from bs4 import BeautifulSoup
origin= ["USD","GBP","EUR"]
i=0
while i < len(origin):
page = requests.get("https://www.x-rates.com/table/?from="+origin[i]+"&amount=1")
soup = BeautifulSoup(page.content, "html.parser")
tables = soup.findChildren('table')
my_table = tables[0]
rows = my_table.findChildren(['td'])
i = i +1
for rows in rows:
cells = rows.findChildren('a')
for cell in cells:
value = cell.string
print(value)
To scrape data from this HTML:
https://i.stack.imgur.com/DkX83.png
The problem I have is that I'm struggling to only scrape the first column without scraping the second one as well because they are both under tags and in the same table row as each other. The href is the only thing which differentiates between the two tags and I have tried filtering using this but it doesn't seem to work and returns a blank value. Also when i try to sort the data manually the output is amended vertically and not horizontally, I am new to coding so any help would be appreciated :)

There is another way you might wanna try as well to achieve the same:
import requests
from bs4 import BeautifulSoup
keywords = ["USD","GBP","EUR"]
for keyword in keywords:
page = requests.get("https://www.x-rates.com/table/?from={}&amount=1".format(keyword))
soup = BeautifulSoup(page.content, "html.parser")
for items in soup.select_one(".ratesTable tbody").find_all("tr"):
data = [item.text for item in items.find_all("td")[1:2]]
print(data)

It is easier to follow what happens when you print every item you got from the top e.g. in this case from table item. The idea is to go one by one so you can follow.
import requests
from bs4 import BeautifulSoup
origin= ["USD","GBP","EUR"]
i=0
while i < len(origin):
page = requests.get("https://www.x-rates.com/table/?from="+origin[i]+"&amount=1")
soup = BeautifulSoup(page.content, "html.parser")
tables = soup.findChildren('table')
my_table = tables[0]
i = i +1
rows = my_table.findChildren('tr')
for row in rows:
cells = row.findAll('td',class_='rtRates')
if len(cells) > 0:
first_item = cells[0].find('a')
value = first_item.string
print(value)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautifulsoup scraping table from website with requests for pandas - python

Related

How can I get an html table into python if the table has different tabs?

extract title from a link using BeautifulSoup

How to scrape particular data from Yahoo Finance?

How can I loop through all <th> tags within my script for web scraping?

How to only scrape the first item in a row using Beautiful Soup

Categories

Resources