I am very new to using Beautiful Soup and I'm trying to import data from the below url as a pandas dataframe.
However, the final result has the correct columns names, but no numbers for the rows.
What should I be doing instead?
Here is my code:
from bs4 import BeautifulSoup
import requests
def get_tables(html):
soup = BeautifulSoup(html, 'html.parser')
table = soup.find_all('table')
return pd.read_html(str(table))[0]
url = 'https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html'
html = requests.get(url).content
get_tables(html)
The data you see in the table is loaded from another URL via JavaScript. You can use this example to save the data to csv:
import json
import requests
import pandas as pd
data = requests.get('https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/1/G').json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
df = pd.json_normalize(data['quotes'])
df.to_csv('data.csv')
Saves data.csv (screenshot from LibreOffice):
The website you're trying to scrape data from is rendering the table values dynamically and using requests.get will only return the HTML the server sends prior to JavaScript rendering.
You will have to find an alternative way of accessing the data or render the webpages JS (see this example).
A common way of doing this is to use selenium to automate a browser which allows you to render the JavaScript and get the source code that way.
Here is a quick example:
import time
import pandas as pd
from selenium.webdriver import Chrome
#Request the dynamically loaded page source
c = Chrome(r'/path/to/webdriver.exe')
c.get('https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html')
#Wait for it to render in browser
time.sleep(5)
html_data = c.page_source
#Load into pd.DataFrame
tables = pd.read_html(html_data)
df = tables[0]
df.columns = df.columns.droplevel() #Convert the MultiIndex to an Index
Note that I didn't use BeautifulSoup, you can directly pass the html to pd.read_html. You'll have to do some more cleaning from there but that's the gist.
Alternatively, you can take a peak at requests-html which is a library that offers JavaScript rendering and might be able to help, search for a way to access the data as JSON or .csv from elsewhere and use that, etc.
Related
I want to create a script that fetches the all the data in the following website : https://www.bis.doc.gov/dpl/dpl.txt and store it in a excel file and count the number of records in it, using python language. I've tried to achieve by implementing the code as:
import requests
import re
from bs4 import BeautifulSoup
URL = "https://www.bis.doc.gov/dpl/dpl.txt"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "lxml")
print(soup)
I've fetched the data but didn't know the next step of storing it as excel file. Anyone pls guide or share your valuable ideas. Thank you in advance!
You can do it with pandas easily. Since the data is in tab seperated value.
Note: openpyxl needs to be installed for this to work.
import requests
import io
import pandas as pd
URL = "https://www.bis.doc.gov/dpl/dpl.txt"
page = requests.get(URL)
df = pd.read_csv(io.StringIO(page.text), sep="\t")
df.to_excel(r'i_data.xlsx', index = False)
I am learning Python and I am trying to create a function to web scrape tables of vaccination rates from several different web pages - a github repository for Our World in Data https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations/country_data and https://ourworldindata.org/about. The code works perfectly when web scraping a single table and saving it into a data frame...
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/country_data/Bangladesh.csv"
response = requests.get(url)
response
scraping_html_table_BD = BeautifulSoup(response.content, "lxml")
scraping_html_table_BD = scraping_html_table_BD.find_all("table", "js-csv-data csv-data js-file-line-container")
df = pd.read_html(str(scraping_html_table_BD))
BD_df = df[0]
But I have not had much luck when trying to create a function to scrape several pages. I have been following the tutorial on this website 3 in the section 'Scrape multiple pages with one script' and StackOverflow questions like 4 and 5 amongst other pages. I have tried creating a global variable first but I end up with errors like "Recursion Error: maximum recursion depth exceeded while calling a Python object". This is the best code I have managed as it doesn't generate an error but I've not managed to save the output to a global variable. I really appreciate your help.
import pandas as pd
from bs4 import BeautifulSoup
import requests
link_list = ['/Bangladesh.csv',
'/Nepal.csv',
'/Mongolia.csv']
def get_info(page_url):
page = requests.get('https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations/country_data' + page_url)
scape = BeautifulSoup(page.text, 'html.parser')
vaccination_rates = scape.find_all("table", "js-csv-data csv-data js-file-line-container")
result = {}
df = pd.read_html(str(vaccination_rates))
vaccination_rates = df[0]
df = pd.DataFrame(vaccination_rates)
print(df)
df.to_csv("testdata.csv", index=False)
for link in link_list:
get_info(link)
edit: I can view the final webpage that is iterated as it saves to a csv file, but not the data from the preceding links.
new = pd.read_csv('testdata6.csv')
pd.set_option("display.max_rows", None, "display.max_columns", None)
new
This is because in every iteration your 'testdata.csv' is overwritten with a new one.
so you can do :
df.to_csv(page_url[1:], index=False)
I'm guessing you're overwriting your 'testdata.csv' each time, hence why you can see the final page. I would either add an enumerate function to add an identifier for a separate csv each time you scrape a page, eg:
for key, link in enumerate(link_list):
get_info(link, key)
...
df.to_csv(f"testdata{key}.csv", index=False)
Or, open this csv as part of your get_info function, steps of which are available in append new row to old csv file python.
I'm new to Python and am working to extract data from website https://www.screener.in/company/ABB/consolidated/ on a particular table (the last table which is Shareholding Pattern)
I'm using BeautifulSoup library for this but I do not know how to go about it.
So far, here below is my code snippet. am failing to pick the right table due to the fact that the page has multiple tables and all tables share common classes and IDs which makes it difficult for me to filter for the one table I want.
import requests import urllib.request
from bs4 import BeautifulSoup
url = "https://www.screener.in/company/ABB/consolidated/"
r = requests.get(url)
print(r.status_code)
html_content = r.text
soup = BeautifulSoup(html_content,"html.parser")
# print(soup)
#data_table = soup.find('table', class_ = "data-table")
# print(data_table) table_needed = soup.find("<h2>ShareholdingPattern</h2>")
#sub = table_needed.contents[0] print(table_needed)
Just use requests and pandas. Grab the last table and dump it to a .csv file.
Here's how:
import pandas as pd
import requests
df = pd.read_html(
requests.get("https://www.screener.in/company/ABB/consolidated/").text,
flavor="bs4",
)
df[-1].to_csv("last_table.csv", index=False)
Output from a .csv file:
I have setup BeautifulSoup to find a specific class for two webpages.
I would like to know how to write each URL's result to a unique cell in one CSV?
Also is there a limit to the number of URLs I can read as I would like to expand this to about 200 URLs once I get this working.
The class is always the same and I don't need any formatting just the raw HTML in one cell per URL.
Thanks for any ideas.
from bs4 import BeautifulSoup
import requests
urls = ['https://www.ozbargain.com.au/','https://www.ozbargain.com.au/forum']
for u in urls:
response = requests.get(u)
data = response.text
soup = BeautifulSoup(data,'lxml')
soup.find('div', class_="block")
Use pandas to work with tabular data: pd.DataFrame to create a table, and pd.to_csv to save table as csv (might also check out the documentation, append mode for example).
Basically it.
import requests
import pandas as pd
from bs4 import BeautifulSoup
def func(urls):
for url in urls:
data = requests.get(url).text
soup = BeautifulSoup(data,'lxml')
yield {
"url": url, "raw_html": soup.find('div', class_="block")
}
urls = ['https://www.ozbargain.com.au/','https://www.ozbargain.com.au/forum']
data = func(urls)
table = pd.DataFrame(data)
table.to_csv("output.csv", index=False)
I am working on web scraping using Python and BeautifulSoup. My purpose is to pull members data from https://thehia.org/directory?&tab=1. There are around 1685 records.
When I view the page source on my Chrome, I cannot find the table. Seems it dynamically pulls the data. But when I use the inspect option of Chrome, I can find the "membersTable" table in the div that I need.
How can I use BeautifulSoup to access that membersTable that I can access in the inspect.
You can mimic the POST request the page makes for content then use hjson to handle unquoted keys in string pulled out of response
import requests, hjson
import pandas as pd
data = {'formId': '3721260'}
r = requests.post('https://thehia.org/Sys/MemberDirectory/LoadMembers', data=data)
data = hjson.loads(r.text.replace('while(1); ',''))
total = data['TotalCount']
structure = data['JsonStructure']
members = hjson.loads(structure)
df = pd.DataFrame([[member[k][0]['v'] for k in member.keys()] for member in members['members'][0]]
,columns = ['Organisation', 'City', 'State','Country'])
print(df)
Try this one
import requests
from bs4 import BeautifulSoup
url = "https://thehia.org/directory?&tab=1"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'membersTable'})
row_list = []
for row in table.findAll('tr',{'class':['normal']}):
data= []
for cell in row.findAll('td'):
data.append(cell.text)
row_list.append(data)
print(row_list)