table not extracted and not saved in csv file - python

I am trying to extract the table but my code seems like not working as it returns the value none
i wanted to extract it with xpath but i couldnt try xpath as i have no knowledge and i am little bit familiar with beautifulsoup. how can i extract this table and save in csv?
the website i am using is :https://training.gov.au/Organisation/Details/31102
import requests
from bs4 import BeautifulSoup
url = 'https://training.gov.au/Organisation/Details/31102'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find('tabel', id = 'ScopeQualification')
print(table)

If you're trying to extract the values of that table the best way is pandas
here's a cheat sheet for it so you can get exactly what you want
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

Related

Using BeautifulSoup to extract table

I would like to extract the table from the following URL: "https://www.nordpoolgroup.com/en/Market-data1/#/nordic/table", and eventually store it in a pandas dataframe.
The code bellow returns:
table day-headers="true" enable-filter="false" nps-data-table="" table-data="ctrl.data[ctrl.selectedTab].table.data"></table
URL = "https://www.nordpoolgroup.com/en/Market-data1/#/nordic/table"
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.find('table')[0])
I am not sure how to continue. I have been able to extract the content under the table tag on other sites using this code. Could someone please give me some advice and maybe explain what is happening in this case?
from pprint import pp
import requests
def main(url):
r = requests.get(url)
pp(r.json())
main('https://www.nordpoolgroup.com/api/marketdata/page/11')
from here you can parse it as JSON and create your dataframe!

Read table from Web using Python

I'm new to Python and am working to extract data from website https://www.screener.in/company/ABB/consolidated/ on a particular table (the last table which is Shareholding Pattern)
I'm using BeautifulSoup library for this but I do not know how to go about it.
So far, here below is my code snippet. am failing to pick the right table due to the fact that the page has multiple tables and all tables share common classes and IDs which makes it difficult for me to filter for the one table I want.
import requests import urllib.request
from bs4 import BeautifulSoup
url = "https://www.screener.in/company/ABB/consolidated/"
r = requests.get(url)
print(r.status_code)
html_content = r.text
soup = BeautifulSoup(html_content,"html.parser")
# print(soup)
#data_table = soup.find('table', class_ = "data-table")
# print(data_table) table_needed = soup.find("<h2>ShareholdingPattern</h2>")
#sub = table_needed.contents[0] print(table_needed)
Just use requests and pandas. Grab the last table and dump it to a .csv file.
Here's how:
import pandas as pd
import requests
df = pd.read_html(
requests.get("https://www.screener.in/company/ABB/consolidated/").text,
flavor="bs4",
)
df[-1].to_csv("last_table.csv", index=False)
Output from a .csv file:

Scraping HTML tables to CSV's using BS4 for use with Pandas

I have begun a pet-project creating what is essentially an indexed compilation of a plethora of NFL statistics with a nice simple GUI. Fortunately, the site https://www.pro-football-reference.com has all the data you can imagine in the form of tables which can be exported to CSV format on the site and manually copied/pasted. I started doing this, and then using the Pandas library, began reading the CSVs into DataFrames to make use of the data.
This works great, however, manually fetching all this data is quite tedious, so I decided to attempt to create a web scraper that can scrape HTML tables and convert them into a usable CSV format. I am struggling, specifically to isolate individual tables but also with having the CSV that is produced render in a readable/usable format.
Here is what the scraper looks like right now:
from bs4 import BeautifulSoup
import requests
import csv
def table_Scrape():
url = 'https://www.pro-football-reference.com/teams/nwe/2008.htm'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
table = soup.select_one('table.stats_table')
headers = [th.text.encode("utf-8") for th in table.select("tr th")]
with open("out.csv", "w", encoding='utf-8') as f:
wr = csv.writer(f)
wr.writerow(headers)
wr.writerows([
[td.text.encode("utf-8") for td in row.find_all("td")]
for row in table.select("tr + tr")
])
table_Scrape()
This does properly send the request to the URL, but doesn't fetch the data I am looking for which is 'Rushing_and_Receiving'. Instead, it fetches the first table on the page 'Team Stats and Ranking'. It also renders the CSV in a rather ugly/not useful format like so:
b'',b'',b'',b'Tot Yds & TO',b'',b'',b'Passing',b'Rushing',b'Penalties',b'',b'Average Drive',b'Player',b'PF',b'Yds',b'Ply',b'Y/P',b'TO',b'FL',b'1stD',b'Cmp',b'Att',b'Yds',b'TD',b'Int',b'NY/A',b'1stD',b'Att',b'Yds',b'TD',b'Y/A',b'1stD',b'Pen',b'Yds',b'1stPy',b'#Dr',b'Sc%',b'TO%',b'Start',b'Time',b'Plays',b'Yds',b'Pts',b'Team Stats',b'Opp. Stats',b'Lg Rank Offense',b'Lg Rank Defense'
b'309',b'4944',b'920',b'5.4',b'22',b'8',b'268',b'288',b'474',b'3222',b'27',b'14',b'6.4',b'176',b'415',b'1722',b'8',b'4.1',b'78',b'81',b'636',b'14',b'170',b'30.6',b'12.9',b'Own 27.8',b'2:38',b'5.5',b'29.1',b'1.74'
b'8',b'5',b'',b'',b'8',b'13',b'1',b'',b'12',b'12',b'13',b'5',b'13',b'',b'4',b'6',b'4',b'7',b'',b'',b'',b'',b'',b'1',b'21',b'2',b'3',b'2',b'5',b'4'
b'8',b'10',b'',b'',b'20',b'20',b'7',b'',b'7',b'11',b'31',b'15',b'21',b'',b'11',b'15',b'4',b'15',b'',b'',b'',b'',b'',b'24',b'16',b'5',b'13',b'14',b'15',b'11'
I know my issue with fetching the correct table lies within the line:
table = soup.select_one('table.stats_table')
I am what I would still consider a novice in Python, so if someone can help me be able to query and parse a specific table with BS4 into CSV format I would be beyond appreciative!
Thanks in advance!
The pandas solution didn't work for me due to the ajax load, but you can see in the console the URL each table is loading from, and request to it directly. In this case, the URL is: https://widgets.sports-reference.com/wg.fcgi?css=1&site=pfr&url=%2Fteams%2Fnwe%2F2008.htm&div=div_rushing_and_receiving
You can then get the table directly using its id rushing_and_receiving.
This seems to work.
from bs4 import BeautifulSoup
import requests
import csv
def table_Scrape():
url = 'https://widgets.sports-reference.com/wg.fcgi?css=1&site=pfr&url=%2Fteams%2Fnwe%2F2008.htm&div=div_rushing_and_receiving'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
table = soup.find('table', id='rushing_and_receiving')
headers = [th.text for th in table.findAll("tr")[1]]
body = table.find('tbody')
with open("out.csv", "w", encoding='utf-8') as f:
wr = csv.writer(f)
wr.writerow(headers)
for data_row in body.findAll("tr"):
th = data_row.find('th')
wr.writerow([th.text] + [td.text for td in data_row.findAll("td")])
table_Scrape()
I would bypass beautiful soup altogether since pandas works well for this site. (at least the first 4 tables I glossed over)
Documentation here
import pandas as pd
url = 'https://www.pro-football-reference.com/teams/nwe/2008.htm'
data = pd.read_html(url)
# data is now a list of dataframes (spreadsheets) one dataframe for each table in the page
data[0].to_csv('somefile.csv')
I wish I could credit both of these answers as correct, as they are both useful, but alas, the second answer using BeautifulSoup is the better answer since it allows for the isolation of specific tables, whereas the nature of the way the site is structured limits the effectiveness of the 'read_html' method in Pandas.
Thanks to everyone who responded!

Wikipedia table scraping using python

I am trying to scrape tables from wikipedia. I wrote a table scraper that downloads a table and saves it as a pandas data frame.
This is the code
from bs4 import BeautifulSoup
import pandas as pd
import urllib2
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, 'lxml') # Parse the HTML as a string
print soup
# Create an object of the first object
table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
print table
rank=[]
country=[]
pop=[]
date=[]
per=[]
source=[]
for row in table.find_all('tr')[1:]:
col=row.find_all('td')
col1=col[0].string.strip()
rank.append(col1)
col2=col[1].string.strip()
country.append(col2)
col3=col[2].string.strip()
pop.append(col2)
col4=col[3].string.strip()
date.append(col4)
col5=col[4].string.strip()
per.append(col5)
col6=col[5].string.strip()
source.append(col6)
columns={'Rank':rank,'Country':country,'Population':pop,'Date':date,'Percentage':per,'Source':source}
# Create a dataframe from the columns variable
df = pd.DataFrame(columns)
df
But it is not downloading the table. The problem is in this section
table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
print table
where output is None
As far as I can see, there is no such element on that page. The main table has "class":"wikitable sortable" but not the jquery-tablesorter.
Make sure you know what element you are trying to select and check if your program sees the same elements you see, then make your selector.
The docs says you need to specify multiple classes like so:
soup.find("table", class_="wikitable sortable jquery-tablesorter")
Also, consider using requests instead of urllib2.

How to get the contents under a particular column in a table from Wikipedia using soup & python

I need to get the href links that the contents point to under a particular column from a table in wikipedia. The page is "http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015". On this page there are a few tables with class "wikitable". I need the links of the contents under the column Title for each row that they point to. I would like them to be copied onto an excel sheet.
I do not know the exact code of searching under a particular column but I came upto this far and I am getting a "Nonetype object is not callable". I am using bs4. I wanted to extract atleast somepart of the table so I could figure out narrowing to the href links under the Title column I want but I am ending up with this error. The code is as below:
from urllib.request import urlopen
from bs4 import BeautifulSoup
soup = BeautifulSoup(urlopen('http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015').read())
for row in soup('table', {'class': 'wikitable'})[1].tbody('tr'):
tds = row('td')
print (tds[0].string, tds[0].string)
A little guidance appreciated. Anyone knows?
Figured out that the none type error might be related to the table filtering. Corrected code is as below:
import urllib2
from bs4 import BeautifulSoup, SoupStrainer
content = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)
links=[]
for sp in soup.find_all(align="center"):
a_tag = sp('a')
if a_tag:
links.append(a_tag[0].get('href'))

Categories