Beautiful Soup AssertionError - python

I am trying to scrape this website into a .CSV and I am getting an error that says: AssertionError: 9 columns passed, passed data had 30 columns. My code is below, it is a little messy because I exported from Jupyter Notebook.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
url = 'https://apps.azsos.gov/apps/election/cfs/search/CandidateSearch.aspx'
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
soup = BeautifulSoup(html)
type(soup) # we see that soup is a BeautifulSoup object
column_headers = [th.getText() for th in
soup.findAll('tr', limit=2)[1].findAll('th')]
column_headers # our column headers
data_rows = soup.findAll('th')[2:] # skip the first 2 header rows
type(data_rows) # now we have a list of table rows
candidate_data = [[td.getText() for td in data_rows[i].findAll('td')]
for i in range(len(data_rows))]
df = pd.DataFrame(candidate_data, columns=column_headers)
df.head() # head() lets us see the 1st 5 rows of our DataFrame by default
df.to_csv(r'C:/Dev/Sheets/Candiate_Search.csv', encoding='utf-8', index=False)

The data on the page [ definitely has a table, and you parse out the column headers and pass them to your CSV. Visually that table has 8 columns, but you parse 9 headers. At this point you should probably go check your data to see what you've found - it might not be what you expect. But okay, you go and check and you see that one of them is a spacer column in the table that will be empty or garbage, and you proceed.
These lines:
data_rows = soup.findAll('th')[2:] # skip the first 2 header rows
type(data_rows) # now we have a list of table rows
candidate_data = [[td.getText() for td in data_rows[i].findAll('td')]
for i in range(len(data_rows))]
find every <th> instance in the page and then every <td> inside each <th>, and that's where it really goes off the rails. I am guessing you are not a web developer, but tables and their sub-elements (rows aka <tr>, headers aka <th>, and cells aka <td>) are used all over most pages for organizing tons of visual elements and also sometimes for organizing tabular data.
Guess what? You found a lot of tables that are not this visual table because you were searching the whole page for <th> elements.
I'd suggest you pre-filter down from using the entire soup by first finding a <table> or <div> that only contains the tabular data you're interested in, and then search within that scope.

Related

How do I get the data from a table of google spreadsheet using requests in python?

I am doing a project that asks me to obtain 3 databases from google spreadsheet using the requests library (I do the processing afterwards). The problem is that when I get the GET of the url and apply a ".text" or ".json" or ".content" it gives me the structured information of the entire spreadsheet but I want the values of the row and column. Any ideas???
Here are the spreadsheets:
https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/edit#gid=1691373423
https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/edit
https://docs.google.com/spreadsheets/d/1udwn61l_FZsFsEuU8CMVkvU2SpwPW3Krt1OML3cYMYk/edit
The best way of getting the data from a google spreadsheet in python is by using gspread, which is a Python API for Google Sheets.
However, there are alternatives if you aren't the owner of the spreadsheet (or you just want to do it by other method as an exercise). For instance, you can do it using requests and bs4 modules as you can see in this answer.
Applied to your specific case, the code would look like this ("Datos Argentina - Salas de Cine" spreadsheet):
import typing
import requests
from bs4 import BeautifulSoup
def scrapeDataFromSpreadsheet() -> typing.List[typing.List[str]]:
html = requests.get('https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/edit#gid=1691373423').text
soup = BeautifulSoup(html, 'lxml')
salas_cine = soup.find_all('table')[0]
rows = [[td.text for td in row.find_all("td")] for row in salas_cine.find_all('tr')]
return rows
Important note: with the link provided (and the code above) you will only be able to get the first 100 rows of data!
This can be fixed in more than one way. What I've tried is modifying the url of the spreadsheet to display the data as a simple html table (reference).
Old url: https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/edit#gid=1691373423
New url: (remove edit#gid=1691373423 and add gviz/tq?tqx=out:html&tq&gid=1) https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/gviz/tq?tqx=out:html&tq&gid=1
Now you are able to obtain all the rows that the spreadsheet contains:
def scrapeDataFromSpreadsheet() -> typing.List[typing.List[str]]:
html = requests.get('https://docs.google.com/spreadsheets/u/0/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/gviz/tq?tqx=out:html&tq&gid=1').text
soup = BeautifulSoup(html, 'lxml')
salas_cine = soup.find_all('table')[0]
rows = [[td.text for td in row.find_all("td")] for row in salas_cine.find_all('tr')]
return rows

Python BeautifulSoup parsing / crawl table

For my own interest, I want to crawl the table of properties from "https://thinkimmo.com/search?noReset=true". After having clicked on "TABELLE" (TABLE) you can see all properties listed in a table.
With the following code I am able to see the table:
driver.get("https://thinkimmo.com/search?noReset=true")
driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[2]/div/div[2]/div/div[2]/div/div/div/div[1]/div/div/button[2]/span[1]').click()
Now I am able to crawl some parts of the table with the following code:
soup = BeautifulSoup(driver.page_source, 'html.parser')
htmltable = soup.find('table', { 'class' : 'MuiTable-root' })
def tableDataText(table):
rows = []
trs = table.find_all('tr')
headerow = [td.get_text(strip=True) for td in trs[0].find_all('th')] # header row
if headerow: # if there is a header row include first
rows.append(headerow)
trs = trs[1:]
for tr in trs: # for every table row
rows.append([td.get_text(strip=True) for td in tr.find_all('td')]) # data row
return rows
list_table = tableDataText(htmltable)
list_table
The result however is not what I expect. I only get the first 7 headings, but all other headings are not returned.
After I had a closer look at the HTML of the webpage I am not sure how to get all headings and results of the table.
I am looking forward to solving the problem of getting only some parts of the heading. And more closely I am interested in why I am failing.
What I see in the result of table = soup.find("table") is that after the 7th heading title the table closes.
Thanks in advance.
Steffen
The site uses a backend api you can edit to bulk download data:
import requests
import pandas as pd
results = 1000
url = f'https://api.thinkimmo.com/immo?active=true&type=APARTMENTBUY&sortBy=publishDate,desc&from=0&size={str(results)}&grossReturnAnd=false&allowUnknown=false&excludePlatforms=ebk,immowelt&favorite=false&noReset=true&excludedFields=true&geoSearches=[]&averageAggregation=buyingPrice%3BpricePerSqm%3BsquareMeter%3BconstructionYear%3BrentPrice%3BrentPricePerSqm%3BrentPricePerSqm%3BrunningTime&termsAggregation=platforms.name.keyword,60'
resp = requests.get(url).json()
df = pd.DataFrame(resp['results'])
df.to_csv('thinkimmo.csv',index=False)
print('Saved to thinkimmo.csv')
This is alot of unstructured data but should help. If you want to inspect what is in this api call and only get certain parts of the returned JSON then you can open your browser's Developer Tools - Network - fetch/XHR and reload the page to see all the backend requests fire. You are looking for one that starts "immo?" take a look at the Payload and Preview to see all the data. That's what we are scraping above.

Scraping HTML tables to CSV's using BS4 for use with Pandas

I have begun a pet-project creating what is essentially an indexed compilation of a plethora of NFL statistics with a nice simple GUI. Fortunately, the site https://www.pro-football-reference.com has all the data you can imagine in the form of tables which can be exported to CSV format on the site and manually copied/pasted. I started doing this, and then using the Pandas library, began reading the CSVs into DataFrames to make use of the data.
This works great, however, manually fetching all this data is quite tedious, so I decided to attempt to create a web scraper that can scrape HTML tables and convert them into a usable CSV format. I am struggling, specifically to isolate individual tables but also with having the CSV that is produced render in a readable/usable format.
Here is what the scraper looks like right now:
from bs4 import BeautifulSoup
import requests
import csv
def table_Scrape():
url = 'https://www.pro-football-reference.com/teams/nwe/2008.htm'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
table = soup.select_one('table.stats_table')
headers = [th.text.encode("utf-8") for th in table.select("tr th")]
with open("out.csv", "w", encoding='utf-8') as f:
wr = csv.writer(f)
wr.writerow(headers)
wr.writerows([
[td.text.encode("utf-8") for td in row.find_all("td")]
for row in table.select("tr + tr")
])
table_Scrape()
This does properly send the request to the URL, but doesn't fetch the data I am looking for which is 'Rushing_and_Receiving'. Instead, it fetches the first table on the page 'Team Stats and Ranking'. It also renders the CSV in a rather ugly/not useful format like so:
b'',b'',b'',b'Tot Yds & TO',b'',b'',b'Passing',b'Rushing',b'Penalties',b'',b'Average Drive',b'Player',b'PF',b'Yds',b'Ply',b'Y/P',b'TO',b'FL',b'1stD',b'Cmp',b'Att',b'Yds',b'TD',b'Int',b'NY/A',b'1stD',b'Att',b'Yds',b'TD',b'Y/A',b'1stD',b'Pen',b'Yds',b'1stPy',b'#Dr',b'Sc%',b'TO%',b'Start',b'Time',b'Plays',b'Yds',b'Pts',b'Team Stats',b'Opp. Stats',b'Lg Rank Offense',b'Lg Rank Defense'
b'309',b'4944',b'920',b'5.4',b'22',b'8',b'268',b'288',b'474',b'3222',b'27',b'14',b'6.4',b'176',b'415',b'1722',b'8',b'4.1',b'78',b'81',b'636',b'14',b'170',b'30.6',b'12.9',b'Own 27.8',b'2:38',b'5.5',b'29.1',b'1.74'
b'8',b'5',b'',b'',b'8',b'13',b'1',b'',b'12',b'12',b'13',b'5',b'13',b'',b'4',b'6',b'4',b'7',b'',b'',b'',b'',b'',b'1',b'21',b'2',b'3',b'2',b'5',b'4'
b'8',b'10',b'',b'',b'20',b'20',b'7',b'',b'7',b'11',b'31',b'15',b'21',b'',b'11',b'15',b'4',b'15',b'',b'',b'',b'',b'',b'24',b'16',b'5',b'13',b'14',b'15',b'11'
I know my issue with fetching the correct table lies within the line:
table = soup.select_one('table.stats_table')
I am what I would still consider a novice in Python, so if someone can help me be able to query and parse a specific table with BS4 into CSV format I would be beyond appreciative!
Thanks in advance!
The pandas solution didn't work for me due to the ajax load, but you can see in the console the URL each table is loading from, and request to it directly. In this case, the URL is: https://widgets.sports-reference.com/wg.fcgi?css=1&site=pfr&url=%2Fteams%2Fnwe%2F2008.htm&div=div_rushing_and_receiving
You can then get the table directly using its id rushing_and_receiving.
This seems to work.
from bs4 import BeautifulSoup
import requests
import csv
def table_Scrape():
url = 'https://widgets.sports-reference.com/wg.fcgi?css=1&site=pfr&url=%2Fteams%2Fnwe%2F2008.htm&div=div_rushing_and_receiving'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
table = soup.find('table', id='rushing_and_receiving')
headers = [th.text for th in table.findAll("tr")[1]]
body = table.find('tbody')
with open("out.csv", "w", encoding='utf-8') as f:
wr = csv.writer(f)
wr.writerow(headers)
for data_row in body.findAll("tr"):
th = data_row.find('th')
wr.writerow([th.text] + [td.text for td in data_row.findAll("td")])
table_Scrape()
I would bypass beautiful soup altogether since pandas works well for this site. (at least the first 4 tables I glossed over)
Documentation here
import pandas as pd
url = 'https://www.pro-football-reference.com/teams/nwe/2008.htm'
data = pd.read_html(url)
# data is now a list of dataframes (spreadsheets) one dataframe for each table in the page
data[0].to_csv('somefile.csv')
I wish I could credit both of these answers as correct, as they are both useful, but alas, the second answer using BeautifulSoup is the better answer since it allows for the isolation of specific tables, whereas the nature of the way the site is structured limits the effectiveness of the 'read_html' method in Pandas.
Thanks to everyone who responded!

Getting the child element of a particular div element using beautiful soup

I am trying to scrape table data from this link
http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=30-01-2017&venue=ST&raceno=2&lang=en
Here is my code
from lxml import html
import webbrowser
import re
import xlwt
import requests
import bs4
content = requests.get("http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=30-01-2017&venue=ST&raceno=1&lang=en").text # Get page content
soup = bs4.BeautifulSoup(content, 'lxml') # Parse page content
table = soup.find('div', {'id': 'detailWPTable'}) # Locate that table tag
rows = table.find_all('tr') # Find all row tags in that table
for row in rows:
columns = row.find_all('td') # Find all data tags in each column
print ('\n')
for column in columns:
print (column.text.strip(),end=' ') # Output data in each column
It is not giving any output . Please help !
The table is generated by JavaScrip and requests will only return html code like the picture shows.
Use selemium
I'm looking at the last line of your code:
print (column.text.strip(),end=' ') # Output data in each column
Are you sure that should read column.text. Maybe you could try column.strings or column.get_text(). Or column.stripped_strings even
I just wanted to mention that id you are using are for the wrapping div, not for the child table element.
Maybe you could try something like:
wrapper = soup.find('div', {'id': 'detailWPTable'})
table_body = wrapper.table.tbody
rows = table_body.find_all('tr')
But thinking about it, the tr elements are also descendants of the wrapping div, so find_all should still find them %]
Update: adding tbody
Update: sorry I'm not allowed to comment yet :). Are you sure you have the correct document. Have you checked the whole soup that the tags are actually there?
And I guess all those lines could be written as:
rows = soup.find('div', {'id': 'detailWPTable'}).find('tbody').find_all('tr')
Update: Yeah the wrapper div is empty. So it seems that you don't get whats being generated by javascript like the other guy said. Maybe you should try Selenium as he suggested? Possibly PhantomJS as well.
You can try it with dryscrape like so:
import dryscrape
from bs4 import BeautifulSoup as BS
import re
import xlwt
ses=dryscrape.Session()
ses.visit("http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=30-01-2017&venue=ST&raceno=1&lang=en")
soup = BS(ses.body(), 'lxml') # Parse page content
table = soup.find('div', {'id': 'detailWPTable'}) # Locate that table tag
rows = table.find_all('tr') # Find all row tags in that table
for row in rows:
columns = row.find_all('td') # Find all data tags in each column
print ('\n')
for column in columns:
print (column.text.strip())

Wikipedia table scraping using python

I am trying to scrape tables from wikipedia. I wrote a table scraper that downloads a table and saves it as a pandas data frame.
This is the code
from bs4 import BeautifulSoup
import pandas as pd
import urllib2
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, 'lxml') # Parse the HTML as a string
print soup
# Create an object of the first object
table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
print table
rank=[]
country=[]
pop=[]
date=[]
per=[]
source=[]
for row in table.find_all('tr')[1:]:
col=row.find_all('td')
col1=col[0].string.strip()
rank.append(col1)
col2=col[1].string.strip()
country.append(col2)
col3=col[2].string.strip()
pop.append(col2)
col4=col[3].string.strip()
date.append(col4)
col5=col[4].string.strip()
per.append(col5)
col6=col[5].string.strip()
source.append(col6)
columns={'Rank':rank,'Country':country,'Population':pop,'Date':date,'Percentage':per,'Source':source}
# Create a dataframe from the columns variable
df = pd.DataFrame(columns)
df
But it is not downloading the table. The problem is in this section
table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
print table
where output is None
As far as I can see, there is no such element on that page. The main table has "class":"wikitable sortable" but not the jquery-tablesorter.
Make sure you know what element you are trying to select and check if your program sees the same elements you see, then make your selector.
The docs says you need to specify multiple classes like so:
soup.find("table", class_="wikitable sortable jquery-tablesorter")
Also, consider using requests instead of urllib2.

Categories