How to split a complex html table using beautifulsoup

How to split a complex html table using beautifulsoup - python

I am trying to scrape some webpages using bs4. I get the tables from the webpage as follows:
tables_list = soup.find_all("table")
However, I have some complex tables which contains multiple small table sections. I have provided the source code for the table here:
https://ufile.io/tee2c
Is there a way to split such a table into smaller table sections so that I can parse each one into its own dataframe? Thanks.
Here's my code thus far using python html extractor (https://pypi.python.org/pypi/html-table-extractor/1.3.0)
# table_doc => source code provided at https://ufile.io/tee2c
extractor = Extractor(str(table_doc))
extractor.parse()
list_of_lists = extractor.return_list()
new_list = []
# some cleanup
for row in list_of_lists:
stripped_list = [item.strip() for item in row]
if len(set(stripped_list))==1 or any([x for x in stripped_list if len(x)>200]): # skip any item with > 200 characters, its probably some paragraph string and hence not a valid table item
continue
new_list.append(stripped_list)
df = pandas.DataFrame(new_list)
"""need to understand how to get this dataframe in a more structured format; for complex tables (e.g. the one provided at https://ufile.io/tee2c), this df contains content from multiple table 'sections' """

Related

Get tables from Wiki that match specific text

I'm quite new to Python and BeautifulSoup, and I've been trying to work this out for several hours...
Firstly, I want to extract all table data from below link with "general election" in the title:
https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)
I do have another dataframe with names of each table (eg. "1961 general election", "1965 general election"), but am hoping to get away with just searching for "general election" on each table to confirm if it's what I need.
I then want to get all the names that are in Bold (which indicates they won) and finally I want another list of "Count 1" (or sometimes 1st Pref) in the original order, which I want to compare to the "Bold" list. I haven't even looked at this piece yet, as I haven't gotten past the first hurdle.
url = "https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)"
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
my_tables = soup.find_all("table", {"class":"wikitable"})
for table in my_tables:
rows = table.find_all('tr', text="general election")
print(rows)
Any help on this would be greatly appreciated...

This page requires some gymnastics, but it can be done:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
req = requests.get('https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)')
soup = bs(req.text,'lxml')
#first - select all the tables on the page
tables = soup.select('table.wikitable')
for table in tables:
ttr = table.select('tbody tr')
#next, filter out any table that doesn't involve general elections
if "general election" in ttr[0].text:
#clean up the rows
s_ttr = ttr[1].text.replace('\n','xxx').strip()
#find and clean up column headings
columns = [col.strip() for col in s_ttr.split('xxx') if len(col.strip())>0 ]
rows = [] #initialize a list to house the table rows
for c in ttr[2:]:
#from here, start processing each row and loading it into the list
row = [a.text.strip() if len(a.text.strip())>0 else 'NA' for a in c.select('td') ]
if (row[0])=="NA":
row=row[1:]
columns = [col.strip() for col in s_ttr.split('xxx') if len(col.strip())>0 ]
if len(row)>0:
rows.append(row)
#load the whole thing into a dataframe
df = pd.DataFrame(rows,columns=columns)
print(df)
The output should be all the general election tables on the page.

Scrape a table from multiple pages and store in a single dataframe

Problem: a website has c.80 pages, each of which contains a single table that is identically structured. I need to scrape each table and store the results in a single pandas dataframe. The table content is regularly updated, and therefore the exercise will need to be frequently repeated.
I can scrape the table from a single page but am struggling to do it for multiple pages. All of the examples I have found are for URLs that change iteratively, e.g. (www.example.com/page1, /page2 etc), rather than for a specified list of URLs.
I have tried the following for a subset of the URLs (ideally, I would like to read in the URLs from a csv list), but it only seems to scrape the final table into the dataframe (i.e. ZZ).
Apologies if this seems dim, I’m fairly new to Python and have mainly been using pandas for data analysis, reading in directly from csv. Any help would be gratefully appreciated.
How can I read the URLs from a csv list ? my current solution does not scrape the whole table as I expect.
from bs4 import BeautifulSoup
import requests
import pandas as pd
COLUMNS = ['ID', 'Serial', 'Aircraft', 'Notes']
urls = ['http://www.ukserials.com/results.php?serial=ZR',
'http://www.ukserials.com/results.php?serial=ZT',
'http://www.ukserials.com/results.php?serial=ZZ']
#scrape elements
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
table = soup.find("table") # Find the "table" tag in the page
rows = table.find_all("tr") # Find all the "tr" tags in the table
cy_data = []
for row in rows:
cells = row.find_all("td") # Find all the "td" tags in each row
cells = cells[0:4] # Select the correct columns
cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
data = pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0)

Can you not add each dataframe into a list, and then merge the elements of that list right at the end?
...
dataframes = []
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
table = soup.find("table") # Find the "table" tag in the page
rows = table.find_all("tr") # Find all the "tr" tags in the table
cy_data = []
for row in rows:
cells = row.find_all("td") # Find all the "td" tags in each row
cells = cells[0:4] # Select the correct columns
cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
dataframes.append(pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0))
data = pd.concat(dataframes)
Note: You might need to specify index offsets for each dataframe (before merging), as seen here: https://pandas.pydata.org/pandas-docs/stable/merging.html

Extracting tables from web

I need to extract all tables from this web:(only the second column)
https://zh.wikipedia.org/wiki/上海证券交易所上市公司列表
Well, the last three tables I don't need it...
However, my code only extract the second column from the first table.
import pickle
import requests
def save_china_tickers():
resp = requests.get('https://zh.wikipedia.org/wiki/上海证券交易所上市公司列表')
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('table', {'class':'wikitable'})
tickers=[]
for row in table.findAll('tr')[1:]:
ticker = row.findAll('td')[1].text
tickers.append(ticker)
with open('chinatickers.pickle','wb') as f:
pickle.dump(tickers,f)
return tickers save_china_tickers()

I have an easy method.
Get HTTP Response
Find all tables using RegEx
Parse HTML Table to list of lists
Iterate the over each list in list
Requirements
dashtable
Code
from urllib.request import urlopen
from dashtable import html2data # to convert html table to list of list
import re
url = "https://zh.wikipedia.org/wiki/%E4%B8%8A%E6%B5%B7%E8%AF%81%E5%88%B8%E4%BA%A4%E6%98%93%E6%89%80%E4%B8%8A%E5%B8%82%E5%85%AC%E5%8F%B8%E5%88%97%E8%A1%A8"
# Reading http content
data = urlopen(url).read().decode()
# now fetching all tables with the help of regex
tables = ["<table>{}</table>".format(table) for table in re.findall(r"<table .*?>(.*?)</table>", data, re.M|re.S|re.I)]
# parsing data
parsed_tables = [html2data(table)[0] for table in tables] # html2data returns a tuple with 0th index as list of lists
# lets take first table ie 600000-600099
parsed = parsed_tables[0]
# column names of first table
print(parsed[0])
# rows of first table 2nd column
for index in range(1, len(parsed)):
print(parsed[index][1])
"""
Output: All the rows of table 1, column 2 excluding the headers
"""

Scraping table data from a list of URLs (each containing a unique table) for the purposes of appending it all to a single list/dataframe?

I am scraping data from a list of hundreds of URLs, each one containing a table with statistical baseball data. Within each unique URL in the list, there is a table for all of the seasons of a single baseball player's career, like this:
https://www.baseball-reference.com/players/k/killeha01.shtml
I have successfully created a script to append the data from a single URL into a single list/dataframe. However, here is my question:
How should I adjust my code to scrape a full list of hundreds of URLs from this domain and then append all of the table rows from all of the URLs into a single list/dataframe?
My general format for scraping a single URL is as follows:
import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
url_baseball_players = ['https://www.baseball-reference.com/players/k/killeha01.shtml']
def scrape_baseball_data(url_parameter):
html = urlopen(url_parameter)
# create the BeautifulSoup object
soup = BeautifulSoup(html, "lxml")
column_headers = [SCRAPING COMMAND WITH CSS SELECTOR GADGET FOR GETTING COLUMN HEADERS]
table_rows = soup.select(SCRAPING COMMAND WITH CSS SELECTOR GADGET FOR GETTING ALL OF THE DATA FROM THE TABLES INCLUDING HTML CHARACTERS)
player_data = []
for row in table_rows:
player_list = [COMMANDS FOR SCRAPING HTML DATA FROM THE TABLES INTO AN ORGANIZED LIST]
if not player_list:
continue
player_data.append(player_list)
return player_data
list_baseball_player_data = scrape_baseball_data(url_baseball_players)
df_baseball_player_data = pd.DataFrame(list_baseball_player_data)

If url_baseball_players is a list of all the URLs you want to scrape, and your expected output is one data frame (where you append, row-wise, each new URL's data), then just keep adding with concat() as you iterate over URLs:
df = pd.DataFrame()
for url in url_baseball_players:
df = pd.concat([df, pd.DataFrame(scrape_baseball_data(url))])

How can I get data from a specific cell in an HTML table in python?

This link contains the table I'm trying to parse.
I'm trying to use BeautifulSoup in Python. I'm very new to BeautifulSoup and HTML. This is my attempt to solve my problem.
soup = BeautifulSoup(open('BBS_student_grads.php'))
data = []
table = soup.find('table')
rows = table.find_all('tr') #array of rows in table
for x,row in enumerate(rows[1:]):# skips first row
cols = row.find_all('td') # finds all cols in rows
for y,col in enumerate(cols): # iterates through col
data.append([])
data[x].append(col) # puts table into a 2d array called data
print(data[0][0]) #prints top left corner
Sample Output
I'm trying to extract all the names in the table, then update the names in the list and then update the table. I'm also using a local copy of this HTML. Temporary fix till I learn how to do more web programming.
help is much appreciated

I think you need just the td elements in the tr element with class="searchbox_black".
You can use CSS Selectors to get to the desired td elements:
for cell in soup.select('tr.searchbox_black td'):
print cell.text
It prints:
BB Salsa
Adams State University Alamosa, CO
Sensei: Oneyda Maestas
Raymond Breitstein
...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split a complex html table using beautifulsoup - python

Related

Get tables from Wiki that match specific text

Scrape a table from multiple pages and store in a single dataframe

Extracting tables from web

Scraping table data from a list of URLs (each containing a unique table) for the purposes of appending it all to a single list/dataframe?

How can I get data from a specific cell in an HTML table in python?

Categories

Resources