I'm trying to copy search result tables from a website into an excel sheet
But the data isn't very clean and it's causing some issues with outputting the lists into a pandas dataframe.
There is 15 columns of data, but the last 2 have blank headers and 2 of them have duplicate headers. Which (I think) is causing me to get this error "ValueError: All arrays must be of the same length"
Realistically I only need the first 9 columns of the table which in this case means there won't be any duplicate or blank headers in the data anymore.
Is there a way to limit find_elements to get the first 9 columns of the table rather than all columns? Or to fix the headers so that there are no longer any duplicates or blanks?
Any help is greatly appreciated thank you.
for x in result:
driver.find_element(By.XPATH, '//*[#id="sidemenu"]/table/tbody/tr[1]/td/form/div[2]/input[1]').send_keys(x)
driver.implicitly_wait(2)
driver.find_element(By.XPATH, '//*[#id="navsrch"]').click()
driver.implicitly_wait(2)
headers = []
columns = dict()
table_id = driver.find_element(By.ID, 'invoice')
all_rows = table_id.find_elements(By.TAG_NAME, "tr")
row = all_rows[0]
all_items = row.find_elements(By.TAG_NAME, "th")
for item in all_items:
name = item.text
columns[name] = []
headers.append(name)
print(headers)
for row in all_rows[1:]:
all_items = row.find_elements(By.TAG_NAME, "td")
for name, item in zip(headers, all_items):
value = item.text
columns[name].append(value)
print(columns)
df = pd.DataFrame(columns)
print(df)
driver.close()
Fixed issue by looping through the specificed amount of times that I needed and iterating the xpath position each loop and sending the data straight to excel on each loop. Thanks for your help Aaron for the iterating idea & Arundeep for the xpath position idea.
for x in result:
driver.find_element(By.XPATH, '//*[#id="sidemenu"]/table/tbody/tr[1]/td/form/div[2]/input[1]').send_keys(x)
driver.implicitly_wait(2)
driver.find_element(By.XPATH, '//*[#id="navsrch"]').click()
driver.implicitly_wait(5)
row = driver.find_elements(By.XPATH, '//*[#id="invoice"]/tbody/tr')
lrow = len(row) + 1
for i in range(1, lrow):
l_e_r = len(list(ws.rows)) + 1
for y in range(1, 10):
rows = driver.find_element(By.XPATH, '//*[#id="invoice"]/tbody/tr['+str(i)+']/td['+str(y)+']').text
ws.cell(row=l_e_r, column=y).value = rows
wb.save("test.xlsx")
driver.close()
Related
I have a python script that scrapes a html table. When I try to save my scraped data to pandas dataframe, I get an error. Please help me check what am doing wrong?
Here is my codeblock
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import time
import pandas as pd
def HDI():
url = 'https://worldpopulationreview.com/country-rankings/hdi-by-country'
service = Service(executable_path="C:/driver/new/chromedriver_win32/chromedriver.exe")
driver = webdriver.Chrome(service=service)
driver.get(url)
time.sleep(5)
btn = driver.find_element(By.CLASS_NAME, '_3p_1XEZR')
btn.click()
time.sleep(5)
temp_height=0
while True:
#Looping down the scroll bar
driver.execute_script("window.scrollBy(0,500)")
#sleep and let the scroll bar react
time.sleep(5)
#Get the distance of the current scroll bar from the top
check_height = driver.execute_script("return document.documentElement.scrollTop || window.window.pageYOffset || document.body.scrollTop;")
#If the two are equal to the end
if check_height==temp_height:
break
temp_height=check_height
time.sleep(3)
row_headers = []
tableheads = driver.find_elements(By.CLASS_NAME, 'datatable-th')
for value in tableheads:
thead_values = value.find_element(By.CLASS_NAME, 'has-tooltip-bottom').text.strip()
row_headers.append(thead_values)
tablebodies = driver.find_elements(By.TAG_NAME, 'tr')
for row in tablebodies:
tabledata = row.find_elements(By.CSS_SELECTOR, 'tr, td')
row_data = []
for data in tabledata:
row_data.append(data.text)
df = pd.DataFrame(row_data, columns=row_headers)
df
HDI()
Here is the error i get
File "c:\Users\LP\Documents\python\HD1 2023\HDI2023.py", line 49, in HDI
df = pd.DataFrame(row_data, columns=row_headers)
File "C:\Users\LP\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\internals\construction.py", line 351, in ndarray_to_mgr
_check_values_indices_shape_match(values, index, columns)
File "C:\Users\LP\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\internals\construction.py", line 422, in _check_values_indices_shape_match
raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (9, 1), indices imply (9, 9)
I want to save the above scraped values into pandas dataframe. That's my aim. Please help if you can. THANKS
In your variable row_data you are only saving one row and you are overwriting it in every iteration. You probably want to use all rows in your DataFrame. You can for example create a new variable row_data_all and pass that to your DataFrame
row_data_all = []
for row in tablebodies:
tabledata = row.find_elements(selenium.webdriver.common.by.By.CSS_SELECTOR, 'tr, td')
row_data = []
for data in tabledata:
row_data.append(data.text)
row_data_all.append(row_data)
pd.DataFrame(row_data_all, columns = row_headers)
In case you really wanted to create a DataFrame from a single row you should use
pd.DataFrame(row_data, index = row_headers).T
Alternative
You can also use pandas' read_html() method, which only needs the html source code. You can even pass it the source code of the entire page, and it will return a list of DataFrames of the tables found in the source code. This will also speed up your function a lot.
html_table = driver.find_element(By.TAG_NAME, "table").get_attribute("outerHTML")
df = pd.read_html(html_table)[0]
Goal: The goal of my project is to use BeautifulSoup aka bs4 to scrape only necessary data from an HTML file and import it into excel. The html file is heavily formatted so unfortunately I haven't been able to tailor more common solutions to my needs.
What I have tried: I have been able to parse the HTML file to the point where I am only pulling the tables I need, and I am able to detect every column of data and print it. In example, if there are a total of 18 columns and 3 rows of data, the code will output 54 times with each piece of table data going from row 1 col 1 to row 3 col 18.
My code is as follows:
from bs4 import BeautifulSoup
import csv
import pandas as pd
url =
output =
#define table error to detect only tables with extractable data
def iserror(func, *args, **kw):
try:
func(*args, **kw)
return False
except Exception:
return True
#read the html
with open(url) as html_file:
soup = BeautifulSoup(html_file, 'html.parser')
#table = soup.find_all('table')
all_tables = soup.find_all('table')
#print(len(table))
df = pd.DataFrame( columns=(pks_col_names))
col_list = []
table_list = []
for tnum, tables in enumerate(all_tables):
if iserror(all_tables[tnum].tbody): #Finds table with data
table = tables.findAll('tr')
#Loops through rows of each data table
for rnum, row in enumerate(table):
table_row = table[rnum].findAll('td')
if len(table_row)==17:
#Loops through columns of each data table
for col in table_row:
col_list.append(col.string)
else:
pass
else:
pass
Example of data output currently achieved
row 1 column 1 (first string in list)
row 1 column 2
row 1 column 3 ...
row 3 column 17
row 3 column 18 (last string in list)
The current code creates a single list with the data outputted above, though I am unable to figure out a way to convert that list into a pandas dataframe tying each list output to the appropriate row/column. Could anyone provide ideas on how to do this or how to otherwise rework my code to import this data into a dataframe?
it's all messed up: your function iserror does in fact check if there's no error (and i don't think it works at all). what you call tables are rows and you don't need to enumerate
as you haven't provided the data, i made only rough tests. but it's a bit cleaner
row_list = []
for table in all_tables:
if is_ok(table.tbody): #Finds table with data
rows = table.findAll('tr')
#Loops through rows of each data table
for row in rows:
cols = row.findAll('td')
col_list = []
if len(cols)==17:
#Loops through columns of each data table
for col in cols:
col_list.append(col.string)
row_list.append(col_list)
df = pd.DataFrame(row_list, columns=(pks_col_names))
Thanks everyone for the help. I was able to achieve the desired data frame and the final code looks like this:
url = 'insert url here'
#define table error to detect only tables with extractable data
def iserror(func, *args, **kw):
try:
func(*args, **kw)
return False
except Exception:
return True
#All tables
all_tables = soup.findAll('table')
#Define column names
pks_col_names = ['TBD','Trade Date', 'Settle Date', 'Source', 'B/S', 'Asset Name', 'Security Description',
'Ticker','AccountID','Client Name','Shares','Price','Amount','Comms',
'Fees', 'Payout %','Payout']
row_list = []
for table in all_tables:
if iserror(table.tbody): #Finds table with data
rows = table.findAll('tr')
#Loops through rows of each data table
for row in rows:
cols = row.findAll('td')
col_list = []
if len(cols)==17:
#Loops through columns of each data table
for col in cols:
col_list.append(col.string)
row_list.append(col_list)
df = pd.DataFrame(row_list, columns=(pks_col_names))
df.to_csv(output, index=False, encoding = 'utf-8-sig')
first thanks for taking time to review my problem.
I am trying to get web page in a list with beautifullsoup and then pass it to a dataframe.
# search for tbody and print each tr to a list
table_body=soup.find('tbody')
rows = table_body.find_all('tr')
res = []
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip()
for x in cols]
# print(cols)
if row:
res.append(row)
# transfer python list to dataframe
df_wocaps = pd.DataFrame(res, columns=["drop1", "Platz", "Vorher", "Wertpapier", "Kurs", "drop2", "Perf", "drop3", "1 Monat", "Suchanfragen", "drop4", "drop5", "drop6"])
df_wocaps.head()
# drop unused columns
df = df_wocaps.drop(['drop1', 'drop2', 'drop3', 'drop4', 'drop5', 'drop6'], axis=1)
df.head()
At the End it look slike this:
dataframe picture
i want to extract only the relevant information without all the html stuff and brackets. Anyone an idea how to do that? Thanks in advance.
Use variable "cols" instead of rows which gives you the input data
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip()
for x in cols]
# print(cols)
if row:
res.append(cols)
I have parsed a table and would like to convert two of those variables to a Pandas Dataframe to print to excel.
FYI:
I did ask a similar question, however, it was not answered thoroughly. There was no suggestion on how to create a Pandas DataFrame. This was the whole point of my question.
Caution:
There is small issue with the data that I parsed. The data contains "TEAM" and "SA/G" multiple times in the output.
The 1st variable that I would like in the DataFrame is 'TEAM'.
The 2nd variable that I would like in the DataFrame is 'SA/G'.
Here is my code so far:
# imports
from selenium import webdriver
from bs4 import BeautifulSoup
# make a webdriver object
driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
# open some page using get method - url -- > parameters
driver.get('http://www.espn.com/nhl/statistics/team/_/stat/scoring/sort/avgGoals')
# driver.page_source
soup = BeautifulSoup(driver.page_source,'lxml')
#close driver
driver.close()
#find table
table = soup.find('table')
#find_all table rows
t_rows = table.find_all('tr')
#loop through tr to find_all td
for tr in t_rows:
td = tr.find_all('td')
row = [i.text for i in td]
# print(row)
# print(row[9])
# print(row[1], row[9])
team = row[1]
sag = row[9]
# print(team, sag)
data = [(team, sag)]
print(data)
Here is the final output that I would like printed to excel using the Pandas DataFrame option:
Team SA/G
Nashville 30.1
Colorado 33.6
Washington 31.0
... ...
Thanks in advance for any help that you may offer. I am still learning and appreciate any feedback that I can get.
Looks like you want to create a DataFrame from a list of tuples, which has been answered here.
I would change your code like this:
# Initial empty list
data = []
#loop through tr to find_all td
for tr in t_rows:
td = tr.find_all('td')
row = [i.text for i in td]
team = row[1]
sag = row[9]
# Add tuple containing one row of data
data.append((team, sag))
# Create df from list of tuples
df = pd.DataFrame(data, columns=['Team', 'SA/G'])
# Remove lines where Team value is "TEAM"
df = df[df["Team"] != "TEAM"]
EDIT: Add line to remove ("TEAM", "SA/G") rows in df
First inside the "for loop" append tuples into a list (instead of doing data=[(x,y)] declare the data variable before the loop as a list data = list() and append the tuples to list in the loop data.append((x,y))) and do the following
import pandas as pd
data=[("t1","sag1"),("t2","sag2"),("t3","sag3")]
df = pd.DataFrame(data,columns=['Team','SA/G'])
print(df)
I'm currently looping though web pages and pulling the values from each <td> element and appending them in a list as text, which I want to export into an Excel spreadsheet.
The problem is I want to copy the values from all the different webpages into their own row on the spreadsheet but I can only figure out how to append all the data to a list before I send to excel, so this is printing all the data to 1 row.
I really need each web page on a separate row in excel but cant figure out how to write it.
This is what I have -
import requests, bs4, xlsxwriter
td_text = []
row = 0
col = 0
def print_table():
for i in range(1, 10):
base_link = 'http://some/website/%d' % (i)
try:
res = requests.get(base_link)
res.raise_for_status()
techSoup = bs4.BeautifulSoup(res.text, 'html.parser')
table = techSoup.find('table', attrs={'class':'table borderless'})
for div in table:
rows = div.findAll('td')
for string in rows:
td_text.append(string.text)
print(string.text)
send_excel(row, col)
except requests.exceptions.HTTPError:
print('Error: Invalid Website \n\n.')
def send_excel(row, col):
workbook = xlsxwriter.Workbook('list.xlsx')
worksheet = workbook.add_worksheet()
row += 1
worksheet.write_row(row, col, td_text)
workbook.close()
print_table()
All the data is pulled from websites correctly.
I can see my issue that all data is appended to list before I call write_row(), but I'm not sure how I would write it so each website is written to the spreadsheet as it iterates through the loop.
If you think about where your code is executing (in terms of local scope and your loops) you'll realize that you're opening and closing this file dozens of times (which is very inefficient), never incrementing your row counter, and never clearing your text data between requests. You only need to open and close the file once, and you'll want to write the row just once for each set of data. Try something like this:
import requests, bs4, xlsxwriter
workbook = xlsxwriter.Workbook('list.xlsx')
worksheet = workbook.add_worksheet()
for i in range(1, 10):
td_text = []
base_link = 'http://some/website/%d' % (i)
try:
res = requests.get(base_link)
res.raise_for_status()
techSoup = bs4.BeautifulSoup(res.text, 'html.parser')
table = techSoup.find('table', attrs={'class':'table borderless'})
for div in table:
rows = div.findAll('td')
for string in rows:
td_text.append(string.text)
worksheet.write_row(i, 0, td_text)
except requests.exceptions.HTTPError:
print('Error: Invalid Website \n\n.')
workbook.close()