Why can't I save my scraped html table to pandas dataframe? - python

I have a python script that scrapes a html table. When I try to save my scraped data to pandas dataframe, I get an error. Please help me check what am doing wrong?
Here is my codeblock
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import time
import pandas as pd
def HDI():
url = 'https://worldpopulationreview.com/country-rankings/hdi-by-country'
service = Service(executable_path="C:/driver/new/chromedriver_win32/chromedriver.exe")
driver = webdriver.Chrome(service=service)
driver.get(url)
time.sleep(5)
btn = driver.find_element(By.CLASS_NAME, '_3p_1XEZR')
btn.click()
time.sleep(5)
temp_height=0
while True:
#Looping down the scroll bar
driver.execute_script("window.scrollBy(0,500)")
#sleep and let the scroll bar react
time.sleep(5)
#Get the distance of the current scroll bar from the top
check_height = driver.execute_script("return document.documentElement.scrollTop || window.window.pageYOffset || document.body.scrollTop;")
#If the two are equal to the end
if check_height==temp_height:
break
temp_height=check_height
time.sleep(3)
row_headers = []
tableheads = driver.find_elements(By.CLASS_NAME, 'datatable-th')
for value in tableheads:
thead_values = value.find_element(By.CLASS_NAME, 'has-tooltip-bottom').text.strip()
row_headers.append(thead_values)
tablebodies = driver.find_elements(By.TAG_NAME, 'tr')
for row in tablebodies:
tabledata = row.find_elements(By.CSS_SELECTOR, 'tr, td')
row_data = []
for data in tabledata:
row_data.append(data.text)
df = pd.DataFrame(row_data, columns=row_headers)
df
HDI()
Here is the error i get
File "c:\Users\LP\Documents\python\HD1 2023\HDI2023.py", line 49, in HDI
df = pd.DataFrame(row_data, columns=row_headers)
File "C:\Users\LP\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\internals\construction.py", line 351, in ndarray_to_mgr
_check_values_indices_shape_match(values, index, columns)
File "C:\Users\LP\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\internals\construction.py", line 422, in _check_values_indices_shape_match
raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (9, 1), indices imply (9, 9)
I want to save the above scraped values into pandas dataframe. That's my aim. Please help if you can. THANKS

In your variable row_data you are only saving one row and you are overwriting it in every iteration. You probably want to use all rows in your DataFrame. You can for example create a new variable row_data_all and pass that to your DataFrame
row_data_all = []
for row in tablebodies:
tabledata = row.find_elements(selenium.webdriver.common.by.By.CSS_SELECTOR, 'tr, td')
row_data = []
for data in tabledata:
row_data.append(data.text)
row_data_all.append(row_data)
pd.DataFrame(row_data_all, columns = row_headers)
In case you really wanted to create a DataFrame from a single row you should use
pd.DataFrame(row_data, index = row_headers).T
Alternative
You can also use pandas' read_html() method, which only needs the html source code. You can even pass it the source code of the entire page, and it will return a list of DataFrames of the tables found in the source code. This will also speed up your function a lot.
html_table = driver.find_element(By.TAG_NAME, "table").get_attribute("outerHTML")
df = pd.read_html(html_table)[0]

Related

Python / Selenium Find first instances of elements

I'm trying to copy search result tables from a website into an excel sheet
But the data isn't very clean and it's causing some issues with outputting the lists into a pandas dataframe.
There is 15 columns of data, but the last 2 have blank headers and 2 of them have duplicate headers. Which (I think) is causing me to get this error "ValueError: All arrays must be of the same length"
Realistically I only need the first 9 columns of the table which in this case means there won't be any duplicate or blank headers in the data anymore.
Is there a way to limit find_elements to get the first 9 columns of the table rather than all columns? Or to fix the headers so that there are no longer any duplicates or blanks?
Any help is greatly appreciated thank you.
for x in result:
driver.find_element(By.XPATH, '//*[#id="sidemenu"]/table/tbody/tr[1]/td/form/div[2]/input[1]').send_keys(x)
driver.implicitly_wait(2)
driver.find_element(By.XPATH, '//*[#id="navsrch"]').click()
driver.implicitly_wait(2)
headers = []
columns = dict()
table_id = driver.find_element(By.ID, 'invoice')
all_rows = table_id.find_elements(By.TAG_NAME, "tr")
row = all_rows[0]
all_items = row.find_elements(By.TAG_NAME, "th")
for item in all_items:
name = item.text
columns[name] = []
headers.append(name)
print(headers)
for row in all_rows[1:]:
all_items = row.find_elements(By.TAG_NAME, "td")
for name, item in zip(headers, all_items):
value = item.text
columns[name].append(value)
print(columns)
df = pd.DataFrame(columns)
print(df)
driver.close()
Fixed issue by looping through the specificed amount of times that I needed and iterating the xpath position each loop and sending the data straight to excel on each loop. Thanks for your help Aaron for the iterating idea & Arundeep for the xpath position idea.
for x in result:
driver.find_element(By.XPATH, '//*[#id="sidemenu"]/table/tbody/tr[1]/td/form/div[2]/input[1]').send_keys(x)
driver.implicitly_wait(2)
driver.find_element(By.XPATH, '//*[#id="navsrch"]').click()
driver.implicitly_wait(5)
row = driver.find_elements(By.XPATH, '//*[#id="invoice"]/tbody/tr')
lrow = len(row) + 1
for i in range(1, lrow):
l_e_r = len(list(ws.rows)) + 1
for y in range(1, 10):
rows = driver.find_element(By.XPATH, '//*[#id="invoice"]/tbody/tr['+str(i)+']/td['+str(y)+']').text
ws.cell(row=l_e_r, column=y).value = rows
wb.save("test.xlsx")
driver.close()

How can I get the data of this table from HackerRank and filter it by country of origin and score for then exporting it as a csv file?

I'm learning web scraping on Python and I decided to test my skills in the HackerRank Leaderboard page, so I wrote the code below expecting no errors before adding the country restriction to the tester function for then exporting my csv file successfully.
But then the Python console replied:
AttributeError: 'NoneType' object has no attribute 'find_all'
The error above corresponds to the line 29 from my code (for i in table.find_all({'class':'ellipsis'}):), so I decided to come here in order to ask for assistance, I'm afraid there could be more syntax or logic errors, so it's better to get rid of my doubts by getting a feedback from experts.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from time import sleep
from random import randint
pd.set_option('display.max_columns', None)
#Declaring a variable for looping over all the pages
pages = np.arange(1, 93, 1)
a = pd.DataFrame()
#loop cycle
for url in pages:
#get html for each new page
url ='https://www.hackerrank.com/leaderboard?page='+str(url)
page = requests.get(url)
sleep(randint(3,10))
soup = BeautifulSoup(page.text, 'lxml')
#get the table
table = soup.find('header', {'class':'table-header flex'})
headers = []
#get the headers of the table and delete the "white space"
for i in table.find_all({'class':'ellipsis'}):
title = i.text.strip()
headers.append(title)
#set the headers to columns in a new dataframe
df = pd.DataFrame(columns=headers)
rows = soup.find('div', {'class':'table-body'})
#get the rows of the table but omit the first row (which are headers)
for row in rows.find_all('table-row-wrapper')[1:]:
data = row.find_all('table-row-column ellipsis')
row_data = [td.text.strip() for td in data]
length = len(df)
df.loc[length] = row_data
#set the data of the Txn Count column to float
Txn = df['SCORE'].values
#combine all the data rows in one single dataframe
a = a.append(pd.DataFrame(df))
def tester(mejora):
mejora = mejora[(mejora['SCORE']>2250.0)]
return mejora.to_csv('new_test_Score_Count.csv')
tester(a)
Do you guys have any ideas or suggestions that could fix the problem?
the error states, that you table element is None. i'm guessing here but you cant get the table from the page with bs4 because it is loaded after with javascript. I would recommend to use selenium for this instead

How to Create Pandas DataFrame out of Parsed Code using bs4/selenium on Python?

I have parsed a table and would like to convert two of those variables to a Pandas Dataframe to print to excel.
FYI:
I did ask a similar question, however, it was not answered thoroughly. There was no suggestion on how to create a Pandas DataFrame. This was the whole point of my question.
Caution:
There is small issue with the data that I parsed. The data contains "TEAM" and "SA/G" multiple times in the output.
The 1st variable that I would like in the DataFrame is 'TEAM'.
The 2nd variable that I would like in the DataFrame is 'SA/G'.
Here is my code so far:
# imports
from selenium import webdriver
from bs4 import BeautifulSoup
# make a webdriver object
driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
# open some page using get method - url -- > parameters
driver.get('http://www.espn.com/nhl/statistics/team/_/stat/scoring/sort/avgGoals')
# driver.page_source
soup = BeautifulSoup(driver.page_source,'lxml')
#close driver
driver.close()
#find table
table = soup.find('table')
#find_all table rows
t_rows = table.find_all('tr')
#loop through tr to find_all td
for tr in t_rows:
td = tr.find_all('td')
row = [i.text for i in td]
# print(row)
# print(row[9])
# print(row[1], row[9])
team = row[1]
sag = row[9]
# print(team, sag)
data = [(team, sag)]
print(data)
Here is the final output that I would like printed to excel using the Pandas DataFrame option:
Team SA/G
Nashville 30.1
Colorado 33.6
Washington 31.0
... ...
Thanks in advance for any help that you may offer. I am still learning and appreciate any feedback that I can get.
Looks like you want to create a DataFrame from a list of tuples, which has been answered here.
I would change your code like this:
# Initial empty list
data = []
#loop through tr to find_all td
for tr in t_rows:
td = tr.find_all('td')
row = [i.text for i in td]
team = row[1]
sag = row[9]
# Add tuple containing one row of data
data.append((team, sag))
# Create df from list of tuples
df = pd.DataFrame(data, columns=['Team', 'SA/G'])
# Remove lines where Team value is "TEAM"
df = df[df["Team"] != "TEAM"]
EDIT: Add line to remove ("TEAM", "SA/G") rows in df
First inside the "for loop" append tuples into a list (instead of doing data=[(x,y)] declare the data variable before the loop as a list data = list() and append the tuples to list in the loop data.append((x,y))) and do the following
import pandas as pd
data=[("t1","sag1"),("t2","sag2"),("t3","sag3")]
df = pd.DataFrame(data,columns=['Team','SA/G'])
print(df)

Extracting Tables From Different Sites With BeautifulSoup IN A LOOP

I have extracted a table from a site with the help of BeautifulSoup. Now I want to keep this process going in a loop with several different URL:s. If it is possible, I would like to extract these tables into different excel documents, or different sheets within a document.
I have been trying to put the code through a loop and appending the df
from bs4 import BeautifulSoup
import requests
import pandas as pd
xl = pd.ExcelFile(r'path/to/file.xlsx')
link = xl.parse('Sheet1')
#this is what I can't figure out
for i in range(0,10):
try:
url = link['Link'][i]
html = requests.get(url).content
df_list = pd.read_html(html)
soup = BeautifulSoup(html,'lxml')
table = soup.select_one('table:contains("Fees Earned")')
df = pd.read_html(str(table))
list1.append(df)
except ValueError:
print('Value')
pass
#Not as important
a = df[0]
writer = pd.ExcelWriter('mytables.xlsx')
a.to_excel(writer,'Sheet1')
writer.save()
I get a 'ValueError'(no tables found) for the first nine tables and only the last table is printed when I print mylist. However, when I print them without the for loop, one link at a time, it works.
I can't append the value of df[i] because it says 'index out of range'

Why is my for loop overwriting instead of appending CSV?

I am trying to scrape IB website. So, what I am doing, I have created the urls to iterate over, and I am able to extract the required information, but seems the dataframe keeps being overwritten vs appending.
import pandas as pd
from pandas import DataFrame as df
from bs4 import BeautifulSoup
import csv
import requests
base_url = "https://www.interactivebrokers.com/en/index.phpf=2222&exch=mexi&showcategories=STK&p=&cc=&limit=100"
n = 1
url_list = []
while n <= 2:
url = (base_url + "&page=%d" % n)
url_list.append(url)
n = n+1
def parse_websites(url_list):
for url in url_list:
html_string = requests.get(url)
soup = BeautifulSoup(html_string.text, 'lxml') # Parse the HTML as a string
table = soup.find('div',{'class':'table-responsive no-margin'}) #Grab the first table
df = pd.DataFrame(columns=range(0,4), index = [0]) # I know the size
for row_marker, row in enumerate(table.find_all('tr')):
column_marker = 0
columns = row.find_all('td')
try:
df.loc[row_marker] = [column.get_text() for column in columns]
except ValueError:
# It's a safe way when [column.get_text() for column in columns] is empty list.
continue
print(df)
df.to_csv('path_to_file\\test1.csv')
parse_websites(url_list)
Can you please take a look at my code at advise what I am doing wrong ?
One solution if you want to append the data frames on the file is to write in append mode:
df.to_csv('path_to_file\\test1.csv', mode='a', header=False)
otherwise you should create the data frame outside as mentioned in the comments.
If you define a data structure from within a loop, each iteration of the loop
will redefine the data structure, meaning that the work is being rewritten.
The dataframe should be defined outside of the loop if you do not want it to be overwritten.

Categories