AttributeError: 'ResultSet' object has no attribute 'find_all' - pd.read_html - python

I am trying to extract the data from a table from a webpage, but keep receiving the above error. I have looked at the examples on this site, as well as others, but none deal directly with my problem. Please see code below:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
table = soup.find_all('table', class_='dataframe')
rows = table.find_all('tr')[2:]
data = {
'RK' : [],
'PLAYER' : [],
'TEAM' : [],
'GP' : [],
'G' : [],
'A' : [],
'PTS' : [],
'+/-' : [],
'PIM' : [],
'PTS/G' : [],
'SOG' : [],
'PCT' : [],
'GWG' : [],
'G1' : [],
'A1' : [],
'G2' : [],
'A2' : []
}
for row in rows:
cols = row.find_all('td')
data['RK'].append( cols[0].get_text() )
data['PLAYER'].append( cols[1].get_text() )
data['TEAM'].append( cols[2].get_text() )
data['GP'].append( cols[3].get_text() )
data['G'].append( cols[4].get_text() )
data['A'].append( cols[5].get_text() )
data['PTS'].append( cols[6].get_text() )
data['+/-'].append( cols[7].get_text() )
data['PIM'].append( cols[8].get_text() )
data['PTS/G'].append( cols[9].get_text() )
data['SOG'].append( cols[10].get_text() )
data['PCT'].append( cols[11].get_text() )
data['GWG'].append( cols[12].get_text() )
data['G1'].append( cols[13].get_text() )
data['A1'].append( cols[14].get_text() )
data['G2'].append( cols[15].get_text() )
data['A2'].append( cols[16].get_text() )
df = pd.DataFrame(data)
df.to_csv("NHL_Players_Stats.csv")
I have eradicated the error, by seeing that the error was referring to the table (i.e. the Resultset) not having the method find_all and got the code running by commenting out the following line:
#rows = table.find_all('tr')[2:]
and changing this:
for row in rows:
This, however, does not extracts any data from the webpage and simply creates a .csv file with column headers.
I have tried to extract some data directly into rows using soup.find_all, but get the following error;
data['GP'].append( cols[3].get_text() )
IndexError: list index out of range
which I have not been able to resolve.
Therefore, any help would be very much appreciated.
Also, out of curiosity, are there any ways to achieve the desired outcome using:
dataframe = pd.read_html('url')
because, I have tried this also, but keep keeping:
FeatureNotFound: Couldn't find a tree builder with the features you
requested: html5lib. Do you need to install a parser library?
Ideally this is the method that I would prefer, but can't find any examples online.

find_all returns a ResultSet, which is basically a list of elements. For this reason, it has no method find_all, as this is a method that belongs to an individual element.
If you only want one table, use find instead of find_all to look for it.
table = soup.find('table', class_='dataframe')
Then, getting its rows should work as you have already done:
rows = table.find_all('tr')[2:]
The second error you got is because, for some reason, one of the table's rows seems to have only 3 cells, thus your cols variable became a list with only indexes 0, 1 and 2. That's why cols[3] gives you an IndexError.

In terms of achieving the same outcome using:
dataframe = pd.read_html('url')
It is achieved using just that or similar:
dataframe = pd.read_html(url, header=1, index_col=None)
The reason why I was receiving errors previously is because I had not configured Spyder's iPython console's backend to 'automatic' in 'Preferences'.
I am still, however, trying to resolve this problem using BeautifulSoup. So any useful comments would be appreciated.

Related

Pandas DF.output write to columns (current data is written all to one row or one column)

I am using Selenium to extract data from the HTML body of a webpage and am writing the data to a .csv file using pandas.
The data is extracted and written to the file, however I would like to manipulate the formatting of the data to write to specified columns, after reading many threads and docs I am not able to understand how to do this.
The current CSV file output is as follows, all data in one row or one column
0,
B09KBFH6HM,
dropdownAvailable,
90,
1,
B09KBNJ4F1,
dropdownAvailable,
100,
2,
B09KBPFPCL,
dropdownAvailable,
110
or if I use the [count] count +=1 method it will be one row
0,B09KBFH6HM,dropdownAvailable,90,1,B09KBNJ4F1,dropdownAvailable,100,2,B09KBPFPCL,dropdownAvailable,110
I would like the output to be formatted as follows,
/col1 /col2 /col3 /col4
0, B09KBFH6HM, dropdownAvailable, 90,
1, B09KBNJ4F1, dropdownAvailable, 100,
2, B09KBPFPCL, dropdownAvailable, 110
I have tried using columns= options but get errors in the terminal and don't understand what feature I should be using to achieve this in the docs under the append details
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html?highlight=append#pandas.DataFrame.append
A simplified version is as follows
from selenium import webdriver
import pandas as pd
price = []
driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.example.co.jp/dp/zzzzzzzzzz/")
select_box = driver.find_element_by_name("dropdown_selected_size_name")
options = [x for x in select_box.find_elements_by_tag_name("option")]
for element in options:
price.append(element.get_attribute("value"))
price.append(element.get_attribute("class"))
price.append(element.get_attribute("data-a-html-content"))
output = pd.DataFrame(price)
output.to_csv("Data.csv", encoding='utf-8-sig')
driver.close()
Do I need to parse each item separately and append?
I would like each of the .get_attribute values to be written to a new column.
Is there any advice you can offer for a solution to this as I am not very proficient at pandas, thank you for your helps
 Approach similar to #user17242583, but a little shorter:
data = [[e.get_attribute("value"), e.get_attribute("class"), e.get_attribute("data-a-html-content")] for e in options]
df = pd.DataFrame(data, columns=['ASIN', 'dropdownAvailable', 'size']) # third column maybe is the product size
df.to_csv("Data.csv", encoding='utf-8-sig')
Adding all your items to the price list is going to cause them all to be in one column. Instead, store separate lists for each column, in a dict, like this (name them whatever you want):
data = {
'values': [],
'classes': [],
'data_a_html_contents': [],
}
...
for element in options:
values.append(element.get_attribute("value"))
classes.append(element.get_attribute("class"))
data_a_html_contents.append(element.get_attribute("data-a-html-content"))
...
output = pd.DataFrame(data)
output.to_csv("Data.csv", encoding='utf-8-sig')
You were collecting the value, class and data-a-html-content and appending them to the same list price. Hence, the list becomes:
price = [value1, class1, data-a-html-content1, value2, class2, data-a-html-content2, ...]
Hence, within the dataframe it looks like:
Solution
To get value, class and data-a-html-content in seperate columns you can adopt any of the below two approaches:
Pass a dictionary to the dataframe.
Pass a list of lists to the dataframe.
While the #user17242583 and #h.devillefletcher suggests a dictionary, you can still achieve the same using list of lists as follows:
values = []
classes = []
data-a-html-contents = []
driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.example.co.jp/dp/zzzzzzzzzz/")
select_box = driver.find_element_by_name("dropdown_selected_size_name")
options = [x for x in select_box.find_elements_by_tag_name("option")]
for element in options:
values.append(element.get_attribute("value"))
classes.append(element.get_attribute("class"))
data-a-html-contents.append(element.get_attribute("data-a-html-content"))
df = pd.DataFrame(data=list(zip(values, classes, data-a-html-contents)), columns=['Value', 'Class', 'Data-a-Html-Content'])
output = pd.DataFrame(my_list)
output.to_csv("Data.csv", encoding='utf-8-sig')
References
You can find a couple of relevant detailed discussions in:
Selenium: Web-Scraping Historical Data from Coincodex and transform into a Pandas Dataframe
Python Selenium: How do I print the values from a website in a text file?

Unable to drop columns in table

Beginner here.
With the help of someone here I was able to extract the second and third Tables on this page (Team Statistics and Team Analytics 5-on-5) that included this last part:
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(each, header=1)[0])
tables = tables[tables['Rk'].ne('Rk')]
tables = tables.rename(columns={'Unnamed: 1':'Team'})
except:
for table in tables[1:3]:
print(table)
They are standard dataframes but i just can't figure out how drop some columns out of it. I've tried to do this by using df.drop :
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(each, header=1)[0])
tables = tables[tables['Rk'].ne('Rk')]
tables = tables.rename(columns={'Unnamed: 1':'Team'})
except:
for table in tables[1:3]:
df = pd.read_table = [1]
df = df.drop({"AvAge", "GP", "W", "L", "OL", "PTS", "GF", "GA", "SOW", "SOL", "SOS", "PP", "PPO", "PP%", "PPA", "PPOA", "PK%", "SH", "SHA", "PIM/G", "oPIM/G", "S", "SA", "SO"}, 1)
print(df)
df = pd.read_table = [2]
df = df = df.drop({"S%", "SV%", "CF", "CA", "FF", "FA", "xGF", "xGA", "aGF", "aGA", "SCF", "SCA", "HDF", "HDA", "HDGF", "HDGA"}, 1)
print(df)
but I got this answer:
AttributeError: 'list' object has no attribute 'drop'
It feels like there's a problem with using "df" and "table" but I'm not sure at all. And this is where I'm stuck for the moment.
Thanx in advance!
No, the problem is with the compound assignment statement.
df = pd.read_table = [1]
print(df)
print(pd.read_table)
Output:
[1]
[1]
This code assigns [1] to both df and pd.read_table. Then the code calls df.drop() but df is a list and list does not have a drop() method. More troubling is that the code assigns a list to the pd.read_table callable. I'm uncertain what you're trying to do here, but this is certainly the source of your error.

BeautifulSoup: Repated columns

I am trying to scrape data from a webpage that contains a table to then put it in a pandas data frame.
I beleive to have done everything correctly but i get repeated columns...
Here is my code:
html_content = requests.get('http://timetables.itsligo.ie:81/reporting/individual;student+set;id;SG_KAPPL_H08%2FF%2FY1%2F1%2F%28A%29%0D%0A?t=student+set+individual&days=1-5&weeks=29&periods=3-20&template=student+set+individual').text
soup = BeautifulSoup(html_content,'html.parser')
all_tables = soup.find_all('table')
wanted_table = all_tables[6]
first_tr = wanted_table.find('tr')
following_tr = first_tr.find_next_siblings()
details = []
for tr in following_tr:
prepare = []
for td in tr.find_all('td'):
prepare.append(td.text)
details.append(prepare)
df = pd.DataFrame(details)
pd.set_option('display.max_columns', None)
display(df)
Which works great but as you can see in the bellow picture(column1 and 2 in row 0) , im getting repeated td's and one always has \n repeated.
The thing i noticed is that the details list return its double for some reason,maybe there is a table nested in a table?
Im doing this in jupyter by the way.
Thank you in advance!
The reason your details list is nested is because you are constructing it that way; that is, if you append a list (prepare) to another list (details), you get a nested list. See here. And this is okay, since it is works well to be read into your DataFrame.
Still, you are correct that there is a nested table thing going on in the HTML. I won't try to format the HTML here, but each box in the schedule is a <td> within the overarching wanted_table. When there is a course in one of those cells, there is another <table> used to hold the details. So the class name, professor, etc. are more <td> elements within this nested <table>. So when finding all the cells (tr.find_all('td')), you encounter both the larger class box, as well as its nested elements. And when you get the .text on the outermost <td>, you also get the text from the innermost cells, hence the repetition.
I am not sure if this is the best way, but one option would be to prevent the search from entering the nested table, using the recursive parameter in find_all.
# all your other code above
for tr in following_tr:
prepare = []
for td in tr.find_all('td', recursive=False):
prepare.append(td.text)
details.append(prepare)
df = pd.DataFrame(details)
The above should prevent the repeated elements from appearing. However, there is still the problem of having many \n characters, as well as not including the fact that some cells span multiple columns. You can start to fix the first by including some strip-ing on the text. For the second, you can access the colspan attribute to pad the prepare list:
# all your other code above
for tr in following_tr:
prepare = []
for td in tr.find_all('td', recursive=False):
text = td.text.strip('\s\n')
prepare += [text] + [None] * (int(td.get('colspan', 0)) - 1)
details.append(prepare)
df = pd.DataFrame(details)
It's a little too unwieldy to post the output. And there is still formatting you will likely want to do, but that is getting beyond the scope of your original post. Hopefully something in here helps!
import pandas as pd
url = 'https://en.wikipedia.org/wiki/The_World%27s_Billionaires'
df_list = pd.read_html(url)
len(df_list)
Output:
32
after specifying na_values Below
pd.read_html(
url,
na_values=["Forbes: The World's Billionaires website"]
)[0].tail()

How to add Items to a Single Dictionary from Multiple For Loops?

I started Learning Python Recently. What I am basically doing is Scraping data from Website and adding to a list of dictionaries ,
This is what the final structure should look like :
This is basically my scraping code. I had to use two for loops since , the element to target are present at different positions on the webpage(One for Title and Another for Description)
jobslist=[]
for item in title:
MainTitle = item.text
mydict = {
'title' : MainTitle,
}
jobslist.append(mydict)
for i in link:
links = i['href']
r2 = requests.get(links, headers = headers)
soup2 = BeautifulSoup(r2.content,'lxml')
entry_content = soup2.find('div', class_ ='entry-content')
mydict= {
'description' : entry_content
}
jobslist.append(mydict)
Finally Saving to a CSV (pandas library used where pd is the import)
df = pd.DataFrame(jobslist)
df.to_csv('data.csv')
But, the Output is quite strange. The description are added below the Titles and not side by side. This is the Screenshot :
How can I align it side by side ?
Disclaimer: It's hard to give a perfect answer because your code is not reproducible; I have no idea what your date looks like, nor what you're trying to do, so I can't really test anything.
From what I understand of your code, it looks like the dictionaries are completely unnecessary. You have a list of titles, and a list of descriptions. So be it:
titles_list = []
for item in title:
titles_list.append(item.text)
descriptions_list = []
for i in link:
links = i['href']
r2 = requests.get(links, headers = headers)
soup2 = BeautifulSoup(r2.content,'lxml')
entry_content = soup2.find('div', class_ ='entry-content')
descriptions_list.append(entry_content)
df = pd.DataFrame(data = {'title': titles_list, 'description': descriptions_list}) # here we use a dict of lists instead of a list of dicts
df.to_csv('data.csv')

How to pull values from a table with no defining characteristics?

I am trying to pull part numbers from a cross-reference website but when I inspect the element the only tags used around the table are tr, td, tbody, and table, which are used many other places on the page. Currently I am using Beautiful soup and selenium, and I am looking into using lxml.html for its xpath tool, but I cant seem to get beautiful soup to work with it.
The website I trying to pull values from is
https://jdparts.deere.com/servlet/com.deere.u90.jdparts.view.servlets.searchcontroller.PartialPartNumberSearchController?action=UNSIGNED_VIEW
and technically I only want the Part Number, Make, Part No, Part Type, and Description Values, but I can deal with getting the whole table.
When I use
html2 = browser.page_source
source = soup(html2, 'html.parser')
for article in source.find_all('td', valign='middle'):
PartNumber = article.text.strip()
number.append(PartNumber)
it gives me all the values on the page and several blank values all in a single line of text, which would be just as much work to sift through as just manually pulling the values.
Ultimately I am hoping to get the values in the table and formatted to look like the table and I can just delete the columns I don't need. What would be the best way to go about gathering the information in the table?
One approach would be to find the Qty. which is an element at the start of the table you want and to then look for the previous table. You could then iterate over the tr elements and produce a row of values from all the td elements in each row.
The Python itemgetter() function could be useful here, as it lets you extract the elements you want (in any order) from a bigger list. In this example, I have chosen items 1,2,3,4,5, but if say Make wasn't needed, you could provide 1,3,4,5.
The search results might have multiple pages of results, if this is the case it checks for a Next Page button and if present adjusts params to get the next page of results. This continues until no next page is found:
from operator import itemgetter
import requests
from bs4 import BeautifulSoup
import csv
search_term = "AT2*"
params = {
"userAction" : "search",
"browse" : "",
"screenName" : "partSearch",
"priceIdx" : 1,
"searchAppType" : "",
"searchType" : "search",
"partSearchNumber" : search_term,
"pageIndex" : 1,
"endPageIndex" : 100,
}
url = 'https://jdparts.deere.com/servlet/com.deere.u90.jdparts.view.servlets.searchcontroller.PartNumberSearch'
req_fields = itemgetter(1, 2, 3, 4, 5)
page_index = 1
session = requests.Session()
start_row = 0
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
while True:
print(f'Page {page_index}')
req = session.post(url, params=params)
soup = BeautifulSoup(req.content, 'html.parser')
table = soup.find(text='Qty.').find_previous('table')
for tr in table.find_all('tr')[start_row:]:
row = req_fields([value.get_text(strip=True) for value in tr.find_all('td')])
if row[0]:
csv_output.writerow(row)
if soup.find(text='Next Page'):
start_row = 2
params = {
"userAction" : "NextPage",
"browse" : "NextPage",
"pageIndex" : page_index,
"endPageIndex" : 15,
}
page_index += 1
else:
break
Which would give you an output.csv file starting:
Part Number,Make,Part No.,Part Type,Description
AT2,Kralinator,PMTF15013,Filters,Filter
AT2,Kralinator,PMTF15013J,Filters,Filter
AT20,Berco,T139464,Undercarriage All Makes,Spring Pin
AT20061,A&I Products,A-RE29882,Clutch,Clutch Disk
Note: This makes use of requests instead of using selenium as it will be much faster.

Categories