I'm trying to parse a number of columns in a table into a dictionary using Selenium, but what I have seems slow. I'm using python, Selenium 2.0, and webdriver.Chrome()
table = self.driver.find_element_by_id("thetable")
# now get all the TR elements from the table
all_rows = table.find_elements_by_tag_name("tr")
# and iterate over them, getting the cells
for row in all_rows:
cells = row.find_elements_by_tag_name("td")
# slowwwwwwwwwwwwww
dict_value = {'0th': cells[0].text,
'1st': cells[1].text,
'2nd': cells[2].text,
'3rd': cells[3].text,
'6th': cells[6].text,
'7th': cells[7].text,
'10th': cells[10].text}
The problem seems to be getting the 'text' attribute of each td element. Is there a faster way?
Alternative option.
If later (after the loop), you don't need interactiveness that selenium provides you with - you can pass the current HTML source code of the page to lxml.html, which is known for it's speed. Example:
import lxml.html
root = lxml.html.fromstring(driver.page_source)
for row in root.xpath('.//table[#id="thetable"]//tr'):
cells = row.xpath('.//td/text()')
dict_value = {'0th': cells[0],
'1st': cells[1],
'2nd': cells[2],
'3rd': cells[3],
'6th': cells[6],
'7th': cells[7],
'10th': cells[10]}
Related
I am attempting to iterate through some BeautifulSoup Data and get headers from the table header elements () and place them into a list. Currently, my code is extracting more than the just the header element, and is pulling part of the BS4 tags.
See attached image (result1) for current results.
result1
Code:
column_names = []
def extract_column_from_header(row):
if (row.br):
row.br.extract()
if row.a:
row.a.extract()
if row.sup:
row.sup.extract()
column_name = ' '.join(row.contents)
if not(column_name.strip().isdigit()):
column_name = column_name.strip()
return column_name
soup=first_launch_table.find_all('th')
for i in soup:
name=extract_column_from_header(i)
if name != None and len(name) > 0:
column_names.append(i)
Question needs improvement (HTML, expected result, ...), so this is only pointing in a direction - You could use css selectors to select all th in table, extract content with .text while iterating with list comprehension:
[th.text for th in soup.select('table th')]
Or based on your example:
[th.text for th in first_launch_table.find_all('th')]
I am trying to scrape data from a webpage that contains a table to then put it in a pandas data frame.
I beleive to have done everything correctly but i get repeated columns...
Here is my code:
html_content = requests.get('http://timetables.itsligo.ie:81/reporting/individual;student+set;id;SG_KAPPL_H08%2FF%2FY1%2F1%2F%28A%29%0D%0A?t=student+set+individual&days=1-5&weeks=29&periods=3-20&template=student+set+individual').text
soup = BeautifulSoup(html_content,'html.parser')
all_tables = soup.find_all('table')
wanted_table = all_tables[6]
first_tr = wanted_table.find('tr')
following_tr = first_tr.find_next_siblings()
details = []
for tr in following_tr:
prepare = []
for td in tr.find_all('td'):
prepare.append(td.text)
details.append(prepare)
df = pd.DataFrame(details)
pd.set_option('display.max_columns', None)
display(df)
Which works great but as you can see in the bellow picture(column1 and 2 in row 0) , im getting repeated td's and one always has \n repeated.
The thing i noticed is that the details list return its double for some reason,maybe there is a table nested in a table?
Im doing this in jupyter by the way.
Thank you in advance!
The reason your details list is nested is because you are constructing it that way; that is, if you append a list (prepare) to another list (details), you get a nested list. See here. And this is okay, since it is works well to be read into your DataFrame.
Still, you are correct that there is a nested table thing going on in the HTML. I won't try to format the HTML here, but each box in the schedule is a <td> within the overarching wanted_table. When there is a course in one of those cells, there is another <table> used to hold the details. So the class name, professor, etc. are more <td> elements within this nested <table>. So when finding all the cells (tr.find_all('td')), you encounter both the larger class box, as well as its nested elements. And when you get the .text on the outermost <td>, you also get the text from the innermost cells, hence the repetition.
I am not sure if this is the best way, but one option would be to prevent the search from entering the nested table, using the recursive parameter in find_all.
# all your other code above
for tr in following_tr:
prepare = []
for td in tr.find_all('td', recursive=False):
prepare.append(td.text)
details.append(prepare)
df = pd.DataFrame(details)
The above should prevent the repeated elements from appearing. However, there is still the problem of having many \n characters, as well as not including the fact that some cells span multiple columns. You can start to fix the first by including some strip-ing on the text. For the second, you can access the colspan attribute to pad the prepare list:
# all your other code above
for tr in following_tr:
prepare = []
for td in tr.find_all('td', recursive=False):
text = td.text.strip('\s\n')
prepare += [text] + [None] * (int(td.get('colspan', 0)) - 1)
details.append(prepare)
df = pd.DataFrame(details)
It's a little too unwieldy to post the output. And there is still formatting you will likely want to do, but that is getting beyond the scope of your original post. Hopefully something in here helps!
import pandas as pd
url = 'https://en.wikipedia.org/wiki/The_World%27s_Billionaires'
df_list = pd.read_html(url)
len(df_list)
Output:
32
after specifying na_values Below
pd.read_html(
url,
na_values=["Forbes: The World's Billionaires website"]
)[0].tail()
i am trying to retrieve the code as well as title but somehow i am not able to retrieve the website is
https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27
here i have tried to get the value from the table
import requests
unspsc_link = "https://www.unspsc.org/search-code/default.aspx?
CSS=51%&Type=desc&SS%27"
link = requests.get(unspsc_link).text
from bs4 import BeautifulSoup
soup = BeautifulSoup(link, 'lxml')
print(soup.prettify())
all_table = soup.find_all('table')
print(all_table)
right_table = soup.find_all('table',
id="dnn_ctr1535_UNSPSCSearch_gvDetailsSearchView")
tables = right_table.find_all('td')
print(tables)
the errors AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
i expect to save the code as well as title in a list and save it in dataframe later
is there any way to continue to next page without manually providing values like in search code like 51% as there as more than 20 pages inside 51%
From the documentation
AttributeError: 'ResultSet' object has no attribute 'foo' - This
usually happens because you expected find_all() to return a single tag
or string. But find_all() returns a list of tags and stringsāa
ResultSet object. You need to iterate over the list and look at the
.foo of each one. Or, if you really only want one result, you need to
use find() instead of find_all()
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
unspsc_link = "https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27"
link = requests.get(unspsc_link).text
soup = BeautifulSoup(link, 'lxml')
right_table = soup.find('table', id="dnn_ctr1535_UNSPSCSearch_gvDetailsSearchView")
df = pd.read_html(str(right_table))[0]
# Clean up the DataFrame
df = df[[0, 1]]
df.columns = df.iloc[0]
df = df[1:]
print(df)
Output:
0 Code Title
1 51180000 Hormones and hormone antagonists
2 51280000 Antibacterials
3 51290000 Antidepressants
4 51390000 Sympathomimetic or adrenergic drugs
5 51460000 Herbal drugs
...
Notes:
The row order may be a little different but the data seems to be the same.
You will have to remove the last one or two rows
from the DataFrame as they are not relevant.
This is the data from the first page only. Look into
selenium to get the data from all pages by clicking on the buttons [1] [2] .... You can also use requests to emulate the POST request but it is a bit difficult for this site (IMHO).
I am trying to pull part numbers from a cross-reference website but when I inspect the element the only tags used around the table are tr, td, tbody, and table, which are used many other places on the page. Currently I am using Beautiful soup and selenium, and I am looking into using lxml.html for its xpath tool, but I cant seem to get beautiful soup to work with it.
The website I trying to pull values from is
https://jdparts.deere.com/servlet/com.deere.u90.jdparts.view.servlets.searchcontroller.PartialPartNumberSearchController?action=UNSIGNED_VIEW
and technically I only want the Part Number, Make, Part No, Part Type, and Description Values, but I can deal with getting the whole table.
When I use
html2 = browser.page_source
source = soup(html2, 'html.parser')
for article in source.find_all('td', valign='middle'):
PartNumber = article.text.strip()
number.append(PartNumber)
it gives me all the values on the page and several blank values all in a single line of text, which would be just as much work to sift through as just manually pulling the values.
Ultimately I am hoping to get the values in the table and formatted to look like the table and I can just delete the columns I don't need. What would be the best way to go about gathering the information in the table?
One approach would be to find the Qty. which is an element at the start of the table you want and to then look for the previous table. You could then iterate over the tr elements and produce a row of values from all the td elements in each row.
The Python itemgetter() function could be useful here, as it lets you extract the elements you want (in any order) from a bigger list. In this example, I have chosen items 1,2,3,4,5, but if say Make wasn't needed, you could provide 1,3,4,5.
The search results might have multiple pages of results, if this is the case it checks for a Next Page button and if present adjusts params to get the next page of results. This continues until no next page is found:
from operator import itemgetter
import requests
from bs4 import BeautifulSoup
import csv
search_term = "AT2*"
params = {
"userAction" : "search",
"browse" : "",
"screenName" : "partSearch",
"priceIdx" : 1,
"searchAppType" : "",
"searchType" : "search",
"partSearchNumber" : search_term,
"pageIndex" : 1,
"endPageIndex" : 100,
}
url = 'https://jdparts.deere.com/servlet/com.deere.u90.jdparts.view.servlets.searchcontroller.PartNumberSearch'
req_fields = itemgetter(1, 2, 3, 4, 5)
page_index = 1
session = requests.Session()
start_row = 0
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
while True:
print(f'Page {page_index}')
req = session.post(url, params=params)
soup = BeautifulSoup(req.content, 'html.parser')
table = soup.find(text='Qty.').find_previous('table')
for tr in table.find_all('tr')[start_row:]:
row = req_fields([value.get_text(strip=True) for value in tr.find_all('td')])
if row[0]:
csv_output.writerow(row)
if soup.find(text='Next Page'):
start_row = 2
params = {
"userAction" : "NextPage",
"browse" : "NextPage",
"pageIndex" : page_index,
"endPageIndex" : 15,
}
page_index += 1
else:
break
Which would give you an output.csv file starting:
Part Number,Make,Part No.,Part Type,Description
AT2,Kralinator,PMTF15013,Filters,Filter
AT2,Kralinator,PMTF15013J,Filters,Filter
AT20,Berco,T139464,Undercarriage All Makes,Spring Pin
AT20061,A&I Products,A-RE29882,Clutch,Clutch Disk
Note: This makes use of requests instead of using selenium as it will be much faster.
I am trying scrap data using beautiful soup, however it comes in the form of the lists, however i need a pandas data frame. I am using a for loop to get the data however i am unable to append these to dataframe. When i check for the len of row it says only 1.
INFY = url.urlopen("https://in.finance.yahoo.com/quote/INFY.NS/history?p=INFY.NS")
div = INFY.read()
div = soup(div,'html.parser')
div = div.find("table",{"class":"W(100%) M(0)"})
table_rows = div.findAll("tr")
print(table_rows)
for tr in table_rows:
td = tr.findAll('td')
row = [i.text for i in td]
print(row)
Below is the result i get after running the code.
['30-Mar-2017', '1,034.00', '1,035.90', '1,020.25', '1,025.50', '1,010.02', '60,78,590']
['29-Mar-2017', '1,034.30', '1,041.50', '1,025.85', '1,031.85', '1,016.27', '34,90,593']
['28-Mar-2017', '1,031.50', '1,039.00', '1,030.05', '1,035.15', '1,019.52', '23,98,398']
pd.DataFrame([[i.text for i in tr.findAll('td')] for tr in table_rows])
You would then need to convert text values to their numeric equivalents.