How to pull values from a table with no defining characteristics?

How to pull values from a table with no defining characteristics? - python

I am trying to pull part numbers from a cross-reference website but when I inspect the element the only tags used around the table are tr, td, tbody, and table, which are used many other places on the page. Currently I am using Beautiful soup and selenium, and I am looking into using lxml.html for its xpath tool, but I cant seem to get beautiful soup to work with it.
The website I trying to pull values from is
https://jdparts.deere.com/servlet/com.deere.u90.jdparts.view.servlets.searchcontroller.PartialPartNumberSearchController?action=UNSIGNED_VIEW
and technically I only want the Part Number, Make, Part No, Part Type, and Description Values, but I can deal with getting the whole table.
When I use
html2 = browser.page_source
source = soup(html2, 'html.parser')
for article in source.find_all('td', valign='middle'):
PartNumber = article.text.strip()
number.append(PartNumber)
it gives me all the values on the page and several blank values all in a single line of text, which would be just as much work to sift through as just manually pulling the values.
Ultimately I am hoping to get the values in the table and formatted to look like the table and I can just delete the columns I don't need. What would be the best way to go about gathering the information in the table?

One approach would be to find the Qty. which is an element at the start of the table you want and to then look for the previous table. You could then iterate over the tr elements and produce a row of values from all the td elements in each row.
The Python itemgetter() function could be useful here, as it lets you extract the elements you want (in any order) from a bigger list. In this example, I have chosen items 1,2,3,4,5, but if say Make wasn't needed, you could provide 1,3,4,5.
The search results might have multiple pages of results, if this is the case it checks for a Next Page button and if present adjusts params to get the next page of results. This continues until no next page is found:
from operator import itemgetter
import requests
from bs4 import BeautifulSoup
import csv
search_term = "AT2*"
params = {
"userAction" : "search",
"browse" : "",
"screenName" : "partSearch",
"priceIdx" : 1,
"searchAppType" : "",
"searchType" : "search",
"partSearchNumber" : search_term,
"pageIndex" : 1,
"endPageIndex" : 100,
}
url = 'https://jdparts.deere.com/servlet/com.deere.u90.jdparts.view.servlets.searchcontroller.PartNumberSearch'
req_fields = itemgetter(1, 2, 3, 4, 5)
page_index = 1
session = requests.Session()
start_row = 0
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
while True:
print(f'Page {page_index}')
req = session.post(url, params=params)
soup = BeautifulSoup(req.content, 'html.parser')
table = soup.find(text='Qty.').find_previous('table')
for tr in table.find_all('tr')[start_row:]:
row = req_fields([value.get_text(strip=True) for value in tr.find_all('td')])
if row[0]:
csv_output.writerow(row)
if soup.find(text='Next Page'):
start_row = 2
params = {
"userAction" : "NextPage",
"browse" : "NextPage",
"pageIndex" : page_index,
"endPageIndex" : 15,
}
page_index += 1
else:
break
Which would give you an output.csv file starting:
Part Number,Make,Part No.,Part Type,Description
AT2,Kralinator,PMTF15013,Filters,Filter
AT2,Kralinator,PMTF15013J,Filters,Filter
AT20,Berco,T139464,Undercarriage All Makes,Spring Pin
AT20061,A&I Products,A-RE29882,Clutch,Clutch Disk
Note: This makes use of requests instead of using selenium as it will be much faster.

Related

Cannot get the hyperlink href beautiful soup

I am trying to get the hyperlink of anchor (a) element but I get keep getting:
h ttps://in.finance.yahoo.com/h ttps://in.finance.yahoo.com/
I have tried all solutions provided here: link
Here's my code:
href_links = []
symbols = []
prices = []
commodities = []
CommoditiesUrl = "https://in.finance.yahoo.com/commodities"
r = requests.get(CommoditiesUrl)
data = r.text
soup = BeautifulSoup(data)
counter = 40
for i in range(40, 404, 14):
for row in soup.find_all('tbody'):
for srow in row.find_all('tr'):
for symbol in srow.find_all('td', attrs={'class':'data-col0'}):
symbols.append(symbol.text)
href_link = soup.find('a').get('href')
href_links.append('https://in.finance.yahoo.com/' + href_link)
for commodity in srow.find_all('td', attrs={'class':'data-col1'}):
commodities.append(commodity.text)
for price in srow.find_all('td', attrs={'class':'data-col2'}):
prices.append(price.text)
pd.DataFrame({"Links": href_links, "Symbol": symbols, "Commodity": commodities, "Prices": prices })
Also, I would like to know if it's feasible, to similarly to the website, to have the symbol of the commodity as a hyperlink in my pandas dataframe.

I'm not sure what's going on with the code you posted, but you can simply get that URL by finding an a element with the attribute data-symbol set to GC=F. The html has 2 such elements. The one you want is the first one, which is what is returned by soup.find('a', {'data-symbol': 'GC=F'}).get('href').
import requests, urllib
from bs4 import BeautifulSoup
CommoditiesUrl = "https://in.finance.yahoo.com/commodities"
r = requests.get(CommoditiesUrl)
data = r.text
soup = BeautifulSoup(data)
gold_href = soup.find('a', {'data-symbol': 'GC=F'}).get('href')
# If it is a relative URL, we need to transform it into an absolute URL (it always is, fwiw)
if not gold_href.startswith('http'):
# If you insist, you can do 'https://in.finance.yahoo.com" + gold_href
gold_href = urllib.parse.urljoin(CommoditiesUrl, gold_href)
print(gold_url)
Also, I would like to know if it's feasible, to similarly to the website, to have the symbol of the commodity as a hyperlink in my pandas dataframe.
I'm not familiar with pandas, but I'd say the answer is yes. See: How to create a table with clickable hyperlink in pandas & Jupyter Notebook

how to return data from multiple pages from table in url using beautifulsoup

i am trying to retrieve the code as well as title but somehow i am not able to retrieve the website is
https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27
here i have tried to get the value from the table
import requests
unspsc_link = "https://www.unspsc.org/search-code/default.aspx?
CSS=51%&Type=desc&SS%27"
link = requests.get(unspsc_link).text
from bs4 import BeautifulSoup
soup = BeautifulSoup(link, 'lxml')
print(soup.prettify())
all_table = soup.find_all('table')
print(all_table)
right_table = soup.find_all('table',
id="dnn_ctr1535_UNSPSCSearch_gvDetailsSearchView")
tables = right_table.find_all('td')
print(tables)
the errors AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
i expect to save the code as well as title in a list and save it in dataframe later
is there any way to continue to next page without manually providing values like in search code like 51% as there as more than 20 pages inside 51%

From the documentation
AttributeError: 'ResultSet' object has no attribute 'foo' - This
usually happens because you expected find_all() to return a single tag
or string. But find_all() returns a list of tags and strings–a
ResultSet object. You need to iterate over the list and look at the
.foo of each one. Or, if you really only want one result, you need to
use find() instead of find_all()
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
unspsc_link = "https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27"
link = requests.get(unspsc_link).text
soup = BeautifulSoup(link, 'lxml')
right_table = soup.find('table', id="dnn_ctr1535_UNSPSCSearch_gvDetailsSearchView")
df = pd.read_html(str(right_table))[0]
# Clean up the DataFrame
df = df[[0, 1]]
df.columns = df.iloc[0]
df = df[1:]
print(df)
Output:
0 Code Title
1 51180000 Hormones and hormone antagonists
2 51280000 Antibacterials
3 51290000 Antidepressants
4 51390000 Sympathomimetic or adrenergic drugs
5 51460000 Herbal drugs
...
Notes:
The row order may be a little different but the data seems to be the same.
You will have to remove the last one or two rows
from the DataFrame as they are not relevant.
This is the data from the first page only. Look into
selenium to get the data from all pages by clicking on the buttons [1] [2] .... You can also use requests to emulate the POST request but it is a bit difficult for this site (IMHO).

BeautifulSoup - How do I find values from a table with no id?

Trying to scrape values from the 'retained profit' row under the 'British Land Fundamentals' heading at https://uk.advfn.com/p.php?pid=financials&symbol=L%5EBLND
Not sure how to go about this as I can't see a specific ID or class.
Thanks

this return all tables with width 100%
u can try:
tables = soup.find_all('table',width="100%")
for tr in table.find_all("tr"):
for key,value in enumerate(tr.find_all("td")):
anchor = value.find_all("a",class="Lcc")
if "Retained Profit PS" in anchor.text:
return tr.find_all("td")[key+1]

Figured out a working solution by finding the specific row name in the HTML source code and iterating through the table values in said row:
def foo(ticker):
response = requests.get('https://uk.advfn.com/p.php?pid=financials&symbol=L%5E{}'.format(ticker))
soup = bs.BeautifulSoup(response.text, 'lxml')
# find all 'a' tags
for a in soup.find_all("a"):
# find the specific row we want, there is only one instance of the a tag with the text 'retained profit' so this works
if a.text == "retained profit":
# iterate through each value in the row
for val in a.parent.find_next_siblings():
# return only the values in bold (i.e. those tags with the class='sb' in this case)
for class_ in val.attrs['class']:
if class_ == 'sb':
print(val.text)
Outputs the desired values from the table :D

How can I scrape table and find out corresponding entry for maximum number in particular column?

How can I scrape table from "https://www.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=-9999&symbol=BANKNIFTY&symbol=BANKNIFTY&instrument=OPTIDX&date=-&segmentLink=17&segmentLink=17"
Then find out maximum "OI" under "PUTS" and finally have corresponding entries in that row for that particular maximum OI
Reached till printing rows:
import urllib2
from urllib2 import urlopen
import bs4 as bs
url = 'https://www.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=-9999&symbol=BANKNIFTY&symbol=BANKNIFTY&instrument=OPTIDX&date=-&segmentLink=17&segmentLink=17'
html = urllib2.urlopen(url).read()
soup = bs.BeautifulSoup(html,'lxml')
table = soup.find('div',id='octable')
rows = table.find_all('tr')
for row in rows:
print row.text

You have to iterate all the <td> inside the <tr>. You can do this with a bunch of for loop but using list comprehension is more straightforward. Using only this :
oi_column = [
float(t[21].text.strip().replace('-','0').replace(',',''))
for t in (t.find_all('td') for t in tables.find_all('tr'))
if len(t) > 20
]
to iterate all <td> in all <tr> of your table, selecting only those rows with more than 20 items (to exclude the last row) and perform text replacement or anything you want to match your requirement, here converting the text to float
The whole code would be :
from bs4 import BeautifulSoup
import requests
url = 'https://www.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=-9999&symbol=BANKNIFTY&symbol=BANKNIFTY&instrument=OPTIDX&date=-&segmentLink=17&segmentLink=17'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
tables = soup.find("table", {"id":"octable"})
oi_column = [
float(t[21].text.strip().replace('-','0').replace(',',''))
for t in (t.find_all('td') for t in tables.find_all('tr'))
if len(t) > 20
]
#column to check
print(oi_column)
print("max value : {}".format(max(oi_column)))
print("index of max value : {}".format(oi_column.index(max(oi_column))))
#the row at index
root = tables.find_all('tr')[2 + oi_column.index(max(oi_column))].find_all('td')
row_items = [
(
root[1].text.strip(),
root[2].text.strip()
#etc... select index you want to extract in the corresponding rows
)
]
print(row_items)
You can find an additional example to scrap a table like this here

Getting text of a table quickly in Selenium

I'm trying to parse a number of columns in a table into a dictionary using Selenium, but what I have seems slow. I'm using python, Selenium 2.0, and webdriver.Chrome()
table = self.driver.find_element_by_id("thetable")
# now get all the TR elements from the table
all_rows = table.find_elements_by_tag_name("tr")
# and iterate over them, getting the cells
for row in all_rows:
cells = row.find_elements_by_tag_name("td")
# slowwwwwwwwwwwwww
dict_value = {'0th': cells[0].text,
'1st': cells[1].text,
'2nd': cells[2].text,
'3rd': cells[3].text,
'6th': cells[6].text,
'7th': cells[7].text,
'10th': cells[10].text}
The problem seems to be getting the 'text' attribute of each td element. Is there a faster way?

Alternative option.
If later (after the loop), you don't need interactiveness that selenium provides you with - you can pass the current HTML source code of the page to lxml.html, which is known for it's speed. Example:
import lxml.html
root = lxml.html.fromstring(driver.page_source)
for row in root.xpath('.//table[#id="thetable"]//tr'):
cells = row.xpath('.//td/text()')
dict_value = {'0th': cells[0],
'1st': cells[1],
'2nd': cells[2],
'3rd': cells[3],
'6th': cells[6],
'7th': cells[7],
'10th': cells[10]}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to pull values from a table with no defining characteristics? - python

Related

Cannot get the hyperlink href beautiful soup

how to return data from multiple pages from table in url using beautifulsoup

BeautifulSoup - How do I find values from a table with no id?

How can I scrape table and find out corresponding entry for maximum number in particular column?

Getting text of a table quickly in Selenium

Categories

Resources