BeautifulSoup - How do I find values from a table with no id?

BeautifulSoup - How do I find values from a table with no id? - python

Trying to scrape values from the 'retained profit' row under the 'British Land Fundamentals' heading at https://uk.advfn.com/p.php?pid=financials&symbol=L%5EBLND
Not sure how to go about this as I can't see a specific ID or class.
Thanks

this return all tables with width 100%
u can try:
tables = soup.find_all('table',width="100%")
for tr in table.find_all("tr"):
for key,value in enumerate(tr.find_all("td")):
anchor = value.find_all("a",class="Lcc")
if "Retained Profit PS" in anchor.text:
return tr.find_all("td")[key+1]

Figured out a working solution by finding the specific row name in the HTML source code and iterating through the table values in said row:
def foo(ticker):
response = requests.get('https://uk.advfn.com/p.php?pid=financials&symbol=L%5E{}'.format(ticker))
soup = bs.BeautifulSoup(response.text, 'lxml')
# find all 'a' tags
for a in soup.find_all("a"):
# find the specific row we want, there is only one instance of the a tag with the text 'retained profit' so this works
if a.text == "retained profit":
# iterate through each value in the row
for val in a.parent.find_next_siblings():
# return only the values in bold (i.e. those tags with the class='sb' in this case)
for class_ in val.attrs['class']:
if class_ == 'sb':
print(val.text)
Outputs the desired values from the table :D

Related

Parsing a table from website (choosing correct HTML tag)

I need to make dataframe from the following page: http://pitzavod.ru/products/upakovka/
from bs4 import BeautifulSoup
import pandas as pd
import requests
kre = requests.get(f'http://pitzavod.ru/products/upakovka/')
soup = BeautifulSoup(kre.text, 'lxml')
table1 = soup.find('table', id="tab3")
I chose "tab3", as I find in the HTML text <div class="tab-pane fade" id="tab3". But the variable table1 gives no output. How can I get the table? Thank You.

NOTE: you can get the table as a DataFrame in one statement with .read_html, but the DataFrame returned by pd.read_html('http://pitzavod.ru/products/upakovka/')[0] will not retain line breaks.
.find('table', id="tab3") searches for table tags with id="tab3", and there are no such elements in that page's HTML.
There's a div with id="tab3" (as you've notice), but it does not contain any tables.
The only table on the page is contained in a div with id="tab4", so you might have used table1 = soup.find('div', id="tab4").table [although I prefer using .select with CSS selectors for targeting nested tags].
Suggested solution:
kre = requests.get('http://pitzavod.ru/products/upakovka/')
# print(kre.status_code, kre.reason, 'from', kre.url)
kre.raise_for_status()
soup = BeautifulSoup(kre.content, 'lxml')
# table = soup.select_one('div#tab4>div.table-responsive>table')
table = soup.find('table') # soup.select_one('table')
tData = [{
1 if 'center' in c.get('style', '') else ci: '\n'.join([
l.strip() for l in c.get_text('\n').splitlines() if l.strip()
]) for ci, c in enumerate(r.select('td'))
} for r in table.select('tr')]
df = pandas.DataFrame(tData)
## combine the top 2 rows to form header ##
df.columns = ['\n'.join([
f'{d}' for d in df[c][:2] if pandas.notna(d)
]) for c in df.columns]
df = df.drop([0,1], axis='rows').reset_index(drop=True)
# print(df.to_markdown(tablefmt="fancy_grid"))
(Normally, I would use this function if I wanted to specify the separator for tag-contents inside cells, but the middle cell in the 2nd header row would be shifted if I used .DataFrame(read_htmlTable(table, tSep='\n', asObj='dicts')) - the 1 if 'center' in c.get('style', '') else ci bit in the above code is for correcting that.)

Extracting Column names from 'th' in Beautifulsoup

I am attempting to iterate through some BeautifulSoup Data and get headers from the table header elements () and place them into a list. Currently, my code is extracting more than the just the header element, and is pulling part of the BS4 tags.
See attached image (result1) for current results.
result1
Code:
column_names = []
def extract_column_from_header(row):
if (row.br):
row.br.extract()
if row.a:
row.a.extract()
if row.sup:
row.sup.extract()
column_name = ' '.join(row.contents)
if not(column_name.strip().isdigit()):
column_name = column_name.strip()
return column_name
soup=first_launch_table.find_all('th')
for i in soup:
name=extract_column_from_header(i)
if name != None and len(name) > 0:
column_names.append(i)

Question needs improvement (HTML, expected result, ...), so this is only pointing in a direction - You could use css selectors to select all th in table, extract content with .text while iterating with list comprehension:
[th.text for th in soup.select('table th')]
Or based on your example:
[th.text for th in first_launch_table.find_all('th')]

Cannot get the hyperlink href beautiful soup

I am trying to get the hyperlink of anchor (a) element but I get keep getting:
h ttps://in.finance.yahoo.com/h ttps://in.finance.yahoo.com/
I have tried all solutions provided here: link
Here's my code:
href_links = []
symbols = []
prices = []
commodities = []
CommoditiesUrl = "https://in.finance.yahoo.com/commodities"
r = requests.get(CommoditiesUrl)
data = r.text
soup = BeautifulSoup(data)
counter = 40
for i in range(40, 404, 14):
for row in soup.find_all('tbody'):
for srow in row.find_all('tr'):
for symbol in srow.find_all('td', attrs={'class':'data-col0'}):
symbols.append(symbol.text)
href_link = soup.find('a').get('href')
href_links.append('https://in.finance.yahoo.com/' + href_link)
for commodity in srow.find_all('td', attrs={'class':'data-col1'}):
commodities.append(commodity.text)
for price in srow.find_all('td', attrs={'class':'data-col2'}):
prices.append(price.text)
pd.DataFrame({"Links": href_links, "Symbol": symbols, "Commodity": commodities, "Prices": prices })
Also, I would like to know if it's feasible, to similarly to the website, to have the symbol of the commodity as a hyperlink in my pandas dataframe.

I'm not sure what's going on with the code you posted, but you can simply get that URL by finding an a element with the attribute data-symbol set to GC=F. The html has 2 such elements. The one you want is the first one, which is what is returned by soup.find('a', {'data-symbol': 'GC=F'}).get('href').
import requests, urllib
from bs4 import BeautifulSoup
CommoditiesUrl = "https://in.finance.yahoo.com/commodities"
r = requests.get(CommoditiesUrl)
data = r.text
soup = BeautifulSoup(data)
gold_href = soup.find('a', {'data-symbol': 'GC=F'}).get('href')
# If it is a relative URL, we need to transform it into an absolute URL (it always is, fwiw)
if not gold_href.startswith('http'):
# If you insist, you can do 'https://in.finance.yahoo.com" + gold_href
gold_href = urllib.parse.urljoin(CommoditiesUrl, gold_href)
print(gold_url)
Also, I would like to know if it's feasible, to similarly to the website, to have the symbol of the commodity as a hyperlink in my pandas dataframe.
I'm not familiar with pandas, but I'd say the answer is yes. See: How to create a table with clickable hyperlink in pandas & Jupyter Notebook

How to pull values from a table with no defining characteristics?

I am trying to pull part numbers from a cross-reference website but when I inspect the element the only tags used around the table are tr, td, tbody, and table, which are used many other places on the page. Currently I am using Beautiful soup and selenium, and I am looking into using lxml.html for its xpath tool, but I cant seem to get beautiful soup to work with it.
The website I trying to pull values from is
https://jdparts.deere.com/servlet/com.deere.u90.jdparts.view.servlets.searchcontroller.PartialPartNumberSearchController?action=UNSIGNED_VIEW
and technically I only want the Part Number, Make, Part No, Part Type, and Description Values, but I can deal with getting the whole table.
When I use
html2 = browser.page_source
source = soup(html2, 'html.parser')
for article in source.find_all('td', valign='middle'):
PartNumber = article.text.strip()
number.append(PartNumber)
it gives me all the values on the page and several blank values all in a single line of text, which would be just as much work to sift through as just manually pulling the values.
Ultimately I am hoping to get the values in the table and formatted to look like the table and I can just delete the columns I don't need. What would be the best way to go about gathering the information in the table?

One approach would be to find the Qty. which is an element at the start of the table you want and to then look for the previous table. You could then iterate over the tr elements and produce a row of values from all the td elements in each row.
The Python itemgetter() function could be useful here, as it lets you extract the elements you want (in any order) from a bigger list. In this example, I have chosen items 1,2,3,4,5, but if say Make wasn't needed, you could provide 1,3,4,5.
The search results might have multiple pages of results, if this is the case it checks for a Next Page button and if present adjusts params to get the next page of results. This continues until no next page is found:
from operator import itemgetter
import requests
from bs4 import BeautifulSoup
import csv
search_term = "AT2*"
params = {
"userAction" : "search",
"browse" : "",
"screenName" : "partSearch",
"priceIdx" : 1,
"searchAppType" : "",
"searchType" : "search",
"partSearchNumber" : search_term,
"pageIndex" : 1,
"endPageIndex" : 100,
}
url = 'https://jdparts.deere.com/servlet/com.deere.u90.jdparts.view.servlets.searchcontroller.PartNumberSearch'
req_fields = itemgetter(1, 2, 3, 4, 5)
page_index = 1
session = requests.Session()
start_row = 0
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
while True:
print(f'Page {page_index}')
req = session.post(url, params=params)
soup = BeautifulSoup(req.content, 'html.parser')
table = soup.find(text='Qty.').find_previous('table')
for tr in table.find_all('tr')[start_row:]:
row = req_fields([value.get_text(strip=True) for value in tr.find_all('td')])
if row[0]:
csv_output.writerow(row)
if soup.find(text='Next Page'):
start_row = 2
params = {
"userAction" : "NextPage",
"browse" : "NextPage",
"pageIndex" : page_index,
"endPageIndex" : 15,
}
page_index += 1
else:
break
Which would give you an output.csv file starting:
Part Number,Make,Part No.,Part Type,Description
AT2,Kralinator,PMTF15013,Filters,Filter
AT2,Kralinator,PMTF15013J,Filters,Filter
AT20,Berco,T139464,Undercarriage All Makes,Spring Pin
AT20061,A&I Products,A-RE29882,Clutch,Clutch Disk
Note: This makes use of requests instead of using selenium as it will be much faster.

Beautiful Soup For loop gives me individual list, however in need a dataframe

I am trying scrap data using beautiful soup, however it comes in the form of the lists, however i need a pandas data frame. I am using a for loop to get the data however i am unable to append these to dataframe. When i check for the len of row it says only 1.
INFY = url.urlopen("https://in.finance.yahoo.com/quote/INFY.NS/history?p=INFY.NS")
div = INFY.read()
div = soup(div,'html.parser')
div = div.find("table",{"class":"W(100%) M(0)"})
table_rows = div.findAll("tr")
print(table_rows)
for tr in table_rows:
td = tr.findAll('td')
row = [i.text for i in td]
print(row)
Below is the result i get after running the code.
['30-Mar-2017', '1,034.00', '1,035.90', '1,020.25', '1,025.50', '1,010.02', '60,78,590']
['29-Mar-2017', '1,034.30', '1,041.50', '1,025.85', '1,031.85', '1,016.27', '34,90,593']
['28-Mar-2017', '1,031.50', '1,039.00', '1,030.05', '1,035.15', '1,019.52', '23,98,398']

pd.DataFrame([[i.text for i in tr.findAll('td')] for tr in table_rows])
You would then need to convert text values to their numeric equivalents.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup - How do I find values from a table with no id? - python

Trying to scrape values from the 'retained profit' row under the 'British Land Fundamentals' heading at https://uk.advfn.com/p.php?pid=financials&symbol=L%5EBLND Not sure how to go about this as I can't see a specific ID or class. Thanks

this return all tables with width 100% u can try: tables = soup.find_all('table',width="100%") for tr in table.find_all("tr"): for key,value in enumerate(tr.find_all("td")): anchor = value.find_all("a",class="Lcc") if "Retained Profit PS" in anchor.text: return tr.find_all("td")[key+1]

Related

Parsing a table from website (choosing correct HTML tag)

Extracting Column names from 'th' in Beautifulsoup

Cannot get the hyperlink href beautiful soup

How to pull values from a table with no defining characteristics?

Beautiful Soup For loop gives me individual list, however in need a dataframe

Categories

Resources