Extracting Column names from 'th' in Beautifulsoup - python

I am attempting to iterate through some BeautifulSoup Data and get headers from the table header elements () and place them into a list. Currently, my code is extracting more than the just the header element, and is pulling part of the BS4 tags.
See attached image (result1) for current results.
result1
Code:
column_names = []
def extract_column_from_header(row):
if (row.br):
row.br.extract()
if row.a:
row.a.extract()
if row.sup:
row.sup.extract()
column_name = ' '.join(row.contents)
if not(column_name.strip().isdigit()):
column_name = column_name.strip()
return column_name
soup=first_launch_table.find_all('th')
for i in soup:
name=extract_column_from_header(i)
if name != None and len(name) > 0:
column_names.append(i)

Question needs improvement (HTML, expected result, ...), so this is only pointing in a direction - You could use css selectors to select all th in table, extract content with .text while iterating with list comprehension:
[th.text for th in soup.select('table th')]
Or based on your example:
[th.text for th in first_launch_table.find_all('th')]

Related

Parsing a table from website (choosing correct HTML tag)

I need to make dataframe from the following page: http://pitzavod.ru/products/upakovka/
from bs4 import BeautifulSoup
import pandas as pd
import requests
kre = requests.get(f'http://pitzavod.ru/products/upakovka/')
soup = BeautifulSoup(kre.text, 'lxml')
table1 = soup.find('table', id="tab3")
I chose "tab3", as I find in the HTML text <div class="tab-pane fade" id="tab3". But the variable table1 gives no output. How can I get the table? Thank You.
NOTE: you can get the table as a DataFrame in one statement with .read_html, but the DataFrame returned by pd.read_html('http://pitzavod.ru/products/upakovka/')[0] will not retain line breaks.
.find('table', id="tab3") searches for table tags with id="tab3", and there are no such elements in that page's HTML.
There's a div with id="tab3" (as you've notice), but it does not contain any tables.
The only table on the page is contained in a div with id="tab4", so you might have used table1 = soup.find('div', id="tab4").table [although I prefer using .select with CSS selectors for targeting nested tags].
Suggested solution:
kre = requests.get('http://pitzavod.ru/products/upakovka/')
# print(kre.status_code, kre.reason, 'from', kre.url)
kre.raise_for_status()
soup = BeautifulSoup(kre.content, 'lxml')
# table = soup.select_one('div#tab4>div.table-responsive>table')
table = soup.find('table') # soup.select_one('table')
tData = [{
1 if 'center' in c.get('style', '') else ci: '\n'.join([
l.strip() for l in c.get_text('\n').splitlines() if l.strip()
]) for ci, c in enumerate(r.select('td'))
} for r in table.select('tr')]
df = pandas.DataFrame(tData)
## combine the top 2 rows to form header ##
df.columns = ['\n'.join([
f'{d}' for d in df[c][:2] if pandas.notna(d)
]) for c in df.columns]
df = df.drop([0,1], axis='rows').reset_index(drop=True)
# print(df.to_markdown(tablefmt="fancy_grid"))
(Normally, I would use this function if I wanted to specify the separator for tag-contents inside cells, but the middle cell in the 2nd header row would be shifted if I used .DataFrame(read_htmlTable(table, tSep='\n', asObj='dicts')) - the 1 if 'center' in c.get('style', '') else ci bit in the above code is for correcting that.)

how to return data from multiple pages from table in url using beautifulsoup

i am trying to retrieve the code as well as title but somehow i am not able to retrieve the website is
https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27
here i have tried to get the value from the table
import requests
unspsc_link = "https://www.unspsc.org/search-code/default.aspx?
CSS=51%&Type=desc&SS%27"
link = requests.get(unspsc_link).text
from bs4 import BeautifulSoup
soup = BeautifulSoup(link, 'lxml')
print(soup.prettify())
all_table = soup.find_all('table')
print(all_table)
right_table = soup.find_all('table',
id="dnn_ctr1535_UNSPSCSearch_gvDetailsSearchView")
tables = right_table.find_all('td')
print(tables)
the errors AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
i expect to save the code as well as title in a list and save it in dataframe later
is there any way to continue to next page without manually providing values like in search code like 51% as there as more than 20 pages inside 51%
From the documentation
AttributeError: 'ResultSet' object has no attribute 'foo' - This
usually happens because you expected find_all() to return a single tag
or string. But find_all() returns a list of tags and stringsā€“a
ResultSet object. You need to iterate over the list and look at the
.foo of each one. Or, if you really only want one result, you need to
use find() instead of find_all()
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
unspsc_link = "https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27"
link = requests.get(unspsc_link).text
soup = BeautifulSoup(link, 'lxml')
right_table = soup.find('table', id="dnn_ctr1535_UNSPSCSearch_gvDetailsSearchView")
df = pd.read_html(str(right_table))[0]
# Clean up the DataFrame
df = df[[0, 1]]
df.columns = df.iloc[0]
df = df[1:]
print(df)
Output:
0 Code Title
1 51180000 Hormones and hormone antagonists
2 51280000 Antibacterials
3 51290000 Antidepressants
4 51390000 Sympathomimetic or adrenergic drugs
5 51460000 Herbal drugs
...
Notes:
The row order may be a little different but the data seems to be the same.
You will have to remove the last one or two rows
from the DataFrame as they are not relevant.
This is the data from the first page only. Look into
selenium to get the data from all pages by clicking on the buttons [1] [2] .... You can also use requests to emulate the POST request but it is a bit difficult for this site (IMHO).

BeautifulSoup - How do I find values from a table with no id?

Trying to scrape values from the 'retained profit' row under the 'British Land Fundamentals' heading at https://uk.advfn.com/p.php?pid=financials&symbol=L%5EBLND
Not sure how to go about this as I can't see a specific ID or class.
Thanks
this return all tables with width 100%
u can try:
tables = soup.find_all('table',width="100%")
for tr in table.find_all("tr"):
for key,value in enumerate(tr.find_all("td")):
anchor = value.find_all("a",class="Lcc")
if "Retained Profit PS" in anchor.text:
return tr.find_all("td")[key+1]
Figured out a working solution by finding the specific row name in the HTML source code and iterating through the table values in said row:
def foo(ticker):
response = requests.get('https://uk.advfn.com/p.php?pid=financials&symbol=L%5E{}'.format(ticker))
soup = bs.BeautifulSoup(response.text, 'lxml')
# find all 'a' tags
for a in soup.find_all("a"):
# find the specific row we want, there is only one instance of the a tag with the text 'retained profit' so this works
if a.text == "retained profit":
# iterate through each value in the row
for val in a.parent.find_next_siblings():
# return only the values in bold (i.e. those tags with the class='sb' in this case)
for class_ in val.attrs['class']:
if class_ == 'sb':
print(val.text)
Outputs the desired values from the table :D

How can I scrape table and find out corresponding entry for maximum number in particular column?

How can I scrape table from "https://www.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=-9999&symbol=BANKNIFTY&symbol=BANKNIFTY&instrument=OPTIDX&date=-&segmentLink=17&segmentLink=17"
Then find out maximum "OI" under "PUTS" and finally have corresponding entries in that row for that particular maximum OI
Reached till printing rows:
import urllib2
from urllib2 import urlopen
import bs4 as bs
url = 'https://www.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=-9999&symbol=BANKNIFTY&symbol=BANKNIFTY&instrument=OPTIDX&date=-&segmentLink=17&segmentLink=17'
html = urllib2.urlopen(url).read()
soup = bs.BeautifulSoup(html,'lxml')
table = soup.find('div',id='octable')
rows = table.find_all('tr')
for row in rows:
print row.text
You have to iterate all the <td> inside the <tr>. You can do this with a bunch of for loop but using list comprehension is more straightforward. Using only this :
oi_column = [
float(t[21].text.strip().replace('-','0').replace(',',''))
for t in (t.find_all('td') for t in tables.find_all('tr'))
if len(t) > 20
]
to iterate all <td> in all <tr> of your table, selecting only those rows with more than 20 items (to exclude the last row) and perform text replacement or anything you want to match your requirement, here converting the text to float
The whole code would be :
from bs4 import BeautifulSoup
import requests
url = 'https://www.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=-9999&symbol=BANKNIFTY&symbol=BANKNIFTY&instrument=OPTIDX&date=-&segmentLink=17&segmentLink=17'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
tables = soup.find("table", {"id":"octable"})
oi_column = [
float(t[21].text.strip().replace('-','0').replace(',',''))
for t in (t.find_all('td') for t in tables.find_all('tr'))
if len(t) > 20
]
#column to check
print(oi_column)
print("max value : {}".format(max(oi_column)))
print("index of max value : {}".format(oi_column.index(max(oi_column))))
#the row at index
root = tables.find_all('tr')[2 + oi_column.index(max(oi_column))].find_all('td')
row_items = [
(
root[1].text.strip(),
root[2].text.strip()
#etc... select index you want to extract in the corresponding rows
)
]
print(row_items)
You can find an additional example to scrap a table like this here

Beautiful Soup For loop gives me individual list, however in need a dataframe

I am trying scrap data using beautiful soup, however it comes in the form of the lists, however i need a pandas data frame. I am using a for loop to get the data however i am unable to append these to dataframe. When i check for the len of row it says only 1.
INFY = url.urlopen("https://in.finance.yahoo.com/quote/INFY.NS/history?p=INFY.NS")
div = INFY.read()
div = soup(div,'html.parser')
div = div.find("table",{"class":"W(100%) M(0)"})
table_rows = div.findAll("tr")
print(table_rows)
for tr in table_rows:
td = tr.findAll('td')
row = [i.text for i in td]
print(row)
Below is the result i get after running the code.
['30-Mar-2017', '1,034.00', '1,035.90', '1,020.25', '1,025.50', '1,010.02', '60,78,590']
['29-Mar-2017', '1,034.30', '1,041.50', '1,025.85', '1,031.85', '1,016.27', '34,90,593']
['28-Mar-2017', '1,031.50', '1,039.00', '1,030.05', '1,035.15', '1,019.52', '23,98,398']
pd.DataFrame([[i.text for i in tr.findAll('td')] for tr in table_rows])
You would then need to convert text values to their numeric equivalents.

Categories