Cannot get the hyperlink href beautiful soup - python

I am trying to get the hyperlink of anchor (a) element but I get keep getting:
h ttps://in.finance.yahoo.com/h ttps://in.finance.yahoo.com/
I have tried all solutions provided here: link
Here's my code:
href_links = []
symbols = []
prices = []
commodities = []
CommoditiesUrl = "https://in.finance.yahoo.com/commodities"
r = requests.get(CommoditiesUrl)
data = r.text
soup = BeautifulSoup(data)
counter = 40
for i in range(40, 404, 14):
for row in soup.find_all('tbody'):
for srow in row.find_all('tr'):
for symbol in srow.find_all('td', attrs={'class':'data-col0'}):
symbols.append(symbol.text)
href_link = soup.find('a').get('href')
href_links.append('https://in.finance.yahoo.com/' + href_link)
for commodity in srow.find_all('td', attrs={'class':'data-col1'}):
commodities.append(commodity.text)
for price in srow.find_all('td', attrs={'class':'data-col2'}):
prices.append(price.text)
pd.DataFrame({"Links": href_links, "Symbol": symbols, "Commodity": commodities, "Prices": prices })
Also, I would like to know if it's feasible, to similarly to the website, to have the symbol of the commodity as a hyperlink in my pandas dataframe.

I'm not sure what's going on with the code you posted, but you can simply get that URL by finding an a element with the attribute data-symbol set to GC=F. The html has 2 such elements. The one you want is the first one, which is what is returned by soup.find('a', {'data-symbol': 'GC=F'}).get('href').
import requests, urllib
from bs4 import BeautifulSoup
CommoditiesUrl = "https://in.finance.yahoo.com/commodities"
r = requests.get(CommoditiesUrl)
data = r.text
soup = BeautifulSoup(data)
gold_href = soup.find('a', {'data-symbol': 'GC=F'}).get('href')
# If it is a relative URL, we need to transform it into an absolute URL (it always is, fwiw)
if not gold_href.startswith('http'):
# If you insist, you can do 'https://in.finance.yahoo.com" + gold_href
gold_href = urllib.parse.urljoin(CommoditiesUrl, gold_href)
print(gold_url)
Also, I would like to know if it's feasible, to similarly to the website, to have the symbol of the commodity as a hyperlink in my pandas dataframe.
I'm not familiar with pandas, but I'd say the answer is yes. See: How to create a table with clickable hyperlink in pandas & Jupyter Notebook

Related

Parsing a table from website (choosing correct HTML tag)

I need to make dataframe from the following page: http://pitzavod.ru/products/upakovka/
from bs4 import BeautifulSoup
import pandas as pd
import requests
kre = requests.get(f'http://pitzavod.ru/products/upakovka/')
soup = BeautifulSoup(kre.text, 'lxml')
table1 = soup.find('table', id="tab3")
I chose "tab3", as I find in the HTML text <div class="tab-pane fade" id="tab3". But the variable table1 gives no output. How can I get the table? Thank You.
NOTE: you can get the table as a DataFrame in one statement with .read_html, but the DataFrame returned by pd.read_html('http://pitzavod.ru/products/upakovka/')[0] will not retain line breaks.
.find('table', id="tab3") searches for table tags with id="tab3", and there are no such elements in that page's HTML.
There's a div with id="tab3" (as you've notice), but it does not contain any tables.
The only table on the page is contained in a div with id="tab4", so you might have used table1 = soup.find('div', id="tab4").table [although I prefer using .select with CSS selectors for targeting nested tags].
Suggested solution:
kre = requests.get('http://pitzavod.ru/products/upakovka/')
# print(kre.status_code, kre.reason, 'from', kre.url)
kre.raise_for_status()
soup = BeautifulSoup(kre.content, 'lxml')
# table = soup.select_one('div#tab4>div.table-responsive>table')
table = soup.find('table') # soup.select_one('table')
tData = [{
1 if 'center' in c.get('style', '') else ci: '\n'.join([
l.strip() for l in c.get_text('\n').splitlines() if l.strip()
]) for ci, c in enumerate(r.select('td'))
} for r in table.select('tr')]
df = pandas.DataFrame(tData)
## combine the top 2 rows to form header ##
df.columns = ['\n'.join([
f'{d}' for d in df[c][:2] if pandas.notna(d)
]) for c in df.columns]
df = df.drop([0,1], axis='rows').reset_index(drop=True)
# print(df.to_markdown(tablefmt="fancy_grid"))
(Normally, I would use this function if I wanted to specify the separator for tag-contents inside cells, but the middle cell in the 2nd header row would be shifted if I used .DataFrame(read_htmlTable(table, tSep='\n', asObj='dicts')) - the 1 if 'center' in c.get('style', '') else ci bit in the above code is for correcting that.)

how to return data from multiple pages from table in url using beautifulsoup

i am trying to retrieve the code as well as title but somehow i am not able to retrieve the website is
https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27
here i have tried to get the value from the table
import requests
unspsc_link = "https://www.unspsc.org/search-code/default.aspx?
CSS=51%&Type=desc&SS%27"
link = requests.get(unspsc_link).text
from bs4 import BeautifulSoup
soup = BeautifulSoup(link, 'lxml')
print(soup.prettify())
all_table = soup.find_all('table')
print(all_table)
right_table = soup.find_all('table',
id="dnn_ctr1535_UNSPSCSearch_gvDetailsSearchView")
tables = right_table.find_all('td')
print(tables)
the errors AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
i expect to save the code as well as title in a list and save it in dataframe later
is there any way to continue to next page without manually providing values like in search code like 51% as there as more than 20 pages inside 51%
From the documentation
AttributeError: 'ResultSet' object has no attribute 'foo' - This
usually happens because you expected find_all() to return a single tag
or string. But find_all() returns a list of tags and stringsā€“a
ResultSet object. You need to iterate over the list and look at the
.foo of each one. Or, if you really only want one result, you need to
use find() instead of find_all()
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
unspsc_link = "https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27"
link = requests.get(unspsc_link).text
soup = BeautifulSoup(link, 'lxml')
right_table = soup.find('table', id="dnn_ctr1535_UNSPSCSearch_gvDetailsSearchView")
df = pd.read_html(str(right_table))[0]
# Clean up the DataFrame
df = df[[0, 1]]
df.columns = df.iloc[0]
df = df[1:]
print(df)
Output:
0 Code Title
1 51180000 Hormones and hormone antagonists
2 51280000 Antibacterials
3 51290000 Antidepressants
4 51390000 Sympathomimetic or adrenergic drugs
5 51460000 Herbal drugs
...
Notes:
The row order may be a little different but the data seems to be the same.
You will have to remove the last one or two rows
from the DataFrame as they are not relevant.
This is the data from the first page only. Look into
selenium to get the data from all pages by clicking on the buttons [1] [2] .... You can also use requests to emulate the POST request but it is a bit difficult for this site (IMHO).

How to pull values from a table with no defining characteristics?

I am trying to pull part numbers from a cross-reference website but when I inspect the element the only tags used around the table are tr, td, tbody, and table, which are used many other places on the page. Currently I am using Beautiful soup and selenium, and I am looking into using lxml.html for its xpath tool, but I cant seem to get beautiful soup to work with it.
The website I trying to pull values from is
https://jdparts.deere.com/servlet/com.deere.u90.jdparts.view.servlets.searchcontroller.PartialPartNumberSearchController?action=UNSIGNED_VIEW
and technically I only want the Part Number, Make, Part No, Part Type, and Description Values, but I can deal with getting the whole table.
When I use
html2 = browser.page_source
source = soup(html2, 'html.parser')
for article in source.find_all('td', valign='middle'):
PartNumber = article.text.strip()
number.append(PartNumber)
it gives me all the values on the page and several blank values all in a single line of text, which would be just as much work to sift through as just manually pulling the values.
Ultimately I am hoping to get the values in the table and formatted to look like the table and I can just delete the columns I don't need. What would be the best way to go about gathering the information in the table?
One approach would be to find the Qty. which is an element at the start of the table you want and to then look for the previous table. You could then iterate over the tr elements and produce a row of values from all the td elements in each row.
The Python itemgetter() function could be useful here, as it lets you extract the elements you want (in any order) from a bigger list. In this example, I have chosen items 1,2,3,4,5, but if say Make wasn't needed, you could provide 1,3,4,5.
The search results might have multiple pages of results, if this is the case it checks for a Next Page button and if present adjusts params to get the next page of results. This continues until no next page is found:
from operator import itemgetter
import requests
from bs4 import BeautifulSoup
import csv
search_term = "AT2*"
params = {
"userAction" : "search",
"browse" : "",
"screenName" : "partSearch",
"priceIdx" : 1,
"searchAppType" : "",
"searchType" : "search",
"partSearchNumber" : search_term,
"pageIndex" : 1,
"endPageIndex" : 100,
}
url = 'https://jdparts.deere.com/servlet/com.deere.u90.jdparts.view.servlets.searchcontroller.PartNumberSearch'
req_fields = itemgetter(1, 2, 3, 4, 5)
page_index = 1
session = requests.Session()
start_row = 0
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
while True:
print(f'Page {page_index}')
req = session.post(url, params=params)
soup = BeautifulSoup(req.content, 'html.parser')
table = soup.find(text='Qty.').find_previous('table')
for tr in table.find_all('tr')[start_row:]:
row = req_fields([value.get_text(strip=True) for value in tr.find_all('td')])
if row[0]:
csv_output.writerow(row)
if soup.find(text='Next Page'):
start_row = 2
params = {
"userAction" : "NextPage",
"browse" : "NextPage",
"pageIndex" : page_index,
"endPageIndex" : 15,
}
page_index += 1
else:
break
Which would give you an output.csv file starting:
Part Number,Make,Part No.,Part Type,Description
AT2,Kralinator,PMTF15013,Filters,Filter
AT2,Kralinator,PMTF15013J,Filters,Filter
AT20,Berco,T139464,Undercarriage All Makes,Spring Pin
AT20061,A&I Products,A-RE29882,Clutch,Clutch Disk
Note: This makes use of requests instead of using selenium as it will be much faster.

Writing extracted items of from a website onto a .xls sheet with lists of different length using Pandas module in Python

I am a beginner in Python Programming and I am practicing scraping different values from websites.
I have extracted the items from a particular website and now want to write them onto a .xls file.
The whole web page has 714 records including duplicate records but the excel sheet is displaying only 707 records because of the zip() function which stops when the smallest list gets exhausted. Here the smallest list is the email list. So it is getting exhausted and the iteration stops due to the property of zip() function.I have even kept a checking for it within a if condition for the records which has no email address so that it displays "No email address" but still the same result is displayed with 704 with duplicates records. Kindly tell where am I going wrong and if possible suggest what to be done regarding removing duplicate records and displaying "No email address" where there is no email.
from bs4 import BeautifulSoup as bs
import pandas as pd
res = requests.get('https://www.raywhite.com/contact/?type=People&target=people&suburb=Sydney%2C+NSW+2000&radius=50%27%27&firstname=&lastname=&_so=contact', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')
names=[]
positions=[]
phone=[]
emails=[]
links=[l1['href'] for l1 in soup.select('.agent-name a')]
nlist = soup.find_all('li', class_='agent-name')
plist= soup.find_all('li',class_='agent-role')
phlist = soup.find_all('li', class_='agent-officenum')
elist = soup.find_all('a',class_='val withicon')
for n1 in nlist:
names.append(n1.text)
for p1 in plist:
positions.append(p1.text)
for ph1 in phlist:
phone.append(ph1.text)
for e1 in elist:
emails.append(e1.get('href') if e1.get('href') is not None else 'No Email address')
df = pd.DataFrame(list(zip(names,positions,phone,emails,links)),columns=['Names','Position','Phone','Email','Link'])
df.to_excel(r'C:\Users\laptop\Desktop\RayWhite.xls', sheet_name='MyData2', index = False, header=True)
The excel sheet looks like this where we can see the last records name and it's email address does not match:
Ray White Excel Sheet
It looks like you are doing many find_all's and then stitching them together. My advice would be to do one find_all then iterate through that. It makes it a lot easier to build out the columns of your dataframe when all your data is in one place.
I have updated the below code to successfully extract links without error. With any code there is a number of ways to perform the same task. This one may not be the most elegant but it does get the job done.
import requests
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get('https://www.raywhite.com/contact/?type=People&target=people&suburb=Sydney%2C+NSW+2000&radius=50%27%27&firstname=&lastname=&_so=contact', headers = {'User-agent': 'Super Bot 9000'})
soup = BeautifulSoup(r.text, 'html.parser')
get_cards = soup.find_all("div",{"class":"card horizontal-split vcard"})
agent_list = []
for item in get_cards:
name = item.find('li', class_='agent-name').text
position = item.find('li', class_='agent-role').text
phone = item.find('li', class_='agent-officenum').text
link = item.find('li', class_='agent-name').a['href']
try:
email = item.find('a',class_='val withicon')['href'].replace('mailto:','')
except:
email = 'No Email address'
agent_list.append({'name':name,'position':position,'email':email,'link':link})
df = pd.DataFrame(agent_list)
Above is some sample code I have put together to create the dataframe. The key here is to do one find_all on "class":"card horizontal-split vcard"}
Hope that has been some help.
Cheers,
Adam

web data having unequal length arrays python 3

Below is my code to scrape a website. i have to create a DataFrame with arrays of unequal length, for instance property_Type has varying length, soe listings have one property_type , some has two & some has three. Similarly , Agency name also has varying length.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
urls = []
for i in range(1,3):
pages = "http://www.realcommercial.com.au/for-sale/property-offices-retail-in-vic/list-{0}?includePropertiesWithin=includesurrounding&activeSort=list-date&autoSuggest=true".format(i)
urls.append(pages)
Data = []
for info in urls:
page = requests.get(info)
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('a', attrs ={'class' :'details-panel'})
hrefs = [link['href'] for link in links]
for href in hrefs:
pages = requests.get(href)
soup_2 =BeautifulSoup(pages.content, 'html.parser')
Address_1 = soup_2.find_all('p', attrs={'class' :'full-address'})
Address = [Address.text.strip() for Address in Address_1]
Prop_Type = soup_2.find_all('div', attrs={'class' :'propType ellipsis'})
Property_Type = [Property_Type.text.strip() for Property_Type in Prop_Type]
Agency_1=soup_2.find_all('div', attrs={'class' :'agencyName ellipsis'})
Agency_Name=[Agency_Name.text.strip() for Agency_Name in Agency_1]
Agent_1=soup_2.find_all('div', attrs={'class' :'agentName ellipsis'})
Agent_Name=[Agent_Name.text.strip() for Agent_Name in Agent_1]
raw_data = dict(A=np.array(Address),B=np.array(Property_Type),C=np.array(Agency_Name),D=np.array(Agent_Name))
raw_df = pd.DataFrame(dict([ k,series(v) for k,v in raw_data.iteritems() ]))
The error i am getting is
File "<ipython-input-8-3a7c5fc4fb93>", line 32
raw_df = pd.DataFrame(dict([ k,series(v) for k,v in raw_data.iteritems() ]))
^
SyntaxError: invalid syntax
What should i do to have a dataframe where only the relevant values fall under relevant columns, like property type should be in property type and not fall in agency name.
Any help would be highly appreciated,
Thanks !!!
series appears to do nothing -- perhaps you mean to turn each v into a pandas Series with pd.Series. Also, the syntax inside dict([...]) appears to be way off. If you want to create key:value pairs where each value is a Series, and then to use the key:value pairs to create a DataFrame, you can do it with a dictionary comprehension, like so:
raw_df = pd.DataFrame.from_dict({k:pd.Series(v) for k,v in raw_data.iteritems()})

Categories