Not able to find a link in a product page - python

I am trying to make a list of the links that are inside a product page.
I have multiple links through which I want to get the links of the product page.
I am just posting the code for a single link.
r = requests.get("https://funskoolindia.com/products.php?search=9723100")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='product-bg-panel', href=True):
print('href: ', a_tag['href'])
This is what it should print: https://funskoolindia.com/product_inner_page.php?product_id=1113

The site is dynamic, thus, you can use selenium
from bs4 import BeautifulSoup as soup
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://funskoolindia.com/products.php?search=9723100')
results = [*{i.a['href'] for i in soup(d.page_source, 'html.parser').find_all('div', {'class':'product-media light-bg'})}]
Output:
['product_inner_page.php?product_id=1113']

The data are loaded dynamically through Javascript from different URL. One solution is using selenium - that executes Javascript and load links that way.
Other solution is using re module and parse the data url manually:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://funskoolindia.com/products.php?search=9723100'
data_url = 'https://funskoolindia.com/admin/load_data.php'
data = {'page':'1',
'sort_val':'new',
'product_view_val':'grid',
'show_list':'12',
'brand_id':'',
'checkboxKey': re.findall(r'var checkboxKey = "(.*?)";', requests.get(url).text)[0]}
soup = BeautifulSoup(requests.post(data_url, data=data).text, 'lxml')
for a in soup.select('#list-view .product-bg-panel > a[href]'):
print('https://funskoolindia.com/' + a['href'])
Prints:
https://funskoolindia.com/product_inner_page.php?product_id=1113

try this : print('href: ', a_tag.get("href"))
and add features="lxml" to the BeautifulSoup constructor

Related

Web scraping IMDB with Python's Beautiful Soup

I am trying to parse this page "https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1", but I can't find the href that I need (href="/title/tt0068112/episodes?ref_=tt_eps_sm").
I tried with this code:
url="https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1"
page(requests.get(url)
soup=BeautifulSoup(page.content,"html.parser")
for a in soup.find_all('a'):
print(a['href'])
What's wrong with this? I also tried to check "manually" with print(soup.prettify()) but it seems that that link is hidden or something like that.
You can get the page html with requests, the href item is in there, no need for special apis. I tried this and it worked:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1")
soup = BeautifulSoup(page.content, "html.parser")
scooby_link = ""
for item in soup.findAll("a", href="/title/tt0068112/episodes?ref_=tt_eps_sm"):
print(item["href"])
scooby_link = "https://www.imdb.com" + "/title/tt0068112/episodes?ref_=tt_eps_sm"
print(scooby_link)
I'm assuming you also wanted to save the link to a variable for further scraping so I did that as well. 🙂
To get the link with Episodes you can use next example:
import requests
from bs4 import BeautifulSoup
url = "https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(soup.select_one("a:-soup-contains(Episodes)")["href"])
Prints:
/title/tt0068112/episodes?ref_=tt_eps_sm

How to extract url/links that are contents of a webpage with BeautifulSoup

So the website I am using is : https://keithgalli.github.io/web-scraping/webpage.html and I want to extract all the social media links on the webpage.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://keithgalli.github.io/web-scraping/webpage.html')
soup = bs(r.content)
links = soup.find_all('a', {'class':'socials'})
actual_links = [link['href'] for link in links]
I get an error, specifically:
KeyError: 'href'
For a different example and webpage, I was able to use the same code to extract the webpage link but for some reason this time it is not working and I don't know why.
I also tried to see what the problem was specifically and it appears that
links is a nested array where links[0] outputs the entire content of the ul tag that has class=socials so its not iterable so to speak since the first element contains all the links rather than having each social li tag be seperate elements inside links
Here is the solution using css selectors:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://keithgalli.github.io/web-scraping/webpage.html')
soup = bs(r.content, 'lxml')
links = soup.select('ul.socials li a')
actual_links = [link['href'] for link in links]
print(actual_links)
Output:
['https://www.instagram.com/keithgalli/', 'https://twitter.com/keithgalli', 'https://www.linkedin.com/in/keithgalli/', 'https://www.tiktok.com/#keithgalli']
Why not try something like:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://keithgalli.github.io/web-
scraping/webpage.html')
soup = bs(r.content)
links = soup.find_all('a', {'class':'socials'})
actual_links = [link['href'] for link in links if 'href' in link.keys()]
After gaining some new information from you and visiting the webpage, I've realized that you did the following mistake:
The socials class is never used in any a-element and thus you won't find any such in your script. Instead you should look for the li-elements with the class "social".
Thus your code should look like:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://keithgalli.github.io/web-
scraping/webpage.html')
soup = bs(r.content, "lxml")
link_list_items = soup.find_all('li', {'class':'social'})
links = [item.find('a').get('href') for item in link_list_items]
print(links)

BeautifulSoup.select classname not working

I am trying to find tags by CSS class, using BeautifulSoup.
Read documentation and tried different ways but the below code returns new_elem : [].
Could you help me understand what I am doing wrong? Thanks.
import requests
from bs4 import BeautifulSoup
url = "https://solanamonkeysclub.com/#/#mint"
response = requests.get(url)
response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text, 'html.parser')
new_elems = str(soup.select('.ant-card-body'))
print(f'{"new_elem":10} : {new_elems}')
As the url is dynamic,I use selenium with bs4 and getting the follwing output:
Code:
import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
url = "https://solanamonkeysclub.com/#/#mint"
driver.get(url)
time.sleep(8)
soup = BeautifulSoup(driver.page_source, 'html.parser')
new_elems = soup.select('.ant-card-body')
for new_elem in new_elems:
print(f'{"new_elem":10} : {new_elem.text}')
OUTPUT:
new_elem : 0
new_elem : 0
Have you looked at the output at all? You should bring up this page in a browser and do a "view source", or do print(response.text) after you fetch it. The page as delivered contains no HTML elements. The entire page is built dynamically using Javascript.
You will need do use something like Selenium to scrape it.

Beautifulsoup extracting urls from a given website menu

Hello every one I'm new to beautifulsoup, I'm trying to write a function that will be able to extract second level urls from a given website.
For example if I have this website url : https://edition.cnn.com/ my function should be able to return
https://edition.cnn.com/world
https://edition.cnn.com/politics
https://edition.cnn.com/business
https://edition.cnn.com/health
https://edition.cnn.com/entertainment
https://edition.cnn.com/style
https://edition.cnn.com/travel
first I have tried this code to retrieve all links starting with the string of the url:
from bs4 import BeautifulSoup as bs4
import requests
import lxml
import re
def getLinks(url):
response = requests.get(url)
data = response.text
soup = bs4(data, 'lxml')
links = []
for link in soup.find_all('a', href=re.compile(str(url))):
links.append(link.get('href'))
return links
But then again the actual output is giving me all the links even links of articles which is not I'm looking for. is there a method that I can use to get what I want using regular expression or others.
The links are inside <nav> tag, so using CSS selector nav a[href] will select only links inside <nav> tag:
import requests
from bs4 import BeautifulSoup
url = 'https://edition.cnn.com'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for a in soup.select('nav a[href]'):
if a['href'].count('/') > 1 or '#' in a['href']:
continue
print(url + a['href'])
Prints:
https://edition.cnn.com/world
https://edition.cnn.com/politics
https://edition.cnn.com/business
https://edition.cnn.com/health
https://edition.cnn.com/entertainment
https://edition.cnn.com/style
https://edition.cnn.com/travel
https://edition.cnn.com/sport
https://edition.cnn.com/videos
https://edition.cnn.com/world
https://edition.cnn.com/africa
https://edition.cnn.com/americas
https://edition.cnn.com/asia
https://edition.cnn.com/australia
https://edition.cnn.com/china
https://edition.cnn.com/europe
https://edition.cnn.com/india
https://edition.cnn.com/middle-east
https://edition.cnn.com/uk
...and so on.

Extracting links from html from the link of the following website

I want to extract the link
/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=2&next=0&durationType=Y&Year=2018&duration=1&news_type=
from the html of the page
http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05
The following is the code that is used
url_list = "http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05"
html = requests.get(url_list)
soup = BeautifulSoup(html.text,'html.parser')
link = soup.find_all('a')
print(link)
using beautiful soup. How would I go about it, using find_all('a") doesn't return the required link in the returned html.
Please try this to get Exact Url you want.
import bs4 as bs
import requests
import re
sauce = requests.get('https://www.moneycontrol.com/stocks/company_info/stock_news.php?sc_id=CHC&durationType=Y&Year=2018')
soup = bs.BeautifulSoup(sauce.text, 'html.parser')
for a in soup.find_all('a', href=re.compile("company_info")):
# print(a['href'])
if 'pageno' in a['href']:
print(a['href'])
output:
/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=2&next=0&durationType=Y&Year=2018&duration=1&news_type=
/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=3&next=0&durationType=Y&Year=2018&duration=1&news_type=
You just have to use the get method to find the href attribute:
from bs4 import BeautifulSoup as soup
import requests
url_list = "http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05"
html = requests.get(url_list)
page= soup(html.text,'html.parser')
link = page.find_all('a')
for l in link:
print(l.get('href'))

Categories