Extracting links from html from the link of the following website - python

I want to extract the link
/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=2&next=0&durationType=Y&Year=2018&duration=1&news_type=
from the html of the page
http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05
The following is the code that is used
url_list = "http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05"
html = requests.get(url_list)
soup = BeautifulSoup(html.text,'html.parser')
link = soup.find_all('a')
print(link)
using beautiful soup. How would I go about it, using find_all('a") doesn't return the required link in the returned html.

Please try this to get Exact Url you want.
import bs4 as bs
import requests
import re
sauce = requests.get('https://www.moneycontrol.com/stocks/company_info/stock_news.php?sc_id=CHC&durationType=Y&Year=2018')
soup = bs.BeautifulSoup(sauce.text, 'html.parser')
for a in soup.find_all('a', href=re.compile("company_info")):
# print(a['href'])
if 'pageno' in a['href']:
print(a['href'])
output:
/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=2&next=0&durationType=Y&Year=2018&duration=1&news_type=
/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=3&next=0&durationType=Y&Year=2018&duration=1&news_type=

You just have to use the get method to find the href attribute:
from bs4 import BeautifulSoup as soup
import requests
url_list = "http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05"
html = requests.get(url_list)
page= soup(html.text,'html.parser')
link = page.find_all('a')
for l in link:
print(l.get('href'))

Related

Web scraping IMDB with Python's Beautiful Soup

I am trying to parse this page "https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1", but I can't find the href that I need (href="/title/tt0068112/episodes?ref_=tt_eps_sm").
I tried with this code:
url="https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1"
page(requests.get(url)
soup=BeautifulSoup(page.content,"html.parser")
for a in soup.find_all('a'):
print(a['href'])
What's wrong with this? I also tried to check "manually" with print(soup.prettify()) but it seems that that link is hidden or something like that.
You can get the page html with requests, the href item is in there, no need for special apis. I tried this and it worked:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1")
soup = BeautifulSoup(page.content, "html.parser")
scooby_link = ""
for item in soup.findAll("a", href="/title/tt0068112/episodes?ref_=tt_eps_sm"):
print(item["href"])
scooby_link = "https://www.imdb.com" + "/title/tt0068112/episodes?ref_=tt_eps_sm"
print(scooby_link)
I'm assuming you also wanted to save the link to a variable for further scraping so I did that as well. 🙂
To get the link with Episodes you can use next example:
import requests
from bs4 import BeautifulSoup
url = "https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(soup.select_one("a:-soup-contains(Episodes)")["href"])
Prints:
/title/tt0068112/episodes?ref_=tt_eps_sm

How do I get certain links from a website, but not all of them?

Here's what I have so far:
import requests
from bs4 import BeautifulSoup
def linkScraper():
html = requests.get("https://www.bbc.com/").text
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
But this prints every single link on the website. How can I configure this to give me the links to the articles that appear on the BBC's homepage?
You can filter it with list comprehension:
import requests
from bs4 import BeautifulSoup
def linkScraper():
html = requests.get("https://www.bbc.com/").text
soup = BeautifulSoup(html, 'html.parser')
links = [link['href'] for link in soup.find_all('a') if link['href'].startswith('https://www.bbc.com/')]
for i in links:
print(i)

How to extract url/links that are contents of a webpage with BeautifulSoup

So the website I am using is : https://keithgalli.github.io/web-scraping/webpage.html and I want to extract all the social media links on the webpage.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://keithgalli.github.io/web-scraping/webpage.html')
soup = bs(r.content)
links = soup.find_all('a', {'class':'socials'})
actual_links = [link['href'] for link in links]
I get an error, specifically:
KeyError: 'href'
For a different example and webpage, I was able to use the same code to extract the webpage link but for some reason this time it is not working and I don't know why.
I also tried to see what the problem was specifically and it appears that
links is a nested array where links[0] outputs the entire content of the ul tag that has class=socials so its not iterable so to speak since the first element contains all the links rather than having each social li tag be seperate elements inside links
Here is the solution using css selectors:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://keithgalli.github.io/web-scraping/webpage.html')
soup = bs(r.content, 'lxml')
links = soup.select('ul.socials li a')
actual_links = [link['href'] for link in links]
print(actual_links)
Output:
['https://www.instagram.com/keithgalli/', 'https://twitter.com/keithgalli', 'https://www.linkedin.com/in/keithgalli/', 'https://www.tiktok.com/#keithgalli']
Why not try something like:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://keithgalli.github.io/web-
scraping/webpage.html')
soup = bs(r.content)
links = soup.find_all('a', {'class':'socials'})
actual_links = [link['href'] for link in links if 'href' in link.keys()]
After gaining some new information from you and visiting the webpage, I've realized that you did the following mistake:
The socials class is never used in any a-element and thus you won't find any such in your script. Instead you should look for the li-elements with the class "social".
Thus your code should look like:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://keithgalli.github.io/web-
scraping/webpage.html')
soup = bs(r.content, "lxml")
link_list_items = soup.find_all('li', {'class':'social'})
links = [item.find('a').get('href') for item in link_list_items]
print(links)

How do I exclude certain beautifulsoup results that I don't want?

I am having issues trying to exclude results given from my beautiful soup program this is my code:
from bs4 import BeautifulSoup
import requests
URL = 'https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
I don't want to get the results that start with a "#" for example: #cite_ref-18
I have tried using for loops but I get this error message: KeyError: 0
You can use the str.startswith() method:
from bs4 import BeautifulSoup
import requests
URL = 'https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
for tag in soup.find_all('a'):
link = tag.get('href')
if not str(link).startswith('#'):
print(link)
You can use CSS selector a[href]:not([href^="#"]). This will select all <a> tags with href= attribute but not the ones starting with # character:
import requests
from bs4 import BeautifulSoup
URL = 'https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
for link in soup.select('a[href]:not([href^="#"])'):
print(link['href'])

Why is Python Beautiful Soup stripping parameters from the scraped URL

I'm trying to scrape this website with Python BeautifulSoup. And my code below is first fetching all the links from the page. While fetching the links it is stripping ampersands and parameters from the original link. I wonder why? Would somebody know? I've got the code down here along with the output.
from bs4 import BeautifulSoup as bs
import requests
url = requests.get ("http://mnregaweb4.nic.in/netnrega/demand_emp_demand.aspx?lflag=eng&file1=dmd&fin=2017-2018&fin_year=2017-2018&source=national&Digest=x44uSVqhiyzomN66Te0ELQ")
soup = bs(url.text, 'xml')
state= soup.find(id = "t1")
state_links = []
for link in soup.find_all('a', href= True):
state_links.append(link['href'])
state_links = [e for e in state_links if e not in ("javascript:history.go(-1);", "http://164.100.129.6/netnrega/MISreport4.aspx?fin_year=2013-2014rpt=RP&source=national", "javascript:__doPostBack('ctl00$ContentPlaceHolder1$LinkButton1','')")]
for dis_link in state_links:
# print (dis_link)
link_new = "http://mnregaweb4.nic.in/netnrega/"+dis_link
print (link_new)
Output:
Actual Link: http://mnregaweb4.nic.in/netnrega/demand_emp_demand.aspx?file1=dmd&page1=s&lflag=eng&state_name=ANDHRA+PRADESH&state_code=02&fin_year=2017-2018&source=national&Digest=4jL5hchs+iT7xqB6T/UXzw
(Highlighted stuff in code is missing from the scraped link)
Scraped link: http://mnregaweb4.nic.in/netnrega/demand_emp_demand.aspx?file1=dmd=s=eng=ANDHRA+PRADESH=02=2017-2018=national=4jL5hchs+iT7xqB6T/UXzw
It might be because you are trying to parse it with 'xml', instead try to parse it with 'html.parser',
I am getting the following result with the code below:
from bs4 import BeautifulSoup as bs
import requests
url = requests.get ("http://mnregaweb4.nic.in/ne....")
soup = bs(url.text, 'html.parser')
state_links = []
for link in soup.find_all('a', href=True):
state_links.append(link['href'])
print(state_links)
# 'demand_emp_demand.aspx?file1=dmd&page1=s&lflag=eng&state_name=ANDHRA+PRADESH&state_code=02&fin_year=2017-2018&source=national&Digest=4jL5hchs+iT7xqB6T/UXzw'
This issue is about the parser used in Beautifulsoup.
Try with
soup = bs(url.text, 'html.parser')
or
soup = bs(url.text, 'lxml')
You might need to install some specific parser, see this chapter of the doc.

Categories