https://plus.google.com/s/casasgrandes27%40gmail.com/top
I need to crawl the following page with python but I need its HTML not the generic source code of link.
For example
Open the link: plus.google.com/s/casasgrandes27%40gmail.com/top without login second last thumbnail will be "G Suite".
<div class="Wbuh5e" jsname="r4nke">G Suite</div>
I am unable to find the above line of HTML-code after executing this python-code.
from bs4 import BeautifulSoup
import requests
L = list()
r = requests.get("https://plus.google.com/s/casasgrandes27%40gmail.com/top")
data = r.text
soup = BeautifulSoup(data,"lxml")
print(soup)
To get the soup object try the following
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
http://docs.python-requests.org/en/master/user/quickstart/#binary-response-content
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
you can try this code to read a HTML page :
import urllib.request
urls = "https://plus.google.com/s/casasgrandes27%40gmail.com/top"
html_file = urllib.request.urlopen(urls)
html_text = html_file.read()
html_text = str(html_text)
print(html_text)
Related
I am trying to scrape this page https://ntrs.nasa.gov/search .
I am using the code below and Beautiful soup is finding only 3 tags when there are many more. I have tried using html5lib, lxml and HTML parsers but none of them have worked.
Can you advise what might be the problem please?
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
# Set the URL
url = 'https://ntrs.nasa.gov/search'
# Connect to the URL
response = requests.get(url)
# Parse HTML and save to a BeautifulSoup object¶
soup = BeautifulSoup(response.content, "html5lib")
# soup = BeautifulSoup(response.text, "html5lib")
# soup = BeautifulSoup(response.content, "html.parser")
# soup = BeautifulSoup(response.content, "lxml")
# loop through all a-tags
for a_tag in soup.findAll('a'):
if 'title' in a_tag:
if a_tag['title'] == 'Download Document':
link = a_tag['href']
download_url = 'https://ntrs.nasa.gov' + link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/citations/')+1:11])
It is dynamically pulled from a script tag. You can regex out the JavaScript object which contains the download url, handle some string replacements for html entities, parse as json then extract the desired url:
import requests, re, json
r = requests.get('https://ntrs.nasa.gov/search')
data = json.loads(re.search(r'(\{.*/api.*\})', r.text).group(1).replace('&q;','"'))
print('https://ntrs.nasa.gov' + data['http://ntrs-proxy-auto-deploy:3001/citations/search']['results'][0]['downloads'][0]['links']['pdf'])
You could append the ?attachment=true but I don't think that is required.
Your problem stems from the fact that the page is rendered using Javascipt, and the actual page source is only a few script and style tags.
I am scraping this site: https://finance.yahoo.com/quote/MSFT/press-releases.
In the browser, there are 20+ articles. However, when I pull the site's HTML down and load it into HTML agility pack, only the first three articles are appearing.
let client = new WebClient()
let uri = "https://finance.yahoo.com/quote/MSFT/press-releases"
let response = client.DownloadString(uri)
let doc = HtmlDocument()
doc.LoadHtml(response)
works:
let node = doc.DocumentNode.SelectSingleNode("//*[#id=\"summaryPressStream-0-Stream\"]/ul/li[1]")
node.InnerText
no works:
let node = doc.DocumentNode.SelectSingleNode("//*[#id=\"summaryPressStream-0-Stream\"]/ul/li[10]")
node.InnerText
Is it because there are some jenky li tags in the yahoo site? Is it is limitation of the HtmlAgilityPack?
I also did the same script in Python using BeautifulSoup and have the same problem:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://finance.yahoo.com/quote/MSFT/press-releases?"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a', href=True):
print(link['href'])
Thanks
I want to extract the link
/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=2&next=0&durationType=Y&Year=2018&duration=1&news_type=
from the html of the page
http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05
The following is the code that is used
url_list = "http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05"
html = requests.get(url_list)
soup = BeautifulSoup(html.text,'html.parser')
link = soup.find_all('a')
print(link)
using beautiful soup. How would I go about it, using find_all('a") doesn't return the required link in the returned html.
Please try this to get Exact Url you want.
import bs4 as bs
import requests
import re
sauce = requests.get('https://www.moneycontrol.com/stocks/company_info/stock_news.php?sc_id=CHC&durationType=Y&Year=2018')
soup = bs.BeautifulSoup(sauce.text, 'html.parser')
for a in soup.find_all('a', href=re.compile("company_info")):
# print(a['href'])
if 'pageno' in a['href']:
print(a['href'])
output:
/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=2&next=0&durationType=Y&Year=2018&duration=1&news_type=
/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=3&next=0&durationType=Y&Year=2018&duration=1&news_type=
You just have to use the get method to find the href attribute:
from bs4 import BeautifulSoup as soup
import requests
url_list = "http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05"
html = requests.get(url_list)
page= soup(html.text,'html.parser')
link = page.find_all('a')
for l in link:
print(l.get('href'))
I'm trying to scrape this website with Python BeautifulSoup. And my code below is first fetching all the links from the page. While fetching the links it is stripping ampersands and parameters from the original link. I wonder why? Would somebody know? I've got the code down here along with the output.
from bs4 import BeautifulSoup as bs
import requests
url = requests.get ("http://mnregaweb4.nic.in/netnrega/demand_emp_demand.aspx?lflag=eng&file1=dmd&fin=2017-2018&fin_year=2017-2018&source=national&Digest=x44uSVqhiyzomN66Te0ELQ")
soup = bs(url.text, 'xml')
state= soup.find(id = "t1")
state_links = []
for link in soup.find_all('a', href= True):
state_links.append(link['href'])
state_links = [e for e in state_links if e not in ("javascript:history.go(-1);", "http://164.100.129.6/netnrega/MISreport4.aspx?fin_year=2013-2014rpt=RP&source=national", "javascript:__doPostBack('ctl00$ContentPlaceHolder1$LinkButton1','')")]
for dis_link in state_links:
# print (dis_link)
link_new = "http://mnregaweb4.nic.in/netnrega/"+dis_link
print (link_new)
Output:
Actual Link: http://mnregaweb4.nic.in/netnrega/demand_emp_demand.aspx?file1=dmd&page1=s&lflag=eng&state_name=ANDHRA+PRADESH&state_code=02&fin_year=2017-2018&source=national&Digest=4jL5hchs+iT7xqB6T/UXzw
(Highlighted stuff in code is missing from the scraped link)
Scraped link: http://mnregaweb4.nic.in/netnrega/demand_emp_demand.aspx?file1=dmd=s=eng=ANDHRA+PRADESH=02=2017-2018=national=4jL5hchs+iT7xqB6T/UXzw
It might be because you are trying to parse it with 'xml', instead try to parse it with 'html.parser',
I am getting the following result with the code below:
from bs4 import BeautifulSoup as bs
import requests
url = requests.get ("http://mnregaweb4.nic.in/ne....")
soup = bs(url.text, 'html.parser')
state_links = []
for link in soup.find_all('a', href=True):
state_links.append(link['href'])
print(state_links)
# 'demand_emp_demand.aspx?file1=dmd&page1=s&lflag=eng&state_name=ANDHRA+PRADESH&state_code=02&fin_year=2017-2018&source=national&Digest=4jL5hchs+iT7xqB6T/UXzw'
This issue is about the parser used in Beautifulsoup.
Try with
soup = bs(url.text, 'html.parser')
or
soup = bs(url.text, 'lxml')
You might need to install some specific parser, see this chapter of the doc.
When I inspect the elements on my browser, I can obviously see the exact web content. But when I try to run the below script, I cannot see the some of the web page details. In the web page I see there are "#document" elements and that is missing while I run the script. How can I see the details of #document elements or extract with the script.?
from bs4 import BeautifulSoup
import requests
response = requests.get('http://123.123.123.123/')
soup = BeautifulSoup(response.content, 'html.parser')
print soup.prettify()
You need to make additional requests to get the frame page contents as well:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
BASE_URL = 'http://123.123.123.123/'
with requests.Session() as session:
response = session.get(BASE_URL)
soup = BeautifulSoup(response.content, 'html.parser')
for frame in soup.select("frameset frame"):
frame_url = urljoin(BASE_URL, frame["src"])
response = session.get(frame_url)
frame_soup = BeautifulSoup(response.content, 'html.parser')
print(frame_soup.prettify())