Python Web Scraping With Problems - python

I am using requests-HTML and beautiful to scrape a website, below is the code. The weird thing is I can get the text sometimes from the web when using print(soup.get_text()) and I get some random codes when using print(soup) - in the image attached.
session = HTMLSession()
r = session.get(url)
soup = bs(r.content, "html.parser")
print(soup.get_text())
#print(soup)
The program return this when I tried to look at the soup

I think the site is javascript protected..well try this..it might help
import requests
from bs4 import BeautifulSoup
r = requests.get(url)
print(r.text)
#if you want the whole content you can just do slicing stuff on the response stored in r or rather just do it with bs4
soup = BeautifulSoup(r.text, "html.parser")
print(soup.text)

Related

my web scraping does not work and i don t know what the problem is

import requests
from bs4 import BeautifulSoup
req = requests.get("https://www.arukereso.hu/mobiltelefon-c3277/")
soup = BeautifulSoup(req.content, "head")
print(soup.prettify())
Assuming you want the entire html content (not sure what you want to achieve with the "head"):
Try replacing
soup = BeautifulSoup(req.content, "head")
with
soup = BeautifulSoup(req.content, "html.parser")

Web scraping by BeautifulSoup in Python

I tried to retrieve table data through the link below by python, unfortunately they brought all the html tags but haven't brought the table. Could you do me a favor and help me.
https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/exp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-02-01&r9=2022-02-01
my code:
import requests
from bs4 import BeautifulSoup
url = 'https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/exp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-02-01&r9=2022-02-01'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup)

Beautiful Soup can't find most of the tags

I am trying to scrape this page https://ntrs.nasa.gov/search .
I am using the code below and Beautiful soup is finding only 3 tags when there are many more. I have tried using html5lib, lxml and HTML parsers but none of them have worked.
Can you advise what might be the problem please?
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
# Set the URL
url = 'https://ntrs.nasa.gov/search'
# Connect to the URL
response = requests.get(url)
# Parse HTML and save to a BeautifulSoup object¶
soup = BeautifulSoup(response.content, "html5lib")
# soup = BeautifulSoup(response.text, "html5lib")
# soup = BeautifulSoup(response.content, "html.parser")
# soup = BeautifulSoup(response.content, "lxml")
# loop through all a-tags
for a_tag in soup.findAll('a'):
if 'title' in a_tag:
if a_tag['title'] == 'Download Document':
link = a_tag['href']
download_url = 'https://ntrs.nasa.gov' + link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/citations/')+1:11])
It is dynamically pulled from a script tag. You can regex out the JavaScript object which contains the download url, handle some string replacements for html entities, parse as json then extract the desired url:
import requests, re, json
r = requests.get('https://ntrs.nasa.gov/search')
data = json.loads(re.search(r'(\{.*/api.*\})', r.text).group(1).replace('&q;','"'))
print('https://ntrs.nasa.gov' + data['http://ntrs-proxy-auto-deploy:3001/citations/search']['results'][0]['downloads'][0]['links']['pdf'])
You could append the ?attachment=true but I don't think that is required.
Your problem stems from the fact that the page is rendered using Javascipt, and the actual page source is only a few script and style tags.

How to get Html code after crawling with python

https://plus.google.com/s/casasgrandes27%40gmail.com/top
I need to crawl the following page with python but I need its HTML not the generic source code of link.
For example
Open the link: plus.google.com/s/casasgrandes27%40gmail.com/top without login second last thumbnail will be "G Suite".
<div class="Wbuh5e" jsname="r4nke">G Suite</div>
I am unable to find the above line of HTML-code after executing this python-code.
from bs4 import BeautifulSoup
import requests
L = list()
r = requests.get("https://plus.google.com/s/casasgrandes27%40gmail.com/top")
data = r.text
soup = BeautifulSoup(data,"lxml")
print(soup)
To get the soup object try the following
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
http://docs.python-requests.org/en/master/user/quickstart/#binary-response-content
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
you can try this code to read a HTML page :
import urllib.request
urls = "https://plus.google.com/s/casasgrandes27%40gmail.com/top"
html_file = urllib.request.urlopen(urls)
html_text = html_file.read()
html_text = str(html_text)
print(html_text)

Could not able to extract #document from HTML file through python web scraping

When I inspect the elements on my browser, I can obviously see the exact web content. But when I try to run the below script, I cannot see the some of the web page details. In the web page I see there are "#document" elements and that is missing while I run the script. How can I see the details of #document elements or extract with the script.?
from bs4 import BeautifulSoup
import requests
response = requests.get('http://123.123.123.123/')
soup = BeautifulSoup(response.content, 'html.parser')
print soup.prettify()
You need to make additional requests to get the frame page contents as well:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
BASE_URL = 'http://123.123.123.123/'
with requests.Session() as session:
response = session.get(BASE_URL)
soup = BeautifulSoup(response.content, 'html.parser')
for frame in soup.select("frameset frame"):
frame_url = urljoin(BASE_URL, frame["src"])
response = session.get(frame_url)
frame_soup = BeautifulSoup(response.content, 'html.parser')
print(frame_soup.prettify())

Categories