BeautifulSoup html -- load from memory? - python

I'm using BeautifulSoup in python 3.5 to parse html. While I can load it from file, I need to load it from memory because I get from an HTTP request. I've google but found nothing loading html to BeautifulSoup from memory. Is it possible?

If you are using the version 4 of BeautifulSoup, try passing the request data to it
from bs4 import BeautifulSoup
import requests
# replace the following URL
response = requests.get("https://www.python.org")
soup = BeautifulSoup(response.text,"html.parser")

from BeautifulSoup import BeautifulSoup
import requests
data = requests.get('https://google.com').text
soup = BeautifulSoup(data)

Related

How to get the url from extracted information from a website

So basically I am stuck on the problem where I don't know how to the url from the extracted data from a website.
Here is my code:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://api.randomtube.xyz/video.get?chan=2ch.hk&board=b&page=1')
soup = BeautifulSoup(req.content, "html.parser")
print(soup.prettify())
I get a lot of information on output, but the only thing I need is the url, I hope someone can help me.
P.S:
It gives me this information:
{"response":{"items":[{"url":"https:\/\/2ch.hk\/b\/src\/262671212\/16440825183970.webm","type":"video\/webm","filesize":"20259","width":1280,"height":720,"name":"1521967932778.webm","board":"b","thread":"262671212"},{"url":"https:\/\/2ch.hk\/b\/src\/261549765\/16424501976450.webm","type":"video\/webm","filesize":"12055","width":1280,"height":720,"name":"1526793203110.webm","board":"b","thread":"261549765"}...
But i only need this part out of all the things
https:\/\/2ch.hk\/b\/src\/261549765\/16424501976450.webm (Not exactly this url, but just as an example)
You can do it this way:
url_array = []
for item in soup['response']['items']:
url_array.append(item['url'])
I guess if the API returns JSON data then it should be better to just parse it directly.
The url produces json data. Beautifulsoup can't grab json data and to grab json data, you can follow the next example.
import requests
import json
data = requests.get('https://api.randomtube.xyz/video.get?chan=2ch.hk&board=b&page=1').json()
url= data['response']['items'][0]['url']
if url:
url=url.replace('.webm','.mp4')
print(url)
Output:
https://2ch.hk/b/src/263361969/16451225633240.mp4
The problem is you are telling BeautifulSoup to parse JSON data as HTML. You can get the URL you need more directly with the following code
import json
import requests
from bs4 import BeautifulSoup
req = requests.get('https://api.randomtube.xyz/video.get?chan=2ch.hk&board=b&page=1')
data = json.loads(req.content)
my_url = data['response']['items'][0]['url']

Web Scraping with Beautiful Soup Python - Seesaw - The output has no length error

I am trying to get the comments from a website called Seesaw but the output has no length. What am I doing wrong?
import requests
import requests
import base64
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as req
from requests import get
html_text = requests.get("https://app.seesaw.me/#/activities/class/class.93a29acf-0eef-4d4e-9d56-9648d2623171").text
soup = BeautifulSoup(html_text, "lxml")
comments = soup.find_all("span", class_ = "ng-binding")
print(comments)
Because there is no span element with class ng-binding on the page (these elements added later via JavaScript)
import requests
html_text = requests.get("https://app.seesaw.me/#/activities/class/class.93a29acf-0eef-4d4e-9d56-9648d2623171").text
print(f'{"ng-binding" in html_text=}')
So output is:
"ng-binding" in html_text=False
Also you can check it using "View Page Source" function in your browser. You can try to use Selenium for automate interaction with the site.

bs4 object of type 'Response' has no len()

I've been trying to get this to work, but keep getting the same TypeError object has no len(). The BeautifulSoup documentation hasn't been any help. This seems to work on every tutorial I watch, and read, but not for me. What am I doing wrong?
import requests
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt0366627/?ref_=nv_sr_1")
print(http)
This returns Response [200], but if I try to add soup... I get the len error:
import requests
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt0366627/?ref_=nv_sr_1")
soup = BeautifulSoup(http, 'lxml')
print(soup)
As the docs say:
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
A Response object is neither a string nor an open filehandle.
The simplest way to get one of the two, as shown in the first example in the requests docs, is the .text attribute. So:
http = requests.get("https://www.imdb.com/title/tt6738136/?ref_=inth_ov_tt")
soup = BeautifulSoup(http.text, 'lxml')
For other options see Response Content—e.g., you can get the bytes with .content to let BeautifulSoup guess at the encoding instead of reading it from the headers, or get the socket (which is an open filehandle) with .raw.
My final code. It just prints out the title, year and summary, which was all I wanted. Thank all of you for your help.
import requests
import lxml
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt0366627/?ref_=nv_sr_1")
soup = BeautifulSoup(http.content, 'lxml')
title = soup.find("div", class_="title_wrapper").find()
summary = soup.find(class_="summary_text")
print(title.text)
print(summary.text)
The Response-200 that you getting from the following code:
import requests
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt6738136/?ref_=inth_ov_tt")
print(http)
shows that the your request is succeeded and returned a response. In-order to parse the HTML code there are two ways:
Direct print the text/String format
import requests
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt6738136/?ref_=inth_ov_tt")
print(http.text)
Use a HTML parser
import requests
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt6738136/?ref_=inth_ov_tt")
soup = BeautifulSoup(http.text, 'lxml')
print(soup)
it is better to use BeautifulSoup as using this will allow you to extract the required data from HTML, in-case you need it

Accessing a website in python

I am trying to get all the urls on a website using python. At the moment I am just copying the websites html into the python program and then using code to extract all the urls. Is there a way I could do this straight from the web without having to copy the entire html?
In Python 2, you can use urllib2.urlopen:
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
In Python 3, you can use urllib.request.urlopen:
import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()
If you have to perform more complicated tasks like authentication or passing parameters I suggest to have a look at the requests library.
The most straightforward would probably be urllib.urlopen if you're using python2, or urllib.request.urlopen if you're using python3 (you have to do import urllib or import urllib.request first of course). That way you get an file like object from which you can read (ie f.read()) the html document.
Example for python 2:
import urllib
f = urlopen("http://stackoverflow.com")
http_document = f.read()
f.close()
The good news is that you seem to have done the hard part which is analyzing the html document for links.
You might want to use the bs4(BeautifulSoup) library.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
You can download bs4 with the followig command at the cmd line. pip install BeautifulSoup4
import urllib2
import urlparse
from bs4 import BeautifulSoup
url = "http://www.google.com"
response = urllib2.urlopen(url)
content = response.read()
soup = BeautifulSoup(content, "html.parser")
for link in soup.find_all('a', href=True):
print urlparse.urljoin(url, link['href'])
You can simply use the combination of requests and BeautifulSoup.
First make an HTTP request using requests to get the HTML content. You will get it as a Python string, which you can manipulate as you like.
Take the HTML content string and supply it into the BeautifulSoup, which has done all the job to extract the DOM, and get all URLs, i.e. <a> elements.
Here is an example of how to fetch all links from StackOverflow:
import requests
from bs4 import BeautifulSoup, SoupStrainer
response = requests.get('http://stackoverflow.com')
html_str = response.text
bs = BeautifulSoup(html_str, parseOnlyThese=SoupStrainer('a'))
for a_element in bs:
if a_element.has_attr('href'):
print(a_element['href'])
Sample output:
/questions/tagged/facebook-javascript-sdk
/questions/31743507/facebook-app-request-dialog-keep-loading-on-mobile-after-fb-login-called
/users/3545752/user3545752
/questions/31743506/get-nuspec-file-for-existing-nuget-package
/questions/tagged/nuget
...

Using urllib and BeautifulSoup to retrieve info from web with Python

I can get the html page using urllib, and use BeautifulSoup to parse the html page, and it looks like that I have to generate file to be read from BeautifulSoup.
import urllib
sock = urllib.urlopen("http://SOMEWHERE")
htmlSource = sock.read()
sock.close()
--> write to file
Is there a way to call BeautifulSoup without generating file from urllib?
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(htmlSource)
No file writing needed: Just pass in the HTML string. You can also pass the object returned from urlopen directly:
f = urllib.urlopen("http://SOMEWHERE")
soup = BeautifulSoup(f)
You could open the url, download the html, and make it parse-able in one shot with gazpacho:
from gazpacho import Soup
soup = Soup.get("https://www.example.com/")

Categories