Getting the next UL element using BeautifulSoup - python

I'm trying to find the next ul element in a give webpage.
I start by plugging in my response into Beautiful Soup like so:
soup = BeautifulSoup(response.context)
printing out response.context gives the following
print(response.context)
<!DOCTYPE html>
<html>
<head>
<title> | FollowUp</title>
<meta name='viewport' content='width=device-width, initial-scale=1.0'>
<link href='/static/css/bootstrap.min.css' rel='stylesheet' media='screen'>
</head>
<body>
<div class='navbar'>
<div class='navbar-inner'>
<a class='brand' href='/'>TellMe.cat</a>
<ul class='nav'>
<li><a href='list'>My Stories</a></li>
<li><a href='add'>Add Story</a></li>
<li><a href='respond'>Add Update</a></li>
</ul>
<form class='navbar-form pull-right' action='process_logout' method='post'>
<input type='hidden' name='csrfmiddlewaretoken' value='RxquwEsaS5Bn1MsKOIJP8uLtRZ9yDusH' />
Hello add!
<button class='btn btn-small'>Logout</button>
</form>
</div>
</div>
<div class='container'>
<ul id='items'>
<ul>
<li><a href='http://www.example.org'>http://www.example.org</a></li>
<ul>
<p>There have been no follow ups.</p>
</ul>
</ul>
</ul>
</div>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script src='/static/js/bootstrap.min.js'></script>
</body>
</html>
I'm trying to get the ul that's named 'items'. I do so with:
items = soup.find(id='items')
Which gives me the correct ul and all of its children. However calling
items.find_next('ul')
Gives the error of
TypeError: 'NoneType' object is not callable
Even though this seems to be how it's supposed to be called accorind to the Beautiful Soup docs: https://beautiful-soup-4.readthedocs.org/en/latest/#find-all-next-and-find-next
What am I doing incorrectly?

Make a virtualenv, pip install BeautifulSoup requests, open python console.
import BeautifulSoup
import requests
html = requests.get("http://yahoo.com").text
b = BeautifulSoup.BeautifulSoup(html)
m = b.find(id='masthead')
item = m.findNext('ul')
dir(m) tells you the functions on m. You can see you want findNext.
You also might find ipython a more forgiving shell to run python in. You can type the name of a variable and hit Tab to see the member variables.

Related

Call python's object method from flask jinja2 html file

I am trying to create youtube video downloader application using pytube and flask. All is done, except that a want to call pytube's stream download method from within the html script tag. How can i do it.
Here's my flask code
from flask import Flask, render_template, request
from pytube import YouTube
app = Flask(__name__)
#app.route("/")
def index():
return render_template("index.html", data=None)
#app.route("/download", methods=["POST", "GET"])
def downloadVideo():
if request.method == "POST":
url = request.form["videourl"]
if url:
yt = YouTube(url)
title = yt.title
thumbnail = yt.thumbnail_url
streams = yt.streams.filter(file_extension='mp4')
data = [title, thumbnail, streams, yt]
return render_template("index.html", data=data)
if __name__ == "__main__":
app.run(debug=True)
and here's my html code
<!DOCTYPE html>
<html>
<head>
<title> Youtube Downloader </title>
<meta name="viewport" content="width=device-width,initial-scale=1.0">
<link rel="stylesheet" href="static/css/style.css">
</head>
<body>
<div class="main">
<div class="header">
<div>
<img src="static/img/icon.png" width="48" height="48">
<h2> Youtube Downloader </h2>
</div>
<div>
<p> Convert and download youtube videos </p>
<p> in MP4 for free </p>
</div>
</div>
{% if not data %}
<div class="dform">
<form action="http://127.0.0.1:5000/download", method="POST">
<div class="inputfield">
<input type="input" name="videourl" placeholder="Search or Paste youtube link here" autocomplete="off">
<button type="submit"> Download </button>
</div>
</form>
</div>
{% else %}
<div class="videoinfo">
<img src="" class="thumbnail">
<h2> {{data[0]}} </h2>
</div>
<div class="quality">
<select id="streams">
{% for stream in data[2][:3] %}
<option value="{{stream.itag}}"> {{stream.resolution}} </option>
{% endfor %}
</select>
</div>
{% endif %}
</div>
<script type="text/javascript">
const image = document.querySelector(".thumbnail");
const select = document.querySelector("select");
let url = `{{data[1]}}`;
if (image) {
image.src = `${url}`;
window.addEventListener('change', function() {
var option = select.options[select.selectedIndex].value;
console.log(option);
{% set stream = data[3].get_by_itag(option) %}
{% stream.download() %}
});
}
</script>
</body>
</html>
I am trying to download the video using itag when a user clicks an option in the select element by using pytube get_by_itag() method.
From what I understand you want to do two things. You want to create a route on your flask app that will let serve up the youtube video based on an itag, and you want to be able to call that route from javascript.
This answer shows how to create a route to download the video.
To call a url that starts a file download from javascript you'll need to use the fetch method and open that link into an iFrame. This answer covers it.
Let me know if that covers your question.

Request.POST.get not working for me in django, returning default value

I am trying to get input from a html form in django , python code below:
def add(request):
n = request.POST.get('Username', 'Did not work')
i = Item(name=n,price=0)
i.save()
return render(request,'tarkovdb/test.html')
Second pic is my html code:
<html>
<head>
<meta charset="UTF-8"›
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css" integrity="sha384—Vkoo8x4CGs0 OPaXtkKtu6ug5T0eNV6gBiFeWPGFN9Muh0f23Q9Ifjh" crossorigin="anonymous">
<title>Tarkov Database Web App</title>
</head>
<body>
<h1>This is the page to add items</h1>
<li>List of Items in DataBase</li>
<form>
<div class="form—group">
<label for&username'> Username: </lable>
<input type&text' name='Username' id.'username'> <br><br>
</div>
<button type="submit" class="btn btn—primary">Submit</button>
</form>
You need to set your method attribute to "post" on your HTML form tag. Like this:
<form method="post">
<!-- your fields here -->
</form>
Otherwise you'll be sending a GET request, which is the default value of the method attribute.
PD.: Please paste your code, make it easy for the community to help you. Otherwise you'll get down voted.

How can i get a clean script embedded in an html page with python standard library

I'm trying to backup a code of mine from sololearn site. I could copy/paste it of course, but because i would like to repeat it for other codes, and also for learning purpose, i'd like to do it with a python code, and only using the standard library if possible.
I present here the more basic try. I also have been struggling with HTMLParser, html.entities, xml.etree, i've tried to decode the response as "utf-8", to pass it through html.unescape(). The result is always dirty.
this kind of dirty: \u003c!DOCTYPE html\u003e\r\n\u003chtml\u003e\r\n\u003c!--\r\
sometimes less, but never clean
from urllib.request import urlopen
import re
url = "https://code.sololearn.com/************/#"
with urlopen(url) as response:
page = str(response.read())
code = re.search(r'window.code = "(.*)";.*window.cssCode',page).group(1)
print(code)
The goal is to backup my files, writing them into files in a clean functional form, the codes can be html+css+js, python, c, etc... I also tried to work on the dirty results with regex modifications, but i think it's impossible, because the codes may contain on purpose elements like "\r\n" that should not be modified.
Seems that you got JSON encoded string. You can use ast.literal_eval() (doc) to decode the string:
from ast import literal_eval
from urllib.request import urlopen
import re
url = "https://code.sololearn.com/************/#"
with urlopen(url) as response:
page = response.read().decode('utf-8')
code = re.search(r'window.code = "(.*)";.*window.cssCode',page, flags=re.DOTALL).group(1)
print(literal_eval('"' + code + '"'))
Prints:
<!DOCTYPE html>
<html>
<!--
If you're interested in the tools used here:
to display a partition:
http://www.vexflow.com/
to make it sound:
https://tonejs.github.io/
-->
<head>
<link href="https://fonts.googleapis.com/css?family=Annie+Use+Your+Telescope&display=swap" rel="stylesheet">
<script src="https://cdnjs.cloudflare.com/ajax/libs/tone/13.8.12/Tone.js"></script>
<script src="https://unpkg.com/vexflow/releases/vexflow-min.js"></script>
<title>Melody Generator</title>
</head>
<body>
<div id="wrapper">
<div id="popup">
<div id="description">description gonna be here</div>
<div id="choice"></div>
</div>
<div id="input" class="blur">
<div id="melody">
<h1>Melody</h1>
<textarea id="melo_num" class="text_input" placeholder="Enter two words..."></textarea>
<p id="melo_rebased"></p>
</div>
<div id="rhythm">
<h1>Rhythm</h1>
<textarea id="rhyt_num" class="text_input" placeholder="...hear some magic !"></textarea>
<p id="rhyt_rebased"></p>
</div>
</div>
<div id="partition" class="blur"></div>
<div id="controls" class="blur">
<div id="back" class="control">back</div>
<div id="play" class="control">play</div>
<div id="stop" class="control">stop</div>
</div>
<div id="current" class="blur"></div>
<p></p>
<div id="settings" class="blur">
<div id="loop" class="blur">loop
<div class="twinkle lamp" id="loop_lamp"></div>
</div>
<div id="root" class="blur">root
<div class="lamp" id="root_lamp"></div>
</div>
<div id="mode" class="blur">mode
<div class="lamp" id="mode_lamp"></div>
</div>
<div id="range" class="blur">range
<div class="lamp" id="range_lamp"></div>
</div>
<div id="rhythm" class="blur">rhythm
<div class="lamp" id="rhythm_lamp"></div>
</div>
<div id="convert" class="blur">convert
<div class="lamp" id="convert_lamp"></div>
</div>
<div id="volume" class="blur slider_box">
volume
<input id="sound_vol" class="slider" type="range" min="-50" max="0" value="-10">
</div>
<div id="speed" class="blur slider_box">
speed
<input id="speed_lvl" class="slider" type="range" min="0" max="200" value="100">
</div>
<div id="sustain" class="blur slider_box">
sustain
<input id="sustain_lvl" class="slider" type="range" min="0" max="200" value="100">
</div>
<div id="demo" class="blur">demo
<div class="lamp" id="demo_lamp"></div>
</div>
</div>
<p></p>
</div>
</body>
</html>
Or use json.loads() (doc):
import json
print(json.loads('"' + code + '"'))

Scrapy Scrape element within unknown number of <div>

I am trying to scrape a list of website on Shopee. Some example include dudesgadget and 2ubest. Each of these shopee shop have different design and way of constructing their web element and different domain as well. They looks like stand alone website but they are actually not.
So the main problem here is I am trying to scrape the product details. I will summarize some different structure:
2ubest
<html>
<body>
<div id="shopify-section-announcement-bar" id="shopify-section-announcement-bar">
<main class="wrapper main-content" role="main">
<div class="grid">
<div class="grid__item">
<div id="shopify-section-product-template" class="shopify-section">
<script id="ProductJson-product-template" type="application/json">
//Things I am looking for
</script>
</div>
</div>
</div>
</main>
</div>
</body>
</html>
littleplayland
<html>
<body id="adjustable-ergonomic-laptop-stand" class="template-product">
<script>
//Things I am looking for
</script>
</body>
</html>
And few other, and I had discover a pattern between them.
The thing that I am looking for will for sure in <body>
The thing that I am looking for is within a <script>
The only thing that I not sure is the distance from <body> to <script>
My solution is:
def parse(self, response):
body = response.xpath("//body")
for script in body.xpath("//script/text()").extract():
#Manipulate the script with js2xml here
I am able to extract the littleplayland, dailysteals and many others which has very less distance from the <body> to <script>, but does not works for the 2ubest which has a lot of other html element in between to the thing I am looking for. Can I know are there solution that I can ignore all the html element in between and only look for the <script> tag?
I need a single solution that are generic and can work across all Shopee website if possible since all of them have the characteristic that I had mention above.
Which mean that the solution should not filter using <div> because every different website have different numbers of <div>
This is how to get the scripts in your HTML using Scrapy:
scriptTagSelector = scrapy.Selector(text=text, type="html")
theScripts = scriptTagSelector.xpath("//script/text()").extract()
for script in theScripts:
#Manipulate the script with js2xml here
print("------->A SCRIPT STARTS HERE<--------")
print(script)
print("------->A SCRIPT ENDS HERE<--------")
Here is an example with the HTML in your question (I added an extra script :) ):
import scrapy
text="""<html>
<body>
<div id="shopify-section-announcement-bar" id="shopify-section-announcement-bar">
<main class="wrapper main-content" role="main">
<div class="grid">
<div class="grid__item">
<div id="shopify-section-product-template" class="shopify-section">
<script id="ProductJson-product-template" type="application/json">
//Things I am looking for
</script>
</div>
<script id="script 2">I am another script</script>
</div>
</div>
</main>
</div>
</body>
</html>"""
scriptTagSelector = scrapy.Selector(text=text, type="html")
theScripts = scriptTagSelector.xpath("//script/text()").extract()
for script in theScripts:
#Manipulate the script with js2xml here
print("------->A SCRIPT STARTS HERE<--------")
print(script)
print("------->A SCRIPT ENDS HERE<--------")
Try this:
//body//script/text()

Using BeautifulSoup to extract specific nested div

I have this HTML code which I'm creating the script for:
http://imgur.com/a/dPNYI
I would like to extract the highlighted text ("some text") and print it.
I tried going through every nested div in the way to the div I needed, like this:
import requests
from bs4 import BeautifulSoup
url = "the url this is from"
r = requests.get(url)
for div in soup.find_all("div", {"id": "main"}):
for div2 in div.find_all("div", {"id": "app"}):
for div3 in div2.find_all("div", {"id": "right-sidebar"}):
for div4 in div3.find_all("div", {"id": "chat"}):
for div5 in div4.find_all("div", {"id": "chat-messages"}):
for div6 in div5.find_all("div", {"class": "chat-message"}):
for div7 in div6.find_all("div", {"class": "chat-message-content selectable"}):
print(div7.text.strip())
I implemented what I've seen in guides and similar questions online, but I bet this is not even close and there must be a much easier way.This doesn't work. It doesn't print anything, and I'm a bit lost. How can I print the highlighted line (which is essentially the very first div child of the div with the id "chat-messages")?
HTML CODE:
<!DOCTYPE html>
<html>
<head>
<title>
</title>
</head>
<body>
<div id="main">
<div data-reactroot="" id="app">
<div class="top-bar-authenticated" id="top-bar">
</div>
<div class="closed" id="navigation-bar">
</div>
<div id="right-sidebar">
<div id="chat">
<div id="chat-head">
</div>
<div id="chat-title">
</div>
<div id="chat-messages">
<div class="chat-message">
<div class="chat-message-avatar" style="background-image: url("https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/65/657dcec97cc00bc378629930ecae1776c0d981e0.jpg");">
</div>
<a class="chat-message-username clickable">
<div class="iron-color">
aloe
</div></a>
<div class="chat-message-content selectable">
<!-- react-text: 2532 -->some text<!-- /react-text -->
</div>
</div>
<div class="chat-message">
</div>
<div class="chat-message">
</div>
<div class="chat-message">
</div>
<div class="chat-message">
</div>
<div class="chat-message">
</div>
Using lxml parser (i.e. soup = BeautifulSoup(data, 'lxml')) you can use .find with multiple classes just as simple as single classes to find nested divs:
soup.find('div',{'class':'chat-message-content selectable'}).text
The line above should work for you as long as the occurence of that class is the only one in the html.

Categories