Input to a web page and retrieve the output - python

i am trying to enter a password into "howsecureismypassword" website and retrieve the output using webscraping..is this possible. here is the html from the web page and below is the code i have so far, any help would be appreciated.
<div role="main">
<div class="input">
<input type="password" id="password" ng-model="password" ng-change="passwordChange()" placeholder="Enter Password" class="password" autofocus>
</div>
<div class="phishing" ng-hide="password">
<p>This site could be stealing your password... it's not, but it easily <em>could</em> be.<br />Be careful where you type your password.</p>
</div>
<div ng-show="password">
<ul ng-show="display.config">
<li>
<label><input type="checkbox" ng-model="config.namedNumbers" ng-change="config.changeNamedNumbers()" />Use Named Numbers</label>
</li>
<li>
<label>Calculations per second</label>
<input type="text" ng-model="config.calculations" ng-change="config.changeCalculations()" />
</li>
</ul>
<div class="results">
<span class="toggle" ng-click="display.toggleConfig()">{{display.configText}}</span>
<p ng-hide="insecure">It would take <span ng-show="config.calculationsOriginal">a desktop PC</span> about <span class="main">{{time}}</span> to crack your password</p>
<a class="tweet-me" ng-hide="insecure" href="http://twitter.com/home/?status=It would take a desktop PC about {{time}} to crack my password!%0d%0dhttp://hsim.pw">[Tweet Result]</a>
<p ng-show="insecure">Your password would be cracked almost <span class="main">Instantly</span></p>
<a class="tweet-me" ng-show="insecure" href="http://twitter.com/home/?status=My password would be cracked almost instantly!%0d%0dhttp://hsim.pw">[Tweet Result]</a>
<span class="toggle" ng-click="display.toggleDetails()">{{display.detailsText}}</span>
</div>
# Program to access a web page using httplib2
from httplib2 import Http
from urllib.parse import urlencode
# Create a web object
h = Http()
# set the url of the webpage
url = 'https://howsecureismypassword.net/'
password=input('enter')
# Create a data dictionary
data = {'placeholder' : password}
# encode the dictionary
web_data = urlencode(data)
# Connect to the local web server 192.168.26.10
response, content = h.request(url, 'POST', web_data)
if response.status == 200:
# Display the contents of the web page returned
text_content = content.decode()
print('Contents:')
print(text_content)
# Turn it into "Beautiful Soup"
from bs4 import BeautifulSoup
soup = BeautifulSoup(content)
print(soup.get_text())
else:
print('Error accesing web page')

I would use splinter for this. BeautifulSoup is good for scraping, not for interacting with pages. Here is the example straight from the splinter:
from splinter import Browser
with Browser() as browser:
# Visit URL
url = "http://www.google.com"
browser.visit(url)
browser.fill('q', 'splinter - python acceptance testing for web applications')
# Find and click the 'search' button
button = browser.find_by_name('btnG')
# Interact with elements
button.click()
if browser.is_text_present('splinter.cobrateam.info'):
print "Yes, the official website was found!"
else:
print "No, it wasn't found... We need to improve our SEO techniques"
You can adapt something like that to suit your needs.

Related

Parsing website with BeautifulSoup and Requests returns None

Im a beginner in programming all together and work on a project of mine. For that I'm trying to parse data from a website to make a tool that uses the data. I found that BeatifulSoup and Requests are common tools to do it, but unfortunately i can not seem to make it work. It always returns the value None or an error where it says:
"TypeError: 'NoneType' object is not callable"
Did i do anything wrong? Is it maybe not possible to parse some websites data and I'm being restricted the access or something?
If there are other ways to access the data im happy to hear as well.
Here is my code:
from bs4 import BeautifulSoup
import requests
pickrates = {} # dict to store winrate of champions for each position
source = requests.get("http://u.gg/lol/champions/aatrox/build?role=top").text
soup = BeautifulSoup(source, "lxml")
value = soup.find("div", class_="content-section champion-ranking-stats")
print(value.prettify())
Remember when you request a webpage with requests module, you will only get the html of that page. I mean this module is not capable of rendering JavaScript.
Try this code:
import requests
source = requests.get("http://u.gg/lol/champions/aatrox/build?role=top").text
print(source)
Then search for the class names you provided by hand(ctrl + f), there is no such elements at all. It means those are generated by other requests like ajax. They are somehow created after the initial html page is loaded. So before Beautiful soup comes to the party, you can't get them even in .text attribute of the response object.
One way of doing it is to Selenium or any other libraries which handles the JS.
It seems like this question (can't find html tag when I scrape web using beautifulsoup), the problem would be caused by the JavaScript event listener. I would suggest you to use selenium to handle this issue. So, let apply selenium at sending request and getting back page source and then use BeautifulSoup to parse it.
Don't forget to download a browser driver from https://www.selenium.dev/documentation/getting_started/installing_browser_drivers/ and place it in the same directory with your code.
The example of code below is using selenium with Firefox:
from selenium import webdriver
from bs4 import BeautifulSoup
URL = 'http://u.gg/lol/champions/aatrox/build?role=top'
browser = webdriver.Firefox()
browser.get(URL)
soup = BeautifulSoup(browser.page_source, 'html.parser')
time.sleep(1)
browser.close()
value = soup.find("div", class_="content-section champion-ranking-stats")
print(value.prettify())
Your expected output would be like:
>>> print(value.prettify())
<div class="content-section champion-ranking-stats">
<div class="win-rate meh-tier">
<div class="value">
48.4%
</div>
<div class="label">
Win Rate
</div>
</div>
<div class="overall-rank">
<div class="value">
49 / 58
</div>
<div class="label">
Rank
</div>
</div>
<div class="pick-rate">
<div class="value">
3.6%
</div>
<div class="label">
Pick Rate
</div>
</div>
<div class="ban-rate">
<div class="value">
2.3%
</div>
<div class="label">
Ban Rate
</div>
</div>
<div class="matches">
<div class="value">
55,432
</div>
<div class="label">
Matches
</div>
</div>
</div>

Python beautifulsoup AttributeError

I'm trying to get some image url using python beautifulsoup from html content.
My HTML Content :
<div id="photos" class="tab rel-photos multiple-photos">
<span id="watch-this" class="classified-detail-buttons">
<span id="c_id_10832265:c_type_202:watch_this">
<a href="/watchlist/classified/baby-items/10832265/1/" id="watch_this_logged" data-require-auth="favoriteAd" data-tr-event-name="dpv-add-to-favourites">
<i class="fa fa-fw fa-star-o"></i></a></span>
</span>
<span id="thumb1" class=" image">
<a href="https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6ImYzYWdrZm8xcDBlai1EVUJJWlpMRSIsInciOlt7ImZuIjoiNWpldWk3cWZ6aWU2MS1EVUJJWlpMRSIsInMiOjUwLCJwIjoiY2VudGVyLGNlbnRlciIsImEiOjgwfV19.s1GmifnZr0_Bx4HG8RTR4puYcxN0asqAmnBvSpIExEI/image;p=main"
id="a-photo-modal-view:263986810"
rel="photos-modal"
target="_new"
onClick="return dbzglobal_event_adapter(this);">
<div style="background-image:url(https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6ImYzYWdrZm8xcDBlai1EVUJJWlpMRSIsInciOlt7ImZuIjoiNWpldWk3cWZ6aWU2MS1EVUJJWlpMRSIsInMiOjUwLCJwIjoiY2VudGVyLGNlbnRlciIsImEiOjgwfV19.s1GmifnZr0_Bx4HG8RTR4puYcxN0asqAmnBvSpIExEI/image;p=main);"></div>
</a>
</span>
<ul id="thumbs-list">
<li>
<span id="thumb2" class="image2">
<a href="https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6Imtmc3cxMWgzNTB2cTMtRFVCSVpaTEUiLCJ3IjpbeyJmbiI6IjVqZXVpN3FmemllNjEtRFVCSVpaTEUiLCJzIjo1MCwicCI6ImNlbnRlcixjZW50ZXIiLCJhIjo4MH1dfQ.Wo2YqPdWav8shtmyVO2AdisHmLX-ZLDAiskLPAmTSPU/image;p=main" id="a-photo-modal-view:263986811" rel="photos-modal" target="_new" onClick="return dbzglobal_event_adapter(this);" >
<div style="background-image:url(https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6Imtmc3cxMWgzNTB2cTMtRFVCSVpaTEUiLCJ3IjpbeyJmbiI6IjVqZXVpN3FmemllNjEtRFVCSVpaTEUiLCJzIjo1MCwicCI6ImNlbnRlcixjZW50ZXIiLCJhIjo4MH1dfQ.Wo2YqPdWav8shtmyVO2AdisHmLX-ZLDAiskLPAmTSPU/image;p=thumb_retina);"></div>
</a>
</span>
</li>
<li id="thumbnails-info">
4 Photos
</li>
</ul>
<div id="photo-count">
4 Photos - Click to enlarge
</div>
</div>
My python code :
images = soup.find("div", {"id": ["photos"]}).find_all("a")
for image in images:
sk = image.get("href").replace("p=main","p=thumb_retina",1)
print(sk)
But i'm getting error :
Traceback (most recent call last):
File "/Users/evilslab/Documents/Websites/www.futurepoint.dev.cc/dobuyme/SCRAPE/boats.py", line 47, in <module>
images = soup.find("div", {"id": ["photos"]}).find_all("a")
AttributeError: 'NoneType' object has no attribute 'find_all'
How i can get only the url from a href tag ?
Your code works for me, more completely (given your HTML as html_doc):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
images = soup.find("div", {"id": ["photos"]}).find_all("a")
for image in images:
print(image['href'].replace("p=main","p=thumb_retina",1))
However your problem is that the text returned by requests from URL is not the same as the HTML sample you give. Despite your attempt to supply a random user agent, the server returns:
<li>You\'re a power user moving through this website with super-human speed.</li>\n <li>You\'ve disabled JavaScript in your web browser.</li>\n <li>A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this <a title=\'Third party browser plugins that block javascript\' href=\'http://ds.tl/help-third-party-plugins\' target=\'_blank\'>support article</a>.</li>\n </ul>\n </div>\n <p class="we-could-be-wrong" >\n We could be wrong, and sorry about that! Please complete the CAPTCHA below and we’ll get you back on dubizzle right away.
Since the CAPTCHA is intended to prevent scraping, I suggest respecting the admin's wishes and not scraping it. Maybe there's an API?
Try this:
for item in soup.find_all('span'):
try:
link = item.find_all('a', href=True)[0].attrs.get('href', None)
except IndexError:
continue
else:
print(link)
output
/watchlist/classified/baby-items/10832265/1/
/watchlist/classified/baby-items/10832265/1/
https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6ImYzYWdrZm8xcDBlai1EVUJJWlpMRSIsInciOlt7ImZuIjoiNWpldWk3cWZ6aWU2MS1EVUJJWlpMRSIsInMiOjUwLCJwIjoiY2VudGVyLGNlbnRlciIsImEiOjgwfV19.s1GmifnZr0_Bx4HG8RTR4puYcxN0asqAmnBvSpIExEI/image;p=main
https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6Imtmc3cxMWgzNTB2cTMtRFVCSVpaTEUiLCJ3IjpbeyJmbiI6IjVqZXVpN3FmemllNjEtRFVCSVpaTEUiLCJzIjo1MCwicCI6ImNlbnRlcixjZW50ZXIiLCJhIjo4MH1dfQ.Wo2YqPdWav8shtmyVO2AdisHmLX-ZLDAiskLPAmTSPU/image;p=main

How can using scrapy to login scrapy without form element

I try to login some website but it seems to be that they do not use form to display login dialog. So when using FormRequest, i got error
raise ValueError("No <form> element found in %s" % response)
So how can i login with scrapy in this case?
I try to find a form element in this website (using chrome devtool with xpath //form) but the result is zero
It's login element is
<div class="loginModalBody">
<div class="coverLoginModal">
<p class="loginModalTitle">Login </p>
<div class=""><p class="login-msg"></p></div>
<!-- Email -->
<div class="loginCoverInputText">
<input class="loginInputText" id="email-login" role="presentation" autocomplete="off" type="email" name="loginEmail" placeholder="E-mail">
<span class="loginNameInputText">E-mail</span>
<span class="loginLineInputText"></span>
<!-- Error email -->
<div class="dontEnterEmail loginErrorInput"><p class="loginError">Vui lòng nhập email<span class="loginIconError"></span></p></div>
<div class="loginEmailInvalid loginErrorInput"><p class="loginError">Invalid email<span class="loginIconError"></span></p></div>
</div>
<!-- Password -->
<div class="loginCoverInputText">
<input class="loginInputText" id="password-login" autocomplete="new-password" type="password" name="loginPassword" placeholder="Password">
<span class="loginNameInputText">Password</span>
<span class="loginLineInputText"></span>
<!-- Error password -->
<div class="dontEnterPassword loginErrorInput"><p class="loginError">Enter password<span class="loginIconError"></span></p></div>
</div>
<!-- Remember password -->
<label class="loginRememberPassword" id="login-remember-pass" for="loginRememberPassword"><input id="loginRememberPassword" type="checkbox" name="loginRememberPassword"><span></span>Ghi nhớ mật khẩu</label>
<p class="loginForgotPassword forgot-password"> <span></span>forgot pass</p>
<button class="loginButtonSubmit btn-login" id="btn-login-system" type="button">Login</button>
<p class="loginDontAccount">Do not have account? <a class="not-account" href="javascript:void(0)" data-dismiss="modal" data-toggle="modal" data-target="#modal-signup-system">Register!</a></p>
<p class="loginOr">Or</p>
<button type="button" class="loginByGoogle" onclick="open_login_g()">Login with Google</button>
<button type="button" class="loginByFacebook" onclick="open_login_f()">Login with Facebook</button>
</div>
</div>
The code i use is
class Spider(scrapy.Spider):
name = "card"
start_urls = ["https://website/auth/signin"]
login_user = "foo"
login_pass = "bar"
def parse(self, response):
'''Parse login page'''
open_in_browser(response)
return FormRequest.from_response(
response,
formdata={
'email':"username",
'password': "pass"
},
callback=self.parse_home
)
def parse_home(self, response):
open_in_browser(response)
print response
Web scraping is about requests and responses, so all you need is simulate all user requests. FormRequest just helps us to avoid extra work with forms. In this case you need make a proper login Request.
Go to needed page and open developer tools in your browser (e.g. Chrome)
Check a preserve log option in Network tab.
Fill credentials at the page and push login button.
Find out the login request (after button was pressed)
Check Headers tab in the request and find out request type and parameters (it can be GET with some querystring parameters or POST with some Form Data
In your code try to reproduce the login request using a simple scrapy Request instead of FormRequest

Web Scraping with Python Request/lxml: Getting data from ul/li

so I'm pretty new to this, and I haven't been able to find anything on google on this question.
I'm using request and lxml with Python, I've seen that there's a lot of different modules for web scraping, but is there any reason to choose one over the other? Can you do the same stuff with requests/lxml as you can with for example BeautifulSoup?
Anyway, here's my actual question;
This is my code:
import requests
from lxml import html
# Login data
inputUrl = 'http://forum.mytestsite.com/login'
usr = 'myusername'
pwd = 'mypassword'
payload = dict(login=usr, password=pwd)
# Open session
with requests.Session() as s:
# Login
s.post(inputUrl, data=payload)
# Get page data
pageResult = s.get('http://forum.mytestsite.com/icons/', allow_redirects=False)
pageResult = html.fromstring(pageResult.content)
pageIcons = pageResult.xpath('//script[#id="table-icons"]/text()')
print pageIcons[0]
The result when printing pageIcons[0]:
<ul id="icons">
{{#each icons}}
<li data-handle="{{handle}}">
<img src="{{image_path}}" alt="{{desc_or_name this}}" title="{{desc_or_name this}}">
</li>
{{/each}}
</ul>
This is the website/js code that generates the icons:
<script id="table-icons" type="text/x-handlebars-template">
<ul id="icons">
{{#each icons}}
<li data-handle="{{handle}}">
<img src="{{image_path}}" alt="{{desc_or_name this}}" title="{{desc_or_name this}}">
</li>
{{/each}}
</ul>
</script>
And here's the result on the page:
<ul id="icons">
<li data-handle="558FSTBI" class="">
<img src="http://testsite.com/icons/558FSTBI.1.png" alt="Icon 1" title="Icon 1">
</li>
<li data-handle="310AYTZI">
<img src="http://testsite.com/icons/310AYTZI.1.png" alt="Icon 2" title="Icon 2">
</li>
<li data-handle="669PQXBI" class="">
<img src="http://testsite.com/icons/669PQXBI.1.png" alt="Icon 3" title="Icon 3">
</li>
</ul>
My goal:
What I would like to do is to retrieve all of li data-handles, but I haven't been able to figure out how to retrieve this data. So my goal is to retrieve all of the icon paths and their titles, could anyone help me out here? I'd really appreciate any help :)
You aren't parsing the li or ul.
Start with this
//ul[#id='icons']/li/img
And from those elements, you can extract the individual information
Regarding the first question, beautifulsoup optionally uses lxml. If you don't think you need it, and are comfortable with XPath, don't worry about it.
However, since it's Javascript generating the page, you need a headless browser rather than requests library.
Get page generated with Javascript in Python
Reading dynamically generated web pages using python

Python data scraping

I want to download a couple songs off of http://www.youtube-mp3.org/. I'm using urllib2 and BeautifulSoup.
The problem is that when I urllib2 open the site with my video ID plugged in, http://www.youtube-mp3.org/?c#v=lV7r8PiuecQ, I get the site but they are tricky about it and load the info after the initial pageload with some js ajax stuff. So when I try to scrape the url of the download link, literally isn't on the page because it hasn't been loaded.
Anyone know how I can maybe trigger this js loader in my python script, or something?
Here is the relevant empty html BEFORE the content that I want is loaded into it.
<div id="link_box" style="display:none">
<div id="link_box_title" style="font-weight:bold; text-decoration:underline">
</div>
<div class="row">
<div id="link_box_bb_code_title" style="font-weight:bold">
</div>
<input type="text" id="BBCodeLink" onclick="sAll(this)" />
</div>
<div class="row">
<div id="link_box_html_code_title" style="font-weight:bold">
</div>
<input type="text" id="HTMLLink" onclick="sAll(this)" />
</div>
<div class="row">
<div id="link_box_direct_code_title" style="font-weight:bold">
</div>
<input type="text" id="DirectLink" onclick="sAll(this)" />
</div>
</div>
<div id="v-ads">
</div>
<div id="dl_link">
</div>
<div id="progress">
</div>
<div id="loader">
<img src="ajax-loader-b.gif" alt="loading.." width="16" height="11" />
</div>
</div>
<div class="clear">
</div>
</div>
The API is JSON-based, so the contents of the html files won't give you any clue on where to find the files. A good idea when exploring web services like this one, is to open the Network tab in Chrome's developer tools and see what pages it loads when interacting with the page. That exercise showed me that two urls in particular seem interesting:
http://www.youtube-mp3.org/api/pushItem/?item=http%3A//www.youtube.com/watch%3Fv%3DKMU0tzLwhbE&xy=trve&r=1314700829128
http://www.youtube-mp3.org/api/itemInfo/?video_id=KMU0tzLwhbE&adloc=&r=1314700829314
The first url appears to be queuing a file for processing, the second to get the status of the processing job.
The second url takes a video_id GET parameter that is the id for the video on youtube (http://www.youtube.com/watch?v=KMU0tzLwhbE) and returns the status of the decoding job. The second and third seem irrelevant for this purpose which you can verify by test loading the url with and without the extra parameters.
The content of the page is:
info = { "title" : "Developers",
"image" : "http://i4.ytimg.com/vi/KMU0tzLwhbE/default.jpg",
"length" : "3", "status" : "serving", "progress_speed" : "",
"progress" : "", "ads" : "",
"h" : "a0aa17294103c638fa7f5e0606f839d3" };
Which happens to be JSON data. The interesting bit in this is "a0aa17294103c638fa7f5e0606f839d3" which looks like a hash that the web service use to refer to the decoded mp3 file. Also check out how the download link on the front page looks:
http://www.youtube-mp3.org/get?video_id=KMU0tzLwhbE&h=a0aa17294103c638fa7f5e0606f839d3
Now we have all the missing pieces of the puzzle together. First, we take the url of a youtube video (http://www.youtube.com/watch?v=iKP7DZmqdbU) url quote it and feed it to the api using this url:
http://www.youtube-mp3.org/api/pushItem/?item=http%3A//www.youtube.com/watch%3Fv%3DiKP7DZmqdbU&xy=trve
Then, wait a few moments until the decoding job is done:
http://www.youtube-mp3.org/api/itemInfo/?video_id=iKP7DZmqdbU
Take the hash found in the info url to construct the download url:
http://www.youtube-mp3.org/get?video_id=iKP7DZmqdbU&h=2e4b61b6ddc8bf83f5a0e4e4ee0635bb
Note that it is possible that the web master of the site does not want to be scraped and will take counter measures if people starts to (in the webmasters eyes) abuse the site. For example it seem to use referer protection so clicking the links in this post won't work, you have to copy them and load them in a new browser window.
Test code:
from re import findall
from time import sleep
from urllib import urlopen, quote
yt_code = 'gijypDkEqUA'
yt_url = 'http://www.youtube.com/watch?v=%s' % yt_code
push_url_fmt = 'http://www.youtube-mp3.org/api/pushItem/?item=%s&xy=trve'
info_url_fmt = 'http://www.youtube-mp3.org/api/itemInfo/?video_id=%s'
download_url_fmt = 'http://www.youtube-mp3.org/get?video_id=%s&h=%s'
push_url = push_url_fmt % quote(yt_url)
data = urlopen(push_url).read()
sleep(10)
info_url = info_url_fmt % yt_code
data = urlopen(info_url).read()
res = findall('"h" : "([^"]*)"', data)
download_url = download_url_fmt % (yt_code, res[0])
print 'Download here:', download_url
You could use selenium to interact with the js stuff and then combine it with BeautifulSoup or do everything with selenium, just as you prefer.
http://seleniumhq.org/
Selenium is a tool for browser automatization and has bindings for a few languages including Python. It takes a running instance of Firefox/IE/Chrome and let's you script it (I suggest using the selenium webdriver for this simple problem, not the whole selenium server).
You're going to have to work through http://www.youtube-mp3.org/client.js and figure out the exact information that is being passed around, this could allow you to post a request, parse the response and download from the correct scraped url.

Categories