The code below is so simple, why is it printing None. Does that mean it's not finding the page?
from bs4 import BeautifulSoup as soup
from robobrowser import RoboBrowser
br = RoboBrowser()
login_url = 'https://www.cbssports.com/login'
login_page = br.open(login_url)
print(login_page)
br.open doesn't return the page content
br.open seems to open the robot browser to that page, which only changes the state of the robot browser. If you want to get the content of the page, you can do br.open(login_url) which opens the page, and then print(br.state.response.text), which prints the text sent back in the response, which is stored in the state of the browser.
Related
I need to download the content of a web page using Python.
What I need is the TLE of a specific satellite from Space-Track.org website.
An example of the url I need to scrape is the following:
https://www.space-track.org/basicspacedata/query/class/gp/NORAD_CAT_ID/44235/format/tle/emptyresult/show
Below the unsuccesful code I wrote/copied:
import requests
url = 'https://www.space-
track.org/basicspacedata/query/class/gp/NORAD_CAT_ID/44235/format/tle/emptyresult/show'
res = requests.post(url)
html_page = res.content
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
print(text)
res.post(url) returns Response [204] and I can't access the content of the webpage.
Could this happen because of the required login?
I must admit that I am not experienced with Python and I don't have the knowledge to this myself.
What I can do is to manipulate text files and from the DevTools page I can get the HTML file and extrapolate the text, but how can I do this programmatically?
To access the url you mentioned , you need USERNAME and PASSWORD Authorization.
to do this( customize to your need):
import mechanize
from bs4 import BeautifulSoup
import urllib2
import cookielib ## http.cookiejar in python3
cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("https://id.arduino.cc/auth/login/")
br.select_form(nr=0)
br.form['username'] = 'username'
br.form['password'] = 'password.'
br.submit()
print br.response().read()
I don't have access to this API, so take my advice with a grain of salt, but you should also try using requests.get instead of requests.post.
Why? Because requests.post POSTs data to the server, while requests.get GETs data from the server. GET and POST are known as HTTP methods, and to learn more about them, see https://www.tutorialspoint.com/http/http_methods.htm. Since web browsers use GET, you should give that a try.
I would like to parse a website with urllib python library. I wrote this:
from bs4 import BeautifulSoup
from urllib.request import HTTPCookieProcessor, build_opener
from http.cookiejar import FileCookieJar
def makeSoup(url):
jar = FileCookieJar("cookies")
opener = build_opener(HTTPCookieProcessor(jar))
html = opener.open(url).read()
return BeautifulSoup(html, "lxml")
def articlePage(url):
return makeSoup(url)
Links = "http://collegeprozheh.ir/%d9%85%d9%82%d8%a7%d9%84%d9%87- %d9%85%d8%af%d9%84-%d8%b1%d9%82%d8%a7%d8%a8%d8%aa%db%8c-%d8%af%d8%b1-%d8%b5%d9%86%d8%b9%d8%aa-%d9%be%d9%86%d9%84-%d9%87%d8%a7%db%8c-%d8%ae%d9%88%d8%b1%d8%b4%db%8c%d8%af/"
print(articlePage(Links))
but the website does not return content of body tag.
this is result of my program:
cURL = window.location.href;
var p = new Date();
second = p.getTime();
GetVars = getUrlVars();
setCookie("Human" , "15421469358743" , 10);
check_coockie = getCookie("Human");
if (check_coockie != "15421469358743")
document.write("Could not Set cookie!");
else
window.location.reload(true);
</script>
</head><body></body>
</html>
i think the cookie has caused this problem.
The page is using JavaScript to check the cookie and to generate the content. However, urllib does not process JavaScript and thus the page shows nothing.
You'll either need to use something like Selenium that acts as a browser and executes JavaScript, or you'll need to set the cookie yourself before you request the page (from what I can see, that's all the JavaScript code does). You seem to be loading a file containing cookie definitions (using FileCookieJar), however you haven't included the content.
I have the following script:
import requests
import cookielib
jar = cookielib.CookieJar()
login_url = 'http://www.whispernumber.com/signIn.jsp?source=calendar.jsp'
acc_pwd = {'USERNAME':'myusername',
'PASSWORD':'mypassword'
}
r = requests.get(login_url, cookies=jar)
r = requests.post(login_url, cookies=jar, data=acc_pwd)
page = requests.get('http://www.whispernumber.com/calendar.jsp?day=20150129', cookies=jar)
print page.text
But the print page.text is showing that the site is trying to forward me back to the login page:
<script>location.replace('signIn.jsp?source=calendar.jsp');</script>
I have a feeling this is because of the jsp, and am not sure how to login to a java script page? Thanks for the help!
Firstly you're posting to the wrong page. If you view the HTML from your link you'll see the form is as follows:
<form action="ValidatePassword.jsp" method="post">
Assuming you're correctly authenticated you will probably get a cookie back that you can use for subsequent page requests. (You seem to be thinking along the right lines.)
Requests isn't a web browser, it is an http client, it simply grabs the raw text from the page. You are going to want to use something like Selenium or another headless browser to programatically login to a site.
I intend to use twill to fill out a form on one page, hit the submit button, and then use BeautifulSoup to parse the resulting page. How can I feed BeautifulSoup the HTML page? I assume I have to read the current url, but I do not know how to actually return the url in order to do so. I have tried twill's TwillBrowser.get_url(), but it only returns None.
For any future sufferers, I have found better luck in using mechanize instead of twill as twill is an un-updated thin shell for mechanize. The solution is as follows:
import mechanize
url = "foo.com"
br = mechanize.Browser()
br.open(url)
br.select_form(name = "YOURFORMNAMEHERE") #make sure to leave the quotation marks
br["YOURINPUTFIELDNAMEHERE"] = ["YOURVALUEHERE"] #this must be in a list even if it is only one value
response = br.submit()
print response.geturl()
Finally figured this out!
If you import twill like so:
import twill.commands as com
then the url =
url = com.browser.get_url()
Source: http://nullege.com/codes/search/twill.commands.browser.get_url?utm_expid=24446124-0.lSQi4Ea5S7WZwxHvFPbOIA.0&utm_referrer=https%3A%2F%2Fwww.google.com%2F
I'm submitting a form, which then has a confirmation page. At the confirmation page in a browser there is an that is an image the user clicks to confirm the order.
Mechanize is not recognizing the form at all when it is present in the HTML mech has:
content = mech.submit().read()
soup = BeautifulSoup(content)
print soup.findAll('form')
displays the correct form, while mech claims there are no forms present. I have tried doing:
mech.click(inputName)
and mech claims the input does not exist. Meanwhile the input shows up just fine with:
print soup.findAll('input')
Any ideas? I have also done this:
mech = mechanize.Browser(factory=mechanize.RobustFactory())
With no luck.
Try parsing all html responses with BeautifulSoup, then mechanize should recognise the form.
You can see how to do it in this answer Is it possible to hook up a more robust HTML parser to Python mechanize?