I would like to know if there is a way to display Google's search results as an output in my Python program. Like if I type "Electricity" in my program then I want to display Google's search results as plain text. Is there a way to do it?
UPDATE
import urllib2
response = urllib2.urlopen("https://en.wikipedia.org/wiki/Machine_learning")
the_page = response.read(bytes)
content = str(the_page)
print the_page
I tried the above code but it is showing me errors, if I just type
the_page = response.read()
print the_page
It just prints the HTML format of the page but not the text string so How do I get the string alone?
import urllib2
response = urllib2.urlopen("https://en.wikipedia.org/wiki/Machine_learning")
content= response.read()
# Now it's time for parsing *content* to extract the relevant data
# Use regex, HTMLParser from standard library
# or use beautifulsoup, LXML (third party)
Related
I am trying to use the requests function in python to post the text content of a text file to a website, submit the text for analysis on said website, and pull the results back in to python. I have read through a number of responses here and on other websites, but have not yet figured out how to correctly modify the code to a new website.
I'm familiar with beautiful soup so pulling in webpage content and removing HTML isn't an issue, its the submitting the data that I don't understand.
My code currently is:
import requests
fileName = "texttoAnalyze.txt"
fileHandle = open(fileName, 'rU');
url_text = fileHandle.read()
url = "http://www.webpagefx.com/tools/read-able/"
payload = {'value':url_text}
r = requests.post(url, payload)
print r.text
This code comes back with the html of the website, but hasn't recognized the fact that I'm trying to a submit a form.
Any help is appreciated. Thanks so much.
You need to send the same request the website is sending, usually you can get these with web debugging tools (like chrome/firefox developer tools).
In this case the url the request is being sent to is: http://www.webpagefx.com/tools/read-able/check.php
With the following params: tab=Test+by+Direct+Link&directInput=SOME_RANDOM_TEXT
So your code should look like this:
url = "http://www.webpagefx.com/tools/read-able/check.php"
payload = {'directInput':url_text, 'tab': 'Test by Direct Link'}
r = requests.post(url, data=payload)
print r.text
Good luck!
There are two post parameters, tab and directInput:
import requests
post = "http://www.webpagefx.com/tools/read-able/check.php"
with open("in.txt") as f:
data = {"tab":"Test by Direct Link",
"directInput":f.read()}
r = requests.post(post, data=data)
print(r.content)
I have a web form that I would like to be filled out via users through a python bot. Everything that I can find regarding this pulls all of the data from a pre-defined payload via request or mechanize, however my situation differs in that I'd like users to be able to trigger this with their own text (for example - .submit Ticket #1234 - Blah blah blah).
The page they are submitting to is a simple form - 1 text area and 1 submit button.
Could anyone shine some light on some tutorials or how I'd go about this?
Here's my attempt:
import re
import urllib.parse
import requests
from lxml import etree
#hook.command("addquote")
def addquote(text, bot):
"""<query> -- adds a quote to the qdb."""
url = 'example.com/?add';
values = {'addquote' : text}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
Thanks!
Here's an example using HTTP POST. If your web form uses PUT just change the method. It outputs the page that is returned (which probably isn't needed unless you want to check for success) and the HTTP status code and reason. This is Python 3, only small changes are needed for older versions:
import urllib.parse, urllib.request
url="http://hskhsk.pythonanywhere.com/cidian"
params={"q":"apple","secondparam":"ignored"}
encoded=urllib.parse.urlencode(params)
data=bytes(encoded, 'utf-8')
req = urllib.request.Request(url=url, data=data, method="POST")
with urllib.request.urlopen(req) as f:
print(f.read().decode('utf-8'))
print(f.status)
print(f.reason)
The web form that it calls is a simple English-Chinese dictionary that accepts HTTP GET or POST requests. The text to search for (apple) is the q parameter and the other parameter is ignored.
I would recommend reading this book http://learnpythonthehardway.org/book/ex51.html
Here is a very simple example, using Grablib:
# we will use GrabLib (http://docs.grablib.org/en/latest/)
from grab import Grab
g = Grab()
g.go ('someurl.com')
g.set_input ('form name', 'text to be inputted')
g.submit
Thats all)
As of now I have created a basic program in python 2.7 using urllib2 and re that gathers the html code of a website and prints it out for you as well as indexing a keyword. I would like to create a much more complex and dynamic program which could gather data from websites such as sports or stock statistics and aggregate them into lists which could then be used in analysis in something such as an excel document etc. I'm not asking for someone to literally write the code. I simply need help understanding more of how I should approach the code: whether I require extra libraries, etc. Here is the current code. It is very simplistic as of now.:
import urllib2
import re
y = 0
while(y == 0):
x = str(raw_input("[[[Enter URL]]]"))
keyword = str(raw_input("[[[Enter Keyword]]]"))
wait = 0
try:
req = urllib2.Request(x)
response = urllib2.urlopen(req)
page_content = response.read()
idall = [m.start() for m in re.finditer(keyword,page_content)]
wait = raw_input("")
print(idall)
wait = raw_input("")
print(page_content)
except urllib2.HTTPError as e:
print e.reason
You can use requests to deal with interaction with website. Here is link for it. http://docs.python-requests.org/en/latest/
And then you can use beautifulsoup to handle the html content. Here is link for it.http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
They're more ease of use than urllib2 and re.
Hope it helps.
The printed html returns garbled text... instead of what I expect to see as seen in "view source" in browser.
Why is that? How to fix it easily?
Thank you for your help.
Same behavior using mechanize, curl, etc.
import urllib
import urllib2
start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
print html
I got the same garbled text using curl
curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm
The result appears to be gzipped. So this shows the correct HTML for me.
curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm | gunzip
Here's a solutions on doing this in Python: Convert gzipped data fetched by urllib2 to HTML
Edited by OP:
The revised answer after reading above is:
import urllib
import urllib2
import gzip
import StringIO
start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
data = StringIO.StringIO(html)
gzipper = gzip.GzipFile(fileobj=data)
html = gzipper.read()
html now holds the HTML (Print it to see)
Try requests. Python Requests.
import requests
response = requests.get("http://www.ncert.nic.in/ncerts/textbook/textbook.htm")
print response.text
The reason for this is because the site uses gzip encoding. To my knowledge urllib doesn't support deflating so you end up with compressed html responses for certain sites that use that encoding. You can confirm this by printing the content headers from the response like so.
print response.headers
There you will see that the "Content-Encoding" is gzip format. In order to get around this using the standard urllib library you'd need to use the gzip module. Mechanize also does this because it uses the same urllib library. Requests will handle this encoding and format it nicely for you.
I have trouble understanding wikipedia API.
I have isolated a link, by processing json that I got as a response after sending a request to http://en.wikipedia.org/w/api.php
Assuming that I got the following link, how do I get access to information like date of birth, etc.
I'm using python. I tried doing a
import urllib2,simplejson
search_req = urllib2.Request(direct_url_to_required_wikipedia_page)
response = urllib2.urlopen(search_req)
I have tried reading the api. But, I can't figure out how to extract data from specific pages.
Try:
import urllib
import urllib2
import simplejson
url = 'http://en.wikipedia.org/w/api.php'
values = {'action' : 'query',
'prop' : 'revisions',
'titles' : 'Jennifer_Aniston',
'rvprop' : 'content',
'format' : 'json'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
json = response.read()
Variable json is the json of the wikipedia page. You can now parse it with simplejson or whatever...
Go to MediaWiki API. It's better organized, and friendly for humans :-).
You won't get information like date of birth from the API, at least not directly. The best you can do is to get the code of the page (or rendered HTML) and parse that to get the information you need.
As an alternative, you might want to look at DBpedia.