Getting Twitter news feed using Python - python

Hey everybody, I've been playing around with Python 2.7 and BeautifulSoup 3.2 recently and I've gotten my code to work for Facebook where it makes a POST request to Facebook to login and downloads the HTML to the page, then saves it to my computer. I tried doing this with Twitter but it doesn't seem to be working*.. here's my code:
import urllib, urllib2
from BeautifulSoup import BeautifulSoup
#I've replaced my actual username and password for obvious reasons
form = urllib.urlencode({'username':'myUsername','password':'myPassword'})
response = urllib2.urlopen('https://mobile.twitter.com/session',form)
response = response.read()
Can anyone tell me what's wrong with it? Thanks!
*After I do response = response.read() I have it write to a file on my harddrive and open it with firefox. When I open it all I see is whatever is on http://mobile.twitter.com/ at the time of me running the script.

Don't use BeatufulSoup, use lxml instead http://lxml.de/ (much more powerfull, faster and convenient)
Don't grab twitter web-interface, use oficial twitter API instead http://dev.twitter.com/doc/get/statuses/home_timeline

You may want to check out the Twitter libraries for Python.

Related

Accessing Indeed through Python

My goal for this python code is to create a way to obtain job information into a folder. The first step is being unsuccessful. When running the code I want the url to print https://www.indeed.com/. However instead the code returns https://secure.indeed.com/account/login. I am open to using urlib or cookielib to resolve this ongoing issue.
import requests
import urllib
data = {
'action':'Login',
'__email':'email#gmail.com',
'__password':'password',
'remember':'1',
'hl':'en',
'continue':'/account/view?hl=en',
}
response = requests.get('https://secure.indeed.com/account/login',data=data)
print(response.url)
If you're trying to scrape information from indeed, you should use the selenium library for python.
https://pypi.python.org/pypi/selenium
You can then write your program within the context of a real user browsing the site normally.

Unable to read HTML content

I'm building a webCrawler which needs to read links inside a webpage. For which I'm using urllib2 library of python to open and read the websites.
I found a website where I'm unable to fetch any data.
The URL is "http://www.biography.com/people/michael-jordan-9358066"
My code,
import urllib2
response = urllib2.urlopen("http://www.biography.com/people/michael-jordan-9358066")
print response.read()
By running the above code, the content I get from the website, if I open it in a browser and the content I get from the above code is very different. The content from the above code does not include any data.
I thought it could be because of delay in reading the web page, so I introduced a delay. Even after the delay, the response is the same.
response = urllib2.urlopen("http://www.biography.com/people/michael-jordan-9358066")
time.sleep(20)
print response.read()
The web page opens perfectly fine in a browser.
However, the above code works fine for reading Wikipedia or some other websites.
I'm unable to find the reason behind this odd behaviour. Please help, thanks in advance.
What you are experiencing is most likely to be the effect of dynamic web pages. These pages do not have static content for urllib or requests to get. The data is loaded on site. You can use Python's selenium to solve this.

Deferred Downloading using Python Requests Library

I am trying to fetch some information from Workflowy using Python Requests Library. Basically I am trying to programmatically get the content under this URL: https://workflowy.com/s/XCL9FCaH1b
The problem is Workflowy goes through a 'loading phase' before the actual content is displayed when I visit this website so I end up getting the content of 'loading' page when I get the request. Basically I need a way to defer getting the content so I can bypass the loading phase.
It seemed like Requests library is talking about this problem here: http://www.python-requests.org/en/latest/user/advanced/#body-content-workflow but I couldn't get this example work for my purposes.
Here is the super simple block of code that ends up getting the 'loading page':
import requests
path = "https://workflowy.com/s/XCL9FCaH1b"
r = requests.get(path, stream=True)
print(r.content)
Note that I don't have to use Requests just picked it up because it looked like it might offer a solution to my problem. Also currently using Python 2.7.
Thanks a lot for your time!

urllib2 geturl() does not work for some url redirects

I am learning python and trying to get the urllib2 geturl() to work. So far, I have the following skeleton, which looks like:
import urllib2
gh=urllib2.urlopen(http://somewebsite.com/).geturl()
print gh
which seems to work fine. However, when I try for example using a url given here, it fails to get me the "final url" (but works on a browser).
I would appreciate any guidance to solve this.
This happens, because you are redirected using javascript, and urllib2 can't handle javascript. If it is important to handle javascript redirects, use selenium.

How to extract text from a web page that requires logging in using python and beautiful soup?

i have to retrieve some text from a website called morningstar.com . To access that data i have to log in. Once i log in and provide the url of the web page , i get the HTML text of a normal user (not logged in).As a result am not able to accees that information . ANy solutions ?
BeautifulSoup is for parsing html once you've already fetched it. You can fetch the html using any standard url fetching library. I prefer curl, as you tagged your post, python's built-in urllib2 also works well.
If you're saying that after logging in the response html is the same as for those who are not logged in, I'm gonna guess that your login is failing for some reason. If you are using urllib2, are are you making sure to store the cookie properly after your first login and then passing this cookie to urllib2 when you are sending the request for the data?
It would help if you posted the code you are using to make the two requests (the initial login, and the attempt to fetch the data).

Categories