Deferred Downloading using Python Requests Library - python

I am trying to fetch some information from Workflowy using Python Requests Library. Basically I am trying to programmatically get the content under this URL: https://workflowy.com/s/XCL9FCaH1b
The problem is Workflowy goes through a 'loading phase' before the actual content is displayed when I visit this website so I end up getting the content of 'loading' page when I get the request. Basically I need a way to defer getting the content so I can bypass the loading phase.
It seemed like Requests library is talking about this problem here: http://www.python-requests.org/en/latest/user/advanced/#body-content-workflow but I couldn't get this example work for my purposes.
Here is the super simple block of code that ends up getting the 'loading page':
import requests
path = "https://workflowy.com/s/XCL9FCaH1b"
r = requests.get(path, stream=True)
print(r.content)
Note that I don't have to use Requests just picked it up because it looked like it might offer a solution to my problem. Also currently using Python 2.7.
Thanks a lot for your time!

Related

How to retrieve html with grequests

When I was just doing some research on python web scraping I got to know of a package named grequests, it was said that this can send parallel HTTP requests thus gaining more speed than the normal python requests module. Well, that sounds great but I was not able to get the HTML of the web pages I requested as there is no .text method like the normal requests module. If I get some help it would be great!
Since grequests.imap function returns a list, you'll need to use an index or call the entire list in a loop.
responses = grequests.imap(session)
for response in responses:
print(response.text)

Asynchronous URL un-shortening in Python

I am currently trying to implement a feature into my program that will detect and unshorten any URL shorteners, including bit.ly and old goo.gl links (now no longer existent). I have found a few articles, and I am going to discuss my current experiments, findings and ask the question of "Is there even a way to do it?"
I started off by reading up on any previously found articles. I found a Stack Overflow question on how to un-shorten URLs using Python. The answer pointed to the requests library, using requests.head, setting allow_redirects to True. requests does not function with async.io at all. Which is where I found a question based on Async requests with Python requests (found here)
This question pointed to grequests, which is an async version of requests, however, when I attempted the code from the first question, replacing requests with grequests, it did not show the link location after re-directs. I then changed the .head to .get, and while it did work, it still provided the bit.ly URL I was using, rather than the un-shortened URL.
I am unsure what I could use to find the URL location after unshortening without making it synchronous rather than async. If anyone can help, that would be really useful!
A good library that I would recommend using is aiohttp, a library which allows for asynchronous web requests.
Try this and then run it as a loop on your data frame using .apply(lambda) :
import requests
def unshortenurlx(url):
try:
response = requests.get(url)
return(response.url)
except Exception as e:
return('Bad url {url}. {e}'.format(url=url, e=e))

Accessing Indeed through Python

My goal for this python code is to create a way to obtain job information into a folder. The first step is being unsuccessful. When running the code I want the url to print https://www.indeed.com/. However instead the code returns https://secure.indeed.com/account/login. I am open to using urlib or cookielib to resolve this ongoing issue.
import requests
import urllib
data = {
'action':'Login',
'__email':'email#gmail.com',
'__password':'password',
'remember':'1',
'hl':'en',
'continue':'/account/view?hl=en',
}
response = requests.get('https://secure.indeed.com/account/login',data=data)
print(response.url)
If you're trying to scrape information from indeed, you should use the selenium library for python.
https://pypi.python.org/pypi/selenium
You can then write your program within the context of a real user browsing the site normally.

Unable to scrape data from dynamic page - Python Requests

I am working to scrape the reports from this site, I hit the home page, enter report date and hit submit, it's Ajax enabled and I am not getting how to get that report table . Any help will be really appreciated.
https://www.theice.com/marketdata/reports/176
I tried sending get and post using requests module, but failed as Session Time out or Report not Available.
EDIT:
Steps Taken so far:
URL = "theice.com/marketdata/reports/datawarehouse/..."
with requests.Session() as sess:
f = sess.get(URL,params = {'selectionForm':''}) # Got 'selectionForm' by analyzing GET requests to URL
data = {'criteria.ReportDate':--, ** few more params i got from hitting submit}
f = sess.post(URL,data=data)
f.text # Session timeout / No Reports Found –
Since you've already identified that the data you're looking to scrape is hidden behind some AJAX calls, you're already on your way to solving this problem.
At the moment, you're using python-requests for HTTP, but that is pretty much all it does. It does not handle executing JavaScript or any other items that involve scanning the content and executing code in another language runtime. For that, you'll need to use something like Mechanize or Selenium to load those websites, interact with the JavaScript, and then scrape the data you're looking for.

python downloading data (urllib, urllib2)

I have a link like this, direct to a mp3 file. So when I put it in my browser, basically asks me if I want to download the file, however when I do the same thing with python by the following code :
> data = urllib2.urlopen("http://www23.zippyshare.com/d/44123087/497548/Lil%20Wayne%20ft.%20Eminem%20-%20Drop%20The%20World.mp3".read())
I will redirected to another link like this. Therefore, instead of the MP3 data, I am getting the html code for
'http://www23.zippyshare.com/v/44123087/file.html'
any ideas ?
thanks
urllib2 handles redirection transparently. You might want to see what the server is actually doing when it is presenting such a redirection as well allowing you to download. You might want to subclass the redirect handler and see which property of the header is giving you the url and use urlretrieve to download that.
Setting the cookies, trying explicitly might be a good try as well.
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.open('yourmp3filelink')
Your link redirects to an HTML webpage, most likely because your download request is timing out. That's often how these download websites work: you never get a static link to the download, only a temporarily assigned link.
My guess is that there's no way to get that static link using that website. You'd have to know where that file was actually coming from.
So no, nothing is wrong with your python code; just your sources.

Categories