Using Python mechanize on websites that use DHTML, AJAX, etc.? - python

So, let's say I'm trying to create something that replies to tweets of a certain "hashtag keyword" on twitter (for example "#FirstWorldProblems") I have a script that looks like this:
# apply settings, create a mechanize.Browser, etc.
login() # log into twitter
# at this point we've logged into twitter, now, we will perform navigate to their search page and run a search query:
br.open('http://twitter.com/search?q=' + hashtag)
print(br.response().read()) # print the response
So, what I have above is sort of an abbreviated version to quickly get to the spot giving me trouble.
I set up a browser, log into twitter, all done no problemo. But, then I run a search for the hashtag (using br.open) and then I print the response.
On Twitter, the "Reply" link only appears when you hover over a specific link and leads to "#" (because it opens a little pop-up thing where you can enter your reply), how would I click on the "Reply" link, because it doesn't show up in the response?

If your problem is actually just accessing Twitter, dmedvinsky is probably right.
However, if you really want to be able to scrape websites (while allowing their javascript to run as it normally would..) you'll probably want something a bit more robust.
While it's a lot of baggage, I strongly urge you to grab Qt, PySide, and get familiar with QWebKit. You can drive a 'real' web browser from Python and get all the benefits (and problems;) one would expect. But, so far it's the best and cleanest method I've found to do what you're asking about.
http://qt.nokia.com/
http://www.pyside.org/

Related

How to scrape infinitely scrolling websites with login using python request (or similar)

I would like to scrape a website that does not have an API and is an "infinite scroller". I have been using selenium for this, but now I need to scrape a lot more pages and do that all at once. The problem is that selenium is very resource-dependant since I am running a full (headless) chrome browser in each instance and also not stable at all (probably because of limited resources but still). I know that there is a way to look for ajax requests that the site uses and access it with requests library, but I have two issues:
I can't seem to find the desired request
The ones that I try to use with requests library require the user to be logged in and I have no idea how to do that (maybe pass cookies and whatnot, I am not a web developer).
Let me take Twitter as an example since it is exactly the as what I am describing (except it has an API). You have to log in and then the feed is loaded infinitely. So the goal is to "scroll" and take the content of each tweet. How can this be done? If you can, please, provide a working example.
Thank you.

Python Data extraction from a pop-up window

I'm trying to get a specific data from a website, but this is a little bit complicated to understand so here is some images.
So, first, I'm on this page,
Image1
then I click on the icon in the middle and something pop,
popup
then I have to click on this,
almost there
And finally I land here
arrival
And I want to get all the names of the people here
So, my question is, is there a way to get directly this list with a requests ?
If yes, how do i have do to ? I can't find the URL of this kind of pop up and I'm a complete beginner with requests and all this kind of things..
(To get the name, I have to be connected on my account by the way)
So, since I don't know how to access to the pop-up windows, this is the only code I got :
import requests
x = requests.get('https://www.tiktok.com/#programm___r?lang=en', headers={'User-Agent':'test'})
print(x.text)
I checked what it prints, and i didn't see a sign of the pop-up window
you can get some sort of network interception tool like Burpsuite and watch the network traffic that comes through each time you click on each link along the way to your final destination, this should give you an endpoint you may be able to send your request too. I think this network information should also be available in the browser tools but I'm not sure. A potential issue here is that usually tokens and other information has to be passed down the chain along the way, which might make scripting something like this too hard.
So aside from that, with browser automation software like selenium, you could automate the process of getting to that point on the page, and be able to pull out the list you want once you're there. I've used selenium myself and it's really usable and well documented!

To redirect a twitter page through the Chinese firewall

My younger brother, who still lives in China is a fan of Michael Phelps. He wants to see his twitter posts. Since they can't access twitter behind the GFW and setting up a VPN is too hard for my mom. I want to write something that grabs the twitter and sends them to my mom's email.
I use python as my main language. Familiar with tweepy / request / scrapy
I have tried or thought about three ways of doing this:
Use the twitter API and grabs the user_timeline. However, this method will lost all graphical data and throws a bunch of useless links that are only visible after proper rendering
Do a web scraping and save the html content. Then send the html file as an attachment. However, this method still loses some graphical contents and is not that user friendly to someone in her 40s. In addition, it will be kinda hard to tell how many tweets I have scraped and if there's any updates.
Wrap the html content in the email and use html rendering within the email. I haven't work with this before so I am not exactly sure how its gonna work out.
I am aware that "what's the best way to do this" kinda question is always downvoted on SO but I do believe this problem is particular enough to engage meaningful Q&As. Any suggestion will be appreciated.
Have you thought of using selenium and taking screen shots of the browser window? Taking a screen shot with selenium is as easy as
browser.get('twitter.com')
browser.get_screenshot_as_file('twitter_screenshot.png')
You'd have to figure out a way to automate both watching for new tweets and running the selenium script when a new tweet is found. However in terms of preserving graphical content, taking screenshots w/ Selenium would be simple to implement.
Docs: http://selenium-python.readthedocs.io/api.html#selenium.webdriver.remote.webdriver.WebDriver.get_screenshot_as_file

How do I make Python urlib2 to cleverly avoid the security check while trying to log into a site?

I am trying to crawl a website for the first time. I am using urllib2 Python
I am currently trying to log into Foursquare social networking site using Python urlib2 and Beautifulsoup. To view a particular page, I need to provide username and password.
So,I followed the Basic Authentication described on the ducumentation page.
I guess, everything worked well, but the site throws up a security check asking me to type a text (capcha), before sending me the required page. It obviously looks like, the site is detecting that, a page is being requested not by a human, but a crawler.
So, what is the way, to avoid being detected. How to make urllib2 get the desired page, without having to stop at the security check? Pls help..
You probably want to use foursquare API instead.
You have to use the foursquare API. I guess, there is no other way. API are designed for such purposes.
Crawlers depending solely on the HTML format of the page will fail in the furture when the HTML page changes

Parsing from a website -- source code does not contain the info I need

I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.
The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.
Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)

Categories