Python Data extraction from a pop-up window - python

I'm trying to get a specific data from a website, but this is a little bit complicated to understand so here is some images.
So, first, I'm on this page,
Image1
then I click on the icon in the middle and something pop,
popup
then I have to click on this,
almost there
And finally I land here
arrival
And I want to get all the names of the people here
So, my question is, is there a way to get directly this list with a requests ?
If yes, how do i have do to ? I can't find the URL of this kind of pop up and I'm a complete beginner with requests and all this kind of things..
(To get the name, I have to be connected on my account by the way)
So, since I don't know how to access to the pop-up windows, this is the only code I got :
import requests
x = requests.get('https://www.tiktok.com/#programm___r?lang=en', headers={'User-Agent':'test'})
print(x.text)
I checked what it prints, and i didn't see a sign of the pop-up window

you can get some sort of network interception tool like Burpsuite and watch the network traffic that comes through each time you click on each link along the way to your final destination, this should give you an endpoint you may be able to send your request too. I think this network information should also be available in the browser tools but I'm not sure. A potential issue here is that usually tokens and other information has to be passed down the chain along the way, which might make scripting something like this too hard.
So aside from that, with browser automation software like selenium, you could automate the process of getting to that point on the page, and be able to pull out the list you want once you're there. I've used selenium myself and it's really usable and well documented!

Related

Handle random ForeSee popup using Python and Selenium

I'm new to coding and trying to use Selenium with Python to click through a website and fill a shopping cart. I've got things working well except for the random ForeSee survey popup. When it appears (and it doesn't always appear in the same location), my code stops working at that point.
I read the ForeSee documentation and it says "...when the invitation is displayed, the fsr.r...cookie is dropped. This cookie prevents a user from being invited again for X days (default 90)."
Hoping for a quick fix, I created a separate Firefox profile and ran through the website and got the ForeSee pop up invitation--no more pop up when manually using that profile. But I still get the pop up when using Selenium.
I used this code:
fp = webdriver.FirefoxProfile('C:\path\to\profile')
browser = webdriver.Firefox(firefox_profile=fp)
EDIT: I got the cookie working. I was using the Local folder instead of the Roaming folder in C:\path\to\profile. Using the roaming folder solved the problem.
My question edited to delete the part about the cookie not working:
Can someone suggest code to permanently handle the ForeSee pop up that appears randomly and on random pages?
I'm using using Protractor with JS, so I can't give you actual code to handle the issue, but I can give you an idea how to approach this.
In a nutshell
When following script is executed in the browser's console -
window.FSR.setFSRVisibility(true);
it makes ForeSee popup appear behind the rest of HTML elements. And doesn't affect UI tests anymore
So my protractor script will look like so
await browser.executeScript(
`window.FSR.setFSRVisibility(true);`
);
Theory
So ForeSee is one of those services that can be integrated with any web app, and will be pulling js code from their API and changing HTML of your app, by executing the code on the scope of the website. Another example of such company is walkme
Obviously in modern world, if these guys can overlay a webpage, they should have a configuration to make it optional (at least for lower environments) and they actually do. What I mentioned as a solution came from this page. But assuming they didn't have such option, one could reach out their support and ask how to workaround their popups. Even if they didn't have such option they would gladly consider it as a feature for improvement.

Parsing from a website -- source code does not contain the info I need

I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.
The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.
Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)

How to detect which websites the user is viewing or connecting to

I'm writing a Python application that, among other things, needs to know which websites the user is looking at in the web browser or otherwise connecting to on OS X and, if possible, Linux. This is to track how long the user is accessing certain websites.
I know that on OS X there is a Cocoa call which returns the current page in Safari, but this must also work with Chrome and Firefox at a minimum, ideally with any client, known or unknown to the software.
The first thing I've looked into is pcap via libpcap which I can use in Python with pylibpcap. pcap is for packet capture, and in theory as I understand, I could detect whether packets are flowing to/from certain "black-listed" IP addresses. This would sort of work, but if a static web-page were open in the browser and left as is, I would not be able to detect it via this mechanism.
First, will I even be able to do what I've described above with libpcap? I'm a beginner with network filtering and the like, so I'm not entirely sure.
Second, is there a better way to do this?
(The app TimeSink for OS X has an interesting approach, which is to look at what is displayed in the title bar to decide which website the user is browsing. This is not ideal for me for two reasons: (1) I may not be able to conclusively decide what domain is being visited by the title and (2) I can only see the title of the active tab.)
Maybe to use Twisted proxy and pass all browsers through it?
You will be able to analyse HTTP headers and extract relevant information.
Here is an example: https://github.com/nbareil/twisted-proxy

Using Python mechanize on websites that use DHTML, AJAX, etc.?

So, let's say I'm trying to create something that replies to tweets of a certain "hashtag keyword" on twitter (for example "#FirstWorldProblems") I have a script that looks like this:
# apply settings, create a mechanize.Browser, etc.
login() # log into twitter
# at this point we've logged into twitter, now, we will perform navigate to their search page and run a search query:
br.open('http://twitter.com/search?q=' + hashtag)
print(br.response().read()) # print the response
So, what I have above is sort of an abbreviated version to quickly get to the spot giving me trouble.
I set up a browser, log into twitter, all done no problemo. But, then I run a search for the hashtag (using br.open) and then I print the response.
On Twitter, the "Reply" link only appears when you hover over a specific link and leads to "#" (because it opens a little pop-up thing where you can enter your reply), how would I click on the "Reply" link, because it doesn't show up in the response?
If your problem is actually just accessing Twitter, dmedvinsky is probably right.
However, if you really want to be able to scrape websites (while allowing their javascript to run as it normally would..) you'll probably want something a bit more robust.
While it's a lot of baggage, I strongly urge you to grab Qt, PySide, and get familiar with QWebKit. You can drive a 'real' web browser from Python and get all the benefits (and problems;) one would expect. But, so far it's the best and cleanest method I've found to do what you're asking about.
http://qt.nokia.com/
http://www.pyside.org/

Is there a way to save a captcha image and view it later in python?

I am scripting in python for some web automation. I know i can not automate captchas but here is what i want to do:
I want to automate everything i can up to the captcha. When i open the page (usuing urllib2) and parse it to find that it contains a captcha, i want to open the captcha using Tkinter. Now i know that i will have to save the image to my harddrive first, then open it but there is an issue before that. The captcha image that is on screen is not directly in the source anywhere. There is a variable in the source, inside some javascript, that points to another page that has the link to the image, BUT if you load that middle page, the captcha picture for that link changes, so the image associated with that javascript variable is no longer valid. It may be impossible to gather the image using this method, so please enlighten me if you have any ideas on this.
Now if I use firebug to load the page, there is a "GET" that is a direct link to the current Captcha image that i am seeing, and i'm wondering if there is anyway to make python or ullib2 see the "GET"s that are going on when a page is loaded, because if that was possible, this would be simple.
Please let me know if you have any suggestions.
Of course the captcha's served by a page which will serve a new one each time (if it was repeated, then once it was solved for one fake userid, a spammer could automatically make a million!). I think you need some "screenshot" functionality to capture the image you want -- there is no cross-platform way to invoke such functionality, but each platform (or desktop manager in the case of Linux, BSD, etc) tends to have one. Or, you could automate the browser (e.g. via SeleniumRC) to "screenshot" (e.g. "print to PDF") things at the right time. (I believe what you're seeing in firebug may be misleading you because it is "showing a snapshot"... just at the html source or DOM level rather than at a screen/bitmap level).

Categories