navigate webpage without opening a browser

navigate webpage without opening a browser - python

I am trying to write a python script that will check every hour for open seats in the organic chemistry class at my university and email me if it finds one so that I can then log in and register for it (I am not quite good enough to have it register me automatically). I think that I have gotten the login process correct, but when I navigate manually to the correct page to get the URL, it has the same URL as the page before it, which doesn't make much sense. How can I find the correct URL, or if I need to POST, how do I find the right spot and the right commands? Can I navigate within the website without opening a browser, since this will be running hourly in the background while I am trying to use the computer? Unfortunately I cannot share the website, since it contains personal information.

Related

Using Python to scrape data from a websites HTML

I have used many of the popular libraries to read html and saw variations on this website but have had no success. I go to a website, call it www.google.com. I enter passwords and I am still at www.google.com. While in this website after using my password, I hit the F12 key to view the html. Inside the html, I see a number that changes at a rate of about 2Hz. This number is also on the webpage. I want to record this changing value over a period of time. I have tried to view that page using Python and my result is a new login page of the same name. I am currently using an OCR system but it is slow and it sometimes gets an incorrect value. This website is a service that I pay a little to use. I can send a request for the value but they get angry because the site cannot handle the multiple requests. Is there a way for me to simple read the html file in Python? I have tried to use Python to open the site but it sees it as not a person.

Handle random ForeSee popup using Python and Selenium

I'm new to coding and trying to use Selenium with Python to click through a website and fill a shopping cart. I've got things working well except for the random ForeSee survey popup. When it appears (and it doesn't always appear in the same location), my code stops working at that point.
I read the ForeSee documentation and it says "...when the invitation is displayed, the fsr.r...cookie is dropped. This cookie prevents a user from being invited again for X days (default 90)."
Hoping for a quick fix, I created a separate Firefox profile and ran through the website and got the ForeSee pop up invitation--no more pop up when manually using that profile. But I still get the pop up when using Selenium.
I used this code:
fp = webdriver.FirefoxProfile('C:\path\to\profile')
browser = webdriver.Firefox(firefox_profile=fp)
EDIT: I got the cookie working. I was using the Local folder instead of the Roaming folder in C:\path\to\profile. Using the roaming folder solved the problem.
My question edited to delete the part about the cookie not working:
Can someone suggest code to permanently handle the ForeSee pop up that appears randomly and on random pages?

I'm using using Protractor with JS, so I can't give you actual code to handle the issue, but I can give you an idea how to approach this.
In a nutshell
When following script is executed in the browser's console -
window.FSR.setFSRVisibility(true);
it makes ForeSee popup appear behind the rest of HTML elements. And doesn't affect UI tests anymore
So my protractor script will look like so
await browser.executeScript(
`window.FSR.setFSRVisibility(true);`
);
Theory
So ForeSee is one of those services that can be integrated with any web app, and will be pulling js code from their API and changing HTML of your app, by executing the code on the scope of the website. Another example of such company is walkme
Obviously in modern world, if these guys can overlay a webpage, they should have a configuration to make it optional (at least for lower environments) and they actually do. What I mentioned as a solution came from this page. But assuming they didn't have such option, one could reach out their support and ask how to workaround their popups. Even if they didn't have such option they would gladly consider it as a feature for improvement.

How can I use a session ID in python for web-scraping dataes?

I want to webscraping from a website, where i have to log in first. The problem is that, there is a "robotprotection" too (so I have to verify that i am not a robot + a recaptcha-security.), but it's chances of success (passing the captcha) is ~30% and this is horrible for me.
There is another possibility maybe which one i am log in with my browser (for example chrome or firefox), and after im going to use this session ID in my python script to webscraping dataes automatically?
So, more simplier: I want to webscraping tables from a website, so i have to log in first. This 30% succes rate is not enough good for me, so i hope there is another possibilty : log in manually, and after use this session in python?!
After that, there is a textbox in this page, where i want to write what i want to search, and after it is navigate to the page, where i'll found the table and dataes.
Any ideas, or it is possible?
(now i have only a script which one i have to download the html code to this datapage, and after change some name in the code manually..it is a very big waste time, i hope i can automate it more.) - Python 2.7

Trying to automate downloading campaign disclosure reports from an ASP server but it uses an encrypted __VIEWSTATE to handle important data

In my line of work, I often need to look at campaign disclosure reports for my state from ethics.ga.gov. However, the state system is one of the shittiest webapps I've ever dealt with.
It only provides contribution data per report. There are six reports per election cycle. And to add insult to injury, the system is slow. Not only are you having to download a shit ton of files, you have to wait a good minute for the damn thing to generate.
This is like an obvious opportunity to automate the process. What I had planned on doing is writing a program where I can input a URL of the page that links to all disclosure reports, and it will download all the contribution reports.
For a given candidate, I would input a link to this page - http://media.ethics.ga.gov/Search/Campaign/Campaign_Name.aspx?NameID=5753&FilerID=C2009000086&Type=candidate (the view report links are in the dropdown list titled "campaign contribution reports"). I then plan on following each of those links to the report page, following that link to the contributions page, and downloading the csv file. Once I have the csv file, (I think) the project comes under the scope of my coding ability.
The problem I am stuck on right now is that I can't figure out how to follow the view report links. The system is written in ASP. The links call a javascript postback function with a call of the sort "View Report". ctl02 is the identifier of the control. It appears that the information to map that control identifier to the url I need (in this case http://media.ethics.ga.gov/search/Campaign/Campaign_ReportOptions.aspx?NameID=5753&FilerID=C2009000086&CDRID=85776) is embedded in an encrypted __VIEWSTATE field.
I installed the Firebug debugger to try and get data that way. While I am very new to Firebug, all I could find is that in the net tab it shows a GET request to the URL that I need.
Obviously, somehow my browser is getting the next page, which means it should be automatable, but I am now at a loss. I've been working this up in python because I'm really starting to like it, but everything's negotiable. I am doing this on a mac (with full gnu environment), and would prefer to keep working in the environment I am familiar with, but I do have a windows xp vm with visual c++ '10 if I have to go that route.
What do y'all think?

Turns out the data wasn't in the encrypted __VIEWSTATE at all. There was a POST operation that Firebug was clearing on a redirect (despite having it set not to clear things.) I ran it with the Chrome dev console, and I was able to capture the POST data and replicate the POST operation in my application. That got me the URL I was looking for.
Thanks to everyone that looked at this!

Parsing from a website -- source code does not contain the info I need

I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.

The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.

Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.