optimize my python bank webscraper

optimize my python bank webscraper - python

I am using Python 3.4 to make a webscraper that logins to my bank account, clicks into each account copying the balance , adding the total then pasting into google sheets.
I got it working but as you can see from the code, it is repetitive, ugly and long winded.
I have identified a few issues:
I believe I should be using a function to loop through the different account pages to get the balance and then assigning values to a different variable. However I couldn't think of a way of getting this done.
converting the string to float seems messy, what I am trying to do is to make a string ie. $1,000.00 into a float by stripping the '$' and ',' , is there a more elegant way?
from selenium import webdriver
import time
import bs4
import gspread
from oauth2client.service_account import serviceAccountCredentials
driver = webdriver.Chrome()
driver.get(bank url)
inputElement = driver.find_element_by_id("dUsername")
inputElement.send_keys('username')
pwdElement = driver.find_element_by_id("password")
pwdElement.send_keys('password')
driver.find_element_by_id('loginBtn').click()
time.sleep(3)
#copies saving account balance
driver.find_element_by_link_text('Savings').click()
time.sleep(3)
html = driver.page_source
soup = bs4.BeautifulSoup(html)
elems=soup.select('#CurrentBalanceAmount')
SavingsAcc = float(elems[0].getText().strip('$').replace(',',''))
driver.back()
#copy cheque balance
driver.find_element_by_link_text('cheque').click()
time.sleep(3)
html = driver.page_source
soup = bs4.BeautifulSoup(html)
elems=soup.select('#CurrentBalanceAmount')
ChequeAcc = float(elems[0].getText().strip('$').replace(',',''))
Total = SavingsAcc+ ChequeACC
driver.back()

try the following code:
from selenium import webdriver
import time
import bs4
import gspread
from oauth2client.service_account import serviceAccountCredentials
driver = webdriver.Chrome()
driver.get(bank url)
inputElement = driver.find_element_by_id("dUsername")
inputElement.send_keys('username')
pwdElement = driver.find_element_by_id("password")
pwdElement.send_keys('password')
driver.find_element_by_id('loginBtn').click()
time.sleep(3)
def getBalance(accountType):
driver.find_element_by_link_text(accountType).click()
time.sleep(3)
html = driver.page_source
soup = bs4.BeautifulSoup(html)
elems=soup.select('#CurrentBalanceAmount')
return float(elems[0].getText().strip('$').replace(',',''))
#copies saving account balance
SavingsAcc = getBalance('Savings')
driver.back()
#copy cheque balance
ChequeACC = getBalance('cheque')
Total = SavingsAcc+ ChequeACC
driver.back()
Made a method getBalance, where you have to pass the account type, which returns the balance amount.
Note: you can keep driver.back call in getBalance as per your convenience, but before return statement.
Related to converting string to float, I don't know any other better way apart from the existing logic. As it is now moved into a method, I hope now it won't trouble you much. there is float method, which converts string to float, but $, , are not accepted. more details here
Note: If #CurrentBalanceAmount value changes every time for different account types, you can parameterize like accountType.

I would use several python idioms to clean up the code:
Wrap all code in functions
Generally speaking, putting your code in functions makes it easier to read and follow
When you run a python script (python foo.py), the python interpreter runs every line it can, in order, one by one. When it encounters a function definition, it only runs the definition line (def bar():), and not the code within the function.
This article seems like a good place to get more info on it: Understanding Python's Execution Model
Use the if __name__ == "__main__": idiom to make it an importable module
Similar to the above bullet, this gives you more control on how and when your code executes, how portable it is, and how reusable it is.
"Importable module" means you can write your code in one file, and then import that code in another module.
More info on if __name__ == "__main__" here: What does if name == “main”: do?
Use try/finally to make sure your driver instances get cleaned up
Use explicit waits to interact with the page so you don't need to use sleep
By default, Selenium tries to find and return things immediately. If the element hasn't loaded yet, Selenium throws an exception because it isn't smart enough to wait for it to load.
Explicit waits are built into Selenium, and allow your code to wait for an element to load into the page. By default it checks every half a second to see if the element loaded in. If it hasn't, it simply tries again in another half second. If it has, it returns the element. If it doesn't ever load in, the Wait object throws a TimeoutException.
More here: Explicit and Implicit Waits
And here: WAIT IN SELENIUM PYTHON
Code (untested for obvious reasons):
from selenium import webdriver
from explicit import waiter, ID # This package makes explicit waits easier to use
# pip install explicit
from selenium.webdriver.common.by import By
# Are any of these needed?
# import time
# import bs4
# import gspread
# from oauth2client.service_account import serviceAccountCredentials
def bank_login(driver, username, password):
"""Log into the bank account"""
waiter.find_write(driver, 'dUsername', username, by=ID)
waiter.find_write(driver, 'password', password, by=ID, send_enter=True)
def get_amount(driver, source):
"""Click the page and scrape the amount"""
# Click the page in question
waiter.find_element(driver, source, by=By.LINK_TEXT).click()
# Why are you using beautiful soup? Because it is faster?
# time.sleep(3)
# html = driver.page_source
# soup = bs4.BeautifulSoup(html)
# elems=soup.select('#CurrentBalanceAmount')
# SavingsAcc = float(elems[0].getText().strip('$').replace(',',''))
# driver.back()
# I would do it this way:
# When using explicit waits there is no need to explicitly sleep
amount_str = waiter.find_element(driver, "CurrentBalanceAmount", by=ID).text
# This conversion scheme will handle none $ characters too
amount = float("".join([char for char in amount_str if char in ["1234567890."]]))
driver.back()
return amount
def main():
driver = webdriver.Chrome()
try:
driver.get(bank_url)
bank_login(driver, 'username', 'password')
print(sum([get_amount(driver, source) for source in ['Savings', 'cheque']]))
finally:
driver.quit() # Use this try/finally idiom to prevent a bunch of dead browsers instances
if __name__ == "__main__":
main()
Full disclosure: I maintain the explicit package. You could replace the waiter calls above with relatively short Wait calls if you would prefer. If you are using Selenium with any regularity it is worth investing the time to understand and use explicit waits.

Related

How to download PDF from url in python

Note: This is very different problem compared to other SO answers (Selenium Webdriver: How to Download a PDF File with Python?) available for similar questions.
This is because The URL: https://webice.ongc.co.in/pay_adv?TRACKNO=8262# does not directly return the pdf but in turn makes several other calls and one of them is the url that returns the pdf file.
I want to be able to call the url with a variable for the query param TRACKNO and to be able to save the pdf file using python.
I was able to do this using selenium, but my code fails to work when the browser is used in headless mode and I need it to work in headless mode. The code that I wrote is as follows:
import requests
from urllib3.exceptions import InsecureRequestWarning
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
def extract_url(driver):
advice_requests = driver.execute_script("var performance = window.performance || window.mozPerformance || window.msPerformance || window.webkitPerformance || {}; var network = performance.getEntries() || {}; return network;")
print(advice_requests)
for request in advice_requests:
if(request.get('initiatorType',"") == 'object' and request.get('entryType',"") == 'resource'):
link_split = request['name'].split('-')
if(link_split[-1] == 'filedownload=X'):
print("..... Successful")
return request['name']
print("..... Failed")
def save_advice(advice_url,tracking_num):
requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)
response = requests.get(advice_url,verify=False)
with open(f'{tracking_num}.pdf', 'wb') as f:
f.write(response.content)
def get_payment_advice(tracking_nums):
options = webdriver.ChromeOptions()
# options.add_argument('headless') # DOES NOT WORK IN HEADLESS MODE SO COMMENTED OUT
driver = webdriver.Chrome(options=options)
for num in tracking_nums:
print(num,end=" ")
driver.get(f'https://webice.ongc.co.in/pay_adv?TRACKNO={num}#')
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'ls-highlight-domref')))
time.sleep(0.1)
advice_url = extract_url(driver)
save_advice(advice_url,num)
except:
pass
driver.quit()
get_payment_advice['8262']
As it can be seen I get all the network calls that the browser makes in the first line of the extract_url function and then parse each request to find the correct one. However this does not work in headless mode
Is there any other way of doing this as this seems like a workaround? If not, can this be fixed to work in headless mode?

I fixed it, i only changed one function. The correct url is in the given page_source of the driver (with beautifulsoup you can parse html, xml etc.):
from bs4 import BeautifulSoup
def extract_url(driver):
soup = BeautifulSoup(driver.page_source, "html.parser")
object_element = soup.find("object")
data = object_element.get("data")
return f"https://webice.ongc.co.in{data}"
The hostname part may can be extracted from the driver.
I think i did not changed anything else, but if it not work for you, I can paste the full code.
Old Answer:
if you print the text of the returned page (print(driver.page_source)) i think you would get a message that says something like:
"Because of your system configuration the pdf can't be loaded"
This is because the requested site checks some preferences to decide if you are a roboter or not. Maybe it helps to change some arguments (screen size, user agent) to fix this. Here are some information about, how to detect a headless browser.
And for the next time you should paste all relevant code into the question (imports) to make it easier to test.

How can I get the Changing Data Values from website with Beautiful Soup?

I am trying web scraping with BeautifulSoup for getting data of BTC-USDT from biance. Actually I am getting what I want but the value is changing in every second in website,but when I am trying to print it to my console it prints me same value and it change rarely. Basically, my data are the same every time when I try to get it, but on the website, it changes every time and I can't get that changing data.
What can I do?
from bs4 import BeautifulSoup
import requests
import time
while(True):
url='https://www.binance.com/tr/trade/BTC_USDT'
HTML=requests.get(url)
html_content=HTML.content
soup=BeautifulSoup(HTML.text,'html.parser')
paper=str((soup.find('title',attrs={'data-shuvi-head':'true'})))
print(paper)
time.sleep(5)

This page uses JavaScript to update data but BeautifulSoup can't run JavaScript. You use need Selenium to control real web browser which can run JavaScript.
from selenium import webdriver
import time
url = 'https://www.binance.com/tr/trade/BTC_USDT' # PEP8: spaces around `=`
#driver = webdriver.Chrome()
driver = webdriver.Firefox()
driver.get(url)
while True: # PEP8: no need `()`
try:
#print(driver.title)
print(driver.title.split(' ')[0].strip())
except Exception as ex:
print('Exception:', ex)
time.sleep(5)
Eventually you can check in DevTools (tab Network) in Chrome/Firefox to see url used by JavaScript to get new data - and then you can try to use it with requests. Because JavaScript usually send data as JSON so you will no need BeautifulSoup but module json.
But first check if you can get it with official Binance API
PEP 8 -- Style Guide for Python Code
EDIT
Example with Binance API: Current Average Price
import requests
import time
url = 'https://api.binance.com/api/v3/avgPrice?symbol=BTCUSDT'
while True:
response = requests.get(url)
data = response.json()
print(data['price'])
time.sleep(5)

How do I make the driver navigate to new page in selenium python

I am trying to write a script to automate job applications on Linkedin using selenium and python.
The steps are simple:
open the LinkedIn page, enter id password and log in
open https://linkedin.com/jobs and enter the search keyword and location and click search(directly opening links like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia get stuck as loading, probably due to lack of some post information from the previous page)
the click opens the job search page but this doesn't seem to update the driver as it still searches on the previous page.
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import yaml
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = "https://linkedin.com/"
driver.get(url)
content = driver.page_source
stream = open("details.yaml", 'r')
details = yaml.safe_load(stream)
def login():
username = driver.find_element_by_id("session_key")
password = driver.find_element_by_id("session_password")
username.send_keys(details["login_details"]["id"])
password.send_keys(details["login_details"]["password"])
driver.find_element_by_class_name("sign-in-form__submit-button").click()
def get_experience():
return "1%C22"
login()
jobs_url = f'https://www.linkedin.com/jobs/'
driver.get(jobs_url)
keyword = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-keyword-id-ember')]")
location = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-location-id-ember')]")
keyword.send_keys("python")
location.send_keys("Australia")
driver.find_element_by_xpath("//button[normalize-space()='Search']").click()
WebDriverWait(driver, 10)
# content = driver.page_source
# soup = BeautifulSoup(content)
# with open("a.html", 'w') as a:
# a.write(str(soup))
print(driver.current_url)
driver.current_url returns https://linkedin.com/jobs/ instead of https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia as it should. I have tried to print the content to a file, it is indeed from the previous jobs page and not from the search page. I have also tried to search elements from page like experience and easy apply button but the search results in a not found error.
I am not sure why this isn't working.
Any ideas? Thanks in Advance
UPDATE
It works if try to directly open something like https://www.linkedin.com/jobs/search/?f_AL=True&f_E=2&keywords=python&location=Australia but not https://www.linkedin.com/jobs/search/?f_AL=True&f_E=1%2C2&keywords=python&location=Australia
the difference in both these links is that one of them takes only one value for experience level while the other one takes two values. This means it's probably not a post values issue.

You are getting and printing the current URL immediately after clicking on the search button, before the page changed with the response received from the server.
This is why it outputs you with https://linkedin.com/jobs/ instead of something like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia.
WebDriverWait(driver, 10) or wait = WebDriverWait(driver, 20) will not cause any kind of delay like time.sleep(10) does.
wait = WebDriverWait(driver, 20) only instantiates a wait object, instance of WebDriverWait module / class

Python - Selenium - Print Webpage

How do I print a webpage using selenium please.
import time
from selenium import webdriver
# Initialise the webdriver
chromeOps=webdriver.ChromeOptions()
chromeOps._binary_location = "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe"
chromeOps._arguments = ["--enable-internal-flash"]
browser = webdriver.Chrome("C:\\Program Files\\Google\\Chrome\\Application\\chromedriver.exe", port=4445, chrome_options=chromeOps)
time.sleep(3)
# Login to Webpage
browser.get('www.webpage.com')
Note: I am using the, at present, current version of Google Chrome: Version 32.0.1700.107 m

While it's not directly printing the webpage, it is easy to take a screenshot of the entire current page:
browser.save_screenshot("screenshot.png")
Then the image can be printed using any image printing library. I haven't personally used any such library so I can't necessarily vouch for it, but a quick search turned up win32print which looks promising.

The key "trick" is that we can execute JavaScript in the selenium browser window using the "execute_script" method of the selenium webdriver, and if you execute the JavaScript command "window.print();" it will activate the browsers print function.
Now, getting it to work elegantly requires setting a few preferences to print silently, remove print progress reporting, etc. Here is a small but functional example that loads up and prints whatever website you put in the last line (where 'http://www.cnn.com/' is now):
import time
from selenium import webdriver
import os
class printing_browser(object):
def __init__(self):
self.profile = webdriver.FirefoxProfile()
self.profile.set_preference("services.sync.prefs.sync.browser.download.manager.showWhenStarting", False)
self.profile.set_preference("pdfjs.disabled", True)
self.profile.set_preference("print.always_print_silent", True)
self.profile.set_preference("print.show_print_progress", False)
self.profile.set_preference("browser.download.show_plugins_in_list",False)
self.driver = webdriver.Firefox(self.profile)
time.sleep(5)
def get_page_and_print(self, page):
self.driver.get(page)
time.sleep(5)
self.driver.execute_script("window.print();")
if __name__ == "__main__":
browser_that_prints = printing_browser()
browser_that_prints.get_page_and_print('http://www.cnn.com/')
The key command you were probably missing was "self.driver.execute_script("window.print();")" but one needs some of that setup in init to make it run smooth so I thought I'd give a fuller example. I think the trick alone is in a comment above so some credit should go there too.

Getting all visible text from a webpage using Selenium

I've been googling this all day with out finding the answer, so apologies in advance if this is already answered.
I'm trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.
After a couple of days of research, I decided that Selenium was my best chance. I've found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:
from selenium import webdriver
import codecs
filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+')
driver = webdriver.Firefox()
driver.get("http://www.examplepage.com")
allelements = driver.find_elements_by_xpath("//*")
ferdigtxt = []
for i in allelements:
if i.text in ferdigtxt:
pass
else:
ferdigtxt.append(i.text)
filen.writelines(i.text)
filen.close()
driver.quit()
The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)
I'm guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question.
Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I'm out of ideas for this one.
Edit: the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text

Using lxml, you might try something like this:
import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean
url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
browser.get(url) # Load page
content=browser.page_source
cleaner=clean.Cleaner()
content=cleaner.clean_html(content)
with open('/tmp/source.html','w') as f:
f.write(content.encode('utf-8'))
doc=LH.fromstring(content)
with open('/tmp/result.txt','w') as f:
for elt in doc.iterdescendants():
if elt.tag in ignore_tags: continue
text=elt.text or ''
tail=elt.tail or ''
words=' '.join((text,tail)).strip()
if words:
words=words.encode('utf-8')
f.write(words+'\n')
This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).

Here's a variation on #unutbu's answer:
#!/usr/bin/env python
import sys
from contextlib import closing
import lxml.html as html # pip install 'lxml>=2.3.1'
from lxml.html.clean import Cleaner
from selenium.webdriver import Firefox # pip install selenium
from werkzeug.contrib.cache import FileSystemCache # pip install werkzeug
cache = FileSystemCache('.cachedir', threshold=100000)
url = sys.argv[1] if len(sys.argv) > 1 else "https://stackoverflow.com/q/7947579"
# get page
page_source = cache.get(url)
if page_source is None:
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get(url)
page_source = browser.page_source
cache.set(url, page_source, timeout=60*60*24*7) # week in seconds
# extract text
root = html.document_fromstring(page_source)
# remove flash, images, <script>,<style>, etc
Cleaner(kill_tags=['noscript'], style=True)(root) # lxml >= 2.3.1
print root.text_content() # extract text
I've separated your task in two:
get page (including elements generated by javascript)
extract text
The code is connected only through the cache. You can fetch pages in one process and extract text in another process or defer to do it later using a different algorithm.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

optimize my python bank webscraper - python

Related

How to download PDF from url in python

How can I get the Changing Data Values from website with Beautiful Soup?

How do I make the driver navigate to new page in selenium python

Python - Selenium - Print Webpage

Getting all visible text from a webpage using Selenium

Categories

Resources