Python - Downloading PDF and saving to disk using Selenium

Python - Downloading PDF and saving to disk using Selenium - python

I'm creating an application that downloads PDF's from a website and saves them to disk. I understand the Requests module is capable of this but is not capable of handling the logic behind the download (File size, progress, time remaining etc.).
I've created the program using selenium thus far and would like to eventually incorporate this into a GUI Tkinter app eventually.
What would be the best way to handle the downloading, tracking and eventually creating a progress bar?
This is my code so far:
from selenium import webdriver
from time import sleep
import requests
import secrets
class manual_grabber():
""" A class creating a manual downloader for the Roger Technology website """
def __init__(self):
""" Initialize attributes of manual grabber """
self.driver = webdriver.Chrome('\\Users\\Joel\\Desktop\\Python\\manual_grabber\\chromedriver.exe')
def login(self):
""" Function controlling the login logic """
self.driver.get('https://rogertechnology.it/en/b2b')
sleep(1)
# Locate elements and enter login details
user_in = self.driver.find_element_by_xpath('/html/body/div[2]/form/input[6]')
user_in.send_keys(secrets.username)
pass_in = self.driver.find_element_by_xpath('/html/body/div[2]/form/input[7]')
pass_in.send_keys(secrets.password)
enter_button = self.driver.find_element_by_xpath('/html/body/div[2]/form/div/input')
enter_button.click()
# Click Self Service Area button
self_service_button = self.driver.find_element_by_xpath('//*[#id="bs-example-navbar-collapse-1"]/ul/li[1]/a')
self_service_button.click()
def download_file(self):
"""Access file tree and navigate to PDF's and download"""
# Wait for all elements to load
sleep(3)
# Find and switch to iFrame
frame = self.driver.find_element_by_xpath('//*[#id="siteOutFrame"]/iframe')
self.driver.switch_to.frame(frame)
# Find and click tech manuals button
tech_manuals_button = self.driver.find_element_by_xpath('//*[#id="fileTree_1"]/ul/li/ul/li[6]/a')
tech_manuals_button.click()
bot = manual_grabber()
bot.login()
bot.download_file()
So in summary, I'd like to make this code download PDF's on a website, store them in a specific directory (named after it's parent folder in the JQuery File Tree) and keep tracking of the progress (file size, time remaining etc.)
Here is the DOM:
I hope this is enough information. Any more required please let me know.

I would recommend using tqdm and the request module for this.
Here is a sample code that effectively achieves that hard job of downloading and updating progress bar.
from tqdm import tqdm
import requests
url = "http://www.ovh.net/files/10Mb.dat" #big file test
# Streaming, so we can iterate over the response.
response = requests.get(url, stream=True)
total_size_in_bytes= int(response.headers.get('content-length', 0))
block_size = 1024 #1 Kibibyte
progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True)
with open('test.dat', 'wb') as file:
for data in response.iter_content(block_size):
progress_bar.update(len(data)) #change this to your widget in tkinter
file.write(data)
progress_bar.close()
if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
print("ERROR, something went wrong")
The block_size is your file-size and the time-remaining can be calculated with the number of iterations performed per second with respect to the block-size that remains. Here is an alternative - How to measure download speed and progress using requests?

Related

PywinAuto - Excel Automation Can not click the button

I m making an Excel automation via pywinauto library. But there is a hard challange for me due to using Excel Oracle add-ins called Smartview.
I need to click 'Private Connections' button, however i can't find any little info in app.Excel.print_control_identifiers() Private Connections
So i tried to use inspector.exe for find ui element regarding private connections button, however i couldn't find any little solvetion inside of inspector.exe's result inspector's result
Then i used another program called UISpy, however i can only find private connection's pane inside of the program. UISpy's result
i tried to find an answer but i couldn't find out anything. So, can you help me to click here?
By the way here is my code :
import pywinauto
from pywinauto import application
from pywinauto.keyboard import send_keys
from pywinauto.controls.common_controls import TreeViewWrapper
program_path = r"C:\Program Files\Microsoft Office\root\Office16\EXCEL.EXE"
file_path = r"C:\Users\AytugMeteBeder\Desktop\deneme.xlsx"
app = application.Application(backend="uia").start(r'{} "{}"'.format(program_path, file_path))
# sapp = application.Application(backend="uia").connect(title = 'deneme.xlsx - Excel')
time.sleep(7)
myExcel = app.denemeExcel.child_window(title="Smart View", control_type="TabItem").wrapper_object()
myExcel.click_input()
Panel = app.denemeExcel.child_window(title="Panel", control_type="Button").wrapper_object()
Panel.click_input()
time.sleep(1)
app.denemeExcel.print_control_identifiers()

How to write a proper function file in Python

I want to write a Python file that contains functions which we want to use in our project. We are working on a Selenium web scraping bot fot Instagram. Right now we write all the functions in the scripts but we want to make a "function" file which we will import and use for our scripts. But the thing is that VS code does not use autocompletion when I want to use a webdrivers function like driver.find_element_by_xpath(cookies_button_xpath).click().
The function file (not finished yet) looks like this:
import time
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
# set constants for functions to run
WEBSITE_PRE_FIX = 'https://www.instagram.com/'
FORBIDDEN_CAPTION_WORDS = ['link in bio','buy now','limited time']
def open_ig(driver: webdriver):
# opens the website and waits till it is loaded
driver.get(WEBSITE_PRE_FIX)
time.sleep(2)
# accept cookies
cookies_button_xpath = "/html/body/div[4]/div/div/button[1]"
driver.find_element_by_xpath(cookies_button_xpath).click()
def login(driver: webdriver, username, password):
time.sleep(2)
# fill in user name and password and log in
username_box_xpath = '/html/body/div[1]/section/main/article/div[2]/div[1]/div/form/div/div[1]/div/label/input'
username_element = driver.find_element_by_xpath(username_box_xpath)
username_element.send_keys(username)
password_box_xpath = '/html/body/div[1]/section/main/article/div[2]/div[1]/div/form/div/div[2]/div/label/input'
password_element = driver.find_element_by_xpath(password_box_xpath)
password_element.send_keys(password)
password_element.send_keys(Keys.ENTER)
# click on do not save username and password + do not turn on notifications
time.sleep(3)
dont_save_username_button_password_xpath = '/html/body/div[1]/section/main/div/div/div/div/button'
dont_save_username_button_element = driver.find_element_by_xpath(dont_save_username_button_password_xpath)
dont_save_username_button_element.click()
So the code does work (as in it runs and does what I want) but I would like to know if we can write the function file another way so things like autocompletion en the color filters work. I'm not completely sure if it is possible. If there is any other way to write the functions file, all recommendations are welcome.

Have you tried writing the functions file as a simple class?
class FunctionsFile():
def __init__(self):
self.website_pre_fix = 'https://www.instagram.com/'
self.forbidden_capture_words = ['link in bio','buy now','limited time']
def open_ig(self, driver: webdriver):
# opens the website and waits till it is loaded
driver.get(WEBSITE_PRE_FIX)
time.sleep(2)
# accept cookies
cookies_button_xpath = "/html/body/div[4]/div/div/button[1]"
driver.find_element_by_xpath(cookies_button_xpath).click()
def login(self, driver: webdriver, username, password):
time.sleep(2)
# fill in user name and password and log in
username_box_xpath = '/html/body/div[1]/section/main/article/div[2]/div[1]/div/form/div/div[1]/div/label/input'
username_element = driver.find_element_by_xpath(username_box_xpath)
username_element.send_keys(username)
password_box_xpath = '/html/body/div[1]/section/main/article/div[2]/div[1]/div/form/div/div[2]/div/label/input'
password_element = driver.find_element_by_xpath(password_box_xpath)
password_element.send_keys(password)
password_element.send_keys(Keys.ENTER)
# click on do not save username and password + do not turn on notifications
time.sleep(3)
dont_save_username_button_password_xpath = '/html/body/div[1]/section/main/div/div/div/div/button'
dont_save_username_button_element = driver.find_element_by_xpath(dont_save_username_button_password_xpath)
dont_save_username_button_element.click()
You can then instantiate the class in any file. If in same directory:
from FunctionsFile import FunctionsFile
funcs = FunctionsFile()
funcs.open_ig(driver)
That should use the standard VS Code color schemes and autocompletion. (I think anyway).

How to save/display giphy gif using python API?

I am creating one of those cool moving photograph frames, eventually with my own pictures, but for now I just want to search giphy and save/display a gif.
Here's the code I gathered would be useful from their API.
import giphy_client as gc
from giphy_client.rest import ApiException
from random import randint
api_instance = gc.DefaultApi()
api_key = 'MY_API_KEY'
query = 'art'
fmt = 'gif'
try:
response = api_instance.gifs_search_get(api_key,query,limit=1,offset=randint(1,10),fmt=fmt)
gif_id = response.data[0]
except ApiException:
print("Exception when calling DefaultApi->gifs_search_get: %s\n" % e)
with open('test.txt','w') as f:
f.write(type(gif_id))
I get an object of type: class 'giphy_client.models.gif.Gif', I want to save this gif and display it on a monitor. I understand that I am a far way off on this but I am still learning about API and how to use them. If anyone can help me find a way to save this gif or display it directly from their website, that would be much appreciated!

Welcome dbarth!
I see your code does successfully retrieve a random image, that is good.
There are 3 steps needed to get the image:
Get the GIF URL.
That giphy_client client you are using, is made with Swagger, so, you can access the REST Response elements like any other object, or print them.
For example:
>>> print(gif_id.images.downsized.url)
'https://media0.giphy.com/media/l3nWlvtvAFHcDFKXm/giphy-downsized.gif?cid=e1bb72ff5c7dc1c67732476c2e69b2ff'
Note that when I print this, I get an URL. The Gif object you got, called gif_id, has a bunch of URLs to download the GIF or MP4 at different resolutions. In this case, I went with the downsized GIF. You can see all the elements retrieved using print(gif_id)
So, I will add this to your code:
gif_url = gif_id.images.downsized.url
Download the GIF
Now that you have a URL, it's time to download the GIF. I will use the requests library to do this, install it with pip if you don't have in your environment. Seems that you already tried to do this, but with an error.
import requests
[...]
with open('test.gif','wb') as f:
f.write(requests.get(url_gif).content)
Display the GIF
There are a bunch of GUIs for Python to do this, or you can even invoke a browser to show it. You need to investigate which GUI adapts better to your needs. For this case, I will use the example posted here, with a few modifications,to display the Gif using TKinter. Install Tkinter if isn't included with your Python installation.
Final code:
import giphy_client as gc
from giphy_client.rest import ApiException
from random import randint
import requests
from tkinter import *
import time
import os
root = Tk()
api_instance = gc.DefaultApi()
api_key = 'YOUR_OWN_API_KEY'
query = 'art'
fmt = 'gif'
try:
response = api_instance.gifs_search_get(api_key,query,limit=1,offset=randint(1,10),fmt=fmt)
gif_id = response.data[0]
url_gif = gif_id.images.downsized.url
except ApiException:
print("Exception when calling DefaultApi->gifs_search_get: %s\n" % e)
with open('test.gif','wb') as f:
f.write(requests.get(url_gif).content)
frames = []
i = 0
while True: # Add frames until out of range
try:
frames.append(PhotoImage(file='test.gif',format = 'gif -index %i' %(i)))
i = i + 1
except TclError:
break
def update(ind): # Display and loop the GIF
if ind >= len(frames):
ind = 0
frame = frames[ind]
ind += 1
label.configure(image=frame)
root.after(100, update, ind)
label = Label(root)
label.pack()
root.after(0, update, 0)
root.mainloop()
Keep learning how to use a REST API, and Swagger, if you want to keep using the giphy_client library. If not, you can make the requests directly using the requests library.

python: get all youtube video urls of a channel

I want to get all video url's of a specific channel. I think json with python or java would be a good choice. I can get the newest video with the following code, but how can I get ALL video links (>500)?
import urllib, json
author = 'Youtube_Username'
inp = urllib.urlopen(r'http://gdata.youtube.com/feeds/api/videos?max-results=1&alt=json&orderby=published&author=' + author)
resp = json.load(inp)
inp.close()
first = resp['feed']['entry'][0]
print first['title'] # video title
print first['link'][0]['href'] #url

After the youtube API change, max k.'s answer does not work. As a replacement, the function below provides a list of the youtube videos in a given channel. Please note that you need an API Key for it to work.
import urllib
import json
def get_all_video_in_channel(channel_id):
api_key = YOUR API KEY
base_video_url = 'https://www.youtube.com/watch?v='
base_search_url = 'https://www.googleapis.com/youtube/v3/search?'
first_url = base_search_url+'key={}&channelId={}&part=snippet,id&order=date&maxResults=25'.format(api_key, channel_id)
video_links = []
url = first_url
while True:
inp = urllib.urlopen(url)
resp = json.load(inp)
for i in resp['items']:
if i['id']['kind'] == "youtube#video":
video_links.append(base_video_url + i['id']['videoId'])
try:
next_page_token = resp['nextPageToken']
url = first_url + '&pageToken={}'.format(next_page_token)
except:
break
return video_links

Short answer:
Here's a library That can help with that.
pip install scrapetube
import scrapetube
videos = scrapetube.get_channel("UC9-y-6csu5WGm29I7JiwpnA")
for video in videos:
print(video['videoId'])
Long answer:
The module mentioned above was created by me due to a lack of any other solutions. Here's what i tried:
Selenium. It worked but had three big drawbacks: 1. It requires a web browser and driver to be installed. 2. has big CPU and memory requirements. 3. can't handle big channels.
Using youtube-dl. Like this:
import youtube_dl
youtube_dl_options = {
'skip_download': True,
'ignoreerrors': True
}
with youtube_dl.YoutubeDL(youtube_dl_options) as ydl:
videos = ydl.extract_info(f'https://www.youtube.com/channel/{channel_id}/videos')
This also works for small channels, but for bigger ones i would get blocked by youtube for making so many requests in such a short time (because youtube-dl downloads more info for every video in the channel).
So i made the library scrapetube which uses the web API to get all the videos.

Increase max-results from 1 to however many you want, but beware they don't advise grabbing too many in one call and will limit you at 50 (https://developers.google.com/youtube/2.0/developers_guide_protocol_api_query_parameters).
Instead you could consider grabbing the data down in batches of 25, say, by changing the start-index until none came back.
EDIT: Here's the code for how I would do it
import urllib, json
author = 'Youtube_Username'
foundAll = False
ind = 1
videos = []
while not foundAll:
inp = urllib.urlopen(r'http://gdata.youtube.com/feeds/api/videos?start-index={0}&max-results=50&alt=json&orderby=published&author={1}'.format( ind, author ) )
try:
resp = json.load(inp)
inp.close()
returnedVideos = resp['feed']['entry']
for video in returnedVideos:
videos.append( video )
ind += 50
print len( videos )
if ( len( returnedVideos ) < 50 ):
foundAll = True
except:
#catch the case where the number of videos in the channel is a multiple of 50
print "error"
foundAll = True
for video in videos:
print video['title'] # video title
print video['link'][0]['href'] #url

Based on the code found here and at some other places, I've written a small script that does this. My script uses v3 of Youtube's API and does not hit against the 500 results limit that Google has set for searches.
The code is available over at GitHub: https://github.com/dsebastien/youtubeChannelVideosFinder

Independent way of doing things. No api, no rate limit.
import requests
username = "marquesbrownlee"
url = "https://www.youtube.com/user/username/videos"
page = requests.get(url).content
data = str(page).split(' ')
item = 'href="/watch?'
vids = [line.replace('href="', 'youtube.com') for line in data if item in line] # list of all videos listed twice
print(vids[0]) # index the latest video
This above code will scrap only limited number of video url's max upto 60. How to grab all the videos url which is present in the channel. Can you please suggest.
This above code snippet will display only the list of all the videos which is listed twice. Not all the video url's in the channel.

Using Selenium Chrome Driver:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
driverPath = ChromeDriverManager().install()
driver = webdriver.Chrome(driverPath)
url = 'https://www.youtube.com/howitshouldhaveended/videos'
driver.get(url)
height = driver.execute_script("return document.documentElement.scrollHeight")
previousHeight = -1
while previousHeight < height:
previousHeight = height
driver.execute_script(f'window.scrollTo(0,{height + 10000})')
time.sleep(1)
height = driver.execute_script("return document.documentElement.scrollHeight")
vidElements = driver.find_elements_by_id('thumbnail')
vid_urls = []
for v in vidElements:
vid_urls.append(v.get_attribute('href'))
This code has worked the few times I've tried it; however, you might need to tweak the sleep time, or add a way to recognize when the browser is still loading the extra information. It easily worked for me for getting a channel with 300+ videos, but it was having an issue with one that had 7000+ videos due to the time required to load the new videos on the browser becoming inconsistent.

I modified the script originally posted by dermasmid to fit my needs. This is the result:
import scrapetube
import sys
path = '_list.txt'
sys.stdout = open(path, 'w')
videos = scrapetube.get_channel("UC9-y-6csu5WGm29I7JiwpnA")
for video in videos:
print("https://www.youtube.com/watch?v="+str(video['videoId']))
# print(video['videoId'])
Basically it is saves all the URLs from the playlist into a "_list.txt" file. I am using this "_list.txt" file to download all the videos using the yt-dlp.exe. All the downloaded files have the .mp4 extension.
Now I do need to create another "_playlist.txt" file that contains all the FILENAMES coresponding to each URL from the "_List.txt".
For example, for: "https://www.youtube.com/watch?v=yG1m7oGZC48" to have "Apple M1 Ultra & NUMA - Computerphile.mp4" as output into the "_playlist.txt"

I do made some further improvements, to be able to add the channel URL into the console, print the result on screen and also into an external file called "_list.txt".
import scrapetube
import sys
path = '_list.txt'
print('**********************\n')
print("The result will be saved in '_list.txt' file.")
print("Enter Channel ID:")
# Prints the output in the console and into the '_list.txt' file.
class Logger:
def __init__(self, filename):
self.console = sys.stdout
self.file = open(filename, 'w')
def write(self, message):
self.console.write(message)
self.file.write(message)
def flush(self):
self.console.flush()
self.file.flush()
sys.stdout = Logger(path)
# Strip the: "https://www.youtube.com/channel/"
channel_id_input = input()
channel_id = channel_id_input.strip("https://www.youtube.com/channel/")
videos = scrapetube.get_channel(channel_id)
for video in videos:
print("https://www.youtube.com/watch?v="+str(video['videoId']))
# print(video['videoId'])

Captchas in Scrapy

I'm working on a Scrapy app, where I'm trying to login to a site with a form that uses a captcha (It's not spam). I am using ImagesPipeline to download the captcha, and I am printing it to the screen for the user to solve. So far so good.
My question is how can I restart the spider, to submit the captcha/form information? Right now my spider requests the captcha page, then returns an Item containing the image_url of the captcha. This is then processed/downloaded by the ImagesPipeline, and displayed to the user. I'm unclear how I can resume the spider's progress, and pass the solved captcha and same session to the spider, as I believe the spider has to return the item (e.g. quit) before the ImagesPipeline goes to work.
I've looked through the docs and examples but I haven't found any ones that make it clear how to make this happen.

This is how you might get it to work inside the spider.
self.crawler.engine.pause()
process_my_captcha()
self.crawler.engine.unpause()
Once you get the request, pause the engine, display the image, read the info from the user& resume the crawl by submitting a POST request for login.
I'd be interested to know if the approach works for your case.

I would not create an Item and use the ImagePipeline.
import urllib
import os
import subprocess
...
def start_requests(self):
request = Request("http://webpagewithcaptchalogin.com/", callback=self.fill_login_form)
return [request]
def fill_login_form(self,response):
x = HtmlXPathSelector(response)
img_src = x.select("//img/#src").extract()
#delete the captcha file and use urllib to write it to disk
os.remove("c:\captcha.jpg")
urllib.urlretrieve(img_src[0], "c:\captcha.jpg")
# I use an program here to show the jpg (actually send it somewhere)
captcha = subprocess.check_output(r".\external_utility_solving_captcha.exe")
# OR just get the input from the user from stdin
captcha = raw_input("put captcha in manually>")
# this function performs the request and calls the process_home_page with
# the response (this way you can chain pages from start_requests() to parse()
return [FormRequest.from_response(response,formnumber=0,formdata={'user':'xxx','pass':'xxx','captcha':captcha},callback=self.process_home_page)]
def process_home_page(self, response):
# check if you logged in etc. etc.
...
What I do here is that I import urllib.urlretrieve(url) (to store the image), os.remove(file) (to delete the previous image), and subprocess.checoutput (to call an external command line utility to solve the captcha). The whole Scrapy infrastructure is not used in this "hack", because solving a captcha like this is always a hack.
That whole calling external subprocess thing could have been one nicer, but this works.
On some sites it's not possible to save the captcha image and you have to call the page in a browser and call a screen_capture utility and crop on an exact location to "cut out" the captcha. Now that is screenscraping.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Downloading PDF and saving to disk using Selenium - python

Related

PywinAuto - Excel Automation Can not click the button

How to write a proper function file in Python

How to save/display giphy gif using python API?

python: get all youtube video urls of a channel

Captchas in Scrapy

Categories

Resources