Python script runs normally but cannot be scheduled [closed]

Python script runs normally but cannot be scheduled [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I have a python which can run normally but cannot be scheduled to run successfully on Windows 7 Task scheduler. I even created a batch file to call the python script.
Under the task action ("Start a program"), I have;
C:\Backup\backup.bat
It is so simple that I cannot find anything that I have done wrong. Anything else that I need to take note?

Create on drive C: a directory Temp and make sure that security permissions are set to full control for everyone.
Put into your batch file at top:
#echo off
echo Current directory: %CD%>C:\Temp\Environment.txt
echo.>>C:\Temp\Environment.txt
echo Environment variables:>>C:\Temp\Environment.txt
echo.>>C:\Temp\Environment.txt
set >>C:\Temp\Environment.txt
When you double click on your batch file it writes the current directory to file C:\Temp\Environment.txt which will be the directory of the batch file. And it writes all environment variables defined for your user account also to file C:\Temp\Environment.txt.
Now rename Environment.txt to DoubleClickedEnvironment.txt.
Then do what is necessary to run the same batch file as scheduled task and later look on C:\Temp\Environment.txt.
You will see most likely by comparing C:\Temp\Environment.txt with C:\Temp\DoubleClickedEnvironment.txt that the current directory is now C:\Windows\System32 (respectively %SystemRoot%\System32) instead of the directory containing the batch file and the list of environment variables as well as their values differ.
Most important on environment variables are PATH and PATHEXT when not referencing executables in batch file with name of file with extension and with full path enclosed in double quotes if name or path contains 1 or more spaces. Also all environment variables defined for Python and evaluated by Python are important on your batch file.
Another common mistake on running something as scheduled task is thinking that the used account for the scheduled task has same permissions on accessing files and directories as the current user. This is not the case if the scheduled task is not executed with using your user account.
And last mapped network drives are not mapped on running a batch file as scheduled task. Mapping network drives is done by Windows only on login of a user. So in batch file designed for running as scheduled task
the UNC paths must be used, or
the commands pushd and popd are used to map a network share temporarily to a drive letter using credentials of the account defined for the scheduled task, or
%SystemRoot%\System32\net.exe X: \\ComputerName\ShareName password /user:domain\username /persistent:no
is used at beginning of batch file and
%SystemRoot%\System32\net.exe X: /delete
is used at end of batch file as an example for drive X:.
The last method is very insecure as this makes it possible to everyone with permissions to read the batch file to get user name and password for the share.

Related

Batch file not running python script in task scheduler

I have a scraping python script and a batch file that when run from CMD works perfectly however when I try to run it from Task Scheduler nothing happens.
I know there are a lot of questions regarding the same issue but I have tried all of the proposed answers and non of them seem to work.
Don't know if this is relevant but script would open Firefox and scrape some websites.
Have tried adding full permissions to the folders and files that I am using.
Also, have tried in Task Scheduler to set up "Run wheter user is logged on or not", "Run with highest privileges" , "Start in (optional): add/bactch/file/path" and so on
Batch file:
py "C:\python_test\myscript.py"
It should run the python script which opens Firefox and scrapes some websites, gets their links and saves them in a csv file
Here is myscript.py:
import datetime
import time
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import os
file_path = r"C:\General\{0:%Y%m%d}\results{0:%Y%m%d%H%M%S}.csv".format(datetime.datetime.now())
directory = os.path.dirname(file_path)
try:
os.stat(directory)
except:
os.mkdir(directory)
from bs4 import BeautifulSoup
def init_driver():
caps = DesiredCapabilities.FIREFOX
caps['marionette'] = True
driver = webdriver.Firefox(capabilities=caps)
driver.wait = WebDriverWait(driver, 15)
return driver
xpath = {
'english': '//*[#id="xxx"]',
'soort': '//*[#id="xxx"]/option[text()=\'|- Announcement of change in denominator or in thresholds\']',
'start': '//*[#id="xxx"]',
'end': '//*[#id="xxx"]',
'submit': '//*[#id="xxx"]',
'pub_sort': '//*[#id="xxx"]',
}
if __name__ == "__main__":
driver = init_driver()
try:
driver.get("http://SOMEWEBSITE")
driver.find_element_by_css_selector('[id$=hplEnglish]').click()
except Exception:
DoNothing = ""
time.sleep(2)
driver.find_element_by_css_selector('[id$=hplEnglish]').click()
time.sleep(3)
#alert_obj = driver.switch_to.alert
#alert_obj.dismiss()
#driver.find_element_by_xpath(xpath['english']).click()
today = datetime.datetime.now()
driver.wait.until(EC.element_to_be_clickable((By.XPATH, xpath['soort']))).click()
driver.find_element_by_xpath(xpath['start']).send_keys((today-datetime.timedelta(weeks=1)).strftime('%d/%m/%Y'))
driver.find_element_by_xpath(xpath['end']).send_keys(today.strftime('%d/%m/%Y'))
driver.find_element_by_xpath(xpath['submit']).click()
for i in range(2):
driver.wait.until(EC.element_to_be_clickable((By.XPATH, xpath['pub_sort']))).click()
time.sleep(5)
html = driver.page_source
driver.quit()
results = BeautifulSoup(html, 'html.parser').find('div', { 'id': 'SearchResults'}).table
res = [['issuer', 'type', 'date']]
for x in results.find_all('tr')[1:]:
# print(x.text.strip())
try:
a, b, c = [i.text.strip() for i in x.find_all('td', class_='resultItemTop')[:3]]
res.append([a,b,c])
except ValueError:
continue
with open(file_path, 'w', newline='') as result:
writer = csv.writer(result, delimiter=',')
writer.writerows(res)
print('finished')

Introduction
I suggest to read first What must be taken into account on executing a batch file as scheduled task?
Issue 1: Current directory
The first issue here is most likely the current directory on running the batch file.
Windows sets the directory of the batch file as current directory on double clicking a batch file, except the batch file path is a UNC path starting with \\computername\share\.
The current directory for a scheduled task is by default %SystemRoot%\System32, i.e. the Windows system directory which of course is special protected against modifications. Many batch files expect that the current directory is the directory of the batch file and not any other directory.
Solution 1: Define Start in directory in properties of scheduled task.
Start Windows Task Scheduler.
Navigate to the task and double click on it to open the Properties of the task.
Select tab Action and click on button Edit.
There is Start in (optional). Enter here the path of the executed batch file.
Click two times on button OK to save this important modification in properties.
Solution 2: Make batch file directory the current directory using CD.
In the batch file insert after first line being usually #echo off the line:
cd /D "%~dp0"
This command line changes the current directory from default %SystemRoot%\System32 to the directory of the batch file as long as the batch file is not started using a UNC path.
Open a command prompt window and run cd /? for help on command CD and option /D.
Solution 3: Make batch file directory the current directory using PUSHD.
This solution is best if the batch file is stored on a network resource accessed using UNC path and of course the scheduled task is configured to run with credentials (user account and password) with permissions to read content of the batch file on the network resource.
In the batch file insert after first line being usually #echo off the lines:
setlocal EnableExtensions DisableDelayedExpansion
pushd "%~dp0"
The batch file should additionally contain as last two line executed before exiting batch file processing the two lines:
popd
endlocal
Open a command prompt window and run pushd /?, popd /?, setlocal /? and endlocal /? for help on these four commands and read also this answer for even more details on these four commands.
Solution 4: Code everything to be independent on current directory.
The fourth solution is writing batch file and Python script for being independent on which directory is the current directory on execution of batch file and Python script.
This solution requires that all files and directories are specified with full qualified file/folder name which means full path + file/folder name + file extension.
The full file/folder paths can be derived from known paths on execution like path of the batch file which can be referenced with %~dp0 inside the batch file and which expands to a path string always ending with backslash which means it can be concatenated with file/folder names without usage of an additional backslash.
See also Wikipedia article about Windows Environment Variables.
Issue 2: Environment variables
Scheduled tasks are often executed with built-in local system account. So all environment variables defined just for the user account used on developing and testing the batch file either are not defined at all or are defined different.
Open from Windows Control Panel from item System the Advanced system settings or press key combination Win+Break if keyboard has a key Break (often as alternate function requiring pressing additionally the key Fn) and clicking next on Advanced system settings on left side. The System Properties dialog is opened wit tab Advanced selected containing at bottom the button Environment Variables... which must be clicked to open the Environment Variables window.
There are two lists of environment variables: User variables for ... and System variables. The system variables are defined for all accounts including built-in local system account. The user variables are defined only for the displayed user.
The user variables defined for current user are important and could be a reason why a batch file executed with a double click by current user works, but does not work on being executed as scheduled task with built-in system account. A user PATH is often a main source of not working batch file on execution as scheduled task if the used scripts and executables depend on specific folder paths in local PATH defined in user PATH.
Please take a look on What is the reason for "X is not recognized as an internal or external command, operable program or batch file"? for more information about system, user and local PATH and environment variable PATHEXT used on writing in a batch file just py instead of the script/executable file with full qualified file name.
So it is definitely better to use
"C:\Python27\python.exe" "C:\python_test\myscript.py"
instead of using
py "C:\python_test\myscript.py"
which results in cmd.exe searching around for a file py using local PATHEXT and local PATH environment variables which can file if the folder containing file py is defined in user PATH.
I have not installed Python and so don't know what py really is. It could be a batch file with file name py.cmd or py.bat in which case the command CALL must be used if the batch file contains additional command lines after the command line with py and which perhaps depends on user environment variables. It could be a symbolic link to python.exe in installation folder of Python. I don't know.
Issue 3: Network resources
Many scheduled tasks access files, folders or data via a network. In this case the scheduled task must be configured to run with the credentials (user account and password) which has the required permissions to access the files, folders or data on the network resource. The usage of the built-in local system account is in this case nearly never the right choice to run the scheduled task because of local system account has usually no permissions to read/write data on any network resource.
Conclusion
It is good practice in my opinion to write a batch file being executed as scheduled task to be as independent as possible on other batch files and environment variables not defined in the batch file executed by Windows task scheduler itself. A design with dependencies on other script files and variables defined outside main script cause often sooner or later (sometimes years later) unexpected execution problems.
The programmer of a script written for execution as scheduled task should really know very well on which files, libraries, environment variables and registry keys/values the script and called executables depend on to work properly.
A batch file containing only one line to execute one application with some parameters is completely unnecessary and just a potential source of not working scheduled task because of the executable to run and its parameters can be in this case also directly set in properties of the scheduled task. There is absolutely no need to run %SystemRoot%\System32\cmd.exe with implicit option /C to process a specified batch file containing only one command line to execute an application with zero or more parameters because of Windows task scheduler can directly run the application with its parameters, too.

Why the task scheduler can not fully interact with the batch file which execute a python program with selenium when user is logged off

So I have a server which runs on windows system, and I need to schedule a daily task with windows task scheduler for a batch file which executes a python script which contains selenium operation (I am using a chrome driver), the expected outcome is to automatically download a file from online and then unzipped it to another folder. i.e. Program download the file and stored it in C:\download then unzipped it to E:\Data.
Everything is working fine if I set "Run only when user is logged on" in windows scheduler or I just ran the program manually. However, when I set "Run whether user is logged on or not", I won't see any file gets downloaded, so I suspect in order to let it work, one must log on to get webdriver interacts with website? I have refer to this site and seems like it is the case though the author is using edge (https://developer.microsoft.com/en-us/microsoft-edge/platform/issues/7290550/). Can anyone help to confirm this or can someone offer me a solution?
Moreover, if I manually leave the zipped file in i.e. C:\download, when the task starts with "Run whether user is logged on or not", I can see the file gets unzipped in the destination folder i.e. E:\Data which means part of my script (the extract part) is executed. So I am positive the task started successfully, but there's just no interaction in between selenium and website. Thanks.
P.S. My script does not contain any mapped driver, so this solution doesn't apply (https://social.technet.microsoft.com/Forums/windows/en-US/c03d6691-b058-4f8d-961c-e8eba25bbaed/task-scheduler-problem-run-whether-user-is-logged-on-or-not?forum=w7itprogeneral), and neither does this help (Task scheduler cannot open batch file when set to run whether user is logged on or not).

Using Python, how do I close a file in use by another user over a network?

I have a program that creates a bunch of movie files. I runs as a cron job and every time it runs the movies from the previous iteration are moved to a 'previous' folder so that there is always a previous version to look at.
These movie files are accessed across a network by various users and that's where I'm running into a problem.
When the script runs and tries to move the files it throws a resource busy error because the files are open by various users. Is there a way in Python to force close these files before I attempt to move them?
Further clarification:
JMax is correct when he mentions it is server level problem. I can access our windows server through Administrative Tools > Computer Management > Shared Folders > Open Files and manually close the files there, but I am wondering whether there is a Python equivalent which will achieve the same result.
something like this:
try:
shutil.move(src, dst)
except OSError:
# Close src file on all machines that are currently accessing it and try again.

This question has nothing to do with Python, and everything to do with the particular operating system and file system you're using. Could you please provide these details?
At least in Windows you can use Sysinternals Handle to force a particular handle to a file to be closed. Especially as this file is opened by another user over a network this operation is extremely destabilising and will probably render the network connection subsequently useless. You're looking for the "-c" command-line argument, where the documentation reads:
Closes the specified handle (interpreted as a hexadecimal number). You
must specify the process by its PID.
WARNING: Closing handles can cause application or system instability.
And if you're force-closing a file mounted over Samba in Linux, speaking from experience this is an excruciating experience in futility. However, others have tried with mixed success; see Force a Samba process to close a file.

As far as I know you have to end the processes which access the file. At least on Windows

The .close() method doesn't work on your object file?
See dive into Python for more information on file objects
[EDIT] I've re-read your question. Your problem is that users do open the same file from the network and you want them to close the file? But can you access to their OS?
[EDIT2] The problem is more on a server level to disconnect the user that access the file. See this example for Windows servers.

How to handle new files to process in cron job

How can I check files that I already processed in a script so I don't process those again? and/or
What is wrong with the way I am doing this now?
Hello,
I am running tshark with the ring buffer option to dump to files after 5MB or 1 hour. I wrote a python script to read these files in XML and dump into a database, this works fine.
My issue is that this is really process intensive, one of those 5MB can turn into a 200MB file when converted to XML, so I do not want to do any unnecessary processing.
The script is running every 10 minutes and processes ~5 files per run, since is scanning the folder where the files are created for any new entries, I dump a hash of the file into the database and on the next run check the hash and if it isn't in the database I scan the file.
The problem is that this does not appear to work every time, it ends up processing files that it has already done. When I check the hash of the file that it keeps trying to process it doesn't show up anywhere in the database, hence why is trying to process it over and over.
I am printing out the filename + hash in the output of the script:
using file /var/ss01/SS01_00086_20100107100828.cap with hash: 982d664b574b84d6a8a5093889454e59
using file /var/ss02/SS02_00053_20100106125828.cap with hash: 8caceb6af7328c4aed2ea349062b74e9
using file /var/ss02/SS02_00075_20100106184519.cap with hash: 1b664b2e900d56ca9750d27ed1ec28fc
using file /var/ss02/SS02_00098_20100107104437.cap with hash: e0d7f5b004016febe707e9823f339fce
using file /var/ss02/SS02_00095_20100105132356.cap with hash: 41a3938150ec8e2d48ae9498c79a8d0c
using file /var/ss02/SS02_00097_20100107103332.cap with hash: 4e08b6926c87f5967484add22a76f220
using file /var/ss02/SS02_00090_20100105122531.cap with hash: 470b378ee5a2f4a14ca28330c2009f56
using file /var/ss03/SS03_00089_20100107104530.cap with hash: 468a01753a97a6a5dfa60418064574cc
using file /var/ss03/SS03_00086_20100105122537.cap with hash: 1fb8641f10f733384de01e94926e0853
using file /var/ss03/SS03_00090_20100107105832.cap with hash: d6209e65348029c3d211d1715301b9f8
using file /var/ss03/SS03_00088_20100107103248.cap with hash: 56a26b4e84b853e1f2128c831628c65e
using file /var/ss03/SS03_00072_20100105093543.cap with hash: dca18deb04b7c08e206a3b6f62262465
using file /var/ss03/SS03_00050_20100106140218.cap with hash: 36761e3f67017c626563601eaf68a133
using file /var/ss04/SS04_00010_20100105105912.cap with hash: 5188dc70616fa2971d57d4bfe029ec46
using file /var/ss04/SS04_00071_20100107094806.cap with hash: ab72eaddd9f368e01f9a57471ccead1a
using file /var/ss04/SS04_00072_20100107100234.cap with hash: 79dea347b04a05753cb4ff3576883494
using file /var/ss04/SS04_00070_20100107093350.cap with hash: 535920197129176c4d7a9891c71e0243
using file /var/ss04/SS04_00067_20100107084826.cap with hash: 64a88ecc1253e67d49e3cb68febb2e25
using file /var/ss04/SS04_00042_20100106144048.cap with hash: bb9bfa773f3bf94fd3af2514395d8d9e
using file /var/ss04/SS04_00007_20100105101951.cap with hash: d949e673f6138af2d388884f4a6b0f08
The only files it should be doing are one per folder, so only 4 files. This causes unecessary processing and I have to deal with overlapping cron jobs + other services been affected.
What I am hoping to get from this post is a better way to do this or hopefully someone can tell me why is happening, I know that the latter might be hard since it can be a bunch of reasons.
Here is the code (I am not a coder but a sys admin so be kind :P) line 30-32 handle the hash comparisons.
Thanks in advance.

A good way to handle/process files that are created at random times is to use
incron rather than cron. (Note: since incron uses the Linux kernel's
inotify syscalls, this solution only works with Linux.)
Whereas cron runs a job based on dates and times, incron runs a job based on
changes in a monitored directory. For example, you can configure incron to run a
job every time a new file is created or modified.
On Ubuntu, the package is called incron. I'm not sure about RedHat, but I believe this is the right package: http://rpmfind.net//linux/RPM/dag/redhat/el5/i386/incron-0.5.9-1.el5.rf.i386.html.
Once you install the incron package, read
man 5 incrontab
for information on how to setup the incrontab config file. Your incron_config file might look something like this:
/var/ss01/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss02/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss03/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss04/ IN_CLOSE_WRITE /path/to/processing/script.py $#
Then to register this config with the incrond daemon, you'd run
incrontab /path/to/incron_config
That's all there is to it. Now whenever a file is created in /var/ss01, /var/ss02, /var/ss03 or /var/ss04, the command
/path/to/processing/script.py $#
is run, with $# replaced by the name of the newly created file.
This will obviate the need to store/compare hashes, and files will only get processed once -- immediately after they are created.
Just make sure your processing script does not write into the top level of the monitored directories.
If it does, then incrond will notice the new file created, and launch script.py again, sending you into an infinite loop.
incrond monitors individual directories, and does not recursively monitor subdirectories. So you could direct tshark to write to /var/ss01/tobeprocessed, use incron to monitor
/var/ss01/tobeprocessed, and have your script.py write to /var/ss01, for example.
PS. There is also a python interface to inotify, called pyinotify. Unlike incron, pyinotify can recursively monitor subdirectories. However, in your case, I don't think the recursive monitoring feature is useful or necessary.

I don't know enough about what is in these files, so this may not work for you, but if you have only one intended consumer, I would recommend using directories and moving the files to reflect their state. Specifically, you could have a dir structure like
/waiting
/progress
/done
and use the relative atomicity of mv to change the "state" of each file. (Whether mv is truly atomic depends on your filesystem, I believe.)
When your processing task wants to work on a file, it moves it from waiting to progress (and makes sure that the move succeeded). That way, no other task can pick it up, since it's no longer waiting. When the file is complete, it gets moved from progress to done, where a cleanup task might delete or archive old files that are no longer needed.

I see several issues.
If you have overlapping cron jobs you need to have a locking mechanism to control access. Only allow one process at a time to eliminate the overlap problem. You might setup a shell script to do that. Create a 'lock' by making a directory (mkdir is atomic), process the data, then delete the lock directory. If the shell script finds the directory already exists when it tries to make it then you know another copy is already running and it can just exit.
If you can't change the cron table(s) then just rename the executable and name your shell script the same as the old executable.
Hashes are not guaranteed to be unique identifiers for files, it's likely they are, but it's not absolutely guaranteed.

Why not just move a processed file to a different directory?
You mentioned overlapping cron jobs. Does this mean one conversion process can start before the previous one finished? That means you would perform the move at the beginning of the conversion. If you are worries about an interrupted conversion, use an intermediate directory, and move to a final directory after completion.

If I'm reading the code correctly, you're updating the database (by which I mean the log of files processed) at the very end. So when you have a huge file that's being processed and not yet complete, another cron job will 'legally' start working on it. - both completing succesfully resulting in two entries in the database.
I suggest you move up the logging-to-database, which would act as a lock for subsequent cronjobs and having a 'success' or 'completed' at the very end. The latter part is important as something that's shown as processing but doesnt have a completed state (coupled with the notion of time) can be programtically concluded as an error. (That is to say, a cronjob tried processing it but never completed it and the log show processing for 1 week!)
To summarize
Move up the log-to-database so that it would act as a lock
Add a 'success' or 'completed' state which would give the notion of errored state
PS: Dont take it in the wrong way, but the code is a little hard to understand. I am not sure whether I do at all.

Does python have hooks into EXT3

We have several cron jobs that ftp proxy logs to a centralized server. These files can be rather large and take some time to transfer. Part of the requirement of this project is to provide a logging mechanism in which we log the success or failure of these transfers. This is simple enough.
My question is, is there a way to check if a file is currently being written to? My first solution was to just check the file size twice within a given timeframe and check the file size. But a co-worker said that there may be able to hook into the EXT3 file system via python and check the attributes to see if the file is currently being appended to. My Google-Fu came up empty.
Is there a module for EXT3 or something else that would allow me to check the state of a file? The server is running Fedora Core 9 with EXT3 file system.

no need for ext3-specific hooks; just check lsof, or more exactly, /proc/<pid>/fd/* and /proc/<pid>/fdinfo/* (that's where lsof gets it's info, AFAICT). There you can check if the file is open, if it's writeable, and the 'cursor' position.
That's not the whole picture; but any more is done in processpace by stdlib on the writing process, as most writes are buffered and the kernel only sees bigger chunks of data, so any 'ext3-aware' monitor wouldn't get that either.

There's no ext3 hooks to check what you'd want directly.
I suppose you could dig through the source code of Fuser linux command, replicate the part that finds which process owns a file, and watch that resource. When noone longer has the file opened, it's done transferring.
Another approach:
Your cron jobs should tell that they're finished.
We have our cron jobs that transport files just write an empty filename.finished after it's transferred the filename. Another approach is to transfer them to a temporary filename, e.g. filename.part and then rename it to filename Renaming is atomic. In both cases you check repeatedly until the presence of filename or filename.finished

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.