I found this answer, and it somewhat provides what I needed, but I wanted to ask about any problems that can occur when storing and pulling files from Dropbox based on date.
I have an employee list with the filename empList.txt sitting in a DB folder named empList-20171106_183150. The folder name has the year,month,day and time right down to the second, appended to it(YYYYMMDD_HHMMSS).
Locally, I have a python script that has a log(txt), which just contains the date of the last time the script went and downloaded the updated list. The log looks like this if the last time the script ran was on Nov 01 2017 at 9am
20171101_090020
If I used Dropbox and a script written in Python to download the lastest version based on the date/time, are there any disadvantages to doing this?
I just compare the date stored in the log, to the date appended on the folder. If the date of folder in DB is greater, a download is needed. My only concern is during the date comparison and download, one of the managers might upload a new list, meaning I would have to run the script again.
How do complete programs such as MalwareBytes or internet security software manage a user downloading an update when they make a new update available at the same time? For me, I just run the update again to make sure while I was checkking/updating a new update wasn't made available.
I wouldn't recommend using date comparisons, because of potential issues with race conditions, like you mentioned, etc.
The Dropbox API exposes ways to tell if things have changed instead. Specifically, when downloading the file, you should store the metadata for the version of the file you downloaded. In particular, FileMetadata.rev or FileMetadata.content_hash would be useful.
If you then check again later and either of those values are different than the last one you downloaded, you know something has changed, so you should re-download.
Related
I'm currently working on a python tkinter windows-based application where I need to get the last modified time of a disk partition. My main aim is to get the latest updated time of a partition, where the host system user might have created a file\folder and also deleted some files and there might be some other changes made by the user to some files. I have tried this using python os.stat() but it only provides the modified date of existing files, it fails in case of a deleted file. The same is the case with the PowerShell command Get-ChildItem| Sort-Object -Descending -Property LastWriteTime | select -First 1, it provides the last time with respect to the contents present in the main directory but does not handle the changes made for file\folder deletion.
In the application, I want to get the comparison of the partition state change, i.e. if the user has made some changes to the disk partition since the last use of the application. Another option to get this result is by calculating the hash value for the disk partition but that is much time consuming, I need to get the result in just a few seconds.
This is my first interaction on StackOverflow as a questioner. Looking forward to getting helpful answers from the community.
I am working on a project right now that uses data from a large xml database file (usually like 8gb) pulled from a website. The website updates this database file monthly, so every month, there's a newer and more accurate database file.
I started my project about a year ago, so it is using a database file from February 2019. For the sake of people using my program, I would like for the database file to be replaced with the new one from each month when that gets rolled out.
How could I go about implementing this in my project so I don't have to manually go and replace the file with a newer one each month? Is it something I should write into the program? But, if that's the case, it would only update when the program is ran. Or, is there a way to have some script do this that automatically checks once a month?
Note: this project is not being used by people yet, it has got a long way to go, but I am trying to figure out how to implement these features earlier on before I get to a point where I can publish it.
I would first find out if there is an API built on top of that XML data that you could leverage, instead of downloading the XML into your own website. That way you always get the latest version of the data, since you're pulling it on-demand.
However, an on-demand integration wouldn't be a good idea if you would be hitting the API with any kind of heavy frequency, or if you would be pulling large datasets from said API. In that case, you need an ETL integration. Look into open-source ETL tools (just Google it) to help move that data in an automated fashion; I would recommend importing the XML into MongoDB or some other DB, and pull the data from there instead of reading it from a flat file.
And if you absolutely have to have it as a flat file, looking into using Gatsby; it's a framework for static websites that need to be reconstituted every once in a while.
it's a kind of open question but please bear with me.
I am working on several projects (mainly with pandas) and I have created my standard approach to manage them:
1. create a main folder for all files in a project
2. create a data folder
3. have all the output in another folder
and so on.
One of my main activities is data cleaning, and in order to standardize it I have created a dictionary file where I store the various translation of the same entity, e.g. USA, US, United States, and so on, so that the files I am producing are consistent.
Every time I create a new project, I copy the dictionary file in the data directory and then:
xls = pd.ExcelFile(r"data/dictionary.xlsx")
df_area = xls.parse("area")
and after, to translate the country name into my standard, I call:
join_column, how_join = "country", "inner"
df_ct = pd.concat([
df_ct.merge(df_area, left_on=join_column, right_on="country_name", how=how_join),
df_ct.merge(df_area, left_on=join_column, right_on="alternative01", how=how_join),
and finally I check that I am not losing an record with a miss-join.
Over and over the same thing.
I would like to have a way to remove all this unnecessary cut and paste (of the file and of the code). Also, the file I used on the first projects are already deprecated and I need to update them (and sometime the code) when I need to process new data. Sometimes I also lose track of where is the latest dictionary file! Overall it's a lot of maintenance, which I believe might be saved.
Creating my own package is the way to go or is it a little too much ambitious?
Is there another shortcut? Overall it's not a lot of code, but multiplied by several projects.
Thanks for any insight, your time going through this is appreciated.
At the end I decided to create my own package.
It required some time so I am happy to share the details about the process (I run python on jupyter and windows).
The first step is to decide where to store the code.
In my case it was C:\Users\my_user\Documents
You need to add this directory to the list of the directories where python is looking for packages. this is achieved running the following statement:
import sys
sys.path.append("C:\\Users\\my_user\\Documents")
In order to run the above statement each time you start python, it must be included into a file in the directory (this directory might vary depending on your installation):
C:\Users\my_user\.ipython\profile_default\startup
the file can be named "00-first.py" ("50-middle.py" or "99-last.py" will also work)
To verify everything is working, restart python and run the command:
print(sys.path)
you should be able to see your directory at this point.
create a folder with the package name in your directory, and a subfolder (I prefer not to have code in the main package folder)
C:\Users\my_user\Documents\my_package\my_subfolder
put an empty file named "_ _init__.py" (note that there should be no space between underscores, but I do not know how to achieve it with the editor) in each of the two folders: my package and my_subfolder. At this point you should be able already to import your empty package from python
import my_package as my_pack
inside my_subfolder create a file (my_code.py) which will store the actual code
def my_function(name):
print("Hallo " + name)
modify the outer _ _init__.py file to include shortcuts. Add the following:
from my_package.my_subfolder.my_code import my_function
You should be able now to run the following in python:
my_pack.my_function("World!")
Hope you find it useful!
I need to access the date and time in ubuntu from a program.This program may not use any commands to do this. So making a call to date is not an option.
Is there a file or files which hold this information?
Where can it be found ?
No, read time(7). There are some system calls (listed in syscalls(2)...) to query the time (since Unix Epoch); in particular time(2) and clock_gettime(2).
You then need to convert that time into a string, probably using localtime(3) then strftime(3). That conversion use some files notably /etc/timezone (and some under /usr/share/zoneinfo/ ...) according to TZ variable (see environ(7) and locale(7)).
BTW, date is free software (so you could study its source code). And you could strace(1) it.
See also vdso(7) and this.
I'm writing a python app that connects to perforce on a daily basis. The app gets the contents of an excel file on perfoce, parses it, and copies some data to a database. The file is rather big, so I would like to keep track of which revision of the file the app last read on the database, this way i can check to see if the revision number is higher and avoid reading the file if it has not changed.
I could make do with getting the revision number, or the changelist number when the file was last checked in / changed. Or if you have any other suggestion on how to accomplish my goal of avoiding doing an unnecessary read of the file.
I'm using python 2.7 and the perforce-python API
Several options come to mind.
The simplest approach would be to always let your program use the same client and let it sync the file. You could let your program call p4 sync and see if you get a new version or not. Let it continue if you get a new version. This approach has the advantage that you don't need to remember any states/version from the previous run of your program.
If you don't like using a fixed client you could let your program always check the current head revision of the file in question:
p4 fstat //depot/path/yourfile |grep headRev | sed 's/.*headRev \(.*\)/\1/'
You could store that version for the next run of your program in some temp file and compare versions each time.
If you run your program at fixed times (e.g. via cron) you could check the last modification time (either with p4 filelog or with p4 fstat) and if the time is between the time of the last run and the current time then you need to process the file. This option is a bit intricate since you need to parse those different time formats.