Python: sort files by datetime in more details - python

I using python 2.7 in ubuntu. How do i sort files in more detail order because i had a script that create into numbers of txt flies for split seconds. I had mod a script, it can find the oldest and youngest file but it seem like it just compare with the second but not milliseconds.
My print output:
output_04.txt 06/08/12 12:00:18
output_05.txt 06/08/12 12:00:18
-----------
oldest: output_05.txt
youngest: output_05.txt
-----------
But the right order of oldest file should be "output_04.txt".
Any expertise know? Thanks!
Updated:
Thanks everyone.
I did try out with all of the codes but seem like can't have the output i need.
Sorry guys, i did appreciated you all. But the example of my files like above have the same time, so if the full-date, hour, min, sec are all the same, it have to compare by millisecond. isn't it? Correct me if im wrong. Thanks everyone! Cheers!

You can use os.path.getmtime(path_to_file) to get the modification time of the file.
One way of ordering the list of files is to create a list of them with os.listdir and get the modification time of each one. You would have a list of tuples and you could order it by the second element of the tuple (which would be the modification time).
You also can check the resolution of os.path.getmtime with os.stat_float_times(). If the latter returns True then os.path.getmtime returns a float (this indicates you have more resolution than seconds).

def get_files(path):
import os
if os.path.exists(path):
os.chdir(path)
files = (os.listdir(path))
items = {}
def get_file_details(f):
return {f:os.path.getmtime(f)}
results = [get_file_details(f) for f in files]
for result in results:
for key, value in result.items():
items[key] = value
return items
v = sorted(get_files(path), key=r.get)
get_files takes path as an argument and if path exists, changes current directory to the path and list of files are generated. get_file_details yields last modified time for the file.
get_files returns a dict with filename as key, modified time as value. Then standard sorted is used for sorting the values. reverse parameter can be passed to sort ascending or descending.

You can't compare the milliseconds because there is no such information.
The stat(2) call returns three time_t fields:
- access time
- creation time
- last modification time
time_t is an integer representing the number of seconds (not of milliseconds) elapsed since 00:00, Jan 1 1970 UTC.
So the maximum detail you can have in file time is seconds. I don't know if some filesystem provides more resolution but you'd have to use specific calls in C and then write wrappers in Python to use them.

HI try following code
# retrieve the file information from a selected folder
# sort the files by last modified date/time and display in order newest file first
# tested with Python24 vegaseat 21jan2006
import os, glob, time
# use a folder you have ...
root = 'D:\\Zz1\\Cartoons\\' # one specific folder
#root = 'D:\\Zz1\\*' # all the subfolders too
date_file_list = []
for folder in glob.glob(root):
print "folder =", folder
# select the type of file, for instance *.jpg or all files *.*
for file in glob.glob(folder + '/*.*'):
# retrieves the stats for the current file as a tuple
# (mode, ino, dev, nlink, uid, gid, size, atime, mtime, ctime)
# the tuple element mtime at index 8 is the last-modified-date
stats = os.stat(file)
# create tuple (year yyyy, month(1-12), day(1-31), hour(0-23), minute(0-59), second(0-59),
# weekday(0-6, 0 is monday), Julian day(1-366), daylight flag(-1,0 or 1)) from seconds since epoch
# note: this tuple can be sorted properly by date and time
lastmod_date = time.localtime(stats[8])
#print image_file, lastmod_date # test
# create list of tuples ready for sorting by date
date_file_tuple = lastmod_date, file
date_file_list.append(date_file_tuple)
#print date_file_list # test
date_file_list.sort()
date_file_list.reverse() # newest mod date now first
print "%-40s %s" % ("filename:", "last modified:")
for file in date_file_list:
# extract just the filename
folder, file_name = os.path.split(file[1])
# convert date tuple to MM/DD/YYYY HH:MM:SS format
file_date = time.strftime("%m/%d/%y %H:%M:%S", file[0])
print "%-40s %s" % (file_name, file_date)
Hope this will help
Thank You

Related

A Pythonic way to delete older logfiles

I'm just cleaning log files greater than 50 (by oldest first). This is the only thing I've been able to come up with and I feel like there is a better way to do this. I'm currently getting a pylint warning using the lambda on get_time.
def clean_logs():
log_path = "Runtime/logs/"
max_log_files = 50
def sorted_log_list(path):
get_time = lambda f: os.stat(os.path.join(path, f)).st_mtime
return list(sorted(os.listdir(path), key=get_time))
del_list = sorted_log_list(log_path)[0:(len(sorted_log_list(log_path)) - max_log_files)]
for x in del_list:
pathlib.Path(pathlib.Path(log_path).resolve() / x).unlink(missing_ok=True)
clean_logs()
The two simplified solutions below are used to accomplish different tasks, so included both for flexibility. Obviously, you can wrap this in a function if you like.
Both code examples breaks down into the following steps:
Set the date delta (as an epoch reference) for mtime comparison as, N days prior to today.
Collect the full path to all files matching a given extension.
Create a generator (or list) to hold the files to be deleted, using mtime as a reference.
Iterate the results and delete all applicable files.
Removing log files older than (n) days:
import os
from datetime import datetime as dt
from glob import glob
# Setup
path = '/tmp/logs/'
days = 5
ndays = dt.now().timestamp() - days * 86400
# Collect all files.
files = glob(os.path.join(path, '*.sql.gz'))
# Choose files to be deleted.
to_delete = (f for f in files if os.stat(f).st_mtime < ndays)
# Delete files older than (n) days.
for f in to_delete:
os.remove(f)
Keeping the (n) latest log files
To keep the (n) latest log files, simply replace the to_delete definition above with:
n = 50
to_delete = sorted(files, key=lambda x: os.stat(x).st_mtime)[:len(files)-n]

How to divide a large image dataset into groups of pictures and save them inside subfolders using python?

I have an image dataset that looks like this: Dataset
The timestep of each image is 15 minutes (as you can see, the timestamp is in the filename).
Now I would like to group those images in 3hrs long sequences and save those sequences inside subfolders that would contain respectively 12 images(=3hrs).
The result would ideally look like this:
Sequences
I have tried using os.walk and loop inside the folder where the image dataset is saved, then I created a dataframe using pandas because I thought I could handle the files more easily but I think I am totally off target here.
Since you said you need only 12 files (considering that the timestamp is the same for all of them and 12 is the exact number you need, the following code can help you
import os
import shutil
output_location = "location where you want to save them" # better not to be in the same location with the dataset
dataset_path = "your data set"
files = [os.path.join(path, file) for path, subdirs, files in os.walk(dataset_path) for file in files]
nr_of_files = 0
folder_name = ""
for index in range(len(files)):
if nr_of_files == 0:
folder_name = os.path.join(output_location, files[index].split("\\")[-1].split(".")[0])
os.mkdir(folder_name)
shutil.copy(files[index], files[index].replace(dataset_path, folder_name))
nr_of_files += 1
elif nr_of_files == 11:
shutil.copy(files[index], files[index].replace(dataset_path, folder_name))
nr_of_files = 0
else:
shutil.copy(files[index], files[index].replace(dataset_path, folder_name))
nr_of_files += 1
Explaining the code:
files takes value of all files in the dataset_path. You set this variable and files will contain the entire path to all files.
for loop interating for the entire length of files.
Used nr_of_files to count each 12 files. If it's 0, it will create a folder with the name of files[index] to the location you set as output, will copy the file (replacing the input path with the output path)
If it's 11 (starting from 0, index == 11 means 12th file) will copy the file and set nr_of_files back to 0 to create another folder
Last else will simply copy the file and increment nr_of_files
The timestep of each image is 15 minutes (as you can see, the
timestamp is in the filename).
Now I would like to group those images in 3hrs long sequences and save
those sequences inside subfolders that would contain respectively 12
images(=3hrs)
I suggest exploiting datetime built-in libary to get desired result, for each file you have
get substring which is holding timestamp
parse it into datetime.datetime instance using datetime.datetime.strptime
convert said instance into seconds since epoch using .timestamp method
compute number of seconds integer division (//) 10800 (number of seconds inside 3hr)
convert value you got into str and use it as target subfolder name

Is there a function for finding the differences between the last time stamp of one netCDF file and the first time stamp of the next netCDF file?

I have a list of netCDF files. I've opened each of these netCDF files in xarray like this:
files = ['file_1.nc', 'file_2.nc', 'file_3.nc', 'file_4.nc']
for file in files:
xarray_object = xr.open_dataset(file)
I next want to take the last time stamp from file_1.nc and subtract it from the first time stamp from file_2.nc, and continue this pattern throughout the entire files list (so file_2.nc[first time stamp] - file_1.nc[last time stamp], file_3.nc[first time stamp] - file_2.nc[last time stamp], and so on).
I started attacking this problem by:
time_diff = xarray_object['time'][-1] - xarray_object['time'][0]
But this only subtracts the last time stamp from the first time stamp of file_1.nc, then the last time stamp from the first time stamp of file_2.nc, and so on.
I'm not sure the best way to get the loop to look at the time stamps of two separate files at once.
Any help would be greatly appreciated!
I think you're mostly looking for zip?
https://docs.python.org/3.9/library/functions.html#zip
# glob is convenient for getting multiple paths with a wildcard e.g.
import glob
import xarray as xr
paths = sorted(glob.glob("*.nc"))
datasets = [xr.open_dataset(path) for path in paths]
# Get the first time for the second dataset onward
first_times = [ds["time"][0] for ds in datasets[1:]]
# Get the last time, except for the last dataset
last_times [ds["time"][-1] for ds in datasets[:-1]]
# Use zip to go through both lists at once
time_deltas = [t1 - t0 for t0, t1 in zip(first_times, last_times)]
P.S.
With glob, you may want to check if the sorting goes as desired, you can always use the key argument, which takes a function, to make sure you're e.g. sorting by the number in the filename:
unsorted = ["file_02", "file_1"]
print(sorted(unsorted))
More robustly, we could split on the underscore, and turn whatever comes after the underscore into an integer:
print(sorted(unsorted, key=lambda x: int(x.split("_")[-1]))
Of course, if you just provide the list yourself, you're in full control anyway...

Time-series data analysis using scientific python: continuous analysis over multiple files

The Problem
I'm doing time-series analysis. Measured data comes from the sampling the voltage output of a sensor at 50 kHz and then dumping that data to disk as separate files in hour chunks. Data is saved to an HDF5 file using pytables as a CArray. This format was chosen to maintain interoperability with MATLAB.
The full data set is now multiple TB, far too large to load into memory.
Some of my analysis requires me to iterative over the full data set. For analysis that requires me to grab chunks of data, I can see a path forward through creating a generator method. I'm a bit uncertain of how to proceed with analysis that requires a continuous time series.
Example
For example, let's say I'm looking to find and categorize transients using some moving window process (e.g. wavelet analysis) or apply a FIR filter. How do I handle the boundaries, either at the end or beginning of a file or at chunk boundaries? I would like the data to appear as one continuous data set.
Request
I would love to:
Keep the memory footprint low by loading data as necessary.
Keep a map of the entire data set in memory so that I can address the data set as I would a regular pandas Series object, e.g. data[time1:time2].
I'm using scientific python (Enthought distribution) with all the regular stuff: numpy, scipy, pandas, matplotlib, etc. I only recently started incorporating pandas into my work flow and I'm still unfamiliar with all of its capabilities.
I've looked over related stackexchange threads and didn't see anything that exactly addressed my issue.
EDIT: FINAL SOLUTION.
Based upon the helpful hints I built a iterator that steps over files and returns chunks of arbitrary size---a moving window that hopefully handles file boundaries with grace. I've added the option of padding the front and back of each of the windows with data (overlapping windows). I can then apply a succession of filters to the overlapping windows and then remove the overlaps at the end. This, I hope, gives me continuity.
I haven't yet implemented __getitem__ but it's on my list of things to do.
Here's the final code. A few details are omitted for brevity.
class FolderContainer(readdata.DataContainer):
def __init__(self,startdir):
readdata.DataContainer.__init__(self,startdir)
self.filelist = None
self.fs = None
self.nsamples_hour = None
# Build the file list
self._build_filelist(startdir)
def _build_filelist(self,startdir):
"""
Populate the filelist dictionary with active files and their associated
file date (YYYY,MM,DD) and hour.
Each entry in 'filelist' has the form (abs. path : datetime) where the
datetime object contains the complete date and hour information.
"""
print('Building file list....',end='')
# Use the full file path instead of a relative path so that we don't
# run into problems if we change the current working directory.
filelist = { os.path.abspath(f):self._datetime_from_fname(f)
for f in os.listdir(startdir)
if fnmatch.fnmatch(f,'NODE*.h5')}
# If we haven't found any files, raise an error
if not filelist:
msg = "Input directory does not contain Illionix h5 files."
raise IOError(msg)
# Filelist is a ordered dictionary. Sort before saving.
self.filelist = OrderedDict(sorted(filelist.items(),
key=lambda t: t[0]))
print('done')
def _datetime_from_fname(self,fname):
"""
Return the year, month, day, and hour from a filename as a datetime
object
"""
# Filename has the prototype: NODE##-YY-MM-DD-HH.h5. Split this up and
# take only the date parts. Convert the year form YY to YYYY.
(year,month,day,hour) = [int(d) for d in re.split('-|\.',fname)[1:-1]]
year+=2000
return datetime.datetime(year,month,day,hour)
def chunk(self,tstart,dt,**kwargs):
"""
Generator expression from returning consecutive chunks of data with
overlaps from the entire set of Illionix data files.
Parameters
----------
Arguments:
tstart: UTC start time [provided as a datetime or date string]
dt: Chunk size [integer number of samples]
Keyword arguments:
tend: UTC end time [provided as a datetime or date string].
frontpad: Padding in front of sample [integer number of samples].
backpad: Padding in back of sample [integer number of samples]
Yields:
chunk: generator expression
"""
# PARSE INPUT ARGUMENTS
# Ensure 'tstart' is a datetime object.
tstart = self._to_datetime(tstart)
# Find the offset, in samples, of the starting position of the window
# in the first data file
tstart_samples = self._to_samples(tstart)
# Convert dt to samples. Because dt is a timedelta object, we can't use
# '_to_samples' for conversion.
if isinstance(dt,int):
dt_samples = dt
elif isinstance(dt,datetime.timedelta):
dt_samples = np.int64((dt.day*24*3600 + dt.seconds +
dt.microseconds*1000) * self.fs)
else:
# FIXME: Pandas 0.13 includes a 'to_timedelta' function. Change
# below when EPD pushes the update.
t = self._parse_date_str(dt)
dt_samples = np.int64((t.minute*60 + t.second) * self.fs)
# Read keyword arguments. 'tend' defaults to the end of the last file
# if a time is not provided.
default_tend = self.filelist.values()[-1] + datetime.timedelta(hours=1)
tend = self._to_datetime(kwargs.get('tend',default_tend))
tend_samples = self._to_samples(tend)
frontpad = kwargs.get('frontpad',0)
backpad = kwargs.get('backpad',0)
# CREATE FILE LIST
# Build the the list of data files we will iterative over based upon
# the start and stop times.
print('Pruning file list...',end='')
tstart_floor = datetime.datetime(tstart.year,tstart.month,tstart.day,
tstart.hour)
filelist_pruned = OrderedDict([(k,v) for k,v in self.filelist.items()
if v >= tstart_floor and v <= tend])
print('done.')
# Check to ensure that we're not missing files by enforcing that there
# is exactly an hour offset between all files.
if not all([dt == datetime.timedelta(hours=1)
for dt in np.diff(np.array(filelist_pruned.values()))]):
raise readdata.DataIntegrityError("Hour gap(s) detected in data")
# MOVING WINDOW GENERATOR ALGORITHM
# Keep two files open, the current file and the next in line (que file)
fname_generator = self._file_iterator(filelist_pruned)
fname_current = fname_generator.next()
fname_next = fname_generator.next()
# Iterate over all the files. 'lastfile' indicates when we're
# processing the last file in the que.
lastfile = False
i = tstart_samples
while True:
with tables.openFile(fname_current) as fcurrent, \
tables.openFile(fname_next) as fnext:
# Point to the data
data_current = fcurrent.getNode('/data/voltage/raw')
data_next = fnext.getNode('/data/voltage/raw')
# Process all data windows associated with the current pair of
# files. Avoid unnecessary file access operations as we moving
# the sliding window.
while True:
# Conditionals that depend on if our slice is:
# (1) completely into the next hour
# (2) partially spills into the next hour
# (3) completely in the current hour.
if i - backpad >= self.nsamples_hour:
# If we're already on our last file in the processing
# que, we can't continue to the next. Exit. Generator
# is finished.
if lastfile:
raise GeneratorExit
# Advance the active and que file names.
fname_current = fname_next
try:
fname_next = fname_generator.next()
except GeneratorExit:
# We've reached the end of our file processing que.
# Indicate this is the last file so that if we try
# to pull data across the next file boundary, we'll
# exit.
lastfile = True
# Our data slice has completely moved into the next
# hour.
i-=self.nsamples_hour
# Return the data
yield data_next[i-backpad:i+dt_samples+frontpad]
# Move window by amount dt
i+=dt_samples
# We've completely moved on the the next pair of files.
# Move to the outer scope to grab the next set of
# files.
break
elif i + dt_samples + frontpad >= self.nsamples_hour:
if lastfile:
raise GeneratorExit
# Slice spills over into the next hour
yield np.r_[data_current[i-backpad:],
data_next[:i+dt_samples+frontpad-self.nsamples_hour]]
i+=dt_samples
else:
if lastfile:
# Exit once our slice crosses the boundary of the
# last file.
if i + dt_samples + frontpad > tend_samples:
raise GeneratorExit
# Slice is completely within the current hour
yield data_current[i-backpad:i+dt_samples+frontpad]
i+=dt_samples
def _to_samples(self,input_time):
"""Convert input time, if not in samples, to samples"""
if isinstance(input_time,int):
# Input time is already in samples
return input_time
elif isinstance(input_time,datetime.datetime):
# Input time is a datetime object
return self.fs * (input_time.minute * 60 + input_time.second)
else:
raise ValueError("Invalid input 'tstart' parameter")
def _to_datetime(self,input_time):
"""Return the passed time as a datetime object"""
if isinstance(input_time,datetime.datetime):
converted_time = input_time
elif isinstance(input_time,str):
converted_time = self._parse_date_str(input_time)
else:
raise TypeError("A datetime object or string date/time were "
"expected")
return converted_time
def _file_iterator(self,filelist):
"""Generator for iterating over file names."""
for fname in filelist:
yield fname
#Sean here's my 2c
Take a look at this issue here which I created a while back. This is essentially what you are trying to do. This is a bit non-trivial.
Without knowing more details, I would offer a couple of suggestions:
HDFStore CAN read in a standard CArray type of format, see here
You can easily create a 'Series' like object that has nice properties of a) knowing where each file is and its extents, and uses __getitem__ to 'select' those files, e.g. s[time1:time2]. From a top-level view this might be a very nice abstraction, and you can then dispatch operations.
e.g.
class OutOfCoreSeries(object):
def __init__(self, dir):
.... load a list of the files in the dir where you have them ...
def __getitem__(self, key):
.... map the selection key (say its a slice, which 'time1:time2' resolves) ...
.... to the files that make it up .... , then return a new Series that only
.... those file pointers ....
def apply(self, func, **kwargs):
""" apply a function to the files """
results = []
for f in self.files:
results.append(func(self.read_file(f)))
return Results(results)
This can very easily get quite complicated. For instance, if you apply an operation that does a reduction that you can fit in memory, Results can simpley be a pandas.Series (or Frame). Hoever,
you may be doing a transformation which necessitates you writing out a new set of transformed data files. If you so, then you have to handle this.
Several more suggestions:
You may want to hold onto your data in possibly multiple ways that may be useful. For instance you say that you are saving multiple values in a 1-hour slice. It may be that you can split these 1-hour files instead into a file for each variable you are saving but save a much longer slice that then becomes memory readable.
You might want to resample the data to lower frequencies, and work on these, loading the data in a particular slice as needed for more detailed work.
You might want to create a dataset that is queryable across time, e.g. say high-low peaks at varying frequencies, e.g. maybe using the Table format see here
Thus you may have multiple variations of the same data. Disk space is usually much cheaper/easier to manage than main memory. It makes a lot of sense to take advantage of that.

Python: How do I iterate over several files with similar names (the variation in each name is the date)?

I wrote a program that filters files containing to pull location and time from specific ones. Each file contains one day's worth of tweets.
I would like to run this program over one year's worth of tweets, which would involve iterating over 365 folders with names like this: 2011--.tweets.dat.gz, with the stars representing numbers that complete the file name to make it a date for each day in the year.
Basically, I'm looking for code that will loop over 2011-01-01.tweets.dat.gz, 2011-01-02.tweets.dat.gz, ..., all the way through 2011-12-31.tweets.dat.gz.
What I'm imagining now is somehow telling the program to loop over all files with the name 2011-*.tweets.dat.gz, but I'm not sure exactly how that would work or how to structure it, or even if the * syntax is correct.
Any tips?
Easiest way is indeed with a glob:
import from glob import iglob
for pathname in iglob("/path/to/folder/2011-*.tweets.dat.gz"):
print pathname # or do whatever
Use the datetime module:
>>> from datetime import datetime,timedelta
>>> d = datetime(2011,1,1)
while d < datetime(2012,1,1) :
filename = "{}{}".format(d.strftime("%Y-%m-%d"),'.tweets.dat.gz')
print filename
d = d + timedelta(days = 1)
...
2011-01-01.tweets.dat.gz
2011-01-02.tweets.dat.gz
2011-01-03.tweets.dat.gz
2011-01-04.tweets.dat.gz
2011-01-05.tweets.dat.gz
2011-01-06.tweets.dat.gz
2011-01-07.tweets.dat.gz
2011-01-08.tweets.dat.gz
2011-01-09.tweets.dat.gz
2011-01-10.tweets.dat.gz
...
...
2011-12-27.tweets.dat.gz
2011-12-28.tweets.dat.gz
2011-12-29.tweets.dat.gz
2011-12-30.tweets.dat.gz
2011-12-31.tweets.dat.gz

Categories