I have saved all the daily sales reports in a common folder. each file is named with the corresponding date. eg: 01-01-2019-Sales.csv, 02-01-2019-Sales.csv, etc. all the files are saved in the "C:\Desktop\Sales" folder path. now i want to extract & combine all the files which are between 05-01-2019 to 04-02-2019.
I know I can extract all the files with pandas using the below code
import pandas as pd
import glob
import os
file_path = r'C:\Desktop\Sales'
all_files = glob.glob(os.path.join(file_path,'*.csv'))
df = pd.concat([pd.read_csv(f) for f in all_files], sort=False)
But, my question is how can i extract files between 2 given specific dates using pandas/python. (using the file names which has been saved with the date) eg ; extract only the files between 05-01-2019 to 04-02-2019.
What about this
start_date = "05-01-2019"
end_date = "04-02-2019"
all_csv_files = [x for x in os.listdir(file_path) if x.endswith('.csv')]
correct_date_files = [x for x in all_csv_files
if x >= start_date + "-Sales.csv" and x <= end_date + "-Sales.csv"]
df = pd.concat([pd.read_csv(f) for f in correct_date_files], sort=False)
You basically just list all .csv files in your directory and only take the ones between the chosen dates.
I think that this piece of code will help you
import datetime
d1 = datetime.date(2019,1,1)
d2 = datetime.date(2019,2,1)
d3 = datetime.date(2019,1,20)
d4 = datetime.date(2019,2,20)
print(d1<d3<d2)
# True
print(d1<d4<d2)
# False
The dates could be compared lexically with a change to yyyy-mm-dd.
L = [ '01-01-2019-Sales.csv', '02-01-2019-Sales.csv']
>>> start = '2018-12-01'
>>> end = '2019-02-01'
>>> for file in L:
m, d, yr = file.split('-')[:3]
date = '-'.join([yr, m, d])
if start <= date <= end:
print(file)
01-01-2019-Sales.csv
02-01-2019-Sales.csv
Use the dates as comparison:
import pandas as pd
import glob
import os
from time import strptime
file_path = r'C:\Desktop\Sales'
all_files = glob.glob(os.path.join(file_path,'*.csv'))
start_date = strptime('04-02-2019', '%m-%d-%Y')
end_date = strptime('05-01-2019', '%m-%d-%Y')
df = pd.concat([pd.read_csv(f) for f in all_files
if start_date < strptime(f, '%d-%m-%Y.csv') < end_date],
sort=False)
Related
This is some code that will generate some random time series data. Ultimately I am trying to save each days, data into separate CSV files...
import pandas as pd
import numpy as np
from numpy.random import randint
np.random.seed(10) # added for reproductibility
rng = pd.date_range('10/9/2018 00:00', periods=1000, freq='1H')
df = pd.DataFrame({'Random_Number':randint(1, 100, 1000)}, index=rng)
I can print each days data with this:
for idx, days in df.groupby(df.index.date):
print(days)
But how could I encorporate savings individual CSV files into a directory csv each file named with the first time stamp entry of month & day? (Code below does not work)
for idx, days in df.groupby(df.index.date):
for day in days:
df2 = pd.DataFrame(list(day))
month_num = df.index.month[0]
day_num = df.index.day[0]
df2.to_csv('/csv/' + f'{month_num}' + '_' + f'{day_num}' + '.csv')
You could iterate over all available days, filter your dataframe and then save.
# iterate over all available days
for date in set(df.index.date):
# filter your dataframe
filtered_df = df.loc[df.index.date == date].copy()
# save it
filename = date.strftime('%m_%d') # filename represented as 'month_day'
filtered_df.to_csv(f"./csv/{filename}.csv")
I need to read one column from the multiple csv file present in folder and then extract minimum and maximum dates from the column.
For e.g. if i have folder path "/usr/abc/xyz/" and multiple csv files are present as below
aaa.csv
bbb.csv
ccc.csv
and the files are containing data
aaa.csv is containing the data
name,address,dates
xxx,11111,20190101
yyy,22222,20190201
zzz,33333,20190101
bbb.csv is containing the data
name,address,dates
fff,11111,20190301
ggg,22222,20190501
hhh,33333,20190601
so I need to extract the minimum and maximum dates from the files and in the above case the date range should be 20190101 to 20190601
Can anyone please help how can i extract the minimum and maximum dates from the files in python
I need to avoid pandas or any other package as I need to read csv files in directly in pyhton
import pandas as pd
dt = pd.read_csv('you_csv.csv')
print(max(dt['dates']))
print(min(dt['dates']))
If you need to avoid pandas you can do the following which is not recommended at all:
dt = []
with open('your_csv.csv', 'r') as f:
data = f.readlines()
for row in data:
dt.append(row.split(',')[2].rstrip())
dt.pop(0)
print(max(dt))
print(min(dt))
A solution only using the available core libraries. It doesn't read the whole file into memory so should have a very low footprint and will work with larger files.
pathlib is used to get all the csv files
datetime is used to convert to dates
sys is used for user input
$ python3 date_min_max.py /usr/abc/xyz/
min date: 2019-01-01 00:00:00
max date: 2019-06-01 00:00:00
date_min_max.py
from pathlib import Path
from datetime import datetime
import sys
if len(sys.argv) > 1:
p = sys.argv[1]
else:
p = "."
files = [x for x in Path(p).iterdir() if x.suffix == ".csv"]
date_format = "%Y%m%d"
dt_max = datetime.strptime("19000101", date_format)
dt_min = datetime.strptime("30000101", date_format)
for file in files:
with file.open("r") as fh:
for i, line in enumerate(fh):
if i == 0:
continue
t = line.strip().split(",")[2]
dt_max = max(dt_max, datetime.strptime(t, date_format))
dt_min = min(dt_min, datetime.strptime(t, date_format))
print("min date: {}\nmax date: {}".format(dt_min, dt_max))
I have a problem with automatically importing csv and creating pandas dataframe. The code I've got:
from datetime import time
from datetime import date
from datetime import datetime
import os
import fnmatch
def get_local_file(pdate, hour, path='/apps/dev_data/data/'):
"""Get date+hour processing file from local drive
:param pdate: str Processing date
:param hour: str Processing hour
:param path: str Path to file location
:return: Pandas DF Retrieved DataFrame
"""
sdate = pdate + '-' + str(hour)
for p_file in os.listdir(path):
if fnmatch.fnmatch(p_file, 'RSRAN098_IP_R*'+sdate+'*.csv'):
return path+p_file
def get_files(pdate, path='/apps/dev_data/data/'):
hours = [time(i).strftime('%H') for i in range(24)]
fileList=[]
for hour in hours:
fileList.append(get_local_file(pdate, hour))
return fileList
processing_date = datetime.strptime('20170614', '%Y%m%d').date()
a = get_files(str(processing_date).replace('-', '_'))
print a
frame = pd.DataFrame()
list_ = []
for file_ in a:
df = pd.read_csv(file_,index_col=None, header=0, delimiter=';')
list_.append(df)
frame = pd.concat(list_)
The only problem is that I have a fixed date, I can't find a way to put the current date,
you can get current date with datetime module.
replace this
processing_date = datetime.strptime('20170614', '%Y%m%d').date()
with something like datetime.datetime.now()
but I think maybe I don't your point. because the answer seems too straightford.
I am trying to read file names in a folder between startdate and enddate. (Datestamp on file name)
I'm trying something like this.
Is there a better or more efficient way to do this?
I have thousands of files in that folder but based on start/end date values, often I will have a small percentage files between them.
startdate = "05/05/2013"
enddate = "06/06/2013"
mypath = "C:\\somepath\\"
onlyfiles = [ f for f in listdir(mypath) if isfile(join(mypath,f)) ]
for filetoread in onlyfiles:
filesBetweenDate = [ f for f in time.strftime('%m/%d/%Y', time.gmtime(os.path.getmtime(somepath+filetoread ))) if f > startdate and f < enddate]
Thanks
This avoids the walk through the folder:
from datetime import datetime, timedelta
start = datetime.strptime('05/06/2013', '%m/%d/%Y')
end = datetime.strptime('06/05/2013', '%m/%d/%Y')
filesBetweenDate = []
while start <= end:
f = start.strftime('%m/%d/%Y')
if isfile(join(mypath,f))
filesBetweenDate.append(f)
start += timedelta(1)
This should do the trick, with a couple of nice extra features, and only a single pass through the loop.
import calendar
from datetime import datetime
import os
import glob, os
mypath = "/Users/craigmj/"
timefmt = "%Y%m%d %H:%M:%S"
start = calendar.timegm(datetime.strptime("20130128 00:00:00", timefmt).timetuple())
end = calendar.timegm(datetime.strptime("20130601 00:00:00", timefmt).timetuple())
def test(f):
if (not os.path.isfile(f)):
return 0
(mode, ino, dev, nlink, uid, gid, size, atime, mtime, ctime) = os.stat(f)
return start<=ctime and end>=ctime
files = [f for f in glob.glob(os.path.join(mypath, "*")) if test(f)]
for f in files:
print(f)
First off, I use glob.glob so that you can use a wildcard in selecting your files. This might save you time if you can be more specific about the files you want to select (eg. if you files contain the datestamp in the filename).
Secondly, I use ctime in the test function, but you could as easily use mtime - the last modification time.
Finally, I'm time-specific, not just date-specific.
The only thing I'm not 100% sure about is whether this is all timezone safe. You might want to check that with an example, before digging through the docs to decide.
I have a list of files(Actually these are the files in some directory) as below for example
import os
path = '/home/user/folder'
files = os.listdir(path)
so result is as below
files = ['backup_file_2010-06-30_category.zip','backup_file_2010-06-28_category.zip',
'backup_file_2010-06-26_category.zip','backup_file_2010-06-24_category.zip',
'backup_file_2010-06-23_category.zip','backup_file_2010-06-20_category.zip'
'some_text_files_one.txt','some_text_files_two.txt']
so from this list i need to delete the zip files that contains the date in it on a condition that, the files that are created before five days from today needs to be deleted
I mean if the file created today is backup_file_2013-04-17_category.zip, we need to delete the files that are created before five days from today something like the files named as backup_file_2013-04-11_category.zip
Can anyone please let me know how to do this in python
You could do something like that and in the filtered_files list you have the list of files that need to be deleted. It works if your backup files starting with prefix.
from datetime import datetime
from datetime import timedelta
import os
path = '/home/user/folder'
files = os.listdir(path)
prefix = 'backup_file_'
days = 5
filtered_files = []
five_days_ago = datetime.now() - timedelta(days=days)
date_before = '%s%s' % (prefix, five_days_ago.strftime('%Y-%m-%d'))
for f in files:
if f.startswith(prefix) and f < date_before:
filtered_files.append(f)
print filtered_files
import datetime, os
mydaterange=[datetime.datetime.today()-datetime.timedelta(days=x) for x in range(1,6)]
myfilenames=['backup_file_'+ str(x.year)+'-'+ str(x.month) + '-' + str(x.day)+ '_category.zip' for x in mydaterange]
for files in os.listdir('/home/user/folder'):
if files.endswith('.zip') and files.startswith('backup_file_'):
if files not in myfilenames:
os.remove(files)
You can extract the date from each filename using a regex, and compare it to see if the backup file is indeed old. If there is no match from the regex then it's not a backup file.
from datetime import date
import re
OLD_FILE_DAYS = 5
def is_old_backup_file(filename, today):
m = re.match('backup_file_(\d\d\d\d)-(\d\d)-(\d\d)_category.zip', filename)
if not m:
return False
year, month, day = (int(s) for s in m.groups())
d = date(year, month, day)
delta = today - d
return delta.days > OLD_FILE_DAYS
files = ['backup_file_2010-06-30_category.zip','backup_file_2010-06-28_category.zip',
'backup_file_2010-06-26_category.zip','backup_file_2010-06-24_category.zip',
'backup_file_2010-06-23_category.zip','backup_file_2010-06-20_category.zip',
'some_text_files_one.txt','some_text_files_two.txt'] # os.listdir(path)
today = date(2010, 7, 1) # date.today()
filtered_files = [f for f in files if not is_old_backup_file(f, today)]
print filtered_files
Output:
['backup_file_2010-06-30_category.zip', 'backup_file_2010-06-28_category.zip',
'backup_file_2010-06-26_category.zip', 'some_text_files_one.txt',
'some_text_files_two.txt']