Python get files from N days ago and before - python

I am trying to write a function that can find files from a certain date and before and delete them. I was playing around with fabric and I want to delete my old log files from my server. the folder has files in the following format:
['user-2015-10-16.log.gz', 'user-2015-10-19.log.gz', 'user-2015-10-22.log.gz', 'user-2015-10-25.log.gz', 'admin-2015-10-17.log.gz', 'admin-2015-10-20.log.gz', 'admin-2015-10-23.log.gz', 'requests.log', 'user-2015-10-17.log.gz', 'user-2015-10-20.log.gz', 'user-2015-10-23.log.gz', 'extra.log', 'admin-2015-10-18.log.gz', 'admin-2015-10-21.log.gz', 'admin-2015-10-24.log.gz', 'user-2015-10-18.log.gz', 'user-2015-10-21.log.gz', 'user-2015-10-24.log.gz', 'admin-2015-10-16.log.gz', 'admin-2015-10-19.log.gz', 'admin-2015-10-22.log.gz', 'admin-2015-10-25.log.gz']
What I want to do is keep files from today till 4 days back i.e. keep the ones from 25th, 24th, 23rd, 22nd and delete the rest (keep extra.log and requests.log).
I tried this:
import datetime
days = 4
user = []
admin = []
for i in range(days):
that_date = datetime.datetime.now() - datetime.timedelta(days=i)
use = 'user-{}.log.gz'.format(that_date)
adm = 'admin-{}.log.gz'.format(that_date)
# user.append(user)
# admin.append(admin)
print user, adm
But realized embarrassingly late that this gives me the files I want to keep and not the ones I want to delete.
Any help will be greatly appreciated
Edit: if not already clear, the files are generated daily with the user-(todays date) format so cant hardcode anything.

You might consider using glob with user-* and admin-* and then get the file creation times with os.stat
NOT TESTED, but something like:
import glob
import os
import time
target=4*24*60*60 # 4 days in seconds
for fn in glob.glob('user-*')+glob.glob('admin-*'):
if time.time()-os.path.getctime(fn)>target:
# delete that file...
You need to change the working directory (or change the glob) to the target directory.

Related

Download intraday historical stock data

I want to download historical intraday stock data. I've found AlphaVantage offers two years of data. It's the longest history of data that I've found for free.
I'm making a script to download the full two years of data for all ticker symbols that they offer and in all timeframes. They provide the data divided in 30 days intervals from the current day (or the last trading day, I'm not sure). The rows go from newest to oldest timedate. I want to reverse the order in which the data appears and concatenate all the months with the column headers appearing only once. So I would have a single csv file with two years of data for each stock and timeframe. The rows of the data would go from oldest to newest timedate.
The problem I have is that I also want to use the script to update the data and I don't know how to append only the data that doesn't already appear in my files. The data that I've downloaded goes from 2020-09-28 07:15:00 to 2020-10-26 20:00:00 in 15 minutes intervals (when they exists, there are some missing). When I use the script again I'd like to update the data. I'd like to delete somehow the rows that already appear and append only the rest. So if the last datetime that appears is for example 2020-10-26 20:00:00 it would continue appending from 2020-10-26 2020-10-26 20:15:00 if it exists. How can I update the data correctly?
Also, when updating, if the file already exists, it copies the column headers which is something I don't want to do. Edit: I've solved this with header=(not os.path.exists(file)) but it seems very inefficient to check if the file exists in every iteration.
I also have to make the script comply with the API's rule of 5 calls per minute and 500 calls per day. Is there a way for the script to stop when it reaches the daily limit and continue at that point next time it runs? Or should I just add a 173 seconds sleep between API calls?
import os
import glob
import pandas as pd
from typing import List
from requests import get
from pathlib import Path
import os.path
import sys
BASE_URL= 'https://www.alphavantage.co/'
def download_previous_data(
file: str,
ticker: str,
timeframe: str,
slices: List,
):
for _slice in slices:
url = f'{BASE_URL}query?function=TIME_SERIES_INTRADAY_EXTENDED&symbol={ticker}&interval={timeframe}&slice={_slice}&apikey=demo&datatype=csv'
pd.read_csv(url).iloc[::-1].to_csv(file, mode='a', index=False, encoding='utf-8-sig')
def main():
# Get a list of all ticker symbols
print('Downloading ticker symbols:')
#df = pd.read_csv('https://www.alphavantage.co/query?function=LISTING_STATUS&apikey=demo')
#tickers = df['symbol'].tolist()
tickers = ['IBM']
timeframes = ['1min', '5min', '15min', '30min', '60min']
# To download the data in a subdirectory where the script is located
modpath = os.path.dirname(os.path.abspath(sys.argv[0]))
# Make sure the download folders exists
for timeframe in timeframes:
download_path = f'{modpath}/{timeframe}'
#download_path = f'/media/user/Portable Drive/Trading/data/{timeframe}'
Path(download_path).mkdir(parents=True, exist_ok=True)
# For each ticker symbol download all data available for each timeframe
# except for the last month which would be incomplete.
# Each download iteration has to be in a 'try except' in case the ticker symbol isn't available on alphavantage
for ticker in tickers:
print(f'Downloading data for {ticker}...')
for timeframe in timeframes:
download_path = f'{modpath}/{timeframe}'
filepath = f'{download_path}/{ticker}.csv'
# NOTE:
# To ensure optimal API response speed, the trailing 2 years of intraday data is evenly divided into 24 "slices" - year1month1, year1month2,
# year1month3, ..., year1month11, year1month12, year2month1, year2month2, year2month3, ..., year2month11, year2month12.
# Each slice is a 30-day window, with year1month1 being the most recent and year2month12 being the farthest from today.
# By default, slice=year1month1
if Path(filepath).is_file(): # if the file already exists
# download the previous to last month
slices = ['year1month2']
download_previous_data(filepath, ticker, timeframe, slices)
else: # if the file doesn't exist
# download the two previous years
#slices = ['year2month12', 'year2month11', 'year2month10', 'year2month9', 'year2month8', 'year2month7', 'year2month6', 'year2month5', 'year2month4', 'year2month3', 'year2month2', 'year2month1', 'year1month12', 'year1month11', 'year1month10', 'year1month9', 'year1month8', 'year1month7', 'year1month6', 'year1month5', 'year1month4', 'year1month3', 'year1month2']
slices = ['year1month2']
download_previous_data(filepath, ticker, timeframe, slices)
if __name__ == '__main__':
main()
You have an awful lot of questions within your question!
These are suggestions for you to try, but I have no way to test the validity of them:
Read all your files names into a list check files names exist against the list rather than pinging the os each time
Read the data from existing file and append everything in pandas and write new file. Can't tell if you are appending the csv files but if you're having difficulty there just read the data and append new data - until you figure out how to append a excel correctly. Or save new iterations to their own file and consolidate files later.
Look into drop_duplicates() if you have concerned with having duplicates
Look into time module for time.sleep() in your for loops for reduce calls
If you have 1min data you can look into resample() to 5min, 15min rather than importing at all those timeframes

Apply a written code to all CSV files across different folders using Python

I have a wide range of CSV files that give me the solar energy produced by several systems on a daily basis. Each CSV file corresponds to one day in the year, for one particular site (I have 12 sites).
My goal is to develop a code that reads all CSV files (located across different folders), extracts the daily produced solar energy for every specific day and site, stores the values in a dataframe, and finally exports the dataframe collecting all daily produced solar energy across all sites on a new Excel file.
So far I have written the code to extract the values of all CSV files stored within the same folder, which gives me the solar energy produced for all days for which a CSV file exists in that folder:
import csv
import pandas as pd
import numpy as np
import glob
path = r"C:\Users\XX\Documents\XX\DataLogs\NameofSite\CSV\2020\02\CSV\*.csv"
Monthly_PV=[]
for fname in glob.glob(path):
df=pd.read_csv(fname, header=7, decimal=',')
kWh_produced=df["kWh"]
daily_start=kWh_produced.iloc[0]
daily_end=kWh_produced.iloc[-1]
DailyPV=daily_end-daily_start
Monthly_PV.append(DailyPV)
print(Monthly_PV)
MonthlyTotal=sum(Monthly_PV)
Monthly_PV=pd.DataFrame(Monthly_PV)
print(MonthlyTotal)
Monthly_PV.to_excel(r"C:\Users\XXX\Documents\XXX\DataLogs\NameofSite\CSV\2020\02\CSV\Summary.xlsx")
I get the result I want: a list in which each value corresponds to the daily produced solar energy of each CSV in this one given folder located on the folder I called "path". My aim is to add something to this code so that the developed code is applied to CSV files located in previous folders to the one listed here as well, or to parallel folders within the same bigger folder.
Any tips will be much appreciated.
Thanks!
You can add an extra for loop to handle a list of paths
import numpy as np
import glob
paths = [r"C:\Users\XX\Documents\XX\DataLogs\NameofSite\CSV\2020\02\CSV\*.csv",
r"C:\Foo\*.csv",
r"..\..Bar]*.csv"]
Monthly_PV=[]
for path in paths:
for fname in glob.glob(path):
df=pd.read_csv(fname, header=7, decimal=',')
kWh_produced=df["kWh"]
daily_start=kWh_produced.iloc[0]
daily_end=kWh_produced.iloc[-1]
DailyPV=daily_end-daily_start
Monthly_PV.append(DailyPV)
print(Monthly_PV)
MonthlyTotal=sum(Monthly_PV)
Monthly_PV=pd.DataFrame(Monthly_PV)
print(MonthlyTotal)
Monthly_PV.to_excel(r"C:\Users\XXX\Documents\XXX\DataLogs\NameofSite\CSV\2020\02\CSV\Summary.xlsx")
If you do not want to hardcode a list of directories in your program, maybe try something based on this?
def get_input_directories(depth: int, base_directory: str) -> typing.DefaultDict[str, typing.List[DeltaFile]]:
"""
Build a dict with keys that are directories, and values that are lists of filenames.
Does not include blocking uninteresting tracks.
"""
result: typing.DefaultDict[str, typing.List[DeltaFile]] = collections.defaultdict(list)
os.chdir(base_directory)
try:
for root, directories, filenames in os.walk('.'):
if root.count('/') != depth:
# We only want to deal with /band/album (for EG, with depth==2) in root
continue
assert not directories, f"root is {root}, directories is {directories}"
for filename in filenames:
if appropriate_extension(filename) and not hidden(filename):
result[root].append(DeltaFile(filename))
finally:
os.chdir('..')
return result
You can safely remove the type annotations if you don't want them.

Add current date to end of a filename when exporting using to_excel

When I run my script I would like to export it as an Excel file with the current date tagged onto the end of it, I could put the date in manually but as I run this each day I would like it to use the current date automatically.
So, to just output an normal excel via python/pandas I use:
df.to_excel('myfile.xlsx')
And in my working directory I get an Excel file called "myfile.xlsx".
But I would like today's current date added to the end, so if I ran the script today the file would be called "myfile 24/09/2019.xlsx".
This will get you there and employs string formatting for clean / readable code:
from datetime import datetime as dt
# Create filename from current date.
mask = '%d%m%Y'
dte = dt.now().strftime(mask)
fname = "myfile_{}.xlsx".format(dte)
df.to_excel(fname)
As mentioned in a comment above, some OS use / as a path separator, so I suggest using a dmY (24092019) date format. As shown here
Output:
myfile_24092019.xlsx

Appending incoming csv files in Python to a master data frame

I have these data exports that are populating every hour in a particular directory, and i'm hoping to have a script that reads all the files and appends them into one master dataframe in Python. Only issue is, since they are populating every hour, I don't want to append existing or already added csv files to the master dataframe.
I'm very new to Python, and so far have only been able to load all the files in the directory and append them all, using the below code:
import pandas as pd
import os
import glob
path = os.environ['HOME'] + "/file_location/"
allFiles = glob.glob(os.path.join(path,"name_of_files*.csv"))
df = pd.concat((pd.read_csv(f) for f in allFiles), sort=False)
With the above code, it looks into the file_location and imports any files with the name "name_of_files" & uses a wildcard as the tail of each of the files will be different.
I could continue to do this, but i'm literally going to have hundreds of files and don't want to import them all and append/concat them each and every hour. To avoid this i'd like to have that master data frame mentioned above and just have new csv files that populate each hour to be automatically appended to that existing master df.
Again super new to Python, so not even sure what to do next. Any advice would be greatly appreciated!

Python: finding files (png) with a creation date between two set dates

I am relatively new to Python and im trying to make a script that finds files (photos) that have been created between two dates and puts them into a folder.
For that, I need to get the creation date of the files somehow (Im on Windows).
I already have everything coded but I just need to get the date of each picture. Would also be interesting to see in which form the date is returned. The best would be like m/d/y or d/m/y (d=day; m=month, y=year).
Thank you all in advance! I am new to this forum
I imagine you are somehow listing files if so then use the
os.stat(path).st_ctime to get the creation time in Windows and then using datetime module string format it.
https://docs.python.org/2/library/stat.html#stat.ST_CTIME
https://stackoverflow.com/a/39359270/928680
this example shows how to convert the mtime (modified) time but the same applies to the ctime (creation time)
once you have the ctime it's relatively simple to check if that falls with in a range
https://stackoverflow.com/a/5464465/928680
you will need to do your date logic before converting​ to a string.
one of the solutions, not very efficient.. just to show one of the ways this can be done.
import os
from datetime import datetime
def filter_files(path, start_date, end_date, date_format="%Y"):
result = []
start_time_obj = datetime.strptime(start_date, date_format)
end_time_obj = datetime.strptime(end_date, date_format)
for file in os.listdir(path):
c_time = datetime.fromtimestamp(os.stat(file).st_ctime)
if start_time_obj <= c_time <= end_time_obj:
result.append("{}, {}".format(os.path.join(path, file), c_time))
return result
if __name__ == "__main__":
print "\n".join(filter_files("/Users/Jagadish/Desktop", "2017-05-31", "2017-06-02", "%Y-%m-%d"))
cheers!
See the Python os package for basic system commands, including directory listings with options. You'll be able to extract the file date. See the Python datetime package for date manipulation.
Also, check the available Windows commands on your version: most of them have search functions with a date parameter; you could simply have an OS system command return the needed file names.
You can use subprocess to run a shell command on a file to get meta_data of that file.
import re
from subprocess import check_output
meta_data = check_output('wmic datafile where Name="C:\\\\Users\\\\username\\\\Pictures\\\\xyz.jpg"', shell=True)
# Note that you have to use '\\\\' instead of '\\' for specifying path of the file
pattern = re.compile(r'\b(\d{14})\b.*')
re.findall(pattern,meta_data.decode())
=> ['20161007174858'] # This is the created date of your file in format - YYYYMMDDHHMMSS
Here is my solution. The Pillow/Image module can access the metadata of the .png file. Then we access the 36867 position of that metadata which is DateTimeOriginal. Finally I convert the string returned to a datetime object which gives flexibility to do whatever you need to do with it. Here is the code.
from PIL import Image
from datetime import datetime
# Get the creationTime
creationTime = Image.open('myImage.PNG')._getexif()[36867]
# Convert creationTime to datetime object
creationTime = datetime.strptime(creationTime, '%Y:%m:%d %H:%M:%S')

Categories