Get the latest FTP folder name in Python - python

I am trying to write a script to get the latest file from the latest sub-
directory of FTP server in Python. My problem is I am unable to figure out the
latest sub-directory. There are two options available, sub-directories have ctime available. Also in directory name date is mentioned that on which date directory was created. But I do not know how to get the name of the latest directory. I have figured out the following way (hoping for the server side to be sorted by latest ctime). I have done it the following way which will work if first object is the latest directory.
import ftplib
import os
import time
ftp = ftplib.FTP('test.rebex.net','demo', 'password')
ftp.cwd(str((ftp.nlst())[0])) #if directory is sorted in descending order by date.
But is there any way where I will find the exact directory by ctime or by date in directory name ?
Thanks a lot guys.

If your FTP server supports MLSD command, a solution is easy:
If you want to base the decision on a modification timestamp:
entries = list(ftp.mlsd())
# Only interested in directories
entries = [entry for entry in entries if entry[1]["type"] == "dir"]
# Sort by timestamp
entries.sort(key = lambda entry: entry[1]['modify'], reverse = True)
# Pick the first one
latest_name = entries[0][0]
print(latest_name)
If you want to use a file name:
# Sort by filename
entries.sort(key = lambda entry: entry[0], reverse = True)
If you need to rely on an obsolete LIST command, you have to parse a proprietary listing it returns.
A common *nix listing is like:
drw-r--r-- 1 user group 4096 Mar 26 2018 folder1-20180326
drw-r--r-- 1 user group 4096 Jun 18 11:21 folder2-20180618
-rw-r--r-- 1 user group 4467 Mar 27 2018 file-20180327.zip
-rw-r--r-- 1 user group 124529 Jun 18 15:31 file-20180618.zip
With a listing like this, this code will do:
If you want to base the decision on a modification timestamp:
lines = []
ftp.dir("", lines.append)
latest_time = None
latest_name = None
for line in lines:
tokens = line.split(maxsplit = 9)
# Only interested in directories
if tokens[0][0] == "d":
time_str = tokens[5] + " " + tokens[6] + " " + tokens[7]
time = parser.parse(time_str)
if (latest_time is None) or (time > latest_time):
latest_name = tokens[8]
latest_time = time
print(latest_name)
If you want to use a file name:
lines = []
ftp.dir("", lines.append)
latest_name = None
for line in lines:
tokens = line.split(maxsplit = 9)
# Only interested in directories
if tokens[0][0] == "d":
name = tokens[8]
if (latest_name is None) or (name > latest_name):
latest_name = name
print(latest_name)
Some FTP servers may return . and .. entries in LIST results. You may need to filter those.
Partially based on: Python FTP get the most recent file by date.
If the folder does not contain any files, only subfolders, there are other easier options.
If you want to base the decision on a modification timestamp and the server supports non-standard -t switch, you can use:
lines = ftp.nlst("-t")
latest_name = lines[-1]
See How to get files in FTP folder sorted by modification time
If you want to use a file name:
lines = ftp.nlst()
latest_name = max(lines)

Related

Extract data using regex between specified strings

Question1: I want to extract the data between "Target Information" and the line before "Group Information" and store it as a variable or appropriately.
Question2: Next, I want to extract the data from "Group Information" till the end of the file and store it in a variable or something appropriate.
Question3: With this information in both the above cases, I want to extract the line just after the line which starts with "Name"
From the below code I was able to get the information between "Target Information" and "Group Information" and Captured the data in "required_lines" variable.
Next, I am trying to get the line after the line "Name". But this fails. And can the logic be implemented using regex call?
# Extract the lines between
with open ('showrcopy.txt', 'r') as f:
file = f.readlines()
required_lines1 = []
required_lines = []
inRecordingMode = False
for line in file:
if not inRecordingMode:
if line.startswith('Target Information'):
inRecordingMode = True
elif line.startswith('Group Information'):
inRecordingMode = False
else:
required_lines.append(line.strip())
print(required_lines)
#Extract the line after the line "Name"
def gen():
for x in required_lines:
yield x
for line in gen():
if "Name" in line:
print(next(gen())
showrcopy.txt
root#gnodee184119:/home/usr/redsuren# date; showrcopy -qw
Tue Aug 24 00:20:38 PDT 2021
Remote Copy System Information
Status: Started, Normal
Target Information
Name ID Type Status Policy QW-Server QW-Ver Q-Status Q-Status-Qual ATF-Timeout
s2976 4 IP ready mirror_config https://10.157.35.148:8443 4.0.007 Re-starting Quorum not stable 10
Link Information
Target Node Address Status Options
s2976 0:9:1 192.168.20.21 Up -
s2976 1:9:1 192.168.20.22 Up -
receive 0:9:1 192.168.10.21 Up -
receive 1:9:1 192.168.10.22 Up -
Group Information
Name Target Status Role Mode Options
SG_hpux_vgcgloack.r518634 s2976 Started Primary Sync auto_recover,auto_failover,path_management,auto_synchronize,active_active
LocalVV ID RemoteVV ID SyncStatus LastSyncTime
vgcglock_SG_cluster 13496 vgcglock_SG_cluster 28505 Synced NA
Name Target Status Role Mode Options
aix_rcg1_AA.r518634 s2976 Started Primary Sync auto_recover,auto_failover,path_management,auto_synchronize,active_active
LocalVV ID RemoteVV ID SyncStatus LastSyncTime
tpvvA_aix_r.2 20149 tpvvA_aix.2 41097 Synced NA
tpvvA_aix_r.3 20150 tpvvA_aix.3 41098 Synced NA
tpvvA_aix_r.4 20151 tpvvA_aix.4 41099 Synced NA
tpvvA_aix_r.5 20152 tpvvA_aix.5 41100 Synced NA
tpvvA_aix_r.6 20153 tpvvA_aix.6 41101 Synced NA
tpvvA_aix_r.7 20154 tpvvA_aix.7 41102 Synced NA
tpvvA_aix_r.8 20155 tpvvA_aix.8 41103 Synced NA
tpvvA_aix_r.9 20156 tpvvA_aix.9 41104 Synced NA
tpvvA_aix_r.10 20157 tpvvA_aix.10 41105 Synced NA
Here's a regex solution to pull the target info and group info:
import re
with open("./showrcopy.txt", "r") as f:
text = f.read()
target_info_pattern = re.compile(r"Target Information([.\s\S]*)Group Information")
group_info_pattern = re.compile(r"Group Information([.\s\S]*)")
target_info = target_info_pattern.findall(text)[0].strip().split("\n")
group_info = group_info_pattern.findall(text)[0].strip().split("\n")
target_info_line_after_name = target_info[1]
group_info_line_after_name = group_info[1]
And the lines you're interested in:
>>> target_info_line_after_name
's2976 4 IP ready mirror_config https://10.157.35.148:8443 4.0.007 Re-starting Quorum not stable 10'
>>> group_info_line_after_name
'SG_hpux_vgcgloack.r518634 s2976 Started Primary Sync auto_recover,auto_failover,path_management,auto_synchronize,active_active'

NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\\Users\\username\\MYD06_L2.A2008001.0000.006.2013341193524.hdf'

I am using Windows 10 and running the code in Jupyter Notebook (in Chrome).
This is my code:
if __name__ == '__main__':
import itertools
MOD03_path = r"C:\Users\saviosebastian\MYD03.A2008001.0000.006.2012066122450.hdf"
MOD06_path = r"C:\Users\saviosebastian\MYD06_L2.A2008001.0000.006.2013341193524.hdf"
satellite = 'Aqua'
yr = [2008]
mn = [1] #np.arange(1,13)
dy = [1]
# latitude and longtitude boundaries of level-3 grid
lat_bnd = np.arange(-90,91,1)
lon_bnd = np.arange(-180,180,1)
nlat = 180
nlon = 360
TOT_pix = np.zeros(nlat*nlon)
CLD_pix = np.zeros(nlat*nlon)
### To use Spark in Python
spark = SparkSession\
.builder\
.appName("Aggregation")\
.getOrCreate()
filenames0=['']*500
i=0
for y,m,d in itertools.product(yr,mn,dy):
#-------------find the MODIS prodcts--------------#
date = datetime.datetime(y,m,d)
JD01, JD02 = gcal2jd(y,1,1)
JD1, JD2 = gcal2jd(y,m,d)
JD = np.int((JD2+JD1)-(JD01+JD02) + 1)
granule_time = datetime.datetime(y,m,d,0,0)
while granule_time <= datetime.datetime(y,m,d,23,55): # 23,55
print('granule time:',granule_time)
**[MOD03_fp = 'MYD03.A{:04d}{:03d}.{:02d}{:02d}.006.?????????????.hdf'.format(y,JD,granule_time.hour,granule_time.minute)][1]**
MOD06_fp = 'MYD06_L2.A{:04d}{:03d}.{:02d}{:02d}.006.?????????????.hdf'.format(y,JD,granule_time.hour,granule_time.minute)
MOD03_fn, MOD06_fn =[],[]
for MOD06_flist in os.listdir(MOD06_path):
if fnmatch.fnmatch(MOD06_flist, MOD06_fp):
MOD06_fn = MOD06_flist
for MOD03_flist in os.listdir(MOD03_path):
if fnmatch.fnmatch(MOD03_flist, MOD03_fp):
MOD03_fn = MOD03_flist
if MOD03_fn and MOD06_fn: # if both MOD06 and MOD03 products are in the directory
I am getting the following error:
Do you know any solution to this problem?
I can't give you a specific answer without knowledge of the directory system on your computer, but for now it's obvious that there is something wrong with the name of the directory that you are referencing. Use File Explorer to make sure that the directory actually exists, and also make sure that you haven't misspelled the name of the file, which could easily happen given the filename.
You are giving the full path along with file name. The os.listdir(path) method in python is used to get the list of all files and directories in the specified directory. If we don’t specify any directory, then list of files and directories in the current working directory will be returned.
You can just write "C:/Users/saviosebastian" in path.
Same goes for os.chdir("C:/Users/saviosebastian").

py2exe SytaxError: invalid syntax (asyncsupport.py, line22) [duplicate]

This command works fine on my personal computer but keeps giving me this error on my work PC. What could be going on? I can run the Char_Limits.py script directly in Powershell without a problem.
error: compiling 'C:\ProgramData\Anaconda2\lib\site-packages\jinja2\asyncsupport.py' failed
SyntaxError: invalid syntax (asyncsupport.py, line 22)
My setup.py file looks like:
from distutils.core import setup
import py2exe
setup (console=['Char_Limits.py'])
My file looks like:
import xlwings as xw
from win32com.client import constants as c
import win32api
"""
Important Notes: Header row has to be the first row. No columns without a header row. If you need/want a blank column, just place a random placeholder
header value in the first row.
Product_Article_Number column is used to determine the number of rows. It must be populated for every row.
"""
#functions, hooray!
def setRange(columnDict, columnHeader):
column = columnDict[columnHeader]
rngForFormatting = xw.Range((2,column), (bttm, column))
cellReference = xw.Range((2,column)).get_address(False, False)
return rngForFormatting, cellReference
def msg_box(message):
win32api.MessageBox(wb.app.hwnd, message)
#Character limits for fields in Hybris
CharLimits_Fields = {"alerts":500, "certifications":255, "productTitle":300,
"teaserText":450 , "includes":1000, "compliance":255, "disclaimers":9000,
"ecommDescription100":100, "ecommDescription240":240,
"internalKeyword":1000, "metaKeywords":1000, "metaDescription":1000,
"productFeatures":7500, "productLongDescription":1500,"requires":500,
"servicePlan":255, "skuDifferentiatorText":255, "storage":255,
"techDetailsAndRefs":12000, "warranty":1000}
# Fields for which a break tag is problematic.
BreakTagNotAllowed = ["ecommDescription100", "ecommDescription240", "productTitle",
"skuDifferentiatorText"]
app = xw.apps.active
wb = xw.Book(r'C:\Users\XXXX\Documents\Import File.xlsx')
#identifies the blanket range of interest
firstCell = xw.Range('A1')
lstcolumn = firstCell.end("right").column
headers_Row = xw.Range((1,1), (1, lstcolumn)).value
columnDict = {}
for column in range(1, len(headers_Row) + 1):
header = headers_Row[column - 1]
columnDict[header] = column
try:
articleColumn = columnDict["Product_Article_Number"]
except:
articleColumn = columnDict["Family_Article_Number"]
firstCell = xw.Range((1,articleColumn))
bttm = firstCell.end("down").row
wholeRange = xw.Range((1,1),(bttm, lstcolumn))
wholeRangeVal = wholeRange.value
#Sets the font and deletes previous conditional formatting
wholeRange.api.Font.Name = "Arial Unicode MS"
wholeRange.api.FormatConditions.Delete()
for columnHeader in columnDict.keys():
if columnHeader in CharLimits_Fields.keys():
rng, cellRef = setRange(columnDict, columnHeader)
rng.api.FormatConditions.Add(2,3, "=len(" + cellRef + ") >=" + str(CharLimits_Fields[columnHeader]))
rng.api.FormatConditions(1).Interior.ColorIndex = 3
if columnHeader in BreakTagNotAllowed:
rng, cellRef = setRange(columnDict, columnHeader)
rng.api.FormatConditions.Add(2,3, '=OR(ISNUMBER(SEARCH("<br>",' + cellRef + ')), ISNUMBER(SEARCH("<br/>",' + cellRef + ")))")
rng.api.FormatConditions(2).Interior.ColorIndex = 6
searchResults = wholeRange.api.Find("~\"")
if searchResults is not None:
msg_box("There's a double quote in this spreadsheet")
else:
msg_box("There are no double quotes in this spreadsheet")
# app.api.FindFormat.Clear
# app.api.FindFormat.Interior.ColorIndex = 3
# foundRed = wholeRange.api.Find("*", SearchFormat=True)
# if foundRed is None:
# msg_box("There are no values exceeding character limits")
# else:
# msg_box("There are values exceeding character limits")
# app.api.FindFormat.Clear
# app.api.FindFormat.Interior.ColorIndex = 6
# foundYellow = wholeRange.api.Find("*", SearchFormat=True)
# if foundYellow is None:
# msg_box("There are no break tags in this spreadsheet")
# else:
# msg_box("There are break tags in this spreadsheet")
Note:
If you are reading this, I would try Santiago's solution first.
The issue:
Looking at what is likely at line 22 on the github package:
async def concat_async(async_gen):
This is making use of the async keyword which was added in python 3.5, however py2exe only supports up to python 3.4. Now jinja looks to be extending the python language in some way (perhaps during runtime?) to support this async keyword in earlier versions of python. py2exe cannot account for this language extension.
The Fix:
async support was added in jinja2 version 2.9 according to the documentation. So I tried installing an earlier version of jinja (version 2.8) which I downloaded here.
I made a backup of my current jinja installation by moving the contents of %PYTHONHOME%\Lib\site-packages\jinja2 to some other place.
extract the previously downloaded tar.gz file and install the package via pip:
cd .\Downloads\dist\Jinja2-2.8 # or wherever you extracted jinja2.8
python setup.py install
As a side note, I also had to increase my recursion limit because py2exe was reaching the default limit.
from distutils.core import setup
import py2exe
import sys
sys.setrecursionlimit(5000)
setup (console=['test.py'])
Warning:
If whatever it is you are using relies on the latest version of jinja2, then this might fail or have unintended side effects when actually running your code. I was compiling a very simple script.
I had the same trouble coding in python3.7. I fixed that adding the excludes part to my py2exe file:
a = Analysis(['pyinst_test.py'],
#...
excludes=['jinja2.asyncsupport','jinja2.asyncfilters'],
#...)
I took that from: https://github.com/pyinstaller/pyinstaller/issues/2393

i need to write a python script to pull the repository details of mercurial for given 2 dates

i have referred a program which prints all the repository details. Now i want to provide 2 dates i.e. from date and to date as parameters and the repository details between these 2 dates needs to be pulled out. How can this be done. I am not sure which mercurial api to be used.
import sys
import hglib
# repo path
# figure out what repo path to use
repo = "."
if len(sys.argv) > 3:
repo = sys.argv[1],
from_date = sys.argv[2],
to_date = sys.argv[3]
# connect to hg
client = hglib.open(repo)
# gather some stats
revs = int(client.tip().rev)
files = len(list(client.manifest()))
heads = len(client.heads())
branches = len(client.branches())
tags = len(client.tags()) - 1 # don't count tip
authors = {}
for e in client.log():
authors[e.author] = True
description = {}
for e in client.log():
description[e.desc] = True
merges = 0
for e in client.log(onlymerges=True):
merges += 1
print "%d revisions" % revs
print "%d merges" % merges
print "%d files" % files
print "%d heads" % heads
print "%d branches" % branches
print "%d tags" % tags
print "%d authors" % len(authors)
print "%s authors name" % authors.keys()
print "%d desc" % len(description)
This prints out everything in the repository, i need to pull the details between the two given dates lie 2015-07-13 (from date) and 2015-08-20(todate)
Updated code not working
import sys
import hglib
import datetime as dt
# repo path
# figure out what repo path to use
repo = "."
if len(sys.argv) > 1:
repo = sys.argv[1]
#from_date = sys.argv[2],
#to_date = sys.argv[3]
# connect to hg
client = hglib.open(repo)
# define time ranges
d1 = dt.datetime(2015,7,7)
d2 = dt.datetime(2015,8,31)
#if "date('>05/07/07') and date('<06/08/8')"
#logdetails = client.log()
description = {}
for e in client.log():
if (description[e.date] >= d1 and description[e.date] <=d2):
description[e.desc] = True
print "%s desc" % description
You can use revsets to limit the changesets. I'm not exactly sure how it translates to the hglib API, but it has an interface for revsets as well. From the normal CLI you can do like:
hg log -r"date('>2015-01-01') and date('<2015-03-25')"
Checkout hg help revsets and hg help dates.
By the way: the output of of the numeric revision count revs = int(client.tip().rev) will be wrong (too big) if there are pruned or obsolete changesets which are for instance easily created by means of hg commit --amend.

Python script to alert on empty/missing logs

I am working on a project to check a file directory and automatically add log files as they are created. A file is being generated every five minutes, but some of the files are being created with a "0" filesize and I would like to alert when this happens.
So the sequence of steps I would like to have are essentially:
Get time (MM:DD:YY HH:MM:SS) *Not sure if I need to do this...
CD to Folder Directory /Netflow/YY/MM/DD
Search for filename "nfcapd.YYYYMMDDHHMM" where MM increments by 5.
If filesize is 0, then email Johnny, Sally and Jimmy
Wait 6 minutes and repeat
This is what I have pieced together thus far. How can I get the desired functionality?
import os
def is_non_zero_file(fpath): storage/Netflow/
return True if os.path.isfile(fpath) and os.path.getsize(fpath) > 0 else False
# I need to check storage/Netflow for files named by time e.g 13_56_05.txt
while True:
time.sleep(360)
In addition to enumerating the files in a given path, and subsequently filtering the files which are only zero-length, you probably want to maintain some type of state to ensure you're aren't notified multiple times of the same zero length file. That is, you probably don't want to get a notification that the same file is zero-length indefinitely (although you can modify the example below if you want said behavior).
You may optionally want to do things like verify that the file name strictly meets your naming convention. You may also want to validate the the string date-stamp included in the file name is a valid datetime.
The example below uses the glob module (itself leveraging os.listdir() and fnmatch.fnmatch()) to build up a set of possible files for inclusion. [1]
The example is intentionally simple, and leverages a single class to store log sample 'state'. KEEP_SAMPLES samples are maintained (instances of logState() in the log_states list, achieved by using list slicing.
A single alert(msg) function is supplied as a stub to something that might send mail, etc...
References:
[1] https://docs.python.org/3.2/library/glob.html
#!/usr/bin/python3
import os
import glob
import re
from datetime import datetime, timezone
import time
from pprint import pprint
class logState():
def __init__(self, log_path, glob_patt, re_patt, dt_fmt):
self.dt = datetime.now(timezone.utc)
self.log_path = log_path
self.glob_patt = glob_patt
self.re_patt = re_patt
self.dt_fmt = dt_fmt
self.empty_logs = []
self.nonempty_logs = []
# Retrieve only files from glob
self.files = [ f for f in
glob.glob(self.log_path + self.glob_patt)
if os.path.isfile(f) ]
for f in self.files:
unq_fname = f.split('/')[-1]
if unq_fname == None:
continue
# Tighter pattern matching
if re.match(re_patt, unq_fname) == None:
continue
# Get the datetime portion of the file name
f_dtstamp = unq_fname.split('.')[-1]
# Make sure the datetime stamp represents
# a valid date
if datetime.strptime(f_dtstamp, self.dt_fmt) == None:
continue
# Check file size, add to the appropriate
# list
if os.path.getsize(f) <= 0:
self.empty_logs.append(f)
else:
self.nonempty_logs.append(f)
def alert(msg):
print("ALERT!: {0}".format(msg))
if __name__ == "__main__":
# How long to sleep
SLEEP_SECS = 5
# How many samples to keep
KEEP_SAMPLES = 5
log_states = []
# Definition for what logs states we'll look for
log_path = './'
glob_patt = 'nfcapd.[0-9]*'
re_patt = 'nfcapd.([0-9]{12})'
dt_fmt = "%Y%m%d%H%M"
print("-- Setup --")
print("Sample files in '{0}'".format(log_path))
print("\t{0} samples kept:".format(KEEP_SAMPLES))
print("\tglob pattern: '{0}'".format(glob_patt))
print("\tregex pattern: '{0}'".format(re_patt))
print("\tdatetime string: '{0}'".format(dt_fmt))
print("")
# Collect the initial state
log_states.append(logState(log_path,
glob_patt,
re_patt, dt_fmt))
while True:
# Print state inventory and current state detail
print( "-- Log States Stored --")
for i, log_state in enumerate(log_states):
print("Log state {0} # {1}".format(i, log_state.dt))
print(" -- Logs size > 0 --")
pprint(log_states[-1].nonempty_logs)
print(" -- Logs size <= 0 --")
pprint(log_states[-1].empty_logs)
print("")
time.sleep(SLEEP_SECS)
log_states = log_states[-KEEP_SAMPLES+1:]
log_states.append(logState(log_path,
glob_patt,
re_patt,
dt_fmt))
# p = previous sample, c = current
p = set(log_states[-2].empty_logs)
c = set(log_states[-1].empty_logs)
# only report the items in the current sample
# not in the last
if len(c.difference(p)) > 0:
alert("\nNew zero length logs: " + str(c.difference(p)) + "\n")

Categories