Python: Reading and processing Multiple gzip files in remote server

Python: Reading and processing Multiple gzip files in remote server - python

Problem Statement:
I have multiple(1000+) *.gz files in a remote server. I have to read these files and check for certain strings. If the strings matches, I have to return the file name. I have tried the following code. The following program is working but doesnot seem efficient as there is a huge IO involved. Can you please suggest an efficient way to do this.
My Code:
import gzip
import os
import paramiko
import multiprocessing
from bisect import insort
synchObj=multiprocessing.Manager()
hostname = '192.168.1.2'
port = 22
username='may'
password='Apa$sW0rd'
def miniAnalyze():
ifile_list=synchObj.list([]) # A synchronized list to Store the File names containing the matched String.
def analyze_the_file(file_single):
strings = ("error 72","error 81",) # Hard Coded the Strings that needs to be searched.
try:
ssh=paramiko.SSHClient()
#Code to FTP the file to local system from the remote machine.
.....
........
path_f='/home/user/may/'+filename
#Read the Gzip file in local system after FTP is done
with gzip.open(path_f, 'rb') as f:
contents = f.read()
if any(s in contents for s in strings):
print "File " + str(path_f) + " is a hit."
insort(ifile_list, filename) # Push the file into the list if there is a match.
os.remove(path_f)
else:
os.remove(path_f)
except Exception, ae:
print "Error while Analyzing file "+ str(ae)
finally:
if ifile_list:
print "The Error is at "+ ifile_list
ftp.close()
ssh.close()
def assign_to_proc():
# Code to glob files matching a pattern and pass to another function via multiprocess .
apath = '/home/remotemachine/log/'
apattern = '"*.gz"'
first_command = 'find {path} -name {pattern}'
command = first_command.format(path=apath, pattern=apattern)
try:
ssh=paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(hostname,username=username,password=password)
stdin, stdout, stderr = ssh.exec_command(command)
while not stdout.channel.exit_status_ready():
time.sleep(2)
filelist = stdout.read().splitlines()
jobs = []
for ifle in filelist:
p = multiprocessing.Process(target=analyze_the_file,args=(ifle,))
jobs.append(p)
p.start()
for job in jobs:
job.join()
except Exception, fe:
print "Error while getting file names "+ str(fe)
finally:
ssh.close()
if __name__ == '__main__':
miniAnalyze()
The above code is slow. There are lot of IO while getting the GZ file to local system. Kindly help me to find a better way to do it.

Execute a remote OS command such as zgrep, and process the command results locally. This way, you won't have to transfer the whole file contents on your local machine.

Related

Comparing MD5 of downloaded files against files on an SFTP server in Python

Here I am trying to list all the MD5's of the files I downloaded and compare them to the original to see if they are the same Files.
I can't access a server to test this code right now but I was really curious if it would work...
Does someone have a better solution or something they would change?
#!/usr/bin/python3
import paramiko
import pysftp
import os
import sys
print("Localpath eingeben: ")
localpath = input()
print("Remothpath eingeben: ")
remotepath = input()
k = paramiko.RSAKey.from_private_key_file("/home/abdulkarim/.ssh/id_rsa")
c = paramiko.SSHClient()
c.set_missing_host_key_policy(paramiko.AutoAddPolicy())
print("connecting")
c.connect(hostname = "do-test", username = "abdulkarim", pkey = k)
print("connected")
sftp = c.open_sftp()
sftp.Connection.get_d(localpath, remotepath)
#sftp.get_d(localpath, remotepath)
def hashCheckDir(f,r):
files = []
# r=root, d=directories, f=files
for r, d, f in os.walk(localpath):
for file in f:
if '.txt' in file:
files.append(os.path.join(r, file))
files1 = []
# r=root, d=directories, f=files
for r, d, f in os.walk(remotepath):
for file in f:
if '.txt' in file:
files.append(os.path.join(r, file))
for i in range(2):
for x in files:
localsum = os.system('md5sum ' + files)
remotesum = os.system('ssh do-test md5sum ' + files1)
if localsum == remotesum:
print ("The lists are identical")
else :
print ("The lists are not identical")
hashCheckDir(localpath,remotepath)
c.close()
I am pretty new to Python so.. Bear with me if I did some stupid mistake.
Maybe I have to sort them first?

It's an overkill to launch an external console application (ssh) to execute md5sum on the server (and open a new connection for each and every file on top of that), if you already have a native Python SSH connection to the same server.
Instead use SSHClient.exec_command:
stdin, stdout, stderr = c.exec_command('md5sum '+ files1)
checksum = stdout.read()
Note that MD5 is obsolete, use SHA-256 (sha256sum).
Though question is whether the whole checksum check isn't an overkill, see:
How to perform checksums during a SFTP file transfer for data integrity?
Obligatory warning: Do not use AutoAddPolicy – You are losing a protection against MITM attacks by doing so. For a correct solution, see Paramiko "Unknown Server".

Python ftplib error 426 when putting files on iSeries

Have a peculiar issue that I can't seem to fix on my own..
I'm attempting to FTP a list of files in a directory over to an iSeries IFS using Python's ftplib library.
Note, the files are in a single subdirectory down from the python script.
Below is an excerpt of the code that is giving me trouble:
from ftplib import FTP
import os
localpath = os.getcwd() + '/Files/'
def putFiles():
hostname = 'host.name.com'
username = 'myuser'
password = 'mypassword'
myftp = FTP(hostname)
myftp.login(username, password)
myftp.cwd('/STUFF/HERE/')
for file in os.listdir(localpath):
if file.endswith('.csv'):
try:
file = localpath + file
print 'Attempting to move ' + file
myftp.storbinary("STOR " + file, open(file, 'rb'))
except Exception as e:
print(e)
The specific error that I am getting throw is:
Attempting to move /home/doug/Files/FILE.csv
426-Unable to open or create file /home/doug/Files to receive data.
426 Data transfer ended.
What I've done so far to troubleshoot:
Initially I thought this was a permissions issue on the directory containing my files. I used chmod 777 /home/doug/Files and re-ran my script, but the same exception occured.
Next I assumed there was an issue between my machine and the iSeries. I validated that I could indeed put files by using ftp. I was successfully able to put the file on the iSeries IFS using the shell FTP.
Thanks!
Solution
from ftplib import FTP
import os
localpath = os.getcwd() + '/Files/'
def putFiles():
hostname = 'host.name.com'
username = 'myuser'
password = 'mypassword'
myftp = FTP(hostname)
myftp.login(username, password)
myftp.cwd('/STUFF/HERE/')
for csv in os.listdir(localpath):
if csv.endswith('.csv'):
try:
myftp.storbinary("STOR " + csv, open(localpath + csv, 'rb'))
except Exception as e:
print(e)

As written, your code is trying to execute the following FTP command:
STOR /home/doug/Files/FILE.csv
Meaning it is trying to create /home/doug/Files/FILE.csv on the IFS. Is this what you want? I suspect that it isn't, given that you bothered to change the remote directory to /STUFF/HERE/.
If you are trying to issue the command
STOR FILE.csv
then you have to be careful how you deal with the Python variable that you've named file. In general, it's not recommended that you reassign a variable that is the target of a for loop, precisely because this type of confusion can occur. Choose a different variable name for localpath + file, and use that in your open(..., 'rb').
Incidentally, it looks like you're using Python 2, since there is a bare print statement with no parentheses. I'm sure you're aware that Python 3 is recommended by now, but if you do stick to Python 2, it's recommended that you avoid using file as a variable name, because it actually means something in Python 2 (it's the name of a type; specifically, the return type of the open function).

Comparing files once I have hostname

I need a way to compare two files that have the same hostname in them. I have written a function that will parse out the hostnames and save that in a list. Once I have that I need to be able to compare the files.
Each file is in a different directory.
Step One: Retrieve "hostname" from each file.
Step Two: Run compare on files with same "hostname" from two directories.
Retrieve hostname Code:
def hostname_parse(directory):
results = []
try:
for filename in os.listdir(directory):
if filename.endswith(('.cfg', '.startup', '.confg')):
file_name = os.path.join(directory, filename)
with open(file_name, "r") as in_file:
for line in in_file:
match = re.search('hostname\s(\S+)', line)
if match:
results.append(match.group(1))
#print "Match Found"
return results
except IOError as (errno, strerror):
print "I/O error({0}): {1}".format(errno, strerror)
print "Error in hostname_parse function"
Sample Data:
Test File:
19-30#
!
version 12.3
service timestamps debug datetime msec
service timestamps log datetime msec
service password-encryption
!
hostname 19-30
!
boot-start-marker
boot-end-marker
!
ntp clock-period 17179738
ntp source Loopback0
!
end
19-30#
In this case the hostname is 19-30. For ease of testing I just used the same file but modified it to be the same or not the same.
As stated above. I can extract the hostname but am now looking for a way to then compare the files based on the hostname found.
At the core of things it is a file comparison. However being able to look at specific fields will be what I would like to accomplish. For starters I'm just looking to see that the files are identical. Case sensitivity shouldn't matter as these are cisco generated files that have the same formatting. The contents of the files are more important as I'm looking for "configuration" changes.

Here is some code to meet your requirements. I had no way to test, so it may have a few challenges. Is used hash lib to calculate a hash on the file contents, as a way to find changes.
import hashlib
import os
import re
HOSTNAME_RE = re.compile(r'hostname +(\S+)')
def get_file_info_from_lines(filename, file_lines):
hostname = None
a_hash = hashlib.sha1()
for line in file_lines:
a_hash.update(line.encode('utf-8'))
match = HOSTNAME_RE.match(line)
if match:
hostname = match.group(1)
return hostname, filename, a_hash.hexdigest()
def get_file_info(filename):
if filename.endswith(('.cfg', '.startup', '.confg')):
with open(filename, "r") as in_file:
return get_file_info_from_lines(filename, in_file.readlines())
def hostname_parse(directory):
results = {}
for filename in os.listdir(directory):
info = get_file_info(filename)
if info is not None:
results[info[0]] = info
return results
results1 = hostname_parse('dir1')
results2 = hostname_parse('dir2')
for hostname, filename, filehash in results1.values():
if hostname in results2:
_, filename2, filehash2 = results2[hostname]
if filehash != filehash2:
print("%s has a change (%s, %s)" % (
hostname, filehash, filehash2))
print(filename)
print(filename2)
print()

Create a temporary FIFO (named pipe) in Python?

How can you create a temporary FIFO (named pipe) in Python? This should work:
import tempfile
temp_file_name = mktemp()
os.mkfifo(temp_file_name)
open(temp_file_name, os.O_WRONLY)
# ... some process, somewhere, will read it ...
However, I'm hesitant because of the big warning in Python Docs 11.6 and potential removal because it's deprecated.
EDIT: It's noteworthy that I've tried tempfile.NamedTemporaryFile (and by extension tempfile.mkstemp), but os.mkfifo throws:
OSError -17: File already exists
when you run it on the files that mkstemp/NamedTemporaryFile have created.

os.mkfifo() will fail with exception OSError: [Errno 17] File exists if the file already exists, so there is no security issue here. The security issue with using tempfile.mktemp() is the race condition where it is possible for an attacker to create a file with the same name before you open it yourself, but since os.mkfifo() fails if the file already exists this is not a problem.
However, since mktemp() is deprecated you shouldn't use it. You can use tempfile.mkdtemp() instead:
import os, tempfile
tmpdir = tempfile.mkdtemp()
filename = os.path.join(tmpdir, 'myfifo')
print filename
try:
os.mkfifo(filename)
except OSError, e:
print "Failed to create FIFO: %s" % e
else:
fifo = open(filename, 'w')
# write stuff to fifo
print >> fifo, "hello"
fifo.close()
os.remove(filename)
os.rmdir(tmpdir)
EDIT: I should make it clear that, just because the mktemp() vulnerability is averted by this, there are still the other usual security issues that need to be considered; e.g. an attacker could create the fifo (if they had suitable permissions) before your program did which could cause your program to crash if errors/exceptions are not properly handled.

You may find it handy to use the following context manager, which creates and removes the temporary file for you:
import os
import tempfile
from contextlib import contextmanager
#contextmanager
def temp_fifo():
"""Context Manager for creating named pipes with temporary names."""
tmpdir = tempfile.mkdtemp()
filename = os.path.join(tmpdir, 'fifo') # Temporary filename
os.mkfifo(filename) # Create FIFO
try:
yield filename
finally:
os.unlink(filename) # Remove file
os.rmdir(tmpdir) # Remove directory
You can use it, for example, like this:
with temp_fifo() as fifo_file:
# Pass the fifo_file filename e.g. to some other process to read from.
# Write something to the pipe
with open(fifo_file, 'w') as f:
f.write("Hello\n")

How about using
d = mkdtemp()
t = os.path.join(d, 'fifo')

If it's for use within your program, and not with any externals, have a look at the Queue module. As an added benefit, python queues are thread-safe.

Effectively, all that mkstemp does is run mktemp in a loop and keeps attempting to exclusively create until it succeeds (see stdlib source code here). You can do the same with os.mkfifo:
import os, errno, tempfile
def mkftemp(*args, **kwargs):
for attempt in xrange(1024):
tpath = tempfile.mktemp(*args, **kwargs)
try:
os.mkfifo(tpath, 0600)
except OSError as e:
if e.errno == errno.EEXIST:
# lets try again
continue
else:
raise
else:
# NOTE: we only return the path because opening with
# os.open here would block indefinitely since there
# isn't anyone on the other end of the fifo.
return tpath
else:
raise IOError(errno.EEXIST, "No usable temporary file name found")

Why not just use mkstemp()?
For example:
import tempfile
import os
handle, filename = tempfile.mkstemp()
os.mkfifo(filename)
writer = open(filename, os.O_WRONLY)
reader = open(filename, os.O_RDONLY)
os.close(handle)

How to make Python check if ftp directory exists?

I'm using this script to connect to sample ftp server and list available directories:
from ftplib import FTP
ftp = FTP('ftp.cwi.nl') # connect to host, default port (some example server, i'll use other one)
ftp.login() # user anonymous, passwd anonymous#
ftp.retrlines('LIST') # list directory contents
ftp.quit()
How do I use ftp.retrlines('LIST') output to check if directory (for example public_html) exists, if it exists cd to it and then execute some other code and exit; if not execute code right away and exit?

Nslt will list an array for all files in ftp server. Just check if your folder name is there.
from ftplib import FTP
ftp = FTP('yourserver')
ftp.login('username', 'password')
folderName = 'yourFolderName'
if folderName in ftp.nlst():
#do needed task

you can use a list. example
import ftplib
server="localhost"
user="user"
password="test#email.com"
try:
ftp = ftplib.FTP(server)
ftp.login(user,password)
except Exception,e:
print e
else:
filelist = [] #to store all files
ftp.retrlines('LIST',filelist.append) # append to list
f=0
for f in filelist:
if "public_html" in f:
#do something
f=1
if f==0:
print "No public_html"
#do your processing here

You can send "MLST path" over the control connection.
That will return a line including the type of the path (notice 'type=dir' down here):
250-Listing "/home/user":
modify=20131113091701;perm=el;size=4096;type=dir;unique=813gc0004; /
250 End MLST.
Translated into python that should be something along these lines:
import ftplib
ftp = ftplib.FTP()
ftp.connect('ftp.somedomain.com', 21)
ftp.login()
resp = ftp.sendcmd('MLST pathname')
if 'type=dir;' in resp:
# it should be a directory
pass
Of course the code above is not 100% reliable and would need a 'real' parser.
You can look at the implementation of MLSD command in ftplib.py which is very similar (MLSD differs from MLST in that the response in sent over the data connection but the format of the lines being transmitted is the same):
http://hg.python.org/cpython/file/8af2dc11464f/Lib/ftplib.py#l577

The examples attached to ghostdog74's answer have a bit of a bug: the list you get back is the whole line of the response, so you get something like
drwxrwxrwx 4 5063 5063 4096 Sep 13 20:00 resized
This means if your directory name is something like '50' (which is was in my case), you'll get a false positive. I modified the code to handle this:
def directory_exists_here(self, directory_name):
filelist = []
self.ftp.retrlines('LIST',filelist.append)
for f in filelist:
if f.split()[-1] == directory_name:
return True
return False
N.B., this is inside an FTP wrapper class I wrote and self.ftp is the actual FTP connection.

Tom is correct, but no one voted him up
however for the satisfaction who voted up ghostdog74 I will mix and write this code, works for me, should work for you guys.
import ftplib
server="localhost"
user="user"
uploadToDir="public_html"
password="test#email.com"
try:
ftp = ftplib.FTP(server)
ftp.login(user,password)
except Exception,e:
print e
else:
filelist = [] #to store all files
ftp.retrlines('NLST',filelist.append) # append to list
num=0
for f in filelist:
if f.split()[-1] == uploadToDir:
#do something
num=1
if num==0:
print "No public_html"
#do your processing here
first of all if you follow ghost dog method, even if you say directory "public" in f, even when it doesnt exist it will evaluate to true because the word public exist in "public_html" so thats where Tom if condition can be used
so I changed it to if f.split()[-1] == uploadToDir:.
Also if you enter a directory name somethig that doesnt exist but some files and folder exist the second by ghostdog74 will never execute because its never 0 as overridden by f in for loop so I used num variable instead of f and voila the goodness follows...
Vinay and Jonathon are right about what they commented.

In 3.x nlst() method is deprecated. Use this code:
import ftplib
remote = ftplib.FTP('example.com')
remote.login()
if 'foo' in [name for name, data in list(remote.mlsd())]:
# do your stuff
The list() call is needed because mlsd() returns a generator and they do not support checking what is in them (do not have __contains__() method).
You can wrap [name for name, data in list(remote.mlsd())] list comp in a function of method and call it when you will need to just check if a directory (or file) exists.

=> I found this web-page while googling for a way to check if a file exists using ftplib in python. The following is what I figured out (hope it helps someone):
=> When trying to list non-existent files/directories, ftplib raises an exception. Even though Adding a try/except block is a standard practice and a good idea, I would prefer my FTP scripts to download file(s) only after making sure they exist. This helps in keeping my scripts simpler - at least when listing a directory on the FTP server is possible.
For example, the Edgar FTP server has multiple files that are stored under the directory /edgar/daily-index/. Each file is named liked "master.YYYYMMDD.idx". There is no guarantee that a file will exist for every date (YYYYMMDD) - there is no file dated 24th Nov 2013, but there is a file dated: 22th Nov 2013. How does listing work in these two cases?
# Code
from __future__ import print_function
import ftplib
ftp_client = ftplib.FTP("ftp.sec.gov", "anonymous", "MY.EMAIL#gmail.com")
resp = ftp_client.sendcmd("MLST /edgar/daily-index/master.20131122.idx")
print(resp)
resp = ftp_client.sendcmd("MLST /edgar/daily-index/master.20131124.idx")
print(resp)
# Output
250-Start of list for /edgar/daily-index/master.20131122.idx
modify=20131123030124;perm=adfr;size=301580;type=file;unique=11UAEAA398;
UNIX.group=1;UNIX.mode=0644;UNIX.owner=1019;
/edgar/daily-index/master.20131122.idx
250 End of list
Traceback (most recent call last):
File "", line 10, in <module>
resp = ftp_client.sendcmd("MLST /edgar/daily-index/master.20131124.idx")
File "lib/python2.7/ftplib.py", line 244, in sendcmd
return self.getresp()
File "lib/python2.7/ftplib.py", line 219, in getresp
raise error_perm, resp
ftplib.error_perm: 550 '/edgar/daily-index/master.20131124.idx' cannot be listed
As expected, listing a non-existent file generates an exception.
=> Since I know that the Edgar FTP server will surely have the directory /edgar/daily-index/, my script can do the following to avoid raising exceptions due to non-existent files:
a) list this directory.
b) download the required file(s) if they are are present in this listing - To check the listing I typically perform a regexp search, on the list of strings that the listing operation returns.
For example this script tries to download files for the past three days. If a file is found for a certain date then it is downloaded, else nothing happens.
import ftplib
import re
from datetime import date, timedelta
ftp_client = ftplib.FTP("ftp.sec.gov", "anonymous", "MY.EMAIL#gmail.com")
listing = []
# List the directory and store each directory entry as a string in an array
ftp_client.retrlines("LIST /edgar/daily-index", listing.append)
# go back 1,2 and 3 days
for diff in [1,2,3]:
today = (date.today() - timedelta(days=diff)).strftime("%Y%m%d")
month = (date.today() - timedelta(days=diff)).strftime("%Y_%m")
# the absolute path of the file we want to download - if it indeed exists
file_path = "/edgar/daily-index/master.%(date)s.idx" % { "date": today }
# create a regex to match the file's name
pattern = re.compile("master.%(date)s.idx" % { "date": today })
# filter out elements from the listing that match the pattern
found = filter(lambda x: re.search(pattern, x) != None, listing)
if( len(found) > 0 ):
ftp_client.retrbinary(
"RETR %(file_path)s" % { "file_path": file_path },
open(
'./edgar/daily-index/%(month)s/master.%(date)s.idx' % {
"date": today
}, 'wb'
).write
)
=> Interestingly, there are situations where we cannot list a directory on the FTP server. The edgar FTP server, for example, disallows listing on /edgar/data because it contains far too many sub-directories. In such cases, I wouldn't be able to use the "List and check for existence" approach described here - in these cases I would have to use exception handling in my downloader script to recover from non-existent file/directory access attempts.

from ftplib import FTP
ftp = FTP()
ftp.connect(hostname, 21)
ftp.login(username,password)
try:
ftp.cwd('your folder name')
#do the code for successfull cd
except Exception:
#do the code for folder not exists

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Reading and processing Multiple gzip files in remote server - python

Execute a remote OS command such as zgrep, and process the command results locally. This way, you won't have to transfer the whole file contents on your local machine.

Related

Comparing MD5 of downloaded files against files on an SFTP server in Python

Python ftplib error 426 when putting files on iSeries

Comparing files once I have hostname

Create a temporary FIFO (named pipe) in Python?

How to make Python check if ftp directory exists?

Categories

Resources