Open file in an Apache Storm Spout with python

Open file in an Apache Storm Spout with python - python

I am trying to make Apache Storm Spout read from a file line by line. I have tried to write these statements, but they didn't work . It gave me the first line only iterated every time:
class SimSpout(storm.Spout):
# Not much to do here for such a basic spout
def initialize(self, conf, context):
## Open the file with read only permit
self.f = open('data.txt', 'r')
## Read the first line
self._conf = conf
self._context = context
storm.logInfo("Spout instance starting...")
# Process the next tuple
def nextTuple(self):
# check if it reach at the EOF to close it
for line in self.f.readlines():
# Emit a random sentence
storm.logInfo("Emiting %s" % line)
storm.emit([line])
# Start the spout when it's invoked
SimSpout().run()

Disclaimer: Since I have no way to test this, this answer will simply be from inspection.
You failed to save the filehandle you opened in initialize(). This edit saves the filehandle and then use the saved filehandle for the read. It also fixes (I hope) some indenting that looked wrong.
class SimSpout(storm.Spout):
# Not much to do here for such a basic spout
def initialize(self, conf, context):
## Open the file with read only permit
self.f = open('mydata.txt', 'r')
self._conf = conf
self._context = context
storm.logInfo("Spout instance starting...")
# Process the next tuple
def nextTuple(self):
# check if it reach at the EOF to close it
for line in self.f.readlines():
# Emit a random sentence
storm.logInfo("Emiting %s" % line)
storm.emit([line])
# Start the spout when it's invoked
SimSpout().run()

Related

Python Callback for File Object Close

I am working on a custom file path class, which should always execute a function
after the corresponding system file has been written to and its file object
closed. The function will upload the contents of file path to a remote location.
I want the upload functionality to happen entirely behind the scenes from a user
perspective, i.e. the user can use the class just like any other os.PathLike
class and automatically get the upload functionality. Psuedo code below for
refernce.
import os
class CustomPath(os.PathLike):
def __init__(self, remote_path: str):
self._local_path = "/some/local/path"
self._remote_path = remote_path
def __fspath__(self) -> str:
return self._local_path
def upload(self):
# Upload local path to remote path.
I can of course handle automatically calling the upload function for when the
user calls any of the methods directly.
However, it unclear to me how to automatically call the upload function if
someone writes to the file with the builtin open as follows.
custom_path = CustomPath("some remote location")
with open(custom_path, "w") as handle:
handle.write("Here is some text.")
or
custom_path = CustomPath("some remote location")
handle = open(custom_path, "w")
handle.write("Here is some text.")
handle.close()
I desire compatibility with invocations of the open function, so that the
upload behavior will work with all third party file writers. Is this kind of
behavior possible in Python?

Yes, it is possible with Python by making use of Python's function overriding, custom context manager and __ getattr __ facilities. Here's the basic logic:
override the builtins.open() function with custom open() class.
make it compatible with context manager using __ enter __ and __ exit__ methods.
make it compatible with normal read/write operations using __ getattr __ method.
call builtins method from the class whenever necessary.
invoke automatically callback function when close() method is called.
Here's the sample code:
import builtins
import os
to_be_monitered = ['from_file1.txt', 'from_file2.txt']
# callback function (called when file closes)
def upload(content_file):
# check for required file
if content_file in to_be_monitered:
# copy the contents
with builtins.open(content_file, 'r') as ff:
with builtins.open(remote_file, 'a') as tf:
# some logic for writing only new contents can be used here
tf.write('\n'+ff.read())
class open(object):
def __init__(self, path, mode):
self.path = path
self.mode = mode
# called when context manager invokes
def __enter__(self):
self.file = builtins.open(self.path, self.mode)
return self.file
# called when context manager returns
def __exit__(self, *args):
self.file.close()
# after closing calling upload()
upload(self.path)
return True
# called when normal non context manager invokes the object
def __getattr__(self, item):
self.file = builtins.open(self.path, self.mode)
# if close call upload()
if item == 'close':
upload(self.path)
return getattr(self.file, item)
if __name__ == '__main__':
remote_file = 'to_file.txt'
local_file1 = 'from_file1.txt'
local_file2 = 'from_file2.txt'
# just checks and creates remote file no related to actual problem
if not os.path.isfile(remote_file):
f = builtins.open(remote_file, 'w')
f.close()
# DRIVER CODE
# writing with context manger
with open(local_file1, 'w') as f:
f.write('some text written with context manager to file1')
# writing without context manger
f = open(local_file2, 'w')
f.write('some text written without using context manager to file2')
f.close()
# reading file
with open(remote_file, 'r') as f:
print('remote file contains:\n', f.read())
What does it do:
Writes "some text written with context manager to file1" to local_file1.txt and "some text written without context manager to file2" to local_file2.txt meanwhile copies these text to remote_file.txt automatically without copying explicitly.
How does it do:(context manager case)
with open(local_file1, 'w') as f: cretes an object of custom class open and initializes it's path and mode variables. And calls __ enter __ function(because of context manager(with as block)) which opens the file using builtins.open() method and returns the _io.TextIOWrapper (a opened text file object) object. It is a normal file object we can use it normally for read/write operations. After that context manger calls __ exit __ function at the end which(__ exit__) closess the file and calls required callback(here upload) function automatically and passes the file path just closed. In this callback function we can perform any operations like copying.
Non-context manger case also works similarly but the difference is __ getattr __ function is the one making magic.
Here's the contents of file's after the execution of code:
from_file1.txt
some text written with context manager to file1
from_file2.txt
some text written without using context manager to file2
to_file.txt
some text written with context manager to file1
some text written without using context manager to file2

Based on your comment to Girish Dattatray Hegde, it seems that what you would like to do is something like the following to override the default __exit__ handler for open:
import io
old_exit = io.FileIO.__exit__ # builtin __exit__ method
def upload(self):
print(self.read()) # just print out contents
def new_exit(self):
try:
upload(self)
finally:
old_exit(self) # invoke the builtin __exit__ method
io.FileIO.__exit__ = new_exit # establish our __exit__ method
with open('test.html') as f:
print(f.closed) # False
print(f.closed) # True
Unfortunately, the above code results in the following error:
test.py", line 18, in <module>
io.FileIO.__exit__ = new_exit # establish our __exit__ method
TypeError: can't set attributes of built-in/extension type '_io.FileIO'
So, I don't believe it is possible to do what you want to do. Ultimately you can create your own subclasses and override methods, but you cannot replace methods of the exiting builtin open class.

Write header to a python log file, but only if a record gets written

fh = logging.FileHandler('example.log',delay = True)
fh.setLevel(logging.INFO)
Since delay is True, the file will never be written unless something is logged.
At that point, the first line in the file is the first record, and it will contain the asctime, levelname etc elements
Using python 2.7.10, is there a sane way to add a line (or two) the first time a record is written that don't include those elements?
I can just write to the file before using it for logging, but if I do that, I end up with logs empty but for the header.
The desired output might look like:
Using test.fil with option 7
2015-11-01 13:57:58,045 :log_example: INFO fn:main result:process 4 complete --7 knights said ni
2015-11-01 13:57:58,045 :log_example: INFO fn:main result:process 3 complete --3 bunnies attacked
Thanks,

Sub class the FileHandler to create your own custom FileHandleWithHeader as shown below:
import os
import logging
# Create a class that extends the FileHandler class from logging.FileHandler
class FileHandlerWithHeader(logging.FileHandler):
# Pass the file name and header string to the constructor.
def __init__(self, filename, header, mode='a', encoding=None, delay=0):
# Store the header information.
self.header = header
# Determine if the file pre-exists
self.file_pre_exists = os.path.exists(filename)
# Call the parent __init__
logging.FileHandler.__init__(self, filename, mode, encoding, delay)
# Write the header if delay is False and a file stream was created.
if not delay and self.stream is not None:
self.stream.write('%s\n' % header)
def emit(self, record):
# Create the file stream if not already created.
if self.stream is None:
self.stream = self._open()
# If the file pre_exists, it should already have a header.
# Else write the header to the file so that it is the first line.
if not self.file_pre_exists:
self.stream.write('%s\n' % self.header)
# Call the parent class emit function.
logging.FileHandler.emit(self, record)
# Create a logger and set the logging level.
logger = logging.getLogger("example")
logger.setLevel(logging.INFO)
# Create a file handler from our new FileHandlerWith Header class and set the
# logging level.
fh = FileHandlerWithHeader('example.log', 'This is my header', delay=True)
fh.setLevel(logging.INFO)
# Add formatter to the file handler.
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
fh.setFormatter(formatter)
# Add the handler to the logger.
logger.addHandler(fh)
# Since the constructor of the FileHandlerWithHeader was passed delay=True
# the file should not exist until the first log as long as the log file did
# not pre-exist.
print "Ready to write to the the example.log file."
raw_input("Press Enter to continue...")
# Send 3 logs to the logger.
logger.info("First line in the file")
logger.info("Second line in the file")
logger.info("Third line in the file")
# The log file should now be created and only have a header at the begining of
# the file.
print "The example.log file should exist and have a header."
This script should run as is in Python 2.7. If the "example.log" file already exists, it will not recreate the header.
This solution required knowledge of the logging source code found here
and general use of the python logging package found here.

I had a simpler idea. The following just uses a custom formatter. The first message formatted spits out a header record then after that just does normal formatting.
import logging
class FormatterWithHeader(logging.Formatter):
def __init__(self, header, fmt=None, datefmt=None, style='%'):
super().__init__(fmt, datefmt, style)
self.header = header # This is hard coded but you could make dynamic
# Override the normal format method
self.format = self.first_line_format
def first_line_format(self, record):
# First time in, switch back to the normal format function
self.format = super().format
return self.header + "\n" + self.format(record)
def test_logger():
logger = logging.getLogger("test")
logger.setLevel(logging.DEBUG)
formatter = FormatterWithHeader('First Line Only')
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
ch.setFormatter(formatter)
logger.addHandler(ch)
logger.info("This line will kick out a header first.")
logger.info("This line will *not* kick out a header.")

When working with a named pipe is there a way to do something like readlines()

Overall Goal: I am trying to read some progress data from a python exe to update the progress of the exe in another application
I have a python exe that is going to do some stuff, I want to be able to communicate the progress to another program. Based on several other Q&A here I have been able to have my running application send progress data to a named pipe using the following code
import win32pipe
import win32file
import glob
test_files = glob.glob('J:\\someDirectory\\*.htm')
# test_files has two items a.htm and b.htm
p = win32pipe.CreateNamedPipe(r'\\.\pipe\wfsr_pipe',
win32pipe.PIPE_ACCESS_DUPLEX,
win32pipe.PIPE_TYPE_MESSAGE | win32pipe.PIPE_WAIT,
1,65536,65536,300,None)
# the following line is the server-side function for accepting a connection
# see the following SO question and answer
""" http://stackoverflow.com/questions/1749001/named-pipes-between-c-sharp-and-python
"""
win32pipe.ConnectNamedPipe(p, None)
for each in testFiles:
win32file.WriteFile(p,each + '\n')
#send final message
win32file.WriteFile(p,'Process Complete')
# close the connection
p.close()
In short the example code writes the path of the each file that was globbed to the NamedPipe - this is useful and can be easily extended to more logging type events. However, the problem is trying to figure out how to read the content of the named pipe without knowing the size of each possible message. For example the first file could be named J:\someDirectory\a.htm, but the second could have 300 characters in the name.
So far the code I am using to read the contents of the pipe requires that I specify a buffer size
First establish the connection
file_handle = win32file.CreateFile("\\\\.\\pipe\\wfsr_pipe",
win32file.GENERIC_READ | win32file.GENERIC_WRITE,
0, None,
win32file.OPEN_EXISTING,
0, None)
and then I have been playing around with reading from the file
data = win32file.ReadFile(file_handle,128)
This generally works but I really want to read until I hit a newline character, do something with the content between when I started reading and the newline character and then repeat the process until I get to a line that has Process Complete in the line
I have been struggling with how to read only until I find a newline character (\n). I basically want to read the file by lines and based on the content of the line do something (either display the line or shift the application focus).
Based on the suggestion provided by #meuh I am updating this because I think there is a dearth of examples, guidance in how to use pipes
My server code
import win32pipe
import win32file
import glob
import os
p = win32pipe.CreateNamedPipe(r'\\.\pipe\wfsr_pipe',
win32pipe.PIPE_ACCESS_DUPLEX,
win32pipe.PIPE_TYPE_MESSAGE | win32pipe.PIPE_WAIT,
1,65536,65536,300,None)
# the following line is the server-side function for accepting a connection
# see the following SO question and answer
""" http://stackoverflow.com/questions/1749001/named-pipes-between-c-sharp-and-python
"""
win32pipe.ConnectNamedPipe(p, None)
for file_id in glob.glob('J:\\level1\\level2\\level3\\*'):
for filer_id in glob.glob(file_id + os.sep + '*'):
win32file.WriteFile(p,filer_id)
#send final message
win32file.WriteFile(p,'Process Complete')
# close the connection
p.close() #still not sure if this should be here, I need more testing
# I think the client can close p
The Client code
import win32pipe
import win32file
file_handle = win32file.CreateFile("\\\\.\\pipe\\wfsr_pipe",
win32file.GENERIC_READ |
win32file.GENERIC_WRITE,
0, None,win32file.OPEN_EXISTING,0, None)
# this is the key, setting readmode to MESSAGE
win32pipe.SetNamedPipeHandleState(file_handle,
win32pipe.PIPE_READMODE_MESSAGE, None, None)
# for testing purposes I am just going to write the messages to a file
out_ref = open('e:\\testpipe.txt','w')
dstring = '' # need some way to know that the messages are complete
while dstring != 'Process Complete':
# setting the blocksize at 4096 to make sure it can handle any message I
# might anticipate
data = win32file.ReadFile(file_handle,4096)
# data is a tuple, the first position seems to always be 0 but need to find
# the docs to help understand what determines the value, the second is the
# message
dstring = data[1]
out_ref.write(dstring + '\n')
out_ref.close() # got here so close my testfile
file_handle.close() # close the file_handle

I don't have windows but looking through the api it seems you should convert
your client to message mode by adding after the CreateFile() the call:
win32pipe.SetNamedPipeHandleState(file_handle,
win32pipe.PIPE_READMODE_MESSAGE, None, None)
then each sufficiently long read will return a single message, ie what the other wrote in a single write. You already set PIPE_TYPE_MESSAGE when you created the pipe.

You could simply use an implementation of io.IOBase that would wrap the NamedPipe.
class PipeIO(io.RawIOBase):
def __init__(self, handle):
self.handle = handle
def read(self, n):
if (n == 0): return ""
elif n == -1: return self.readall()
data = win32file.ReadFile(self.file_handle,n)
return data
def readinto(self, b):
data = self.read(len(b))
for i in range(len(data)):
b[i] = data[i]
return len(data)
def readall(self):
data = ""
while True:
chunk = win32file.ReadFile(self.file_handle,10240)
if (len(chunk) == 0): return data
data += chunk
BEWARE : untested, but it should work after fixing the eventual typos.
You could then do:
with PipeIO(file_handle) as fd:
for line in fd:
# process a line

You could use the msvcrt module and open to turn the pipe into a file object.
Sending code
import win32pipe
import os
import msvcrt
from io import open
pipe = win32pipe.CreateNamedPipe(r'\\.\pipe\wfsr_pipe',
win32pipe.PIPE_ACCESS_OUTBOUND,
win32pipe.PIPE_TYPE_MESSAGE | win32pipe.PIPE_WAIT,
1,65536,65536,300,None)
# wait for another process to connect
win32pipe.ConnectNamedPipe(pipe, None)
# get a file descriptor to write to
write_fd = msvcrt.open_osfhandle(pipe, os.O_WRONLY)
with open(write_fd, "w") as writer:
# now we have a file object that we can write to in a standard way
for i in range(0, 10):
# create "a\n" in the first iteration, "bb\n" in the second and so on
text = chr(ord("a") + i) * (i + 1) + "\n"
writer.write(text)
Receiving code
import win32file
import os
import msvcrt
from io import open
handle = win32file.CreateFile(r"\\.\pipe\wfsr_pipe",
win32file.GENERIC_READ,
0, None,
win32file.OPEN_EXISTING,
0, None)
read_fd = msvcrt.open_osfhandle(handle, os.O_RDONLY)
with open(read_fd, "r") as reader:
# now we have a file object with the readlines and other file api methods
lines = reader.readlines()
print(lines)
Some notes.
I've only tested this with python 3.4, but I believe you may be using python 2.x.
Python seems to get weird if you try to close both the file object and the pipe..., so I've only used the file object (by using the with block)
I've only created the file objects to read on one end and write on the other. You can of course make the file objects duplex by
Creating the file descriptors (read_fd and write_fd) with the os.O_RDWR flag
Creating the file objects in in "r+" mode rather than "r" or "w"
Going back to creating the pipe with the win32pipe.PIPE_ACCESS_DUPLEX flag
Going back to creating the file handle object with the win32file.GENERIC_READ | win32file.GENERIC_WRITE flags.

how to read from a particular line upto a particular line in python

I haave a file and I want to read a specific part of it. This is the file.
.....
.....
Admin server interface begins
...
....
....
....
....
Admin server interface ends
....
....
I want to read the part of file between 'admin server interface begins' till 'admin server interface ends'. I found a way of doing it in perl but can't find a way in python.
in perl
while (<INP>)
{
print $_ if(/^AdminServer interface definitions begins/ .. /^AdminServer interface definitions ends/);
}
Could anyonle please help.

You can read the file line by line and gather what is in between your markers.
def dispatch(inputfile):
# if the separator lines must be included, set to True
need_separator = True
new = False
rec = []
with open(inputfile) as f:
for line in f:
if "Admin server interface begins" in line:
new = True
if need_separator:
rec = [line]
else:
rec = []
elif "Admin server interface ends" in line:
if need_separator:
rec.append(line)
new = False
# if you do not need to process further, uncomment the following line
#return ''.join(rec)
elif new:
rec.append(line)
return ''.join(rec)
The code above will successfully return data even if the input file does not contain the ending separator (Admin server interface ends). You can amend the last return with a condition if you want to catch such files:
if new:
# handle the case where there is no end separator
print("Error in input file: no ending separator")
return ''
else:
return ''.join(rec)

If the file isn't very big and you are not concerned about memory consumption, you can write this simple solution:
from os.path import isfile
def collect_admin_server_interface_info(filename):
""" Collects admin server interface information from specified file. """
if isfile(filename):
contents = ''
with open(filename, 'r') as f:
contents = file.read()
beg_str = 'Admin server interface begins'
end_str = 'Admin server interface ends'
beg_index = contents.find(beg_str + len(beg_str))
end_index = contents.find(end_str)
if beg_index == -1 or end_index == -1:
raise("Admin server interface not found.")
return contents[beg_index : end_index]
else:
raise("File doesn't exist.")
This method will try to return a single string containing administrator server interface information.

flask+ftplib basic application

app = Flask(__name__)
#app.route("/")
def hello():
address="someserver"
global FTP
ftp = FTP(address)
ftp.login()
return ftp.retrlines("LIST")
if __name__ == "__main__":
app.run()
...this gives me a following output:
226-Options: -l 226 1 matches total
The question is - why does not this print the output of retrlines and how do I do so?

The documentation for the ftplib.FTP class says that retrlines takes an optional callback - if no callback is provided "The default callback prints the line to sys.stdout." This means that the method retrlines does not actually return the data provided - it simply passes each line as it receives it to a callable that may be passed to it. This leaves you with a couple of options:
Pass in a callable that can stores the results of being called multiple times:
def fetchlines(line=None):
if line is not None:
# As long as we are called with a line
# store the line in the array we added to this function
fetchlines.lines.append(line)
else:
# When we are called without a line
# we are retrieving the lines
# Truncate the array after copying it
# so we can re-use this function
lines = fetchlines.lines[:]
fetchlines.lines = []
return lines
fetchlines.lines = []
#app.route("/")
def hello():
ftp = FTP("someaddress")
ftp.login()
ftp.dir(fetchlines)
lines = fetchlines()
return "<br>".join(lines)
Replace sys.stdout with a file-like object (from cStringIO for example) and then simply read the file afterwards:
from cStringIO import StringIO
from sys import stdout
# Save a reference to stdout
STANDARD_OUT = stdout
#app.route("/")
def hello():
ftp = FTP("someaddress")
ftp.login()
# Change stdout to point to a file-like object rather than a terminal
file_like = StringIO()
stdout = file_like
ftp.dir()
# lines in this case will be a string, not a list
lines = file_like.getvalue()
stdout = STANDARD_OUT
file_like.close()
return lines
Neither of these techniques will hold up well under a lot of load - or even under any real concurrency. There are ways to solve for that, but I'll leave that for another day.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Open file in an Apache Storm Spout with python - python

Related

Python Callback for File Object Close

Write header to a python log file, but only if a record gets written

When working with a named pipe is there a way to do something like readlines()

how to read from a particular line upto a particular line in python

flask+ftplib basic application

Categories

Resources