I've been working on some Spark Streaming using Python, specifically textFileStream, and I've noticed a slightly weird behaviour. I was wondering if anybody could help explain this to me.
I currently have my code set up as follows:
def fileName(data):
debug = data.toDebugString()
pattern = re.compile("file:/.*\.txt")
files = pattern.findall(debug)
return files
if __name__ == "__main__":
sc = SparkContext(appName="PythonStreamingFileNamePrinter")
ssc = StreamingContext(sc, 1)
lines = ssc.textFileStream("file:///test/input/")
files = lines.foreachRDD(fileName)
print(files)
ssc.start()
ssc.awaitTermination()
The fileName function simple grabs the name of the file being processed from the debug stream (Spark Streaming: How to get the filename of a processed file in Python). However, this code only runs once, printing files exactly once. When I modify the function as follows:
def fileName(data):
debug = data.toDebugString()
pattern = re.compile("file:/.*\.txt")
files = pattern.findall(debug)
print(files);
it checks the directory every second, as expected. It seems the only code that 'loops' is inside foreachRDD.
Am I correct in this assumption, and all processing (including loops, conditionals etc) must occur inside map functions and the like?
Thanks,
M
A DStream is composed of many rdds that are build over time.
lines is a DStream.
When you perform the foreachRDD on lines each rdd in your stream is transformed into a string. So when you print it you are getting a list of strings that represent all the rdds in the stream. Meaning, this happens "At the end of the stream".
When you print the string in the fileName function, you are doing it for each rdd in the stream while it is being proceed. So you are getting it while the stream is running.
Also, as I mentioned to you in your previous question, foreachRDD is not necessary here. It is not "The spark stream way" for this specific need and maybe this is why it confuses you.
The more direct way here is to use a map on the DStream itself (Which will effect all the rdd's in it) and then use pprint.
Remember that unlike a regular rdd, you can't just collect (Or anything similar) rdds in a stream and return the result while the stream is running. You need to do something with that data which will save it to some external source (If needed) or process it as part of the state of the whole stream.
Related
I read position data from a GPS Sensor in a dictionary, which I am sending in cyclic interval to a server.
If I have no coverage, the data will be saved in a list.
If connection can be reestablished, all list items will be transmitted.
But if the a power interruption occurs, all temp data elements will be lost.
What would be the best a pythonic solution to save this data?
I am using a SD card as storage, so i am not sure, if writing every element to a file would be the best solution.
Current implementation:
stageddata = []
position = {'lat':'1.2345', 'lon':'2.3455', 'timestamp':'2020-10-18T15:08:04'}
if not transmission(position):
stageddata.append(position)
else:
while stageddata:
position = stageddata.pop()
if not transmission(position):
stageddata.append(position)
return
EDIT: Finding the "best" solution may be very subjective. I agree with zvone, a power outage can be prevented. Perhaps a shutdown routine should save the temporary data.
So question may be how to pythonic save a given list to a file?
A good solution for temporary storage in Python is tempfile.
You can use it, e.g., like the following:
import tempfile
with tempfile.TemporaryFile() as fp:
# Store your varibale
fp.write(your_variable_to_temp_store)
# Do some other stuff
# Read file
fp.seek(0)
fp.read()
I agree with the comment of zvone. In order to know the best solution, we would need more information.
The following would be a robust and configurable solution.
import os
import pickle
backup_interval = 2
backup_file = 'gps_position_backup.bin'
def read_backup_data():
file_backup_data = []
if os.path.exists(backup_file):
with open(backup_file, 'rb') as f:
while True:
try:
coordinates = pickle.load(f)
except EOFError:
break
file_backup_data += coordinates
return file_backup_data
# When the script is started and backup data exists, stageddata uses it
stageddata = read_backup_data()
def write_backup_data():
tmp_backup_file = 'tmp_' + backup_file
with open(tmp_backup_file, 'wb') as f:
pickle.dump(stageddata, f)
os.replace(tmp_backup_file, backup_file)
print('Wrote data backup!')
# Mockup variable and method
transmission_return = False
def transmission(position):
return transmission_return
def try_transmission(position):
if not transmission(position):
stageddata.append(position)
if len(stageddata) % backup_interval == 0:
write_backup_data()
else:
while stageddata:
position = stageddata.pop()
if not transmission(position):
stageddata.append(position)
return
else:
if len(stageddata) % backup_interval == 0:
write_backup_data()
if __name__ == '__main__':
# transmission_return is False, so write to backup_file
for counter in range(10):
position = {'lat':'1.2345', 'lon':'2.3455'}
try_transmission(position)
# transmission_return is True, transmit positions and "update" backup_file
transmission_return = True
position = {'lat':'1.2345', 'lon':'2.3455'}
try_transmission(position)
I moved your code into some some functions. With the variable backup_interval, it is possible to control how often a backup is written to disk.
Additional Notes:
I use the built-in pickle module, since the data does not have to be human readable or transformable for other programming languages. Alternatives are JSON, which is human readable, or msgpack, which might be faster, but needs an extra package to be installed. The tempfile is not a pythonic solution, as it cannot easily be retrieved in case the program crashes.
stageddata is written to disk when it hits the backup_interval (obviously), but also when transmission returns True within the while loop. This is needed to "synchronize" the data on disk.
The data is written to disk completely new every time. A more sophisticated approach would be to just append the newly added positions, but then the synchronizing part, that I described before, would be more complicated too. Additionally, the safer temporary file approach (see Edit below) would not work.
Edit: I just reconsidered your use case. The main problem here is: Restoring data, even if the program gets interrupted at any time (due to power interruption or whatever). My first solution just wrote the data to disk (which solves part of the problem), but it could still happen, that the program crashes the moment when writing to disk. In that case the file would probably be corrupt and the data lost. I adapted the function write_backup_data(), so that it writes to a temporary file first and then replaces the old file. So now, even if a lot of data has to be written to disk and the crash happens there, the previous backup file would still be available.
Maybe saving it as a binary code could help to minimize the storage. 'pickle' and 'shelve' modules will help with storing objects and serializing (To serialize an object means to convert its state to a byte stream so that the byte stream can be reverted back into a copy of the object), but you should be carefull that when you resolve the power interruption it does not overwrite the data you have been storing, with open(file, "a") (a== append), you could avoid that.
I'm developping a python app that deals with big objects, and to avoid filling the pc ram while executing, I chosed to store my temporary objects (created at one step, used by the next step) in files with pickle module.
While trying to optimize memory consumption, I saw a behaviour that I don't understand.
In the first case, I'm opening my temp file, then I loop to do the actions I need and during the loop I regularly dump objects in the file. It works well, but as the file pointer remains open, it consumes a lot of memory. Here is the code example :
tmp_file_path = "toto.txt"
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
for filepath in self.file_list: // loop over files to be treated
try:
my_obj = process_file(filepath)
storage_obj = StorageObj()
storage_obj.add(os.path.basename(filepath), my_obj)
p.dump(storage_obj)
[...]
In the second case I'm only opening my temp file when I need to write inside it :
tmp_file_path = "toto.txt"
for filepath in self.file_list: // loop over files to be treated
try:
my_obj = process_file(filepath)
storage_obj = StorageObj()
storage_obj.add(os.path.basename(filepath), my_obj)
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
p.dump(storage_obj)
[...]
The code between the two versions is the same except from the block :
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
which moves inside/outside the loop.
And for the unpickling part :
with open("toto.txt", 'rb') as f:
try:
u = pickle.Unpickler(f)
storage_obj = u.load()
while storage_obj:
process_my_obj(storage_obj)
storage_obj = u.load()
except EOFError:
pass
When I'm running both codes, in the first case I have a high memory consumption (due to the fact that temp file remains open during the treatment I guess) and in the end, with a set of inputs, the application finds 622 elements in the unpickled data.
In the second case, memory cunsumption is far lower, but in the end , with the same inputs, the application finds 440 elements in the unpickled data, and sometimes crashes with random errors during Unpickler.load() method (for exemple Attribute error, but it's not always reproductible and not always the same error).
With even bigger set of inputs, the first code example often crashes with memory error, so I'd like to use the second code example, but it seems that it doesn't succeed to save all my objects correctly.
Does anyone have an idea of the reason why there is differences between the two behaviour ?
Maybe opening / dumping / closing / reopening /dumping / etc a file in my loop doesn't garanty the content that is dumped ?
EDIT 1 :
All the pickling part is done in a multiprocessing context, with 10 processes writing in their own temp file, and the unpickling is done by the main process, by reading each temp file created.
EDIT 2 :
I can't provide a full reproductible example (company code), but the treatment consists of parsing C files (process_file method, based on pycparser module) and generating an object representing the C file content (fields, functions etc) -> my_obj. Then storing my_obj in an object (StorageObj) that has a a dict as attribute, containing the my_obj object with the file is was extracted from as key.
Thanks in advance if anyone finds the reason behind this, or suggest me a way around to avoid this :)
This has nothing to do with the file. It is that you are using a common Pickler which is retaining its memo table.
The example that does not have the issue creates a new Pickler with a fresh memo table and lets the old one be collected effectively clearing the memo table.
But that doesn't explain why when I create multiple Pickler I retrieve less data than with only one in the end.
Now that is because you have written multiple pickles to the same file and the method where you read one. Only reads the first. As closing and reopening the file resets the file offset. In the reading of multiple objects each time you call load advances the file offset to the start of the next object.
If you are given a list of documents, with strings in the documents, how do you go about and search from the documents and return the list of documents that contains the string that you were searching for?
How would I go about implementing a program in Python or C for this problem statement? I've considered grep, but I'm not sure how implementing that inside of a native Python/C application would work.
Thought process at the moment is simply to parse through documents in a loop, then parse through all strings, etc., but it seems a little inefficient.
Any help appreciated.
The simple solution is just as you stated: loop through the files and search through each one.
Naive approach
for file in files:
for line in file:
if line contains pattern:
print file.name
If you wanted to be a little better, you could immediately bail out of the file as soon as you found a match.
Slightly better
for file in files:
for line in file:
if line contains pattern:
print file.name
break # found what we were looking for. continue to next file
At this point you could attempt to distribute the problem across multiple threads. You will probably be IO bound and may even see worse performance because multiple threads are trying to read different parts of the disk at the same time
Threaded approach
for file in files:
# create new worker thread which does...
for line in file:
if line contains pattern:
# insert filename into data structure
break # found what we were looking for. continue to next file
# wait for all threads to finish, collect and display data
But if you are concerned about performance, you should either use grep or copy how it works. It saves time by reading the files as raw binary (rather than break it up line by line) and makes use of a string searching algorithm called the Boyer–Moore algorithm. Refer to this other SO about how grep runs fast.
Probably What You Wantâ„¢ approach
grep -l pattern files
I'm an extreme noob to python, so If there's a better way to do what I'm asking please let me know.
I have one file, which works with flask to create markers on a map. It has an array which stores these said markers. I'm starting the file through command prompt, and opening said file multiple times. Basically, how would one open a file multiple times, and have them share a variable (Not the same as having a subfile that shares variables with a superfile.) I'm okay with creating another file that starts the instances if needed, but I'm not sure how I'd do that.
Here is an example of what I'd like to accomplish. I have a file called, let's
say, test.py:
global number
number += 1
print(number)
I'd like it so that when I start this through command prompt (python test.py) multiple times, it'd print the following:
1
2
3
4
5
The only difference between above and what I have, is that what I have will be non-terminating and continuously running
What you seem to be looking for is some form of inter-process communication. In terms of python, each process has its own memory space and its own variables meaning that if I ran.
number += 1
print(number)
Multiple times then I would get 1,2..5 on a new line. No matter how many times I start the script, number would be a global.
There are a few ways where you can keep consistency.
Writing To A File (named pipe)
One of your scripts can have (generator.py)
import os
num = 1
try:
os.mkfifo("temp.txt")
except:
pass # In case one of your other files already started
while True:
file = open("temp.txt", "w")
file.write(num)
file.close() # Important because if you don't close the file
# The operating system will lock your file and your other scripts
# Won't have access
sleep(# seconds)
In your other scripts (consumer.py)
while True:
file = open("temp.txt", "r")
number = int(file.read())
print(number)
sleep(# seconds)
You would start 1 or so generator and as many consumers as you want. Note: this does have a race condition that can't really be avoided. When you write to the file, you should use a serializer like pickler or json to properly encode and decode your array object.
Other Ways
You can also look up how to use pipes (both named and unnamed), databases, ampq (IMHO the best way to do it but there is a learning curve and added dependencies), and if you are feeling bold use mmap.
Design Change
If you are willing to listen to a design change, Since you are making a flask application that has the variable in memory why don't you just make an endpoint to serve up your array and check the endpoint every so often?
import json # or pickle
import flask
app = Flask(__name__)
array = [objects]
converted = method_to_convert_to_array_of_dicts(array)
#app.route("/array")
def hello():
return json.dumps(array)
You will need to convert but then the web server can be hosted and your clients would just need something like
import requests
import json
while True:
result = requests.get('localhost/array')
array = json.loads(str(result.body)) # or some string form of result
sleep(...)
Your description is kind of confusing, but if I understand you correctly, one way of doing this would be to keep the value of the variable in a separate file.
When a script needs the value, read the value from the file and add one to it. If the file doesn't exist, use a default value of 1. Finally, rewrite the file with the new value.
However you said that this value would be shared among two python scripts, so you'd have to be careful that both scripts don't try to access the file at the same time.
I think you could use pickle.dump(your array, file) to serie the data(your array) intoto a file. And at next time running the script, you could just load the data back with pickle.dump(your array, file)
I'm writing a script that gets the most recently modified file from a unix directory.
I'm certain it works, but I have to create a unittest to prove it.
The problem is the setUp function. I want to be able to predict the order the files are created in.
self.filenames = ["test1.txt", "test2.txt", "test3.txt", "filename.txt", "test4"]
newest = ''
for fn in self.filenames:
if pattern.match(fn): newest = fn
with open(fn, "w") as f: f.write("some text")
The pattern is "test.*.txt" so it just matches the first three in the list. In multiple tests, newest sometimes returns 'test3.txt' and sometimes 'test1.txt'.
Use os.utime to explicitly set modified time on the files that you have created. That way your test will run faster.
I doubt that the filesystem you are using supports fractional seconds on file create time.
I suggest you insert a call to time.sleep(1) in your loop so that the filesystem actually has a different timestamp on each created file.
It could be due to syncing. Just because you call write() on files in a certain order, it doesn't mean the data will be updated by the OS in that order.
Try calling f.flush() followed by os.fsync() on your file object before going to the next file. Giving some time between calls (using sleep()) might help also