I have this class that consists of 3 functions. Each function is in charge of one part of the whole process.
.load() loads up two files, re-formats their content and writes them to two new files.
.compare() takes two files and prints out their differences in a specific format.
.final() takes the result of .compare() and creates a file for every set of values.
Please ignore the Frankenstein nature of the logic as it is not my main concern at the moment. I know it can be written a thousand times better and that's fine by me for now as i am still new to Python and programing in general. I do have some theoretical experience but very limited technical practice and that is something i am working on.
Here is the code:
from collections import defaultdict
from operator import itemgetter
from itertools import groupby
from collections import deque
import os
class avs_auto:
def load(self, fileIn1, fileIn2, fileOut1, fileOut2):
with open(fileIn1+'.txt') as fin1, open(fileIn2+'.txt') as fin2:
frame_rects = defaultdict(list)
for row in (map(str, line.split()) for line in fin1):
id, frame, rect = row[0], row[2], [row[3],row[4],row[5],row[6]]
frame_rects[frame].append(id)
frame_rects[frame].append(rect)
for row in (map(str, line.split()) for line in fin2):
id, frame, rect = row[0], row[2], [row[3],row[4],row[5],row[6]]
frame_rects[frame].append(id)
frame_rects[frame].append(rect)
with open(fileOut1+'.txt', 'w') as fout1, open(fileOut2+'.txt', 'w') as fout2:
for frame, rects in sorted(frame_rects.iteritems()):
fout1.write('{{{}:{}}}\n'.format(frame, rects))
fout2.write('{{{}:{}}}\n'.format(frame, rects))
def compare(self, f1, f2):
with open(f1+'.txt', 'r') as fin1:
with open(f2+'.txt', 'r') as fin2:
lines1 = fin1.readlines()
lines2 = fin2.readlines()
diff_lines = [l.strip() for l in lines1 if l not in lines2]
diffs = defaultdict(list)
with open(f1+'x'+f2+'Result.txt', 'w') as fout:
for line in diff_lines:
d = eval(line)
for k in d:
list_ids = d[k]
for i in range(0, len(d[k]), 2):
diffs[d[k][i]].append(k)
for id_ in diffs:
diffs[id_].sort()
for k, g in groupby(enumerate(diffs[id_]), lambda (i, x): i - x):
group = map(itemgetter(1), g)
fout.write('{0} {1} {2}\n'.format(id_, group[0], group[-1]))
def final(self):
with open('hw1load3xhw1load2Result.txt', 'r') as fin:
lines = (line.split() for line in fin)
for k, g in groupby(lines, itemgetter(0)):
fst = next(g)
lst = next(iter(deque(g, 1)), fst)
with open('final/{}.avs'.format(k), 'w') as fout:
fout.write('video0=ImageSource("MovieName\original\%06d.jpeg", {}, {}, 15)\n'.format(fst[1], lst[2]))
Now to my question, how do i make it so each of the functions passes it's output files as values to the next function and calls it?
So for an example:
running .load() should output two files, call the .compare() function passing it those two files.
Then when .compare() is done, it should pass .final() the output file and calls it.
So .final() will open whatever file is passed to it from .compare() and not "test123.txt" as it is defined above.
I hope this all makes sense. Let me know if you need clarification. Any criticism is welcome concerning the code itself. Thanks in advance.
There are a couple of ways to do this, but I would write a master function that calls the other three in sequence. Something like:
def load_and_compare(self, input_file1, input_file2, output_file1, output_file2, result_file):
self.load(input_file1, input_file2, output_file1, output_file2)
self.compare(output_file1, output_file2)
self.final(result_file)
Looking over your code, I think you have a problem in load. You only declare a single dictionary, then load the contents of both files into it and write those same contents out to two files. Because each file has the same content, compare won't do anything meaningful.
Also, do you really want to write out the file contents and then re-read it into memory? I would keep the frame definitions in memory for use in compare after loading rather than reading them back in.
I don't really see a reason for this to be a class at all rather than just a trio of functions, but maybe if you have to read multiple files with mildly varying formats you could get some benefit of using class attributes to define the format while inheriting the general logic.
Do you mean call with the name of the two files? Well you defined a class, so you can just do:
def load(self, fileIn1, fileIn2, fileOut1, fileOut2):
... // do stuff here
// when done
self.compare( fileOut1, fileOut2 )
And so on.
I might be totally off here, but why don't you do it exactly as you're saying?
Just call self.compare() out of your load() method.
You can also add return statements to load() and return a tuple with the files.
Then add a 4th method to your class, which then collects the returned files and pipes them to the compare() method.
Best Regards!
One of the more powerful aspects of Python is that you can return something called a tuple. To answer this in a more generic Python sense consider this code:
>>> def load(file1, file2):
return file1+'.txt',file2+'.txt'
>>> def convert(file1, file2):
return 'converted_'+file1,'converted_'+file2
>>> convert(*load("Java", "C#"))
('converted_Java.txt', 'converted_C#.txt')
Each function takes two named arguments, but the returned tuple of the first can be "unpacked" into the input arguments of the second by adding a * in front of it.
Related
I am trying to write lists from a file and define them to separate values in a dictionary. The text file would look something like this:
[12, 13, 14]
[87, 45, 32]
...
and then the dictionary would look something like this:
{"score_set0": [12, 13, 14], "score_set1": [87, 45, 32]...}
This is the code I have get so far, but it just returns an empty dictionary
def readScoresFile(fileAddr):
dic = {}
i = 0
with open(fileAddr, "r") as f:
x = len(f.readlines())
for line in f:
dic["score_set{}".format(x[i])] = line
i += 1
return dic
I am only programming at GCSE level (UK OCR syllabus if that helps) in year 10. Thanks for any help anyone can give
Also I am trying to do this without pickle module
x = len(f.readlines()) consumed your whole file, so your subsequent loop over f is iterating an exhausted file handle, sees no remaining lines, and exists immediately.
There's zero need to pre-check the length here (and the only use you make of x is trying to index it, which makes no sense; you avoided a TypeError solely because the loop never ran), so just omit that and use enumerate to get the numbers as you go:
def readScoresFile(fileAddr):
dic = {}
with open(fileAddr, "r") as f:
for i, line in enumerate(f): # Let enumerate manage the numbering for you
dic["score_set{}".format(i)] = line # If you're on 3.6+, dic[f'score_set{i}'] = line is nicer
return dic
Note that this does not actually convert the input lines to lists of int (neither did your original code). If you want to do that, you can change:
dic[f'score_set{i}'] = line
to:
dic[f'score_set{i}'] = ast.literal_eval(line) # Add import ast to top of file
to interpret the line as a Python literal, or:
dic[f'score_set{i}'] = json.loads(line) # Add import json to top of file
to interpret each line as JSON (faster, but supports fewer Python types, and some legal Python literals are not legal JSON).
As a rule, you basically never want to use .readlines(); simply iterating over the file handle will get you the lines live and avoid a memory requirement proportionate to the size of the file. (Frankly, I'd have preferred if they'd gotten rid of it in Py3, since list(f) gets the same result if you really need it, and it doesn't create a visible method that encourages you to do "The Wrong Thing" so often).
By operating line-by-line, you eventually store all the data, but that's better than doubling the overhead by storing both the parsed data in the dict and all the string data it came from in the list.
If you're trying to turn the lines into actual Python lists, I suggest using the json module. (Another option would be ast.literal_eval, since the syntax happens to be the same in this case.)
import json
def read_scores_file(file_path):
with open(file_path) as f:
return {
f"score_set{i}": json.loads(line)
for i, line in enumerate(f)
}
I have data stored in .h5. I use the following code to display group names and also call one of the groups (Event_[0]) to see what's inside:
with h5py.File(data_path, 'r') as f:
ls = list(f.keys())
print('List of datasets: \n', ls)
data = f.get('group_1')
dataset1 = np.array(data)
print('Shape of dataset1: \n', dataset1.shape)
f.close()
It works fine but I have like 2000 groups with one dataset each. How can I avoid writing the same code for every single group? Is there maybe a way to get('all groups')?
EDIT: one more example: I use
f['Event_[0]'][()]
to see one group. Can this be also applied for multiple groups?
Just iterate on the list of keys:
with h5py.File(data_path, 'r') as f:
alist = []
ls = list(f.keys())
print('List of datasets: \n', ls)
for key in ls:
group = f.get(key)
dataset = group.get(datasetname)[:]
print('Shape of dataset: \n', dataset.shape)
alist.append(dataset)
# don't need f.close() in a with
There isn't an allgroups; there are iter and visit methods, but they end up doing the same thing - for each group in the file, fetch the desired dataset. h5py docs should be complete, without hidden methods. The visit is recursive, and similar to Python OS functionality for visiting directories and files.
In h5py the file and groups behave like Python dicts. It's the dataset that behaves like a numpy array.
If you know you will always have this data schema, you can work with the keys (as shown in the previous answer). That implies only Groups at the root level, and Datasets are the only objects under each Group. The "visitor" functions are very handy when you don't know the exact contents of the file.
There are 2 visitor functions. They are visit() and visititems(). Each recursively traverses the object tree, calling the visitor function for each object. The only difference is that callable function for visit receives 1 value: name, and for visititems it receives 2 values: name and node (a h5py object). The name is just that, an object's name, NOT it's full pathname. I prefer visititems for 2 reasons: 1) Having the node object allows you to do tests on the object type (as shown below), and 2) Determining the pathname requires you know the path or you use the object's name attribute to get it.
The example below creates a simple HDF5 file, creates a few groups and datasets, then closes the file. It then reopens in read mode and uses visititems() to traverse the file object tree. (Note: the visitor functions can have any name and can be used with any object. It traverses recursively from that point in the file structure.)
Also, you don't need f.close() when you use the with / as: construct.
import h5py
import numpy as np
def visit_func(name, node) :
print ('Full object pathname is:', node.name)
if isinstance(node, h5py.Group) :
print ('Object:', name, 'is a Group\n')
elif isinstance(node, h5py.Dataset) :
print ('Object:', name, 'is a Dataset\n')
else :
print ('Object:', name, 'is an unknown type\n')
arr = np.arange(100).reshape(10,10)
with h5py.File('SO_63315196.h5', 'w') as h5w:
for cnt in range(3):
grp = h5w.create_group('group_'+str(cnt))
grp.create_dataset('data_'+str(cnt),data=arr)
with h5py.File('SO_63315196.h5', 'r') as h5r:
h5r.visititems(visit_func)
I have two CSV files, one of which is likely to contain a few more records that the other. I am writing a function to iterate over each and determine which records are in dump but not liar.
My code is as follows:
def update_lib(x, y):
dump = open(x, newline='')
libr = open(y, newline='')
dump_reader = csv.reader(dump)
for dump_row in dump_reader:
libr_reader = csv.reader(libr)
for libr_row in libr_reader:
if dump_row[0] == libr_row[0]:
break
I am expecting this to take the first row in dump (dump_row) and iterate over each row in library (libr_row) to see if the first elements match. If they do then I want to move to the next row in dump and if not I will do something else eventually.
My issue is that libr_reader appears to "remember" where it is and I can't get it to go back to the first row in libr, even when the break has been reached and I would therefore expect libr_reader to be re-initiated. I have even tried del libr_row and del libr_reader but this doesn't appear to make a difference. I suspect I am misunderstanding iterators, any help gratefully received.
As it's pasted in your question, you'll be creating a libr_reader object every time you iterate over a row in dump_reader.
dump_reader = csv.reader(dump)
for dump_row in dump_reader:
libr_reader = csv.reader(libr)
dump_reader here is created once. Assuming there are 10 rows from dump_reader, you will be creating 10 libr_reader instances, all from the same file handle.
Per our discussion in the comments, you're aware of that, but what you're unaware of is that the reader object is working on the same file handle and thus, is still at the same cursor.
Consider this example:
>>> import io
>>> my_file = io.StringIO("""Line 1
... Another Line
... Finally, a third line.""")
This is creating a simulated file object. Now I'll create a "LineReader" class.
>>> class LineReader:
... def __init__(self, file):
... self.file = file
... def show_me_a_line(self):
... print(self.file.readline())
...
If I use three line readers on the same file, the file still remembers its place:
>>> line_reader = LineReader(my_file)
>>> line_reader.show_me_a_line()
Line 1
>>> second_line_reader = LineReader(my_file)
>>> second_line_reader.show_me_a_line()
Another Line
>>> third_line_reader = LineReader(my_file)
>>> third_line_reader.show_me_a_line()
Finally, a third line.
To the my_file object, there's no material difference between what I just did, and doing this directly. First, I'll "reset" the file to the beginning by calling seek(0):
>>> my_file.seek(0)
0
>>> my_file.readline()
'Line 1\n'
>>> my_file.readline()
'Another Line\n'
>>> my_file.readline()
'Finally, a third line.'
There you have it.
So TL/DR: Files have cursors and remember where they are. Think of the file handle as a thing that remembers where the file is, yes, but also remembers where in the file your program is.
I want to create a dictionary with a list of values for multiple keys with a single for loop in Python3. For me, the time execution and memory footprint are of utmost importance since the file which my Python3 script is reading is rather long.
I have already tried the following simple script:
p_avg = []
p_y = []
m_avg = []
m_y = []
res_dict = {}
with open('/home/user/test', 'r') as f:
for line in f:
p_avg.append(float(line.split(" ")[5].split(":")[1]))
p_y.append(float(line.split(" ")[6].split(":")[1]))
m_avg.append(float(line.split(" ")[1].split(":")[1]))
m_avg.append(float(line.split(" ")[2].split(":")[1]))
res_dict['p_avg'] = p_avg
res_dict['p_y'] = p_y
res_dict['m_avg'] = m_avg
res_dict['m_y'] = mse_y
print(res_dict)
The format of my home/user/test file is:
n:1 m_avg:7588.39 m_y:11289.73 m_u:147.92 m_v:223.53 p_avg:9.33 p_y:7.60 p_u:26.43 p_v:24.64
n:2 m_avg:7587.60 m_y:11288.54 m_u:147.92 m_v:223.53 p_avg:9.33 p_y:7.60 p_u:26.43 p_v:24.64
n:3 m_avg:7598.56 m_y:11304.50 m_u:148.01 m_v:225.33 p_avg:9.32 p_y:7.60 p_u:26.43 p_v:24.60
.
.
.
The Python script shown above works but first it is too long and repetitive, second, I am not sure how efficient it is. I was eventually thinking to create the same with list-comprehensions. Something like that:
(res_dict['p_avg'], res_dict['p_y']) = [(float(line.split(" ")[5].split(":")[1]), float(line.split(" ")[6].split(":")[1])) for line in f]
But for all four dictionary keys. Do you think that using list comprehension could reduce the used memory footprint of the script and the speed of execution? What should be the right syntax for the list-comprehension?
[EDIT] I have changed the dict -> res_dict as it was mentioned that it is not a good practice, I have also fixed a typo, where the p_y wasn't pointing to the right value and added a print statement to print the resulting dictionary as mentioned by the other users.
You can make use of defaultdict. There is no need to split the line each time, and to make it more readable you can use a lambda to extract the fields for each item.
from collections import defaultdict
res = defaultdict(list)
with open('/home/user/test', 'r') as f:
for line in f:
items = line.split()
extract = lambda x: x.split(':')[1]
res['p_avg'].append(extract(items[5]))
res['p_y'].append(extract(items[6]))
res['m_avg'].append(extract(items[1]))
res['m_y'].append(extract(items[2]))
You can initialize your dict to contain the string/list pairs, and then append directly as you iterate through every line. Also, you don't want to keep calling split() on line on each iteration. Rather, just call once and save to a local variable and index from this variable.
# Initialize dict to contain string key and list value pairs
dictionary = {'p_avg':[],
'p_y':[],
'm_avg':[],
'm_y':[]
}
with open('/home/user/test', 'r') as f:
for line in f:
items = line.split() # store line.split() so you don't split multiple times per line
dictionary['p_avg'].append(float(items[5].split(':')[1]))
dictionary['p_y'].append(float(items[6].split(':')[1])) # I think you meant index 6 here
dictionary['m_avg'].append(float(items[1].split(':')[1]))
dictionary['m_y'].append(float(items[2].split(':')[1]))
You can just pre-define dict attributes:
d = {
'p_avg': [],
'p_y': [],
'm_avg': [],
'm_y': []
}
and then append directly to them:
with open('/home/user/test', 'r') as f:
for line in f:
splitted_line = line.split(" ")
d['p_avg'].append(float(splitted_line[5].split(":")[1]))
d['p_y'].append(float(splitted_line[5].split(":")[1]))
d['m_avg'].append(float(splitted_line[1].split(":")[1]))
d['m_avg'].append(float(splitted_line[2].split(":")[1]))
P.S. Never use variable names equal to built-in words, like dict, list etc. It can cause MANY various errors!
I have this code where I am reading a file in ipython using pyspark. What I am trying to do is to add a piece to it which forms a list based on a particular column read from the file but when I try to execute it then the list comes out to be empty and nothing gets appended to it. My code is:
list1 = []
def file_read(line):
list1.append(line[10])
# bunch of other code which process other column indexes on `line`
inputData = sc.textFile(fileName).zipWithIndex().filter(lambda (line,rownum): rownum>0).map(lambda (line, rownum): line)
column_val = (inputData
.map(lambda line: line.split(","))
.filter(lambda line: len(line) >1 )
.map(file_read))
WHen I execute this part of code the list1 still comes to be empty even though there's data in line[10] as I am using it in other parts of the code in the same function above. It seems as if it is just not appending it to the list. How can I form the list above?
Well, it actually does append to the list1, problem is not to the one you're thinking about. Every variable referenced in the closures are serialized and send to the workers. It applies to list1 as well.
Every partition receives it's own copy of the list1, when file_read is called data is appended to this copy, and when a given map phase is finished it goes out of scope and is discarded.
Not particularly elegant piece of code but you should see that it is really what is happening here:
rdd = sc.parallelize(range(100), 5)
line1 = []
def file_read(line):
list1.append(line)
print len(list1)
return line
xs = rdd.map(file_read).collect()
Edit
Spark provides two types of shared variable. Broadcast variables, which are read only from the worker perspective, and accumulators which are write only from the driver perspective.
By default accumulators support only numeric variables and are intended to be used mostly as counters. It is possible to define custom accumulators though. To do that you have to extend AccumulatorParam class and provide custom zero and addInPlace implementations:
class ListParam(AccumulatorParam):
def zero(self, v):
return []
def addInPlace(self, acc1, acc2):
acc1.extend(acc2)
return acc1
Next you can redefine file_read as follows:
def file_read1(line):
global list1 # Required otherwise the next line will fail
list1 += [line]
return line
Example usage:
list1 = sc.accumulator([], ListParam())
rdd = sc.parallelize(range(10)).map(file_read1).collect()
list1.value
Even if it is possible to use accumulator like this it is probably to expensive to be used in practice and in the worst case scenario it can crash the driver. Instead you can simply use another transformation:
tmp = (inputData
.map(lambda line: line.split(","))
.filter(lambda line: len(line) >1 ))
def line_read2(line): return ... # Just a core logic
line1 = tmp.map(lambda line: line[10])
column_val = tmp.map(line_read2)
Side note:
Code you've provided doesn't do anything. Transformations in Spark are just the descriptions of what has to be done, but until you call an action data nothing is really executed.