Python - lines from files - all combinations

Python - lines from files - all combinations - python

I have two files - prefix.txt and terms.txt both have about 100 lines. I'd like to write out a third file with the Cartesian product
http://en.wikipedia.org/wiki/Join_(SQL)#Cross_join
-about 10000 lines.
What is the best way to approach this in Python?
Secondly, is there a way to write the 10,000 lines to the third file in a random order?

You need itertools.product.
for prefix, term in itertools.product(open('prefix.txt'), open('terms.txt')):
print(prefix.strip() + term.strip())
Print them, or accumulate them, or write them directly. You need the .strip() because of the newline that comes with each of them.
Afterwards, you can shuffle them using random.shuffle(list(open('thirdfile.txt')), but I don't know how fast that will be on a file of the sizes you are using.

A Cartesian product enumerates all combinations. The easiest way to enumerate all combinations is to use nested loops.
You cannot write files in a random order very easily. To write to a "random" position, you must use file.seek(). How will you know what position to which you will seek? How do you know how long each part (prefix+term) will be?
You can, however, read entire files into memory (100 lines is nothing) and process the in-memory collections in "random" orders. This will assure that the output is randomized.

from random import shuffle
a = list(open('prefix.txt'))
b = list(open('terms.txt'))
c = [x.strip() + y.strip() for x in a for y in b]
shuffle(c)
open('result.txt', 'w').write('\n'.join(c))
Certainly, not the best way in terms of speed and memory, but 10000 is not big enough to sacrifice brevity anyway. You should normally close your file objects and you can loop through at least one of the files without storing its content in RAM. This: [:-1] removes the trailing newlline from each element of a and b.
Edit: using s.strip() instead of s[:-1] to get rid of the newlines---it's more portable.

Related

Finding and deleting duplicate lines in a file(the fastest and most efficient way)

As the title says, I want to find and delete duplicate lines in a file. That is pretty easy to do...the catch is that i want to know what is the fastest and most efficient way to do that (let's say that you have gigabytes worth of files and you want to do this as efficient and as fast as you can)
If you know some method...as complicated as it is that can do that I would like to know. I heard some stuff like loop unrolling and started to second guess that the most simple things are the fastest so I am curious.

The best solution is the keep a set of the lines seen so far, and return only the ones not in it. This approach is used in python's collections implementation
def unique_lines(filename):
lines = open(filename).readlines()
seen = set()
for line in lines:
if line not in seen:
yield line
seen.add(line)
and then
for unique_line in unique_lines(filename)
# do stuff
Of course, if you don't care about the order, you can convert the whole text to a set directly, like
set(open(filename).readlines())

Use python hashlib to hash every line in file to a unique hash... And check if a line is duplicate, lookup into the hashes in a set
Lines can be kept directly in a set, however, hashing will reduce space required.
https://docs.python.org/3/library/hashlib.html

Most efficient way to compare all values in a dictionary?

I have a dictionary I created by reading in a whole lot of image files. It looks like this:
files = { 'file1.png': [data...], 'file2.png': [data...], ... 'file1000': [data...]}
I am trying to process these images to see how similar each of them are to each other. The thing is, with 1000s of files worth of data this is taking forever. I'm sure I have 20 different places I could optimize but I am trying to work through it one piece at a time to see how I can better optimize it.
My original method tested file1 against all of the rest of the files. Then I tested file2 against all of the files. But I still tested it against file1. So, by the time I get to file1000 in the above example I shouldn't even need to test anything at that point since it has already been tested 999 times.
This is what I tried:
answers = {}
for x in files:
for y in files:
if y not in answers or x not in answers[y]:
if(compare(files[x],files[y]) < 0.01):
answers.setdefault(x, []).append(y)
This doesn't work, as I am getting the wrong output now. The compare function is just this:
rms = math.sqrt(functools.reduce(operator.add,map(lambda a,b: (a-b)**2, h1[0], h2[0]))/len(h1[0]))
return rms
I just didn't want to put that huge equation into the if statement.
Does anyone have a good method for comparing each of the data segments of the files dictionary without overlapping the comparisons?
Edit:
After trying ShadowRanger's answer I have realized that I may not have fully understood what I needed. My original answers dictionary looked like this:
{ 'file1.png': ['file1.png', 'file23.png', 'file333.png'],
'file2.png': ['file2.png'],
'file3.png': ['file3.png', 'file4.png', 'file5.png'],
'file4.png': ['file3.png', 'file4.png', 'file5.png'],
...}
And for now I am storing my results in a file like this:
file1.png file23.png file33.png
file2.png
file3.png file4.png file5.png
file6.png
...
I thought that by using combinations and only testing individual files once I would save a lot of time retesting files and not have to waste time getting rid of duplicate answers. But as far as I can tell, the combinations have actually reduced my ability to find matches and I'm not sure why.

You can avoid redundant comparisons with itertools.combinations to get order-insensitive unique pairs. Just import itertools and replace your doubly nested loop:
for x in files:
for y in files:
with a single loop that gets the combinations:
for x, y in itertools.combinations(files, 2):

Access list of items with list of indices

Consider a large list of named items (first line) returned from a large csv file (80 MB) with possible interrupted spacing
name_line = ['a',,'b',,'c' .... ,,'cb','cc']
I am reading the remainder of the data in line by line and I only need to process data with a corresponding name. Data might look like
data_line = ['10',,'.5',,'10289' .... ,,'16.7','0']
I tried it two ways. One is popping the empty columns from each line of the read
blnk_cols = [1,3, ... ,97]
while data:
...
for index in blnk_cols: data_line.pop(index)
the other is compiling the items associated with a name from L1
good_cols = [0,2,4, ... ,98,99]
while data:
...
data_line = [data_line[index] for index in good_cols]
in the data I am using there will definitely be more good lines then bad lines although it might be as high as half and half.
I used the cProfile and pstats package to determine my weakest links in speed which suggested the pop was the current slowest item. I switched to the list comp and the time almost doubled.
I imagine one fast way would be to slice the array retrieving only good data, but this would be complicated for files with alternating blank and good data.
what I really need is to be able to do
data_line = data_line[good_cols]
effectively passing a list of indices into a list to get back those items.
Right now my program is running in about 2.3 seconds for a 10 MB file and the pop accounts for about .3 seconds.
Is there a faster way to access certain locations in a list. In C it would just be de-referencing an array of pointers to the correct indices in the array.
Additions:
name_line in file before read
a,b,c,d,e,f,g,,,,,h,i,j,k,,,,l,m,n,
name_line after read and split(",")
['a','b','c','d','e','f','g','','','','','h','i','j','k','','','','l','m','n','\n']

Try a generator expression,
data_line = (data_line[i] for i in good_cols)
Also read here about
Generator Expressions vs. List Comprehension
as the top answer tells you: 'Basically, use a generator expression if all you're doing is iterating once'.
So you should benefit from this.

Python - Best way to read a file and break out the lines by a delimeter

What is the best way to read a file and break out the lines by a delimeter.
Data returned should be a list of tuples.
Can this method be beaten? Can this be done faster/using less memory?
def readfile(filepath, delim):
with open(filepath, 'r') as f:
return [tuple(line.split(delim)) for line in f]

Your posted code reads the entire file and builds a copy of the file in memory as a single list of all the file contents split into tuples, one tuple per line. Since you ask about how to use less memory, you may only need a generator function:
def readfile(filepath, delim):
with open(filepath, 'r') as f:
for line in f:
yield tuple(line.split(delim))
BUT! There is a major caveat! You can only iterate over the tuples returned by readfile once.
lines_as_tuples = readfile(mydata,','):
for linedata in lines_as_tuples:
# do something
This is okay so far, and a generator and a list look the same. But let's say your file was going to contain lots of floating point numbers, and your iteration through the file computed an overall average of those numbers. You could use the "# do something" code to calculate the overall sum and number of numbers, and then compute the average. But now let's say you wanted to iterate again, this time to find the differences from the average of each value. You'd think you'd just add another for loop:
for linedata in lines_as_tuples:
# do another thing
# BUT - this loop never does anything because lines_as_tuples has been consumed!
BAM! This is a big difference between generators and lists. At this point in the code now, the generator has been completely consumed - but there is no special exception raised, the for loop simply does nothing and continues on, silently!
In many cases, the list that you would get back is only iterated over once, in which case a conversion of readfile to a generator would be fine. But if what you want is a more persistent list, which you will access multiple times, then just using a generator will give you problems, since you can only iterate over a generator once.
My suggestion? Make readlines a generator, so that in its own little view of the world, it just yields each incremental bit of the file, nice and memory-efficient. Put the burden of retention of the data onto the caller - if the caller needs to refer to the returned data multiple times, then the caller can simply build its own list from the generator - easily done in Python using list(readfile('file.dat', ',')).

Memory use could be reduced by using a generator instead of a list and a list instead of a tuple, so you don't need to read the whole file into memory at once:
def readfile(path, delim):
return (ln.split(delim) for ln in open(f, 'r'))
You'll have to rely on the garbage collector to close the file, though. As for returning tuples: don't do it if it's not necessary, since lists are a tiny fraction faster, constructing the tuple has a minute cost and (importantly) your lines will be split into variable-size sequences, which are conceptually lists.
Speed can be improved only by going down to the C/Cython level, I guess; str.split is hard to beat since it's written in C, and list comprehensions are AFAIK the fastest loop construct in Python.
More importantly, this is very clear and Pythonic code. I wouldn't try optimizing this apart from the generator bit.

Searching through large data set

how would i search through a list with ~5 mil 128bit (or 256, depending on how you look at it) strings quickly and find the duplicates (in python)? i can turn the strings into numbers, but i don't think that's going to help much. since i haven't learned much information theory, is there anything about this in information theory?
and since these are hashes already, there's no point in hashing them again

If it fits into memeory, use set(). I think it will be faster than sort. O(n log n) for 5 million items is going to cost you.
If it does not fit into memory, say you've lot more than 5 million record, divide and conquer. Break the records at the mid point like 1 x 2^127. Apply any of the above methods. I guess information theory helps by stating that a good hash function will distribute the keys evenly. So the divide by mid point method should work great.
You can also apply divide and conquer even if it fit into memory. Sorting 2 x 2.5 mil records is faster than sorting 5 mil records.

Load them into memory (5M x 64B = 320MB), sort them, and scan through them finding the duplicates.

In Python2.7+ you can use collections.Counter for older Python use collections.deaultdict(int). Either way is O(n).
first make a list with some hashes in it
>>> import hashlib
>>> s=[hashlib.sha1(str(x)).digest() for x in (1,2,3,4,5,1,2)]
>>> s
['5j\x19+y\x13\xb0LTWM\x18\xc2\x8dF\xe69T(\xab', '\xdaK\x927\xba\xcc\xcd\xf1\x9c\x07`\xca\xb7\xae\xc4\xa85\x90\x10\xb0', 'w\xdeh\xda\xec\xd8#\xba\xbb\xb5\x8e\xdb\x1c\x8e\x14\xd7\x10n\x83\xbb', '\x1bdS\x89$s\xa4g\xd0sr\xd4^\xb0Z\xbc 1dz', '\xac4x\xd6\x9a<\x81\xfab\xe6\x0f\\6\x96\x16ZN^j\xc4', '5j\x19+y\x13\xb0LTWM\x18\xc2\x8dF\xe69T(\xab', '\xdaK\x927\xba\xcc\xcd\xf1\x9c\x07`\xca\xb7\xae\xc4\xa85\x90\x10\xb0']
If you are using Python2.7 or later
>>> from collections import Counter
>>> c=Counter(s)
>>> duplicates = [k for k in c if c[k]>1]
>>> print duplicates
['\xdaK\x927\xba\xcc\xcd\xf1\x9c\x07`\xca\xb7\xae\xc4\xa85\x90\x10\xb0', '5j\x19+y\x13\xb0LTWM\x18\xc2\x8dF\xe69T(\xab']
if you are using Python2.6 or earlier
>>> from collections import defaultdict
>>> d=defaultdict(int)
>>> for i in s:
... d[i]+=1
...
>>> duplicates = [k for k in d if d[k]>1]
>>> print duplicates
['\xdaK\x927\xba\xcc\xcd\xf1\x9c\x07`\xca\xb7\xae\xc4\xa85\x90\x10\xb0', '5j\x19+y\x13\xb0LTWM\x18\xc2\x8dF\xe69T(\xab']

Is this array sorted?
I think the fastest solution can be a heap sort or quick sort, and after go through the array, and find the duplicates.

You say you have a list of about 5 million strings, and the list may contain duplicates. You don't say (1) what you want to do with the duplicates (log them, delete all but one occurrence, ...) (2) what you want to do with the non-duplicates (3) whether this list is a stand-alone structure or whether the strings are keys to some other data that you haven't mentioned (4) why you haven't deleted duplicates at input time instead building a list containing duplicates.
As a Data Structures and Algorithms 101 exercise, the answer you have accepted is a nonsense. If you have enough memory, detecting duplicates using a set should be faster than sorting a list and scanning it. Note that deleting M items from a list of size N is O(MN). The code for each of the various alternatives is short and rather obvious; why don't you try writing them, timing them, and reporting back?
If this is a real-world problem that you have, you need to provide much more information if you want a sensible answer.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.