Discover different lines across similar files

Discover different lines across similar files - python

I have a text file with many tens of thousands short sentences like this:
go to venice
come back from grece
new york here i come
from belgium to russia and back to spain
I run a tagging algorithm which produces a tagged output of this sentence file:
go to <place>venice</place>
come back from <place>grece</place>
<place>new york</place> here i come
from <place>belgium</place> to <place>russia</place> and back to <place>spain</place>
The algorithm runs over the input multiple times and produces each time slightly different tagging. My goal is to identify those lines where those differences occur. In other words, print all utterances for which the tagging differs across N results files.
For example N=10, I get 10 tagged files. Suppose line 1 is tagged all the time the same for all 10 tagged files - do not print it. Suppose line 2 is tagged once this way and 9 times other way - print it. And so on.
For N=2 is easy, I just run diff. But what to do if I have N=10 results?

If you have the tagged files - just create a counter for each line of how many times you've seen it:
# use defaultdict for convenience
from collections import defaultdict
# start counting at 0
counter_dict = defaultdict(lambda: 0)
tagged_file_names = ['tagged1.txt', 'tagged2.txt', ...]
# add all lines of each file to dict
for file_name in tagged_file_names:
with open(file_name) as f:
# use enumerate to maintain order
# produces (LINE_NUMBER, LINE CONTENT) tuples (hashable)
for line_with_number in enumerate(f.readlines()):
counter_dict[line_with_number] += 1
# print all values that do not repeat in all files (in same location)
for key, value in counter_dict.iteritems():
if value < len(tagged_file_names):
print "line number %d: [%s] only repeated %d times" % (
key[0], key[1].strip(), value
)
Walkthrough:
First of all, we create a data structure to enable us counting our entries, which are numbered lines. This data structure is a collections.defaultdict which a default value of 0 - which is the count of newly added lines (increased to 1 with each add).
Then, we create the actual entry using a tuple which is hashable, so it can be used as a dictionary key, and by default deeply-comparable to other tuples. this means (1, "lolz") is equal to (1, "lolz") but different than (1, "not lolz") or (2, lolz) - so it fits our use of deep-comparing lines to account for content as well as position.
Now all that's left to do is add all entries using a straightforward for loop and see what keys (which correspond to numbered lines) appear in all files (that is - their value is equal to the number of tagged files provided).
Example:
reut#tHP-EliteBook-8470p:~/python/counter$ cat tagged1.txt
123
abc
def
reut#tHP-EliteBook-8470p:~/python/counter$ cat tagged2.txt
123
def
def
reut#tHP-EliteBook-8470p:~/python/counter$ ./difference_counter.py
line number 1: [abc] only repeated 1 times
line number 1: [def] only repeated 1 times

if you compare all of them to the first text, then you can get a list of all texts that are different. this might not be the quickest way but it would work.
import difflib
n1 = '1 2 3 4 5 6'
n2 = '1 2 3 4 5 6'
n3 = '1 2 4 5 6 7'
l = [n1, n2, n3]
m = [x for x in l if x != l[0]]
diff = difflib.unified_diff(l[0], l.index(m))
print ''.join(diff)

Related

Python: Given a group of strings uniformly bucket them into k buckets so that same strings go to the same bucket

I have a set (of 2000) rows with many elements per row. One of the elements in a row is a string ("name") that is common per group of 5 rows (the total number of unique names is 500).
I want rows with the same "name" to end up in the same bucket. So the function should always return the same value for the given input.
I want to use it for a k-fold cross validation, so I need to create k buckets with numbers of elements as uniformly distributed as possible, +/- few elements is fine but more than 10% is not.
For k = 10 I should have 10 buckets with 200 elements in each, 190 or 210 is ok, but 250 and 180 is not. I tried this answer but it did not give me a very uniform result. This may be due to a dataset itself but it would be great to have somewhat balanced number of elements per bucket. K is usually either 5 or 10.
An example:
name1, date1_1, location1_1, number1_1
name1, date1_2, location1_2, number1_2
...
name1, date1_5, location1_5, number1_5
name2, date2_1, location2_1, number2_1
...
name2, date2_5, location2_5, number2_5
...
name400, date400_1, location400_1, number400_1
...
name400, date400_5, location400_5, number400_5
Output example:
i,name1, date1_1, location1_1, number1_1
i,name1, date1_2, location1_2, number1_2
...
i,name1, date1_5, location1_5, number1_5
j,name2, date2_1, location2_1, number2_1
...
j,name2, date2_5, location2_5, number2_5
...
k,name400, date400_1, location400_1, number400_1
...
k,name400, date400_5, location400_5, number400_5
where 1 < i, j, k < K (K = 5 or K = 10)

What you want is a hash-table, yes? In that case just create a dictionary of size K, and devise a hash-function that takes your string as input and comes back with the index. In the example you provided, an appropriate one might be:
h = int(name.split(',')[0].strip("name")) % K
To be fair, this is pretty naive and doesn't take into account the distribution of your names (you could have many with name1 and very few with name400 for example) but if they are more-or-less the same then that method should work reasonably well.
If your names aren't as convenient as that, you could create a secondary table that simply takes in your name and spits out a number. For instance, suppose you had the names: "Bob", "Sally", "Larry", ...
nameIndexMappings = {"Bob" : 0, "Sally" : 1, "Larry" : 2}
h = nameIndexMappings[name.split(',')[0]] % K
Then you can setup another dictionary like this:
rowMapping = dict()
index = 0
for i in range(0, K):
rowMapping[i] = list()
for row in rows:
name = row.split(',')[0]
if (name not in nameIndexMappings):
nameIndexMappings[index] = name
index += 1
h = nameIndexMappings[name] % K
rowMapping[h].append(row)
After doing this, rowMapping should contain K lists each with about the same number of elements in them (assuming, of course, that all your names are more-or-less equally distributed).

What you are asking for isn't feasible without more constraints.
Imagine if your input consisted of the input string "A" N times, with N arbitrarily large, and input string "B" only 1 times. What would like the output to be?
In any case what you want to do is solve is bin-packing optimization problem.

python code error in mapping file values to list

I have two lists structured somehow like this:
A= [[1,27],[2,27],[3,27],[4,28],[5,29]]
B= [[6,30],[7,31],[8,31]]
and i have a file that has numbers:
1 5
2 3
3 1
4 2
5 5
6....
i want a code that reads this file and maps it to the list. e.g if the file has 1, it should read A list and output 27, if it has 6, it should read B and print 30, such that I get
27 29
27 27
27 27
28 27
29 29
30 31
The problem is, that my code gives index error, i read the file line by line and have an if condition that checks if the number i read from the file is less than the maximum number in list A, if so, it outputs the second character of that list and otherwise move on. The problem is, that instead of moving on to list B, my program still reads A and gives index error.
with open(filename) as myfile:
for line in myfile.readlines():
parts=line.split()
if parts[0]< maxnumforA:
print A[int(parts[0])-1]
else:
print B[int(parts[0]-1)

You should turn that lists into dictionaries. For example:
_A = dict(A)
_B = dict(B)
with open(filename) as myfile:
for line in myfile:
parts = line.split()
for part in parts:
part = int(part)
if part in _A:
print _A[part]
elif part in _B:
print _B[part]
If the action that will take place does not need to know if it comes from A or B, both can be turned into a single dictionary:
d = dict(A + B) # Creating the dictionary
with open(filename) as myfile:
for line in myfile:
parts = line.split()
for part in parts:
part = int(part)
if part in d:
print d[part]
Creating the dictionary can be acomplished in many different ways, I will list some of them:
d = dict(A + B): First joins both lists into a single list (without modifying A or B) and then turns the result into a dictionary. It's the most clear way to do it.
d = {**dict(A), **dict(B)}: Turns both lists into two separates dictionaries (without mmodifying A or B), unpacks them and pack both of them into a single dictionary. Slighlty (and I mean really slightly) faster than the previous method and less clear. Proposed by #Nf4r
d = dict(A) & d.update(B): Turns the first list into a dictionary and updates that dictionary with the content of the second list. Fastest method, 1 line of code per list instead of 1 line for any list and no temporary objects generation so more efficient memory-wise.

As everyone stated before, dict would be much better. I don't know if the left-sided values in each lists are unique, but if yes, you could just go for:
d = {**dict(A), **dict(B)} # available in Python 3.x
with open(filename) as file:
for line in file.readlines():
for num in line.split():
if int(num) in d:
print(d[int(num)])

How to improve Python iteration performance over large files

I have a reference file that is about 9,000 lines and has the following structure: (index, size) - where index is unique but size may not be.
0 193532
1 10508
2 13984
3 14296
4 12572
5 12652
6 13688
7 14256
8 230172
9 16076
And I have a data file that is about 650,000 lines and has the following structure: (cluster, offset, size) - where offset is unique but size is not.
446 0xdf6ad1 34572
447 0xdf8020 132484
451 0xe1871b 11044
451 0xe1b394 7404
451 0xe1d12b 5892
451 0xe1e99c 5692
452 0xe20092 6224
452 0xe21a4b 5428
452 0xe23029 5104
452 0xe2455e 138136
I need to compare each size value in the second column of the reference file for any matches with the size values in the third column of the data file. If there is a match, output the offset hex value (second column in the data file) with the index value (first column in the reference file). Currently I am doing this with the following code and just piping it to a new file:
#!/usr/bin/python3
import sys
ref_file = sys.argv[1]
dat_file = sys.argv[2]
with open(ref_file, 'r') as ref, open(dat_file, 'r') as dat:
for r_line in ref:
ref_size = r_line[r_line.find(' ') + 1:-1]
for d_line in dat:
dat_size = d_line[d_line.rfind(' ') + 1:-1]
if dat_size == ref_size:
print(d_line[d_line.find('0x') : d_line.rfind(' ')]
+ '\t'
+ r_line[:r_line.find(' ')])
dat.seek(0)
The typical output looks like this:
0x86ece1eb 0
0x16ff4628f 0
0x59b358020 0
0x27dfa8cb4 1
0x6f98eb88f 1
0x102cb10d4 2
0x18e2450c8 2
0x1a7aeed12 2
0x6cbb89262 2
0x34c8ad5 3
0x1c25c33e5 3
This works fine but takes about 50mins to complete for the given file sizes.
It has done it's job, but as a novice I am always keen to learn ways to improve my coding and share these learnings. My question is, what changes could I make to improve the performance of this code?

You can do the following, take a dictionary dic and do the following ( following is a pseudocode, also I assume sizes don't repeat )
for index,size in the first file:
dic[size] = index
for index,offset,size in second file:
if size in dic.keys():
print dic[size],offset

Since you look up lines in the files by size, these sizes should be the keys in any dictionary data structure. This dictionary you will need to get rid of the nested loop which is the real performance killer here. Furthermore, as your sizes are not unique, you will have to use lists of offset / index values (depending on which file you want store in the dictionary). A defaultdict will help you avoid some clunky code:
from collections import defaultdict
with open(ref_file, 'r') as ref, open(dat_file, 'r') as dat:
dat_dic = defaultdict(list) # maintain a list of offsets for each size
for d_line in dat:
_, offset, size = d_line.split()
dat_dic[size].append(offset)
for r_line in ref:
index, size = r_line.split()
for offset in dat_dic[size]:
# dict lookup is O(1) and not O(N) ...
# ... as looping over the dat_file is
print('{offset}\t{index}'.format(offset=offset, index=index))
If the order of your output lines does not matter you can think about doing it the other way around because your dat_file is so much bigger and thus building the defaultdict from it uses a lot more RAM.

For-loop to count differences in lines with python

I have a file filled with lines like this (this is just a small bit of the file):
9 Hyphomicrobium facile Hyphomicrobiaceae
9 Hyphomicrobium facile Hyphomicrobiaceae
7 Mycobacterium kansasii Mycobacteriaceae
7 Mycobacterium gastri Mycobacteriaceae
10 Streptomyces olivaceiscleroticus Streptomycetaceae
10 Streptomyces niger Streptomycetaceae
1 Streptomyces geysiriensis Streptomycetaceae
1 Streptomyces minutiscleroticus Streptomycetaceae
0 Brucella neotomae Brucellaceae
0 Brucella melitensis Brucellaceae
2 Mycobacterium phocaicum Mycobacteriaceae
The number refers to a cluster, and then it goes 'Genus' 'Species' 'Family'.
What I want to do is write a program that will look through each line and report back to me: a list of the different genera in each cluster, and how many of each of those genera are in the cluster. So I'm interested in cluster number and the first 'word' in each line.
My trouble is that I'm not sure how to get this information. I think I need to use a for-loop, starting at lines that begin with '0.'The output would be a file that looks something like:
Cluster 0: Brucella(2) # Lists cluster, followed by genera in cluster with number, something like that.
Cluster 1: Streptomyces(2)
Cluster 2: Brucella(1)
etc.
Eventually I want to do the same kind of count with the Families in each cluster, and then Genera and Species together. Any thoughts on how to start would be greatly appreciated!

I thought this would be a fun little toy project, so I wrote a little hack to read in an input file like yours from stdin, count and format the output recursively and spit out output that looks a little like yours, but with a nested format, like so:
Cluster 0:
Brucella(2)
melitensis(1)
Brucellaceae(1)
neotomae(1)
Brucellaceae(1)
Streptomyces(1)
neotomae(1)
Brucellaceae(1)
Cluster 1:
Streptomyces(2)
geysiriensis(1)
Streptomycetaceae(1)
minutiscleroticus(1)
Streptomycetaceae(1)
Cluster 2:
Mycobacterium(1)
phocaicum(1)
Mycobacteriaceae(1)
Cluster 7:
Mycobacterium(2)
gastri(1)
Mycobacteriaceae(1)
kansasii(1)
Mycobacteriaceae(1)
Cluster 9:
Hyphomicrobium(2)
facile(2)
Hyphomicrobiaceae(2)
Cluster 10:
Streptomyces(2)
niger(1)
Streptomycetaceae(1)
olivaceiscleroticus(1)
Streptomycetaceae(1)
I also added some junk data to my table so that I could see an extra entry in Cluster 0, separated from the other two... The idea here is that you should be able to see a top level "Cluster" entry and then nested, indented entries for genus, species, family... it shouldn't be hard to extend for deeper trees, either, I hope.
# Sys for stdio stuff
import sys
# re for the re.split -- this can go if you find another way to parse your data
import re
# A global (shame on me) for storing the data we're going to parse from stdin
data = []
# read lines from standard input until it's empty (end-of-file)
for line in sys.stdin:
# Split lines on spaces (gobbling multiple spaces for robustness)
# and trim whitespace off the beginning and end of input (strip)
entry = re.split("\s+", line.strip())
# Throw the array into my global data array, it'll look like this:
# [ "0", "Brucella", "melitensis", "Brucellaceae" ]
# A lot of this code assumes that the first field is an integer, what
# you call "cluster" in your problem description
data.append(entry)
# Sort, first key is expected to be an integer, and we want a numerical
# sort rather than a string sort, so convert to int, then sort by
# each subsequent column. The lamba is a function that returns a tuple
# of keys we care about for each item
data.sort(key=lambda item: (int(item[0]), item[1], item[2], item[3]))
# Our recursive function -- we're basically going to treat "data" as a tree,
# even though it's not.
# parameters:
# start - an integer telling us what line to begin working from so we needn't
# walk the whole tree each time to figure out where we are.
# super - An array that captures where we are in the search. This array
# will have more elements in it as we deepen our traversal of the "tree"
# Initially, it will be []
# In the next ply of the tree, it will be [ '0' ]
# Then something like [ '0', 'Brucella' ] and so on.
# data - The global data structure -- this never mutates after the sort above,
# I could have just used the global directly
def groupedReport(start, super, data):
# Figure out what ply we're on in our depth-first traversal of the tree
depth = len(super)
# Count entries in the super class, starting from "start" index in the array:
count = 0
# For the few records in the data file that match our "super" exactly, we count
# occurrences.
if depth != 0:
for i in range(start, len(data)):
if (data[i][0:depth] == data[start][0:depth]):
count = count + 1
else:
break; # We can stop counting as soon as a match fails,
# because of the way our input data is sorted
else:
count = len(data)
# At depth == 1, we're reporting about clusters -- this is the only piece of
# the algorithm that's not truly abstract, and it's only for presentation
if (depth == 1):
sys.stdout.write("Cluster " + super[0] + ":\n")
elif (depth > 0):
# Every other depth: indent with 4 spaces for every ply of depth, then
# output the unique field we just counted, and its count
sys.stdout.write((' ' * ((depth - 1) * 4)) +
data[start][depth - 1] + '(' + str(count) + ')\n')
# Recursion: we're going to figure out a new depth and a new "super"
# and then call ourselves again. We break out on depth == 4 because
# of one other assumption (I lied before about the abstract thing) I'm
# making about our input data here. This could
# be made more robust/flexible without a lot of work
subsuper = None
substart = start
for i in range(start, start + count):
record = data[i] # The original record from our data
newdepth = depth + 1
if (newdepth > 4): break;
# array splice creates a new copy
newsuper = record[0:newdepth]
if newsuper != subsuper:
# Recursion here!
groupedReport(substart, newsuper, data)
# Track our new "subsuper" for subsequent comparisons
# as we loop through matches
subsuper = newsuper
# Track position in the data for next recursion, so we can start on
# the right line
substart = substart + 1
# First call to groupedReport starts the recursion
groupedReport(0, [], data)
If you make my Python code into a file like "classifier.py", then you can run your input.txt file (or whatever you call it) through it like so:
cat input.txt | python classifier.py
Most of the magic of the recursion, if you care, is implemented using slices of arrays and leans heavily on the ability to compare array slices, as well as the fact that I can order the input data meaningfully with my sort routine. You may want to convert your input data to all-lowercase, if it is possible that case inconsistencies could yield mismatches.

It is easy to do.
create an empty dict {} to store your result, lets call it 'result'
Loop over the data line by line.
Split the line on space to get 4 elements as per your structure, cluster,genus,species,family
Increment counts of genus inside each cluster key when they are found in the current loop, they have to be set to 1 for the first occurence though.
result = { '0': { 'Brucella': 2} ,'1':{'Streptomyces':2}..... }
Code:
my_data = """9 Hyphomicrobium facile Hyphomicrobiaceae
9 Hyphomicrobium facile Hyphomicrobiaceae
7 Mycobacterium kansasii Mycobacteriaceae
7 Mycobacterium gastri Mycobacteriaceae
10 Streptomyces olivaceiscleroticus Streptomycetaceae
10 Streptomyces niger Streptomycetaceae
1 Streptomyces geysiriensis Streptomycetaceae
1 Streptomyces minutiscleroticus Streptomycetaceae
0 Brucella neotomae Brucellaceae
0 Brucella melitensis Brucellaceae
2 Mycobacterium phocaicum Mycobacteriaceae"""
result = {}
for line in my_data.split("\n"):
cluster,genus,species,family = line.split(" ")
result.setdefault(cluster,{}).setdefault(genus,0)
result[cluster][genus] += 1
print(result)
{'10': {'Streptomyces': 2}, '1': {'Streptomyces': 2}, '0': {'Brucella': 2}, '2': {'Mycobacterium': 1}, '7': {'Mycobacterium': 2}, '9': {'Hyphomicrobium': 2}}

Comparing two lists items in python

I have two files which I loaded into lists. The content of the first file is something like this:
d.complex.1
23
34
56
58
68
76
.
.
.
etc
d.complex.179
43
34
59
69
76
.
.
.
etc
The content of the second file is also the same but with different numerical values. Please consider from one d.complex.* to another d.complex.* as one set.
Now I am interested in comparing each numerical value from one set of first file with each numerical value of the sets in the second file. I would like to record the number of times each numerical has appeared in the second file overall.
For example, the number 23 from d.complex.1 could have appeared 5 times in file 2 under different sets. All I want to do is record the number of occurrences of number 23 in file 2 including all sets of file 2.
My initial approach was to load them into a list and compare but I am not able to achieve this. I searched in google and came across sets but being a python noob, I need some guidance. Can anyone help me?
If you feel the question is not clear,please let me know. I have also pasted the complete file 1 and file 2 here:
http://pastebin.com/mwAWEcTa
http://pastebin.com/DuXDDRYT

Open the file using Python's open function, then iterate over all its lines. Check whether the line contains a number, if so, increase its count in a defaultdict instance as described here.
Repeat this for the other file and compare the resulting dicts.

First create a function which can load a given file, as you may want to maintain individual sets and also want to count occurrence of each number, best would be to have a dict for whole file where keys are set names e.g. complex.1 etc, for each such set keep another dict for numbers in set, below code explains it better
def file_loader(f):
file_dict = {}
current_set = None
for line in f:
if line.startswith('d.complex'):
file_dict[line] = current_set = {}
continue
if current_set is not None:
current_set[line] = current_set.get(line, 0)
return file_dict
Now you can easily write a function which will count a number in given file_dict
def count_number(file_dict, num):
count = 0
for set_name, number_set in file_dict.iteritems():
count += number_set.get(num, 0)
return count
e.g here is a usage example
s = """d.complex.1
10
11
12
10
11
12"""
file_dict = file_loader(s.split("\n"))
print file_dict
print count_number(file_dict, '10')
output is:
{'d.complex.1': {'11': 2, '10': 2, '12': 2}}
2
You may have to improve file loader, e.g. skip empty lines, convert to int etc

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Discover different lines across similar files - python

Related

Python: Given a group of strings uniformly bucket them into k buckets so that same strings go to the same bucket

python code error in mapping file values to list

How to improve Python iteration performance over large files

For-loop to count differences in lines with python

Comparing two lists items in python

Categories

Resources