Efficient lookup by common words

Efficient lookup by common words - python

I have a list of names (strings) divided into words. There are 8 million names, each name consists of up to 20 words (tokens). Number of unique tokens is 2.2 million. I need an efficient way to find all names containing at least one word from the query (which may contain also up to 20 words, but usually only a few).
My current approach uses Python Pandas and looks like this (later referred as original):
>>> df = pd.DataFrame([['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo']],
index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True) # btw, is there a way to include this into prev line?
>>> print df
0 1 2
id
id1 foo bar joe
id2 foo None None
id3 bar joe None
id4 zoo None None
def filter_by_tokens(df, tokens):
# search within each column and then concatenate and dedup results
results = [df.loc[lambda df: df[i].isin(tokens)] for i in range(df.shape[1])]
return pd.concat(results).reset_index().drop_duplicates().set_index(df.index.name)
>>> print filter_by_tokens(df, ['foo', 'zoo'])
0 1 2
id
id1 foo bar joe
id2 foo None None
id4 zoo None None
Currently such lookup (on the full dataset) takes 5.75s on my (rather powerful) machine. I'd like to speed it up at least, say, 10 times.
I was able to get to 5.29s by squeezing all columns into one and perform a lookup on that (later referred as original, squeezed):
>>> df = pd.Series([{'foo', 'bar', 'joe'},
{'foo'},
{'bar', 'joe'},
{'zoo'}],
index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True)
>>> print df
id
id1 {foo, bar, joe}
id2 {foo}
id3 {bar, joe}
id4 {zoo}
dtype: object
def filter_by_tokens(df, tokens):
return df[df.map(lambda x: bool(x & set(tokens)))]
>>> print filter_by_tokens(df, ['foo', 'zoo'])
id
id1 {foo, bar, joe}
id2 {foo}
id4 {zoo}
dtype: object
But that's still not fast enough.
Another solution which seems to be easy to implement is to use Python multiprocessing (threading shouldn't help here because of GIL and there is no I/O, right?). But the problem with it is that the big dataframe needs to be copied to each process, which takes up all the memory. Another problem is that I need to call filter_by_tokens many times in a loop, so it would copy the dataframe on every call, which is inefficient.
Note that words may occur many times in names (e.g. the most popular word occurs 600k times in names), so a reverse index would be huge.
What is a good way to write this efficiently? Python solution preferred, but I'm also open to other languages and technologies (e.g. databases).
UPD:
I've measured execution time of my two solutions and the 5 solutions suggested by #piRSquared in his answer. Here are the results (tl;dr the best is 2x improvement):
+--------------------+----------------+
| method | best of 3, sec |
+--------------------+----------------+
| original | 5.75 |
| original, squeezed | 5.29 |
| zip | 2.54 |
| merge | 8.87 |
| mul+any | MemoryError |
| isin | IndexingError |
| query | 3.7 |
+--------------------+----------------+
mul+any gives MemoryError on d1 = pd.get_dummies(df.stack()).groupby(level=0).sum() (on a 128Gb RAM machine).
isin gives IndexingError: Unalignable boolean Series key provided on s[d1.isin({'zoo', 'foo'}).unstack().any(1)], apparently because shape of df.stack().isin(set(tokens)).unstack() is slightly less than the shape of the original dataframe (8.39M vs 8.41M rows), not sure why and how to fix that.
Note that the machine I'm using has 12 cores (though I mentioned some problems with parallelization above). All of the solutions utilize a single core.
Conclusion (as of now): there is 2.1x improvement by zip (2.54s) vs original squeezed solution (5.29s). It's good, though I aimed for at least 10x improvement, if possible. So I'm leaving the (still great) #piRSquared answer unaccepted for now, to welcome more suggestions.

idea 0
zip
def pir(s, token):
return s[[bool(p & token) for p in s]]
pir(s, {'foo', 'zoo'})
idea 1
merge
token = pd.DataFrame(dict(v=['foo', 'zoo']))
d1 = df.stack().reset_index('id', name='v')
s.ix[d1.merge(token).id.unique()]
idea 2
mul + any
d1 = pd.get_dummies(df.stack()).groupby(level=0).sum()
token = pd.Series(1, ['foo', 'zoo'])
s[d1.mul(token).any(1)]
idea 3
isin
d1 = df.stack()
s[d1.isin({'zoo', 'foo'}).unstack().any(1)]
idea 4
query
token = ('foo', 'zoo')
d1 = df.stack().to_frame('s')
s.ix[d1.query('s in #token').index.get_level_values(0).unique()]

I have done similar things with the following tools
Hbase - Key can have Multiple columns (Very Fast)
ElasticSearch - Nice easy to scale. You just need to import your data as JSON
Apache Lucene - Will be very good for 8 Million records

You can do it with reverse index; the code below run in pypy builds the index in 57 seconds, does the query or 20 words takes 0.00018 seconds and used about 3.2Gb memory. Python 2.7 build index in 158 seconds and does query in 0.0013 seconds using about 3.41Gb memory.
The fastest possible way to do this is with bitmapped reversed indexes, compressed to save space.
"""
8m records with between 1 and 20 words each, selected at random from 100k words
Build dictionary of sets, keyed by word number, set contains nos of all records
with that word
query merges the sets for all query words
"""
import random
import time records = 8000000
words = 100000
wordlists = {}
print "build wordlists"
starttime = time.time()
wordlimit = words - 1
total_words = 0
for recno in range(records):
for x in range(random.randint(1,20)):
wordno = random.randint(0,wordlimit)
try:
wordlists[wordno].add(recno)
except:
wordlists[wordno] = set([recno])
total_words += 1
print "build time", time.time() - starttime, "total_words", total_words
querylist = set()
query = set()
for x in range(20):
while 1:
wordno = (random.randint(0,words))
if wordno in wordlists: # only query words that were used
if not wordno in query:
query.add(wordno)
break
print "query", query
starttime = time.time()
for wordno in query:
querylist.union(wordlists[wordno])
print "query time", time.time() - starttime
print "count = ", len(querylist)
for recno in querylist:
print "record", recno, "matches"

perhaps my first answer was a bit abstract; in the absence of real data it was generating random data to the approx. volume reqd. to get a feel for the query time. This code is practical.
data =[['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo']]
wordlists = {}
print "build wordlists"
for x, d in enumerate(data):
for word in d:
try:
wordlists[word].add(x)
except:
wordlists[word] = set([x])
print "query"
query = [ "foo", "zoo" ]
results = set()
for q in query:
wordlist = wordlists.get(q)
if wordlist:
results = results.union(wordlist)
l = list(results)
l.sort()
for x in l:
print data[x]
The cost in time and memory is building the wordlists (inverted indices); query is almost free. You have 12 core machine so presumably it has plenty of memory. For repeatability, build the wordlists, pickle each wordlist and write to sqlite or any key/value database with the word as key and the pickled set as binary blob. then all you need is:
initialise_database()
query = [ "foo", "zoo" ]
results = set()
for q in query:
wordlist = get_wordlist_from_database(q) # get binary blob and unpickle
if wordlist:
results = results.union(wordlist)
l = list(results)
l.sort()
for x in l:
print data[x]
alternatively, using arrays, which is more memory efficient, and probably faster to build index. pypy more that 10 x faster that 2.7
import array
data =[['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo']]
wordlists = {}
print "build wordlists"
for x, d in enumerate(data):
for word in d:
try:
wordlists[word].append(x)
except:
wordlists[word] = array.array("i",[x])
print "query"
query = [ "foo", "zoo" ]
results = set()
for q in query:
wordlist = wordlists.get(q)
if wordlist:
for i in wordlist:
results.add(i)
l = list(results)
l.sort()
for x in l:
print data[x]

If you know that the number of unique tokens that you'll see is relatively small,
you can pretty easily build an efficient bitmask to query for matches.
The naive approach (in original post) will allow for up to 64 distinct tokens.
The improved code below uses the bitmask like a bloom filter (modular arithmetic in setting the bits wraps around 64). If there are more than 64 unique tokens, there will be some false positives, which the code below will automatically verify (using the original code).
Now the worst-case performance will degrade if the number of unique tokens is (much) larger than 64, or if you get particularly unlucky. Hashing could mitigate this.
As far as performance goes, using the benchmark data set below, I get:
Original Code: 4.67 seconds
Bitmask Code: 0.30 seconds
However, when the number of unique tokens is increased, the bitmask code remains efficient while the original code slows down considerably. With about 70 unique tokens, I get something like:
Original Code: ~15 seconds
Bitmask Code: 0.80 seconds
Note: for this latter case, building the bitmask array from the supplied list, takes about as much time as building the dataframe. There's probably no real reason to build the dataframe; left it in mainly for ease of comparison to original code.
class WordLookerUpper(object):
def __init__(self, token_lists):
tic = time.time()
self.df = pd.DataFrame(token_lists,
index=pd.Index(
data=['id%d' % i for i in range(len(token_lists))],
name='index'))
print('took %d seconds to build dataframe' % (time.time() - tic))
tic = time.time()
dii = {}
iid = 0
self.bits = np.zeros(len(token_lists), np.int64)
for i in range(len(token_lists)):
for t in token_lists[i]:
if t not in dii:
dii[t] = iid
iid += 1
# set the bit; note that b = dii[t] % 64
# this 'wrap around' behavior lets us use this
# bitmask as a probabilistic filter
b = dii[t]
self.bits[i] |= (1 << b)
self.string_to_iid = dii
print('took %d seconds to build bitmask' % (time.time() - tic))
def filter_by_tokens(self, tokens, df=None):
if df is None:
df = self.df
tic = time.time()
# search within each column and then concatenate and dedup results
results = [df.loc[lambda df: df[i].isin(tokens)] for i in range(df.shape[1])]
results = pd.concat(results).reset_index().drop_duplicates().set_index('index')
print('took %0.2f seconds to find %d matches using original code' % (
time.time()-tic, len(results)))
return results
def filter_by_tokens_with_bitmask(self, search_tokens):
tic = time.time()
bitmask = np.zeros(len(self.bits), np.int64)
verify = np.zeros(len(self.bits), np.int64)
verification_needed = False
for t in search_tokens:
bitmask |= (self.bits & (1<<self.string_to_iid[t]))
if self.string_to_iid[t] > 64:
verification_needed = True
verify |= (self.bits & (1<<self.string_to_iid[t]))
if verification_needed:
results = self.df[(bitmask > 0 & ~verify.astype(bool))]
results = pd.concat([results,
self.filter_by_tokens(search_tokens,
self.df[(bitmask > 0 & verify.astype(bool))])])
else:
results = self.df[bitmask > 0]
print('took %0.2f seconds to find %d matches using bitmask code' % (
time.time()-tic, len(results)))
return results
Make some test data
unique_token_lists = [
['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo'],
['ziz','zaz','zuz'],
['joe'],
['joey','joe'],
['joey','joe','joe','shabadoo']
]
token_lists = []
for n in range(1000000):
token_lists.extend(unique_token_lists)
Run the original code and the bitmask code
>>> wlook = WordLookerUpper(token_lists)
took 5 seconds to build dataframe
took 10 seconds to build bitmask
>>> wlook.filter_by_tokens(['foo','zoo']).tail(n=1)
took 4.67 seconds to find 3000000 matches using original code
id7999995 zoo None None None
>>> wlook.filter_by_tokens_with_bitmask(['foo','zoo']).tail(n=1)
took 0.30 seconds to find 3000000 matches using bitmask code
id7999995 zoo None None None

Related

Nested Loop Optimisation in Python for a list of 50K items

I have a csv file with roughly 50K rows of search engine queries. Some of the search queries are the same, just in a different word order, for example "query A this is " and "this is query A".
I've tested using fuzzywuzzy's token_sort_ratio function to find matching word order queries, which works well, however I'm struggling with the runtime of the nested loop, and looking for optimisation tips.
Currently the nested for loops take around 60 hours to run on my machine. Does anyone know how I might speed this up?
Code below:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
from tqdm import tqdm
filePath = '/content/queries.csv'
df = pd.read_csv(filePath)
table1 = df['keyword'].to_list()
table2 = df['keyword'].to_list()
data = []
for kw_t1 in tqdm(table1):
for kw_t2 in table2:
score = fuzz.token_sort_ratio(kw_t1,kw_t2)
if score == 100 and kw_t1 != kw_t2:
data +=[[kw_t1, kw_t2, score]]
data_df = pd.DataFrame(data, columns=['query', 'queryComparison', 'score'])
Any advice would be appreciated.
Thanks!

Since what you are looking for are strings consisting of identical words (just not necessarily in the same order), there is no need to use fuzzy matching at all. You can instead use collections.Counter to create a frequency dict for each string, with which you can map the strings under a dict of lists keyed by their frequency dicts. You can then output sub-lists in the dicts whose lengths are greater than 1.
Since dicts are not hashable, you can make them keys of a dict by converting them to frozensets of tuples of key-value pairs first.
This improves the time complexity from O(n ^ 2) of your code to O(n) while also avoiding overhead of performing fuzzy matching.
from collections import Counter
matches = {}
for query in df['keyword']:
matches.setdefault(frozenset(Counter(query.split()).items()), []).append(query)
data = [match for match in matches.values() if len(match) > 1]
Demo: https://replit.com/#blhsing/WiseAfraidBrackets

I don't think you need fuzzywuzzy here: you are just checking for equality (score == 100) of the sorted queries, but with token_sort_ratio you are sorting the queries over and over. So I suggest to:
create a "base" list and a "sorted-elements" one
iterate on the elements.
This will still be O(n^2), but you will be sorting 50_000 strings instead of 2_500_000_000!
filePath = '/content/queries.csv'
df = pd.read_csv(filePath)
table_base = df['keyword'].to_list()
table_sorted = [sorted(kw) for kw in table_base]
data = []
ln = len(table_base)
for i in range(ln-1):
for j in range(i+1,ln):
if table_sorted[i] == table_sorted[j]:
data +=[[table_base[i], table_base[j], 100]]
data_df = pd.DataFrame(data, columns=['query', 'queryComparison', 'score'])

Apply in pandas as usually works faster:
kw_t2 = df['keyword'].to_list()
def compare(kw_t1):
found_duplicates = []
score = fuzz.token_sort_ratio(kw_t1, kw_t2)
if score == 100 and kw_t1 != kw_t2:
found_duplicates.append(kw_t2)
return found_duplicates
df["duplicates"] = df['keyword'].apply(compare)

Improve performance of dataframe-like structure

I'm facing a data-structure challenge regarding a process in my code code in which I need to count the frequency of strings in positive and negative examples.
It's a large bottleneck and I don't seem to be able to find a better solution.
I have to go through every long strings in the dataset, and extract substrings, of which I need to count the frequency. In a perfect world, a pandas dataframe of the following shape would be perfect:
At the end, the expected structure is something like
string | frequency positive | frequency negative
________________________________________________
str1 | 5 | 7
str2 | 2 | 4
...
However, for obvious performance limits, this is not acceptable.
My solution is to use a dictionary to track the rows, and a Nx2 numpy matrix to track the frequency. This is also done because after this, I need to have the frequency in a Nx2 numpy matrix anyway.
Currently, my solution is something like this:
str_freq = np.zeros((N, 2), dtype=np.uint32)
str_dict = {}
str_dict_counter = 0
for i, string in enumerate(dataset):
substrings = extract(string) # substrings is a List[str]
for substring in substrings:
row = str_dict.get(substring, None)
if row is None:
str_dict[substring] = str_dict_counter
row = str_dict_counter
str_dict_counter += 1
str_freq[row, target[i]] += 1 # target[i] is equal to 1 or 0
However, it is really the bottleneck of my code, and I'd like to speed it up.
Some things about this code are imcompressible, for instance the extract(string), so that loop has to remain. However, if possible there is no problem with using parallel processing.
What I'm wondering especially is if there is a way to improve the inner loop. Python is known to be bad with loops, and this one seems a bit pointless, however since we can't (to my knowledge) do multiple get and sets to dictionaries like we can with numpy arrays, I don't know how I could improve it.
What do you suggest doing? Is the only solution to re-write in some lower level language?
I also though about using SQL-lite, however I don't know if it's worth it.
For the record, this has to take in about 10MB of data, it currently takes about 45 seconds, but needs to be done repeatedly with new data each time.
EDIT: Added example to test yourself
import random
import string
import re
import numpy as np
import pandas as pd
def get_random_alphaNumeric_string(stringLength=8):
return bytes(bytearray(np.random.randint(0,256,stringLength,dtype=np.uint8)))
def generate_dataset(n=10000):
d = []
for i in range(n):
rnd_text = get_random_alphaNumeric_string(stringLength=1000)
d.append(rnd_text)
return d
def test_dict(dataset):
pattern = re.compile(b"(q.{3})")
target = np.random.randint(0,2,len(dataset))
str_freq = np.zeros((len(dataset)*len(dataset[0]), 2), dtype=np.uint32)
str_dict = {}
str_dict_counter = 0
for i, string in enumerate(dataset):
substrings = pattern.findall(string) # substrings is a List[str]
for substring in substrings:
row = str_dict.get(substring, None)
if row is None:
str_dict[substring] = str_dict_counter
row = str_dict_counter
str_dict_counter += 1
str_freq[row, target[i]] += 1 # target[i] is equal to 1 or 0
return str_dict, str_freq[:str_dict_counter,:]
def test_df(dataset):
pattern = re.compile(b"(q.{3})")
target = np.random.randint(0,2,len(dataset))
df = pd.DataFrame(columns=["str","pos","neg"])
df.astype(dtype={"str":bytes, "pos":int, "neg":int}, copy=False)
df = df.set_index("str")
for i, string in enumerate(dataset):
substrings = pattern.findall(string) # substrings is a List[str]
for substring in substrings:
check = substring in df.index
if not check:
row = [0,0]
row[target[i]] = 1
df.loc[substring] = row
else:
df.loc[substring][target[i]] += 1
return df
dataset = generate_dataset(1000000)
d,f = test_dict(dataset) # takes ~10 seconds on my laptop
# to get the value of some key, say b'q123'
f[d[b'q123'],:]
d = test_df(dataset) # takes several minutes (hasn't finished yet)
# the same but with a dataframe
d.loc[b'q123']

Scoring consistency within dataset

Suppose I am given a set of structured data. The data is known to be problematic, and I need to somehow "score" them on consistency. For example, I have the data as shown below:
fieldA | fieldB | fieldC
-------+--------+-------
foo | bar | baz
fooo | bar | baz
foo | bar | lorem
.. | .. | ..
lorem | ipsum | dolor
lorem | upsum | dolor
lorem | ipsum | baz
So assume the first row is considered the correct entry because there are relatively more data in that combination compared to the records in second and third row. In the second row, the value for fieldA should be foo (inconsistent due to misspelling). Then in the third row, the value of fieldC should be baz as other entries in the dataset with similar values for fieldA (foo) and fieldB (bar) suggest.
Also, in other part of the dataset, there's another combination that is relatively more common (lorem, ipsum, dolor). So the problem in the following records are the same as the one mentioned before, just that the value combination is different.
I initially dumped everything to a SQL database, and use statements with GROUP BY to check consistency of fields values. So there will be 1 query for each field I want to check for consistency, and for each record.
SELECT fieldA, count(fieldA)
FROM cache
WHERE fieldB = 'bar' and fieldC = 'baz'
GROUP BY fieldA
Then I could check if the value of fieldA of a record is consistent with the rest by referring the record to the object below (processed result of the previous SQL query).
{'foo': {'consistency': 0.99, 'count': 99, 'total': 100}
'fooo': {'consistency': 0.01, 'count': 1, 'total': 100}}
However it was very slow (dataset has about 2.2million records, and I am checking 4 fields, so making about 9mil queries), and would take half a day to complete. Then I replaced SQL storage to elasticsearch, and the processing time shrunk to about 5 hours, can it be made somehow faster?
Also just out of curiosity, am I re-inventing a wheel here? Is there an existing tool for this? Currently it is implemented in Python3, with elasticsearch.

I just read your question and found it quite interesting. I did something similar using ntlk (python Natural Language Toolkit).
Anyway, in this case I think you dont need the sophisticated string comparison algorithms.
So I tried an approach using the python difflib. The Title sounds promising: difflib — Helpers for computing deltas¶
The difflib.SequenceMatcher class says:
This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.
By the way I think that if you want to save time you could hold and process 2.000.000 3-tuples of (relatively short) strings easily in Memory. (see testruns and Mem Usage below)
So I wrote a demo App that produces 2.000.000 (you can vary that) 3-tuples of randomly slightly shuffled strings. The shuffled strings are based and compared with a default pattern like yours: ['foofoo', 'bar', 'lorem']. It then compares them using difflib.SequenceMatcher. All in Memory.
Here is the compare code:
def compare(intuple, pattern_list):
"""
compare two strings with difflib
intuple: in this case a n-tuple of strings
pattern_list: a given pattern list.
n-tuple and list must be of the same lenght.
return a dict (Ordered) with the tuple and the score
"""
d = collections.OrderedDict()
d["tuple"] = intuple
#d["pattern"] = pattern_list
scorelist = []
for counter in range(0,len(pattern_list)):
score = difflib.SequenceMatcher(None,intuple[counter].lower(),pattern_list[counter].lower()).ratio()
scorelist.append(score)
d["score"] = scorelist
return d
Here are the runtime and Memory usage results:
2000 3-tuples:
- compare time: 417 ms = 0,417 sec
- Mem Usage: 594 KiB
200.000 3-tuples:
- compare time: 5360 ms = 5,3 sec
- Mem Usage: 58 MiB
2.000.000 3-tuples:
- compare time: 462241 ms = 462 sec
- Mem Usage: 580 MiB
So it scales linear in time and Mem usage. And it (only) needs 462 seconds for 2.000.000 3-tuple strings tom compare.
The result looks like this:(example for 200.000 rows)
[ TIMIMG ]
build function took 53304.028034 ms
[ TIMIMG ]
compare_all function took 462241.254807 ms
[ INFO ]
num rows: 2000000
pattern: ['foofoo', 'bar', 'lorem']
[ SHOWING 10 random results ]
0: {"tuple": ["foofoo", "bar", "ewrem"], "score": [1.0, 1.0, 0.6]}
1: {"tuple": ["hoofoo", "kar", "lorem"], "score": [0.8333333333333334, 0.6666666666666666, 1.0]}
2: {"tuple": ["imofoo", "bar", "lorem"], "score": [0.6666666666666666, 1.0, 1.0]}
3: {"tuple": ["foofoo", "bar", "lorem"], "score": [1.0, 1.0, 1.0]}
....
As you can see you get an score based on the similarity of the string compared to the pattern. 1.0 means equal and everything below gets worse the lower the score is.
difflib is known as not to be the fastest algorithm for that but I think 7 minutes is quite an improvement to half a day or 5 hours.
I hope this helps you (and is not complete missunderstanding) but it was a lot of fun to program this yesterday. And I learned a lot. ;)
For example to track memory usage using tracemalloc. Never did that before.
I dropped the code to github (as a one file gist).

Discover different lines across similar files

I have a text file with many tens of thousands short sentences like this:
go to venice
come back from grece
new york here i come
from belgium to russia and back to spain
I run a tagging algorithm which produces a tagged output of this sentence file:
go to <place>venice</place>
come back from <place>grece</place>
<place>new york</place> here i come
from <place>belgium</place> to <place>russia</place> and back to <place>spain</place>
The algorithm runs over the input multiple times and produces each time slightly different tagging. My goal is to identify those lines where those differences occur. In other words, print all utterances for which the tagging differs across N results files.
For example N=10, I get 10 tagged files. Suppose line 1 is tagged all the time the same for all 10 tagged files - do not print it. Suppose line 2 is tagged once this way and 9 times other way - print it. And so on.
For N=2 is easy, I just run diff. But what to do if I have N=10 results?

If you have the tagged files - just create a counter for each line of how many times you've seen it:
# use defaultdict for convenience
from collections import defaultdict
# start counting at 0
counter_dict = defaultdict(lambda: 0)
tagged_file_names = ['tagged1.txt', 'tagged2.txt', ...]
# add all lines of each file to dict
for file_name in tagged_file_names:
with open(file_name) as f:
# use enumerate to maintain order
# produces (LINE_NUMBER, LINE CONTENT) tuples (hashable)
for line_with_number in enumerate(f.readlines()):
counter_dict[line_with_number] += 1
# print all values that do not repeat in all files (in same location)
for key, value in counter_dict.iteritems():
if value < len(tagged_file_names):
print "line number %d: [%s] only repeated %d times" % (
key[0], key[1].strip(), value
)
Walkthrough:
First of all, we create a data structure to enable us counting our entries, which are numbered lines. This data structure is a collections.defaultdict which a default value of 0 - which is the count of newly added lines (increased to 1 with each add).
Then, we create the actual entry using a tuple which is hashable, so it can be used as a dictionary key, and by default deeply-comparable to other tuples. this means (1, "lolz") is equal to (1, "lolz") but different than (1, "not lolz") or (2, lolz) - so it fits our use of deep-comparing lines to account for content as well as position.
Now all that's left to do is add all entries using a straightforward for loop and see what keys (which correspond to numbered lines) appear in all files (that is - their value is equal to the number of tagged files provided).
Example:
reut#tHP-EliteBook-8470p:~/python/counter$ cat tagged1.txt
123
abc
def
reut#tHP-EliteBook-8470p:~/python/counter$ cat tagged2.txt
123
def
def
reut#tHP-EliteBook-8470p:~/python/counter$ ./difference_counter.py
line number 1: [abc] only repeated 1 times
line number 1: [def] only repeated 1 times

if you compare all of them to the first text, then you can get a list of all texts that are different. this might not be the quickest way but it would work.
import difflib
n1 = '1 2 3 4 5 6'
n2 = '1 2 3 4 5 6'
n3 = '1 2 4 5 6 7'
l = [n1, n2, n3]
m = [x for x in l if x != l[0]]
diff = difflib.unified_diff(l[0], l.index(m))
print ''.join(diff)

Matching strings for multiple data set in Python

I am working on python and I need to match the strings of several data files. First I used pickle to unpack my files and then I place them into a list. I only want to match strings that have the same conditions. This conditions are indicated at the end of the string.
My working script looks approximately like this:
import pickle
f = open("data_a.dat")
list_a = pickle.load( f )
f.close()
f = open("data_b.dat")
list_b = pickle.load( f )
f.close()
f = open("data_c.dat")
list_c = pickle.load( f )
f.close()
f = open("data_d.dat")
list_d = pickle.load( f )
f.close()
for a in list_a:
for b in list_b:
for c in list_c
for d in list_d:
if a.GetName()[12:] in b.GetName():
if a.GetName[12:] in c.GetName():
if a.GetName[12:] in d.GetName():
"do whatever"
This seems to work fine for these 2 lists. The problems begin when I try to add more 8 or 9 more data files for which I also need to match the same conditions. The script simple won't process and it gets stuck. I appreciate your help.
Edit: Each of the lists contains histograms named after the parameters that were used to create them. The name of the histograms contains these parameters and their values at the end of the string. In the example I did it for 2 data sets, now I would like to do it for 9 data sets without using multiple loops.
Edit 2. I just expanded the code to reflect more accurately what I want to do. Now if I try to do that for 9 lists, it does not only look horrible, but it also doesn't work.

out of my head:
files = ["file_a", "file_b", "file_c"]
sets = []
for f in files:
f = open("data_a.dat")
sets.append(set(pickle.load(f)))
f.close()
intersection = sets[0].intersection(*sets[1:])
EDIT: Well I overlooked your mapping to x.GetName()[12:], but you should be able to reduce your problem to set logic.

Here a small piece of code you can inspire on. The main idea is the use of a recursive function.
For simplicity sake, I admit that I already have data loaded in lists but you can get them from file before :
data_files = [
'data_a.dat',
'data_b.dat',
'data_c.dat',
'data_d.dat',
'data_e.dat',
]
lists = [pickle.load(open(f)) for f in data_files]
And because and don't really get the details of what you really need to do, my goal here is to found the matches on the four firsts characters :
def do_wathever(string):
print "I have match the string '%s'" % string
lists = [
["hello", "world", "how", "grown", "you", "today", "?"],
["growl", "is", "a", "now", "on", "appstore", "too bad"],
["I", "wish", "I", "grow", "Magnum", "mustache", "don't you?"],
]
positions = [0 for i in range(len(lists))]
def recursive_match(positions, lists):
strings = map(lambda p, l: l[p], positions, lists)
match = True
searched_string = strings.pop(0)[:4]
for string in strings:
if searched_string not in string:
match = False
break
if match:
do_wathever(searched_string)
# increment positions:
new_positions = positions[:]
lists_len = len(lists)
for i, l in enumerate(reversed(lists)):
max_position = len(l)-1
list_index = lists_len - i - 1
current_position = positions[list_index]
if max_position > current_position:
new_positions[list_index] += 1
break
else:
new_positions[list_index] = 0
continue
return new_positions, not any(new_positions)
search_is_finished = False
while not search_is_finished:
positions, search_is_finished = recursive_match(positions, lists)
Of course you can optimize a lot of things here, this is draft code, but take a look at the recursive function, this is a major concept.

In the end I ended up using the map built in function. I realize now I should have been even more explicit than I was (which I will do in the future).
My data files are histograms with 5 parameters, some with 3 or 4. Something like this,
par1=["list with some values"]
par2=["list with some values"]
par3=["list with some values"]
par4=["list with some values"]
par5=["list with some values"]
I need to examine the behavior of the quantity plotted for each possible combination of the values of the parameters. In the end, I get a data file with ~300 histograms each identified in their name with the corresponding values of the parameters and the sample name. It looks something like,
datasample1-par1=val1-par2=val2-par3=val3-par4=val4-par5=val5
datasample1-"permutation of the above values"
...
datasample9-par1=val1-par2=val2-par3=val3-par4=val4-par5=val5
datasample9-"permutation of the above values"
So I get 300 histograms for each of the 9 data files, but luckily all of this histograms are created in the same order. Hence I can pair all of them just using the map built in function. I unpack the data files, put each on lists and the use the map function to pair each histogram with its corresponding configuration in the other data samples.
for lst in map(None, data1_histosli, data2_histosli, ...data9_histosli):
do_something(lst)
This solves my problem. Thank you to all for your help!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient lookup by common words - python

I have done similar things with the following tools Hbase - Key can have Multiple columns (Very Fast) ElasticSearch - Nice easy to scale. You just need to import your data as JSON Apache Lucene - Will be very good for 8 Million records

Related

Nested Loop Optimisation in Python for a list of 50K items

Improve performance of dataframe-like structure

Scoring consistency within dataset

Discover different lines across similar files

Matching strings for multiple data set in Python

Categories

Resources