I am facing a machine learning problem; Learning data consists of numeric, categorical and dates. I started training only based on numerics and dates (that I converted to numeric using epoch, week day, hours and so...). Apart from a poor score, performance are very well (seconds of training on one million entry).
The problem is with categoricals which most many values up to thousands.
Values consist of equipment brands, comments and so, and are humanly entered, So I assume there are much resemblances. I can sacrifice a bit of real world representation through data (hense score) for feasibility (training time).
Programming challenge: I came up with this from this nice performance analysis
import difflib
def gcm1(strings):
clusters = {}
co = 0
for string in (x for x in strings):
if(co % 10000 == 0 ):
print(co)
co = co +1
if string in clusters:
clusters[string].append(string)
else:
match = difflib.get_close_matches(string, clusters.keys(), 1, 0.90)
if match:
clusters[match[0]].append(string)
else:
clusters[string] = [ string ]
return clusters
def reduce(lines_):
clusters = gcm1(lines_)
clusters = dict( (v,k) for k in clusters for v in clusters[k] )
return [clusters.get(item,item) for item in lines_]
Example of this is like:
reduce(['XHSG11', 'XHSG8', 'DOIIV', 'D.OIIV ', ...]
=> ['XHSG11', 'XHSG11', 'DOIIV', 'DOIIV ', ...]
I am very bound to Python so I couldn't get other C implemented code running.
Obviously, the function difflib.get_close_matches in each iteration is the most greedy.
Is there an better alternative? or a better method of my algorithm?
As I said on million entries, on let's say 10 columns, I can't even estimate when the algorithm stops (more than 3 hours and still running on my 16 gigs of RAM and i7 4790k CPU)
Data is like (extract):
Comments: [nan '1er rdv' '16H45-VE' 'VTE 2016 APRES 9H'
'ARM : SERENITE DV. RECUP.CONTRAT. VERIF TYPE APPAREIL. RECTIF TVA SI NECESSAIRE']
422227 different values
MODELE_CODE: ['VIESK02534' 'CMA6781031' 'ELMEGLM23HNATVMC' 'CMACALYDRADELTA2428FF'
'FBEZZCIAO3224SVMC']
10206 values
MARQUE_LIB: ['VIESSMANN' 'CHAFFOTEAUX ET MAURY' 'ELM LEBLANC' 'FR BG' 'CHAPPEE']
167 values
... more columns
I need to build a REST API/server which responds to more than HTTP GET 15,000 requests per seconds in under 80ms. If necessary I could run multiple instances with a load balancer.
The server gets a request with a list of criteria (around 20), they need to be parsed and compared to a ruleset (about 2000 rules which have different values for all 20 criteria and a final decision) which decides the response (yes or no).
Sample Request payload:
{"Country" : "DE",
"ID" : "998423-423432-4234234-234234",
"Criteria1": "8748r78",
"Criteria2": "Some String",
[...]
}
Sample ruleset (still to be decided, but let's start with a simple design):
+--------+---------+-------------+--------------+
| RuleId | Country | Criteria1 | Criteria2 | etc...
+--------+---------+-------------+--------------+
| 1 | UK | SomeString1 | SomeString3 |
| 2 | UK | SomeString1 | SomeString2 |
| 3 | US | SomeString4 | * (Wildcard) |
+--------+---------+-------------+--------------+
Every criteria can contain between 1 and probably around 400 different values, all strings (e.g. GEOs in ISO Code). Some might be empty and be treated as wildcards. Theoretically there could be entries with all 20 criterias having the same value, but that is a topic for the yet to be written rule engine to sort out.
I did some research how to achieve this:
Using sanic as a webserver for a high throughput, according to my
research this is the fastest for python excluding japronto which is
in alpha; Edit: Does anyone has experience with the performance of a python based webserver+webframework regarding a similar usecase? I only read benchmarks which usually have a very simple testcase (just respond a fixed string to a request, therefore the high number of possible requests per second in all the benchmarks)
Using sqlite3 (in memory) for rule lookup; not sure if a SQL statement with 20 constraints is fast enough? Maybe there is another
way to compare every request to the ruleset over 20 criteria (each
one is a string comparison). EDIT: Thanks to a commenter I might precompute the rules into hashes and use hashes for lookup, thus a database for the real-time lookup is not needed.
Use redis or another database to store the rules precomputed (that
is another topic) and make them ready to get loaded in every
instance/worker of the http server and thus sqlite3 database.
Maybe use pypy3 for additional speedup, but I have no experience
with pypy
I would host this on Heroku.
So the question is: Which libraries and thus architecture would allow that kind of speed with python?
I will assume that
all given criteria are exact string matches
all unspecified criteria match anything (wildcard)
we can discard all rules which produce False
rules may contain None which matches anything (wildcard)
the result is True if there is at least one rule that matches all given criteria, else False
We can build a fast look-up as a dict (column) of dict (value) of set (matching rule ids):
from collections import namedtuple
WILDCARD = None
Rule = namedtuple("Rule", ["Country", "Criteria1", "Criteria2"])
rules = [
Rule("UK", "Somestring1", "Somestring3"),
Rule("UK", "Somestring1", "Somestring2"),
Rule("US", "Somestring4", WILDCARD)
]
def build_lookup(rules):
columns = Rule._fields
# create lookup table (special handling of wildcard entries)
lookup = {column: {WILDCARD: set()} for column in columns}
# index rules by criteria
for id, rule in enumerate(rules):
for column, value in zip(columns, rule):
if value in lookup[column]:
lookup[column][value].add(id)
else:
lookup[column][value] = {id}
return lookup
rule_lookup = build_lookup(rules)
With the given sample data, rule_lookup now contains
{
'Country': {WILDCARD: set(), 'UK': {0, 1}, 'US': {2}},
'Criteria1': {WILDCARD: set(), 'Somestring1': {0, 1}, 'Somestring4': {2}},
'Criteria2': {WILDCARD: {2}, 'Somestring2': {1}, 'Somestring3': {0}}
}
then we can quickly match criteria to rules like
def all_matching_rules(criteria):
"""
criteria is a dict of {column: value} to match
Return a set of all rule ids which match criteria
"""
if criteria:
result = empty = set()
first = True
for column, value in criteria.items():
ids = rule_lookup[column].get(value, empty) | rule_lookup[column][WILDCARD]
if first:
result = ids
first = False
else:
result &= ids # find intersection of sets
# short-circuit evaluation if result is null set
if not result:
break
return result
else:
# no criteria, return everything
return set(range(len(rules)))
def any_rule_matches(criteria):
"""
criteria is a dict of {column: value} to match
Return True if any rule matches criteria, else False
"""
if criteria:
return bool(all_matching_rules(criteria))
else:
return bool(len(rules))
which runs like
>>> all_matching_rules({"Country": "UK", "Criteria2": "Somestring8"})
set()
>>> all_matching_rules({"Country": "US", "Criteria2": "Somestring8"})
{2}
>>> any_rule_matches({"Country": "UK", "Criteria2": "Somestring8"})
False
>>> any_rule_matches({"Country": "US", "Criteria2": "Somestring8"})
True
Timeit reports that this runs in about 930ns on my machine - should be plenty fast enough ;-)
I have this dataframe :
order_id product_id user_id
2 33120 u202279
2 28985 u202279
2 9327 u202279
4 39758 u178520
4 21351 u178520
5 6348 u156122
5 40878 u156122
Type user_id : String
Type product_id : Integer
I would like to use this dataframe to create a Doc2vec corpus. So, I need to use the LabeledSentence function to create a dict :
{tags : user_id, words:
all product ids ordered by each user_id}
But the the dataframe shape is (32434489, 3), so I should avoid to use a loop to create my labeledSentence.
I try to run this function (below) with multiprocessing but is too long.
Have you any idea to transform my dataframe in the good format for a Doc2vec corpus where the tag is the user_id and the words is the list of products by user_id?
def append_to_sequences(i):
user_id = liste_user_id.pop(0)
liste_produit_userID = data.ix[data["user_id"]==user_id, "product_id"].astype(str).tolist()
return doc2vec.LabeledSentence(words=prd_user_list, tags=user_id )
pool = multiprocessing.Pool(processes=3)
result = pool.map_async(append_to_sequences, np.arange(len_liste_unique_user))
pool.close()
pool.join()
sentences = result.get()
Using multiprocessing is likely overkill. The forking of processes can wind up duplicating all existing memory, and involve excess communication marshalling results back into the master process.
Using a loop should be OK. 34 million rows (and far fewer unique user_ids) isn't that much, depending on your RAM.
Note that in recent versions of gensim TaggedDocument is the preferred class for Doc2Vec examples.
If we were to assume you have a list of all unique user_ids in liste_user_id, and a (new, not shown) function that gets the list-of-words for a user_id called words_for_user(), creating the documents for Doc2Vec in memory could be as simple as:
documents = [TaggedDocument(words=words_for_user(uid), tags=[uid])
for uid in liste_user_id]
Note that tags should be a list of tags, not a single tag – even though in many common cases each document only has a single tag. (If you provide a single string tag, it will see tags as a list-of-characters, which is not what you want.)
I have a list of names (strings) divided into words. There are 8 million names, each name consists of up to 20 words (tokens). Number of unique tokens is 2.2 million. I need an efficient way to find all names containing at least one word from the query (which may contain also up to 20 words, but usually only a few).
My current approach uses Python Pandas and looks like this (later referred as original):
>>> df = pd.DataFrame([['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo']],
index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True) # btw, is there a way to include this into prev line?
>>> print df
0 1 2
id
id1 foo bar joe
id2 foo None None
id3 bar joe None
id4 zoo None None
def filter_by_tokens(df, tokens):
# search within each column and then concatenate and dedup results
results = [df.loc[lambda df: df[i].isin(tokens)] for i in range(df.shape[1])]
return pd.concat(results).reset_index().drop_duplicates().set_index(df.index.name)
>>> print filter_by_tokens(df, ['foo', 'zoo'])
0 1 2
id
id1 foo bar joe
id2 foo None None
id4 zoo None None
Currently such lookup (on the full dataset) takes 5.75s on my (rather powerful) machine. I'd like to speed it up at least, say, 10 times.
I was able to get to 5.29s by squeezing all columns into one and perform a lookup on that (later referred as original, squeezed):
>>> df = pd.Series([{'foo', 'bar', 'joe'},
{'foo'},
{'bar', 'joe'},
{'zoo'}],
index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True)
>>> print df
id
id1 {foo, bar, joe}
id2 {foo}
id3 {bar, joe}
id4 {zoo}
dtype: object
def filter_by_tokens(df, tokens):
return df[df.map(lambda x: bool(x & set(tokens)))]
>>> print filter_by_tokens(df, ['foo', 'zoo'])
id
id1 {foo, bar, joe}
id2 {foo}
id4 {zoo}
dtype: object
But that's still not fast enough.
Another solution which seems to be easy to implement is to use Python multiprocessing (threading shouldn't help here because of GIL and there is no I/O, right?). But the problem with it is that the big dataframe needs to be copied to each process, which takes up all the memory. Another problem is that I need to call filter_by_tokens many times in a loop, so it would copy the dataframe on every call, which is inefficient.
Note that words may occur many times in names (e.g. the most popular word occurs 600k times in names), so a reverse index would be huge.
What is a good way to write this efficiently? Python solution preferred, but I'm also open to other languages and technologies (e.g. databases).
UPD:
I've measured execution time of my two solutions and the 5 solutions suggested by #piRSquared in his answer. Here are the results (tl;dr the best is 2x improvement):
+--------------------+----------------+
| method | best of 3, sec |
+--------------------+----------------+
| original | 5.75 |
| original, squeezed | 5.29 |
| zip | 2.54 |
| merge | 8.87 |
| mul+any | MemoryError |
| isin | IndexingError |
| query | 3.7 |
+--------------------+----------------+
mul+any gives MemoryError on d1 = pd.get_dummies(df.stack()).groupby(level=0).sum() (on a 128Gb RAM machine).
isin gives IndexingError: Unalignable boolean Series key provided on s[d1.isin({'zoo', 'foo'}).unstack().any(1)], apparently because shape of df.stack().isin(set(tokens)).unstack() is slightly less than the shape of the original dataframe (8.39M vs 8.41M rows), not sure why and how to fix that.
Note that the machine I'm using has 12 cores (though I mentioned some problems with parallelization above). All of the solutions utilize a single core.
Conclusion (as of now): there is 2.1x improvement by zip (2.54s) vs original squeezed solution (5.29s). It's good, though I aimed for at least 10x improvement, if possible. So I'm leaving the (still great) #piRSquared answer unaccepted for now, to welcome more suggestions.
idea 0
zip
def pir(s, token):
return s[[bool(p & token) for p in s]]
pir(s, {'foo', 'zoo'})
idea 1
merge
token = pd.DataFrame(dict(v=['foo', 'zoo']))
d1 = df.stack().reset_index('id', name='v')
s.ix[d1.merge(token).id.unique()]
idea 2
mul + any
d1 = pd.get_dummies(df.stack()).groupby(level=0).sum()
token = pd.Series(1, ['foo', 'zoo'])
s[d1.mul(token).any(1)]
idea 3
isin
d1 = df.stack()
s[d1.isin({'zoo', 'foo'}).unstack().any(1)]
idea 4
query
token = ('foo', 'zoo')
d1 = df.stack().to_frame('s')
s.ix[d1.query('s in #token').index.get_level_values(0).unique()]
I have done similar things with the following tools
Hbase - Key can have Multiple columns (Very Fast)
ElasticSearch - Nice easy to scale. You just need to import your data as JSON
Apache Lucene - Will be very good for 8 Million records
You can do it with reverse index; the code below run in pypy builds the index in 57 seconds, does the query or 20 words takes 0.00018 seconds and used about 3.2Gb memory. Python 2.7 build index in 158 seconds and does query in 0.0013 seconds using about 3.41Gb memory.
The fastest possible way to do this is with bitmapped reversed indexes, compressed to save space.
"""
8m records with between 1 and 20 words each, selected at random from 100k words
Build dictionary of sets, keyed by word number, set contains nos of all records
with that word
query merges the sets for all query words
"""
import random
import time records = 8000000
words = 100000
wordlists = {}
print "build wordlists"
starttime = time.time()
wordlimit = words - 1
total_words = 0
for recno in range(records):
for x in range(random.randint(1,20)):
wordno = random.randint(0,wordlimit)
try:
wordlists[wordno].add(recno)
except:
wordlists[wordno] = set([recno])
total_words += 1
print "build time", time.time() - starttime, "total_words", total_words
querylist = set()
query = set()
for x in range(20):
while 1:
wordno = (random.randint(0,words))
if wordno in wordlists: # only query words that were used
if not wordno in query:
query.add(wordno)
break
print "query", query
starttime = time.time()
for wordno in query:
querylist.union(wordlists[wordno])
print "query time", time.time() - starttime
print "count = ", len(querylist)
for recno in querylist:
print "record", recno, "matches"
perhaps my first answer was a bit abstract; in the absence of real data it was generating random data to the approx. volume reqd. to get a feel for the query time. This code is practical.
data =[['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo']]
wordlists = {}
print "build wordlists"
for x, d in enumerate(data):
for word in d:
try:
wordlists[word].add(x)
except:
wordlists[word] = set([x])
print "query"
query = [ "foo", "zoo" ]
results = set()
for q in query:
wordlist = wordlists.get(q)
if wordlist:
results = results.union(wordlist)
l = list(results)
l.sort()
for x in l:
print data[x]
The cost in time and memory is building the wordlists (inverted indices); query is almost free. You have 12 core machine so presumably it has plenty of memory. For repeatability, build the wordlists, pickle each wordlist and write to sqlite or any key/value database with the word as key and the pickled set as binary blob. then all you need is:
initialise_database()
query = [ "foo", "zoo" ]
results = set()
for q in query:
wordlist = get_wordlist_from_database(q) # get binary blob and unpickle
if wordlist:
results = results.union(wordlist)
l = list(results)
l.sort()
for x in l:
print data[x]
alternatively, using arrays, which is more memory efficient, and probably faster to build index. pypy more that 10 x faster that 2.7
import array
data =[['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo']]
wordlists = {}
print "build wordlists"
for x, d in enumerate(data):
for word in d:
try:
wordlists[word].append(x)
except:
wordlists[word] = array.array("i",[x])
print "query"
query = [ "foo", "zoo" ]
results = set()
for q in query:
wordlist = wordlists.get(q)
if wordlist:
for i in wordlist:
results.add(i)
l = list(results)
l.sort()
for x in l:
print data[x]
If you know that the number of unique tokens that you'll see is relatively small,
you can pretty easily build an efficient bitmask to query for matches.
The naive approach (in original post) will allow for up to 64 distinct tokens.
The improved code below uses the bitmask like a bloom filter (modular arithmetic in setting the bits wraps around 64). If there are more than 64 unique tokens, there will be some false positives, which the code below will automatically verify (using the original code).
Now the worst-case performance will degrade if the number of unique tokens is (much) larger than 64, or if you get particularly unlucky. Hashing could mitigate this.
As far as performance goes, using the benchmark data set below, I get:
Original Code: 4.67 seconds
Bitmask Code: 0.30 seconds
However, when the number of unique tokens is increased, the bitmask code remains efficient while the original code slows down considerably. With about 70 unique tokens, I get something like:
Original Code: ~15 seconds
Bitmask Code: 0.80 seconds
Note: for this latter case, building the bitmask array from the supplied list, takes about as much time as building the dataframe. There's probably no real reason to build the dataframe; left it in mainly for ease of comparison to original code.
class WordLookerUpper(object):
def __init__(self, token_lists):
tic = time.time()
self.df = pd.DataFrame(token_lists,
index=pd.Index(
data=['id%d' % i for i in range(len(token_lists))],
name='index'))
print('took %d seconds to build dataframe' % (time.time() - tic))
tic = time.time()
dii = {}
iid = 0
self.bits = np.zeros(len(token_lists), np.int64)
for i in range(len(token_lists)):
for t in token_lists[i]:
if t not in dii:
dii[t] = iid
iid += 1
# set the bit; note that b = dii[t] % 64
# this 'wrap around' behavior lets us use this
# bitmask as a probabilistic filter
b = dii[t]
self.bits[i] |= (1 << b)
self.string_to_iid = dii
print('took %d seconds to build bitmask' % (time.time() - tic))
def filter_by_tokens(self, tokens, df=None):
if df is None:
df = self.df
tic = time.time()
# search within each column and then concatenate and dedup results
results = [df.loc[lambda df: df[i].isin(tokens)] for i in range(df.shape[1])]
results = pd.concat(results).reset_index().drop_duplicates().set_index('index')
print('took %0.2f seconds to find %d matches using original code' % (
time.time()-tic, len(results)))
return results
def filter_by_tokens_with_bitmask(self, search_tokens):
tic = time.time()
bitmask = np.zeros(len(self.bits), np.int64)
verify = np.zeros(len(self.bits), np.int64)
verification_needed = False
for t in search_tokens:
bitmask |= (self.bits & (1<<self.string_to_iid[t]))
if self.string_to_iid[t] > 64:
verification_needed = True
verify |= (self.bits & (1<<self.string_to_iid[t]))
if verification_needed:
results = self.df[(bitmask > 0 & ~verify.astype(bool))]
results = pd.concat([results,
self.filter_by_tokens(search_tokens,
self.df[(bitmask > 0 & verify.astype(bool))])])
else:
results = self.df[bitmask > 0]
print('took %0.2f seconds to find %d matches using bitmask code' % (
time.time()-tic, len(results)))
return results
Make some test data
unique_token_lists = [
['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo'],
['ziz','zaz','zuz'],
['joe'],
['joey','joe'],
['joey','joe','joe','shabadoo']
]
token_lists = []
for n in range(1000000):
token_lists.extend(unique_token_lists)
Run the original code and the bitmask code
>>> wlook = WordLookerUpper(token_lists)
took 5 seconds to build dataframe
took 10 seconds to build bitmask
>>> wlook.filter_by_tokens(['foo','zoo']).tail(n=1)
took 4.67 seconds to find 3000000 matches using original code
id7999995 zoo None None None
>>> wlook.filter_by_tokens_with_bitmask(['foo','zoo']).tail(n=1)
took 0.30 seconds to find 3000000 matches using bitmask code
id7999995 zoo None None None
One of my mapper produces some logs distributed in files like part-0, part-1, part-2 etc. Now each of these have some queries and some associated data for that query:
part-0
q score
1 ben 10 4.01
horse shoe 5.96
...
part-1
1 ben 10 3.23
horse shoe 2.98
....
and so on for part-2,3 etc.
Now the same query q i.e. "1 ben 10" above resides in part-1, part-2 etc.
Now I have to write a map reduce phase where in I can collect the same queries and aggregate (add up) their scores.
My mapper function can be an identity and in the reduce I will be accomplishing this task.
Output would be:
q aggScore
1 ben 10 7.24
horse shoe 8.96
...
Seems to be a simple task but I am not able to think of as to how can I proceed with this (Read a lot but not really able to proceed). I can think in terms of generic Algorithm problem wherein first I will collect the common queries and than add up their scores.
Any help with some hints of pythonic solution or Algorithm (map reduce) would really be appreciated.
Here is the MapReduce solution:
Map input: Each input file (part-0, part-1, part-2, ...) can be input to individual (separate) map task.
foreach input line in the input file,
Mapper emits <q,aggScore>. If there are multiple scores for a query in a single file, Map will sum them all up, otherwise if we know that each query will appear in each file just once, map can be an identity function emitting <q,aggScore> for each input line as is.
Reducer input is in the form <q,list<aggScore1,aggScore2,...> The Reducer operation is similar to well-known MapReduce example of wordcount. If you are using Hadoop, you can use the following method for Reducer.
public void reduce(Text q, Iterable<IntWritable> aggScore, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : aggScore) {
sum += val.get();
}
context.write(q, new IntWritable(sum));
}
The method will sum all aggScores for a particularq and give you the desired output. The python code for the reducer should look something like this (Here q is the key and the list of aggScores is the values) :
def reduce(self, key, values, output, reporter):
sum = 0
while values.hasNext():
sum += values.next().get()
output.collect(key, IntWritable(sum))