Compare text to template to detect anomalies (reverse template)

Compare text to template to detect anomalies (reverse template) - python

I'm looking for an algorithm or even an algorithm space that deals with the problem of validating that short text (email) matches known templates. Coding will probably be python or perl, but that's flexible.
Here's the problem:
Servers with access to production data need to be able to send out email that will reach the Internet:
Dear John Smith,
We received your last payment for $123.45 on 2/4/13. We'd like you to be aware of the following charges:
$12.34 Spuznitz, LLC on 4/1
$43.21 1-800-FLOWERS on 4/2
As always, you can view these transactions in our portal.
Thank you for your business!
Obviously some of the email contents will vary - the salutation ("John Smith"), "$123.45 on 2/4/13", and the lines with transactions printed out. Other parts ("We received your last payment") are very static. I want to be able to match the static portions of the text and quantify that the dynamic portions are within certain reasonable limits (I might know that the most transaction lines to be printed is 5, for example).
Because I am concerned about data exfiltration, I want to make sure email that doesn't match this template never goes out - I want to examine email and quarantine anything that doesn't look like what I expect. So I need to automate this template matching and block any email messages that are far enough away from matching.
So the question is, where do I look for a filtering mechanism? Bayesian filtering tries to verify a sufficient similarity between a specific message and a non-specific corpus, which is kind of the opposite problem. Things like Perl's Template module are a tight match - but for output, not for input or comparison. Simple 'diff' type comparisons won't handle the limited dynamic info very well.
How do I test to see if these outgoing email messages "quack like a duck"?

You could use grammars for a tight matching. It is possible to organize regexps in grammars for easier abstraction: http://www.effectiveperlprogramming.com/blog/1479
Or you could use a dedicated grammar engine Marpa.
If you want a more statistic approach, consider n-grams. First, tokenize text and replace variable chunks by meaningful placeholders, like CURRENCY and DATE. Then, build the n-grams. Now you can use Jaccard index to compare two texts.
Here is a Pure-Perl implementation which works on trigrams:
#!/usr/bin/env perl
use strict;
use utf8;
use warnings;
my $ngram1 = ngram(3, tokenize(<<'TEXT1'));
Dear John Smith,
We received your last payment for $123.45 on 2/4/13. We'd like you to be aware of the following charges:
$12.34 Spuznitz, LLC on 4/1
$43.21 1-800-FLOWERS on 4/2
As always, you can view these transactions in our portal.
Thank you for your business!
TEXT1
my $ngram2 = ngram(3, tokenize(<<'TEXT2'));
Dear Sally Bates,
We received your last payment for $456.78 on 6/9/12. We'd like you to be aware of the following charges:
$123,43 Gnomovision on 10/1
As always, you can view these transactions in our portal.
Thank you for your business!
TEXT2
my %intersection =
map { exists $ngram1->[2]{$_} ? ($_ => 1) : () }
keys %{$ngram2->[2]};
my %union =
map { $_ => 1 }
keys %{$ngram1->[2]}, keys %{$ngram2->[2]};
printf "Jaccard similarity coefficient: %0.3f\n", keys(%intersection) / keys(%union);
sub tokenize {
my #words = split m{\s+}x, lc shift;
for (#words) {
s{\d{1,2}/\d{1,2}(?:/\d{2,4})?}{ DATE }gx;
s{\d+(?:\,\d{3})*\.\d{1,2}}{ FLOAT }gx;
s{\d+}{ INTEGER }gx;
s{\$\s(?:FLOAT|INTEGER)\s}{ CURRENCY }gx;
s{^\W+|\W+$}{}gx;
}
return #words;
}
sub ngram {
my ($size, #words) = #_;
--$size;
my $ngram = [];
for (my $j = 0; $j <= $#words; $j++) {
my $k = $j + $size <= $#words ? $j + $size : $#words;
for (my $l = $j; $l <= $k; $l++) {
my #buf;
for my $w (#words[$j..$l]) {
push #buf, $w;
}
++$ngram->[$#buf]{join(' ', #buf)};
}
}
return $ngram;
}
You can use one text as a template and match it against your emails.
Check String::Trigram for an efficient implementation. Google Ngram Viewer is a nice resource to illustrate the n-gram matching.

If you want to match a pre-existing template with e.g. control flow elements like {% for x in y %} against a supposed output from it, you're going to have to parse the template language - which seems like a lot of work.
On the other hand, if you're prepared to write a second template for validation purposes - something like:
Dear {{customer}},
We received your last payment for {{currency}} on {{full-date}}\. We'd like you to be aware of the following charges:
( {{currency}} {{supplier}} on {{short-date}}
){,5}As always, you can view these transactions in our portal\.
... which is just a simple extension of regex syntax, it's pretty straightforward to hack something together that will validate against that:
import re
FIELDS = {
"customer": r"[\w\s\.-]{,50}",
"supplier": r"[\w\s\.,-]{,30}",
"currency": r"[$€£]\d+\.\d{2}",
"short-date": r"\d{,2}/\d{,2}",
"full-date": r"\d{,2}/\d{,2}/\d{2}",
}
def validate(example, template_file):
with open(template_file) as f:
template = f.read()
for tag, pattern in FIELDS.items():
template = template.replace("{{%s}}" % tag, pattern)
valid = re.compile(template + "$")
return (re.match(valid, example) is not None)
The example above isn't the greatest Python code of all time by any means, but it's enough to get the general idea across.

I would go for the "longest common subsequence". A standard implementation can be found here:
http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Longest_common_subsequence
If you need a better algorithm and/or lots of additional ideas for inexact matching of strings, THE standard reference is this book:
http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198
Don't be fooled by the title. Compuatational biology is mostly about matching of large database of long strings (also known as DNA sequences).

Related

Write a custom JSON interpreter for a file that looks like json but isnt using Python

What I need to do is to write a module that can read and write files that use the PDX script language. This language looks alot like json but has enough differences that a custom encoder/decoder is needed to do anything with those files (without a mess of regex substitutions which would make maintenance hell). I originally went with just reading them as txt files, and use regex to find and replace things to convert it to valid json. This lead me to my current point, where any additions to the code requires me to write far more code than I would want to, just to support some small new thing. So using a custom json thing I could write code that shows what valid key:value pairs are, then use that to handle the files. To me that will be alot less code and alot easier to maintain.
So what does this code look like? In general it looks like this (tried to put all possible syntax, this is not an example of a working file):
#key = value # this is the definition for the scripted variable
key = {
# This is a comment. No multiline comments
function # This is a single key, usually optimize_memory
# These are the accepted key:value pairs. The quoted version is being phased out
key = "value"
key = value
key = #key # This key is using a scripted variable, defined either in the file its in or in the `scripted_variables` folder. (see above for example on how these are initially defined)
# type is what the key type is. Like trigger:planet_stability where planet_stability is a trigger
key = type:key
# Variables like this allow for custom names to be set. Mostly used for flags and such things
[[VARIABLE_NAME]
math_key = $VARIABLE_NAME$
]
# this is inline math, I dont actually understand how this works in the script language yet as its new. The "<" can be replaced with any math symbol.
# Valid example: planet_stability < #[ stabilitylevel2 + 10 ]
key < #[ key + 10 ]
# This is used alot to handle code blocks. Valid example:
# potential = {
# exists = owner
# owner = {
# has_country_flag = flag_name
# }
# }
key = {
key = value
}
# This is just a list. Inline brackets are used alot which annoys me...
key = { value value }
}
The major differences between json and PDX script is the nearly complete lack of quotations, using an equals sign instead of a colon for separation and no comma's at the end of the lines. Now before you ask me to change the PDX code, I cant. Its not mine. This is what I have to work with and cant make any changes to the syntax. And no I dont want to convert back and forth as I have already mentioned this would require alot of work. I have attempted to look for examples of this, however all I can find are references to convert already valid json to a python object, which is not what I want. So I cant give any examples of what I have already done, as I cant find anywhere to even start.
Some additional info:
Order of key:value pairs does not technically matter, however it is expected to be in a certain order, and when not in that order causes issues with mods and conflict solvers
bool properties always use yes or no rather than true or false
Lowercase is expected and in some cases required
Math operators are used as separators as well, eg >=, <= ect
The list of syntax is not exhaustive, but should contain most of the syntax used in the language
Past work:
My last attempts at this all revolved around converting it from a text file to a json file. This was alot of work just to get a small piece of this to work.
Example:
potential = {
exists = owner
owner = {
is_regular_empire = yes
is_fallen_empire = no
}
NOR = {
has_modifier = resort_colony
has_modifier = slave_colony
uses_habitat_capitals = yes
}
}
And what i did to get most of the way to json (couldnt find a way to add quotes)
test_string = test_string.replace("\n", ",")
test_string = test_string.replace("{,", "{")
test_string = test_string.replace("{", "{\n")
test_string = test_string.replace(",", ",\n")
test_string = test_string.replace("}, ", "},\n")
test_string = "{\n" + test_string + "\n}"
# Replace the equals sign with a colon
test_string = test_string.replace(" =", ":")
This resulted in this:
{
potential: {
exists: owner,
owner: {
is_regular_empire: yes,
is_fallen_empire: no,
},
NOR: {
has_modifier: resort_colony,
has_modifier: slave_colony,
uses_habitat_capitals: yes,
},
}
}
Very very close yes, but in no way could I find a way to add the quotations to each word (I think I did try a regex sub, but wasnt able to get it to work, since this whole thing is just one unbroken string), making this attempt stuck and also showing just how much work is required just to get a very simple potential block to mostly work. However this is not the method I want anymore, one because its alot of work and two because I couldnt find anything to finish it. So a custom json interpreter is what I want.

The classical approach (potentially leading to more code, but also more "correctness"/elegance) is probably to build a "recursive descent parser", from a bunch of conditionals/checks, loops and (sometimes recursive?) functions/handlers to deal with each of the encountered elements/characters on the input stream. An implicit parse/call tree might be sufficient if you directly output/print the JSON equivalent, or otherwise you could also create a representation/model in memory for later output/conversion.
Related book recommendation could be "Language Implementation Patterns" by Terence Parr, me avoiding to promote my own interpreters and introductory materials :-) In case you need further help, maybe write me?

str.format a list by joining its values

Say I have a dictionary:
data = {
"user" : {
"properties" : ["i1", "i2"]
}
}
And the following string:
txt = "The user has properties {user[properties]}"
I want to have:
txt.format(**data)
to equal:
The user has properties i1, i2
I believe to achieve this, I could subclass the formatter used by str.format but I am unfortunately unsure how to proceed. I rarely subclass standard Python classes. Note that writing {user[properties][0]}, {user[properties][1]} is not an ideal option for me here. I don't know how many items are in the list so I would need to do a regex to identify matches, then find the relevant value in data and replace the matched text with {user[properties][0]}, {user[properties][1]}. str.format takes care of all the indexing from the string's value so it is very practical.

Just join the items in data["user"]["properties"]
txt = "The user has properties {properties}"
txt.format(properties = ", ".join(data["user"]["properties"]))
Here you have a live example

I ended up using the jinja2 package for all of my formatting needs. It's extremely powerful and I really recommend it!

How to improve query accuracy of Easticsearch from Python?

How can you improve the accuracy search results from Elasticsearch using the Python wrapper? My basic example returns results, but the results are very inaccurate.
I'm running Elasticsearch 5.2 on Ubuntu 16, and I start by creating my index and adding a few documents like:
es = Elasticsearch()
# Document A
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some specific keywords',
weight=1.0,
data='blah1',
),
)
# Document B
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some other specific keywords',
weight=1.0,
data='blah2',
),
)
# Document C
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some other very long text that is very different yet mentions the word specific and keywords',
weight=1.0,
data='blah3',
),
)
I then query it with:
es = Elasticsearch()
es.indices.create(index='my-test-index', ignore=400)
query = 'some specific keywords'
results = es.search(
index='my-test-index',
body={
'query':{
"function_score": {
"query": {
"match": {
"search_key": query
}
},
"functions": [{
"script_score": {
"script": "doc['weight'].value"
}
}],
"score_mode": "multiply"
}
},
}
)
And although it returns all results, it returns them in the order of documents B, C, A, whereas I would expect them in the order A, B, C, because although all the documents contain all my keywords, only the first one is an exact match. I would expect C to be last because, even though it contains all my keywords, it also contains a lot of fluff I'm not explicitly searching for.
This problem compounds when I index more entries. The search returns everything that has even a single keyword from my query, and seemingly weights them all identically, causing the search results get less and less accurate the larger my index grows.
This is making Elasticsearch almost useless. Is there anyway I can fix it? Is there a problem with my search() call?

In your query, you can use a match_phrase query instead of a match query so that the order and proximity of the search terms get into the mix. Additionally, you can add a small slop in order to allow the terms to be further apart or in a different order. But documents with terms in the same order and closer apart will be ranked higher than documents with terms out of order and/or further apart. Try it out
"query": {
"match_phrase": {
"search_key": query,
"slop": 10
}
},
Note: slop is a number that indicates how many "swaps" of the search terms you need to perform in order to land on the term configuration present in the document.

Sorry for not reading your question more carefully and for the loaded answer below. I don't want to a stick in the mud but I think it will be clearer if you understand a bit more how Elasticsearch itself works.
Because you index your document without specifying any index and mapping configuration, Elasticsearch will use several defaults that it provides out of the box. The indexing process will first tokenize field values in your document using the standard tokenizer and analyze them using the standard analyzer before storing them in the index. Both the standard tokenizer and analyzer work by splitting your string based on word boundary. So at the end of index time, what you have in your index for the terms in the search_key field are ["some", "specific", "keywords"], not "some specific keywords".
During search time, the match query controls relevance using a similarity algorithm called term frequency/inverse document frequency, or TF/IDF. This algorithm is very popular in text search in general and there is a wikipedia section on it: https://en.wikipedia.org/wiki/Tf%E2%80%93idf. What's important to note here is that the more frequently your term appear in the index, the less important it is in terms of relevance. some, specific, and keywords appear in ALL 3 documents in your index, so as far as elasticsearch is concerned, they contribute very little to the document's relevance in your search result. Since A contains only these terms, it's like having a document containing only the, an, a in an English index. It won't show up as first result even if you search for the, an, a specifically. B ranks higher than C because B is shorter, which yields higher norm value. This norm value is explained in the relevance document. This is a bit of a speculation on my part, but I think it does work out this way if you explain the query using the explain API.
So, back to your need, how to favor exact match over everything else? There is, of course, the match_phrase query as Val pointed out. Another popular method to do it, which I personally prefer, is to index the raw value in a nested field called search_key.raw using the not_analyzed option when defining your mapping: https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html#_index_2 and simply boost this raw value when you search.

Search methods and string matching in python

I have a task to search for a group of specific terms(around 138000 terms) in a table made of 4 columns and 187000 rows. The column headers are id, title, scientific_title and synonyms, where each column might contain more than one term inside it.
I should end up with a csv table with the id where a term has been found and the term itself. What could be the best and the fastest way to do so?
In my script, I tried creating phrases by iterating over the different words in a term in order and comparing each word with each row of each column of the table.
It looks something like this:
title_prepared = string_preparation(title)
sentence_array = title_prepared.split(" ")
length = len(sentence_array)
for i in range(length):
for place_length in range(len(sentence_array)):
last_element = place_length + 1
phrase = ' '.join(sentence_array[0:last_element])
if phrase in literalhash:
final_dict.setdefault(id,[])
if not phrase in final_dict[id]:
final_dict[trial_id].append(phrase)
How should I be doing this?

The code on the website you link to is case-sensitive - it will only work when the terms in tumorabs.txt and neocl.xml are the exact same case. If you can't change your data then change:
After:
for line in text:
add:
line = line.lower()
(this is indented four spaces)
And change:
phrase = ' '.join(sentence_array[0:last_element])
to:
phrase = ' '.join(sentence_array[0:last_element]).lower()
AFAICT this works with the unmodified code from the website when I change the case of some of the data in tumorabs.txt and neocl.xml.

To clarify the problem: we are running small scientific project where we need to extract all text parts with particular keywords. We have used coded dictionary and python script posted on http://www.julesberman.info/coded.htm ! But it seems that something does not working properly.
For exemple the script do not recognize a keyword "Heart Disease" in string "A Multicenter Randomized Trial Evaluating the Efficacy of Sarpogrelate on Ischemic Heart Disease After Drug-eluting Stent Implantation in Patients With Diabetes Mellitus or Renal Impairment".
Thanks for understanding! we are a biologist and medical doctor, with little bit knowlege of python!
If you need some more code i would post it online.

The same query on Sparql gives different results

I read some questions related to my question, like
Same sparql not returning same results, but I think is a little different.
Consider this query which I submit into http://live.dbpedia.org/sparql (Virtuoso endpoint) and get 34 triples as a result. Result Sparql
SELECT ?pred ?obj
WHERE {
<http://dbpedia.org/resource/Johann_Sebastian_Bach> ?pred ?obj
FILTER((langMatches(lang(?obj), "")) ||
(langMatches(lang(?obj), "EN"))
)
}
Then, I used the same query in a code in python:
import rdflib
import rdfextras
rdfextras.registerplugins()
g=rdflib.Graph()
g.parse("http://dbpedia.org/resource/Johann_Sebastian_Bach")
PREFIX = """
PREFIX dbp: <http://dbpedia.org/resource/>
"""
query = """
SELECT ?pred ?obj
WHERE {dbp:Johann_Sebastian_Bach ?pred ?obj
FILTER( (langMatches(lang(?obj), "")) ||
(langMatches(lang(?obj), "EN")))}
"""
query = PREFIX + query
result_set = g.query(query)
print len(result_set)
This time, I get only 27 triples! https://dl.dropboxusercontent.com/u/22943656/result.txt
I thought it could be related to the dbpedia site. I repeated these queries several time and always got the same difference. Therefore, I downloaded the RDF file to test it locally, and used the software Protége to simulate the Virtuoso endpoint. Even though, I still have different results from the sparql submitted into Protége and Python, 31 and 27. Is there any explanation for this difference? And how can I get the same result in both?

As the question is written, there are a few possible problems. Based on the comments, the first one described here (about lang, langMatches, etc.) seems to be what you're actually running into, but I'll leave the descriptions of the other possible problems, in case someone else finds them useful.
lang, langMatches, and the empty string
lang is defined to return "" for literals with no language tags. According to RFC 4647 §2.1, language tags are defined as follows:
2.1. Basic Language Range
A "basic language range" has the same syntax as an [RFC3066]
language tag or is the single character "*". The basic language
range was originally described by HTTP/1.1 [RFC2616] and later
[RFC3066]. It is defined by the following ABNF [RFC4234]:
language-range = (1*8ALPHA *("-" 1*8alphanum)) / "*"
alphanum = ALPHA / DIGIT
This means that "" isn't actually a legal language tag. As Jeen Broekstra pointed out on answers.semanticweb.com, the SPARQL recommendation says:
17.2 Filter Evaluation
SPARQL provides a subset of the functions and operators defined by
XQuery Operator Mapping. XQuery 1.0 section 2.2.3 Expression
Processing describes the invocation of XPath functions. The following
rules accommodate the differences in the data and execution models
between XQuery and SPARQL: …
Functions invoked with an
argument of the wrong type will produce a type error. Effective
boolean value arguments (labeled "xsd:boolean (EBV)" in the operator
mapping table below), are coerced to xsd:boolean using the EBV rules
in section 17.2.2.
Since "" isn't a legal language tag, it might be considered "an argument of the wrong type [that] will produce a type error." In that case, the langMatches invocation would produce an error, and that error will be treated as false in the filter expression. Even if it doesn't return false for this reason, RFC 4647 §3.3.1, which describes how language tags and ranges are compared, doesn't say exactly what should happen in the comparison, since it's assuming legal language tags:
Basic filtering compares basic language ranges to language tags. Each
basic language range in the language priority list is considered in
turn, according to priority. A language range matches a particular
language tag if, in a case-insensitive comparison, it exactly equals
the tag, or if it exactly equals a prefix of the tag such that the
first character following the prefix is "-". For example, the
language-range "de-de" (German as used in Germany) matches the
language tag "de-DE-1996" (German as used in Germany, orthography of
1996), but not the language tags "de-Deva" (German as written in the
Devanagari script) or "de-Latn-DE" (German, Latin script, as used in
Germany).
Based on your comments and my local experiments, it appears that langMatches(lang(?obj),"") for literals without language tags (so really, langMatches("","")) is returning true in Virtuoso (as it's installed on DBpedia), Jena's ARQ (from my experiments), and Proégé (from our experiments), and it's returning false (or an error that's coerced to false) in RDFlib.
In either case, since lang is defined to return "" for the literals without a language tag, , you should be able to reliably include them in your results by changing langMatches(lang(?obj),"") with lang(?obj) = "".
Issues with the data that you're using
You're not querying the same data. The data that you download from
http://dbpedia.org/resource/Johann_Sebastian_Bach
is from DBpedia, but when you run a query against
http://live.dbpedia.org/sparql,
you're running it against DBpedia Live, which may have different data. If you run this query on the DBpedia Live endpoint and on the DBpedia endpoint, you get a different number of results:
SELECT count(*) WHERE {
dbpedia:Johann_Sebastian_Bach ?pred ?obj
FILTER( langMatches(lang(?obj), "") || langMatches(lang(?obj), "EN" ) )
}
DBpedia Live results 31
DBpedia results 34
Issues with distinct
Another possible problem, though it doesn't seem to be the one that you're running into, is that your second query has a distinct modifier, but your first one doesn't. That means that your second query could easily have fewer results than the first one.
If you run this query against the DBpedia SPARQL endpoint you should get 34 results, and that's the same whether or not you use the distinct modifiers, and it's the number that you should get if you download the data and run the same query against it.
select ?pred ?obj where {
dbpedia:Johann_Sebastian_Bach ?pred ?obj
filter( langMatches(lang(?obj), "") || langMatches(lang(?obj), "EN") )
}
SPARQL results

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.