Storing token frequencies in elasticsearch, instead of storing text

Storing token frequencies in elasticsearch, instead of storing text - python

From my understanding from reading the documents, how elasticsearch works is that it counts term frequency * indiverse term frequency. It converts text to some sort of term frequency dictionary that also includes the indices of locations where these terms are most frequent.
What I'm trying to do is not store text, but term frequencies for each row of data. The search works fine when I simply upload the full text, but it will not work well in a full scale solution with 10+mil pages of text. Would it not be more effective to only store term frequencies if the text content otherwise is irrelevant?
edit: the anonymity of the data is also relevant and therefore I would not want full sentences and paragraphs stored externally.

For your purposes, you could implement a termvector in the text field for the term frequencies. Please read the documentation here.
Then you could use the bulk query for termvector that is mtermvector- doc here, and python api doc. It works with a list of ids. For example, If you have a list of all the ids of your documents that match "sky" you could proceed in that way:
from elasticsearch import Elasticsearch
es = Elasticsearch()
index = "abc"
my_doc_type ="your_doc_type"
ids = []
result = es.search(index="abc", doc_type= my_doc_type body={"query": {"term": {"my_field": "sky"}}})
for res in in result['hits']['hits']:
ids.append(res['ids'])
for doc in es.mtermvectors(index=index,doc_type=doc_type,body=dict(ids=ids,parameters=dict(term_statistics=True,field_statistics=True,fields=fields)))['docs']:
fields = doc['term_vectors']
terms = field['terms']
tf = vec["term_freq"]
df = vec["doc_freq"]

Related

Sqlite3's create_function() method for specific searching

Hi I'm currently working on a python/anki project in order to turn my language learning journey into a more automate process. I was at the moment searching for a way to use "sqlite3" library in order to define a function that could retrieve from the database only rows which contain a word the japanese dictionary form so it could match also conjugated words in my database sentences. Here is my example :
import sqlite3
from sudachipy import dictionary, Tokenizer
import sudachipy
conn = sqlite3.connect("D:/ajatt/copyankidb/collection.anki2") #my ankidatabase
#to retrieve specific sentences
c = conn.cursor()
tokenizer = dictionary.Dictionary().create(mode= sudachipy.SplitMode.C)
yahoosent = "走る" #the key word
def samewordsearcher(data):
for outword in tokenizer.tokenize(yahoosent): #tokenize create a token object so i can apply additional methods
for inword in tokenizer.tokenize(data): #like dictionary_form() which gives the dictionary form of a
#japanese word
if outword.dictionary_form() == inword.dictionary_form():
return data
else:
return None
conn.create_function("SAMEWORD",1,samewordsearcher)
sqlsent = f"SELECT SAMEWORD(sfld) FROM Notes"
c.execute(sqlsent)
results =c.fetchall()
for sen in results:
print(sen)
But as you can see it is quite time consuming from the fact that it needs to iterate through each sentence in my card database to try to find any matching the conditions. My question is there a better way to achieve my results in a much less lengthy way ?
P.S: here is the link to the library used to tokenize and give dictionary.form() :
https://github.com/WorksApplications/SudachiPy

Select Substring from Larger String and Append to List

I'm currently doing some API work with Tenable.io, and I'm having some trouble selecting substrings. I'm sending requests for scan histories, and the API responds with a continuous string of all scans in JSON format. The response I get is a very large continuous string of data, and I need to select some substrings (a few values), and copy that to a list (just for now). Getting data into a list isn't where I'm stuck - I require some serious assistance with selecting the substrings I need. Each scan has the following attributes:
id
status
is_archived
targets
scan_uuid
reindexing
time_start (unix format)
time_end (unix format)
Each of these has a value/boolean following it (see below). I need a way to extract the values following "id":, "scan_uuid:", and "time_start": from the string (and put it in a list just for now).
I'd like to do this without string.index, as this may break the script if the response length changes. There is also a new scan everyday, so the overall length of the response will change. Due to the nature of the data, I'd imagine the ideal solution would be to specify a condition that will select x amount of characters after "id":, "scan_uuid:", and "time_start":, and append them to a list, with the output looking something like:
scan_id_10_response = ["12345678", ""15b6e7cd-447b-84ab-84d3-48a62b18fe6c", "1639111111", etc, etc]
String is below - I've only included the data for 4 scans for simplicity's sake. I've also changed the values for security reasons, but the length & format of the values are the same.
scan_id_10_response = '{"pagination":{"offset":0,"total":119,"sort":[{"order":"DESC","name":"start_date"}],"limit":100},"history":[\
{"id":12345678,"status":"completed","is_archived":false,"targets":{"custom":false,"default":null},"visibility":"public","scan_uuid":"15b6e7cd-447b-84ab-84d3-48a62b18fe6c","reindexing":null,"time_start":1639111111,"time_end":1639111166},\
{"id":23456789,"status":"completed","is_archived":false,"targets":{"custom":false,"default":null},"visibility":"public","scan_uuid":"8a468cff-c64f-668a-3015-101c218b68ae","reindexing":null,"time_start":1632222222,"time_end":1632222255},\
{"id":34567890,"status":"completed","is_archived":false,"targets":{"custom":false,"default":null},"visibility":"public","scan_uuid":"84ea995a-584a-cc48-e352-8742a38c12ff","reindexing":null,"time_start":1639333333,"time_end":1639333344},\
{"id":45678901,"status":"completed","is_archived":false,"targets":{"custom":false,"default":null},"visibility":"public","scan_uuid":"48a95366-48a5-e468-a444-a4486cdd61a2","reindexing":null,"time_start":1639444444,"time_end":1639444455}\
]}'

Basically you can use the standard json module to parse the json string.
Using that code snippet you obtain a dict you can then work with.
import json
c = json.loads(scan_id_10_response)
Now you can for example create a list of list with the desired attributes:
extracted_data = [[d['id'], d['scan_uuid'], d['time_start']] for d in c['history']]
This returns for this particular example:
[[12345678, '15b6e7cd-447b-84ab-84d3-48a62b18fe6c', 1639111111],
[23456789, '8a468cff-c64f-668a-3015-101c218b68ae', 1632222222],
[34567890, '84ea995a-584a-cc48-e352-8742a38c12ff', 1639333333],
[45678901, '48a95366-48a5-e468-a444-a4486cdd61a2', 1639444444]]
If you only want one result at a time use a generator or iterate over the list
gen_extracted = ([d['id'], d['scan_uuid'], d['time_start']] for d in x['history'])
If you dont want to work with a dict i would reccomend you a look into regular expressions.

Faster NER extraction using SpaCy and Pandas

I have a df with a column that contains comments from which I want to extract the organisations. This article provides a great approach but it is too slow for my problem. The df I am using has over 1,000,000 rows and I am using a Google Colab notebook.
Currently my approach is (from the linked article):
def get_orgs(text):
# process the text with our SpaCy model to get named entities
doc = nlp(text)
# initialize list to store identified organizations
org_list = []
# loop through the identified entities and append ORG entities to org_list
for entity in doc.ents:
if entity.label_ == 'ORG':
org_list.append(entity.text)
# if organization is identified more than once it will appear multiple times in list
# we use set() to remove duplicates then convert back to list
org_list = list(set(org_list))
return org_list
df['organizations'] = df['body'].apply(get_orgs)
Is there a faster way to process this? And, would you advise to apply it to a Pandas df or are there better/faster alternatives?

There are a couple of things you can do in general to speed up spaCy. There's a section in the docs on this.
The first thing to try is creating docs in a pipe. You'll need to be a little creative to get this working with a dataframe:
org_lists = []
for doc in nlp.pipe(iter(df['body']):
org_lists.append(...) # do your processing here
# now you can add a column in your dataframe
The other thing is to disable components you aren't using. Since it looks like you're only using NER you can do this:
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
# Do something with the doc here
Those together should give you a significant speedup.

Pythonic way to solve a text normalization task

Basically, I have a Hive script file, from which I need to extract the names for all the tables created. For example, from the contents
...
create table Sales ...
...
create external table Persons ...
...
Sales and Persons should be extracted. To accomplish this, my basic idea is like:
Search for key phrases create table and create external table,
Extract the next token which should be the table name.
However, the input may not be canonical. For example,
Tab/newline may be used along with space as token delimiter
There may be multiple consecutive delimiters between tokens
Mixed use of upper and lower case letters like create TABLE
Therefore, I'm thinking about first normalizing the input to a canonical form before applying the basic algorithm. Then with some effort, I come up with the following
' '.join(input.split()).lower()
As a Python newcomer, I'm wondering whether this is the Pythonic way to solve the problem, or it may be flawed in the very first place? Is there a simple way to do this in a streaming fashion, i.e., avoiding loading the whole input into memory at once?

Like some comments stated, regex is a neat and easy way to get what you want. If you don't mind getting lowercase results, this one should work:
import re
my_str = """
...
create table Sales ...
create TabLE
test
create external table Persons ...
...
"""
pattern = r"table\s+(\w+)\b"
items = re.findall(pattern, my_str.lower())
print items
It captures the next word after "table " (followed by at least one whitespace / newline).
To get the original case of the table names:
for x, item in enumerate(items):
i = my_str.lower().index(item)
items[x] = my_str[i:i+len(item)]
print items

How to improve query accuracy of Easticsearch from Python?

How can you improve the accuracy search results from Elasticsearch using the Python wrapper? My basic example returns results, but the results are very inaccurate.
I'm running Elasticsearch 5.2 on Ubuntu 16, and I start by creating my index and adding a few documents like:
es = Elasticsearch()
# Document A
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some specific keywords',
weight=1.0,
data='blah1',
),
)
# Document B
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some other specific keywords',
weight=1.0,
data='blah2',
),
)
# Document C
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some other very long text that is very different yet mentions the word specific and keywords',
weight=1.0,
data='blah3',
),
)
I then query it with:
es = Elasticsearch()
es.indices.create(index='my-test-index', ignore=400)
query = 'some specific keywords'
results = es.search(
index='my-test-index',
body={
'query':{
"function_score": {
"query": {
"match": {
"search_key": query
}
},
"functions": [{
"script_score": {
"script": "doc['weight'].value"
}
}],
"score_mode": "multiply"
}
},
}
)
And although it returns all results, it returns them in the order of documents B, C, A, whereas I would expect them in the order A, B, C, because although all the documents contain all my keywords, only the first one is an exact match. I would expect C to be last because, even though it contains all my keywords, it also contains a lot of fluff I'm not explicitly searching for.
This problem compounds when I index more entries. The search returns everything that has even a single keyword from my query, and seemingly weights them all identically, causing the search results get less and less accurate the larger my index grows.
This is making Elasticsearch almost useless. Is there anyway I can fix it? Is there a problem with my search() call?

In your query, you can use a match_phrase query instead of a match query so that the order and proximity of the search terms get into the mix. Additionally, you can add a small slop in order to allow the terms to be further apart or in a different order. But documents with terms in the same order and closer apart will be ranked higher than documents with terms out of order and/or further apart. Try it out
"query": {
"match_phrase": {
"search_key": query,
"slop": 10
}
},
Note: slop is a number that indicates how many "swaps" of the search terms you need to perform in order to land on the term configuration present in the document.

Sorry for not reading your question more carefully and for the loaded answer below. I don't want to a stick in the mud but I think it will be clearer if you understand a bit more how Elasticsearch itself works.
Because you index your document without specifying any index and mapping configuration, Elasticsearch will use several defaults that it provides out of the box. The indexing process will first tokenize field values in your document using the standard tokenizer and analyze them using the standard analyzer before storing them in the index. Both the standard tokenizer and analyzer work by splitting your string based on word boundary. So at the end of index time, what you have in your index for the terms in the search_key field are ["some", "specific", "keywords"], not "some specific keywords".
During search time, the match query controls relevance using a similarity algorithm called term frequency/inverse document frequency, or TF/IDF. This algorithm is very popular in text search in general and there is a wikipedia section on it: https://en.wikipedia.org/wiki/Tf%E2%80%93idf. What's important to note here is that the more frequently your term appear in the index, the less important it is in terms of relevance. some, specific, and keywords appear in ALL 3 documents in your index, so as far as elasticsearch is concerned, they contribute very little to the document's relevance in your search result. Since A contains only these terms, it's like having a document containing only the, an, a in an English index. It won't show up as first result even if you search for the, an, a specifically. B ranks higher than C because B is shorter, which yields higher norm value. This norm value is explained in the relevance document. This is a bit of a speculation on my part, but I think it does work out this way if you explain the query using the explain API.
So, back to your need, how to favor exact match over everything else? There is, of course, the match_phrase query as Val pointed out. Another popular method to do it, which I personally prefer, is to index the raw value in a nested field called search_key.raw using the not_analyzed option when defining your mapping: https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html#_index_2 and simply boost this raw value when you search.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.