Sqlite3's create_function() method for specific searching - python

Hi I'm currently working on a python/anki project in order to turn my language learning journey into a more automate process. I was at the moment searching for a way to use "sqlite3" library in order to define a function that could retrieve from the database only rows which contain a word the japanese dictionary form so it could match also conjugated words in my database sentences. Here is my example :
import sqlite3
from sudachipy import dictionary, Tokenizer
import sudachipy
conn = sqlite3.connect("D:/ajatt/copyankidb/collection.anki2") #my ankidatabase
#to retrieve specific sentences
c = conn.cursor()
tokenizer = dictionary.Dictionary().create(mode= sudachipy.SplitMode.C)
yahoosent = "走る" #the key word
def samewordsearcher(data):
for outword in tokenizer.tokenize(yahoosent): #tokenize create a token object so i can apply additional methods
for inword in tokenizer.tokenize(data): #like dictionary_form() which gives the dictionary form of a
#japanese word
if outword.dictionary_form() == inword.dictionary_form():
return data
else:
return None
conn.create_function("SAMEWORD",1,samewordsearcher)
sqlsent = f"SELECT SAMEWORD(sfld) FROM Notes"
c.execute(sqlsent)
results =c.fetchall()
for sen in results:
print(sen)
But as you can see it is quite time consuming from the fact that it needs to iterate through each sentence in my card database to try to find any matching the conditions. My question is there a better way to achieve my results in a much less lengthy way ?
P.S: here is the link to the library used to tokenize and give dictionary.form() :
https://github.com/WorksApplications/SudachiPy

Related

Using POS and PUNCT tokens in custom sentence boundaries in spaCy

I am trying to split sentences into clauses using spaCy for classification with a MLLib. I have searched for one of two solutions that I consider the best way to approach but haven't quite had much luck.
Option: Would be to use the tokens in the doc i.e. token.pos_ that match to SCONJ and split as a sentence.
Option: Would be to create a list using whatever spaCy has as a dictionary of values it identifies as SCONJ
The issue with 1 is that I only have .text, .i, and no .pos_ as the custom boundaries (as far as I am aware needs to be run before the parser.
The issue with 2 is that I can't seem to find the dictionary. It is also a really hacky approach.
import deplacy
from spacy.language import Language
# Uncomment to visualise how the tokens are labelled
# deplacy.render(doc)
custom_EOS = ['.', ',', '!', '!']
custom_conj = ['then', 'so']
#Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text in custom_EOS:
doc[token.i + 1].is_sent_start = True
if token.text in custom_conj:
doc[token.i].is_sent_start = True
return doc
def set_sentence_breaks(doc):
for token in doc:
if token == "SCONJ":
doc[token.i].is_sent_start = True
def main():
text = "In the add user use case, we need to consider speed and reliability " \
"so use of a relational DB would be better than using SQLite. Though " \
"it may take extra effort to convert #Bot"
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("set_custom_boundaries", before="parser")
doc = nlp(text)
# for token in doc:
# print(token.pos_)
print("Sentences:", [sent.text for sent in doc.sents])
if __name__ == "__main__":
main()
Current Output
Sentences: ['In the add user use case,',
'we need to consider speed and reliability,
'so the use of a relational DB would be better than using SQLite.',
'Though it may take extra effort to convert #Bot']
I would recommend not trying to do anything clever with is_sent_starts - while it is user-accessible, it's really not intended to be used in that way, and there is at least one unresolved issue related to it.
Since you just need these divisions for some other classifier, it's enough for you to just get the string, right? In that case I recommend you run the spaCy pipeline as usual and then split sentences on SCONJ tokens (if just using SCONJ is working for your use case). Something like:
out = []
for sent in doc.sents:
last = sent[0].i
for tok in sent:
if tok.pos_ == "SCONJ":
out.append(doc[last:tok.i])
last = tok.i + 1
out.append(doc[last:sent[-1].i])
Alternately, if that's not good enough, you can identify subsentences using the dependency parse to find verbs in subsentences (by their relation to SCONJ, for example), saving the subsentences, and then adding another sentence based on the root.

Storing token frequencies in elasticsearch, instead of storing text

From my understanding from reading the documents, how elasticsearch works is that it counts term frequency * indiverse term frequency. It converts text to some sort of term frequency dictionary that also includes the indices of locations where these terms are most frequent.
What I'm trying to do is not store text, but term frequencies for each row of data. The search works fine when I simply upload the full text, but it will not work well in a full scale solution with 10+mil pages of text. Would it not be more effective to only store term frequencies if the text content otherwise is irrelevant?
edit: the anonymity of the data is also relevant and therefore I would not want full sentences and paragraphs stored externally.
For your purposes, you could implement a termvector in the text field for the term frequencies. Please read the documentation here.
Then you could use the bulk query for termvector that is mtermvector- doc here, and python api doc. It works with a list of ids. For example, If you have a list of all the ids of your documents that match "sky" you could proceed in that way:
from elasticsearch import Elasticsearch
es = Elasticsearch()
index = "abc"
my_doc_type ="your_doc_type"
ids = []
result = es.search(index="abc", doc_type= my_doc_type body={"query": {"term": {"my_field": "sky"}}})
for res in in result['hits']['hits']:
ids.append(res['ids'])
for doc in es.mtermvectors(index=index,doc_type=doc_type,body=dict(ids=ids,parameters=dict(term_statistics=True,field_statistics=True,fields=fields)))['docs']:
fields = doc['term_vectors']
terms = field['terms']
tf = vec["term_freq"]
df = vec["doc_freq"]

Pythonic way to solve a text normalization task

Basically, I have a Hive script file, from which I need to extract the names for all the tables created. For example, from the contents
...
create table Sales ...
...
create external table Persons ...
...
Sales and Persons should be extracted. To accomplish this, my basic idea is like:
Search for key phrases create table and create external table,
Extract the next token which should be the table name.
However, the input may not be canonical. For example,
Tab/newline may be used along with space as token delimiter
There may be multiple consecutive delimiters between tokens
Mixed use of upper and lower case letters like create TABLE
Therefore, I'm thinking about first normalizing the input to a canonical form before applying the basic algorithm. Then with some effort, I come up with the following
' '.join(input.split()).lower()
As a Python newcomer, I'm wondering whether this is the Pythonic way to solve the problem, or it may be flawed in the very first place? Is there a simple way to do this in a streaming fashion, i.e., avoiding loading the whole input into memory at once?
Like some comments stated, regex is a neat and easy way to get what you want. If you don't mind getting lowercase results, this one should work:
import re
my_str = """
...
create table Sales ...
create TabLE
test
create external table Persons ...
...
"""
pattern = r"table\s+(\w+)\b"
items = re.findall(pattern, my_str.lower())
print items
It captures the next word after "table " (followed by at least one whitespace / newline).
To get the original case of the table names:
for x, item in enumerate(items):
i = my_str.lower().index(item)
items[x] = my_str[i:i+len(item)]
print items

Unable to run Stanford Core NLP annotator over whole data set

I have been trying to use Stanford Core NLP over a data set but it stops at certain indexes which I am unable to find.
The data set is available on Kaggle: https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones/data
This is a function that outputs the sentiment of a paragraph by taking the mean sentiment value of individual sentences.
import json
def funcSENT(paragraph):
all_scores = []
output = nlp.annotate(paragraph, properties={
"annotators": "tokenize,ssplit,parse,sentiment",
"outputFormat": "json",
# Only split the sentence at End Of Line. We assume that this method only takes in one single sentence.
#"ssplit.eolonly": "true",
# Setting enforceRequirements to skip some annotators and make the process faster
"enforceRequirements": "false"
})
all_scores = []
for i in range(0,len(output['sentences'])):
all_scores.append((int(json.loads(output['sentences'][i]['sentimentValue']))+1))
final_score = sum(all_scores)/len(all_scores)
return round(final_score)
Now I run this code for every review in the 'Reviews' column using this code.
import pandas as pd
data_file = 'C:\\Users\\SONY\\Downloads\\Amazon_Unlocked_Mobile.csv'
data = pd.read_csv( data_file)
from pandas import *
i = 0
my_reviews = data['Reviews'].tolist()
senti = []
while(i<data.shape[0]):
senti.append(funcSENT(my_reviews[i]))
i=i+1
But somehow I get this error and I am not able to find the problem. Its been many hours now, kindly help.
[1]: https://i.stack.imgur.com/qFbCl.jpg
How to avoid this error?
As I understand, you're using pycorenlp with nlp=StanfordCoreNLP(...) and a running StanfordCoreNLP server. I won't check the data you are using since it appears to require a Kaggle account.
Running with the same setup but different paragraph shows that printing "output" alone shows an error from the java server, in my case:
java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: Input word not tagged
I THINK that because there is no part-of-speech annotator, the server cannot perform the parsing. Whenever you use parse or depparse, I think you need to have the "pos" annotator as well.
I am not sure what the sentiment annotator needs, but you may need other annotators such as "lemma" to get good sentiment results.
Print output by itself. If you get the same java error, try adding the "pos" annotator to see if you get the expected json. Otherwise, try to give a simpler example, using your own small dataset maybe, and comment or adjust your question.

Pattern matching Twitter Streaming API

I'm trying to insert into a dictionary certain values from the Streaming API. One of these is the value of the term that is used in the filter method using track=keyword. I've written some code but on the print statement I get an "Encountered Exception: 'term'" error. This is my partial code:
for term in setTerms:
a = re.compile(term, re.IGNORECASE)
if re.search(a, status.text):
message['term'] = term
else:
pass
print message['text'], message['term']
This is the filter code:
setTerms = ['BBC','XFactor','Obama']
streamer.filter(track = setTerms)
It matches the string, but I also need to be able to match all instances eg. BBC should also match with #BBC, #BBC or BBC1 etc.
So my question how would i get a term in setTerms eg BBC to match all these instances in if re.search(term, status.text)?
Thanks
Have you tried putting all your search terms into a single expression?
i.e. setTerms = '(BBC)|(XFactor)|(Obama)', then seeing if it matches the whole piece string (not just individual word?

Categories