Suggesting similar sentences - python

I am trying to create a sentence auto-complete model which will suggest similar sentences.
Problem: I have a sentence corpora of more than 20000 sentences. I want to create a program that would suggest similar sentences to a user as the user types in with his/her keyboard.
for example -
user: wh
suggestions: [{'what is your name?'},{'what is your profession?'},{'what do you want?'}, {'where are you?'}]
user: what is your
suggestions: [{'what is your name?'},{'what is your profession?'}]
Note:
The ordering of words is important, i.e prefix of sentence and user input should be the same.
The sentence suggestion are from available text corpora.
My approach:-
Till now I have only come up with a solution that uses trie data structure to store every sentence in text corpora.
I want to know if there are any machine learning techniques that could be implemented for sentence suggestion that also takes sentence prefix into account.
I would really appreciate anyone who could point me in the right direction.

Text generation is a common application of RNNs. Given a sentence prefix the neural network can be trained to predict the most probable next words.
A very interesting article written by Andrej Karpathy can be found here along with the corresponding github repo.
Another popular method utilizes Markov Chains for text generation (for example see here )

if you want to use Lucene relevency, MoreLikeThis similar sentence. or you can apply the cosine similarity for same. hope this will help.
public static void main(String[] args) throws IOException {
Main m = new Main();
m.init();
m.writerEntries();
m.findSilimar("doduck prototype");
}
private Directory indexDir;
private StandardAnalyzer analyzer;
private IndexWriterConfig config;
public void init() throws IOException{
analyzer = new StandardAnalyzer(Version.LUCENE_42);
config = new IndexWriterConfig(Version.LUCENE_42, analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
indexDir = new RAMDirectory(); //do not write on disk
}
public void writerEntries() throws IOException{
IndexWriter indexWriter = new IndexWriter(indexDir, config);
indexWriter.commit();
Document doc1 = createDocument("1","doduck","prototype your idea");
Document doc2 = createDocument("2","doduck","love programming");
Document doc3 = createDocument("3","We do", "prototype");
Document doc4 = createDocument("4","We love", "challange");
indexWriter.addDocument(doc1);
indexWriter.addDocument(doc2);
indexWriter.addDocument(doc3);
indexWriter.addDocument(doc4);
indexWriter.commit();
indexWriter.forceMerge(100, true);
indexWriter.close();
}
private Document createDocument(String id, String title, String content) {
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(true); //TermVectors are needed for MoreLikeThis
Document doc = new Document();
doc.add(new StringField("id", id, Store.YES));
doc.add(new Field("title", title, type));
doc.add(new Field("content", content, type));
return doc;
}
private void findSilimar(String searchForSimilar) throws IOException {
IndexReader reader = DirectoryReader.open(indexDir);
IndexSearcher indexSearcher = new IndexSearcher(reader);
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setMinTermFreq(0);
mlt.setMinDocFreq(0);
mlt.setFieldNames(new String[]{"title", "content"});
mlt.setAnalyzer(analyzer);
Reader sReader = new StringReader(searchForSimilar);
Query query = mlt.like(sReader, null);
TopDocs topDocs = indexSearcher.search(query,10);
for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
Document aSimilar = indexSearcher.doc( scoreDoc.doc );
String similarTitle = aSimilar.get("title");
String similarContent = aSimilar.get("content");
System.out.println("====similar finded====");
System.out.println("title: "+ similarTitle);
System.out.println("content: "+ similarContent);
}
}

Related

Extract paragraphs and CFI from EPUB

I'm trying to extract all paragraphs from and EPUB with associated CFIs. I tried computing the CFI myself but the documentation is really hard to follow and implement. I'm primarily looking for a python solution, but I can work with anything.
To be precise: I want to compute the CFI for every <p> inside every chapter.
Thank you in advance!
async function createCfisForEpubHtmlParagraphs(epub_path){
const data = await fs.readFile(epub_path, "utf8");
const dom = new JSDOM(data);
let nodeRange = dom.window.document.createRange();
const cfiRanges = [...Array(dom.window.document.getElementsByTagName('p').length).keys()].map((paragraphIndx => {
const pNode = dom.window.document.getElementsByTagName('p').item(paragraphIndx);
nodeRange.setStart(pNode.firstChild, 0);
try{
nodeRange.setEnd(pNode.firstChild, pNode.firstChild.textContent.length);
}
catch(e){
nodeRange.setEnd(pNode.firstChild, 1);
}
let cfirange = new EpubCFI(nodeRange, '/6/4').toString();
return {cfirange: cfirange, p_number:paragraphIndx}
})
return cfiRanges;
}

How to query a Google My Business API for Insights

I have created a report where I want to also include all the insights of a Google My Business account.
I have already been approved and have access to the GMB API with no problem. The only thing is now that I have full access, how do I successfully query it so I can get insight information? I have access to a team that works with PHP or Python so I wanted to see what I should give them so that they can start querying successfully. Can anyone help?
Download php client library from here
Here is the sample function to get location insights
Parameters required:
locationNames should be provided as input
startTime and endTime max difference should be 18 months
(2020-01-01T15:01:23Z,2021-01-01T15:01:23Z)
public function getLocationInsights($accountName,$parameters){
// Replace getClientService, with method having accesstoken
$service = $this->getClientService();
$insightReqObj = new Google_Service_MyBusiness_ReportLocationInsightsRequest();
$locationNames = $parameters['locationNames'];
// Atleast one location mandatory
if($locationNames && is_array($locationNames) && count($locationNames) <=10){
$insightReqObj->setLocationNames($locationNames);
}
$basicReqObj = new Google_Service_MyBusiness_BasicMetricsRequest();
// datetime range is mandatory
// TODO :: validate to not allow more than 18 months difference
$timeRangObj = new Google_Service_MyBusiness_TimeRange();
$timeRangObj->setStartTime($parameters['startTime']);
$timeRangObj->setEndTime($parameters['endTime']);
$metricReqObj = new Google_Service_MyBusiness_MetricRequest();
$metricReqObj->setMetric('ALL');
$basicReqObj->setMetricRequests(array($metricReqObj));
$basicReqObj->setTimeRange($timeRangObj);
$insightReqObj->setBasicRequest($basicReqObj);
$allInsights = $service->accounts_locations->reportInsights($accountName,$insightReqObj);
return $allInsights;
}
I work with java to do the same stuff.
Mine is something like this:
ReportLocationInsightsRequest content = new ReportLocationInsightsRequest();
content.setFactory(JSON_FACTORY);
BasicMetricsRequest basicRequest = new BasicMetricsRequest();
content.setLocationNames("your locationName as a list");
List<MetricRequest> metricRequests= new ArrayList<MetricRequest>();
MetricRequest metricR=new MetricRequest();
String metric="ALL";
metricR.setMetric(metric);
metricRequests.add(metricR);
TimeRange timeRange =new TimeRange();
timeRange.setStartTime("Desired startTime");
timeRange.setEndTime("Desired endTime");
basicRequest.setTimeRange(timeRange );
content.setBasicRequest(basicRequest );
try {
MyBusiness.Accounts.Locations.ReportInsights locationReportInsight=
mybusiness.accounts().locations().reportInsights(accountName, content);
ReportLocationInsightsResponse response= locationReportInsight.execute();
System.out.println("response is = "+ response.toPrettyString());
}catch(Exception e) {
System.out.println(e);
}

Crypto++ Signing with PKCS1v15 padding with different algorithms

I am wondering if Crypto++ only signs files using SHA (RSASSA_PKCS1v15_SHA_Signer)?
I have been using pyCryptodome to do the signing and verifying, but I want to make a C++ application that does the same. In Python, I can sign the files with any of the supported hashing (SHA3/BLAKE2B etc..) algorithms. At the very least I want to support signing using SHA256 in C++.
std::string Hasher::sign(const std::string& message)
{
RSASSA_PKCS1v15_SHA_Signer signer(m_privateKey);
size_t length = signer.MaxSignatureLength();
SecByteBlock signature(length);
length = signer.SignMessage(rng, (CryptoPP::byte*)message.c_str(), message.size(), signature);
signature.resize(length);
//Want the signature as a hex
return toHex(signature, signature.size());
}
However, I want to be able to do something similar I do in Python:
def(message, key, passphrase, hashType):
rsakey = RSA.importKey(key,passphrase)
signer = PKCS1_v1_5.new(rsakey)
# Get the hash class based os user given hashType ex ("SHA256" returns "SHA256.New()")
digest = getHash(hashType)
digest.update(message.encode("ascii"))
return signer.sign(digest).hex()
If I chose the same private key, and use the hashType "SHA" I get the same signature result as my C++ code would.
So I found an answer to it. So partially following the tutorial found here
and from RSASS<PSS, SHA256>::Signer signer(privateKey); you could add the PKCS1v15 padding as such RSASS<PKCS1v15, SHA256>::Signer signer(privateKey);
So the final code would look something like this:
std::string Hasher::sign(const std::string& message)
{
CryptoPP::RSASS<CryptoPP::PKCS1v15, CryptoPP::SHA256>::Signer signer(m_privateKey);
size_t length = signer.MaxSignatureLength();
SecByteBlock signature(length);
length = signer.SignMessage(rng, (CryptoPP::byte*)message.c_str(), message.size(), signature);
signature.resize(length);
//Want the signature as a hex
return toHex(signature, signature.size());
}
So super similar, I still don't know why RSASSA_PKCS1v15_SHA_Signer class still exists. To throw in something extra my final goal, as I could find nothing on it except a random 13-year-old question on a mailing list :
How to sign and verify a file with Crypto++ (Like you sign files in pyCrypto)
Signing file:
std::string Hasher::sign()
{
CryptoPP::RSASS<CryptoPP::PKCS1v15, CryptoPP::SHA256>::Signer signer(m_privateKey);
size_t length = signer.MaxSignatureLength();
SecByteBlock signature(length);
// Get bytes of the file
const auto& data = getFileBytes();
//AFAIK you need to create a signer PK_MessageAccumulator object to acheive this
CryptoPP::PK_MessageAccumulator* pSigMsgAcc = signer.NewSignatureAccumulator(rng);
//Update it with your data, toBinary_const is just a short for (const CryptoPP::byte*)data.data()
pSigMsgAcc->Update(toBinary_const(data), data.size());
length = signer.Sign(rng, pSigMsgAcc, signature);
signature.resize(length);
//I return it as hex as people can actually read the signature and verify it themself
return toHex(signature, signature.size());
}
Verifying a file
bool Hasher::verify(const std::string& signature, const std::string& type)
{
CryptoPP::RSASS<CryptoPP::PKCS1v15, CryptoPP::SHA256>::Verifier verifier(m_publicKey);
const auto& data = getFileBytes();
//Create a verifyer PK_MessageAccumulator object
CryptoPP::PK_MessageAccumulator* pVrfyMsgAcc = verifier.NewVerificationAccumulator();
pVrfyMsgAcc->Update(toBinary_const(data), data.size());
//Here comes the down side of using hex as signatrue, I convert it to a binary string
auto toVerify = toBinaryString(signature);
//Then I have to convert the binary string to binary
// (It does not like me to convert the hex directly to binary)
verifier.InputSignature(*pVrfyMsgAcc, toBinary_const(toVerify), toVerify.size());
return verifier.Verify(pVrfyMsgAcc);
}

Conversion of Text sentences to CONLL Format

I want to convert the Normal english text into CONLL-U format for maltparser for finding dependency in the text in Python. I tried in java but was failed to do so, below is the format I'm looking for-
String[] tokens = new String[11];
tokens[0] = "1\thappiness\t_\tN\tNN\tDD|SS\t2\tSS";
tokens[1] = "2\tis\t_\tV\tVV\tPS|SM\t0\tROOT";
tokens[2] = "3\tthe\t_\tAB\tAB\tKS\t2\t+A";
tokens[3] = "4\tkey\t_\tPR\tPR\t_\t2\tAA";
tokens[4] = "5\tof\t_\tN\tEN\t_\t7\tDT";
tokens[5] = "6\tsuccess\t_\tP\tTP\tPA\t7\tAT";
tokens[6] = "7\tin\t_\tN\tNN\t_\t4\tPA";
tokens[7] = "8\tthis\t_\tPR\tPR\t_\t7\tET";
tokens[8] = "9\tlife\t_\tR\tRO\t_\t10\tDT";
tokens[9] = "10\tfor\t_\tN\tNN\t_\t8\tPA";
tokens[10] = "11\tsure\t_\tP\tIP\t_\t2\tIP";
I have tried in java but I can not use the standford APIs, I want the same in python.
//This is the example of java code but here the tokens which is created needs to be parsed via code not manually-
MaltParserService service = new MaltParserService(true);
// in the CoNLL data format.
String[] tokens = new String[11];
tokens[0] = "1\thappiness\t_\tN\tNN\tDD|SS\t2\tSS";
tokens[1] = "2\tis\t_\tV\tVV\tPS|SM\t0\tROOT";
tokens[2] = "3\tthe\t_\tAB\tAB\tKS\t2\t+A";
tokens[3] = "4\tkey\t_\tPR\tPR\t_\t2\tAA";
tokens[4] = "5\tof\t_\tN\tEN\t_\t7\tDT";
tokens[5] = "6\tsuccess\t_\tP\tTP\tPA\t7\tAT";
tokens[6] = "7\tin\t_\tN\tNN\t_\t4\tPA";
tokens[7] = "8\tthis\t_\tPR\tPR\t_\t7\tET";
tokens[8] = "9\tlife\t_\tR\tRO\t_\t10\tDT";
tokens[9] = "10\tfor\t_\tN\tNN\t_\t8\tPA";
tokens[10] = "11\tsure\t_\tP\tIP\t_\t2\tIP";
// Print out the string array
for (int i = 0; i < tokens.length; i++) {
System.out.println(tokens[i]);
}
// Reads the data format specification file
DataFormatSpecification dataFormatSpecification = service.readDataFormatSpecification(args[0]);
// Use the data format specification file to build a dependency structure based on the string array
DependencyStructure graph = service.toDependencyStructure(tokens, dataFormatSpecification);
// Print the dependency structure
System.out.println(graph);
There is now a port of the Standford library to Python (with improvements) called Stanza. You can find it here: https://stanfordnlp.github.io/stanza/
Example of usage:
>>> import stanza
>>> stanza.download('en') # download English model
>>> nlp = stanza.Pipeline('en') # initialize English neural pipeline
>>> doc = nlp("Barack Obama was born in Hawaii.") # run annotation over a sentence

Python convert C header file to dict

I have a C header file which contains a series of classes, and I'm trying to write a function which will take those classes, and convert them to a python dict. A sample of the file is down the bottom.
Format would be something like
class CFGFunctions {
class ABC {
class AA {
file = "abc/aa/functions"
class myFuncName{ recompile = 1; };
};
class BB
{
file = "abc/bb/functions"
class funcName{
recompile=1;
}
}
};
};
I'm hoping to turn it into something like
{CFGFunctions:{ABC:{AA:"myFuncName"}, BB:...}}
# Or
{CFGFunctions:{ABC:{AA:{myFuncName:"string or list or something"}, BB:...}}}
In the end, I'm aiming to get the filepath string (which is actually a path to a folder... but anyway), and the class names in the same class as the file/folder path.
I've had a look on SO, and google and so on, but most things I've found have been about splitting lines into dicts, rather then n-deep 'blocks'
I know I'll have to loop through the file, however, I'm not sure the most efficient way to convert it to the dict.
I'm thinking I'd need to grab the outside class and its relevant brackets, then do the same for the text remaining inside.
If none of that makes sense, it's cause I haven't quite made sense of the process myself haha
If any more info is needed, I'm happy to provide.
The following code is a quick mockup of what I'm sorta thinking...
It is most likely BROKEN and probably does NOT WORK. but its sort of the process that I'm thinking of
def get_data():
fh = open('CFGFunctions.h', 'r')
data = {} # will contain final data model
# would probably refactor some of this into a function to allow better looping
start = "" # starting class name
brackets = 0 # number of brackets
text= "" # temp storage for lines inside block while looping
for line in fh:
# find the class (start
mt = re.match(r'Class ([\w_]+) {', line)
if mt:
if start == "":
start = mt.group(1)
else:
# once we have the first class, find all other open brackets
mt = re.match(r'{', line)
if mt:
# and inc our counter
brackets += 1
mt2 = re.match(r'}', line)
if mt2:
# find the close, and decrement
brackets -= 1
# if we are back to the initial block, break out of the loop
if brackets == 0:
break
text += line
data[start] = {'tempText': text}
====
Sample file
class CfgFunctions {
class ABC {
class Control {
file = "abc\abc_sys_1\Modules\functions";
class assignTracker {
description = "";
recompile = 1;
};
class modulePlaceMarker {
description = "";
recompile = 1;
};
};
class Devices
{
file = "abc\abc_sys_1\devices\functions";
class registerDevice { recompile = 1; };
class getDeviceSettings { recompile = 1; };
class openDevice { recompile = 1; };
};
};
};
EDIT:
If possible, if I have to use a package, I'd like to have it in the programs directory, not the general python libs directory.
As you detected, parsing is necessary to do the conversion. Have a look at the package PyParsing, which is a fairly easy-to-use library to implement parsing in your Python program.
Edit: This is a very symbolic version of what it would take to recognize a very minimalistic grammer - somewhat like the example at the top of the question. It won't work, but it might put you in the right direction:
from pyparsing import ZeroOrMore, OneOrMore, \
Keyword, Literal
test_code = """
class CFGFunctions {
class ABC {
class AA {
file = "abc/aa/functions"
class myFuncName{ recompile = 1; };
};
class BB
{
file = "abc/bb/functions"
class funcName{
recompile=1;
}
}
};
};
"""
class_tkn = Keyword('class')
lbrace_tkn = Literal('{')
rbrace_tkn = Literal('}')
semicolon_tkn = Keyword(';')
assign_tkn = Keyword(';')
class_block = ( class_tkn + identifier + lbrace_tkn + \
OneOrMore(class_block | ZeroOrMore(assignment)) + \
rbrace_tkn + semicolon_tkn \
)
def test_parser(test):
try:
results = class_block.parseString(test)
print test, ' -> ', results
except ParseException, s:
print "Syntax error:", s
def main():
test_parser(test_code)
return 0
if __name__ == '__main__':
main()
Also, this code is only the parser - it does not generate any output. As you can see in the PyParsing docs, you can later add the actions you want. But the first step would be to recognize the what you want to translate.
And a last note: Do not underestimate the complexities of parsing code... Even with a library like PyParsing, which takes care of much of the work, there are many ways to get mired in infinite loops and other amenities of parsing. Implement things step-by-step!
EDIT: A few sources for information on PyParsing are:
http://werc.engr.uaf.edu/~ken/doc/python-pyparsing/HowToUsePyparsing.html
http://pyparsing.wikispaces.com/
(Particularly interesting is http://pyparsing.wikispaces.com/Publications, with a long list of articles - several of them introductory - on PyParsing)
http://pypi.python.org/pypi/pyparsing_helper is a GUI for debugging parsers
There is also a 'tag' Pyparsing here on stackoverflow, Where Paul McGuire (the PyParsing author) seems to be a frequent guest.
* NOTE: *
From PaulMcG in the comments below: Pyparsing is no longer hosted on wikispaces.com. Go to github.com/pyparsing/pyparsing

Categories