Search txt file and display data between all occurrences - python

Extremely new to Python. I started off writing a PHP script to find all occurrences of 2 strings in a txt file but it's using too much memory so i read Python would be better.
Basically what i need to do is:
- import a txt file
- go through it and return all the data between the tags below
- remove any duplicates
- output the results
The bits i'm looking for look like this:
------DATA--------------------------------------------------
DATA TO SHOW
------------------------------------------------------------
Naturally, the important bit to output is the DATA TO SHOW part.
Any help would be appreciated :)
Thanks
UPDATE -----------------------
import re
inputFile = open("small.txt", "r")
output = open("result.txt", "w")
searchStart = "----- ASSERT --------------------------------------------------------------------------------"
searchEnd = "---------------------------------------------------------------------------------------------"
match = re.findall('^----- ASSERT --------------------------------------------------------------------------------\n(.*?)---------------------------------------------------------------------------------------------', inputFile.read(), re.MULTILINE)
print match,
Any ideas how to get it to show all the lines until it hits the searchEnd tag? Example data:
----- ASSERT --------------------------------------------------------------------------------
MORE
INFO
THAT
I
NEED
TO
GET
FROM
THE
FILE
---------------------------------------------------------------------------------------------

An example with php (not tested, the idea is here) :
$handle = fopen("inputfile.txt", "r");
if ($handle) {
$record = false;
while (($line = fgets($handle)) !== false) {
if ($line == '------DATA--------------------------------------------------') {
$record = true;
$temp = '';
} elseif ($record) {
if ($line == '------------------------------------------------------------') {
$record = false;
$results[] = $temp;
$temp = '';
} else $temp .= $line;
}
}
} else {
echo 'Gargoyl, the file can\'t be opened!';
}
fclose($handle);
print_r($results);

Related

Wordpress Password Saving Functions To Python

3 weeks back I posted a question on here to understand how WordPress saves passwords to db. Mystical suggested I look at the source code, and I tried to but I am not too good with php so I am trying to convert relevant functions to python. Here is what I have so far:
Python:
import base64
from email.encoders import encode_base64
from hashlib import md5
prefix = '$P$B'
salt = 'KcFRBGXE'
password = '^zVw*wSFshV2' #password i enter to login
real_hashed_pass = '$P$BKcFRBGXEWOVYQShBC1edT7f3e3Nca1' #this is stored in wp db
hashed_pass = md5((salt + password).encode('utf-8')).hexdigest()
for i in range(8193):
hashed_pass = md5((hashed_pass + password).encode('utf-8')).hexdigest()
# for i in range(17):
# hashed_pass = base64.standard_b64encode(hashed_pass)
hashed_pass = prefix + salt + hashed_pass
print(hashed_pass == real_hashed_pass)
Relevant PHP (full code):
<?php
class PasswordHash {
var $itoa64;
var $iteration_count_log2;
var $portable_hashes;
var $random_state;
function __construct($iteration_count_log2, $portable_hashes)
{
$this->itoa64 = './0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz';
if ($iteration_count_log2 < 4 || $iteration_count_log2 > 31)
$iteration_count_log2 = 8;
$this->iteration_count_log2 = $iteration_count_log2;
$this->portable_hashes = $portable_hashes;
$this->random_state = microtime();
if (function_exists('getmypid'))
$this->random_state .= getmypid();
}
function encode64($input, $count)
{
$output = '';
$i = 0;
do {
$value = ord($input[$i++]);
$output .= $this->itoa64[$value & 0x3f];
if ($i < $count)
$value |= ord($input[$i]) << 8;
$output .= $this->itoa64[($value >> 6) & 0x3f];
if ($i++ >= $count)
break;
if ($i < $count)
$value |= ord($input[$i]) << 16;
$output .= $this->itoa64[($value >> 12) & 0x3f];
if ($i++ >= $count)
break;
$output .= $this->itoa64[($value >> 18) & 0x3f];
} while ($i < $count);
return $output;
}
function crypt_private($password, $setting)
{
$output = '*0';
if (substr($setting, 0, 2) === $output)
$output = '*1';
$id = substr($setting, 0, 3);
# We use "$P$", phpBB3 uses "$H$" for the same thing
if ($id !== '$P$' && $id !== '$H$')
return $output;
$count_log2 = strpos($this->itoa64, $setting[3]);
if ($count_log2 < 7 || $count_log2 > 30)
return $output;
$count = 1 << $count_log2;
$salt = substr($setting, 4, 8);
if (strlen($salt) !== 8)
return $output;
# We were kind of forced to use MD5 here since it's the only
# cryptographic primitive that was available in all versions
# of PHP in use. To implement our own low-level crypto in PHP
# would have resulted in much worse performance and
# consequently in lower iteration counts and hashes that are
# quicker to crack (by non-PHP code).
$hash = md5($salt . $password, TRUE);
do {
$hash = md5($hash . $password, TRUE);
} while (--$count);
$output = substr($setting, 0, 12);
$output .= $this->encode64($hash, 16);
return $output;
}
function CheckPassword($password, $stored_hash)
{
if ( strlen( $password ) > 4096 ) {
return false;
}
$hash = $this->crypt_private($password, $stored_hash);
if ($hash[0] === '*')
$hash = crypt($password, $stored_hash);
# This is not constant-time. In order to keep the code simple,
# for timing safety we currently rely on the salts being
# unpredictable, which they are at least in the non-fallback
# cases (that is, when we use /dev/urandom and bcrypt).
return $hash === $stored_hash;
}
}
My goal is to have the python code produce the same hashed password as the wordpress code. I think the error in the python code is at the commented out loop but I am not sure how to fix it.
Thank you for the help!
UPDATE
When someone enters their passcode you hash it.
$hash1 = hash('ripemd320',$passcode);
$sql = "SELECT `hash` FROM `Client` WHERE `Number` = $client LIMIT 1";
$results = mysqli_query($link,$sql);
list($hash2) = mysqli_fetch_array($results, MYSQLI_NUM);
if($hash1 == $hash2){unlock the pearly gates;}
END OF UPDATE
No 0one should ever save a password in a db. So when you ask "how WordPress saves passwords to db" the answer is they do not.
Are you working ona Word Press add on or do you want to "save passwords" in the same manner as WP?
Word Press is not the place to copy any PHP coding techniques.
When you bring Python into the equation I have to think you are not working with WP but want to do it the same was as WP. That would be a bad idea.
Passwords are not that complicated. And it only takes a few lines of code.
When the password is created you save the hash in the user's table. When they login you get the hash from the table, hash the password given and compare the two.
I recommend using only numerical usernames. Then when you get the username you convert it to an integer and SQL injection is impossible.

Python InDesign scripting: Get overflowing textbox from preflight for automatic resizing

Thanks to this great answer I was able to figure out how to run a preflight check for my documents using Python and the InDesign script API. Now I wanted to work on automatically adjusting the text size of the overflowing text boxes, but was unable to figure out how to retrieve a TextBox object from the Preflight object.
I referred to the API specification, but all the properties only seem to yield strings which do not uniquely define the TextBoxes, like in this example:
Errors Found (1):
Text Frame (R=2)
Is there any way to retrieve the violating objects from the Preflight, in order to operate on them later on? I'd be very thankful for additional input on this matter, as I am stuck!
If all you need is to find and to fix the overset errors I'd propose this solution:
Here is the simple Extendscript to fix the text overset error. It decreases the font size in the all overflowed text frames in active document:
var doc = app.activeDocument;
var frames = doc.textFrames.everyItem().getElements();
var f = frames.length
while(f--) {
var frame = frames[f];
if (frame.overflows) resize_font(frame)
}
function resize_font(frame) {
app.scriptPreferences.enableRedraw = false;
while (frame.overflows) {
var texts = frame.parentStory.texts.everyItem().getElements();
var t = texts.length;
while(t--) {
var characters = texts[t].characters.everyItem().getElements();
var c = characters.length;
while (c--) characters[c].pointSize = characters[c].pointSize * .99;
}
}
app.scriptPreferences.enableRedraw = true;
}
You can save it in any folder and run it by the Python script:
import win32com.client
app = win32com.client.Dispatch('InDesign.Application.CS6')
doc = app.Open(r'd:\temp\test.indd')
profile = app.PreflightProfiles.Item('Stackoverflow Profile')
print('Profile name:', profile.name)
process = app.PreflightProcesses.Add(doc, profile)
process.WaitForProcess()
errors = process.processResults
print('Errors:', errors)
if errors[:4] != 'None':
script = r'd:\temp\fix_overset.jsx' # <-- here is the script to fix overset
print('Run script', script)
app.DoScript(script, 1246973031) # run the jsx script
# 1246973031 --> ScriptLanguage.JAVASCRIPT
# https://www.indesignjs.de/extendscriptAPI/indesign-latest/#ScriptLanguage.html
process = app.PreflightProcesses.Add(doc, profile)
process.WaitForProcess()
errors = process.processResults
print('Errors:', errors) # it should print 'None'
if errors[:4] == 'None':
doc.Save()
doc.Close()
input('\nDone... Press <ENTER> to close the window')
Thanks to the exellent answer of Yuri I was able solve my problem, although there are still some shortcomings.
In Python, I load my documents and check if there are any problems detected during the preflight. If so, I move on to adjusting the text frames.
myDoc = app.Open(input_file_path)
profile = app.PreflightProfiles.Item(1)
process = app.PreflightProcesses.Add(myDoc, profile)
process.WaitForProcess()
results = process.processResults
if "None" not in results:
# Fix errors
script = open("data/script.jsx")
app.DoScript(script.read(), 1246973031, variables.resize_array)
process.WaitForProcess()
results = process.processResults
# Check if problems were resolved
if "None" not in results:
info_fail(card.name, "Error while running preflight")
myDoc.Close(1852776480)
return FLAG_PREFLIGHT_FAIL
I load the JavaScript file stored in script.jsx, that consists of several components. I start by extracting the arguments and loading all the pages, since I want to handle them individually. I then collect all text frames on the page in an array.
var doc = app.activeDocument;
var pages = doc.pages;
var resizeGroup = arguments[0];
var condenseGroup = arguments[1];
// Loop over all available pages separately
for (var pageIndex = 0; pageIndex < pages.length; pageIndex++) {
var page = pages[pageIndex];
var pageItems = page.allPageItems;
var textFrames = [];
// Collect all TextFrames in an array
for (var pageItemIndex = 0; pageItemIndex < pageItems.length; pageItemIndex++) {
var candidate = pageItems[pageItemIndex];
if (candidate instanceof TextFrame) {
textFrames.push(candidate);
}
}
What I wanted to achieve was a setting where if one of a group of text frames was overflowing, the text size of all the text frames in this group are adjusted as well. E.g. text frame 1 overflows when set to size 8, no longer when set to size 6. Since text frame 1 is in the same group as text frame 2, both of them will be adjusted to size 6 (assuming the second frame does not overflow at this size).
In order to handle this, I pass an array containing the groups. I now check if the text frame is contained in one of these groups (which is rather tedious, I had to write my own methods since InDesign does not support modern functions like filter() as far as I am concerned...).
// Check if TextFrame overflows, if so add all TextFrames that should be the same size
for (var textFrameIndex = 0; textFrameIndex < textFrames.length; textFrameIndex++) {
var textFrame = textFrames[textFrameIndex];
// If text frame overflows, adjust it and all the frames that are supposed to be of the same size
if (textFrame.overflows) {
var foundResizeGroup = filterArrayWithString(resizeGroup, textFrame.name);
var foundCondenseGroup = filterArrayWithString(condenseGroup, textFrame.name);
var process = false;
var chosenGroup, type;
if (foundResizeGroup.length > 0) {
chosenGroup = foundResizeGroup;
type = "resize";
process = true;
} else if (foundCondenseGroup.length > 0) {
chosenGroup = foundCondenseGroup;
type = "condense";
process = true;
}
if (process) {
var foundFrames = findTextFramesFromNames(textFrames, chosenGroup);
adjustTextFrameGroup(foundFrames, type);
}
}
}
If this is the case, I adjust either the text size or the second axis of the text (which condenses the text for my variable font). This is done using the following functions:
function adjustTextFrameGroup(resizeGroup, type) {
// Check if some overflowing textboxes
if (!someOverflowing(resizeGroup)) {
return;
}
app.scriptPreferences.enableRedraw = false;
while (someOverflowing(resizeGroup)) {
for (var textFrameIndex = 0; textFrameIndex < resizeGroup.length; textFrameIndex++) {
var textFrame = resizeGroup[textFrameIndex];
if (type === "resize") decreaseFontSize(textFrame);
else if (type === "condense") condenseFont(textFrame);
else alert("Unknown operation");
}
}
app.scriptPreferences.enableRedraw = true;
}
function someOverflowing(textFrames) {
for (var textFrameIndex = 0; textFrameIndex < textFrames.length; textFrameIndex++) {
var textFrame = textFrames[textFrameIndex];
if (textFrame.overflows) {
return true;
}
}
return false;
}
function decreaseFontSize(frame) {
var texts = frame.parentStory.texts.everyItem().getElements();
for (var textIndex = 0; textIndex < texts.length; textIndex++) {
var characters = texts[textIndex].characters.everyItem().getElements();
for (var characterIndex = 0; characterIndex < characters.length; characterIndex++) {
characters[characterIndex].pointSize = characters[characterIndex].pointSize - 0.25;
}
}
}
function condenseFont(frame) {
var texts = frame.parentStory.texts.everyItem().getElements();
for (var textIndex = 0; textIndex < texts.length; textIndex++) {
var characters = texts[textIndex].characters.everyItem().getElements();
for (var characterIndex = 0; characterIndex < characters.length; characterIndex++) {
characters[characterIndex].setNthDesignAxis(1, characters[characterIndex].designAxes[1] - 5)
}
}
}
I know that this code can be improved upon (and am open to feedback), for example if a group consists of multiple text frames, the procedure will run for all of them, even though it need only be run once. I was getting pretty frustrated with the old JavaScript, and the impact is negligible. The rest of the functions are also only helper functions, which I'd like to replace with more modern version. Sadly and as already stated, I think that they are simply not available.
Thanks once again to Yuri, who helped me immensely!

How to escape only delimiter and not the newline character in CSV

I am receiving normal comma delimited CSV files with data having new line character.
Input data
I want to convert the input data to:
Pipe (|) delimited
Without any quotes to escape (" or ')
Pipe (|) within data escaped with a caret (^) character
My file may also contain multiple lines on data (or data in newline in a single row).
Expected output data
Output file I was able to generate.
As you can see in the image that caret (^) perfectly escaped all pipes (|) in data, but also escaping the newline character in 5th and 6th line, which I don't want.
NOTE: All the carriage returns (\r, or CR) and newline (\n, LF) characters should be as it is just like shown in images.
import csv
import sys
inputPath = sys.argv[1]
outputPath = sys.argv[2]
with open(inputPath, encoding="utf-8") as inputFile:
with open(outputPath, 'w', newline='', encoding="utf-8") as outputFile:
reader = csv.DictReader(inputFile, delimiter=',')
writer = csv.DictWriter(
outputFile, reader.fieldnames, delimiter='|', quoting=csv.QUOTE_NONE, escapechar='^', doublequote=False, quotechar="")
writer.writeheader()
writer.writerows(reader)
print("Formationg complete.")
The above code has been written in Python, it would be great if I can get help in Python.
Answers in other programming languages also accepted.
There is more than 8 million records
Please find below some sample data:
"VENDOR ID","VENDOR NAME","ORGANIZATION NUMBER","ADDRESS 1","CITY","COUNTRY","ZIP","PRIMARY PHONE","FAX","EMAIL","LMS RECORD CREATED DATE","LMS RECORD MODIFY DATE","DELETE FLAG","LMS RECORD ID"
"a0E6D000001Fag8UAC","Test 'Vendor' 1","","This Vendor contains a single (') quote.","","","","","","test#test.com","2020-4-1 06:32:29","2020-4-1 06:34:43","false",""
"a0E6D000001FagDUAS","Test ""Vendor"" 2","","This Vendor contains a double("") quote.","","","","","","test#test.com","2020-4-1 06:33:38","2020-4-1 06:35:18","false",""
"a0E6D000001FagIUAS","Test Vendor | 3","","This Vendor contains a Pipe (|).","","","","","","test#test.com","2020-4-1 06:38:45","2020-4-1 06:38:45","false",""
"a0E6D000001FagNUAS","Test Vendor 4","","This Vendor contains a
carriage return, i.e
data in new line.","","","","","","test#test.com","2020-4-1 06:43:08","2020-4-1 06:43:08","false",""
NOTE: If you copy above data, please make sure that 5th and 6th line should end with only LF (i.e New Line, \n) just like shown in images, or else please try to replicate those 2 line as that's what this question is all about not escaping those 2 lines specificaly, as highlighted in the image below.
The above code is the final out come of all my findings on internet. I've even tried pandas library and it's final output is same as well.
The code below is just an alternate way to get my expected output, but still the issue exists as this script takes forever (more than 12 hours) to complete (and still not finishes, ultimately I have to kill the process) when ran on 9 Millions of records.
Batch wrapper for VBS code:
0</* :
#echo off
cscript /nologo /E:jscript "%~f0" %*
exit /b %errorlevel% */0;
var ARGS = WScript.Arguments;
if (ARGS.Length < 3 ) {
WScript.Echo("Wrong arguments");
WScript.Echo(WScript.ScriptName + " path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo(WScript.ScriptName + " e?path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo("if filename starts with \"e?\" search and replace string will be evaluated for special characters ")
WScript.Quit(1);
}
if (ARGS.Item(0).toLowerCase() == "-help" || ARGS.Item(0).toLowerCase() == "-h") {
WScript.Echo(WScript.ScriptName + " path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo(WScript.ScriptName + " e?path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo("if filename starts with \"e?\" search and replace string will be evaluated for special characters ")
WScript.Quit(0);
}
if (ARGS.Length % 2 !== 1 ) {
WScript.Echo("Wrong arguments");
WScript.Quit(2);
}
var jsEscapes = {
'n': '\n',
'r': '\r',
't': '\t',
'f': '\f',
'v': '\v',
'b': '\b'
};
//string evaluation
//http://stackoverflow.com/questions/24294265/how-to-re-enable-special-character-sequneces-in-javascript
function decodeJsEscape(_, hex0, hex1, octal, other) {
var hex = hex0 || hex1;
if (hex) { return String.fromCharCode(parseInt(hex, 16)); }
if (octal) { return String.fromCharCode(parseInt(octal, 8)); }
return jsEscapes[other] || other;
}
function decodeJsString(s) {
return s.replace(
// Matches an escape sequence with UTF-16 in group 1, single byte hex in group 2,
// octal in group 3, and arbitrary other single-character escapes in group 4.
/\\(?:u([0-9A-Fa-f]{4})|x([0-9A-Fa-f]{2})|([0-3][0-7]{0,2}|[4-7][0-7]?)|(.))/g,
decodeJsEscape);
}
function convertToPipe(find, replace, str) {
return str.replace(new RegExp('\\|','g'),"^|");
}
function removeStartingQuote(find, replace, str) {
return str.replace(new RegExp('^"', 'g'), '');
}
function removeEndQuote(find, replace, str) {
return str.replace(new RegExp('"\r\n$', 'g'), '\r\n');
}
function removeLeadingAndTrailingQuotes(find, replace, str) {
return str.replace(new RegExp('"\r\n"', 'g'), '\r\n');
}
function replaceDelimiter(find, replace, str) {
return str.replace(new RegExp('","', 'g'), '|');
}
function convertSFDCDoubleQuotes(find, replace, str) {
return str.replace(new RegExp('""', 'g'), '"');
}
function getContent(file) {
// :: http://www.dostips.com/forum/viewtopic.php?f=3&t=3855&start=15&p=28898 ::
var ado = WScript.CreateObject("ADODB.Stream");
ado.Type = 2; // adTypeText = 2
ado.CharSet = "iso-8859-1"; // code page with minimum adjustments for input
ado.Open();
ado.LoadFromFile(file);
var adjustment = "\u20AC\u0081\u201A\u0192\u201E\u2026\u2020\u2021" +
"\u02C6\u2030\u0160\u2039\u0152\u008D\u017D\u008F" +
"\u0090\u2018\u2019\u201C\u201D\u2022\u2013\u2014" +
"\u02DC\u2122\u0161\u203A\u0153\u009D\u017E\u0178" ;
var fs = new ActiveXObject("Scripting.FileSystemObject");
var size = (fs.getFile(file)).size;
var lnkBytes = ado.ReadText(size);
ado.Close();
var chars=lnkBytes.split('');
for (var indx=0;indx<size;indx++) {
if ( chars[indx].charCodeAt(0) > 255 ) {
chars[indx] = String.fromCharCode(128 + adjustment.indexOf(chars[indx]));
}
}
return chars.join("");
}
function writeContent(file,content) {
var ado = WScript.CreateObject("ADODB.Stream");
ado.Type = 2; // adTypeText = 2
ado.CharSet = "iso-8859-1"; // right code page for output (no adjustments)
//ado.Mode=2;
ado.Open();
ado.WriteText(content);
ado.SaveToFile(file, 2);
ado.Close();
}
if (typeof String.prototype.startsWith != 'function') {
// see below for better implementation!
String.prototype.startsWith = function (str){
return this.indexOf(str) === 0;
};
}
var evaluate=false;
var filename=ARGS.Item(0);
if(filename.toLowerCase().startsWith("e?")) {
filename=filename.substring(2,filename.length);
evaluate=true;
}
var content=getContent(filename);
var newContent=content;
var find="";
var replace="";
for (var i=1;i<ARGS.Length-1;i=i+2){
find=ARGS.Item(i);
replace=ARGS.Item(i+1);
if(evaluate){
find=decodeJsString(find);
replace=decodeJsString(replace);
}
newContent=convertToPipe(find,replace,newContent);
newContent=removeStartingQuote(find,replace,newContent);
newContent=removeEndQuote(find,replace,newContent);
newContent=removeLeadingAndTrailingQuotes(find,replace,newContent);
newContent=replaceDelimiter(find,replace,newContent);
newContent=convertSFDCDoubleQuotes(find,replace,newContent);
}
writeContent(filename,newContent);
Execution Steps:
> replace.bat <file_name or full_path_to_file> "." "."
This batch file is made for the purpose of any file's manipulation according to our requirement.
I've compiled and made this from lot of google searches. It's still in process as I've hardcoded my regular expressions in the file. You can make changes according to your need in the functions i've made, or even make your own functions by replicating other functions, and calling them at the end.
Another alternateive to what I want to achive I've done using Wondows Powershell script.
((Get-Content -path $args[0] -Raw) -replace '\|', '^|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '^"', '') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace "`"\r\n$", "") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '"\r\n"', "`r`n") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '","', '|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '""', '"' ) | Set-Content -Path $args[0]
Execution Ways:
Using Powershell
replace.ps1 '< path_to_file >'
Using a Batch Script
C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -ExecutionPolicy ByPass -command "& '< path_to_ps_script >\replace.ps1' '< path_to_csv_file >.csv'"
NOTE: Powershell V5.0 or greater required
This can process 1 Million of records in a minute or so.
What I've figured out is that we have to split bulky csv files to multiplve file with 1 Million records each and then process them all seperately.
Please correct me if I'm wrong, or there's any other alternate to it.

Convert from Tensorflow -> CoreML 3.0 for slot/intent detection

I am trying to use some of the models created by this codebase (Slot-Filling-Understanding-Using-RNNs) in my Swift application.
I was able to convert lstm_nopooling, lstm_nopooling300 and lstm to convert to CoreML.
In model.py I used this code:
def save_model(self):
joblib.dump(self.summary, 'models/' + self.name + '.txt')
self.model.save('models/' + self.name + '.h5')
try:
coreml_model = coremltools.converters.keras.convert(self.model, input_names="main_input", output_names=["intent_output","slot_output"])
coreml_model.save('models/' + self.name + '.mlmodel')
except:
pass
print("Saved model to disk")
I am trying to convert the vectors back to an intent and slots.
I have this, but
func tokenizeSentences(instr: String) -> [Int] {
let s = instr.lowercased().split(separator: " ")
var ret = [Int]()
if let filepath = Bundle.main.path(forResource: "atis.dict.vocab", ofType: "csv") {
do {
let contents = try String(contentsOfFile: filepath)
print(contents)
var lines = contents.split { $0.isNewline }
var pos = 0
for word in s {
if let index = lines.firstIndex(of: word) {
print(index.description + " " + word)
ret.append(index)
}
}
return ret
} catch {
// contents could not be loaded
}
} else {
// example.txt not found!
}
return ret
}
func predictText(instr:String) {
let model = lstm_nopooling300()
guard let mlMultiArray = try? MLMultiArray(shape:[20,1,1],
dataType:MLMultiArrayDataType.int32) else {
fatalError("Unexpected runtime error. MLMultiArray")
}
let tokens = tokenizeSentences(instr: instr)
for (index, element) in tokens.enumerated() {
mlMultiArray[index] = NSNumber(integerLiteral: element)
}
guard let m = try? model.prediction(input: lstm_nopooling300Input.init(main_input: mlMultiArray))
else {
fatalError("Unexpected runtime error. MLMultiArray")
}
let mm = m.intent_output
let length = mm.count
let doublePtr = mm.dataPointer.bindMemory(to: Double.self, capacity: length)
let doubleBuffer = UnsafeBufferPointer(start: doublePtr, count: length)
let output = Array(doubleBuffer)
print("******** intents \(mm.count) ********")
print(output)
let mn = m.slot_output
let length2 = mn.count
let doublePtr2 = mm.dataPointer.bindMemory(to: Double.self, capacity: length2)
let doubleBuffer2 = UnsafeBufferPointer(start: doublePtr2, count: length2)
let output2 = Array(doubleBuffer2)
print("******** slots \(mn.count) ********")
print(output2)
}
}
When I run my code I get this, truncated, for intents:
******** intents 540 ********
[0.0028914143331348896, 0.0057610333897173405, 4.1651015635579824e-05,
0.15935245156288147, 5.6665314332349226e-05, 5.7797817134996876e-05, 0.0044302307069301605, 0.00012486864579841495, 0.0004683282459154725, 0.003053907072171569, 3.806956738117151e-05, 0.012112349271774292, 5.861848694621585e-05, 0.0031344725284725428,
The problem, I believe, is that the ids are in a pickle file, so in atis/atis.train.pkl perhaps.
All I did was train the models and convert those I could to CoreML and now I am trying to use it, but not certain what to do next.
I have a textfield and I enter 'current weather in london' and I hope to get something similar to (this is from running example.py)
{'intent': 'weather_intent', 'slots': [{'name': 'city', 'value': 'London'}]}
Here is the coreml input/output
Thanks to #MatthijsHollemans I was able to figure out what to do.
In data_processing.py I added these:
with open('atis/wordlist.csv', 'w') as f:
for key in ids2words.keys():
f.write("%s\n"%(ids2words.keys[key]))
with open('atis/wordlist_slots.csv', 'w') as f:
for key in ids2slots.keys():
f.write("%s\n"%(ids2slots[key]))
with open('atis/wordlist_intents.csv', 'w') as f:
for key in ids2intents.keys():
f.write("%s\n"%(ids2intents[key]))
This allows me to tokenize correctly, using wordlist.csv.
Then when I get the response back, and using mm.count was wrong, it should have been output.count for example, I could see the intents.
Look for the element that has the largest value, and then you look up in wordlist_intents.csv (I turned this into an array, probably should be a dictionary) to find the likely intent.
I still need to do the slots, but the basic idea is the same.
The key was to output the dictionary used in python to a csv file and then import that into the project.
UPDATE
I realized that when mm.count was 540 that is because it can have 20 words in the sentence and so it can return that many. So in my case I needed to split the word by space and then loop that many times as I won't get more slots than I have words.
I am doing this in SwiftUI so I also had to create an observable so I could use EnvironmentObject to pass in the terms.
So, to properly loop over the double array in memory I am including the latest code that does what I expect.
func predictText(instr:String) {
let model = lstm_nopooling300()
guard let mlMultiArray = try? MLMultiArray(shape:[20,1,1],
dataType:MLMultiArrayDataType.int32) else {
fatalError("Unexpected runtime error. MLMultiArray")
}
let tokens = tokenizeSentences(instr: instr)
let sent = instr.split(separator: " ")
print(instr)
print(tokens)
for (index, element) in tokens.enumerated() {
mlMultiArray[index] = NSNumber(integerLiteral: element)
}
guard let m = try? model.prediction(input: lstm_nopooling300Input.init(main_input: mlMultiArray))
else {
fatalError("Unexpected runtime error. MLMultiArray")
}
let mm = m.intent_output
let length = mm.count
let doublePtr = mm.dataPointer.bindMemory(to: Double.self, capacity: length)
var intents = [String]()
for i in 0...sent.count - 1 {
let doubleBuffer = UnsafeBufferPointer(start: doublePtr + i * 27, count: 27)
let output = Array(doubleBuffer)
let intent = convertVectorToIntent(vector: output)
intents.append(intent)
}
print(intents)
let mn = m.slot_output
let length2 = mn.count
let doublePtr2 = mn.dataPointer.bindMemory(to: Double.self, capacity: length2)
var slots = [String]()
for i in 0...sent.count - 1 {
let doubleBuffer2 = UnsafeBufferPointer(start: doublePtr2 + i * 133, count: 133)
let output2 = Array(doubleBuffer2)
var slot = ""
slot = convertVectorToSlot(vector: output2)
slots.append(slot)
slots.append(sent[i].description)
}
print(slots)
}

extract data from gmail add to spreadsheet- Google apps script

I have searched, copied and modified code, and tried to break down what others have done and I still can't get this right.
I have email receipts for an ecommerce webiste, where I am trying to harvest particular details from each email and save to a spreadsheet with a script.
Here is the entire script as I have now.
function menu(e) {
var ui = SpreadsheetApp.getUi();
ui.createMenu('programs')
.addItem('parse mail', 'grabreceipt')
.addToUi();
}
function grabreceipt() {
var ss = SpreadsheetApp.getActiveSheet();
var ss = SpreadsheetApp.getActiveSpreadsheet();
var s = ss.getSheetByName("Sheet1");
var threads = GmailApp.search("(subject:order receipt) and (after:2016/12/01)");
var a=[];
for (var i = 0; i<threads.length; i++)
{
var messages = threads[i].getMessages();
for (var j=0; j<messages.length; j++)
{
var messages = GmailApp.getMessagesForThread(threads[i]);
for (var j = 0; j < messages.length; j++) {
a[j]=parseMail(messages[j].getPlainBody());
}
}
var nextRow=s.getDataRange().getLastRow()+1;
var numRows=a.length;
var numCols=a[0].length;
s.getRange(nextRow,1,numRows,numCols).setValues(a);
}
function parseMail(body) {
var a=[];
var keystr="Order #,Subtotal:,Shipping:,Total:";
var keys=keystr.split(",");
var i,p,r;
for (i in keys) {
//p=keys[i]+(/-?\d+(,\d+)*(\.\d+(e\d+)?)?/);
p=keys[i]+"[\r\n]*([^\r^\n]*)[\r\n]";
//p=keys[i]+"[\$]?[\d]+[\.]?[\d]+$";
r=new RegExp(p,"m");
try {a[i]=body.match(p)[1];}
catch (err) {a[i]="no match";}
}
return a;
}
}
So the email data to pluck from comes as text only like this:
Order #89076
(body content, omitted)
Subtotal: $528.31
Shipping: $42.66 via Priority MailĀ®
Payment Method: Check Payment- Money order
Total: $570.97
Note: mywebsite order 456. Customer asked about this and that... etc.
The original code regex was designed to grab content, following the keystr values which were easily found on their own line. So this made sense:
p=keys[i]+"[\r\n]*([^\r^\n]*)[\r\n]";
This works okay, but results where the lines include more data that follows as in line Shipping: $42.66 via Priority MailĀ®.
My data is more blended, where I only wish to take numbers, or numbers and decimals. So I have this instead which validates on regex101.com
p=keys[i]+"[\$]?[\d]+[\.]?\d+$";
The expression only, [\$]?[\d]+[.]?\d+$ works great but I still get "no match" for each row.
Additionally, within this search there are 22 threads returned, and it populates 39 rows in the spreadsheet. I can not figure out why 39?
The reason for your regex not working like it should is because you are not escaping the "\" in the string you use to create the regex
So a regex like this
"\s?\$?(\d+\.?\d+)"
needs to be escaped like so:
"\\s?\\$?(\\d+\\.?\\d+)"
The below code is just modified from your parseEmail() to work as a snippet here. If you copy this to your app script code delete document.getElementById() lines.
Your can try your example in the snippet below it will only give you the numbers.
function parseMail(body) {
if(body == "" || body == undefined){
var body = document.getElementById("input").value
}
var a=[];
var keystr="Order #,Subtotal:,Shipping:,Total:";
var keys=keystr.split(",");
var i,p,r;
for (i in keys) {
p=keys[i]+"\\s?\\$?(\\d+\\.?\\d+)";
r=new RegExp(p,"m");
try {a[i]=body.match(p)[1];}
catch (err) {a[i]="no match";}
}
document.getElementById("output").innerHTML = a.join(";")
return a;
}
<textarea id ="input"></textarea>
<div id= "output"></div>
<input type = "button" value = "Parse" onclick = "parseMail()">
Hope that helps

Categories