I'm trying to extract all paragraphs from and EPUB with associated CFIs. I tried computing the CFI myself but the documentation is really hard to follow and implement. I'm primarily looking for a python solution, but I can work with anything.
To be precise: I want to compute the CFI for every <p> inside every chapter.
Thank you in advance!
async function createCfisForEpubHtmlParagraphs(epub_path){
const data = await fs.readFile(epub_path, "utf8");
const dom = new JSDOM(data);
let nodeRange = dom.window.document.createRange();
const cfiRanges = [...Array(dom.window.document.getElementsByTagName('p').length).keys()].map((paragraphIndx => {
const pNode = dom.window.document.getElementsByTagName('p').item(paragraphIndx);
nodeRange.setStart(pNode.firstChild, 0);
try{
nodeRange.setEnd(pNode.firstChild, pNode.firstChild.textContent.length);
}
catch(e){
nodeRange.setEnd(pNode.firstChild, 1);
}
let cfirange = new EpubCFI(nodeRange, '/6/4').toString();
return {cfirange: cfirange, p_number:paragraphIndx}
})
return cfiRanges;
}
Related
Thanks to this great answer I was able to figure out how to run a preflight check for my documents using Python and the InDesign script API. Now I wanted to work on automatically adjusting the text size of the overflowing text boxes, but was unable to figure out how to retrieve a TextBox object from the Preflight object.
I referred to the API specification, but all the properties only seem to yield strings which do not uniquely define the TextBoxes, like in this example:
Errors Found (1):
Text Frame (R=2)
Is there any way to retrieve the violating objects from the Preflight, in order to operate on them later on? I'd be very thankful for additional input on this matter, as I am stuck!
If all you need is to find and to fix the overset errors I'd propose this solution:
Here is the simple Extendscript to fix the text overset error. It decreases the font size in the all overflowed text frames in active document:
var doc = app.activeDocument;
var frames = doc.textFrames.everyItem().getElements();
var f = frames.length
while(f--) {
var frame = frames[f];
if (frame.overflows) resize_font(frame)
}
function resize_font(frame) {
app.scriptPreferences.enableRedraw = false;
while (frame.overflows) {
var texts = frame.parentStory.texts.everyItem().getElements();
var t = texts.length;
while(t--) {
var characters = texts[t].characters.everyItem().getElements();
var c = characters.length;
while (c--) characters[c].pointSize = characters[c].pointSize * .99;
}
}
app.scriptPreferences.enableRedraw = true;
}
You can save it in any folder and run it by the Python script:
import win32com.client
app = win32com.client.Dispatch('InDesign.Application.CS6')
doc = app.Open(r'd:\temp\test.indd')
profile = app.PreflightProfiles.Item('Stackoverflow Profile')
print('Profile name:', profile.name)
process = app.PreflightProcesses.Add(doc, profile)
process.WaitForProcess()
errors = process.processResults
print('Errors:', errors)
if errors[:4] != 'None':
script = r'd:\temp\fix_overset.jsx' # <-- here is the script to fix overset
print('Run script', script)
app.DoScript(script, 1246973031) # run the jsx script
# 1246973031 --> ScriptLanguage.JAVASCRIPT
# https://www.indesignjs.de/extendscriptAPI/indesign-latest/#ScriptLanguage.html
process = app.PreflightProcesses.Add(doc, profile)
process.WaitForProcess()
errors = process.processResults
print('Errors:', errors) # it should print 'None'
if errors[:4] == 'None':
doc.Save()
doc.Close()
input('\nDone... Press <ENTER> to close the window')
Thanks to the exellent answer of Yuri I was able solve my problem, although there are still some shortcomings.
In Python, I load my documents and check if there are any problems detected during the preflight. If so, I move on to adjusting the text frames.
myDoc = app.Open(input_file_path)
profile = app.PreflightProfiles.Item(1)
process = app.PreflightProcesses.Add(myDoc, profile)
process.WaitForProcess()
results = process.processResults
if "None" not in results:
# Fix errors
script = open("data/script.jsx")
app.DoScript(script.read(), 1246973031, variables.resize_array)
process.WaitForProcess()
results = process.processResults
# Check if problems were resolved
if "None" not in results:
info_fail(card.name, "Error while running preflight")
myDoc.Close(1852776480)
return FLAG_PREFLIGHT_FAIL
I load the JavaScript file stored in script.jsx, that consists of several components. I start by extracting the arguments and loading all the pages, since I want to handle them individually. I then collect all text frames on the page in an array.
var doc = app.activeDocument;
var pages = doc.pages;
var resizeGroup = arguments[0];
var condenseGroup = arguments[1];
// Loop over all available pages separately
for (var pageIndex = 0; pageIndex < pages.length; pageIndex++) {
var page = pages[pageIndex];
var pageItems = page.allPageItems;
var textFrames = [];
// Collect all TextFrames in an array
for (var pageItemIndex = 0; pageItemIndex < pageItems.length; pageItemIndex++) {
var candidate = pageItems[pageItemIndex];
if (candidate instanceof TextFrame) {
textFrames.push(candidate);
}
}
What I wanted to achieve was a setting where if one of a group of text frames was overflowing, the text size of all the text frames in this group are adjusted as well. E.g. text frame 1 overflows when set to size 8, no longer when set to size 6. Since text frame 1 is in the same group as text frame 2, both of them will be adjusted to size 6 (assuming the second frame does not overflow at this size).
In order to handle this, I pass an array containing the groups. I now check if the text frame is contained in one of these groups (which is rather tedious, I had to write my own methods since InDesign does not support modern functions like filter() as far as I am concerned...).
// Check if TextFrame overflows, if so add all TextFrames that should be the same size
for (var textFrameIndex = 0; textFrameIndex < textFrames.length; textFrameIndex++) {
var textFrame = textFrames[textFrameIndex];
// If text frame overflows, adjust it and all the frames that are supposed to be of the same size
if (textFrame.overflows) {
var foundResizeGroup = filterArrayWithString(resizeGroup, textFrame.name);
var foundCondenseGroup = filterArrayWithString(condenseGroup, textFrame.name);
var process = false;
var chosenGroup, type;
if (foundResizeGroup.length > 0) {
chosenGroup = foundResizeGroup;
type = "resize";
process = true;
} else if (foundCondenseGroup.length > 0) {
chosenGroup = foundCondenseGroup;
type = "condense";
process = true;
}
if (process) {
var foundFrames = findTextFramesFromNames(textFrames, chosenGroup);
adjustTextFrameGroup(foundFrames, type);
}
}
}
If this is the case, I adjust either the text size or the second axis of the text (which condenses the text for my variable font). This is done using the following functions:
function adjustTextFrameGroup(resizeGroup, type) {
// Check if some overflowing textboxes
if (!someOverflowing(resizeGroup)) {
return;
}
app.scriptPreferences.enableRedraw = false;
while (someOverflowing(resizeGroup)) {
for (var textFrameIndex = 0; textFrameIndex < resizeGroup.length; textFrameIndex++) {
var textFrame = resizeGroup[textFrameIndex];
if (type === "resize") decreaseFontSize(textFrame);
else if (type === "condense") condenseFont(textFrame);
else alert("Unknown operation");
}
}
app.scriptPreferences.enableRedraw = true;
}
function someOverflowing(textFrames) {
for (var textFrameIndex = 0; textFrameIndex < textFrames.length; textFrameIndex++) {
var textFrame = textFrames[textFrameIndex];
if (textFrame.overflows) {
return true;
}
}
return false;
}
function decreaseFontSize(frame) {
var texts = frame.parentStory.texts.everyItem().getElements();
for (var textIndex = 0; textIndex < texts.length; textIndex++) {
var characters = texts[textIndex].characters.everyItem().getElements();
for (var characterIndex = 0; characterIndex < characters.length; characterIndex++) {
characters[characterIndex].pointSize = characters[characterIndex].pointSize - 0.25;
}
}
}
function condenseFont(frame) {
var texts = frame.parentStory.texts.everyItem().getElements();
for (var textIndex = 0; textIndex < texts.length; textIndex++) {
var characters = texts[textIndex].characters.everyItem().getElements();
for (var characterIndex = 0; characterIndex < characters.length; characterIndex++) {
characters[characterIndex].setNthDesignAxis(1, characters[characterIndex].designAxes[1] - 5)
}
}
}
I know that this code can be improved upon (and am open to feedback), for example if a group consists of multiple text frames, the procedure will run for all of them, even though it need only be run once. I was getting pretty frustrated with the old JavaScript, and the impact is negligible. The rest of the functions are also only helper functions, which I'd like to replace with more modern version. Sadly and as already stated, I think that they are simply not available.
Thanks once again to Yuri, who helped me immensely!
So basically I've added two custom features for coloring text to a RichTextBlock, and I'd like to make them so selecting one for a portion of text would automatically unselect the other color button, much like it's already the case for h tags.
I've searched for a bit but didn't find much, so I guess I could use some help, be it advice, instruction or even code.
My features go like this :
#hooks.register('register_rich_text_features')
def register_redtext_feature(features):
feature_name = 'redtext'
type_ = 'RED_TEXT'
tag = 'span'
control = {
'type': type_,
'label': 'Red',
'style': {'color': '#bd003f'},
}
features.register_editor_plugin(
'draftail', feature_name, draftail_features.InlineStyleFeature(control)
)
db_conversion = {
'from_database_format': {tag: InlineStyleElementHandler(type_)},
'to_database_format': {
'style_map': {
type_: {'element': tag, 'props': {'class': 'text-primary'}}
}
},
}
features.register_converter_rule(
'contentstate', feature_name, db_conversion
)
The other one is similar but color is different.
This is possible, but it requires jumping through many hoops in Wagtail. The h1…h6 tags work like this out of the box because they are block-level formatting – each block within the editor can only be of one type. Here you’re creating this RED_TEXT formatting as inline formatting ("inline style"), which, intentionally supports multiple formats being applied to the same text.
If you want to achieve this mutually exclusive implementation anyway – you’ll need to write custom JS code to auto-magically remove the desired styles from the text when attempting to add a new style.
Here is a function that does just that. It goes through all of the characters in the user’s selection, and removes the relevant styles from them:
/**
* Remove all of the COLOR_ styles from the current selection.
* This is to ensure only one COLOR_ style is applied per range of text.
* Replicated from https://github.com/thibaudcolas/draftjs-filters/blob/f997416a0c076eb6e850f13addcdebb5e52898e5/src/lib/filters/styles.js#L7,
* with additional "is the character in the selection" logic.
*/
export const filterColorStylesFromSelection = (
content: ContentState,
selection: SelectionState,
) => {
const blockMap = content.getBlockMap();
const startKey = selection.getStartKey();
const endKey = selection.getEndKey();
const startOffset = selection.getStartOffset();
const endOffset = selection.getEndOffset();
let isAfterStartKey = false;
let isAfterEndKey = false;
const blocks = blockMap.map((block) => {
const isStartBlock = block.getKey() === startKey;
const isEndBlock = block.getKey() === endKey;
isAfterStartKey = isAfterStartKey || isStartBlock;
isAfterEndKey = isAfterEndKey || isEndBlock;
const isBeforeEndKey = isEndBlock || !isAfterEndKey;
const isBlockInSelection = isAfterStartKey && isBeforeEndKey;
// Skip filtering through the block chars if out of selection.
if (!isBlockInSelection) {
return block;
}
let altered = false;
const chars = block.getCharacterList().map((char, i) => {
const isAfterStartOffset = i >= startOffset;
const isBeforeEndOffset = i < endOffset;
const isCharInSelection =
// If the selection is on a single block, the char needs to be in-between start and end offsets.
(isStartBlock &&
isEndBlock &&
isAfterStartOffset &&
isBeforeEndOffset) ||
// Start block only: after start offset
(isStartBlock && !isEndBlock && isAfterStartOffset) ||
// End block only: before end offset.
(isEndBlock && !isStartBlock && isBeforeEndOffset) ||
// Neither start nor end: just "in selection".
(isBlockInSelection && !isStartBlock && !isEndBlock);
let newChar = char;
if (isCharInSelection) {
char
.getStyle()
.filter((type) => type.startsWith("COLOR_"))
.forEach((type) => {
altered = true;
newChar = CharacterMetadata.removeStyle(newChar, type);
});
}
return newChar;
});
return altered ? block.set("characterList", chars) : block;
});
return content.merge({
blockMap: blockMap.merge(blocks),
});
};
This is taken from the Draftail ColorPicker demo, which you can see running in the Draftail Storybook’s "Custom formats" example.
To implement this kind of customisation in Draftail, you’d need to use the controls API. Unfortunately that API isn’t currently supported out of the box in Wagtail’s integration of the editor (see wagtail/wagtail#5580), so at the moment in order for this to work you’d need to customize Draftail’s initialisation within Wagtail as well.
I am trying to create a sentence auto-complete model which will suggest similar sentences.
Problem: I have a sentence corpora of more than 20000 sentences. I want to create a program that would suggest similar sentences to a user as the user types in with his/her keyboard.
for example -
user: wh
suggestions: [{'what is your name?'},{'what is your profession?'},{'what do you want?'}, {'where are you?'}]
user: what is your
suggestions: [{'what is your name?'},{'what is your profession?'}]
Note:
The ordering of words is important, i.e prefix of sentence and user input should be the same.
The sentence suggestion are from available text corpora.
My approach:-
Till now I have only come up with a solution that uses trie data structure to store every sentence in text corpora.
I want to know if there are any machine learning techniques that could be implemented for sentence suggestion that also takes sentence prefix into account.
I would really appreciate anyone who could point me in the right direction.
Text generation is a common application of RNNs. Given a sentence prefix the neural network can be trained to predict the most probable next words.
A very interesting article written by Andrej Karpathy can be found here along with the corresponding github repo.
Another popular method utilizes Markov Chains for text generation (for example see here )
if you want to use Lucene relevency, MoreLikeThis similar sentence. or you can apply the cosine similarity for same. hope this will help.
public static void main(String[] args) throws IOException {
Main m = new Main();
m.init();
m.writerEntries();
m.findSilimar("doduck prototype");
}
private Directory indexDir;
private StandardAnalyzer analyzer;
private IndexWriterConfig config;
public void init() throws IOException{
analyzer = new StandardAnalyzer(Version.LUCENE_42);
config = new IndexWriterConfig(Version.LUCENE_42, analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
indexDir = new RAMDirectory(); //do not write on disk
}
public void writerEntries() throws IOException{
IndexWriter indexWriter = new IndexWriter(indexDir, config);
indexWriter.commit();
Document doc1 = createDocument("1","doduck","prototype your idea");
Document doc2 = createDocument("2","doduck","love programming");
Document doc3 = createDocument("3","We do", "prototype");
Document doc4 = createDocument("4","We love", "challange");
indexWriter.addDocument(doc1);
indexWriter.addDocument(doc2);
indexWriter.addDocument(doc3);
indexWriter.addDocument(doc4);
indexWriter.commit();
indexWriter.forceMerge(100, true);
indexWriter.close();
}
private Document createDocument(String id, String title, String content) {
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(true); //TermVectors are needed for MoreLikeThis
Document doc = new Document();
doc.add(new StringField("id", id, Store.YES));
doc.add(new Field("title", title, type));
doc.add(new Field("content", content, type));
return doc;
}
private void findSilimar(String searchForSimilar) throws IOException {
IndexReader reader = DirectoryReader.open(indexDir);
IndexSearcher indexSearcher = new IndexSearcher(reader);
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setMinTermFreq(0);
mlt.setMinDocFreq(0);
mlt.setFieldNames(new String[]{"title", "content"});
mlt.setAnalyzer(analyzer);
Reader sReader = new StringReader(searchForSimilar);
Query query = mlt.like(sReader, null);
TopDocs topDocs = indexSearcher.search(query,10);
for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
Document aSimilar = indexSearcher.doc( scoreDoc.doc );
String similarTitle = aSimilar.get("title");
String similarContent = aSimilar.get("content");
System.out.println("====similar finded====");
System.out.println("title: "+ similarTitle);
System.out.println("content: "+ similarContent);
}
}
I have searched, copied and modified code, and tried to break down what others have done and I still can't get this right.
I have email receipts for an ecommerce webiste, where I am trying to harvest particular details from each email and save to a spreadsheet with a script.
Here is the entire script as I have now.
function menu(e) {
var ui = SpreadsheetApp.getUi();
ui.createMenu('programs')
.addItem('parse mail', 'grabreceipt')
.addToUi();
}
function grabreceipt() {
var ss = SpreadsheetApp.getActiveSheet();
var ss = SpreadsheetApp.getActiveSpreadsheet();
var s = ss.getSheetByName("Sheet1");
var threads = GmailApp.search("(subject:order receipt) and (after:2016/12/01)");
var a=[];
for (var i = 0; i<threads.length; i++)
{
var messages = threads[i].getMessages();
for (var j=0; j<messages.length; j++)
{
var messages = GmailApp.getMessagesForThread(threads[i]);
for (var j = 0; j < messages.length; j++) {
a[j]=parseMail(messages[j].getPlainBody());
}
}
var nextRow=s.getDataRange().getLastRow()+1;
var numRows=a.length;
var numCols=a[0].length;
s.getRange(nextRow,1,numRows,numCols).setValues(a);
}
function parseMail(body) {
var a=[];
var keystr="Order #,Subtotal:,Shipping:,Total:";
var keys=keystr.split(",");
var i,p,r;
for (i in keys) {
//p=keys[i]+(/-?\d+(,\d+)*(\.\d+(e\d+)?)?/);
p=keys[i]+"[\r\n]*([^\r^\n]*)[\r\n]";
//p=keys[i]+"[\$]?[\d]+[\.]?[\d]+$";
r=new RegExp(p,"m");
try {a[i]=body.match(p)[1];}
catch (err) {a[i]="no match";}
}
return a;
}
}
So the email data to pluck from comes as text only like this:
Order #89076
(body content, omitted)
Subtotal: $528.31
Shipping: $42.66 via Priority Mail®
Payment Method: Check Payment- Money order
Total: $570.97
Note: mywebsite order 456. Customer asked about this and that... etc.
The original code regex was designed to grab content, following the keystr values which were easily found on their own line. So this made sense:
p=keys[i]+"[\r\n]*([^\r^\n]*)[\r\n]";
This works okay, but results where the lines include more data that follows as in line Shipping: $42.66 via Priority Mail®.
My data is more blended, where I only wish to take numbers, or numbers and decimals. So I have this instead which validates on regex101.com
p=keys[i]+"[\$]?[\d]+[\.]?\d+$";
The expression only, [\$]?[\d]+[.]?\d+$ works great but I still get "no match" for each row.
Additionally, within this search there are 22 threads returned, and it populates 39 rows in the spreadsheet. I can not figure out why 39?
The reason for your regex not working like it should is because you are not escaping the "\" in the string you use to create the regex
So a regex like this
"\s?\$?(\d+\.?\d+)"
needs to be escaped like so:
"\\s?\\$?(\\d+\\.?\\d+)"
The below code is just modified from your parseEmail() to work as a snippet here. If you copy this to your app script code delete document.getElementById() lines.
Your can try your example in the snippet below it will only give you the numbers.
function parseMail(body) {
if(body == "" || body == undefined){
var body = document.getElementById("input").value
}
var a=[];
var keystr="Order #,Subtotal:,Shipping:,Total:";
var keys=keystr.split(",");
var i,p,r;
for (i in keys) {
p=keys[i]+"\\s?\\$?(\\d+\\.?\\d+)";
r=new RegExp(p,"m");
try {a[i]=body.match(p)[1];}
catch (err) {a[i]="no match";}
}
document.getElementById("output").innerHTML = a.join(";")
return a;
}
<textarea id ="input"></textarea>
<div id= "output"></div>
<input type = "button" value = "Parse" onclick = "parseMail()">
Hope that helps
I have a C header file which contains a series of classes, and I'm trying to write a function which will take those classes, and convert them to a python dict. A sample of the file is down the bottom.
Format would be something like
class CFGFunctions {
class ABC {
class AA {
file = "abc/aa/functions"
class myFuncName{ recompile = 1; };
};
class BB
{
file = "abc/bb/functions"
class funcName{
recompile=1;
}
}
};
};
I'm hoping to turn it into something like
{CFGFunctions:{ABC:{AA:"myFuncName"}, BB:...}}
# Or
{CFGFunctions:{ABC:{AA:{myFuncName:"string or list or something"}, BB:...}}}
In the end, I'm aiming to get the filepath string (which is actually a path to a folder... but anyway), and the class names in the same class as the file/folder path.
I've had a look on SO, and google and so on, but most things I've found have been about splitting lines into dicts, rather then n-deep 'blocks'
I know I'll have to loop through the file, however, I'm not sure the most efficient way to convert it to the dict.
I'm thinking I'd need to grab the outside class and its relevant brackets, then do the same for the text remaining inside.
If none of that makes sense, it's cause I haven't quite made sense of the process myself haha
If any more info is needed, I'm happy to provide.
The following code is a quick mockup of what I'm sorta thinking...
It is most likely BROKEN and probably does NOT WORK. but its sort of the process that I'm thinking of
def get_data():
fh = open('CFGFunctions.h', 'r')
data = {} # will contain final data model
# would probably refactor some of this into a function to allow better looping
start = "" # starting class name
brackets = 0 # number of brackets
text= "" # temp storage for lines inside block while looping
for line in fh:
# find the class (start
mt = re.match(r'Class ([\w_]+) {', line)
if mt:
if start == "":
start = mt.group(1)
else:
# once we have the first class, find all other open brackets
mt = re.match(r'{', line)
if mt:
# and inc our counter
brackets += 1
mt2 = re.match(r'}', line)
if mt2:
# find the close, and decrement
brackets -= 1
# if we are back to the initial block, break out of the loop
if brackets == 0:
break
text += line
data[start] = {'tempText': text}
====
Sample file
class CfgFunctions {
class ABC {
class Control {
file = "abc\abc_sys_1\Modules\functions";
class assignTracker {
description = "";
recompile = 1;
};
class modulePlaceMarker {
description = "";
recompile = 1;
};
};
class Devices
{
file = "abc\abc_sys_1\devices\functions";
class registerDevice { recompile = 1; };
class getDeviceSettings { recompile = 1; };
class openDevice { recompile = 1; };
};
};
};
EDIT:
If possible, if I have to use a package, I'd like to have it in the programs directory, not the general python libs directory.
As you detected, parsing is necessary to do the conversion. Have a look at the package PyParsing, which is a fairly easy-to-use library to implement parsing in your Python program.
Edit: This is a very symbolic version of what it would take to recognize a very minimalistic grammer - somewhat like the example at the top of the question. It won't work, but it might put you in the right direction:
from pyparsing import ZeroOrMore, OneOrMore, \
Keyword, Literal
test_code = """
class CFGFunctions {
class ABC {
class AA {
file = "abc/aa/functions"
class myFuncName{ recompile = 1; };
};
class BB
{
file = "abc/bb/functions"
class funcName{
recompile=1;
}
}
};
};
"""
class_tkn = Keyword('class')
lbrace_tkn = Literal('{')
rbrace_tkn = Literal('}')
semicolon_tkn = Keyword(';')
assign_tkn = Keyword(';')
class_block = ( class_tkn + identifier + lbrace_tkn + \
OneOrMore(class_block | ZeroOrMore(assignment)) + \
rbrace_tkn + semicolon_tkn \
)
def test_parser(test):
try:
results = class_block.parseString(test)
print test, ' -> ', results
except ParseException, s:
print "Syntax error:", s
def main():
test_parser(test_code)
return 0
if __name__ == '__main__':
main()
Also, this code is only the parser - it does not generate any output. As you can see in the PyParsing docs, you can later add the actions you want. But the first step would be to recognize the what you want to translate.
And a last note: Do not underestimate the complexities of parsing code... Even with a library like PyParsing, which takes care of much of the work, there are many ways to get mired in infinite loops and other amenities of parsing. Implement things step-by-step!
EDIT: A few sources for information on PyParsing are:
http://werc.engr.uaf.edu/~ken/doc/python-pyparsing/HowToUsePyparsing.html
http://pyparsing.wikispaces.com/
(Particularly interesting is http://pyparsing.wikispaces.com/Publications, with a long list of articles - several of them introductory - on PyParsing)
http://pypi.python.org/pypi/pyparsing_helper is a GUI for debugging parsers
There is also a 'tag' Pyparsing here on stackoverflow, Where Paul McGuire (the PyParsing author) seems to be a frequent guest.
* NOTE: *
From PaulMcG in the comments below: Pyparsing is no longer hosted on wikispaces.com. Go to github.com/pyparsing/pyparsing