python - extract text from microsoft word - python

I'm trying to extract text in specific parts of an MS word document (link) - sample below. Essentially I need to write all text with the tags -- ASN1START and -- ASN1STOP to a file excluding the aforementioned tags.
sample text
-- ASN1START
CounterCheck ::= SEQUENCE {
rrc-TransactionIdentifier RRC-TransactionIdentifier,
criticalExtensions CHOICE {
c1 CHOICE {
counterCheck-r8 CounterCheck-r8-IEs,
spare3 NULL, spare2 NULL, spare1 NULL
},
criticalExtensionsFuture SEQUENCE {}
}
}
CounterCheck-r8-IEs ::= SEQUENCE {
drb-CountMSB-InfoList DRB-CountMSB-InfoList,
nonCriticalExtension CounterCheck-v8a0-IEs OPTIONAL
}
CounterCheck-v8a0-IEs ::= SEQUENCE {
lateNonCriticalExtension OCTET STRING OPTIONAL,
nonCriticalExtension CounterCheck-v1530-IEs OPTIONAL
}
CounterCheck-v1530-IEs ::= SEQUENCE {
drb-CountMSB-InfoListExt-r15 DRB-CountMSB-InfoListExt-r15 OPTIONAL, -- Need ON
nonCriticalExtension SEQUENCE {} OPTIONAL
}
DRB-CountMSB-InfoList ::= SEQUENCE (SIZE (1..maxDRB)) OF DRB-CountMSB-Info
DRB-CountMSB-InfoListExt-r15 ::= SEQUENCE (SIZE (1..maxDRBExt-r15)) OF DRB-CountMSB-Info
DRB-CountMSB-Info ::= SEQUENCE {
drb-Identity DRB-Identity,
countMSB-Uplink INTEGER(0..33554431),
countMSB-Downlink INTEGER(0..33554431)
}
-- ASN1STOP
I have tried using docx.
from docx import *
import re
import json
fileName = './data/36331-f80.docx'
document = Document(fileName)
startText = re.compile(r'-- ASN1START')
for para in document.paragraphs:
# look for each paragraph
text = para.text
print(text)
# if startText.match(para.text):
# print(text)
It seems every line here with the tags mentioned above is a paragraph. I need help with extracting just the text within the tags.

You may try first reading all document/paragraph text into a single string, and then using re.findall to find all matching text in between the target tags:
text = ""
for para in document.paragraphs:
text += para.text + "\n"
matches = re.findall(r'-- ASN1START\s*(.*?)\s*-- ASN1STOP', text, flags=re.DOTALL)
Note that we use DOT ALL mode with the regex to ensure that .* can match content in between the tags which occurs across newlines.

Related

Recursive function for nested dictionaries

I have a text file which is in asn format ,Now im writing my own parser where here in this example,It creates a dictionary for Order initially and then goes inside the items and see if the value is not a dictionary, Dictionaries in the file had been identified and kept in seq_list.Now i need to write a recursive function which goes inside all the dictionaries and create nested dictionaries.
import re
ee='\
Module-order DEFINITIONS AUTOMATIC TAGS ::=\
BEGIN\
Order ::= SEQUENCE {\
header Order-header\
}\
Order-header ::= SEQUENCE {\
reference NumericString (SIZE (12)),\
date NumericString (SIZE (8)) -- MMDDYYYY --\
}END'
seq_list=['Order','Order-header']
condition='Order ::= SEQUENCE {\
header Order-header\
}'
def rec_fn():
ee=ee.lower()
ee=ee.replace('\n','')
for i in condition:
# Removes emty items
i=i.split(' ')
k.append(filter(None, i))
for index_content,content in enumerate(k):
for index,value in enumerate(content[1:]):
new_value=value.replace(',','')
if new_value in seq_list:
# will have the contents of all the items of the new
# dictionary found.
reg_value=re.findall(r'{0}\s*::=\s*sequence(.*?)(::=|end)'.format(new_value),ee)
sample.asn
ee=''' Module-order DEFINITIONS AUTOMATIC TAGS ::=
BEGIN
Order ::= SEQUENCE {
header Order-header,
items SEQUENCE OF Order-line }
Order-header ::= SEQUENCE {
reference NumericString (SIZE (12)),
date NumericString (SIZE (8)) -- MMDDYYYY --,
client Client,
payment Payment-method }
Client ::= SEQUENCE {
name PrintableString (SIZE (1..20)),
street PrintableString (SIZE (1..50)) OPTIONAL,
postcode NumericString (SIZE (5)),
town PrintableString (SIZE (1..30)),
country PrintableString (SIZE (1..20)) DEFAULT "France" }
Payment-method ::= CHOICE {
check NumericString (SIZE (15)),
credit-card Credit-card,
cash NULL }
Credit-card ::= SEQUENCE {
type Card-type,
number NumericString (SIZE (20)),
expiry-date NumericString (SIZE (6)) -- MMYYYY -- }
Card-type ::= ENUMERATED {cb(0), visa(1), eurocard(2), diners(3), american-express(4)}END
You can use the following recursive function:
import re
def rec_fn(asn, key):
def build_definitions(mapping, key, sequence_only=False):
if not sequence_only and (key,) in mapping:
key = (key,)
is_choice = True
else:
is_choice = False
if isinstance(mapping[key], dict):
definitions = {}
for variable, definition in mapping[key].items():
if definition in mapping or (definition,) in mapping:
definitions[variable] = build_definitions(mapping, definition, sequence_only=is_choice)
else:
definitions[variable] = definition
return definitions
else:
return mapping[key]
mapping = {}
for name, type, definition in re.findall(r'([A-Za-z-]+)\s*::=\s*(SEQUENCE|CHOICE|ENUMERATED)\s*{(.*?)}(?=\s*(?:[A-Za-z-]+\s*::=\s*(?:SEQUENCE|CHOICE|ENUMERATED)|END)\b)', asn, flags=re.DOTALL):
if type in ('SEQUENCE', 'CHOICE'):
for definitions in re.sub(r'{[^}]*}', '', definition).split(','):
definitions = re.sub(r'\bSET OF\b|\(.*\).*', '', definitions).strip().split(maxsplit=1)
if definitions:
mapping.setdefault(name if type == 'SEQUENCE' else (name,), {})[definitions[0]] = definitions[1]
elif type == 'ENUMERATED':
mapping[name] = re.findall(r'[A-Za-z-]+', definition)
return build_definitions(mapping, key)
so that with (note that it's better use triple-quotes for a multi-line string literal):
ee='''
Module-order DEFINITIONS AUTOMATIC TAGS ::=
BEGIN
Order ::= SEQUENCE {
header Order-header
}
Order-header ::= SEQUENCE {
reference NumericString (SIZE (12)),
date NumericString (SIZE (8)) -- MMDDYYYY --
}END
seq_list=['Order','Order-header']
condition='Order ::= SEQUENCE {
header Order-header
}'''
rec_fn(ee, 'Order') will return:
{'header': {'reference': 'NumericString', 'date': 'NumericString'}}

pyparsing how to SkipTo end of indented block?

I am trying to parse a structure like this with pyparsing:
identifier: some description text here which will wrap
on to the next line. the follow-on text should be
indented. it may contain identifier: and any text
at all is allowed
next_identifier: more description, short this time
last_identifier: blah blah
I need something like:
import pyparsing as pp
colon = pp.Suppress(':')
term = pp.Word(pp.alphanums + "_")
description = pp.SkipTo(next_identifier)
definition = term + colon + description
grammar = pp.OneOrMore(definition)
But I am struggling to define the next_identifier of the SkipTo clause since the identifiers may appear freely in the description text.
It seems that I need to include the indentation in the grammar, so that I can SkipTo the next non-indented line.
I tried:
description = pp.Combine(
pp.SkipTo(pp.LineEnd()) +
pp.indentedBlock(
pp.ZeroOrMore(
pp.SkipTo(pp.LineEnd())
),
indent_stack
)
)
But I get the error:
ParseException: not a subentry (at char 55), (line:2, col:1)
Char 55 is at the very beginning of the run-on line:
...will wrap\n on to the next line...
^
Which seems a bit odd, because that char position is clearly followed by the whitespace which makes it an indented subentry.
My traceback in ipdb looks like:
5311 def checkSubIndent(s,l,t):
5312 curCol = col(l,s)
5313 if curCol > indentStack[-1]:
5314 indentStack.append( curCol )
5315 else:
-> 5316 raise ParseException(s,l,"not a subentry")
5317
ipdb> indentStack
[1]
ipdb> curCol
1
I should add that the whole structure above that I'm matching may also be indented (by an unknown amount), so a solution like:
description = pp.Combine(
pp.SkipTo(pp.LineEnd()) + pp.LineEnd() +
pp.ZeroOrMore(
pp.White(' ') + pp.SkipTo(pp.LineEnd()) + pp.LineEnd()
)
)
...which works for the example as presented will not work in my case as it will consume the subsequent definitions.
When you use indentedBlock, the argument you pass in is the expression for each line in the block, so it shouldn't be a indentedBlock(ZeroOrMore(line_expression), stack), just indentedBlock(line_expression, stack). Pyparsing includes a builtin expression for "everything from here to the end of the line", titled restOfLine, so we will just use that for the expression for each line in the indented block:
import pyparsing as pp
NL = pp.LineEnd().suppress()
label = pp.ungroup(pp.Word(pp.alphas, pp.alphanums+'_') + pp.Suppress(":"))
indent_stack = [1]
# see corrected version below
#description = pp.Group((pp.Empty()
# + pp.restOfLine + NL
# + pp.ungroup(pp.indentedBlock(pp.restOfLine, indent_stack))))
description = pp.Group(pp.restOfLine + NL
+ pp.Optional(pp.ungroup(~pp.StringEnd()
+ pp.indentedBlock(pp.restOfLine,
indent_stack))))
labeled_text = pp.Group(label("label") + pp.Empty() + description("description"))
We use ungroup to remove the extra level of nesting created by indentedBlock but we also need to remove the per-line nesting that is created internally in indentedBlock. We do this with a parse action:
def combine_parts(tokens):
# recombine description parts into a single list
tt = tokens[0]
new_desc = [tt.description[0]]
new_desc.extend(t[0] for t in tt.description[1:])
# reassign rebuild description into the parsed token structure
tt['description'] = new_desc
tt[1][:] = new_desc
labeled_text.addParseAction(combine_parts)
At this point, we are pretty much done. Here is your sample text parsed and dumped:
parsed_data = (pp.OneOrMore(labeled_text)).parseString(sample)
print(parsed_data[0].dump())
['identifier', ['some description text here which will wrap', 'on to the next line. the follow-on text should be', 'indented. it may contain identifier: and any text', 'at all is allowed']]
- description: ['some description text here which will wrap', 'on to the next line. the follow-on text should be', 'indented. it may contain identifier: and any text', 'at all is allowed']
- label: 'identifier'
Or this code to pull out the label and description fields:
for item in parsed_data:
print(item.label)
print('..' + '\n..'.join(item.description))
print()
identifier
..some description text here which will wrap
..on to the next line. the follow-on text should be
..indented. it may contain identifier: and any text
..at all is allowed
next_identifier
..more description, short this time
last_identifier
..blah blah

Parsing a custom format (curly braces separated) text configuration with Pyparsing

I need to parse some load balancer configuration section. It's seemingly simple (at least for a human).
Config consists of several objects with their content in curly braces like so:
ltm rule ssl-header-insert {
when HTTP_REQUEST {
HTTP::header insert "X-SSL-Connection" "yes"
}
}
ltm rule some_redirect {
priority 1
when HTTP_REQUEST {
if { (not [class match [IP::remote_addr] equals addresses_group ]) }
{
HTTP::redirect "http://some.page.example.com"
TCP::close
event disable all
}
}
The contents of each section/object is a TCL code so there will be nested curly braces. What I want to achieve is to parse this in pairs as: object identifier (after ltm rule keywords) and it's contents (tcl code within braces) as it is.
I've looked around some examples and experimented a lot, but it's really giving me a hard time. I did some debugging within pyparsing (which is a bit confusing to me too) and I think that I'm failing to detect closing braces somehow, but can't figure that out.
What I came up with so far:
from pyparsing import *
import json
list_sample = """ltm rule ssl-header-insert {
when HTTP_REQUEST {
HTTP::header insert "X-SSL-Connection" "yes"
}
}
ltm rule some_redirect {
priority 1
when HTTP_REQUEST {
if { (not [class match [IP::remote_addr] equals addresses_group ]) }
{
HTTP::redirect "http://some.page.example.com"
TCP::close
event disable all
}
}
}
ltm rule http_header_replace {
when HTTP_REQUEST {
HTTP::header replace Host some.host.example.com
}
}"""
ParserElement.defaultWhitespaceChars=(" \t")
NL = LineEnd()
END = StringEnd()
LBRACE, RBRACE = map(Suppress, '{}')
ANY_HEADER = Suppress("ltm rule ") + Word(alphas, alphanums + "_-")
END_MARK = Literal("ltm rule")
CONTENT_LINE = (~ANY_HEADER + (NotAny(RBRACE + FollowedBy(END_MARK)) + ~END + restOfLine) | (~ANY_HEADER + NotAny(RBRACE + FollowedBy(END)) + ~END + restOfLine)) | (~RBRACE + ~END + restOfLine)
ANY_HEADER.setName("HEADER").setDebug()
LBRACE.setName("LBRACE").setDebug()
RBRACE.setName("RBRACE").setDebug()
CONTENT_LINE.setName("LINE").setDebug()
template_defn = ZeroOrMore((ANY_HEADER + LBRACE +
Group(ZeroOrMore(CONTENT_LINE)) +
RBRACE))
template_defn.ignore(NL)
results = template_defn.parseString(list_sample).asList()
print("Raw print:")
print(results)
print("----------------------------------------------")
print("JSON pretty dump:")
print json.dumps(results, indent=2)
I see in the debug that some of the matches work but in the end it fails with an empty list as a result.
On a sidenote - my CONTENT_LINE part of the grammar is probably overly complicated in general, but I didn't find any simpler way to cover it so far.
The next thing would be to figure out how to preserve new lines and tabs in content part, since I need that to be unchanged in the output. But looks like I have to use ignore() function - which is skipping new lines - to parse the multiline text in the first place, so that's another challenge.
I'd be grateful for someone to help me find out what the issues are. Or maybe I should take some other approach?
I think nestedExpr('{', '}') will help. That will take care of the nested '{}'s, and wrapping in originalTextFor will preserve newlines and spaces.
import pyparsing as pp
LTM, RULE = map(pp.Keyword, "ltm rule".split())
ident = pp.Word(pp.alphas, pp.alphanums+'-_')
ltm_rule_expr = pp.Group(LTM + RULE
+ ident('name')
+ pp.originalTextFor(pp.nestedExpr('{', '}'))('body'))
Using your sample string (after adding missing trailing '}'):
for rule, _, _ in ltm_rule_expr.scanString(sample):
print(rule[0].name, rule[0].body.splitlines()[0:2])
gives
ssl-header-insert ['{', ' when HTTP_REQUEST {']
some_redirect ['{', ' priority 1']
dump() is also a good way to list out the contents of a returned ParseResults:
for rule, _, _ in ltm_rule_expr.scanString(sample):
print(rule[0].dump())
print()
['ltm', 'rule', 'ssl-header-insert', '{\n when HTTP_REQUEST {\n HTTP::header insert "X-SSL-Connection" "yes"\n}\n}']
- body: '{\n when HTTP_REQUEST {\n HTTP::header insert "X-SSL-Connection" "yes"\n}\n}'
- name: 'ssl-header-insert'
['ltm', 'rule', 'some_redirect', '{\n priority 1\n\nwhen HTTP_REQUEST {\n\n if { (not [class match [IP::remote_addr] equals addresses_group ]) }\n {\n HTTP::redirect "http://some.page.example.com"\n TCP::close\n event disable all\n }\n}}']
- body: '{\n priority 1\n\nwhen HTTP_REQUEST {\n\n if { (not [class match [IP::remote_addr] equals addresses_group ]) }\n {\n HTTP::redirect "http://some.page.example.com"\n TCP::close\n event disable all\n }\n}}'
- name: 'some_redirect'
Note that I broke up 'ltm' and 'rule' into separate keyword expressions. This guards against the case where a developer may have written valid code as ltm rule blah, with > 1 space between "ltm" and "rule". This kind of thing happens all the time, you never know where whitespace will crop up.

Delete a repeating pattern in a string using Python

I have a JavaScript file with an array of data.
info = [ {
Date = "YR-MM-DDT00:00:10"
}, ....
What I'm trying to do is remove T and on in the Date field.
Here's what I've tried:
import re
with open ("info.js","r") as myFile:
data= myFile.read();
data= re.sub('\0-9T,'',data);
Desired output for each Date field in the array:
Date = "YR-MM-DD"
You should match the T and the characters that come after it, This works for a single timestamp:
import re
print(re.sub('T.*$', '', 'YR-MM-DDT00:00:10'))
Or if you have text containing a bunch of timestamps, match the closing double quote as well, and replace with a double quote:
import re
text = """
info = [ {
Date = "YR-MM-DDT00:00:10",
Date = "YR-MM-DDT01:02:03",
Date = "YR-MM-DDT11:22:33"
}
"""
new_text = re.sub('T.*"', '"', text)
print(new_text)

Work with Chinese in Python

I`m trying to work with Chinese text and big data in Python.
Part of work is clean text from some unneeded data. For this goal I am using regexes. However I met some problems as in Python regex as in PyCharm application:
1) The data is stored in postgresql and viewed well in the columns, however, after select and pull it to the var it is displayed as a square:
When the value printed to the console is looks like:
Mentholatum 曼秀雷敦 男士 深层活炭洁面乳100g(新包装)
So I presume there is no problem with application encoding but with debug part of encoding, however, I did not find any solutions for such behaviour.
2) The example of regex that I need to care is to remove the values between Chinese brackets include them. The code I used is:
#!/usr/bin/env python
# -*- coding: utf-8 -*
import re
from pprint import pprint
import sys, locale, os
columnString = row[columnName]
startFrom = valuestoremove["startsTo"]
endWith = valuestoremove["endsAt"]
isInclude = valuestoremove["include"]
escapeCharsRegex = re.compile('([\.\^\$\*\+\?\(\)\[\{\|])')
nonASCIIregex = re.compile('([^\x00-\x7F])')
if escapeCharsRegex.match(startFrom):
startFrom = re.escape(startFrom)
if escapeCharsRegex.match(endWith):
endWith = re.escape(endWith)
if isInclude:
regex = startFrom + '(.*)' + endWith
else:
regex = '(?<=' + startFrom + ').*?(?=' + endWith + ')'
if nonASCIIregex.match(regex):
p = re.compile(ur'' + regex)
else:
p = re.compile(regex)
row[columnName] = p.sub("", columnString).strip()
But the regex does not influence on the given string.
I`ve made a test with next code:
#!/usr/bin/env python
# -*- coding: utf-8 -*
import re
reg = re.compile(ur'((.*))')
string = u"巴黎欧莱雅 男士 劲能冰爽洁面啫哩(原男士劲能净爽洁面啫哩)100ml"
print string
string = reg.sub("", string)
print string
And it is work fine for me.
The only difference between those two code examples is that n the first the regex values are come from the txt file with json, encoded as utf-8:
{
"between": {
"startsTo": "(",
"endsAt": ")",
"include": true,
"sequenceID": "1"
}
}, {
"between": {
"startsTo": "(",
"endsAt": ")",
"include": true,
"sequenceID": "2"
}
},{
"between": {
"startsTo": "(",
"endsAt": ")",
"include": true,
"sequenceID": "2"
}
},{
"between": {
"startsTo": "(",
"endsAt": ")",
"include": true,
"sequenceID": "2"
}
}
The Chinese brackets from the file are also viewed like the squares:
I cant find explanation or any solution for such behavior, thus the community help need
Thanks for help.
The problem is that the text you're reading in isn't getting understood as Unicode correctly (this is one of the big gotchas that prompted sweeping changes for Python 3k). Instead of:
data_file = myfile.read()
You need to tell it to decode the file:
data_file = myfile.read().decode("utf8")
Then continue with json.loads, etc, and it should work out fine. Alternatively,
data = json.load(myfile, "utf8")
After many searches and consultations here is a solution for Chinese text (also mixed and non-mixed language)
import codecs
def betweencase(valuestoremove, row, columnName):
columnString = row[columnName]
startFrom = valuestoremove["startsTo"]
endWith = valuestoremove["endsAt"]
isInclude = valuestoremove["include"]
escapeCharsRegex = re.compile('([\.\^\$\*\+\?\(\)\[\{\|])')
if escapeCharsRegex.match(startFrom):
startFrom = re.escape(startFrom)
if escapeCharsRegex.match(endWith):
endWith = re.escape(endWith)
if isInclude:
regex = ur'' + startFrom + '(.*)' + endWith
else:
regex = ur'(?<=' + startFrom + ').*?(?=' + endWith + ')'
***p = re.compile(codecs.encode(unicode(regex), "utf-8"))***
delimiter = ' '
if localization == 'CN':
delimiter = ''
row[columnName] = p.sub(delimiter, columnString).strip()
As you can see we encode any regex to utf-8 thus the postgresql db value is match to regex.

Categories