How do I remove unwanted additional empty lines after adding page break? - python

I am trying to reformat this .docx document using the python docx module. Each question ends with the specific expression "-- ans end --". I want to insert a page break after the expression with the following code:
import docx, re
from pathlib import Path
from docx.enum.text import WD_BREAK
filename = Path("DOCUMENT_NAME")
doc = docx.Document(filename)
for para in doc.paragraphs:
match = re.search(r"-- ans end --", para.text)
if match:
run = para.add_run()
run.add_break(WD_BREAK.PAGE)
After each page break there seems to be 2
which I tried to remove with:
para.text = para.text.strip("\n")
Striping the empty lines before adding the page break does nothing, while striping the empty lines after adding the page break removes the page break.
Please tell me how to eliminate or avoiding adding the 2 empty lines. Thanks.
Update:
The page break should be added to the start of the next paragraph/section instead of after -- ans end -- (the end of this section) as the page break creates a new line when it is added to the end of a paragraph (try it on Word). Therefore I used this:
run = para.runs[0]
run._element.addprevious(new_run_element)
new_run = Run(new_run_element, run._parent)
new_run.text = ""
new_run.add_break(WD_BREAK.PAGE)
to add a page break to the start of next paragraph instead, which does not create a new line.

Have you looked at the contents of your doc before and after altering it? eg.
for para in doc.paragraphs:
print(repr(para.text)) # the call to repr() makes your `\n`s show up
this is helpful for figuring out what is going on.
Prior to altering your doc, there are no \ns with the --- ans end --s, so it makes sense that stripping the empty lines before adding your page break doesn't do anything. Also, prior to stripping your doc, there is an empty string in a paragraph right after -- ans end --:
'-- ans --'
'-- ans end --'
''
is what stuff looks like before you edit the doc. (Except there is one case where -- ans end -- is followed by two ''s, which is annoyingly different from all the others.)
After editing the doc, those sections look like this.
'-- ans end --\n'
''
When I run this code, as I mentioned in my comment above, the page break actually shows up in the wrong spot - right after --ans end -- instead of right before. I think that can be worked around in a fairly straightforward way, I'll leave it to you if you're also having that issue.
If you remove those '' paragraphs I think that solves your problem. It is annoying to remove a paragraph from a document, but see this GitHub answer for an incantation which does it.

Related

How to add a new line before a capital letter?

I am writing a piece of code to get lyrics from genius.com.
I have managed to extract the code from the website but it comes out in a format where all the text is on one line.
I have used regex to add a space but cannot figure out how to add a new line. Here is my code so far:
text_container = re.sub(r"(\w)([A-Z])", r"\1 \2", text_container.text)
This adds a space before the capital letter, but I cannot figure out how to add a new line.
It is returning [Verse 1]Leaves are fallin' down on the beautiful ground I heard a story from the man in red He said, "The leaves are fallin' down
I would like to add a new line before "He" in the command line.
Any help would be greatly appreciated.
Thanks :)
If genius.com doesn't somehow provide a separator, it will be very hard to find a way to know what to look for.
In your example, I made a regex searching for " [A-Z]", which will find " He...". But it will also find all places where a sentence starts with " I...". Sometimes new sentences will start with "I...", but it might make new lines where there actually shouldn't be one.
TL;DR - genius.com needs to provide some sort of separator so we know when there should be a new line.
Disclaimer: Unless I missed something in your description/example
A quick skim of the view-source for a genius lyrics page suggests that you're stripping all the HTML markup which would otherwise contain the info about linebreaks etc.
You're probably better off posting that code (likely as a separate question) and asking how to correctly extract not just the text nodes, but also enough of the <span> structure to format it as necessary.
Looking around I found an API that python has to pull lyrics from Genius.com, here's the link to the PyPI:
https://lyricsgenius.readthedocs.io/en/master/
Just follow the instructions and it should have what you need, with more info on the problem I could provide a more detailed response
I'm not sure about using regex. Try this method:
text = lyrics
new_text = ''
for i, letter in enumerate(text):
if i and letter.isupper():
new_text += '\n'
new_text += letter
print(new_text)
However, as oscillate123 has explained, it will create a new line for every capital letter regardless of the context.

"# this is a string", How python identifies it as a string but not a comment?

I really want to know how python identifies # in quotes as a string and normal # as a comment
I mean how the code to identify difference between these actually works, like will the python read a line and how it excludes the string to find the comment
"# this is a string" # this is a comment
How the comment is identified, will python exclude the string and if so, How?
How can we write a code which does the same, like to design a compiler for our own language with python
I am a newbie, please help
You need to know that whether something is a string or a comment can be determined from just one single character. That is the job of the scanner (or lexical analyzer if you want to sound fancy).
If it starts with a ", it's a string. If it starts with #, it's a comment.
In the code that makes up Python itself, there's probably a loop that goes something like this:
# While there is still source code to read
while not done:
# Get the current character
current = source[pos]
# If the current character is a pound sign
if current == "#":
# While we are not at the end of the line
while current != "\n":
# Get the next character
pos += 1
current = source[pos]
elif current == '"':
# Code to read a string omitted for brevity...
else:
done = True
In the real Python lexer, there are probably dozens more of those if statements, but I hope you have a better idea of how it works now. :)
Because of the quotes
# This is a comment
x = "# this is a string"
x = '# this is a also string'
x = """# this string
spans
multiple
lines"""
"# this is a string" # this is a comment
In simple terms, the interpreter sees the first ", then it takes everything that follows as part of the string until it finds the matching " which terminates the string. Then it sees the subsequent # and interprets everything to follow as a comment. The first # is ignored because it is between the two quotes, and hence is taken as part of the string.

Pyparsing for Paragraphs

I have run into a slight problem with pyparsing that I can't seem to solve. I'd like to write a rule that will parse a multiline paragraph for me. The end goal is to end up with a recursive grammar that will parse something like:
Heading: awesome
This is a paragraph and then
a line break is inserted
then we have more text
but this is also a different line
with more lines attached
Other: cool
This is another indented block
possibly with more paragraphs
This is another way to keep this up
and write more things
But then we can keep writing at the old level
and get this
Into something like HTML: so maybe (of course with a parse tree, I can transform this to whatever format I like).
<Heading class="awesome">
<p> This is a paragraph and then a line break is inserted and then we have more text </p>
<p> but this is also a different line with more lines attached<p>
<Other class="cool">
<p> This is another indented block possibly with more paragraphs</p>
<p> This is another way to keep this up and write more things</p>
</Other>
<p> But then we can keep writing at the old level and get this</p>
</Heading>
Progress
I have managed to get to the stage where I can parse the heading row, and an indented block using pyparsing. But I can't:
Define a paragraph as a multiple lines that should be joined
Allow a paragraph to be indented
An Example
Following from here, I can get the paragraphs to output to a single line, but there doesn't seem to be a way to turn this into a parse tree without removing the line break characters.
I believe a paragraph should be:
words = ## I've defined words to allow a set of characters I need
lines = OneOrMore(words)
paragraph = OneOrMore(lines) + lineEnd
But this doesn't seem to work for me. Any ideas would be awesome :)
So I managed to solve this, for anybody who stumbles upon this in the future. You can define the paragraph like this. Although it is certainly not ideal, and doesn't exactly match the grammar that I described. The relevant code is:
line = OneOrMore(CharsNotIn('\n')) + Suppress(lineEnd)
emptyline = ~line
paragraph = OneOrMore(line) + emptyline
paragraph.setParseAction(join_lines)
Where join_lines is defined as:
def join_lines(tokens):
stripped = [t.strip() for t in tokens]
joined = " ".join(stripped)
return joined
That should point you in the right direction if this matches your needs :) I hope that helps!
A Better Empty Line
The definition of empty line given above is definitely not ideal, and it can be improved dramatically. The best way I've found is the following:
empty_line = Suppress(LineStart() + ZeroOrMore(" ") + LineEnd())
empty_line.setWhitespaceChars("")
This allows you to have empty lines that are filled with spaces, without breaking the match.

Incorrect syntax near GO in SQL

I am concatenating many sql statements and am running into the following error.
"Incorrect syntax near GO" and "Incorrect syntax near "-
It seems that when i delete the trailing space and the go and the space after the go, and then CTRL+Z to put back the GO this makes the error go away? its pretty weird
why??
How could I code it in Python, thanks
')
END TRY
BEGIN CATCH
print ERROR_MESSAGE()
END CATCH
GO
As already mentioned in comments, GO is not part of the SQL syntax, rather a batch delimiter in Management Studio.
You can go around it in two ways, use Subprocess to call SqlCmd, or cut the scripts within Python. The Subprocess + SqlCmd will only really work for you if you don't care about query results as you would need to parse console output to get those.
I needed to build a database from SSMS generated scripts in past and created the below function as a result (updating, as I now have a better version that leaves comments in):
def partition_script(sql_script: str) -> list:
""" Function will take the string provided as parameter and cut it on every line that contains only a "GO" string.
Contents of the script are also checked for commented GO's, these are removed from the comment if found.
If a GO was left in a multi-line comment,
the cutting step would generate invalid code missing a multi-line comment marker in each part.
:param sql_script: str
:return: list
"""
# Regex for finding GO's that are the only entry in a line
find_go = re.compile(r'^\s*GO\s*$', re.IGNORECASE | re.MULTILINE)
# Regex to find multi-line comments
find_comments = re.compile(r'/\*.*?\*/', flags=re.DOTALL)
# Get a list of multi-line comments that also contain lines with only GO
go_check = [comment for comment in find_comments.findall(sql_script) if find_go.search(comment)]
for comment in go_check:
# Change the 'GO' entry to '-- GO', making it invisible for the cutting step
sql_script = sql_script.replace(comment, re.sub(find_go, '-- GO', comment))
# Removing single line comments, uncomment if needed
# file_content = re.sub(r'--.*$', '', file_content, flags=re.MULTILINE)
# Returning everything besides empty strings
return [part for part in find_go.split(sql_script) if part != '']
Using this function, you can run scripts containing GO like this:
import pymssql
conn = pymssql.connect(server, user, password, "tempdb")
cursor = conn.cursor()
for part in partition_script(your_script):
cursor.execute(part)
conn.close()
I hope this helps.

pyparsing capturing groups of arbitrary text with given headers as nested lists

I have a text file that looks similar to;
section header 1:
some words can be anything
more words could be anything at all
etc etc lala
some other header:
as before could be anything
hey isnt this fun
I am trying to contruct a grammar with pyparser that would result in the following list structure when asking for the parsed results as a list; (IE; the following should be printed when iterating through the parsed.asList() elements)
['section header 1:',[['some words can be anything'],['more words could be anything at all'],['etc etc lala']]]
['some other header:',[['as before could be anything'],['hey isnt this fun']]]
The header names are all known beforehand, and individual headers may or may not appear. If they do appear, thre is always at least one line of content.
The problem I am having, is that I am having trouble gettnig the parser to recognise where 'section header 1:' ands, and 'some other header:' begins. I end up with a parsed.asList() looking like;
['section header 1:',[[''some words can be anything'],['more words could be anything at all'],['etc etc lala'],['some other header'],[''as before could be anything'],['hey isnt this fun']]]
(IE: section header 1: gets seen correctly, but everythng following it gets added to section header 1, including further header lines etc..)
Ive tried various things, played with leaveWhitespace() and LineEnd() in various ways but I can't figure it out.
The base parser I am hacking about with is (contrived example - in reality this is a class definition etc..).
header_1_line=Literal('section header 1:')
text_line=Group(OneOrMore(Word(printables)))
header_1_block=Group(header_1_line+Group(OneOrMore(text_line)))
header_2_line=Literal('some other header:')
header_2_block=Group(header_2_line+Group(OneOrMore(text_line)))
overall_structure=ZeroOrMore(header_1_block|header_2_block)
and is being called with
parsed=overall_structure.parseFile()
Cheers, Matt.
Matt -
Welcome to pyparsing! You have fallen into one of the most common pitfalls in working with pyparsing, and that is that people are smarter than computers. When you look at your input text, you can easily see which text can be headers and which text can't be. Unfortunately, pyparsing is not so intuitive, so you have to tell it explicitly what can and can't be text.
When you look at your sample text, you are not accepting just any line of text as possible text within a section header. How do you know that 'some other header:' is not valid as text? Because you know that that string matches one of the known header strings. But in your current code, you have told pyparsing that any collection of Word(printables) is valid text, even if that collection is a valid section header.
To fix this, you have to add some explicit lookahead to your parser. Pyparsing offers two constructs, NotAny and FollowedBy. NotAny can be abbreviated using the '~' operator, so we can write this pseudocode expression for text:
text = ~any_section_header + everything_up_to_the_end_of_the_line
Here is a complete parser using negative lookahead to make sure you read each section, breaking on section headings:
from pyparsing import ParserElement, LineEnd, Literal, restOfLine, ZeroOrMore, Group, StringEnd
test = """
section header 1:
some words can be anything
more words could be anything at all
etc etc lala
some other header:
as before could be anything
hey isnt this fun
"""
ParserElement.defaultWhitespaceChars=(" \t")
NL = LineEnd().suppress()
END = StringEnd()
header_1=Literal('section header 1:')
header_2=Literal('some other header:')
any_header = (header_1 | header_2)
# text isn't just anything! don't accept header line, and stop at the end of the input string
text=Group(~any_header + ~END + restOfLine)
overall_structure = ZeroOrMore(Group(any_header +
Group(ZeroOrMore(text))))
overall_structure.ignore(NL)
from pprint import pprint
print(overall_structure.parseString(test).asList())
In my first attempt, I forgot to also look for the end of string, so my restOfLine expression looped forever. By adding a second lookahead for the string end, my program terminates successfully. Exercise left for you: instead of enumerating all possible headers, define a header line as any line that ends with a ':'.
Good luck with your pyparsing efforts,
-- Paul

Categories