I'm having difficulties understanding the "run" object. As described, run object identifies the same style of text continuation. However, when I run a paragraph which all words are in the same style, "runs" still return me more than one line.
To make sure I didn't miss any possible style issues, I created a new word doc and typed as below
Hjkhkuhu joiuiuouoiuo iouiouououoi iouiououiuiuiui hhvvhgh hgjjhjhhh hjhjhjhjhjhj hjhjhj, jjkjkjk jkjkjkjkiuio uiouiouoo! jkjkjlkjlk
And I run below code:
from docx import Document
doc = Document('test.docx')
for p in doc.paragraphs:
for run in p.runs:
print(run.text)
And here is the result I got:
Hjkhkuhu
joiuiuouoiuo
iouiouououoi
iouiououiuiuiui
hhvvhgh
hgjjhjhhh
hjhjhjhjhjhj
hjhjhj
jjkjkjk
jkjkjkjkiuio
uiouiouoo ! jkjkjlkjlk
Why is this the case? Did I miss anything?
Having spent a few days tussling with docx runs now..
Paragraphs <w:p> contain one or more "style" runs <w:r> containing one or more texts <w:t>
But Docx runs are very easy to break, and when broken they can hide it very well.
Just having two texts the same format isn't necessarily enough to make it the same run. They don't automatically join, and changing format on text in a run then changing it back is enough to give you two separate but identically formatted runs.
(Greater experts than I dig more into runs/text here DOCX w:t (text) elements crossing multiple w:r (run) elements?)
This caused me a lot of problems with a 'tag substitution within runs' task, leading me to conclude that the only way to guarantee text is all in one run, is to enter it yourself (with unchanging format) in one fell swoop.
Related
I have a folder with multiple files (.doc and .docx). For the sake of this question I want to primarily deal with the .doc files unless for of these file types and be accounted for in the code.
I'm writing a code to read the folder and identify the .doc files. The objective is to output the paragraph 3, 4, and 7. I'm not sure why but python is reading each paragraph from a different spot in each file. I'm thinking maybe there are spacing/formatting inconsistencies that I wasn't aware of initially. To work around the formatting issue, I was thinking I could define the strings I want outputted. But I'm not sure how to do that. I tried to take add a string in the code but that didn't work.
How can I modify my code to be able to account for finding the strings that I want?
Original Code
doc = ''
for file in glob.glob(r'folderpathway*.docx'):
doc = docx.Document(file)
print (doc.paragraphs[3].text)
print (doc.paragraphs[4].text)
print (doc.paragraphs[7].text)
Code to account for the formatting issues
doc = ''
for file in glob.glob(r'folderpathway*.docx'):
doc = docx.Document(file)
print (doc.paragraphs["Substance Number"].text)
TypeError: list indices must be integers or slices, not str
I am having trouble replacing words in the entire text document. The following code works to replace words in the main paragraphs but not when text is present in text boxes.
wdFindContinue = 1
wdReplaceAll = 2
word.Selection.Find.Execute(find_str, True, True, False, False, False, True, wdFindContinue, False, replace_str, wdReplaceAll)
Is there a way to replace words in the entire document?
See the article by Word MVPs on Using a macro to replace text wherever it appears in a document.
A collaborative effort of MVP’s Doug Robbins and Greg Maxey with
enhancements by Peter Hewett and Jonathan West
Using the Find or Replace utility on the Edit menu you can find or
replace text "almost" anywhere it appears in the document. If you
record that action however, the scope or "range" of the resulting
recorded macro will only act on the text contained in the body of the
document (or more accurately, it will only act on the part of the
document that contains the insertion point). This means that if the
insertion point is located in the main body of the document when your
macro is executed it will have no effect on text that is in the
headers or footers of the document, for example, or in a textbox,
footnotes, or any other area that is outside the main body of the
document.
Even the Find and Replace utility has a shortcoming. For example, text
in a textbox located in a header or footer is outside the scope of the
Find and Replace utility search range.
***
To use a macro to find or replace text anywhere in a document, it is
necessary to loop through each individual part of the document. In
VBA, these parts are called StoryRanges. Each StoryRange is identified
by a unique wdStoryType constant.
There are eleven different wdStoryType constants that can form the
StoryRanges (or parts) of a document (ok, a few more in later versions
of Word, but they have no bearing in this discussion). Simple
documents may contain only one or two StoryRanges, while more complex
documents may contain more. The wdStoryTypes that have a role in find
and replace are:
wdCommentsStory, wdEndnotesStory, wdEvenPagesFooterStory,
wdEvenPagesHeaderStory, wdFirstPageFooterStory,
wdFirstPageHeaderStory, wdFootnotesStory, wdMainTextStory,
wdPrimaryFooterStory, wdPrimaryHeaderStory, and wdTextFrameStory.
The complete code to find or replace text anywhere is a bit complex.
Accordingly, let’s take it a step at a time to better illustrate the
process. In many cases the simpler code is sufficient for getting the
job done.
Step 1
The following code loops through each StoryRange in the active
document and replaces the specified .Text with .Replacement.Text:
Sub FindAndReplaceFirstStoryOfEachType()
Dim rngStory As Range
For Each rngStory In ActiveDocument.StoryRanges
With rngStory.Find
.Text = "find text"
.Replacement.Text = "I'm found"
.Wrap = wdFindContinue
.Execute Replace:=wdReplaceAll
End With
Next rngStory
End Sub
(Note for those already familiar with VBA: whereas if you use
Selection.Find, you have to specify all of the Find and Replace
parameters, such as .Forward = True, because the settings are
otherwise taken from the Find and Replace dialog's current settings,
which are “sticky”, this is not necessary if using [Range].Find –
where the parameters use their default values if you don't specify
their values in your code).
The simple macro above has shortcomings. It only acts on the "first"
StoryRange of each of the eleven StoryTypes (i.e., the first header,
the first textbox, and so on). While a document only has one
wdMainTextStory StoryRange, it can have multiple StoryRanges in some
of the other StoryTypes. If, for example, the document contains
sections with un-linked headers and footers, or if it contains
multiple textboxes, there will be multiple StoryRanges for those
StoryTypes and the code will not act upon the second and subsequent
StoryRanges. To even further complicate matters, if your document
contains unlinked headers or footers and one of the headers or footers
are empty then VBA can have trouble "jumping" that empty header or
footer and process subsequent headers and footers.
The page has more, but the above should help.
Essentially what I need to do is write a program that takes in many .docx files and puts them all in one, ordered in a certain way. I have importing working via:
import docx, os, glob
finaldocname = 'Midterm-All-Questions.docx'
finaldoc=docx.Document()
docstoworkon = glob.glob('*.docx')
if finaldocname in docstoworkon:
docstoworkon.remove(finaldocname) #dont process final doc if it exists
for f in docstoworkon:
doc=docx.Document(f)
fullText=[]
for para in doc.paragraphs:
fullText.append(para.text) #generates a long text list
# finaldoc.styles = doc.styles
for l in fullText:
# if l=='u\'\\n\'':
if '#' in l:
print('We got here!')
if '#1 ' not in l: #check last two characters to see if this is the first question
finaldoc.add_section() #only add a page break between questions
finaldoc.add_paragraph(l)
# finaldoc.add_page_break
# finaldoc.add_page_break
finaldoc.save(finaldocname)
But I need to preserve text styles, like font colors, sizes, italics, etc., and they aren't in this method since it just gets the raw text and dumps it. I can't find anything on the python-docx documentation about preserving text styles or importing in something other than raw text. Does anyone know how to go about this?
Styles are a bit difficult to work with in python-docx but it can be done.
See this explanation first to understand some of the problems with styles and Word.
The Long Way
When you read in a file as a Document() it will bring in all of the paragraphs and within each of these are the runs. These runs are chunks of text with the same style attached to them.
You can find out how many paragraphs or runs there are by doing len() on the object or you can iterate through them like you did in your example with paragraphs.
You can inspect the style of any given paragraph but runs may have different styles than the paragraph as a whole, so I would skip to the run itself and inspect the style there using paragraphs[0].runs[0].style which will give you a style object. You can inspect the font object beyond that which will tell you a number of attributes like size, italic, bold, etc.
Now to the long solution:
You first should create a new blank paragraph, then you should go and add_run() one by one with your text from your original. For each of these you can define a style attribute but it would have to be a named style as described in the first link. You cannot apply a stlye object directly as it won't copy the attributes over. But there is a way around that: check the attributes that you care about copying to the output and then ensure your new run applies the same attributes.
doc_out = docx.Document()
for para in doc.paragraphs:
p = doc_out.add_paragraph()
for run in para.runs:
r = p.add_run(run.text)
if run.bold:
r.bold = True
if run.italic:
r.italic = True
# etc
Obviously this is inefficient and not a great solution, but it will work to ensure you have copied the style appropriately.
Add New Styles
There is a way to add styles by name but because it isn't likely that the Word document you are getting the text and styles from is using named styles (rather than just applying bold, etc. to the words that you want), it is probably going to be a long road to adding a lot of slightly different styles or sometimes even the same ones.
Unfortunately that is the best answer I have for you on how to do this. Working with Word, Outlook, and Excel documents is not great in Python, especially for what you are trying to do.
How to get/extract number of lines added and deleted?
(Just like we do using git diff --numstat).
repo_ = Repo('git-repo-path')
git_ = repo_.git
log_ = g.diff('--numstat','HEAD~1')
print(log_)
prints the entire output (lines added/deleted and file-names) as a single string. Can this output format be modified or changed so as to extract useful information?
Output format: num(added) num(deleted) file-name
For all files modified.
If I understand you correctly, you want to extract data from your log_ variable and then re-format it and print it? If that's the case, then I think the simplest way to fix it, is with a regular expression:
import re
for line in log_.split('\n'):
m = re.match(r"(\d+)\s+(\d+)\s+(.+)", line)
if m:
print("{}: rows added {}, rows deleted {}".format(m[3], m[1], m[2]))
The exact output, you can of course modify any way you want, once you have the data in a match m. Getting the hang of regular expressions may take a while but it can be very helpful for small scripts.
However, be adviced, reg exps tend to be write-only code and can be very hard to debug. However, for extracting small parts like this, it is very helpful.
I hope you can help me trying to combine a paragraph, my style is called "cursiva" and works perfectly also I have other's but it's the same if I change cursiva to other one. the issue is that If I use this coude o get this.
As you can see guys it shows with a line break and I need it shows togetter.
The problem is that i need to make it like this (one, one) togetter because I need to use two styles, the issue here is that I'm using arial narrrow so if I use italic or bold I need to use each one by separate because the typography does not alow me to use "< i >italic text< /i > ", so I need to use two different styles that actually works fine by separate.
how can I achive this?
cursiva = ParagraphStyle('cursiva')
cursiva.fontSize = 8
cursiva.fontName= "Arialni"
incertidumbre=[]
incertidumbre.extend([Paragraph("one", cursiva), Paragraph("one", cursiva)])
Thank you guys
The question you are asking is actually caused by a workaround for a different problem, namely that you don't know how to register font families in Reportlab. Because that is what is needed to make <i> and <b> work.
So you probably already managed to add a custom font, so the first part should look familiar, the final line is probably the missing link. It is registering the combination of these fonts a family.
from reportlab.pdfbase.pdfmetrics import registerFontFamily
pdfmetrics.registerFont(TTFont('Arialn', 'Arialn.ttf'))
pdfmetrics.registerFont(TTFont('Arialnb', 'Arialnb.ttf'))
pdfmetrics.registerFont(TTFont('Arialni', 'Arialni.ttf'))
pdfmetrics.registerFont(TTFont('Arialnbi', 'Arialnbi.ttf'))
registerFontFamily('Arialn',normal='Arialn',bold='Arialnb',italic='Arialni',boldItalic='Arialnbi')