I'm currently creating a part of a program: I'm supposed to take an entirely HTML page, and get it into a WORD file. There are title, paragraph, graphic...
But, into my HTML page, I also have table (tr, th, td, table tags). I parse it into a Word table. But, my table are too large, so I have some word that are cutting into parts. I expected something like this:
|Myentireword|
but it go out like this (because there are too many cell in one line)
|Myent|
|irewo|
|rd |
It's really really ugly, so, I didn't see other solution than transforme my table into image, but I can't find the way to do this. (use my html-parsor, creatable a table of string, display it into an image, save the image, then load it into my Word document. I can do the first and second thing, and load an image in my document, but I do not have a way for creating the image itself.
I cannot change the output, it IS required to be a Word. Our custommer want to be able to modified and save it, and many customers are not really expert. So, I'm not able to use some PIL technic. (also, it does'nt solve my problem)
I already have a working parsor, and have ouput for my html, so i can't use something like "create a table from html"
I'm not able to use css, nor pdf-creating as I said, so this way won't work too: "Quality tables in python" But the question is exactly mine in that case!
I cannot use Matplotlib to crate table myself, because I do not know how to print into it, and I need something (If possible) to "merge" cells. If it's not, I'll do without, I juste need something working, it's really necessary, please!
Related
Im using Camelot to extract table information from a PDF that i have converted from scanned to searchable using ocrmypdf(500dpi).
Camelot seems to be able to identify the table and extract most of the data within the table but it seems to be unable to extract the bottom half. In essence, it sees the top half of the table but seems to be unable to separate the text from the lower half.
This is the table from the PDF in question:
But when i use the visual debugging method of Camelot where i ask it to show me the words it will extract it seems to recognize the bottom section of the table as one giant block
Any guidance you can provide on improving Camelots "vision" here would be helpful.
Apart from the block, the horizontal lines are also marked as text, which is odd.
Camelot uses pdfminer.six for text extraction and you can pass LAParams (page 16) to camelot.read_pdf() to tweak that.
You should also check out camelot.plot(table, type="grid") to see if the lines are recognized correctly. If not, that might be where the problem lies.
I am planning to search the specific heading in the document, and then i have to strike out all the contents in that heading. The document has many headings, each heading may have paragraph, tables, images altogether or in any combinations.
I have installed docx, i was able to search the specific heading, strike out paragraph, tables.
Now i am not able to access the images under that Heading. To indicate that, the image is strikeout, we are trying to blur the image
Problem 1: I am able to get the Image ID (Resource ID), Image Name for all the images in the document. But i don't know how to get the resource id for the images which is under Specific Heading, and then i have to blur it.
Problem 2: I have enabled Track Changes option using VBMacro from python code. But whatever changes i did using docx (strikeout) is not highlighted for Tracking.
These are two separate questions (or three, depending on how you count). I'll address the first one here, you can post the other question as a separate new question. (Maybe: "How use python-pptx to track changes in Word document").
Regarding blurring the image, you have two challenges:
Identify images associated with a particular area in the document.
Blur the image.
There is no direct API support for either of these operations in python-docx. However, you can use python-docx to access the underlying XML and make the changes using lxml calls (which python-docx uses internally). Such efforts are commonly called "workaround functions", so if you search Google on 'python-docx OR python-pptx workaround function' you will find examples.
An in-line image is stored at the Run level. So you can iterate over all the runs in the section of interest and see if any of them have images. This analysis page from the python-docx project has some of the details you'll need: http://python-docx.readthedocs.io/en/latest/dev/analysis/features/shapes/shapes-inline.html
Basically you'd do something like this:
for run in runs: # however you decide to get the runs
r = run._element # this is the `<w:r>` XML element for the run
pics = r.xpath('.//w:drawing/wp:inline/a:graphic/a:graphicData/pic:pic')
if not pics:
break
print(r.xml) # if you want to see the XML for this run
This will print the XML for run elements containing a picture.
Regarding the actual blurring, there are two approaches I can think of:
Replace the current picture with a "blurred" version.
Change the transparency of the image in Word to make it look much lighter. This does not remove detail from the image and the actual image is still "behind", unchanged, if for example the user wanted to right click and pick "Save image...".
The second approach is much easier. You'll have to decide whether it meets your requirements.
Once you decide which way you want to go you can search for solutions to that problem or submit a new question focused on that topic.
I am trying to take my data and put it in tables in either microsoft words or libreoffice writer.
I need to be able to change the background of cells within the table and I need to be able to change the page property to 'landscape'.
I have been looking for a library with simple code ( I am a beginner in coding ) but I did not find one for what I need to do.
Have you heard of anything for me ? If there are example on how to use it that would make it easier for me to learn it.
Check out this project
And here is a great quick-start guide
It's pretty simple to use, i haven't tested this, but it should work:
from docx import Document
document = Document()
r = 2 # Number of rows you want
c = 2 # Number of collumns you want
table = document.add_table(rows=r, cols=c)
table.style = 'LightShading-Accent1' # set your style, look at the help documentation for more help
for y in range(r):
for x in range(c):
cell.text = 'text goes here'
document.save('demo.docx') # Save document
It don't think you can set the page orientation property with this library, but what you could do is create a blank word document that is in landscape yourself, store it in the working directory and make a copy of it every time you generate this document.
The previous answer is a good one, but there is another way: create the document in Word then hack the xml in Python to insert the content you want. I have done this several times. In fact, my current invoicing program works this way.
Disadvantages: Conditional formatting and numbered lists will require some real xml knowledge.
Advantages: No limitations or intricate style definitions to manage. EVERYTHING is supported, because it's all done in Word.
Here's the workflow:
create a *.docx document with marker text where you want your headings, table cells, etc.
Use lxml to find those markers and copy their parent elements (along with their formatting).
Use those found elements to create templates
Insert your data into those templates, and assemble the whole thing like a jigsaw puzzle.
Zip your xml into a new *.docx file.
I have a short sample project showing the procedure. Docx2Python does most of the work for you.
https://github.com/ShayHill/replace_docx_tables
I want to add contents on first page. But the data is uncertain, contents must draw depend on the data.
I try to fix this problem by drawing twice, calculating and recording the contents, and it works well.
But it wastes time and ungainly.
Can I insert first page after completing drawing? Or other ways to fix this problem?
Sorry, I'm poor of English..
If you use Platypus with ReportLab you can get a table of contents "for free": just see the ReportLab manual for details but you basically just add a TableOfContents to the list of Flowables for the document and you're done. Otherwise you'll have to figure out all the logic for building a table of contents yourself.
I have this problem, I need to scrape lots of different HTML data sources, each data source contains a table with lots of rows, for example country name, phone number, price per minute.
I would like to build some semi automatic scraper which will try to ..
find automatically the right table in the HTML page,
-- probably by searching the text for some sample data and trying to find the common HTML element which contain both
extract the rows
-- by looking at above two elements and selecting the same patten
identify which column contains what
-- by using some fuzzy algorithm to best guess which column is what.
export it to some python / other list
-- cleaning everytihng.
does this look like a good design ? what tools would you choose to do it in if you program in python ?
does this look like a good design ?
No.
what tools would you choose to do it in if you program in python ?
Beautiful Soup
find automatically the right table in the HTML page, -- probably by searching the text for some sample data and trying to find the common HTML element which contain both
Bad idea. A better idea is to write a short script to find all tables, dump the table and the XPath to the table. A person looks at the table and copies the XPath into a script.
extract the rows -- by looking at above two elements and selecting the same patten
Bad idea. A better idea is to write a short script to find all tables, dump the table with the headings. A person looks at the table and configures a short block of Python code to map the table columns to data elements in a namedtuple.
identify which column contains what -- by using some fuzzy algorithm to best guess which column is what.
A person can do this trivially.
export it to some python / other list -- cleaning everytihng.
Almost a good idea.
A person picks the right XPath to the table. A person writes a short snippet of code to map column names to a namedtuple. Given these parameters, then a Python script can get the table, map the data and produce some useful output.
Why include a person?
Because web pages are filled with notoriously bad errors.
After having spent the last three years doing this, I'm pretty sure that fuzzy logic and magical "trying to find" and "selecting the same patten" isn't a good idea and doesn't work.
It's easier to write a simple script to create a "data profile" of the page.
It's easier to write a simple script reads a configuration file and does the processing.
I cannot see better solution.
It is convenient to use XPath to find the right table.