I'm creating weekly reports and the data all come from a google sheet with the same format. Instead of entering the data manually in the word file. I created a Word template and want to import the data automatically from the google sheet to my Word template.
My Word template looks like:
The bolded data in the Word file come from the "New" column. The green/red data in the Word file come from "Diff" column.
I know how to get these data from the google sheet using Pandas, but I want to know how should I place them in the specific area in my word template.
I think the best way to go about it would be to go from Google Sheet -> Google Doc and take advantage of the native integration there. From there you can just export it as a .docx file or something and it should be openable in Word as well. I did this exact thing a while back, so it's definitely doable (if not easier now) but here's a place to start.
Related
I would like to create an interface like the one pictured to do a text search within multiple PDF files. I would like to display the matching PDF file info, in the next column a snippet of the matching text and in the last column the actual pdf with the matching text highlighted. I am guessing to start would index with elasticsearch but would love to hear all suggestions.How best to accomplish this?
I have one use-case .Lets say there is pdf report which has data from testing of some manufacturing components
and this PDF report is loaded in DB using some internally developed software.
We need to develop some reconciliation program wherein the data needs to be compared from PDF report to Database. We can assume pdf file has a fixed template.
If there are many tables and some raw text data in pdf then how mysql save this pdf data..in One table or in many tables .
Please suggest some approach(preferably in python) for comparing data
Finding and extracting specific text from URL PDF files, without downloading or writing (solution) have a look at this example and see if it will help. I found it worked efficiently for me, this is if the pdf is URL based, but you could simply change the input source to be your DB. In your case you can remove the two if statements under the if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal): line. You mention having PDFs with the same template, if you are looking to extract text from one specific area of the template, use the print statement that has been commented out to find coordinates of desired data. Then as is done in the example, use those coordinates in if statements.
Here is a document(word file), and I extract some sentences from that and write into an excel file by python.
And now I want to create a hyperlink of a sentences, which links to the page where the sentences belong.
For example, if there a sentence "I love python" is in page 5 in a word file, and after I extract this sentence to a cell of an excel file by python, it is possible to create a hyperlink linking back to page 5 of that word file by xlsxwriter?
I afraid there is no way you can hyperlink to a page in DOCS file. You can hyperlink till file level.
Attaching the code here.
formula = 'HYPERLINK("[{path]", "Click Here")'.format(path=<PATH TO FILE>)
value = xlwt.Formula(formula)
This is done by xlwt library.
Also, The Hyperlink works differently on different excel formats. ie Hyperlink works on Microsoft excel will not work in Open Office Excel.
I am trying to iterate through all tables in a document and extract the text from them. As an intermediate step I am just trying to print the text to the console.
I have looked at other code provided by scanny in similar posts but for some reason it is not giving me my expected output from the document I am parsing through
The document can be found at https://www.ontario.ca/laws/regulation/140300
from docx import Document
from docx.enum.text import WD_COLOR_INDEX
import os, re, sys
document = Document("path/to/doc")
tables = document.tables
for table in tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
print(paragraph.text)
I expect this to print out all the text but instead I get nothing. if I try to print(row.cells) it just prints (). which is an empty list I guess. My document definetly does have text in the cells though. Not sure whats wrong here.
Any help is appreciated,
It's possible that the cell text is "contained" in a wrapper element that python-docx doesn't yet understand. The most common example is revision marks.
The most direct way to diagnose the problem is the inspect the XML for the table in question using opc-diag (as one option). But if it is revision marks, I believe accepting all revisions on the document will fix it, although I haven't actually tried that myself.
If that doesn't work and you post a sample of the table XML I can take a closer look.
Found the error. I was using a third party tool (multiDoc converter) to convert old .Doc files into Docx format. works for the most part, however there must be some meta data that doesn't convert properly because it was causing the issue. Opening the file and manually saving it as Docx solved the issue. Only problem is that I want to convert 2000+ files into Docx, so I'll need to find another solution for convertiing the files.
My document had hundreds of tables and only a few were coming out as empty (when in fact they weren't). So I tried to extract the data from the pdf version of the same document with tabula, same result: a few newly created tables were coming out empty!
After a bit of digging, I realized that my Word document was in "Track Changes" mode (to have the "change bars" indicate the difference with the previous version, and the tables themselves were an changes that were not accepted yet and that were the tables that didn't get extracted...
SOLUTION: In my case, I had to accept all changes to the document (in the "Review" tab of Word, in the "Accept" scroll-down, click "Accept All Changes") and saved the document again.
This is not a duplicate although the issue has been raised in this forum in 2011Getting a hyperlink URL from an Excel document, 2013 Extracting Hyperlinks From Excel (.xlsx) with Python and 2014 Getting the URL from Excel Sheet Hyper links in Python with xlrd; there is still no answer.
After some deep dive into the xlrd module, it seems the Data_sheet.hyperlink_map.get((row, col)) item trips because "xlrd cannot read the hyperlink without formatting_info, which is currently not supported for xlsx" per #alecxe at Extracting Hyperlinks From Excel (.xlsx) with Python.
Question: has anyone has made progress with extracting URLs from hyperlinks stored in an excel file. Say, of all the customer data, there is a column of hyperlinks. I was toying with the idea of dumping the excel sheet as an html page and proceed per usual scraping (file on local drive). But that's not a production solution. Supplementary: is there any other module that can extract the url from a .cell(row,col).value() call on the hyperlink-cell. Is there a solution in mechanize? Many thanks.
I had the same problem trying to get the hyperlinks from the cells of a xlsx file. The work around I came up with is simply converting the Excel sheet to xls format, from which I could manage to get the hyperlinks withount any trouble, and once finished the editing, I formatted it back to the original xlsx file.
I don't know if this should work for your specific needs, or if the change of format implies some consecuences I am not aware of, but I think it's worth a try.
I was able to read and use hyperlinks to copy files with openpyxl. It has a cell_obj.hyperlink and cell_obj.hyperlink.target which will grab the link value. I made a list of the cell row col values which had hyperlinks, then appended them to a list and then looped through the list to move the linked files.