python 3.6 windows: retrieving the clipboard CF_HTML format - python

I want to copy some rich text, modify its source code (changing some tags and text, using regex and/or beautifulsoup) and send it back to the clipboard. I'm looking for the easiest way to do that.
I tried win32clipboard, but it doesn't support the CF_HTML format (windows clipboard contains many formats).
So I'm looking for a module that could help me to get this format:
if the CF_HTML clipboard format contains HTML, store it in that variable, do some operation, then send it back. (Optionally: and do other stuff on other clipboard formats)
Here is a Linux equivalent of what I'm looking for. It retrieves the HTML source, when there's some in the clipboard (source)
#!/usr/bin/env python
import gtk
print (gtk.Clipboard().wait_for_contents('text/html')).data
Edit1: There is a work around with pywin32 using this script. But is there a module able to do that directly (if CF_HTML contains data, get it, and send it back)?

The Edit1 solution seems to be actually the best.
put the script above (HtmlClipboard.py) in the python module folder: C:\Users\xxx\AppData\Local\Programs\Python\Python36\Lib\site-packages
install win32clipboard
With the 2 points above you could play with a script like this:
#get CF_Html Clipboard
import HtmlClipboard #.py script found in github
if HtmlClipboard.HasHtml():
# print('there is HTML!!')
dirty_HTML = HtmlClipboard.GetHtml()
print(dirty_HTML)
else:
print('no html')
dirty_HTML= clean_HTML #do what you want with it
#put data to clipboard:
HtmlClipboard.PutHtml(clean_HTML)
Bonus:
##get CF_TEXT from clipboard
import win32clipboard
win32clipboard.OpenClipboard()
text = win32clipboard.GetClipboardData(win32clipboard.CF_TEXT)
win32clipboard.CloseClipboard()

Related

Is there a way in wxPython to detect RTF format text on the clipboard?

I'm currently implementing paste from the clipboard in a wx.grid.Grid derived class.
All works fine except for pasting from Excel on a Mac, where the data string converted from the clipboard using a wx.TextDataObject has two '\n' characters separating rows instead of one.
Example code:
def Paste(self, mode = 0):
clipboard = wx.TextDataObject()
wx.TheClipboard.GetData(clipboard)
wx.TheClipboard.Close()
data = clipboard.GetText()
if data[-1] == "\n":
data = data[:-1]
print(repr(data))
for three rows containing 1,2,3 produces '1\n2\n3' normally, and '1\n\n2\n\n3' from Excel.
Current fix is a special paste command that splits the clipboard data string using "\n\n" instead of "\n", but I'd like to improve this by using automatic format detection instead of manual paste command choice.
The OSX clipboard viewer shows that the copy from Excel is RTF format.
Is there a function for detecting RTF format clipboard in wxPython/normal Python?
Thanks!

I canĀ“t insert tab stops into a docx generated from html

I have a very specific use case where I need to insert tab stops into Word documents. My code works perfectly when using a docx that was created normally. However, the other part of my use case is that I extract the html from a text editor and turn it into a docx. The problem is with these documents that were generated from html, for some reason when running the same code to insert tab stops it does not work. The tab stop configuration gets created but it is not applied to the document. I cannot seem to find a way around it and any help would be deeply appreciated.
Below is a code sample:
from docx import Document
from docx.shared import Inches
from docx.enum.text import WD_TAB_ALIGNMENT, WD_TAB_LEADER
from htmldocx import HtmlToDocx
new_parser = HtmlToDocx()
new_html = """<p><span>some text</span></p>
<br>
<p><span>Some persons name</span></p>
<p><span>Another text</span></p>
<p><span>Some date</span></p>"""
document = Document()
new_parser.add_html_to_document(new_html, document)
for para in document.paragraphs:
tab_stops = para.paragraph_format.tab_stops
tab_stops.add_tab_stop(Inches(5.51),
WD_TAB_ALIGNMENT.RIGHT, WD_TAB_LEADER.DOTS)
document.save('new-file-name.docx')
When running this code the tab stops configuration gets created correctly in the docx, but it is not reflected in the document itself. Below you can see the configuration correctly created:
However, those tab stops are not visible in the document itself.
This function is supposed to run on Azure functions, so pywin32 is not an option to convert html to docx as it does not run on linux.
I have tried manually setting the styles of the document. I have tried using the api of convertapi, as well as using the library aspose.words but nothing seems to work. It seems that there is something about converting html to docx that precludes inserting tab stops.
Thank you very much in advance and any help is deeply appreciated.

How to extract given PDF to text and tables using python and store the data in .csv file?

I need to extract the first table account number, branch name, etc and last table date, description, and amount.
pdf file: https://drive.google.com/file/d/1b537hdTUMQwWSOJHRan6ckHBUDhRBbvX/view?usp=sharing
getting blank output using pypdf2 library.
camelot giving OSError: Ghostscript is not installed.
import PyPDF2
file_path =open(r"E:\user\programs\28_oct_bank_statement\demo.pdf", "rb")
pdf = PyPDF2.PdfFileReader(file_path)
pageObj = pdf.getPage(0)
print(pageObj.extractText())
import camelot
data = camelot.read_pdf(r"demo.pdf", pages='all')
print(data)
Camelot has dependancies that needs to be install in order to work, such as Ghoscript. You'll fist need to check if that is installed correctly for mac/ubuntu:
from ctypes.util import find_library
find_library("gs")
"libgs.so.9"
for windows:
import ctypes
from ctypes.util import find_library
find_library("".join(("gsdll", str(ctypes.sizeof(ctypes.c_voidp) * 8), ".dll")))
<name-of-ghostscript-library-on-windows>
otherwise download Ghostscript from the following page https://ghostscript.com/ for windows.I highly suggest reading through the camelot documentation again If you run into more issues.
I usually use the apache tika to do this.
As shown here
You can simply install it and then with a python script:
from tika import parser
parsed_pdf = parser.from_file("sample.pdf")
text = parsed_pdf['content']
metadata = parsed_pdf['metadata']
print(data)
Note you do need Java installed on the machine for it to run, however it will return the test and then once you have the text you can look to identify a pattern within the text to extract the exact data required.
The nice part about this is it will also return the metadata of the pdf

Pasting python's tabulate output to Microsoft Office editors

Quite frequently I need to copy-paste small tables from an SQL editor into Microsoft Office programs (Outlook and OneNote) and I want it to look nice as I paste it. So I wrote a short script taking the data from the clipboard, processing it with Tabulate and returning it to the clipboard.
This works very well when I paste the new table into Notepad++ and other editors. It completely messes up when I paste into Outlook.
If I paste into Notepad++ and then copy paste from there, everything's fine.
I tried the different table formats, and I tried playing around with Outlook's editor options.
Would really appreciate any insights!
Thanks!
See code:
import win32clipboard
import pandas as pd
from tabulate import tabulate
df = pd.read_clipboard()
head = df.columns.tolist()
val = df.values
table = tabulate(val,headers=head,tablefmt="grid")
# set clipboard data
win32clipboard.OpenClipboard()
win32clipboard.EmptyClipboard()
win32clipboard.SetClipboardText(str(table))
win32clipboard.CloseClipboard()

Is there a simple way to write an ODT using Python?

My point is that using either pod (from appy framework, which is a pain to use for me) or the OpenOffice UNO bridge that seems soon to be deprecated, and that requires OOo.org to run while launching my script is not satisfactory at all.
Can anyone point me to a neat way to produce a simple yet clean ODT (tables are my priority) without having to code it myself all over again ?
edit: I'm giving a try to ODFpy that seems to do what I need, more on that later.
Your mileage with odfpy may vary. I didn't like it - I ended up using a template ODT, created in OpenOffice, oppening the contents.xml with ziplib and elementtree, and updating that. (In your case, it would create only the relevant table rows and table cell nodes), then recorded everything back.
It is actually straightforward, but for making ElementTree properly work with the XML namespaces. (it is badly documente) But it can be done. I don't have the example, sorry.
To edit odt files, my answer may not help, but if you want to create new odt files, you can use QTextDocument, QTextCursor and QTextDocumentWriter in PyQt4. A simple example to show how to write to an odt file:
>>>from pyqt4 import QtGui
# Create a document object
>>>doc = QtGui.QTextDocument()
# Create a cursor pointing to the beginning of the document
>>>cursor = QtGui.QTextCursor(doc)
# Insert some text
>>>cursor.insertText('Hello world')
# Create a writer to save the document
>>>writer = QtGui.QTextDocumentWriter()
>>>writer.supportedDocumentFormats()
[PyQt4.QtCore.QByteArray(b'HTML'), PyQt4.QtCore.QByteArray(b'ODF'), PyQt4.QtCore.QByteArray(b'plaintext')]
>>>odf_format = writer.supportedDocumentFormats()[1]
>>>writer.setFormat(odf_format)
>>>writer.setFileName('hello_world.odt')
>>>writer.write(doc) # Return True if successful
True
QTextCursor also can insert tables, frames, blocks, images. More information. More information at:
http://qt-project.org/doc/qt-4.8/qtextcursor.html
As a bonus, you also can print to a pdf file by using QPrinter.

Categories