Extracting data from txt files - python

Ok im using this git from Git Bash. After i run it i have the txt files of the Securities and Exchange Commission DB which is EDGAR in this format on my hard drive. I am using Win 7. The txt files have HTML tags inside.
I was wondering since the files in text are in this strict format by the SEC agency since the early nineties if there is a way to extract a certain item let's say
<us-gaap:IncomeTaxExpenseBenefit contextRef="eol_PE9523----1310-K0013_STD_365_20131231_0"
decimals="-3" id="id_3914012_7F3BEF88-8CD1-49E7-8A78-91A091178D1B_1_13"
unitRef="iso4217_USD">40315000</us-gaap:IncomeTaxExpenseBenefit>
Whether by using a Script or a git repository with accuracy since the format is strict? How for instance can someone extract a hole table from the txt file? Libraries, gits, scripts anything that with a little work and modification can be picked up will be fine for me to have a start.
Can any of these gits get in and do such a job? I read the instructions (whenever there are) but i dont understand many stuff.

It's not HTML. It looks like XML - try using an XML parser for Python, for example ElementTree, and parsing out the relevant information. The tutorial is included on the their page.

Related

Creating a kindle dictionary

I am trying to create a Kindle dictionary that can be used for offline lookup. I already have the words and their inflections, but turning this into a working dictionary is difficult.
There is some documentation about this provided by Amazon. It basically says that you should:
Create an XHTML file with their special markup specifying all inflections etc.
Turn it into an epub
Open it with Kindle Previewer
Export it with Kindle Previewer to MOBI
So I created a large XHTML file (23 MB or so) according to the Amazon specifications and opened it in Kindle Previewer, and it looked fine. However, Kindle Previewer does not let you export XHTML files to MOBI. They want you to create an intermediate epub file.
I tried using Pandoc to do the conversion, which did not work because it stripped out all the specific HTML tags and only left in paragraphs. Then I tried using calibre. The normal XHTML -> epub conversion failed because the XHTML file was too large, according to an error message. Calibre suggests to turn on the "heuristic mode" if you run into this error, which I tried, but which did not finish running after hours of runtime.
Then I attempted to create the epub file myself, using a sample file taken from this tutorial. I discovered that this is not trivial, and a check using epubcheck revealed many hard-to-understand errors in my generated file. The generation of the epub file is also a bit complicated by the fact that you probably need to split the XHTML files into many smaller files, which should maybe be 250 kb in size, because e-readers tend to struggle with parsing larger files.
So I thought there should maybe be an easier way to do this, or maybe a library that helps doing this. Maybe it would even be a good idea to output the words + inflections into some other easier dictionary format and then convert it to a MOBI using an existing library and leaving out the XHTML generation completely. Currently I am using Python, but I'd also use other languages if it is necessary. What could I try?
Edit: To add to the things I have tried: there is an apparently closed source script here that unfortunately doesn't support inflections, so does not work. And there are instructions here that advise converting the file to PRC using Mobipocket Creator and then opening it with Kindle Previewer. The problem with this approach is that Kindle Previewer throws the error:
Kindle Previewer does not support this file, which has either been created using an older version of KindleGen or a third party application. We recommend using EPUB or DOCX format directly for previewing and publishing your book on Kindle.
There are also more detailed instructions for Mobipocket Creator here, which tell you to directly move the generated .prc file onto the kindle. I tried that but it is not being recognized as a dictionary.
I figured it out by myself. First I implemented a solution myself, then I found the pyglossary library (right now the code below only works with the version from Github and not from pip) and used it like this:
from pyglossary.glossary import Glossary
Glossary.init()
glos = Glossary()
defiFormat = "h"
base_forms = get_base_forms()
for canonical_form in base_forms:
inflections = get_inflections(canonical_form)
definitions = get_definition(canonical_form)
definitionhtml = ""
for definition in definitions:
definitionhtml += "<p>" + gloss + "</p>"
all_forms = [canonical_form]
all_forms.extend(inflections)
glos.addEntryObj(glos.newEntry(all_forms, glosshtml, defiFormat))
glos.setInfo("title", "Russian-English Dictionary")
glos.setInfo("author", "Vuizur")
glos.sourceLangName = "Russian"
glos.targetLangName = "English"
glos.write("test.mobi", format="Mobi", keep=True, kindlegen_path="path/to/kindlegen.exe")

How to convert XML Word Documents to DOCX?

I have been given a series of folders with large amounts of Word documents in .xml formatting. They each contain some VBA code, but the code on all of them has already been run, so I don't need to keep this.
I need to print all of the files in each folder, but due to constraints on XML files on the network, I can't simply mass-print them from Windows Explorer, so I need to convert them to .docx (or .doc) first.
How can I go about doing this? I tried a simple python script using python-docx:
import os
from docx import Document
folderPath=<folderpath>
fileNamesList=os.listdir(folderPath)
for xmlFileName in fileNamesList:
currentDoc=Document(os.path.join(folderPath,xmlFileName))
docxFileName=xmlFileName.replace('.xml','.docx')
currentDoc.save(os.path.join(folderPath,docxFileName))
currentDoc.close()
This gives:
docx.opc.exceptions.PackageNotFoundError: Package not found at <first file name>.xml
I'm guessing this is because python-docx isn't meant to open .xml files, but that's a pretty uneducated guess. Searching around for this error, all I can find are problems with it not being installed properly (which it is, as far as I can tell) or using .doc files instead of .docx.
Am I simply using python-docx incorrectly? If not, is there are more suitable package or technique I should be using?
It's not clear just what sort of files those .xml files are, but I suspect they are a transitional format used I think in Word 2003, which was XML-based, but not the Open Packaging Convention (OPC) format used in Word documents since Word 2007.
python-docx is not going to read those, now or ever, so you'll either need to convert them to .docx format or parse the XML directly.
If I had Windows available, I suppose I would use VBA to write a short conversion script and then work with the .docx files using python-pptx. I would start by seeing if Word could load the .xml file and go from there.
You might be able to find a utility to do a bulk conversion, but I didn't find any on a quick search.
If all you're interested in is a one-time print, and Word will load the files, then a VBA script for that without the conversion step might be a good option. python-docx doesn't print .docx files, only read and write them.

HTML to RTF string using Python

I am looking for a way to convert HTML text to RTF string. Is there any libraries that does this job. I get html content dynamically in my project and need it to be rendered in RTF format. I am using HTML parser to convert HTML text to normal string and then have trying to use PyRTF for conversion to RTF format. Is there any better way that this can be done.Thanks in advance.
RTF seems a dicey format to convert from/to. I've tried cutting and pasting among applications on Mac OS X, for example, where RTF is something of a lingua franca. Some of those apps are Microsoft apps (relevant in that RTF is a Microsoft-developed format), others are not. Even basic formatting information like font size, font face, line spacing, and list styling (ordered or unordered) is jumbled when copying from one ostensibly RTF-speaking app to another. Simply put, it's a mess.
I have searched for ways to programmatically read, write, and transform RTF, preferably from Python. I found a number of packages on PyPI, trying them out has been a disappointing experience. They would support RTF 1.5, say, when the current version is 1.9.1. RTF has been around a long time, but a 2005-vintage spec is not very recent. There were lots of gotchas and incompatibilities. LOTS.
Now, I'm not saying it's impossible, or that there aren't other libraries out there that would do the trick. I have not tried the zopyx.convert mentioned by others here, for example. Maybe it's great. But looking at its dependencies--Java, FOP, etc.--it looks like a pretty complex (and thus likely fragile) toolchain. I read its code on github, and the Python is really only there as a coordination veneer. It organizes external tools XFC, XINC, FOP, and PrinceXML--three of the four of which are commercial software. That includes the key XFC part that deals with RTF. Color me skeptical.
There are two converters that I've found are worth a look: If you're using a Mac, the textutil command line program is actually one of the better and simpler tools I've seen.
textutil -convert html filename.rtf -output filename.html
The other formatting engine that's worth considering is LibreOffice. It's free, open source, reasonably amenable to automation, and a decent foundation as an interoperability hub. That's not just a guess; I've built complex, multi-format document workflows around it.
I would question why you're trying to get into RTF. That seems like a document format you'd be trying to escape from. But if you need to go there, textutil and LibreOffice are the least-worst mechanisms I've found.
There is a wonderful python library that comes as a tarball.
You can download it at https://pypi.python.org/pypi/zopyx.convert2/2.4.5.
Good luck!
I see this question is over a year old, but figured I'd contribute anyway. I recently had a similar requirement, and turned to PyRTF, a small but powerful Python module that can construct RTF documents from a text file. You could use Beautiful Soup to scrape the HTML, going down the parse tree tag by tag, and use the PyRTF API to construct appropriate objects (table, cell, paragraph, section or document).
The API itself is quite granular, and allows for a whole bunch of custom formatting (font text, alignment, color, headers, footers etc.)
Hope this helps.

How do I write a python script designed to analyse different XML schemas and convert XML files?

I work a lot with a certain type of XML documents. Every few months a new schema for this XML document is released. Often a new schema will add or remove fields from the XML documents.
Currently I have to convert at least some of my XML documents from the old version to the new version. I do this by hand, but obviously that is not ideal.
I am relatively decent at writing Python scripts. However, I have very little experience with working with XML using Python.
I was wondering how difficult it would be to write a generic python script that analysed the new and old schema for these XML documents, identified the changes (which fields were removed and which ones were added), and converted old document versions to new version.
If this is possible/not too difficult for someone with somewhere between beginner and intermediate Python skills, would someone direct where I should look to get started?
I just read the XML chapter for "Dive into Python." It seems like the lxml library is the one that I would be looking to use? Is it very powerful when working with schemas?
Where should I go from here?

Convert CVS/SVN to a Programming Snippets Site

I use cvs to maintain all my python snippets, notes, c, c++ code. As the hosting provider provides a public web- server also, I was thinking that I should convert the cvs automatically to a programming snippets website.
cvsweb is not what I mean.
doxygen is for a complete project and to browse the self-referencing codes online.I think doxygen is more like web based ctags.
I tried with rest2web, it is requires that I write /restweb headers and files to be .txt files and it will interfere with the programming language syntax.
An approach I have thought is:
1) run source-hightlight and create .html pages for all the scripts.
2) now write a script to index those script .htmls and create webpage.
3) Create the website of those pages.
before proceeding, I thought I shall discuss here, if the members have any suggestion.
What do do, when you want to maintain your snippets and notes in cvs and also auto generate it into a good website. I like rest2web for converting notes to html.
Run Trac on the server linked to the (svn) repository. The Trac wiki can conveniently refer to files and changesets. You get TODO tickets, too.
enscript or pygmentize (part of pygments) can be used to convert code to HTML. You can use a custom header or footer to link to the actual code for download.
I finally settled for rest2web. I had to do the following.
Use a separate python script to recursively copy the files in the CVS to a separate directory.
Added extra files index.txt and template.txt to all the directories which I wanted to be in the webpage.
The best thing about rest2web is that it supports python scripting within the template.txt, so I just ran a loop of the contents and indexed them in the page.
There is still lot more to go to automate the entire process. For eg. Inline viewing of programs and colorization, which I think can be done with some more trials.
I have the completed website here, It is called uthcode.

Categories