I'm being faced with the task of generating statistics about the history of a Git project, and I need to produce some specific numbers and representations for various metrics - things like commits per author, commits-over-time/date histograms, that sort of thing.
The trouble is that I need all this data generated in a format that can be dealt with via a script or similar - the output has to be text, and if I can get the numbers into a Python (or similar) script, so much the better.
My question is this: are there any existing frameworks or projects that will provide such an interface? I've seen GitStats, and it does a lot of what I want, but then it dumps the results into a HTML structure instead of just providing textual or programmatic representations back to me. Are there (for example) Python bindings for a Git log parser, or even a Git statistics generator that returns a big text dump of data?
I realize it's a very specific need, and I'm willing to do some serious coding to get the precise format I want, but I'd like to think there's a starting point out there somewhere. Ideas?
How about using XML logs instead, and then you can parse the xml in python relativily easily and build your stats
see this answer for how to get an xml log from git
Related
Somehow, with the broken documentation on Arelle's python API as of date, I managed to make the API work and successfully load an XBRL file.
Anyways, my question is:
How do I extract only the STATEMENTS from the XBRL file?
Below is a screenshot from Arelle's Windows App.
URL used in this example: https://www.sec.gov/Archives/edgar/data/101984/000010198416000062/ueic-20151231.xml
I tried experimenting with the API and here's my code
from arelle import Cntlr
xbrl = Cntlr.Cntlr().modelManager.load('https://www.sec.gov/Archives/edgar/data/101984/000010198416000062/ueic-20151231.xml')
for fact in xbrl.facts:
print(fact)
but after executing this snippet, I'm bombarded with these:
I tried getting the keys available per modelFact and its a mixture between contextRef, id, decimals and unitRef which is not helpful from what I want to extract. With no documentation to help further with this, I'm at a loss here. Can someone enlighten me on how to achieve extracting only the statements?
I am doing something similar and have so far had some progress which I can share:
Going through the python code files of arelle you can detect which properties you can access for the different classes such as ModelFact, ModelContext, ModelUnit etc.
To extract the individual data, you can for example put them in a panda dataframe as follows:
factData=pd.DataFrame(data=[(fact.concept.qname,
fact.value,
fact.isNumeric,
fact.contextID,
fact.context.isStartEndPeriod,
fact.context.isInstantPeriod,
fact.context.isForeverPeriod,
fact.context.startDatetime,
fact.context.endDatetime,
fact.unitID) for fact in xbrl.facts])
Now it is easier to work with all the data, filter those that you want to use etc. If you want to reproduce the statements tables, you will also need to incorporate the links for each of the facts and than order and sort, but I haven't gotten this far either.
I'm looking to generate PDF's from a Python application.
They start relatively simple but some may become more complex (Essentially letter like documents but will include watermarks for example later)
I've worked in raw postscript before and providing I can generate the correct headers etc and file at the end of it I want to avoid use of complex libs that may not do entirely what I want. Some seem to have got bitrot and no longer supported (pypdf and pypdf2) Especially when I know PDF/Postscript can do exactly what I need. PDF content really isn't that complex.
I can generate EPS (Encapsulated postscript) fine by just writing the appropriate text headers to file and my postscript code. But Inspecting PDF's there is a lil binary header I'm not sure how to generate.
I could generate an EPS and convert it. I'm not overly happy with this as the production environment is a Windows 2008 server (Dev is Ubuntu 12.04) and making something and converting it seems very silly.
Has anyone done this before?
Am I being pedantic by not wanting to use a library?
borrowed from ask.yahoo
A PDF file starts with "%PDF-1.1" if it is a version 1.1 type of PDF file. You can read PDF files ok when they don't have binary data objects stored in them, and you could even make one using Notepad if you didn't need to store a binary object like a Paint bitmap in it.
But after seeing the "%PDF-1.1" you ignore what's after that (Adobe Reader does, too) and go straight to the end of the file to where there is a line that says "%%EOF". That's always the last thing in the file; and if that's there you know that just a few characters before that place in the file there's the word "startxref" followed by a number. This number tells a reader program where to look in the file to find the start of the list of items describing the structure of the file. These items in the list can be page objects, dictionary objects, or stream objects (like the binary data of a bitmap), and each one has "obj" and "endobj" marking out where its description starts and ends.
For fairly simple PDF files, you might be able to type the text in just like you did with Notepad to make a working PDF file that Adobe Reader and other PDF viewer programs could read and display correctly.
Doing something like this is a challenge, even for a simple file, and you'd really have to know what you're doing to get any binary data into the file where it's supposed to go; but for character data, you'd just be able to type it in. And all of the commands used in the PDF are in the form of strings that you could type in. The hardest part is calculating those numbers that give the file offsets for items in the file (such as the number following "startxref").
If the way the file format is laid out intrigues you, go ahead and read the PDF manual, which tells the whole story.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
but really you should probably just use a library
Thanks to #LukasGraf for providing this link http://www.gnupdf.org/Introduction_to_PDF that shows how to create a simple hello world pdf from scratch
As long as you're working in Python 2.7, Reportlab seems to be the best solution out there at the moment. It's quite full-featured, and can be a little complex to work with, depending on exactly what you're doing with it, but since you seem to be familiar with PDF internals in general hopefully the learning curve won't be too steep.
I recommend you to use a library. I spent a lot of time creating pdfme and learned a lot of things along the way, but it's not something you would do for a single project. If you want to use my library check the docs here.
I'm working on a script which involves continuously analyzing data and outputting results in a multi-threaded way. So basically the result file(an xml file) is constantly being updated/modified (sometimes 2-3 times/per second).
I'm currently using lxml to parse/modify/update the xml file, which works fine right now. But from what I can tell, you have to rewrite the whole xml file even sometimes you just add one entry/sub-entry like <weather content=sunny /> somewhere in the file. The xml file is growing bigger gradually, and so is the overhead.
As far as efficiency/resource is concerned, any other way to update/modify the xml file? Or you will have to switch to SQL database or similar some day when the xml file is too big to parse/modify/update?
No you generally cannot - and not just XML files, any file format.
You can only update "in place" if you overwite bytes exactly (i.e. don't add or remove any characters, just replace some with something of the same byte length).
Using a form of database sounds like a good option.
It certainly sounds like you need some sort of database, as Li-anung Yip states this would take care of all kinds of nasty multi-threaded sync issues.
You stated that your data is gradually increasing? How is it being consumed? Are clients forced to download the entire result file each time?
Don't know your use-case but perhaps you could consider using an ATOM feed to distribute your data changes? Providing support for Atom pub would also effectively REST enable your data. It's still XML, but in a standard's compliant format that is easy to consume and poll for changes.
I am writing a server side process in Python that takes XML in a directory and puts it into a database. The XML that is put in the directory is generated from forms that are filled out on remote laptops and sent via HTTP to the server. When we add fields to the form it adds tags to the XML which allows for situations where one XML file will have more or fewer tags than another. How can I make my server side script robust enough to handle these scenarios.
I would do something like mentioned here: https://stackoverflow.com/questions/9845943/how-to-convert-xml-data-in-to-sqlite-database/9879617#9879617
There is different ways you can apply the logic in the for loop depending on any patterns in the xml, but the idea is the same. This should then let you handle the query much more smoothly depending on which values exist.
Make sure you look at: http://lxml.de/tutorial.html there a lots of great tips with using lxml.
A mini example may get you started:
from xml.dom.minidom import parseString
doc = parseString('<one><two>three</two></one>')
for twoElement in doc.getElementsByTagName('two'):
print twoElement.firstChild.data
Maybe you should have a look at the minidom documentation or ask further questions here. But with that eggs.getElementsByTagName() you can find all elements below the tree eggs. Of course you can be more specific than searching in doc.
I'm working on a project that is going to extract specified text from a pdf document. I have no experience with this type of extraction. One issue is that we don't just want a dump of all the text in the document. Rather, is there a way to extract only certain fields in the pdf? Is there a notion of pdf templates that could be used for something like this?
I'm trying to use Apple's Automator - this is able to get all the text but not specified text. Ideally, I would like someone in Pages to have for example 30 discreet rows of text and have 20 of those rows be specified as 'catalog item' and have our Automator script take ONLY those twenty lines.
Any ideas on best workflow / extraction tools for this? I would prefer only consumer level items be used such as Apple Pages, Automator, and ruby or python as a scripting language.
thx
edit #1
looks like tagged pdf's might be one way to do this - not sure how well supported on Apple Pages this is
With python, the best choice would probably be PDFMiner. It can extract the coordinates for every text string, so you can work out the rectangles in your form on your own and pick out what falls within them. It's all pretty low level, but PDF is unfortunately a pretty low level format.
Be warned that unless you already know a lot about the structure of PDF, you'll find the API and documentation rather scanty. Look around for usage examples, including here on SO.
For Ruby you might try pdf-reader for parsing a PDF and accessing both metadata and content. Extracting the specific items your interested in is another story, but how to go about doing that depends highly on what format of data you're expecting.
You can use Origami in Ruby, a framework designed to parse, analyze,
and forge PDF documents, or the Python equivalent: Origapy, a simple Python
interface for the Ruby based Origami.