Pros and Cons of html-output for statistical data

Pros and Cons of html-output for statistical data - python

I am using Python3 to calculate some statistics from language corpora. Until now I was exporting the results in a csv-file or directly on the shell. A few days ago I started learning how to output the data to html-tables. I must say I really like it, it deals perfect height/width of cell and unicodes and you can apply color to different values. although I think there are some problem when dealing with large data or tables.
Anyway, my question is, I'mot not sure if I should continue in this direction and output the results to html. Can someone with experience in this field help me with some pros and cons of using html as output?

Why not do both ? Make your data available as CSV (for simple export to scripts etc.) and provide a decorated HTML version.
At some stage you may want (say) a proper Excel sheet, a PDF etc. So I would enforce a separation of the data generation from the rendering. Make your generator return a structure that can be consumed by an abstract renderer, and your concrete implementations would present CSV, PDF, HTML etc.

The question lists some benefits of HTML format. These alone are sufficient for using it as one of output formats. Used that way, it does not really matter much what you cannot easily do with the HTML format, as you can use other formats as needed.
Benefits include reasonable default rendering, which can be fine-tuned in many ways using CSS, possibly with alternate style sheets (now supported even by IE). You can also include links.
What you cannot do in HTML without scripting is computation, sorting, reordering, that kind of stuff. But they can be added with JavaScript – not trivial, but doable.
There’s a technical difficulty with large tables: by default, a browser will start showing any content in the table only after having got, parsed, and processed the entire table. This may cause a delay of several seconds. A way to deal with this is to use fixed layout (table-layout: fixed) with specific widths set on table columns (they need not be fixed in physical units; the great em unit works OK, and on modern browsers you can use ch too).
Another difficulty is bad line breaks. It’s easy fixable with CSS (or HTML), but authors often miss the issue, causing e.g. cell contents like “10 m” to be split into two lines.
Other common problems with formatting statistical data in HTML include:
Not aligning numeric fields to the right.
Using serif fonts.
Using fonts where not all digits have equal width.
Using the unnoticeable hyphen “-” insted of the proper Unicode minus “−” (U+2212, −).
Not indicating missing values in some reasonable way, leaving some cells empty. (Browsers may treat empty cells in odd ways.)
Insufficient horizontal padding, making cell contents (almost) hit cell border or cell background edge.
There are good and fairly easy solutions to such problems, so this is just something to be noted when using HTML as output format, not an argument against it.

Related

How do I replace instructions within a PE file using Python?

I'd like to replace assembly instructions within the code (.text) section of a PE image with semantically equivalent instructions that are of the same length. An example would be replacing an "add 5" with a "sub -5", though certainly I have much more elaborate plans than this.
I've found that LIEF is great for working with higher-level features in a PE file and can give you a dump of the code section data. The same seems to apply for the pefile library. They don't seem to be great tools for manipulating individual instructions though, unless I'm missing something.
The goal would be to perform some initial disassembly and looping through instructions in order to locate interesting or desirable instructions (e.g., the add instruction from above). Then I'd prefer to not have to figure out all of the opcodes and generate bytes by hand. If possible, I'd be looking for something more developer-friendly that computes it for you (E.g., this.intr = 'add' and this.op1 = '5'). Ideally it wouldn't try to re-assemble the entire section after the change is made... this could cause other differences to occur, and my use case requires that this individual instruction be the only change present from a bit-level perspective. Again, the idea is that I'd only be selecting semantically equivalent instructions whose lengths are equal, which is what would allow this scenario to occur without re-assembling from scratch.
How can I do something like this using Python?

Syntax recognizer in python

I need a module or strategy for detecting that a piece of data is written in a programming language, not syntax highlighting where the user specifically chooses a syntax to highlight. My question has two levels, I would greatly appreciate any help, so:
Is there any package in python that receives a string(piece of data) and returns if it belongs to any programming language syntax ?
I don't necessarily need to recognize the syntax, but know if the string is source code or not at all.
Any clues are deeply appreciated.

Maybe you can use existing multi-language syntax highlighters. Many of them can detect language a file is written in.

You could have a look at methods around baysian filtering.

My answer somewhat depends on the amount of code you're going to be given. If you're going to be given 30+ lines of code, it should be fairly easy to identify some unique features of each language that are fairly common. For example, tell the program that if anything matches an expression like from * import * then it's Python (I'm not 100% sure that phrasing is unique to Python, but you get the gist). Other things you could look at that are usually slightly different would be class definition (i.e. Python always starts with 'class', C will start with a definition of the return so you could check to see if there is a line that starts with a data type and has the formatting of a method declaration), conditionals are usually formatted slightly differently, etc, etc. If you wanted to make it more accurate, you could introduce some sort of weighting system, features that are more unique and less likely to be the result of a mismatched regexp get a higher weight, things that are commonly mismatched get a lower weight for the language, and just calculate which language has the highest composite score at the end. You could also define features that you feel are 100% unique, and tell it that as soon as it hits one of those, to stop parsing because it knows the answer (things like the shebang line).
This would, of course, involve you knowing enough about the languages you want to identify to find unique features to look for, or being able to find people that do know unique structures that would help.
If you're given less than 30 or so lines of code, your answers from parsing like that are going to be far less accurate, in that case the easiest best way to do it would probably be to take an appliance similar to Travis, and just run the code in each language (in a VM of course). If the code runs successfully in a language, you have your answer. If not, you would need a list of errors that are "acceptable" (as in they are errors in the way the code was written, not in the interpreter). It's not a great solution, but at some point your code sample will just be too short to give an accurate answer.

Reportlab Wrapper

I'm looking for a Reportlab wrapper which does the heavy lifting for me.
I found this one, which looks promising.
It looks cumbersome to me to deal with the low-level api of Reportlab (especially positioning of elements, etc) and a library should facilitate at least this part.
My code for creating .pdfs is currently a maintain hell which consists of positioning elements, taking care which things should stick together, and logic to deal with varying length of input strings.
For example while creating pdf invoices, I have to give the user the ability to adjust the distance between two paragraphs. Currently I grab this info from the UI and then re-calculate the position of paragraph A and B based upon the input.
Besides that I look for a wrapper to help me with this, it would be great if someone could point me to / provide a best-practice example on how to deal with positioning of elements, varying lengh of input strings etc.

For future reference:
Having tested the lib PDFDocument, I can only recommend it. It takes away a lot of complexity, provides a lot of helper functions, and helps to keep your code clean. I found this resource really helpful to get started.

Read flat file as transpose, python

I'm interested in reading fixed width text files in Python in as efficient a manner as I can. Specifically, most of the time I'm interested in one or more columns in the flat file but not entire records.
It strikes me as inefficient to read the file a line at a time and extract the desired columns after reading the entire line into memory. I think I'd rather have the option of reading only the desired columns, top to bottom, left to right (instead of reading left to right, top to bottom).
Is such a thing desirable, and if so, is it possible?

Files are laid out as a (one-dimensional) sequence of bits. 'Lines' are just a convenience we added to make things easy to read for humans. So, in general, what you're asking is not possible on plain files. To pull this off, you would need some way of finding where a record starts. The two most common ways are:
Search for newline symbols (in other words, read the entire file).
Use a specially spaced layout, so that each record is laid out using a fixed with. That way, you can use low level file operations, like seek, to go directly to where you need to go. This avoids reading the entire file, but is painful to do manually.
I wouldn't worry too much about file reading performance unless it becomes a problem. Yes, you could memory map the file, but your OS probably already caches for you. Yes, you could use a database format (e.g., the sqlite3 file format through sqlalchemy), but it probably isn't worth the hassle.
Side note on "fixed width:" What precisely do you mean by this? If you really mean 'every column always starts at the same offset relative to the start of the record' then you can definitely use Python's seek to skip past data that you are not interested in.

How big are the lines? Unless each record is huge, it's probably likely to make little difference only reading in the fields you're interested in rather than the whole line.
For big files with fixed formatting, you might get something out of mmapping the file. I've only done this with C rather than Python, but it seems like mmapping the file then accessing the appropriate fields directly is likely to be reasonably efficient.

Flat files are not good with what you're trying to do. My suggestion is to convert the files to SQL database (using sqlite3) and then reading just the columns you want. SQLite3 is blazing fast.

If it's truly fixed width, then you should be able to just call read(N) to skip past the fixed number of bytes from the end of your column on one line to the start of it on the next.

python solutions for managing scientific data dependency graph by specification values

I have a scientific data management problem which seems general, but I can't find an existing solution or even a description of it, which I have long puzzled over. I am about to embark on a major rewrite (python) but I thought I'd cast about one last time for existing solutions, so I can scrap my own and get back to the biology, or at least learn some appropriate language for better googling.
The problem:
I have expensive (hours to days to calculate) and big (GB's) data attributes that are typically built as transformations of one or more other data attributes. I need to keep track of exactly how this data is built so I can reuse it as input for another transformation if it fits the problem (built with right specification values) or construct new data as needed. Although it shouldn't matter, I typically I start with 'value-added' somewhat heterogeneous molecular biology info, for example, genomes with genes and proteins annotated by other processes by other researchers. I need to combine and compare these data to make my own inferences. A number of intermediate steps are often required, and these can be expensive. In addition, the end results can become the input for additional transformations. All of these transformations can be done in multiple ways: restricting with different initial data (eg using different organisms), by using different parameter values in the same inferences, or by using different inference models, etc. The analyses change frequently and build on others in unplanned ways. I need to know what data I have (what parameters or specifications fully define it), both so I can reuse it if appropriate, as well as for general scientific integrity.
My efforts in general:
I design my python classes with the problem of description in mind. All data attributes built by a class object are described by a single set of parameter values. I call these defining parameters or specifications the 'def_specs', and these def_specs with their values the 'shape' of the data atts. The entire global parameter state for the process might be quite large (eg a hundred parameters), but the data atts provided by any one class require only a small number of these, at least directly. The goal is to check whether previously built data atts are appropriate by testing if their shape is a subset of the global parameter state.
Within a class it is easy to find the needed def_specs that define the shape by examining the code. The rub arises when a module needs a data att from another module. These data atts will have their own shape, perhaps passed as args by the calling object, but more often filtered from the global parameter state. The calling class should be augmented with the shape of its dependencies in order to maintain a complete description of its data atts.
In theory this could be done manually by examining the dependency graph, but this graph can get deep, and there are many modules, which I am constantly changing and adding, and ... I'm too lazy and careless to do it by hand.
So, the program dynamically discovers the complete shape of the data atts by tracking calls to other classes attributes and pushing their shape back up to the caller(s) through a managed stack of __get__ calls. As I rewrite I find that I need to strictly control attribute access to my builder classes to prevent arbitrary info from influencing the data atts. Fortunately python is making this easy with descriptors.
I store the shape of the data atts in a db so that I can query whether appropriate data (i.e. its shape is a subset of the current parameter state) already exists. In my rewrite I am moving from mysql via the great SQLAlchemy to an object db (ZODB or couchdb?) as the table for each class has to be altered when additional def_specs are discovered, which is a pain, and because some of the def_specs are python lists or dicts, which are a pain to translate to sql.
I don't think this data management can be separated from my data transformation code because of the need for strict attribute control, though I am trying to do so as much as possible. I can use existing classes by wrapping them with a class that provides their def_specs as class attributes, and db management via descriptors, but these classes are terminal in that no further discovery of additional dependency shape can take place.
If the data management cannot easily be separated from the data construction, I guess it is unlikely that there is an out of the box solution but a thousand specific ones. Perhaps there is an applicable pattern? I'd appreciate any hints at how to go about looking or better describing the problem. To me it seems a general issue, though managing deeply layered data is perhaps at odds with the prevailing winds of the web.

I don't have specific python-related suggestions for you, but here are a few thoughts:
You're encountering a common challenge in bioinformatics. The data is large, heterogeneous, and comes in constantly changing formats as new technologies are introduced. My advice is to not overthink your pipelines, as they're likely to be changing tomorrow. Choose a few well defined file formats, and massage incoming data into those formats as often as possible. In my experience, it's also usually best to have loosely coupled tools that do one thing well, so that you can chain them together for different analyses quickly.
You might also consider taking a version of this question over to the bioinformatics stack exchange at http://biostar.stackexchange.com/

ZODB has not been designed to handle massive data, it is just for web-based applications and in any case it is a flat-file based database.
I recommend you to try PyTables, a python library to handle HDF5 files, which is a format used in astronomy and physics to store results from big calculations and simulations. It can be used as an hierarchical-like database and has also an efficient way to pickle python objects. By the way, the author of pytables explained that ZOdb was too slow for what he needed to do, and I can confirm you that. If you are interested in HDF5, there is also another library, h5py.
As a tool for managing the versioning of the different calculations you have, you can have a try at sumatra, which is something like an extension to git/trac but designed for simulations.
You should ask this question on biostar, you will find better answers there.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.