I have scripts which need different configuration data. Most of the time it is a table format or a list of parameters. For the moment I read out excel tables for this.
However, not only reading excel is slightly buggy (excel is just not made for being a stable data provider), but also I'd like to include some data validation and a little help for the configurators so that input is validated partially. It doesn't have to be pretty though - just functional. Pure text files would be to hard to read and check.
Can you suggest an easy to implement way to realize that? Of course one could program complicating web interfaces and form, but maybe that's too much effort?!
Was is an easy to edit way to provide data tables and other configuration parameters?
The configuration info is just small tables with a list of parameters or a matrix with mathematical coefficients.
I like to use YAML. It's pretty flexible and python can read it as a dict using PyYAML.
Its a bit difficult to answer without knowing what the data you're working with looks like, but there are several ways you could do it. You could, for example, use something like csv or sqlite, provided the data can be easily expressed in a tabular format, but I think that you might find xml is best for your use case. It is very versatile and can be easy to work with if you find a good editor (e.g serna or oxygenxml), however, it might still be in your interests to write your own editor for it (which will probably not be as complicated as you think!). XML is easy to work with in python through the standard xml.etree module, and XML schemas can be used for validation.
Related
I am currently starting a kind of larger project in python and I am unsure about how to best structure it. Or to put it in different terms, how to build it in the most "pythonic" way. Let me try to explain the main functionality:
It is supposed to be a tool or toolset by which to extract data from different sources, at the moment mainly SQL-databases, in the future maybe also data from files stored on some network locations. It will probably consist of three main parts:
A data model which will hold all the data extracted from files / SQL. This will be some combination of classes / instances thereof. No big deal here
One or more scripts, which will control everything (Should the data be displayed? Outputted in another file? Which data exactly needs to be fetched? etc) Also pretty straightforward
And some module/class (or multiple modules) which will handle the data extraction of data. This is where I struggle mainly
So for the actual questions:
Should I place the classes of the data model and the "extractor" into one folder/package and access them from outside the package via my "control script"? Or should I place everything together?
How should I build the "extractor"? I already tried three different approaches for a SqlReader module/class: I tried making it just a simple module, not a class, but I didn't really find a clean way on how and where to initialize it. (Sql-connection needs to be set up) I tried making it a class and creating one instance, but then I need to pass around this instance into the different classes of the data model, because each needs to be able to extract data. And I tried making it a static class (defining
everything as a#classmethod) but again, I didn't like setting it up and it also kind of felt wrong.
Should the main script "know" about the extractor-module? Or should it just interact with the data model itself? If not, again the question, where, when and how to initialize the SqlReader
And last but not least, how do I make sure, I close the SQL-connection whenever my script ends? Meaning, even if it ends through an error. I am using cx_oracle by the way
I am happy about any hints / suggestions / answers etc. :)
For this project you will need the basic Data Science Toolkit: Pandas, Matplotlib, and maybe numpy. Also you will need SQLite3(built-in) or another SQL module to work with the databases.
Pandas: Used to extract, manipulate, analyze data.
Matplotlib: Visualize data, make human readable graphs for further data analyzation.
Numpy: Build fast, stable arrays of data that work much faster than python's lists.
Now, this is just a guideline, you will need to dig deeper in their documentation, then use what you need in your project.
Hope that this is what you were looking for!
Cheers
Is there a parser for PDB-files (Protein Data Bank) that can extract (most) information from the header/REMARK-section, like refinement statistics, etc.?
It might be worthwhile to note that I am mainly interested in accessing data from files right after they have been produced, not from structures that have already been deposited in the Protein Data Bank. This means that there is quite a variety of different "propriety" formats to deal with, depending on the refinement software used.
I've had a look at Biopython, but they explicitly state in the FAQ that "If you are interested in data mining the PDB header, you might want to look elsewhere because there is only limited support for this."
I am well aware that it would be a lot easier to extract this information from mmCIF-files, but unfortunately these are still not output routinely from many macromolecular crystallography programs.
The best way I have found so far is converting the PDB-file into mmcif-format using pdb_extract (http://pdb-extract.wwpdb.org/ , either online or as a standalone).
The mmcif-file can be parsed using Biopythons Bio.PDB-module.
Writing to the mmcif-file is a bit trickier, Python PDBx seems to work reasonably well.
This, and other useful PDB-/mmcif-tools can be found at http://mmcif.wwpdb.org/docs/software-resources.html
Maybe you should try that library?
https://pypi.python.org/pypi/bioservices
I have a situation where various analysis programs output large amounts of data, but I may only need to manipulate or access certain parts of the data in a particular Excel workbook.
The numbers might often change as well as newer analyses are run, and I'd like these changes to be reflected in Excel in as automated a manner as possible. Another important consideration is that I'm using Python to process some of the data too, so putting the data somewhere where it's easy for Python and Excel to access would be very beneficial.
I know only a little about databases, but I'm wondering if using one would be a good solution for what my needs - Excel has database interaction capability as far as I'm aware, as does Python. The devil is in the details of course, so I need some help figuring out what system I'd actually set up.
From what I've currently read (in the last hour), here's what I've come up with so far simple plan:
1) Set up an SQLite managed database. Why SQLite? Well, I don't need a database that can manage large volumes of concurrent accesses, but I do need something that is simple to set up, easy to maintain and good enough for use by 3-4 people at most. I can also use the SQLite Administrator to help design the database files.
2 a) Use ODBC/ADO.NET (I have yet to figure out the difference between the two) to help Excel access the database. This is going to be the trickiest part, I think.
2 b) Python already has the built in sqlite3 module, so no worries with the interface there. I can use it to set up the output data into an SQLite managed database as well!
Putting down some concrete questions:
1) Is a server-less database a good solution for managing my data given my access requirements? If not, I'd appreciate alternative suggestions. Suggested reading? Things worth looking at?
2) Excel-SQLite interaction: I could do with some help flushing out the details there...ODBC or ADO.NET? Pointers to some good tutorials? etc.
3) Last, but not least, and definitely of concern: will it be easy enough to teach a non-programmer how to setup spreadsheets using queries to the database (assuming they're willing to put in some time with familiarization, but not very much)?
I think that about covers it for now, thank you for your time!
Although you could certainly use a database to do what you're asking, I'm not sure you really want to add that complexity. I don't see much benefit of adding a database to your mix. ...if you were pulling data from a database as well, then it'd make more sense to add some tables for this & use it.
From what I currently understand of your requirements, since you're using python anyway, you could do your preprocessing in python, then just dump out the processed/augmented values into other csv files for Excel to import. For a more automated solution, you could even write the results directly to the spreadsheets from Python using something like xlwt.
I'm working with a 20 gig XML file that I would like to import into a SQL database (preferably MySQL, since that is what I am familiar with). This seems like it would be a common task, but after Googling around a bit I haven't been able to figure out how to do it. What is the best way to do this?
I know this ability is built into MySQL 6.0, but that is not an option right now because it is an alpha development release.
Also, if I have to do any scripting I would prefer to use Python because that's what I am most familiar with.
Thanks.
You can use the getiterator() function to iterate over the XML file without parsing the whole thing at once. You can do this with ElementTree, which is included in the standard library, or with lxml.
for record in root.getiterator('record'):
add_element_to_database(record) # Depends on your database interface.
# I recommend SQLAlchemy.
Take a look at the iterparse() function from ElementTree or cElementTree (I guess cElementTree would be best if you can use it)
This piece describes more or less what you need to do: http://effbot.org/zone/element-iterparse.htm#incremental-parsing
This will probably be the most efficient way to do it in Python. Make sure not to forget to call .clear() on the appropriate elements (you really don't want to build an in memory tree of a 20gig xml file: the .getiterator() method described in another answer is slightly simpler, but does require the whole tree first - I assume that the poster actually had iterparse() in mind as well)
I've done this several times with Python, but never with such a big XML file. ElementTree is an excellent XML library for Python that would be of assistance. If it was possible, I would divide the XML up into smaller files to make it easier to load into memory and parse.
It may be a common task, but maybe 20GB isn't as common with MySQL as it is with SQL Server.
I've done this using SQL Server Integration Services and a bit of custom code. Whether you need either of those depends on what you need to do with 20GB of XML in a database. Is it going to be a single column of a single row of a table? One row per child element?
SQL Server has an XML datatype if you simply want to store the XML as XML. This type allows you to do queries using XQuery, allows you to create XML indexes over the XML, and allows the XML column to be "strongly-typed" by referring it to a set of XML schemas, which you store in the database.
The MySQL documentation does not seem to indicate that XML import is restricted to version 6. It apparently works with 5, too.
This is what I've done for a project. I have a few data structures that are bascially dictionaries with some methods that operate on the data. When I save them to disk, I write them out to .py files as code that when imported as a module will load the same data into such a data structure.
Is this reasonable? Are there any big disadvantages? The advantage I see is that when I want to operate with the saved data, I can quickly import the modules I need. Also, the modules can be used seperate from the rest of the application because you don't need a separate parser or loader functionality.
By operating this way, you may gain some modicum of convenience, but you pay many kinds of price for that. The space it takes to save your data, and the time it takes to both save and reload it, go up substantially; and your security exposure is unbounded -- you must ferociously guard the paths from which you reload modules, as it would provide an easy avenue for any attacker to inject code of their choice to be executed under your userid (pickle itself is not rock-solid, security-wise, but, compared to this arrangement, it shines;-).
All in all, I prefer a simpler and more traditional arrangement: executable code lives in one module (on a typical code-loading path, that does not need to be R/W once the module's compiled) -- it gets loaded just once and from an already-compiled form. Data live in their own files (or portions of DB, etc) in any of the many suitable formats, mostly standard ones (possibly including multi-language ones such as JSON, CSV, XML, ... &c, if I want to keep the option open to easily load those data from other languages in the future).
It's reasonable, and I do it all the time. Obviously it's not a format you use to exchange data, so it's not a good format for anything like a save file.
But for example, when I do migrations of websites to Plone, I often get data about the site (such as a list of which pages should be migrated, or a list of how old urls should be mapped to new ones, aor lists of tags). These you typically get in Word och Excel format. Also the data often needs massaging a bit, and I end up with what for all intents and purposes are a dictionaries mapping one URL to some other information.
Sure, I could save that as CVS, and parse it into a dictionary. But instead I typically save it as a Python file with a dictionary. Saves code.
So, yes, it's reasonable, no it's not a format you should use for any sort of save file. It however often used for data that straddles the border to configuration, like above.
The biggest drawback is that it's a potential security problem since it's hard to guarantee that the files won't contains arbitrary code, which could be really bad. So don't use this approach if anyone else than you have write-access to the files.
A reasonable option might be to use the Pickle module, which is specifically designed to save and restore python structures to disk.
Alex Martelli's answer is absolutely insightful and I agree with him. However, I'll go one step further and make a specific recommendation: use JSON.
JSON is simple, and Python's data structures map well into it; and there are several standard libraries and tools for working with JSON. The json module in Python 3.0 and newer is based on simplejson, so I would use simplejson in Python 2.x and json in Python 3.0 and newer.
Second choice is XML. XML is more complicated, and harder to just look at (or just edit with a text editor) but there is a vast wealth of tools to validate it, filter it, edit it, etc.
Also, if your data storage and retrieval needs become at all nontrivial, consider using an actual database. SQLite is terrific: it's small, and for small databases runs very fast, but it is a real actual SQL database. I would definitely use a Python ORM instead of learning SQL to interact with the database; my favorite ORM for SQLite would be Autumn (small and simple), or the ORM from Django (you don't even need to learn how to create tables in SQL!) Then if you ever outgrow SQLite, you can move up to a real database such as PostgreSQL. If you find yourself writing lots of loops that search through your saved data, and especially if you need to enforce dependencies (such as if foo is deleted, bar must be deleted too) consider going to a database.