I'm working with a 20 gig XML file that I would like to import into a SQL database (preferably MySQL, since that is what I am familiar with). This seems like it would be a common task, but after Googling around a bit I haven't been able to figure out how to do it. What is the best way to do this?
I know this ability is built into MySQL 6.0, but that is not an option right now because it is an alpha development release.
Also, if I have to do any scripting I would prefer to use Python because that's what I am most familiar with.
Thanks.
You can use the getiterator() function to iterate over the XML file without parsing the whole thing at once. You can do this with ElementTree, which is included in the standard library, or with lxml.
for record in root.getiterator('record'):
add_element_to_database(record) # Depends on your database interface.
# I recommend SQLAlchemy.
Take a look at the iterparse() function from ElementTree or cElementTree (I guess cElementTree would be best if you can use it)
This piece describes more or less what you need to do: http://effbot.org/zone/element-iterparse.htm#incremental-parsing
This will probably be the most efficient way to do it in Python. Make sure not to forget to call .clear() on the appropriate elements (you really don't want to build an in memory tree of a 20gig xml file: the .getiterator() method described in another answer is slightly simpler, but does require the whole tree first - I assume that the poster actually had iterparse() in mind as well)
I've done this several times with Python, but never with such a big XML file. ElementTree is an excellent XML library for Python that would be of assistance. If it was possible, I would divide the XML up into smaller files to make it easier to load into memory and parse.
It may be a common task, but maybe 20GB isn't as common with MySQL as it is with SQL Server.
I've done this using SQL Server Integration Services and a bit of custom code. Whether you need either of those depends on what you need to do with 20GB of XML in a database. Is it going to be a single column of a single row of a table? One row per child element?
SQL Server has an XML datatype if you simply want to store the XML as XML. This type allows you to do queries using XQuery, allows you to create XML indexes over the XML, and allows the XML column to be "strongly-typed" by referring it to a set of XML schemas, which you store in the database.
The MySQL documentation does not seem to indicate that XML import is restricted to version 6. It apparently works with 5, too.
Related
I have scripts which need different configuration data. Most of the time it is a table format or a list of parameters. For the moment I read out excel tables for this.
However, not only reading excel is slightly buggy (excel is just not made for being a stable data provider), but also I'd like to include some data validation and a little help for the configurators so that input is validated partially. It doesn't have to be pretty though - just functional. Pure text files would be to hard to read and check.
Can you suggest an easy to implement way to realize that? Of course one could program complicating web interfaces and form, but maybe that's too much effort?!
Was is an easy to edit way to provide data tables and other configuration parameters?
The configuration info is just small tables with a list of parameters or a matrix with mathematical coefficients.
I like to use YAML. It's pretty flexible and python can read it as a dict using PyYAML.
Its a bit difficult to answer without knowing what the data you're working with looks like, but there are several ways you could do it. You could, for example, use something like csv or sqlite, provided the data can be easily expressed in a tabular format, but I think that you might find xml is best for your use case. It is very versatile and can be easy to work with if you find a good editor (e.g serna or oxygenxml), however, it might still be in your interests to write your own editor for it (which will probably not be as complicated as you think!). XML is easy to work with in python through the standard xml.etree module, and XML schemas can be used for validation.
I'm looking for a tool that will play nicely with Python.
Except for my Python requirement, my question is the same as this one:
"I am looking for a tool which will take an XML instance document and output a corresponding XSD schema."
According to the PyCharm docs, PyCharm has a facility for this. This is not exactly accessible by a program as an API. You are probably better off using XML Schema Learner as a separate program since it is a command line program (subprocess friendly!).
Are you looking for something like pyxsd? (primarily used for validation against a schema) Or maybe PyXB? (can generate classes based on xml) Otherwise, I don't think there's a tool [yet] that will generate the schema from within Python. Can you do it on demand using something like xsd.exe? Does it have to be programmatic/repeatable?
Currently, there is no module that will run within your python program and do this conversion. But I see the problem of creating a XSD schema from XML as a tooling problem. It's the kind of functionality that I'll use once, to get a schema started but after that I'll be maintaining the schema myself. From reading a single XML file the XSD generator will create a starting point for a real schema, it cannot infer all the functionality and options offered by XSD.
Basically, I don't see the need to have this conversion run as a module inside of my code, generating new XSDs every time the XML changes. After all, it's the schema that defines the XML not the other way around.
As end-user pointed out you could use xsd.exe but you might also want to look at other tools such as trang (a bit old) for Java and stylusstudio (XML tool).
I have a django site running with a mysql database backend. I accept fairly large uploads from one of the admin users to bulk import some data. The data comes in a format that is slightly different than the form it needs to be in the database so I need to do a little parsing.
I'd like to be able to pivot this data into csv and write it into a cStringIO object then simply use mysql's bulk import command to load that file. I'd prefer to skip writing the file to the disk first, but I can't seem to find a way around it. I've done basically this exact thing with postgresql in the past, but unfortunately this project is on mysql.
The short: Can I take an in memory file like object and somehow use mysql bulk import operation
There is an excellent tutorial called Generator Tricks for Systems Programmers that addresses processing large log files, which is similar, but not identical, to your situation. As long as you can perform the needed transform with access to only the current (and possibly previous) data in the stream, this may work for you.
I have mentioned this gem in a number of answers because I think that it introduces a different way of thinking that can be quite valuable. There is a companion piece, A Curious Course on Coroutines and Concurrency, that can seriously twist your head around.
If by "bulk import" you mean LOAD DATA [LOCAL] INFILE then, no, there's no way around first writing the data to some file, damn it all. You (and I) would really like to write the table directly from an array.
But some OSs, like Linux, allow a RAM-resident filesystem that eases some of the hurt. I'm not enough of a sysadmin to know how to set up one of these guys; I had to get my ISP's tech support to do it for me. I found an article that might have useful info.
HTH
I'm writing a small Python app for distribution. I need to include simple XML validation (it's a debugging tool), but I want to avoid any dependencies on compiled C libraries such as lxml or pyxml as those will make the resulting app much harder to distribute. I can't find anything that seems to fit the bill - for DTDs, Relax NG or XML Schema. Any suggestions?
Do you mean something like MiniXsv? I have never used it, but from the website, we can read that
minixsv is a lightweight XML schema
validator package written in pure
Python (at least Python 2.4 is
required).
so, it should work for you.
I believe that ElementTree could also be used for that goal, but I am not 100% sure.
Why don't you try invoking an online XML validator and parsing the results?
I couldn't find any free REST or SOAP based services but it would be easy enough to use a normal HTML form based one such as this one or this one. You just need to construct the correct request and parse the results (httplib may be of help here if you don't want to use a third party library such as mechanize to easy the pain).
I haven't looked at it in a while so it might have changed, but Fourthought's 4Suite (community edition) could still be pure Python.
Edit: just browsed through the docs and it's mostly Python which probably isn't enough for you.
The beautifulsoup module is pure Python (and a single .py file), but I don't know if it will fulfill your validation needs, I've only used it for quickly extracting some fields on large XMLs.
http://www.crummy.com/software/BeautifulSoup/
This is what I've done for a project. I have a few data structures that are bascially dictionaries with some methods that operate on the data. When I save them to disk, I write them out to .py files as code that when imported as a module will load the same data into such a data structure.
Is this reasonable? Are there any big disadvantages? The advantage I see is that when I want to operate with the saved data, I can quickly import the modules I need. Also, the modules can be used seperate from the rest of the application because you don't need a separate parser or loader functionality.
By operating this way, you may gain some modicum of convenience, but you pay many kinds of price for that. The space it takes to save your data, and the time it takes to both save and reload it, go up substantially; and your security exposure is unbounded -- you must ferociously guard the paths from which you reload modules, as it would provide an easy avenue for any attacker to inject code of their choice to be executed under your userid (pickle itself is not rock-solid, security-wise, but, compared to this arrangement, it shines;-).
All in all, I prefer a simpler and more traditional arrangement: executable code lives in one module (on a typical code-loading path, that does not need to be R/W once the module's compiled) -- it gets loaded just once and from an already-compiled form. Data live in their own files (or portions of DB, etc) in any of the many suitable formats, mostly standard ones (possibly including multi-language ones such as JSON, CSV, XML, ... &c, if I want to keep the option open to easily load those data from other languages in the future).
It's reasonable, and I do it all the time. Obviously it's not a format you use to exchange data, so it's not a good format for anything like a save file.
But for example, when I do migrations of websites to Plone, I often get data about the site (such as a list of which pages should be migrated, or a list of how old urls should be mapped to new ones, aor lists of tags). These you typically get in Word och Excel format. Also the data often needs massaging a bit, and I end up with what for all intents and purposes are a dictionaries mapping one URL to some other information.
Sure, I could save that as CVS, and parse it into a dictionary. But instead I typically save it as a Python file with a dictionary. Saves code.
So, yes, it's reasonable, no it's not a format you should use for any sort of save file. It however often used for data that straddles the border to configuration, like above.
The biggest drawback is that it's a potential security problem since it's hard to guarantee that the files won't contains arbitrary code, which could be really bad. So don't use this approach if anyone else than you have write-access to the files.
A reasonable option might be to use the Pickle module, which is specifically designed to save and restore python structures to disk.
Alex Martelli's answer is absolutely insightful and I agree with him. However, I'll go one step further and make a specific recommendation: use JSON.
JSON is simple, and Python's data structures map well into it; and there are several standard libraries and tools for working with JSON. The json module in Python 3.0 and newer is based on simplejson, so I would use simplejson in Python 2.x and json in Python 3.0 and newer.
Second choice is XML. XML is more complicated, and harder to just look at (or just edit with a text editor) but there is a vast wealth of tools to validate it, filter it, edit it, etc.
Also, if your data storage and retrieval needs become at all nontrivial, consider using an actual database. SQLite is terrific: it's small, and for small databases runs very fast, but it is a real actual SQL database. I would definitely use a Python ORM instead of learning SQL to interact with the database; my favorite ORM for SQLite would be Autumn (small and simple), or the ORM from Django (you don't even need to learn how to create tables in SQL!) Then if you ever outgrow SQLite, you can move up to a real database such as PostgreSQL. If you find yourself writing lots of loops that search through your saved data, and especially if you need to enforce dependencies (such as if foo is deleted, bar must be deleted too) consider going to a database.