How do I include a data-extraction module into my python project?

How do I include a data-extraction module into my python project? - python

I am currently starting a kind of larger project in python and I am unsure about how to best structure it. Or to put it in different terms, how to build it in the most "pythonic" way. Let me try to explain the main functionality:
It is supposed to be a tool or toolset by which to extract data from different sources, at the moment mainly SQL-databases, in the future maybe also data from files stored on some network locations. It will probably consist of three main parts:
A data model which will hold all the data extracted from files / SQL. This will be some combination of classes / instances thereof. No big deal here
One or more scripts, which will control everything (Should the data be displayed? Outputted in another file? Which data exactly needs to be fetched? etc) Also pretty straightforward
And some module/class (or multiple modules) which will handle the data extraction of data. This is where I struggle mainly
So for the actual questions:
Should I place the classes of the data model and the "extractor" into one folder/package and access them from outside the package via my "control script"? Or should I place everything together?
How should I build the "extractor"? I already tried three different approaches for a SqlReader module/class: I tried making it just a simple module, not a class, but I didn't really find a clean way on how and where to initialize it. (Sql-connection needs to be set up) I tried making it a class and creating one instance, but then I need to pass around this instance into the different classes of the data model, because each needs to be able to extract data. And I tried making it a static class (defining
everything as a#classmethod) but again, I didn't like setting it up and it also kind of felt wrong.
Should the main script "know" about the extractor-module? Or should it just interact with the data model itself? If not, again the question, where, when and how to initialize the SqlReader
And last but not least, how do I make sure, I close the SQL-connection whenever my script ends? Meaning, even if it ends through an error. I am using cx_oracle by the way
I am happy about any hints / suggestions / answers etc. :)

For this project you will need the basic Data Science Toolkit: Pandas, Matplotlib, and maybe numpy. Also you will need SQLite3(built-in) or another SQL module to work with the databases.
Pandas: Used to extract, manipulate, analyze data.
Matplotlib: Visualize data, make human readable graphs for further data analyzation.
Numpy: Build fast, stable arrays of data that work much faster than python's lists.
Now, this is just a guideline, you will need to dig deeper in their documentation, then use what you need in your project.
Hope that this is what you were looking for!
Cheers

Related

How to automate specman files changes with python?

I am working on a specman environment (hardware verification language), and I want to automate my tasks.
In order to do so, I learned Python programming with the target to use the file manipulation abilities. The problem is that I know only how to manipulate .txt files, Is there a way to change different kind of files?

Your question is way too generic. It's possible to change *.e files using string matching, maybe in some cases this makes sense as a one-time task, but there couldn't be any rules for that. Writing e parser in python doesn't sound like a feasible task.
The only reasonable way to analyze e code is to load it and use reflection. But not always you can feed the results to python to let it make any meaningful modifications.
It's totally possible to use python to generate e code based on some formally defined specs, specifically mentioned coverage, generation constraints, etc. It can be efficient and maintainable approach. However, there are different facilities for that, including tables.
Python certainly can be used for all kinds of smart scriptology: define environment, track installations and versions, choose flows, generate stubs, etc.

Feasibility of converting all python pandas/numpy code to base python

General python question-
I have built a script using numpy and pandas libraries. I have now been told that I cannot use any libraries- only base python to code. This is because apparently open source libraries are not approved.
Does this restriction make sense? Isn't base python as open source as pandas/numpy libraries are?
Is it possible to convert pandas/numpy code to base python? Does this sound like a simple exercise or does it require learning a lot of new functions? Majority of the code is reading tables and then using if/then type statements and looking up values from other tables to generate and populate new tables.

I'm only going to address the 2nd point. Reimplementing all of numpy/pandas is certainly a very large and useless task. But you're not reimplementing all of it, you only need some parts, and if it's only a few functions, than it's certainly possible.
I'd start from a working script, replace arrays by python lists, and implement the needed fucntions one by one. For SO specifically, I suspect you're better off asking specific questions, e.g. how to implement an analog of a function X in pure python etc.

What are some effective ways to pass variables between machines?

This is bit of a general question but I wanted to know some different approaches to sharing data between machines.
Basically I have a process that generates large reference table(often over 10gig python dict) and then other machines run independent proceses but reference that table. The dict does not change once its created and all the other machines simply refer to it to do their work. I'm leaning towards storing all this in a database and then having all the servers query the server to get that data. I just suspect it might having multiple 10gig+ queries at the same time may not be the best way to do it. I have thought about a flat file or passing it over using a distribution tool.
Is there any other ways to share this python dict among several machines(general approaches are fine but I'm using python so any library suggestion would work also)?

Well, yes, it would make sense to store it in some kind of shared datastore. Depending on your exact needs, you may find it preferable to store the data in some kind of nosql-type storage. For example redis ( http://redis.io ) is pretty reasonable, and supports various datastructures, including hashtables.

Using Excel to work with large amounts of output data: is an Excel-database interaction the right solution for the problem?

I have a situation where various analysis programs output large amounts of data, but I may only need to manipulate or access certain parts of the data in a particular Excel workbook.
The numbers might often change as well as newer analyses are run, and I'd like these changes to be reflected in Excel in as automated a manner as possible. Another important consideration is that I'm using Python to process some of the data too, so putting the data somewhere where it's easy for Python and Excel to access would be very beneficial.
I know only a little about databases, but I'm wondering if using one would be a good solution for what my needs - Excel has database interaction capability as far as I'm aware, as does Python. The devil is in the details of course, so I need some help figuring out what system I'd actually set up.
From what I've currently read (in the last hour), here's what I've come up with so far simple plan:
1) Set up an SQLite managed database. Why SQLite? Well, I don't need a database that can manage large volumes of concurrent accesses, but I do need something that is simple to set up, easy to maintain and good enough for use by 3-4 people at most. I can also use the SQLite Administrator to help design the database files.
2 a) Use ODBC/ADO.NET (I have yet to figure out the difference between the two) to help Excel access the database. This is going to be the trickiest part, I think.
2 b) Python already has the built in sqlite3 module, so no worries with the interface there. I can use it to set up the output data into an SQLite managed database as well!
Putting down some concrete questions:
1) Is a server-less database a good solution for managing my data given my access requirements? If not, I'd appreciate alternative suggestions. Suggested reading? Things worth looking at?
2) Excel-SQLite interaction: I could do with some help flushing out the details there...ODBC or ADO.NET? Pointers to some good tutorials? etc.
3) Last, but not least, and definitely of concern: will it be easy enough to teach a non-programmer how to setup spreadsheets using queries to the database (assuming they're willing to put in some time with familiarization, but not very much)?
I think that about covers it for now, thank you for your time!

Although you could certainly use a database to do what you're asking, I'm not sure you really want to add that complexity. I don't see much benefit of adding a database to your mix. ...if you were pulling data from a database as well, then it'd make more sense to add some tables for this & use it.
From what I currently understand of your requirements, since you're using python anyway, you could do your preprocessing in python, then just dump out the processed/augmented values into other csv files for Excel to import. For a more automated solution, you could even write the results directly to the spreadsheets from Python using something like xlwt.

Is it reasonable to save data as python modules?

This is what I've done for a project. I have a few data structures that are bascially dictionaries with some methods that operate on the data. When I save them to disk, I write them out to .py files as code that when imported as a module will load the same data into such a data structure.
Is this reasonable? Are there any big disadvantages? The advantage I see is that when I want to operate with the saved data, I can quickly import the modules I need. Also, the modules can be used seperate from the rest of the application because you don't need a separate parser or loader functionality.

By operating this way, you may gain some modicum of convenience, but you pay many kinds of price for that. The space it takes to save your data, and the time it takes to both save and reload it, go up substantially; and your security exposure is unbounded -- you must ferociously guard the paths from which you reload modules, as it would provide an easy avenue for any attacker to inject code of their choice to be executed under your userid (pickle itself is not rock-solid, security-wise, but, compared to this arrangement, it shines;-).
All in all, I prefer a simpler and more traditional arrangement: executable code lives in one module (on a typical code-loading path, that does not need to be R/W once the module's compiled) -- it gets loaded just once and from an already-compiled form. Data live in their own files (or portions of DB, etc) in any of the many suitable formats, mostly standard ones (possibly including multi-language ones such as JSON, CSV, XML, ... &c, if I want to keep the option open to easily load those data from other languages in the future).

It's reasonable, and I do it all the time. Obviously it's not a format you use to exchange data, so it's not a good format for anything like a save file.
But for example, when I do migrations of websites to Plone, I often get data about the site (such as a list of which pages should be migrated, or a list of how old urls should be mapped to new ones, aor lists of tags). These you typically get in Word och Excel format. Also the data often needs massaging a bit, and I end up with what for all intents and purposes are a dictionaries mapping one URL to some other information.
Sure, I could save that as CVS, and parse it into a dictionary. But instead I typically save it as a Python file with a dictionary. Saves code.
So, yes, it's reasonable, no it's not a format you should use for any sort of save file. It however often used for data that straddles the border to configuration, like above.

The biggest drawback is that it's a potential security problem since it's hard to guarantee that the files won't contains arbitrary code, which could be really bad. So don't use this approach if anyone else than you have write-access to the files.

A reasonable option might be to use the Pickle module, which is specifically designed to save and restore python structures to disk.

Alex Martelli's answer is absolutely insightful and I agree with him. However, I'll go one step further and make a specific recommendation: use JSON.
JSON is simple, and Python's data structures map well into it; and there are several standard libraries and tools for working with JSON. The json module in Python 3.0 and newer is based on simplejson, so I would use simplejson in Python 2.x and json in Python 3.0 and newer.
Second choice is XML. XML is more complicated, and harder to just look at (or just edit with a text editor) but there is a vast wealth of tools to validate it, filter it, edit it, etc.
Also, if your data storage and retrieval needs become at all nontrivial, consider using an actual database. SQLite is terrific: it's small, and for small databases runs very fast, but it is a real actual SQL database. I would definitely use a Python ORM instead of learning SQL to interact with the database; my favorite ORM for SQLite would be Autumn (small and simple), or the ORM from Django (you don't even need to learn how to create tables in SQL!) Then if you ever outgrow SQLite, you can move up to a real database such as PostgreSQL. If you find yourself writing lots of loops that search through your saved data, and especially if you need to enforce dependencies (such as if foo is deleted, bar must be deleted too) consider going to a database.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.