Star schema in Python Pandas - python

I currently have a project where I extract the data from a Firebird database and do the ETL process with Knime, then the CSV files are imported into PowerBI, where I create table relationships and develop the measures.
With Knime I summarize several tables, denormalizing.
I would like to migrate to Python completely, I am learning Pandas.
I would like to know how to deal with relational modeling in Python, star schema for example.
In PowerBI there is a section dedicated to it where I establish relationships, indicating if they are uni or bi directional.
The only thing I can think of so far is to work in Pandas with joins in every required situation / function, but it seems to me that there must be a better way.
I would be grateful if you would indicate that I should learn to face this.

I think I can answer your question now that I have a better understanding of what you're trying to do in Python. My stack for reporting also involves Python for ETL operations and Power BI for the front end, so this is how I approach it even if there may be other ways that I'm not aware of.
While I create actual connections in Power BI for the data model I am using, we don't actually need to tell Python anything in advance. Power BI is declarative. You build the visualizations by specifying what information you want related and Power BI will do the required operations on the backend to get that data. However, you need to give it some minimal information in order to do this. So, you communicate the way you want the data modeled to the Power BI.
Python, in contrast, is imperative. Instead of telling it what you want at the end, you tell it what instructions you want it to perform. This means that you have to give all of the instructions yourself and that you need to know the data model.
So, the simple answer is that you don't need to deal with relational modeling. The more complicated and correct answer is that you need to plan your ETL tools around a logical data model. The logical data model doesn't really exist in one physical space like how Power BI stores what you tell it. It basically comes down to you knowing how the tables are supposed to relate and ensuring that the data stored within them allows those transformations to take place.
When the time comes to join tables in Python, perform join operations as needed, using the proper functions (i.e. merge()) in combination with the logical data model you have in your head (or written down).
The link I'm including here is a good place to start research/learning on how to think about data modeling on the more conceptual level you will need to:
https://www.guru99.com/data-modelling-conceptual-logical.html

Related

I would like to be able to compare values in one CSV with a nominal set of values in another

I have been given the task of injecting faults into a system and finding deviations from a norm. These deviations will serve as the failures of the system. So far we've had to detect these faults through observation, but I would like to develop a method for:
1.) Uploading each CSV which will include a fault of a certain magnitude.
2.) Comparing the CSV containing the fault with the nominal value.
3.) Being able to print out where the failure occured, or if no failure occured at all.
I was wondering which language would make the most sense for these three tasks. We've been given the system in Simulink, and have been able to output well-formatted CSV files containing information about the components which comprise the system. For each component, we have a nominal set of values and a set of values given after injecting a fault. I'd like to be able to compare these two and find where the fault has occurred. So far we've had very little luck in Python or in Matlab itself, and have been strongly considering using C to do this.
Any advice on which software will provide which advantages would be fantastic. Thank you.
If you want to store the outcomes in a database, it might be worth considering a tool like Microsoft SSIS (Sql Server Integration Services) where you could use your CSV files and sets of values as data sources, compare them / perform calculations and store outcomes / datasets in tables. SSIS has an easy enough learning curve and easy to use components as well as support for bespoke SQL / T-SQL and you can visually differentiate your components in separate processes. The package(s) can then be run either manually or in automated batches as desired.
https://learn.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-2017
Good luck!

Which part of SQLalchemy's dialect is responsible for pre-processing queries?

Apache Superset has some kind of quirk, where any temporal query is sent with datetimes, even when the respective columns are mere dates.
To work around it, I wanted to add some preprocessing to any passing query, so that datetimes are swapped by dates if the respective columns require it.
Any insight I gained about writing the dialect came by comparing ready-made code, which is often repeatable, as other writers seem to encounter the same problem. I haven't so far found anything tangible to shed light on this particular issue.
Has someone perhaps managed to overcome a similar obstacle, and can point me to the relevant SQLalchemy construct / function that I should implement or add to to allow this preprocessing?

CSV format data manipulation: why use python scripts instead of MS excel functions?

I am currently working on large data sets in csv format. In some cases, it is faster to use excel functions to get the work done. However, I want to write python scripts to read/write csv and carry out the required function. In what cases would python scripts be better than using excel functions for data manipulation tasks? What would be the long term advantages?
Using python is recommended for below scenarios:
Repeated action: Perform similar set of action over a similar dataset repeatedly. For ex, say you get a monthly forecast data and you have to perform various slicing & dicing and plotting. Here the structure of data and the steps of analysis is more or less the same, but the data differs for every month. Using Python and Pandas will save you a bulk of time and also reduces manual error.
Exploratory analysis: Once you establish a certain familiarity with Pandas, Numpy and Matplotlib, analysis using these python libraries are faster and much efficient than Excel Analysis. One simple usecase to justify this statement is backtracking. With Pandas, you can quickly trace back and regain the dataset to its original form or an earlier analysed form. With Excel, you could get lost after a maze of analysis, and might be in a lose of backtrack to an earlier form outside of cntrl + z
Teaching Tool: In my opinion, this is the most underutilized feature. IPython notebook could be an excellent teaching tool and reference document for data analysis. Using this, you can efficiently transfer knowledge between colleagues rather than sharing a complicated excel file.
After learning python, you are more flexible. The operations you can do with
on the user interface of MS excel are limited, whereas there are no limits
if you use python.
The benefit is also, that you automate the modifications, e.g. you can re-use
it or re-apply it to a different dataset. The speed depends heavily on the
algorithm and library you use and on the operation.
You can also use VB script / macros in excel to automate things, but
usually python is less cumbersome and more flexible.

Experience with using h5py to do analytical work on big data in Python?

I do a lot of statistical work and use Python as my main language. Some of the data sets I work with though can take 20GB of memory, which makes operating on them using in-memory functions in numpy, scipy, and PyIMSL nearly impossible. The statistical analysis language SAS has a big advantage here in that it can operate on data from hard disk as opposed to strictly in-memory processing. But, I want to avoid having to write a lot of code in SAS (for a variety of reasons) and am therefore trying to determine what options I have with Python (besides buying more hardware and memory).
I should clarify that approaches like map-reduce will not help in much of my work because I need to operate on complete sets of data (e.g. computing quantiles or fitting a logistic regression model).
Recently I started playing with h5py and think it is the best option I have found for allowing Python to act like SAS and operate on data from disk (via hdf5 files), while still being able to leverage numpy/scipy/matplotlib, etc. I would like to hear if anyone has experience using Python and h5py in a similar setting and what they have found. Has anyone been able to use Python in "big data" settings heretofore dominated by SAS?
EDIT: Buying more hardware/memory certainly can help, but from an IT perspective it is hard for me to sell Python to an organization that needs to analyze huge data sets when Python (or R, or MATLAB etc) need to hold data in memory. SAS continues to have a strong selling point here because while disk-based analytics may be slower, you can confidently deal with huge data sets. So, I am hoping that Stackoverflow-ers can help me figure out how to reduce the perceived risk around using Python as a mainstay big-data analytics language.
We use Python in conjunction with h5py, numpy/scipy and boost::python to do data analysis. Our typical datasets have sizes of up to a few hundred GBs.
HDF5 advantages:
data can be inspected conveniently using the h5view application, h5py/ipython and the h5* commandline tools
APIs are available for different platforms and languages
structure data using groups
annotating data using attributes
worry-free built-in data compression
io on single datasets is fast
HDF5 pitfalls:
Performance breaks down, if a h5 file contains too many datasets/groups (> 1000), because traversing them is very slow. On the other side, io is fast for a few big datasets.
Advanced data queries (SQL like) are clumsy to implement and slow (consider SQLite in that case)
HDF5 is not thread-safe in all cases: one has to ensure, that the library was compiled with the correct options
changing h5 datasets (resize, delete etc.) blows up the file size (in the best case) or is impossible (in the worst case) (the whole h5 file has to be copied to flatten it again)
I don't use Python for stats and tend to deal with relatively small datasets, but it might be worth a moment to check out the CRAN Task View for high-performance computing in R, especially the "Large memory and out-of-memory data" section.
Three reasons:
you can mine the source code of any of those packages for ideas that might help you generally
you might find the package names useful in searching for Python equivalents; a lot of R users are Python users, too
under some circumstances, it might prove convenient to just link to R for a particular analysis using one of the above-linked packages and then draw the results back into Python
Again, I emphasize that this is all way out of my league, and it's certainly possible that you might already know all of this. But perhaps this will prove useful to you or someone working on the same problems.

python solutions for managing scientific data dependency graph by specification values

I have a scientific data management problem which seems general, but I can't find an existing solution or even a description of it, which I have long puzzled over. I am about to embark on a major rewrite (python) but I thought I'd cast about one last time for existing solutions, so I can scrap my own and get back to the biology, or at least learn some appropriate language for better googling.
The problem:
I have expensive (hours to days to calculate) and big (GB's) data attributes that are typically built as transformations of one or more other data attributes. I need to keep track of exactly how this data is built so I can reuse it as input for another transformation if it fits the problem (built with right specification values) or construct new data as needed. Although it shouldn't matter, I typically I start with 'value-added' somewhat heterogeneous molecular biology info, for example, genomes with genes and proteins annotated by other processes by other researchers. I need to combine and compare these data to make my own inferences. A number of intermediate steps are often required, and these can be expensive. In addition, the end results can become the input for additional transformations. All of these transformations can be done in multiple ways: restricting with different initial data (eg using different organisms), by using different parameter values in the same inferences, or by using different inference models, etc. The analyses change frequently and build on others in unplanned ways. I need to know what data I have (what parameters or specifications fully define it), both so I can reuse it if appropriate, as well as for general scientific integrity.
My efforts in general:
I design my python classes with the problem of description in mind. All data attributes built by a class object are described by a single set of parameter values. I call these defining parameters or specifications the 'def_specs', and these def_specs with their values the 'shape' of the data atts. The entire global parameter state for the process might be quite large (eg a hundred parameters), but the data atts provided by any one class require only a small number of these, at least directly. The goal is to check whether previously built data atts are appropriate by testing if their shape is a subset of the global parameter state.
Within a class it is easy to find the needed def_specs that define the shape by examining the code. The rub arises when a module needs a data att from another module. These data atts will have their own shape, perhaps passed as args by the calling object, but more often filtered from the global parameter state. The calling class should be augmented with the shape of its dependencies in order to maintain a complete description of its data atts.
In theory this could be done manually by examining the dependency graph, but this graph can get deep, and there are many modules, which I am constantly changing and adding, and ... I'm too lazy and careless to do it by hand.
So, the program dynamically discovers the complete shape of the data atts by tracking calls to other classes attributes and pushing their shape back up to the caller(s) through a managed stack of __get__ calls. As I rewrite I find that I need to strictly control attribute access to my builder classes to prevent arbitrary info from influencing the data atts. Fortunately python is making this easy with descriptors.
I store the shape of the data atts in a db so that I can query whether appropriate data (i.e. its shape is a subset of the current parameter state) already exists. In my rewrite I am moving from mysql via the great SQLAlchemy to an object db (ZODB or couchdb?) as the table for each class has to be altered when additional def_specs are discovered, which is a pain, and because some of the def_specs are python lists or dicts, which are a pain to translate to sql.
I don't think this data management can be separated from my data transformation code because of the need for strict attribute control, though I am trying to do so as much as possible. I can use existing classes by wrapping them with a class that provides their def_specs as class attributes, and db management via descriptors, but these classes are terminal in that no further discovery of additional dependency shape can take place.
If the data management cannot easily be separated from the data construction, I guess it is unlikely that there is an out of the box solution but a thousand specific ones. Perhaps there is an applicable pattern? I'd appreciate any hints at how to go about looking or better describing the problem. To me it seems a general issue, though managing deeply layered data is perhaps at odds with the prevailing winds of the web.
I don't have specific python-related suggestions for you, but here are a few thoughts:
You're encountering a common challenge in bioinformatics. The data is large, heterogeneous, and comes in constantly changing formats as new technologies are introduced. My advice is to not overthink your pipelines, as they're likely to be changing tomorrow. Choose a few well defined file formats, and massage incoming data into those formats as often as possible. In my experience, it's also usually best to have loosely coupled tools that do one thing well, so that you can chain them together for different analyses quickly.
You might also consider taking a version of this question over to the bioinformatics stack exchange at http://biostar.stackexchange.com/
ZODB has not been designed to handle massive data, it is just for web-based applications and in any case it is a flat-file based database.
I recommend you to try PyTables, a python library to handle HDF5 files, which is a format used in astronomy and physics to store results from big calculations and simulations. It can be used as an hierarchical-like database and has also an efficient way to pickle python objects. By the way, the author of pytables explained that ZOdb was too slow for what he needed to do, and I can confirm you that. If you are interested in HDF5, there is also another library, h5py.
As a tool for managing the versioning of the different calculations you have, you can have a try at sumatra, which is something like an extension to git/trac but designed for simulations.
You should ask this question on biostar, you will find better answers there.

Categories