Power BI: How to use Python to access multiple tables? - python

I already read this post before:
Power BI: How to use Python with multiple tables in the Power Query Editor?
I'd like to do something like this inside PBI:
import pandas as pd
import numpy as np
proc = cad_processo
moeda = tab_moeda_data
def name_of_the_function(proc, moeda):
df = code...
return df
Now I'll explain more in-depth what I'm doing:
I'm in a table called cad_processo, I need to apply a very complex function to create a column in cad_processo, but in order to create this new column, I need another table called tab_moeda_data.
I tried what was explained in the post I quoted before, but I wasn't able to achieve nothing so far.
In theory it's simple, import 2 tables and apply a function to create a new column, but I'm not being able to import this second table (tab_moeda_data) into the cad_processo to apply the function.
*I know that cad_processo is called dataset in this case
I only need to import another table (tab_moeda_data) to apply the function, that's it.
Can anyone help me?

This is a feature that the UI doesn't support, so you have to use the Advanced Editor.
But you can simply pass additional Tables to Python.Execute, eg
MainTable = ...,
RunPythonscript = Python.Execute(pythonScript,[dataset=MainTable, otherTable=OtherTable]),
And they will be available as additional pandas DataFrames in your script.

Related

How to create a property of a subclassed pandas dataframe that acts like a column

This may not be possible, or even a good idea but I wanted to see what options are available.
I want to create a subclass of a DataFrame that has a column as a property. E.g.
import pandas as pd
class DFWithProperties(pd.DataFrame):
"""A DataFrame that definitely has an 'a' column."""
#property
def depends_on_a(self):
return self["a"] + 1
df = DFWithProperties({"a": [1,2,3,4]})
I can obviously access this property in a similar way to other columns with
df.depends_on_a
But I also want to be able to access this column with something like
df["depends_on_a"]
or
df[["a", "depends_on_a"]]
Are there any neat tricks I can use here, or am I being too ambitious?
Yes! You can store whatever you would like as an attribute of the pandas DataFrame, simply by saving it. You can save it even as a column if need be.
For example:
df.depends_on_a = df["depends_on_a"] (or whatever you need)
However, this is not related to the dataframe itself, so any changes made will have to be applied manually to both as it will not happen automatically.
Source: Adding meta-information/metadata to pandas DataFrame

Auxiliary data / description in pandas DataFrame

Imagine I have pd.DataFrame containing a lot of rows and I want to add some string data describing/summarizing it.
One way to do this would be to add column with this value, but it would be a complete waste of memory (e.g. when the string would be a long specification of ML model).
I could also insert it in file name - but this has limitations and is very impractical at saving/loading.
I could store this data in separate file, but it is not what I want.
I could make a class based on pd.DataFrame and add this field, but then I am unsure if save/load pickle would still work properly.
So is there some real clean way to store something like "description" in pandas DataFrame? Preferentially such that would withstand to_pickle/read_pickle operations.
[EDIT]: Ofc, this generates next questions such as what if we concatenate dataframes with such informations? But let's stick to saving/loading problem only.
Googled it a bit, you can name a dataframe as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame( data=np.ones([10,10]) )
df.name = 'My name'
print(df.name)

Pandas parallel apply with koalas (pyspark)

I'm new to Koalas (pyspark), and I was trying to utilize Koalas for parallel apply, but it seemed like it was using a single core for the whole operation (correct me if I'm wrong) and ended up using dask for parallel apply (using map_partition) which worked pretty well.
However, I would like to know if there's a way to utilize Koalas for parallel apply.
I used basic codes for operation like below.
import pandas as pd
import databricks.koalas as ks
my_big_data = ks.read_parquet('my_big_file') # file is single partitioned parquet file
my_big_data['new_column'] = my_big_data['string_column'].apply(my_prep) # my_prep does stirng operations
my_big_data.to_parquet('my_big_file_modified') # for Koalas does lazy evaluation
I found a link that discuss this problem. https://github.com/databricks/koalas/issues/1280
If the number of rows that are being applied by function is less than 1,000 (default value), then pandas dataframe will be called to do the operation.
The user defined function above my_prep is applied to each row, so single core pandas was being used.
In order to force it to work in pyspark (parallel) manner, user should modify the configuration as below.
import databricks.koalas as ks
ks.set_option('compute.default_index_type','distributed') # when .head() call is too slow
ks.set_option('compute.shortcut_limit',1) # Koalas will apply pyspark
Also, explicitly specifying type (type hint) in the user defined function will make Koalas not to go shortcut path and will make parallel.
def my_prep(row) -> string:
return row
kdf['my_column'].apply(my_prep)

Unable to access data using pandas composite index

I'm trying to organise data using a pandas dataframe.
Given the structure of the data it seems logical to use a composite index; 'league_id' and 'fixture_id'. I believe I have implemented this according to the examples in the docs, however I am unable to access the data using the index.
My code can be found here;
https://repl.it/repls/OldCorruptRadius
** I am very new to Pandas and programming in general, so any advice would be much appreciated! Thanks! **
For multi-indexing, you would need to use the pandas MutliIndex API, which brings its own learning curve; thus, I would not recommend it for beginners. Link: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
The way that I use multi-indexing is only to display a final product to others (i.e. making it easy/pretty to view). Before the multi-indexing, you filter the fixture_id and league_id as columns first:
df = pd.DataFrame(fixture, columns=features)
df[(df['fixture_id'] == 592) & (df['league_id'] == 524)]
This way, you are still technically targeting the indexes if you would have gone through with multi-indexing the two columns.
If you have to use multi-indexing, try the transform feature of a pandas DataFrame. This turns the indexes into columns and vise-versa. For example, you can do something like this:
df = pd.DataFrame(fixture, columns=features).set_index(['league_id', 'fixture_id'])
df.T[524][592].loc['event_date'] # gets you the row of `event_dates`
df.T[524][592].loc['event_date'].iloc[0] # gets you the first instance of event_dates

Using revoscalepy to insert data into a database

Ahoi there,
is there a possibility of using the revoscalepy package to insert values into a table?
I would expect something along the lines of:
import pandas as pd
from revoscalepy import rx_write_to_db, RxOdbcData
a_df = pd.DataFrame([[0, 1], [2, 3]], columns=[...])
rx_write_to_db(RxOdbcData(connection_string=con_str, ...), data=a_df)
But I couldn't find anything like this. The closest option appears to be rx_write_object, which dumps the dataframe as a binary into the table. More information about the usage can be found on the R-package site. This however does not solve my issue, as I would that the data is not in one binary blob.
Some context on the problem: During the feature generation I create multiple features which I want to store inside the database for later use. In theory I could create a final dataframe with all my features and the meta-data in it and use some triggers to dump the data into the right tables, but before I do this, I would rather install pymssql.
Any clues?
Ps.: If anyone knows the correct tags for a question like this, let me know...
I think what you are looking for is rx_featurize from microsoftml package (Installed with revoscalepy)
After you have your data frame, you would create a RxSqlServerData or RxOdbcData, with the connection string and table name arguments.
Then you simply call rx_featurize giving it the data frame as input and the Rx...Data object as output (specifying if you want to overwrite the table or not)
http://learn.microsoft.com/en-us/machine-learning-server/python-reference/microsoftml/rx-featurize
import pandas as pd
from revoscalepy import RxOdbcData
from microsoftml import rx_featurize
a_df = pd.DataFrame([[0, 1], [2, 3]], columns=[...])
rx_featurize(data=a_df,output_data=RxOdbcData(connection_string=con_str, table = tablename), overwrite = True)

Categories