Using revoscalepy to insert data into a database - python

Ahoi there,
is there a possibility of using the revoscalepy package to insert values into a table?
I would expect something along the lines of:
import pandas as pd
from revoscalepy import rx_write_to_db, RxOdbcData
a_df = pd.DataFrame([[0, 1], [2, 3]], columns=[...])
rx_write_to_db(RxOdbcData(connection_string=con_str, ...), data=a_df)
But I couldn't find anything like this. The closest option appears to be rx_write_object, which dumps the dataframe as a binary into the table. More information about the usage can be found on the R-package site. This however does not solve my issue, as I would that the data is not in one binary blob.
Some context on the problem: During the feature generation I create multiple features which I want to store inside the database for later use. In theory I could create a final dataframe with all my features and the meta-data in it and use some triggers to dump the data into the right tables, but before I do this, I would rather install pymssql.
Any clues?
Ps.: If anyone knows the correct tags for a question like this, let me know...

I think what you are looking for is rx_featurize from microsoftml package (Installed with revoscalepy)
After you have your data frame, you would create a RxSqlServerData or RxOdbcData, with the connection string and table name arguments.
Then you simply call rx_featurize giving it the data frame as input and the Rx...Data object as output (specifying if you want to overwrite the table or not)
http://learn.microsoft.com/en-us/machine-learning-server/python-reference/microsoftml/rx-featurize
import pandas as pd
from revoscalepy import RxOdbcData
from microsoftml import rx_featurize
a_df = pd.DataFrame([[0, 1], [2, 3]], columns=[...])
rx_featurize(data=a_df,output_data=RxOdbcData(connection_string=con_str, table = tablename), overwrite = True)

Related

Power BI: How to use Python to access multiple tables?

I already read this post before:
Power BI: How to use Python with multiple tables in the Power Query Editor?
I'd like to do something like this inside PBI:
import pandas as pd
import numpy as np
proc = cad_processo
moeda = tab_moeda_data
def name_of_the_function(proc, moeda):
df = code...
return df
Now I'll explain more in-depth what I'm doing:
I'm in a table called cad_processo, I need to apply a very complex function to create a column in cad_processo, but in order to create this new column, I need another table called tab_moeda_data.
I tried what was explained in the post I quoted before, but I wasn't able to achieve nothing so far.
In theory it's simple, import 2 tables and apply a function to create a new column, but I'm not being able to import this second table (tab_moeda_data) into the cad_processo to apply the function.
*I know that cad_processo is called dataset in this case
I only need to import another table (tab_moeda_data) to apply the function, that's it.
Can anyone help me?
This is a feature that the UI doesn't support, so you have to use the Advanced Editor.
But you can simply pass additional Tables to Python.Execute, eg
MainTable = ...,
RunPythonscript = Python.Execute(pythonScript,[dataset=MainTable, otherTable=OtherTable]),
And they will be available as additional pandas DataFrames in your script.

Auxiliary data / description in pandas DataFrame

Imagine I have pd.DataFrame containing a lot of rows and I want to add some string data describing/summarizing it.
One way to do this would be to add column with this value, but it would be a complete waste of memory (e.g. when the string would be a long specification of ML model).
I could also insert it in file name - but this has limitations and is very impractical at saving/loading.
I could store this data in separate file, but it is not what I want.
I could make a class based on pd.DataFrame and add this field, but then I am unsure if save/load pickle would still work properly.
So is there some real clean way to store something like "description" in pandas DataFrame? Preferentially such that would withstand to_pickle/read_pickle operations.
[EDIT]: Ofc, this generates next questions such as what if we concatenate dataframes with such informations? But let's stick to saving/loading problem only.
Googled it a bit, you can name a dataframe as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame( data=np.ones([10,10]) )
df.name = 'My name'
print(df.name)

Python bson: How to create list of ObjectIds in a New Column

I have a CSV that I'm needing to create a column of random unique MongoDB ids in python.
Here's my csv file:
import pandas as pd
df = pd.read_csv('file.csv', sep=';')
print(df)
Zone
Zone_1
Zone_2
I'm currently using this line of code to generate a unique ObjectId - UPDATE
import bson
x = bson.objectid.ObjectId()
df['objectids'] = x
print(df)
Zone; objectids
Zone_1; 5bce2e42f6738f20cc12518d
Zone_2; 5bce2e42f6738f20cc12518d
How can I get the ObjectId to be unique for each row?
Hate to see you down voted... stack overflow goes nuts with the duplicate question nonsense, refuses to provide useful assistance, then down votes you for having the courage to ask about something you don't know.
The referenced question clearly has nothing to do with ObjectIds, let alone adding them (or any other object not internal to NumPy or Pandas) into data frames.
You probably need to use a map
this assumes the column "objectids" is not a Series in your frame and "Zone" is a Series in your frame
df['objectids'] = df['Zone'].map(lambda x: bson.objectid.ObjectId())
Maps are super helpful (though slow) way to poke every record in your series and particularly helpful as an initial method to connect external functions.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html

Is there a dataset file format for Pandas which can be indexed on multiple columns (that is, 'database-indexed'), and/or can be updated cheaply?

I'm building an interactive browser and editor for larger-than-memory datasets which will be later processed with Pandas. Thus, I'll need to have indexes on several columns that the dataset will be interactively sorted or filtered on (database indexes, not Pandas indexing), and I'd like the dataset file format to support cheap edits without rewriting most of the file. Like a database, only I want to be able to just send the files away afterwards in a Pandas-compatible format, without exporting.
So, I wonder if any of the formats that Pandas supports:
Have an option of building database-indexes on several columns (for sorting and filtering)
Can be updated 'in-place' or otherwise cheaply without shifting the rest of the records around
Preferably both of the above
What are my options?
I'm a complete noob in Pandas, and so far it seems that most of the formats are simply serialized sequential records, just like CSV, and at most can be sorted or indexed on one column. If nothing better comes up, I'll have to either build the indexes myself externally and juggle the edited rows manually before exporting the dataset, or dump the whole dataset in and out of a database—but I'd prefer avoiding both of those.
Edit: more specifically, it appears that Parquet has upper/lower bounds recorded for each column in each data page, and I wonder if these can be used as sort-of-indexes to speed up sorting on arbitrary columns, or whether other formats have similar features.
I would argue that parquet is indeed a good format for this situation. It maps well to the tabular nature of pandas dataframes, stores most common data in efficient binary representations (with optional compression), and is a standard, portable format. Furthermore, it allows you to load only those columns or "row groups" (chunks) you require. This latter gets to the crux of your problem.
Pandas' .to_parquet() will automatically store metadata relating to the indexing of your dataframe, and create the column max/min metadata as you suggest. If you use the fastparquet backend, you can use the filters= keyword when loading to select only some of the row-groups (this does not filter within row-groups)
pd.read_parquet('split', filters=[('cat', '==', '1')],
engine='fastparquet')
(selects only row-groups where some values of field 'cat' are equal to '1')
This can be particularly efficient, if you have used directory-based partitioning on writing, e.g.,
out.to_parquet('another_dataset.parq', partition_on=['cat'],
engine='fastparquet', file_scheme='hive')
Some of these options are only documented in the fastparquet docs, and maybe the API of that library implements slightly more than is available via the pandas methods; and I am not sure how well such options are implemented with the arrow backend.
Note further, that you may wish to read/save your dataframes using dask's to/read_parquet methods. Dask will understand the index if it is 1D and perform the equivalent of the filters= operation automatically load only relevant parts of the data on disc when you do filtering operations on the index. Dask is built to deal with data that does not easily fit into memory, and do computations in parallel.
(in answer to some of the comments above: Pandas-SQL interaction is generally not efficient, unless you can push the harder parts of the computation into a fast DB backend - in which case you don't really have a problem)
EDITs some specific notes:
parquet is not in general made for atomic record updating; but you could write to chunks of the whole (not via the pandas API - I think this is true for ALL of the writing format methods)
the "index" you speak on is not the same thing as a pandas index, but I am thinking that the information above may show that the sort of indexing in parquet is still useful for you.
If you decide to go the database route, SQLite is perfect since it's shipped with Python already, the driver api is in Python's standard library, and the fie format is platform independent. I use it for all my personal projects.
Example is modified from this tutorial on Pandas + sqlite3 and the pandas.io documentation:
# Code to create the db
import sqlite3
import pandas as pd
# Create a data frame
df = pd.DataFrame(dict(col1=[1,2,3], col2=['foo', 'bar', 'baz']))
df.index = ('row{}'.format(i) for i in range(df.shape[0]))
# Connect to your database
conn = sqlite3.connect("data/your_database.sqlite")
# Write the data to your table (overwrite old data)
df.to_sql('your_table', conn, if_exists='replace')
# Add more data
new_rows = pd.DataFrame(
dict(col1=[3, 4], col2=['notbaz', 'grunt']),
index=('row2', 'row3')
)
new_rows.to_sql('your_table', conn, if_exists='append') # `append`
This part is an aside in case you need more complex stuff:
# (oops - duplicate row 2)
# also ... need to quote "index" column name because it's a keyword.
query = 'SELECT * FROM your_table WHERE "index" = ?'
pd.read_sql(query, conn, params=('row2',))
# index col1 col2
# 0 row2 3 baz
# 1 row2 3 notbaz
# For more complex queries use pandas.io.sql
from pandas.io import sql
query = 'DELETE FROM your_table WHERE "col1" = ? AND "col2" = ?'
sql.execute(query, conn, params=(3, 'notbaz'))
conn.commit()
# close
conn.close()
When you or collaborators want to read from the database, just send them the
file data/your_database.sqlite and this code:
# Code to access the db
import sqlite3
import pandas as pd
# Connect to your database
conn = sqlite3.connect("data/your_database.sqlite")
# Load the data into a DataFrame
query = 'SELECT * FROM your_table WHERE col1 BETWEEN ? and ?'
df = pd.read_sql_query(query, conn, params=(1,3))
# index col1 col2
# 0 row0 1 foo
# 1 row1 2 bar
# 2 row2 3 baz

How to override/prevent sqlalchemy from ever using floating point type?

I've been using pandas to transform raw file data and import into a database. Often times we use large integers as primary keys. When using pandas to_sql function without explicitly specifying column types, it will sometimes automatically assign large integers as float (rather than bigint).
As you can imagine, much hair was lost when we realized our selects and joins weren't working.
Of course, we can go through and via trial-error manually assign problem columns as bigint, but we'd rather outright disable float altogether and instead force bigint since we work with an extremely large amount of tables, and sometimes an extremely large amount of columns that we can't spend time individually fact-checking. We basically never want a float type in any table definition, ever.
Any way to override floating point type (either in pandas, sqlalchemy, or numpy) as bigint?
ie:
import pandas as pd
from sqlalchemy import create_engine
e = create_engine('mysql+pymysql://user:pass#host')
columns = ['foo', 'bar']
data = [
[123456789, 'one'],
[234567890, 'two'],
[345678901, 'three']
]
df = pd.DataFrame(data=data, columns=columns)
df.to_sql('table', e, flavor='mysql', schema='schema', if_exists='replace')
Unfortunately, this code does not reproduce the effect. It committed as bigint. It happens when loading data from certain csv or xls files, it happens when doing a transfer from one MySQL database to another (latin1) which one would assume to be an isometric copy.
There's nothing to the code at all, it's just:
import pandas as pd
from sqlalchemy import create_engine
e = create_engine('mysql+pymysql://user:pass#host')
df = pd.read_sql('SELECT * FROM source_schema.source_table;', e)
df.to_sql('target_table', e, flavor='mysql', schema='target_schema')
Creating a testfile.csv:
thing1,thing2
123456789,foo
234567890,bar
345678901,baz
didn't reproduce the effect either. I know for a fact it happens with data from NPPES Dissemination, perhaps it has to do with the encoding? I have to convert the files from WIN-1252 to UTF-8 in order for MySQL to even accept them.

Categories