I currently have a function that reads an SQL file to execute a query on Google's BigQuery.
import pandas as pd
def func1(arg1,arg2):
with open('query.sql', 'r') as sqlfile:
sql_query= sqlfile.read()
df = pd.read_gbq(sql_query.format(arg1=arg1,arg2=arg2)
query.sql
SELECT *
FROM bigquery.dataset
WHERE col1= {arg1}
AND col2 = {arg2}
The dataset location is hardcoded in the SQL file itself and as such, makes it hard to make changes if I were to change the dataset location (I.E I would have to individually go to each SQL file and manually change the "From" clause. Since I have many SQL files, it becomes cumbersome to manually edit each individual SQL file's from clause)
So my questions is, what is the best way to make the dataset location dynamic?
Ideally, the dataset location should be a variable, but the question is where to place the variable. If it is a variable, is it better to pass it in as a function argument? I.E func1 will have one more argument, called dataset_loc
import pandas as pd
def func1(arg1,arg2,dataset_loc):
with open('query.sql', 'r') as sqlfile:
sql_query= sqlfile.read()
df = pd.read_gbq(sql_query.format(arg1=arg1,arg2=arg2,dataset_loc=dataset_loc)
query.sql
SELECT *
FROM {dataset_loc}
WHERE col1 = {arg1}
AND col2 = {arg2}
Would like to know what is the best way to go around doing this. Thank you
If you are using the same functions to operate on different datasets, it is a good practice to make the function “dataset agnostic”, i.e to pass the dataset as a parameter. For me, your second example is the good approach to do it.
Also, keep in mind that now, your application might be small, but you need to prepare things for scaling up in the future. And definitely, you don’t want to have to write the same SQL query file for everyone of your datasets.
It depends on your use case but as a general rule it is recommended to manage the parameters of an application out of the code. To do this config files are used and as you are using Python take a look at this Python library which is useful to read them.
Related
I already read this post before:
Power BI: How to use Python with multiple tables in the Power Query Editor?
I'd like to do something like this inside PBI:
import pandas as pd
import numpy as np
proc = cad_processo
moeda = tab_moeda_data
def name_of_the_function(proc, moeda):
df = code...
return df
Now I'll explain more in-depth what I'm doing:
I'm in a table called cad_processo, I need to apply a very complex function to create a column in cad_processo, but in order to create this new column, I need another table called tab_moeda_data.
I tried what was explained in the post I quoted before, but I wasn't able to achieve nothing so far.
In theory it's simple, import 2 tables and apply a function to create a new column, but I'm not being able to import this second table (tab_moeda_data) into the cad_processo to apply the function.
*I know that cad_processo is called dataset in this case
I only need to import another table (tab_moeda_data) to apply the function, that's it.
Can anyone help me?
This is a feature that the UI doesn't support, so you have to use the Advanced Editor.
But you can simply pass additional Tables to Python.Execute, eg
MainTable = ...,
RunPythonscript = Python.Execute(pythonScript,[dataset=MainTable, otherTable=OtherTable]),
And they will be available as additional pandas DataFrames in your script.
This is my code:
def fun(df, file):
symbol = df.select(df.SMBL).distinct().collect()
for i in symbol:
csv_data = df.filter(df.SMBL == i.SMBL)
csv_data.write.csv('%s/'%(BUCKET_PATH))
using collect() slows the process. How to access the column 'SMBL' without using collect?
As far as I understand you are trying to write to files that are named based on the SMBL dataframe column. I suggest to write the dataframe with partitionBy() in which you specify the column. It might be needed to make a user defined function based on the SMBL column in order to get the right partition naming.
Doing this you don't need to call collect previously to the write action.
Recently I've been coding a lot of SQL in my Python projects via pandasql. The minor issue I encounter is the different spacing schema used in Python (4 spaces) and SQL (2 spaces)... so I cannot just press the tab button.
In the case of comment, SQL uses -- while Python uses #. It would be nice if when you press CTRL+/ and the in-line comment would appear accordingly.
I'm just wondering if it is possible to somehow add a tag so that VS Code will know that certain part of the code in the .py file is actually a SQL statement and will use 2 spaces when I press tab?
import pandas as pd
import pandasql as psql
df = pd.read_csv('data.csv',
parse_dates=['date'])
q = '''
SELECT
*
FROM
df
--WHERE
--date > '2019-01-01'
'''
psql.sqldf(q, locals())
Cheers!
Regarding the two-vs-four-space indentation, a simple answer is that you probably don't need to do this. Do you care about having your Python code indented with four spaces and your SQL indented with two? If you do care, why? There is no benefit to you in doing that. SQL is not indentation-sensitive as Python is. You could write your SQL query all on one line and it would work the same:
q = 'SELECT * FROM df'
Or you can use four spaces just like Python. You could even indent the whole thing just for fun:
q = '''
SELECT
*
FROM
df
'''
It just doesn't matter to a SQL engine, so write your code in the way you find most readable and convenient, and don't worry about trying to match some irrelevant coding standard just because your source file includes two different languages.
I do have one suggestion on your Python formatting. You're currently using column alignment, carefully lining things up with spaces so one function argument is below the prior one:
df = pd.read_csv('data.csv',
parse_dates=['date'])
In virtually any programming language, you are better off using only indentation instead of alignment. In an indentation-only style, that code might look like this:
df = pd.read_csv(
'data.csv',
parse_dates=['date']
)
What is the advantage of this style? Suppose you later decide that pd was a little too cryptic and you would prefer to spell out pandas for clarity. After you make that text substitution, the column-aligned code will look like this:
df = pandas.read_csv('data.csv',
parse_dates=['date'])
Oops. Now you have to manually go through every place you used that name and add extra spaces.
If you use the indentation-only style, this problem won't happen:
df = pandas.read_csv(
'data.csv',
parse_dates=['date']
)
Changing the name from pd to pandas didn't affect the formatting at all.
Update: I see you just added a question about making Ctrl+/ work in the embedded SQL code as well as the Python code. That one I don't have an answer for, sorry! I was only addressing your original question about the indentation. :-)
If you put your SQL queries in a different file with a different file extension you can change your settings on a per-file type basis, e.g. "[python]": {"editor.tabSize": 4}, "[sql]": {"editor.tabSize": 2}.
I'm building an interactive browser and editor for larger-than-memory datasets which will be later processed with Pandas. Thus, I'll need to have indexes on several columns that the dataset will be interactively sorted or filtered on (database indexes, not Pandas indexing), and I'd like the dataset file format to support cheap edits without rewriting most of the file. Like a database, only I want to be able to just send the files away afterwards in a Pandas-compatible format, without exporting.
So, I wonder if any of the formats that Pandas supports:
Have an option of building database-indexes on several columns (for sorting and filtering)
Can be updated 'in-place' or otherwise cheaply without shifting the rest of the records around
Preferably both of the above
What are my options?
I'm a complete noob in Pandas, and so far it seems that most of the formats are simply serialized sequential records, just like CSV, and at most can be sorted or indexed on one column. If nothing better comes up, I'll have to either build the indexes myself externally and juggle the edited rows manually before exporting the dataset, or dump the whole dataset in and out of a database—but I'd prefer avoiding both of those.
Edit: more specifically, it appears that Parquet has upper/lower bounds recorded for each column in each data page, and I wonder if these can be used as sort-of-indexes to speed up sorting on arbitrary columns, or whether other formats have similar features.
I would argue that parquet is indeed a good format for this situation. It maps well to the tabular nature of pandas dataframes, stores most common data in efficient binary representations (with optional compression), and is a standard, portable format. Furthermore, it allows you to load only those columns or "row groups" (chunks) you require. This latter gets to the crux of your problem.
Pandas' .to_parquet() will automatically store metadata relating to the indexing of your dataframe, and create the column max/min metadata as you suggest. If you use the fastparquet backend, you can use the filters= keyword when loading to select only some of the row-groups (this does not filter within row-groups)
pd.read_parquet('split', filters=[('cat', '==', '1')],
engine='fastparquet')
(selects only row-groups where some values of field 'cat' are equal to '1')
This can be particularly efficient, if you have used directory-based partitioning on writing, e.g.,
out.to_parquet('another_dataset.parq', partition_on=['cat'],
engine='fastparquet', file_scheme='hive')
Some of these options are only documented in the fastparquet docs, and maybe the API of that library implements slightly more than is available via the pandas methods; and I am not sure how well such options are implemented with the arrow backend.
Note further, that you may wish to read/save your dataframes using dask's to/read_parquet methods. Dask will understand the index if it is 1D and perform the equivalent of the filters= operation automatically load only relevant parts of the data on disc when you do filtering operations on the index. Dask is built to deal with data that does not easily fit into memory, and do computations in parallel.
(in answer to some of the comments above: Pandas-SQL interaction is generally not efficient, unless you can push the harder parts of the computation into a fast DB backend - in which case you don't really have a problem)
EDITs some specific notes:
parquet is not in general made for atomic record updating; but you could write to chunks of the whole (not via the pandas API - I think this is true for ALL of the writing format methods)
the "index" you speak on is not the same thing as a pandas index, but I am thinking that the information above may show that the sort of indexing in parquet is still useful for you.
If you decide to go the database route, SQLite is perfect since it's shipped with Python already, the driver api is in Python's standard library, and the fie format is platform independent. I use it for all my personal projects.
Example is modified from this tutorial on Pandas + sqlite3 and the pandas.io documentation:
# Code to create the db
import sqlite3
import pandas as pd
# Create a data frame
df = pd.DataFrame(dict(col1=[1,2,3], col2=['foo', 'bar', 'baz']))
df.index = ('row{}'.format(i) for i in range(df.shape[0]))
# Connect to your database
conn = sqlite3.connect("data/your_database.sqlite")
# Write the data to your table (overwrite old data)
df.to_sql('your_table', conn, if_exists='replace')
# Add more data
new_rows = pd.DataFrame(
dict(col1=[3, 4], col2=['notbaz', 'grunt']),
index=('row2', 'row3')
)
new_rows.to_sql('your_table', conn, if_exists='append') # `append`
This part is an aside in case you need more complex stuff:
# (oops - duplicate row 2)
# also ... need to quote "index" column name because it's a keyword.
query = 'SELECT * FROM your_table WHERE "index" = ?'
pd.read_sql(query, conn, params=('row2',))
# index col1 col2
# 0 row2 3 baz
# 1 row2 3 notbaz
# For more complex queries use pandas.io.sql
from pandas.io import sql
query = 'DELETE FROM your_table WHERE "col1" = ? AND "col2" = ?'
sql.execute(query, conn, params=(3, 'notbaz'))
conn.commit()
# close
conn.close()
When you or collaborators want to read from the database, just send them the
file data/your_database.sqlite and this code:
# Code to access the db
import sqlite3
import pandas as pd
# Connect to your database
conn = sqlite3.connect("data/your_database.sqlite")
# Load the data into a DataFrame
query = 'SELECT * FROM your_table WHERE col1 BETWEEN ? and ?'
df = pd.read_sql_query(query, conn, params=(1,3))
# index col1 col2
# 0 row0 1 foo
# 1 row1 2 bar
# 2 row2 3 baz
I have a scientific model which I am running in Python which produces a lookup table as output. That is, it produces a many-dimensional 'table' where each dimension is a parameter in the model and the value in each cell is the output of the model.
My question is how best to store this lookup table in Python. I am running the model in a loop over every possible parameter combination (using the fantastic itertools.product function), but I can't work out how best to store the outputs.
It would seem sensible to simply store the output as a ndarray, but I'd really like to be able to access the outputs based on the parameter values not just indices. For example, rather than accessing the values as table[16][5][17][14] I'd prefer to access them somehow using variable names/values, for example:
table[solar_z=45, solar_a=170, type=17, reflectance=0.37]
or something similar to that. It'd be brilliant if I were able to iterate over the values and get their parameter values back - that is, being able to find out that table[16]... corresponds to the outputs for solar_z = 45.
Is there a sensible way to do this in Python?
Why don't you use a database? I have found MongoDB (and the official Python driver, Pymongo) to be a wonderful tool for scientific computing. Here are some advantages:
Easy to install - simply download the executables for your platform (2 minutes tops, seriously).
Schema-less data model
Blazing fast
Provides map/reduce functionality
Very good querying functionalities
So, you could store each entry as a MongoDB entry, for example:
{"_id":"run_unique_identifier",
"param1":"val1",
"param2":"val2" # etcetera
}
Then you could query the entries as you will:
import pymongo
data = pymongo.Connection("localhost", 27017)["mydb"]["mycollection"]
for entry in data.find(): # this will yield all results
yield entry["param1"] # do something with param1
Whether or not MongoDB/pymongo are the answer to your specific question, I don't know. However, you could really benefit from checking them out if you are into data-intensive scientific computing.
If you want to access the results by name, then you could use a python nested dictionary instead of ndarray, and serialize it in a .JSON text file using json module.
One option is to use a numpy ndarray for the data (as you do now), and write a parser function to convert the query values into row/column indices.
For example:
solar_z_dict = {...}
solar_a_dict = {...}
...
def lookup(dataArray, solar_z, solar_a, type, reflectance):
return dataArray[solar_z_dict[solar_z] ], solar_a_dict[solar_a], ...]
You could also convert to string and eval, if you want to have some of the fields to be given as "None" and be translated to ":" (to give the full table for that variable).
For example, rather than accessing the values as table[16][5][17][14]
I'd prefer to access them somehow using variable names/values
That's what numpy's dtypes are for:
dt = [('L','float64'),('T','float64'),('NMSF','float64'),('err','float64')]
data = plb.loadtxt(argv[1],dtype=dt)
Now you can access the data elements using date['T']['L']['NMSF']
More info on dtypes:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html