Trying to run a script that contains a SQL query:
import example_script
example_script.df.describe()
example_script.df.info()
q1 = '''
example_script.df['specific_column'])
'''
job_config = bigquery.QueryJobConfig()
query_job = client.query(q1, job_config= job_config)
q = query_job.to_dataframe()
Issues I'm having are when I import it, how do I get that specific column name used as a text? Then it will run the query from GBQ but instead, it's stuck in pandas formatting that Google doesn't want to read. Are there other options?
Related
I just started playing around with bigquery, and I am trying to pass the dataset id to the python client. It should be a pretty basic operation, but I can't find it on other threads.
In practice I would like to to take the following example
# import packages
import os
from google.cloud import bigquery
# set current work directory to the one with this script.
os.chdir(os.path.dirname(os.path.abspath(__file__)))
# initialize client object using the bigquery key I generated from Google clouds
google_credentials_path = 'bigquery-stackoverflow-DC-fdb49371cf87.json'
client = bigquery.Client.from_service_account_json(google_credentials_path)
# create simple query
query_job = client.query(
"""
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
LIMIT 10"""
)
# store results in dataframe
dataframe_query = query_job.result().to_dataframe()
and make it look something like
# import packages
import os
from google.cloud import bigquery
# set current work directory to the one with this script.
os.chdir(os.path.dirname(os.path.abspath(__file__)))
# initialize client object using the bigquery key I generated from Google clouds
google_credentials_path = 'bigquery-stackoverflow-DC-fdb49371cf87.json'
client = bigquery.Client.from_service_account_json(google_credentials_path)\
.A_function_to_specify_id(bigquery-public-data.stackoverflow)
# create simple query
query_job = client.query(
"""
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `posts_questions` -- No dataset ID here anymore
WHERE tags like 'google-bigquery'
ORDER BY view_count DESC
LIMIT 10"""
)
# store results in dataframe
dataframe_query = query_job.result().to_dataframe()
The documentation eludes me, so any help would be appreciated.
The closest thing to what you're asking for is the default_dataset (reference) property of the query job config. The query job config is an optional object that can be passed into the query() method of the instantiated BigQuery client.
You don't set default dataset as part of instantiating a client as not all resources are dataset scoped. You're implicitly working with a query job in your example, which is a project scoped resource.
So, to adapt your sample a bit, it might look something like this:
# skip the irrelevant bits like imports and client construction
job_config = bigquery.QueryJobConfig(default_dataset="bigquery-public-data.stackoverflow")
sql = "SELECT COUNT(1) FROM posts_questions WHERE tags like 'google-bigquery'"
dataframe = client.query(sql, job_config=job_config).to_dataframe()
If you're issuing multiple queries against this same dataset you could certainly reuse the same job config object with multiple query invocations.
I would like to get only table names from BigQuery by using wildcard in Python.
What I want to do is something like this:
from google.cloud import bigquery
bigquery_client = bigquery.Client()
table_names = bigquery_client.get_something().something('db_name.table_prefix_*')
Can I do something like this?
If I can, What should I do?
Table metadata details available in the INFORMATION_SCHEMA. You can run the following query and get the table names using Python SDK.
Query:
SELECT table_name FROM YOUR_DATASET.INFORMATION_SCHEMA.TABLES where table_name like 'table_prefix_%'
Reference:
https://cloud.google.com/bigquery/docs/information-schema-tables
https://cloud.google.com/bigquery/docs/running-queries#python
I did as SANN3 taught me, like a following:
query = '''select table_name from my_dataset_name.INFORMATION_SCHEMA.TABLES where table_name like 'table_prefix_%' '''
table_names = bigquery_client.query(query).result().to_dataframe()
print(table_names)
then I got table_names :-)
I am new to the Python-SQL connectivity world. My goal is to retrieve data from SQL in a pandas DataFrame format by executing long SQL queries thru my python script.
Most of my SQL queries are long with multiple interim-temp tables before the final SELECT statement from the last temp table. When I run such a monolithic query in Python I get an error saying -
"pandas.io.sql.DatabaseError: Execution failed on sql"
Though they run absolutely fine in MS SQL Management Studio
I suspect this is due to the interim-temp tables, because if I split my long query into two pieces (with everything before the final SELECT in 1st section and final SELECT in the 2nd section) the two section sequentially, run fine
Can someone guide me why is it so or alternatively what is the best way to run long queries with temp tables/views and retrieve results in a pandas DataFrame?
Here is my sample Python code that ideally should take a fine name as an input and run the SQL to retrieve results in a data frame, however it fails in case of a query with temp tables
import pyodbc as db
import pandas as pd
filename = 'file.sql'
username = 'XXXX'
password = 'YYYYY'
driver= '{ODBC Driver 13 for SQL Server}'
database = 'DB'
server = 'local'
conn = db.connect('DRIVER='+driver+'; PORT=1433; SERVER='+server+';
PORT=1443; DATABASE='+database+'; UID='+username+'; PWD='+ password)
fd = open(filename, 'r')
sqlfile = fd.read()
fd.close()
sqlcommand1 = sql
df_table = pd.read_sql(sqlcommand1, conn)
If I break my sql query in two pieces (one with all temp tables and 2nd with final Select), then it runs fine. Below is a modified function that splits the long Query after finding '/**/' and it works fine
"""
This Function Reads a SQL Script From an Extrenal File and Executes The
Script in SQL. If The SQL Script Has Bunch of Tem Tables/Views
Followed By a Select Statement to Retrieve Data From Those Views Then Input
SQL File Should Have '/**/' Immediately Before the Final
Select Statement. This is to Esnure Final Select Statement is Executed on
the Temporary Views Already Run by Python.
Input is a SQL File Name and Output is a DataFrame
"""
import pyodbc as db
import pandas as pd
filename = 'filename.sql'
username = 'XXXX'
password = 'YYYYY'
driver= '{ODBC Driver 13 for SQL Server}'
database = 'DB'
server = 'local'
conn = db.connect('DRIVER='+driver+'; PORT=1433; SERVER='+server+';
PORT=1443; DATABASE='+database+'; UID='+username+'; PWD='+ password)
fd = open(filename, 'r')
sqlfile = fd.read()
fd.close()
sql = sqlfile.split('/**/')
sqlcommand1 = sql[0] #1st Section of Query with temp tables
sqlcommand2 = sql[1] #2nd section of Query with final SELECT statement
conn.execute(sqlcommand1)
df_table = pd.read_sql(sqlcommand2, conn)
Quick and dirty answer: if using T-SQL put the line SET NOCOUNT ON at the beginning of your query.
Like #Parfait mentioned above the pandas read_sql method can only support one result set. However, when you generate a temp table in T-sql you do create a result set in the form "(XX row(s) affected)" which is what causes your original query to fail. By setting NOCOUNT you eliminate any early returns and only get the results from your final SELECT statement.
Alternatively, if using pyodbc cursor instead of pandas you can utilize nextset() to skip the result sets from the temp table(s). More info on pyodbc here.
I was able to connect to SQL server Analysis service in Python using Microsoft.AnalysisServices.dll, and now I can't execute query on cube.
I've tried Execute method same as following:
amoServer.Execute('select from finance')
After issuing Execute method I have this error:
<Microsoft.AnalysisServices.XmlaError object at 0x000000000000002B [Microsoft.AnalysisServices.XmlaError]>
Note: I'm using IronPython with Python 2.7 on Windows Server 64Bit.
What's the problem?
its better use Microsoft.AnalysisServices.AdomdClient.dll and mdx query.
and set query result in Datasets in Ststem.Data assembly
something like this:
clr.AddReference ("Microsoft.AnalysisServices.AdomdClient.dll")
clr.AddReference ("System.Data")
from Microsoft.AnalysisServices.AdomdClient import AdomdConnection , AdomdDataAdapter
from System.Data import DataSet
conn = AdomdConnection("Data Source=0.0.0.0;Catalog=MyCatalog;")
conn.Open()
cmd = conn.CreateCommand()
cmd.CommandText = "your mdx query" # in your case 'select from finance'
adp = AdomdDataAdapter(cmd)
datasetParam = DataSet()
adp.Fill(datasetParam)
conn.Close();
# datasetParam hold your result as collection a\of tables
# each tables has rows
# and each row has columns
print datasetParam.Tables[0].Rows[0][0]
I'm trying to upload a pandas data frame to an SQL table. It seemed to me that pandas to_sql function is the best solution for larger data frames, but I can't get it to work. I can easily extract data, but get an error message when trying to write it to a new table:
# connect to Exasol DB
exaString='DSN=exa'
conDB = pyodbc.connect(exaString)
# get some data from somewhere, works without error
sqlString = "SELECT * FROM SOMETABLE"
data = pd.read_sql(sqlString, conDB)
# now upload this data to a new table
data.to_sql('MYTABLENAME', conDB, flavor='mysql')
conDB.close()
The error message I get is
pyodbc.ProgrammingError: ('42000', "[42000] [EXASOL][EXASolution driver]syntax error, unexpected identifier_chain2, expecting
assignment_operator or ':' [line 1, column 6] (-1)
(SQLExecDirectW)")
Unfortunately I have no idea how the query that caused this syntax error looks like or what else is wrong. Can someone please point me in the right direction?
(Second) EDIT:
Following Humayuns and Joris suggestions, I now use Pandas version 0.14 and SQLAlchemy in combination with the Exasol dialect (?). Since I am connecting to a defined schema, I am using the meta data option, but the programm crashes with "Bus error (core dumped)".
engine = create_engine('exa+pyodbc://uid:passwd#exa/mySchemaName', echo=True)
# get some data
sqlString = "SELECT * FROM SOMETABLE" # SOMETABLE is a view in mySchemaName
df = pd.read_sql(sqlString, con=engine) # works
print engine.has_table('MYTABLENAME') # MYTABLENAME is a view in mySchemaName
# prints "True"
# upload it to a new table
meta = sqlalchemy.MetaData(engine, schema='mySchemaName')
meta.reflect(engine, schema='mySchemaName')
pdsql = sql.PandasSQLAlchemy(engine, meta=meta)
pdsql.to_sql(df, 'MYTABLENAME')
I am not sure about setting "mySchemaName" in create_engine(..), but the outcome is the same.
Pandas does not support the EXASOL syntax out of the box, so it need to be changed a bit, here is a working example of your code without SQLAlchemy:
import pyodbc
import pandas as pd
con = pyodbc.connect('DSN=EXA')
con.execute('OPEN SCHEMA TEST2')
# configure pandas to understand EXASOL as mysql flavor
pd.io.sql._SQL_TYPES['int']['mysql'] = 'INT'
pd.io.sql._SQL_SYMB['mysql']['br_l'] = ''
pd.io.sql._SQL_SYMB['mysql']['br_r'] = ''
pd.io.sql._SQL_SYMB['mysql']['wld'] = '?'
pd.io.sql.PandasSQLLegacy.has_table = \
lambda self, name: name.upper() in [t[0].upper() for t in con.execute('SELECT table_name FROM cat').fetchall()]
data = pd.read_sql('SELECT * FROM services', con)
data.to_sql('SERVICES2', con, flavor = 'mysql', index = False)
If you use the EXASolution Python package, then the code would look like follows:
import exasol
con = exasol.connect(dsn='EXA') # normal pyodbc connection with additional functions
con.execute('OPEN SCHEMA TEST2')
data = con.readData('SELECT * FROM services') # pandas data frame per default
con.writeData(data, table = 'services2')
The problem is that also in pandas 0.14 the read_sql and to_sql functions cannot deal with schemas, but using exasol without schemas makes no sense. This will be fixed in 0.15. If you want to use it now look at this pull request https://github.com/pydata/pandas/pull/7952