Creating a table to query in Spark using Python - python

I'm trying to load a file directly from S3 and ultimately trying to just do some SparkSQL on it. Eventually, I plan on bringing in multiple files which will be multiple tables (1:1 map between files and tables).
So I've been following this tutorial that is fairly good at describing each step. I'm a bit stuck on declaring the proper schema and which variable, if any, is referred in the FROM clause in the SQL statement.
Here's my code:
sqlContext.sql("CREATE TABLE IF NOT EXISTS region (region_id INT, name STRING, comment STRING)")
region = sc.textFile("s3n://thisisnotabucketname/region.tbl")
raw_data = sc.textFile("s3n://thisisnotabucketname/region.tbl")
csv_data = raw_data.map(lambda l: l.split("|"))
row_data = csv_data.map(lambda p: Row( region_id=int(p[0]), name=p[1], comment=p[2] ))
interactions_df = sqlContext.createDataFrame(row_data)
interactions_df.registerTempTable("interactions")
tcp_interactions = sqlContext.sql(""" SELECT region_id, name, comment FROM region WHERE region_id > 1 """)
tcp_interactions = sqlContext.sql(""" SELECT * FROM region """)
tcp_interactions.show()
And here's some sample data. There is no header
0|AFRICA|lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to |
1|AMERICA|hs use ironic, even requests. s|
2|ASIA|ges. thinly even pinto beans ca|
tcp_interactions.show() is returning nothing. Just the header of region_id|name|comment|. What am I doing incorrectly? In the SQL statement, is region pointing to the region variable declared in the first line of code, or is it pointing to something else?

Related

Using a column as a list in where clause in a sql query

I created a list using a column a in python. I am trying to that in where clause in a sql query. list is a list of account numbers.
creating a list from the df
data1
acc_d1= data1['ACCOUNT_NUMBER']
t1 = tuple(acc_d1)
my code for sql query in python (I am using zeppelin)
sql="""
select id_number from table
where account_number IN {}""".format(t1)
prog_list_d1 = pd.read_sql(sql, dbc)
when I create a list by manually typing the numbers
acc_d1 = [12129530695080,12129530755769,12129516984649......]
t = tuple(acc_d1)
sql="""
select id_number from table
where account_number IN {}""".format(t)
prog_list_d1 = pd.read_sql(sql, dbc)
it works just fine. I am using python in a zeppelin notebook, and data is pulling from an Oracle database
You will need parentheses around the list. I don't know Python but I would guess it would be simply:
sql="""
select id_number from table
where account_number IN ({})""".format(t)
And, by the way, really try to avoid this pattern. Varying length lists in IN clauses causes big problems for cursor sharing and is hard on the shared pool. Your DBA will not be happy if this happens with high frequency. It is far better to pull one account number at a time (with real bind variables, not string replacement), or if you need millions then load a temp table with the account numbers then use a join from there to your main table to get all the rows you want in one pull, without listing them in the SQL itself.
The column in the df was an object. By changing the column type to string before converting it to list worked. I kept everything else the same.
data4['account_number'] = data4['account_number'].astype(str)
prog_d4 = list(data4['account_number'])
prog_d4 = tuple(prog_d4)
sql="""
select account_number from table
where account_number IN {}""".format(prog_d4)
prog_list_d4 = pd.read_sql(sql, dbc)

Python SQL loop variables through multiple queries

I'm having trouble with a Python Teradata (tdodbc) query with looping through the same query with different variables and merging the results. I received good direction in another post and ended up here. My issue now is that the dataframe only ends up with query results of the final variable in the loop, "state5". Unfortunately we have 5 states each in their own databases with the same schema. I can run the same query, but want to loop the variables so I can run for all 5 states and return an appended query. This was easy using SAS Macro variables and mending, but need to bring data to python for EDA and data science.
from teradata import tdodbc
udaExec = td.UdaExec(appConfigFile="udaexec.ini")
with udaExec.connect("${dataSourceName}") as session:
state_dataframes = []
STATES = ["state1", "state2", "state3", "state4", "state5"]
for state in STATES:
query1 = """database my_db_{};"""
query2 = """
select top 10
'{}' as state
,a.*
from table_a
"""
session.execute(query1.format(state))
session.execute(query2.format(state))
state_dataframes.append(pd.read_sql(query2, session))
all_states_df = pd.concat(state_dataframes)
I was able to finally get this to work although it may not be the most eloquent way to do it. I did try to do the drop tables as a single variable "query5" but was receiving a DDL error. Once I separated each drop table into it's own session.execute, it worked.
udaExec = td.UdaExec(appConfigFile="udaexec.ini")
with udaExec.connect("${dataSourceName}") as session:
state_dataframes = []
STATES = ["state1", "state2", "state3", "state4", "state5"]
for state in STATES:
query1 = """database my_db_{};"""
query2 = """
create set volatile table v_table
,no fallback, no before journal, no after journal as
(
select top 10
'{}' as state
,t.*
from table t
)
with data
primary index (dw_key)
on commit preserve rows;
"""
query3 = """
create set volatile table v_table_2
,no fallback, no before journal, no after journal as
(
select t.*
from v_table t
)
with data
primary index (dw_key)
on commit preserve rows;
"""
query4 = """
select t.*
from v_table_2 t
"""
session.execute(query1.format(state))
session.execute(query2.format(state))
session.execute(query3)
state_dataframes.append(pd.read_sql(query4, session))
session.execute("DROP TABLE v_table")
session.execute("DROP TABLE v_table_2")
all_states_df = pd.concat(state_dataframes)
Edit for clarity: correcting the query in the question only required proper indentation. In my Teradata environment I have limited spool space which requires building many vol tables to break apart queries. Since I spent a good amount of time trying to solve this, I added to the answer to help others who may run into this scenario.

Too many server roundtrips w/ psycopg2

I am making a script, that should create a schema for each customer. I’m fetching all metadata from a database that defines how each customer’s schema should look like, and then create it. Everything is well defined, the types, names of tables, etc. A customer has many tables (fx, address, customers, contact, item, etc), and each table has the same metadata.
My procedure now:
get everything I need from the metadataDatabase.
In a for loop, create a table, and then Alter Table and add each metadata (This is done for each table).
Right now my script runs in about a minute for each customer, which I think is too slow. It has something to do with me having a loop, and in that loop, I’m altering each table.
I think that instead of me altering (which might be not so clever approach), I should do something like the following:
Note that this is just a stupid but valid example:
for table in tables:
con.execute("CREATE TABLE IF NOT EXISTS tester.%s (%s, %s);", (table, "last_seen date", "valid_from timestamp"))
But it gives me this error (it seems like it reads the table name as a string in a string..):
psycopg2.errors.SyntaxError: syntax error at or near "'billing'"
LINE 1: CREATE TABLE IF NOT EXISTS tester.'billing' ('last_seen da...
Consider creating tables with a serial type (i.e., autonumber) ID field and then use alter table for all other fields by using a combination of sql.Identifier for identifiers (schema names, table names, column names, function names, etc.) and regular format for data types which are not literals in SQL statement.
from psycopg2 import sql
# CREATE TABLE
query = """CREATE TABLE IF NOT EXISTS {shm}.{tbl} (ID serial)"""
cur.execute(sql.SQL(query).format(shm = sql.Identifier("tester"),
tbl = sql.Identifier("table")))
# ALTER TABLE
items = [("last_seen", "date"), ("valid_from", "timestamp")]
query = """ALTER TABLE {shm}.{tbl} ADD COLUMN {col} {typ}"""
for item in items:
# KEEP IDENTIFIER PLACEHOLDERS
final_query = query.format(shm="{shm}", tbl="{tbl}", col="{col}", typ=i[1])
cur.execute(sql.SQL(final_query).format(shm = sql.Identifier("tester"),
tbl = sql.Identifier("table"),
col = sql.Identifier(item[0]))
Alternatively, use str.join with list comprehension for one CREATE TABLE:
query = """CREATE TABLE IF NOT EXISTS {shm}.{tbl} (
"id" serial,
{vals}
)"""
items = [("last_seen", "date"), ("valid_from", "timestamp")]
val = ",\n ".join(["{{}} {typ}".format(typ=i[1]) for i in items])
# KEEP IDENTIFIER PLACEHOLDERS
pre_query = query.format(shm="{shm}", tbl="{tbl}", vals=val)
final_query = sql.SQL(pre_query).format(*[sql.Identifier(i[0]) for i in items],
shm = sql.Identifier("tester"),
tbl = sql.Identifier("table"))
cur.execute(final_query)
SQL (sent to database)
CREATE TABLE IF NOT EXISTS "tester"."table" (
"id" serial,
"last_seen" date,
"valid_from" timestamp
)
However, this becomes heavy as there are too many server roundtrips.
How many tables with how many columns are you creating that this is slow? Could you ssh to a machine closer to your server and run the python there?
I don't get that error. Rather, I get an SQL syntax error. A values list is for conveying data. But ALTER TABLE is not about data, it is about metadata. You can't use a values list there. You need the names of the columns and types in double quotes (or no quotes) rather than single quotes. And you can't have a comma between name and type. And you can't have parentheses around each pair. And each pair needs to be introduced with "ADD", you can't have it just once. You are using the wrong tool for the job. execute_batch is almost the right tool, except it will use single quotes rather than double quotes around the identifiers. Perhaps you could add a flag to it tell it to use quote_ident.
Not only is execute_values the wrong tool for the job, but I think python in general might be as well. Why not just load from a .sql file?

In Python, display output of SQL query as a table, just like it does in SQL

this seems like a basic function, but I'm new to Python so maybe I'm not googling this correctly.
In Microsoft SQL Server, when you have a statement like
SELECT top 100 * FROM dbo.Patient_eligibility
you get a result like
Patient_ID | Patient_Name | Patient_Eligibility
67456 | Smith, John | Eligible
...
etc.
Etc.
I am running a connection to SQL through Python as such, and would like the output to look exactly the same as in SQL. Specifically - with column names and all the data rows specified in the SQL query. It doesn't have to appear in the console or the log, I just need a way to access it to see what's in it.
Here is my current code attempts:
import pyodbc
conn = pyodbc.connect(connstr)
cursor = conn.cursor()
sql = "SELECT top 100 * FROM [dbo].[PATIENT_ELIGIBILITY]"
cursor.execute(sql)
data = cursor.fetchall()
#Query1
for row in data :
print (row[1])
#Query2
print (data)
#Query3
data
My understanding is that somehow the results of PATIENT_ELIGIBILITY are stored in the variable data. Query 1, 2, and 3 represent methods of accessing that data that I've googled for - again seems like basic stuff.
The results of #Query1 give me the list of the first column, without a column name in the console. In the variable explorer, 'data' appears as type List. When I open it up, it just says 'Row object of pyodbc module' 100 times, one for each row. Not what I'm looking for. Again, I'm looking for the same kind of view output I would get if I ran it in Microsoft SQL Server.
Running #Query2 gets me a little closer to this end. The results appear like a .csv file - unreadable, but it's all there, in the console.
Running #Query3, just the 'data' variable, gets me the closest result but with no column names. How can I bring in the column names?
More directly, how do i get 'data' to appear as a clean table with column names somewhere? Since this appears a basic SQL function, could you direct me to a SQL-friendly library to use instead?
Also note that neither of the Queries required me to know the column names or widths. My entire method here is attempting to eyeball the results of the Query and quickly check the data - I can't see that the Patient_IDs are loading properly if I don't know which column is patient_ids.
Thanks for your help!
It's more than 1 question, I'll try help you and give advice.
I am running a connection to SQL through Python as such, and would like the output to look exactly the same as in SQL.
You are mixing SQL as language and formatted output of some interactive SQL tool.
SQL itself does not have anything about "look" of data.
Also note that neither of the Queries required me to know the column names or widths. My entire method here is attempting to eyeball the results of the Query and quickly check the data - I can't see that the Patient_IDs are loading properly if I don't know which column is patient_ids.
Correct. cursor.fetchall returns only data.
Field informations can be read from cursor.description.
Read more in PEP-O249
how do i get 'data' to appear as a clean table with column names somewhere?
It depends how do you define "appear".
You want: text output, html page or maybe GUI?
For text output: you can read column names from cursor.description and print them before data.
If you want html/excel/pdf/other - find some library/framework suiting your taste.
If you want an interactive experience similar to SQL tools - I recommend to look on jupyter-notebook + pandas.
Something like:
pandas.read_sql_query(sql)
will give you "clean table" nothing worse than SQLDeveloper/SSMS/DBeaver/other gives.
We don't need any external libraries.
Refer to this for more details.
Print results in MySQL format with Python
However, the latest version of MySQL gives an error to this code. So, I modified it.
Below is the query for the dataset
stri = "select * from table_name"
cursor.execute(stri)
data = cursor.fetchall()
mycon.commit()
Below it will print the dataset in tabular form
def columnnm(name):
v = "SELECT LENGTH("+name+") FROM table_name WHERE LENGTH("+name+") = (SELECT MAX(LENGTH("+name+")) FROM table_name) LIMIT 1;"
cursor.execute(v)
data = cursor.fetchall()
mycon.commit()
return data[0][0]
widths = []
columns = []
tavnit = '|'
separator = '+'
for cd in cursor.description:
widths.append(max(columnnm(cd[0]), len(cd[0])))
columns.append(cd[0])
for w in widths:
tavnit += " %-"+"%ss |" % (w,)
separator += '-'*w + '--+'
print(separator)
print(tavnit % tuple(columns))
print(separator)
for row in data:
print(tavnit % row)
print(separator)

Build normalised MSSQL dB from CSV files in Python + Pandas + sqlAlchemy

I am learning by doing - Python, Pandas, SQL & Deep Learning. I want to build a database with data for a deep learning experiment (with Keras and Tensorflow). The source data is ~10GB (total) of forex timestamped bid/ask data in 8 CSV files with source information encoded as three 3-4 char strings for categories Contributor, Region and City.
I can connect to my (empty) MSSQL database via pyodbc and sqlAlchemy; I can read my CSV files into dataframes; I can create a simple table in the dB and even create one from a simple dataframe; I can convert the date and time fields into the milliseconds since epoch I want. (And, FWIW, I already have already implemented a working toy LSTM model to adapt to the price data, and I also have some analytical functions I wrote and compiled in Mathematica; I'll either call the C from Python or get Mathematica to work directly on the database.)
The issue is putting the CSV data into the database. Since there are only a dozen or so different sources in each category I believe I should put Contributor etc. into separate tables with e.g Contributor_ID as ints (?) so that data is stored compactly and e.g. SELECT... WHERE Region = "SHRUBBERY" are efficient. (AFAICT I definitely shouldn't use enums because I may get more sources & categories later).
My question is - assuming the aforementioned high level of ignorance! - how can/should I a) create the tables and relationships using python and then b) populate those tables?
Optional extra: to save space, the CSV files omit the Region and City where the row values are the same as those for the row above - reading the CSVs to collect just the source information (which takes about 50s for each category) I know how to deduplicate and dropna, but when I want to populate the dB, how can I most efficiently replace the na's with the values from the previous row? A simple For loop would do it, but is there e.g. some way to "propagate" the last "real" value in a column to replace the na using pandas?
CSV example:
Date Time Bid Price Ask Price Contributor Region City
04/02/2017 00:00.1 1.00266 1.00282 DCFX ASI AKL
04/02/2017 00:00.1 1.00263 1.0028 DCFX
04/02/2017 00:00.2 1.00224 1.00285 FXN NAM NYC
04/02/2017 00:00.2 1.00223 1.00288 FXN
All input gratefully received :)
Relational databases (RDBMS) aim to store data into related, logical groupings with a system of primary key/foreign keys to normalize storage which among other advantages maintains referential integrity (i.e., no orphaned records) and avoids repetition of stored data. For your situation, consider the following:
DATABASE DESIGN: Understand the workflow or "story" of your data pieces (e.g., which comes first/after in data entry) and construct the necessary schema of tables. Classic Database 101 example is the Customers-Products-Orders where many customers can purchase multiple products to fill many orders (1-to-many and many-to-many relationships) where primary keys of parent tables are the foreign key of child tables. Hence, aim for a schema layout as below from this SO answer.
For your needs, your schema may involve Contributors, Regions, Cities, Markets, Company (Ticker), and Prices. This step will make use of DDL commands (CREATE TABLE, CREATE INDEX, CREATE SCHEMA) which can be run in pyodbc cursors or sqlAlchemy engine calls, sufficing the connected user has such privileges.
But typically, database design commands are run in a specialized admin console/IDE or command line tools and not application layer code like Python such as SQL Server's Management Studio or sqlcmd; similarly, Oracle's SQL Developer/sqlplus, MySQL's Workbench/cli or PostgreSQL's PgAdmin/psql. Below is example of setup for Prices table:
# INITIALIZE SQLALCHEMY ENGINE
connection_string = 'mssql+pyodbc://{}:{}#{}/{}'\
.format(db_user,db_password,db_server,db_database)
engine = create_engine(connection_string)
sql = """
CREATE TABLE Prices (
ID INT IDENTITY(1,1) PRIMARY KEY,
DateTime DATETIME,
BidPrice DOUBLE(10,4),
AskPrice DOUBLE(10,4),
ContributorID INT,
RegionID INT,
CityID INT,
CONSTRAINT FK_Contributor FOREIGN KEY (ContributorID) REFERENCES Contributors (ID),
CONSTRAINT FK_Region FOREIGN KEY (RegionID) REFERENCES Regions (ID),
CONSTRAINT FK_City FOREIGN KEY (CityID) REFERENCES Cities (ID)
)
"""
# SQL ACTION QUERY VIA TRANSACTION
with engine.begin() as conn:
conn.execute(sql)
DATA POPULATION: Because a dataset/dataframe, csv, or spreadsheet are NOT equivalent to a normalized RDBMS table but are actually queries of multiple tables, migration of these sources will require some SQL wrangling to align to your above schema. Simple upload of dataframes into SQL Server tables will lead to inefficient and repetitive storage. Therefore, consider below steps:
Staging Tables (using to_sql)
Use staging, temp tables which would be raw dumps from pandas. And for NAs issue, use DataFrame or Series forward fill , ffill, for populating NAs from above rows.
# FILL IN NAs IN ALL COLUMNS FROM PREVIOUS ROW
df = df.ffill() # OR df.fillna(method='ffill')
# FILL IN NAs FOR SPECIFIC COLUMNS
df['Region'] = df['Region'].ffill()
df['City'] = df['City'].ffill()
# DUMP DATA INTO DATA FRAME
df.to_sql(name='pandas_prices_dump', con=engine, if_exists='replace', index=False)
Migration to Final Tables (joining lookup tables by string names)
Then, run action queries (i.e., DML commands: INSERT INTO, UPDATE, DELETE) for populating final tables from staging, temp tables.
sql = """
INSERT INTO Prices (Datetime, BidPrice, AskPrice,
ContributorID, RegionID, CityID)
SELECT pd.Datetime, pd.BidPrice, pd.AskPrice, c.ID, r.ID, cy.ID
FROM pandas_prices_dump pd
INNER JOIN Contributors c
ON c.ContributorName = pd.Contributor
INNER JOIN Regions r
ON r.RegionName = pd.Region
INNER JOIN Cities cy
ON cy.CityName = pd.City
"""
# APPEND FINAL DATA
with engine.begin() as conn:
conn.execute(sql)
# DROP STAGING TABLE
with engine.begin() as conn:
conn.execute("DROP TABLE pandas_prices_dump")
Test/Check Final Tables (using read_sql, joining lookup tables by IDs)
# IMPORT INTO PANDAS (EQUIVALENT TO ORIGINAL df)
sql = """
SELECT p.Datetime, p.BidPrice, p.AskPrice,
c.ContributorName As Contributor, r.RegionName As Region,
cy.CityName As City
FROM Prices p
INNER JOIN Contributors c
ON c.ID = pd.ContributorID
INNER JOIN Regions r
ON r.ID = pd.RegionID
INNER JOIN Cities cy
ON cy.ID = pd.CityID
"""
prices_data = pd.read_sql(sql, engine)

Categories