I am trying to efficiently read some legacy DBs contents into a numpy (rec-)array. I was following this What's the most efficient way to convert a MySQL result set to a NumPy array? and this MySQLdb query to Numpy array posts.
Now it happens that some entries in the DB contains NULL, which are returned as None.
So np.fromiter will react like this e.g.
TypeError: long() argument must be a string or a number, not 'NoneType'
I would like to kind of, tell it how it should behave in case it encounters None.
Is that even possible?
Here is (something like) my code:
cur = db.cursor()
query = ("SELECT a, b, c from Table;")
cur.execute(query)
dt = np.dtype([
('a', int),
('b', int),
('c', float),
])
r = np.fromiter(cur.fetchall(), count=-1, dtype=dt)
And I would like to be able to specify, that the resulting array should contain np.nan in case None is encountered in column 'c', while it should contain the number 9999 when None is found for column 'a' or 'b'. Is something like that possible?
Or is there another (beautiful) method to get MySQL DB contents into numpy arrays, in case some values are unknown?
I would be very hesitant to suggest that this is the best way of doing this, but np.rec.fromrecords has worked well for me in the past.
The fix_duplicate_field_names function is there to ensure that numpy doesn't bork when MySQL returns multiple columns with the same name (it just fudges new names).
In the get_output function, the some info is parsed out of the cursor to get field names for the rec array, after which numpy is allowed to decide the data type of the MySQL data.
def fix_duplicate_field_names(self,names):
"""Fix duplicate field names by appending an integer to repeated names."""
used = []
new_names = []
for name in names:
if name not in used:
new_names.append(name)
else:
new_name = "%s_%d"%(name,used.count(name))
new_names.append(new_name)
used.append(name)
return new_names
def get_output(cursor):
"""Get sql data in numpy recarray form."""
if cursor.description is None:
return None
names = [i[0] for i in cursor.description]
names = fix_duplicate_field_names(names)
output = cursor.fetchall()
if not output or len(output) == 0:
return None
else:
return np.rec.fromrecords(output,names=names)
Related
I am using table merging in order to select items from my db against a list of parameter tuples. The query works fine, but cur.fetchall() does not return the entire table that I want.
For example:
data = (
(1, '2020-11-19'),
(1, '2020-11-20'),
(1, '2020-11-21'),
(2, '2020-11-19'),
(2, '2020-11-20'),
(2, '2020-11-21')
)
query = """
with data(song_id, date) as (
values %s
)
select t.*
from my_table t
join data d
on t.song_id = d.song_id and t.date = d.date::date
"""
execute_values(cursor, query, data)
results = cursor.fetchall()
In practice, my list of tuples to check against is thousands of rows long, and I expect the response to also be thousands of rows long.
But I am only getting 5 rows back if I call cur.fetchall() at the end of this request.
I know that this is because execute_values batches the requests, but there is some strange behavior.
If I pass page_size=10 then I get 2 items back. And if I set fetch=True then I get no results at all (even though the rowcount does not match that).
My thought was to batch these requests, but the page_size for the batch is not matching the number of items that I'm expecting per batch.
How should I change this request so that I can get all the results I'm expecting?
Edit: (years later after much experience with this)
What you really want to do here, is use the COPY command to bulk insert your data into a temporary dataframe. Then use that temporary dataframe to merge on both your columns as you would a normal table. With psycopg2 you can use the copy_expert method to perform the COPY. To reiterate (according to this example) here's how you would do that...
Also... trust me when I say this... if SPEED is an issue for you, this is by far, not-even-close, the fastest method out there.
code in this example is not tested
df = pd.DataFrame('<whatever your dataframe is>')
# Start by creating the temporary table
string = '''
create temp table mydata as (
item_id int,
date date
);
'''
cur.execute(string)
# Now you need to generate an sql string that will copy
# your data into the db
string = sql.SQL("""
copy {} ({})
from stdin (
format csv,
null "NaN",
delimiter ',',
header
)
""").format(sql.Identifier('mydata'), sql.SQL(',').join([sql.Identifier(i) for i in df.columns])
# Write your dataframe to the disk as a csv
df.to_csv('./temp_dataframe.csv', index=False, na_rep='NaN')
# Copy into the database
with open('./temp_dataframe.csv') as csv_file:
cur.copy_expert(string, csv_file)
# Now your data should be in your temporary table, so we can
# perform our select like normal
string = '''
select t.*
from my_table t
join mydata d
on t.item_id = d.item_id and t.date = d.date
'''
cur.execute(string)
data = cur.fetchall()
Hello dear stackoverflow community,
here is my problem:
A) I have data in csv with some boolean columns;
unfortunately, the values in these columns are t or f (single letter);
this is an artifact (from Redshift) that I cannot control.
B) I need to create a spark dataframe from this data,
hopefully converting t -> true and f -> false.
For that, I create a Hive DB and a temp Hive table
and then SELECT * from it, like this:
sql_str = """SELECT * FROM {db}.{s}_{t} """.format(
db=hive_db_name, s=schema, t=table)
df = sql_cxt.sql(sql_str)
This works, I can print df, and it gives me all my columns with correct data types.
But:
C) If I create the table like this:
CREATE EXTERNAL TABLE IF NOT EXISTS {db}.{schema}_{table}({cols})
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|t'
STORED AS TEXTFILE
LOCATION ...
, this converts all my t and f to Nulls.
So:
D) I found out about LazySimpleSerDe that presumably must do what I mean (convert t and f to true and false on the fly). From https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties (quote):
"""
hive.lazysimple.extended_boolean_literal
Default Value: false
Added in: Hive 0.14 with HIVE-3635
LazySimpleSerDe uses this property to determine
if it treats 'T', 't', 'F', 'f', '1', and '0' as extended,
legal boolean literals, in addition to 'TRUE' and 'FALSE'.
The default is false, which means only 'TRUE' and 'FALSE'
are treated as legal boolean literals.
"""
According to this (or at least so I think), I now create a table in Hive DB like this:
create_table_sql = """
CREATE EXTERNAL TABLE IF NOT EXISTS {db_name}.{schema}_{table}({cols})
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ("separatorChar" = "\|")
STORED AS TEXTFILE
LOCATION '{loc}'
TBLPROPERTIES ('hive.lazysimple.extended_boolean_literal'='true')
""".format(db_name=hive_db_name,
schema=schema,
table=table,
cols=",\n".join(cols),
loc=location)
return sql_cxt.sql(create_table_sql)
This does create a table,
I can again see all the columns with proper data types,
the df.count() is correct, but df.head(3) still
gives me all values for my boolean columns == Null.
(:___
I tried for hours different variants for my CREATE TABLE...
with or without SERDEPROPERTIES,
with or without TBLPROPERTIES,
with "FIELDS TERMINATED BY..." or without,
etc.
All give me either
Null in place of 't' and 'f', or
an empty df (nothing from df.head(5)), or
a syntax error, or
some 100 pages of Java exceptions.
The real problem is, I would say, that there is no single example of CREATE TABLE with LazySimpleSerDe
that does the job that is described in the docs.
I would really, really appreciate your help or any ideas. I pulled out almost all my hair.
Thank you in advance!
According to the patches in jira issues:
SET hive.lazysimple.extended_boolean_literal=true;
So for example, if you have a tab-delimited text file, containing header rows, and 't'/'f' for true false:
create table mytable(myfield boolean)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
location '/path'
tblproperties (
'skip.header.line.count' = '1'
);
...
select count(*) from mytable where myfield is null; <-- returns 100% null
...
SET hive.lazysimple.extended_boolean_literal=true;
select count(*) from mytable where myfield is null; <-- changes the serde to interpret the booleans with a more forgiving interpretation, yields a different count
I'm inserting rows this way, data being a dictionary of several fieldname: fieldvalue items:
def add_row(self, data, table): #data is a dictionary
columns = data.keys()
values = []
for column in columns:
if isinstance(data[column], list): #checking for json values
values.append(Json(data[column]))
elif isinstance(data[column], dict):
values.append(Json(data[column]))
else:
values.append(data[column])
insert_statement = 'insert into %s ' % table + '(%s) values %s'
self.cur.execute(insert_statement, (AsIs(','.join(columns)), tuple(values)))
self.conn.commit()
print "added %s" % table
But now I'd like to insert rows in bulk to improve performance and reduce I/O usage. The problem is that I couldn't find the right way to do it. The following function throws (data being a list of the items described above):
psycopg2.ProgrammingError: syntax error at or near "["
LINE 1: ...,category_id,initial_quantity,base_price) VALUES ([u'Entrega...
def add_row_bulk(self, data, table): #data is a dictionary
columns = data[0].keys()
value_rows = []
for e in data:
columns = e.keys()
values = []
for column in columns:
if isinstance(e[column], list): #checking for json values
values.append(Json(e[column]))
elif isinstance(e[column], dict):
values.append(Json(e[column]))
else:
values.append(e[column])
value_rows.append(AsIs(values))
cols = (AsIs(','.join(columns)))
query = self.cur.mogrify("INSERT INTO item (%s) VALUES %s", (cols, tuple(value_rows)))
self.cur.execute(query)
self.conn.commit()
print "added %s" % table
You have a couple problems with your SQL generating code.
First off, AsIs(values) will not mogrify into a value row, like you seem to be hoping. Testing it, it seems to be equivalent to AsIs(str(values)). That's the output you're seeing in your thrown error.
What worked in your working example was using mogrify on separate tuples of values. Add tuple(values) to value_rows, not AsIs(values).
Secondly, to specify the values for inserting a number of rows in one insert statement, you need SQL syntax similar to the following:
... VALUES (1, 'x'), (2, 'y'), (3, 'z')
Note that the list of value lists doesn't have ( ) around it. There's nothing (that I'm aware of) that's magically going to mogrify into a list like that. Certainly a tuple won't.
So you need to do something like:
self.cur.mogrify('INSERT INTO item (%s) VALUES %s,%s,%s,%s',
(cols, value_row1, value_row2, value_row3, value_row4))
which means you need to do a little more work to generate the two arguments to mogrify, because the number of rows isn't known in advance. To generate the first argument, you can do something like:
'INSERT INTO item (%s) VALUES ' + ','.join(['%s'] * len(value_rows))
And the second argument needs to be a sequence with the first value cols, and the rest the contents of value_rows. One way to get that:
[cols] + value_rows
I am loading data from various sources (csv, xls, json etc...) into Pandas dataframes and I would like to generate statements to create and fill a SQL database with this data. Does anyone know of a way to do this?
I know pandas has a to_sql function, but that only works on a database connection, it can not generate a string.
Example
What I would like is to take a dataframe like so:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
And a function that would generate this (this example is PostgreSQL but any would be fine):
CREATE TABLE data
(
index timestamp with time zone,
"A" double precision,
"B" double precision,
"C" double precision,
"D" double precision
)
If you only want the 'CREATE TABLE' sql code (and not the insert of the data), you can use the get_schema function of the pandas.io.sql module:
In [10]: print pd.io.sql.get_schema(df.reset_index(), 'data')
CREATE TABLE "data" (
"index" TIMESTAMP,
"A" REAL,
"B" REAL,
"C" REAL,
"D" REAL
)
Some notes:
I had to use reset_index because it otherwise didn't include the index
If you provide an sqlalchemy engine of a certain database flavor, the result will be adjusted to that flavor (eg the data type names).
GENERATE SQL CREATE STATEMENT FROM DATAFRAME
SOURCE = df
TARGET = data
GENERATE SQL CREATE STATEMENT FROM DATAFRAME
def SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):
# SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET)
# SOURCE: source dataframe
# TARGET: target table to be created in database
import pandas as pd
sql_text = pd.io.sql.get_schema(SOURCE.reset_index(), TARGET)
return sql_text
Check the SQL CREATE TABLE Statement String
print('\n\n'.join(sql_text))
GENERATE SQL INSERT STATEMENT FROM DATAFRAME
def SQL_INSERT_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):
sql_texts = []
for index, row in SOURCE.iterrows():
sql_texts.append('INSERT INTO '+TARGET+' ('+ str(', '.join(SOURCE.columns))+ ') VALUES '+ str(tuple(row.values)))
return sql_texts
Check the SQL INSERT INTO Statement String
print('\n\n'.join(sql_texts))
Insert Statement Solution
Not sure if this is the absolute best way to do it but this is more efficient than using df.iterrows() as that is very slow. Also this takes care of nan values with the help of regular expressions.
import re
def get_insert_query_from_df(df, dest_table):
insert = """
INSERT INTO `{dest_table}` (
""".format(dest_table=dest_table)
columns_string = str(list(df.columns))[1:-1]
columns_string = re.sub(r' ', '\n ', columns_string)
columns_string = re.sub(r'\'', '', columns_string)
values_string = ''
for row in df.itertuples(index=False,name=None):
values_string += re.sub(r'nan', 'null', str(row))
values_string += ',\n'
return insert + columns_string + ')\n VALUES\n' + values_string[:-2] + ';'
If you want to write the file by yourself, you may also retrieve columns names and dtypes and build a dictionary to convert pandas data types to sql data types.
As an example:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
tableName = 'table'
columnNames = df.columns.values.tolist()
columnTypes = map(lambda x: x.name, df.dtypes.values)
# Storing column names and dtypes in a dataframe
tableDef = pd.DataFrame(index = range(len(df.columns) + 1), columns=['cols', 'dtypes'])
tableDef.iloc[0] = ['index', df.index.dtype.name]
tableDef.loc[1:, 'cols'] = columnNames
tableDef.loc[1:, 'dtypes'] = columnTypes
# Defining a dictionnary to convert dtypes
conversion = {'datetime64[ns]':'timestamp with time zone', 'float64':'double precision'}
# Writing sql in a file
f = open('yourdir\%s.sql' % tableName, 'w')
f.write('CREATE TABLE %s\n' % tableName)
f.write('(\n')
for i, row in tableDef.iterrows():
sep = ",\n" if i < tableDef.index[-1] else "\n"
f.write('\t\"%s\" %s%s' % (row['cols'], conversion[row['dtypes']], sep))
f.write(')')
f.close()
You can do the same way to populate your table with INSERT INTO.
SINGLE INSERT QUERY SOLUTION
I didn't find the above answers to suit my needs. I wanted to create one single insert statement for a dataframe with each row as the values. This can be achieved by the below:
import re
import pandas as pd
table = 'your_table_name_here'
# You can read from CSV file here... just using read_sql_query as an example
df = pd.read_sql_query(f'select * from {table}', con=db_connection)
cols = ', '.join(df.columns.to_list())
vals = []
for index, r in df.iterrows():
row = []
for x in r:
row.append(f"'{str(x)}'")
row_str = ', '.join(row)
vals.append(row_str)
f_values = []
for v in vals:
f_values.append(f'({v})')
# Handle inputting NULL values
f_values = ', '.join(f_values)
f_values = re.sub(r"('None')", "NULL", f_values)
sql = f"insert into {table} ({cols}) values {f_values};"
print(sql)
db.dispose()
If you're just looking to generate a string with inserts based on pandas.DataFrame - I'd suggest using bulk sql insert syntax as suggested by #rup.
Here's an example of a function I wrote for that purpose:
import pandas as pd
import re
def df_to_sql_bulk_insert(df: pd.DataFrame, table: str, **kwargs) -> str:
"""Converts DataFrame to bulk INSERT sql query
>>> data = [(1, "_suffixnan", 1), (2, "Noneprefix", 0), (3, "fooNULLbar", 1, 2.34)]
>>> df = pd.DataFrame(data, columns=["id", "name", "is_deleted", "balance"])
>>> df
id name is_deleted balance
0 1 _suffixnan 1 NaN
1 2 Noneprefix 0 NaN
2 3 fooNULLbar 1 2.34
>>> query = df_to_sql_bulk_insert(df, "users", status="APPROVED", address=None)
>>> print(query)
INSERT INTO users (id, name, is_deleted, balance, status, address)
VALUES (1, '_suffixnan', 1, NULL, 'APPROVED', NULL),
(2, 'Noneprefix', 0, NULL, 'APPROVED', NULL),
(3, 'fooNULLbar', 1, 2.34, 'APPROVED', NULL);
"""
df = df.copy().assign(**kwargs)
columns = ", ".join(df.columns)
tuples = map(str, df.itertuples(index=False, name=None))
values = re.sub(r"(?<=\W)(nan|None)(?=\W)", "NULL", (",\n" + " " * 7).join(tuples))
return f"INSERT INTO {table} ({columns})\nVALUES {values};"
By the way, it converts nan/None entries to NULL and it's possible to pass constant column=value pairs as keyword arguments (see status="APPROVED" and address=None arguments in docstring example).
Generally, it works faster since any database does a lot of work for a single insert: checking the constraints, building indices, flushing, writing to log, etc. This complex operations can be optimized by the database when doing several-in-one operation, and not calling the engine one-by-one.
Taking the user #Jaris's post to get the CREATE, I extended it further to work for any CSV
import sqlite3
import pandas as pd
db = './database.db'
csv = './data.csv'
table_name = 'data'
# create db and setup schema
df = pd.read_csv(csv)
create_table_sql = pd.io.sql.get_schema(df.reset_index(), table_name)
conn = sqlite3.connect(db)
c = conn.cursor()
c.execute(create_table_sql)
conn.commit()
# now we can insert data
def insert_data(row, c):
values = str(row.name)+','+','.join([str('"'+str(v)+'"') for v in row])
sql_insert=f"INSERT INTO {table_name} VALUES ({values})"
try:
c.execute(sql_insert)
except Exception as e:
print(f"SQL:{sql_insert} \n failed with Error:{e}")
# use apply to loop over dataframe and call insert_data on each row
df.apply(lambda row: insert_data(row, c), axis=1)
# finally commit all those inserts into the database
conn.commit()
Hopefully this is more simple than the alternative answers and more pythonic!
Depending on if you can forego generating an intermediate representation of the SQL statement; You can just outright execute the insert statement as well.
con.executemany("INSERT OR REPLACE INTO data (A, B, C, D) VALUES (?, ?, ?, ?, ?)", list(df_.values))
This worked a little better as there is less messing around with string generation.
Is this the correct way to get a list from a SQL query in Python 2.7? Using a loop just seems somehow spurious. Is there a neater better way?
import numpy as np
import pyodbc as SQL
from datetime import datetime
con = SQL.connect('Driver={SQL Server};Server=MyServer; Database=MyDB; UID=MyUser; PWD=MyPassword')
cursor = con.cursor()
#Function to convert the unicode dates returned by SQL Server into Python datetime objects
ConvertToDate = lambda s:datetime.strptime(s,"%Y-%m-%d")
#Parameters
Code = 'GBPZAR'
date_query = '''
SELECT DISTINCT TradeDate
FROM MTM
WHERE Code = ?
and TradeDate > '2009-04-08'
ORDER BY TradeDate
'''
#Get a list of dates from SQL
cursor.execute(date_query, [Code])
rows = cursor.fetchall()
Dates = [None]*len(rows) #Initialize array
r = 0
for row in rows:
Dates[r] = ConvertToDate(row[0])
r += 1
Edit:
What about when I want to put a query into a structured array? At the moment I do something like this:
#Initialize the structured array
AllData = np.zeros(num_rows, dtype=[('TradeDate', datetime),
('Expiry', datetime),
('Moneyness', float),
('Volatility', float)])
#Iterate through the record set using the cursor and populate the structure array
r = 0
for row in cursor.execute(single_date_and_expiry_query, [TradeDate, Code, Expiry]):
AllData[r] = (ConvertToDate(row[0]), ConvertToDate(row[1])) + row[2:] #Convert th0e date columns and concatenate the numeric columns
r += 1
There is no need to pre-create a list, you could use list.append() instead. This also avoids having to keep a counter to index into Dates.
I'd use a list comprehension here, looping directly over the cursor to fetch rows:
cursor.execute(date_query, [Code])
Dates = [datetime.strptime(r[0], "%Y-%m-%d") for r in cursor]
You may want to add .date() to the datetime.strptime() result to get datetime.date objects instead.
Iterating over the cursor is preferable as it avoids loading all rows as a list into memory, only to replace that list with another, processed list of dates. See the cursor.fetchall() documentation:
Since this reads all rows into memory, it should not be used if there are a lot of rows. Consider iterating over the rows instead.
To produce your numpy.array, don't prepopulate. Instead use numpy.asarray() to turn the cursor items into an array, with the help of a generator:
dtype=[('TradeDate', datetime), ('Expiry', datetime),
('Moneyness', float), ('Volatility', float)]
dt = lambda v: datetime.strptime(v, "%Y-%m-%d")
filtered_rows = ((dt(r[0]), dt(r[1]) + r[2:]) for r in cursor)
all_values = np.asarray(filtered_rows, dtype=dtype)
For future reference, you can use enumerate() to produce a counter with a loop:
for r, row in enumerate(rows):
# r starts at 0 and counts along