So I have an array of 200+ columns. How would I loop through this and create a table using pymysql?
Currently am connected like:
import pymysql
connection = pymysql.connect(
host='my_host_name',
user='my_username',
password='my_password',
port= 0000,
database='my_db')
columns = ['firstname', 'lastname', 'email', .... ]
cursor = connection.cursor()
sql = 'CREATE TABLE my_table (
# For each column in columns
)'
cursor.execute(sql)
Edit: I will loop through the columns first and append their appropriate data type
You can use join to loop through all the columns:
columns = ['firstname VARCHAR(255)', 'lastname VARCHAR(255)'] # and so on
sql = 'CREATE TABLE my_table (' + ', '.join(columns) + ');'
Note that the resulting table is not even in 1NF (First Normal Form), as it doesn't have a PRIMARY KEY. It would be better to set one or more columns as PRIMARY KEY to reduce the risk of inconsistency.
Related
I have an access table called "Cell_list" with a key column called "Cell_#". I want to read the table into a dataframe, but only the rows that match indices which are specified in a python list "cell_numbers".
I tried several variations on:
import pyodbc
import pandas as pd
cell_numbers = [1,3,7]
cnn_str = r'Driver={Microsoft Access Driver (*.mdb,*.accdb)};DBQ=C:\folder\myfile.accdb;'
conn = pyodbc.connect(cnn_str)
query = ('SELECT * FROM Cell_list WHERE Cell_# in '+tuple(cell_numbers))
df = pd.read_sql(query, conn)
But no matter what I try I get a syntax error.
How do I do this?
Consider best practice of parameterization which is supported in pandas.read_sql:
# PREPARED STATEMENT, NO DATA
query = (
'SELECT * FROM Cell_list '
'WHERE [Cell_#] IN (?, ?, ?)'
)
# RUN SQL WITH BINDED PARAMS
df = pd.read_sql(query, conn, params=cell_numbers)
Consider even dynamic qmark placeholders dependent on length of cell_numbers:
qmarks = [', '.join('?' for _ in cell_numbers)]
query = (
'SELECT * FROM Cell_list '
f'WHERE [Cell_#] IN ({qmarks})'
)
Convert (join) cell_numbers to text:
cell_text = '(1,3,7)'
and concatenate this.
The finished SQL should read (you may need brackets around the weird field name Cell_#):
SELECT * FROM Cell_list WHERE [Cell_#] IN (1,3,7)
I have a database with around 10 columns. Sometimes I need to insert a row which has only 3 of the required columns, the rest are not in the dic.
The data to be inserted is a dictionary named row :
(this insert is to avoid duplicates)
row = {'keyword':'abc','name':'bds'.....}
df = pd.DataFrame([row]) # df looks good, I see columns and 1 row.
engine = getEngine()
connection = engine.connect()
df.to_sql('temp_insert_data_index', connection, if_exists ='replace',index=False)
result = connection.execute(('''
INSERT INTO {t} SELECT * FROM temp_insert_data_index
ON CONFLICT DO NOTHING''').format(t=table_name))
connection.close()
Problem : when I don't have all columns in the row(dic), it will insert dic fields by order (a 3 keys dic will be inserted to the first 3 columns) and not to the right columns. ( I expect the keys in dic to fit the db columns)
Why ?
Consider explicitly naming the columns to be inserted in INSERT INTO and SELECT clauses which is best practice for SQL append queries. Doing so, the dynamic query should work for all or subset of columns. Below uses F-string (available Python 3.6+) for all interpolation to larger SQL query:
# APPEND TO STAGING TEMP TABLE
df.to_sql('temp_insert_data_index', connection, if_exists='replace', index=False)
# STRING OF COMMA SEPARATED COLUMNS
cols = ", ".join(df.columns)
sql = (
f"INSERT INTO {table_name} ({cols}) "
f"SELECT {cols} FROM temp_insert_data_index "
"ON CONFLICT DO NOTHING"
)
result = connection.execute(sql)
connection.close()
I have a list contains many lists in python.
my_list = [['city', 'state'], ['tampa', 'florida'], ['miami','florida']]
The nested list at index 0 contains the column headers, and rest of the nested lists contain corresponding values. How would I insert this into sql server using pyodbc or slqalchemy? I have been using pandas pd.to_sql and want to make this a process in pure python. Any help would be greatly appreciated.
expected output table would look like:
city |state
-------------
tampa|florida
miami|florida
Since the column names are coming from your list you have to build a query string to insert the values. Column names and table names can't be parameterised with placeholders (?).
import pyodbc
conn = pyodbc.connect(my_connection_string)
cursor = conn.cursor()
my_list = [['city', 'state'], ['tampa', 'florida'], ['miami','florida']]
columns = ','.join(my_list[0]) #String of column names
values = ','.join(['?'] * len(my_list[0])) #Placeholders for values
query = "INSERT INTO mytable({0}) VALUES ({1})".format(columns, values)
#Loop through rest of list, inserting data
for l in my_list[1:]:
cursor.execute(query, l)
conn.commit() #save changes
Update:
If you have a large number of records to insert you can do that in one go using executemany. Change the code like this:
columns = ','.join(my_list[0]) #String of column names
values = ','.join(['?'] * len(my_list[0])) #Placeholders for values
#Bulk insert
query = "INSERT INTO mytable({0}) VALUES ({1})".format(columns, values)
cursor.executemany(query, my_list[1:])
conn.commit() #save change
Assuming conn is already open connection to your database:
cursor = conn.cursor()
for row in my_list:
cursor.execute('INSERT INTO my_table (city, state) VALUES (?, ?)', row)
cursor.commit()
Since the columns value are are the first elemnts in the array, just do:
q ="""CREATE TABLE IF NOT EXISTS stud_data (`{col1}` VARCHAR(250),`{col2}` VARCHAR(250); """
sql_cmd = q.format(col1 = my_list[0][0],col2 = my_list[0][1])
mcursor.execute(sql)#Create the table with columns
Now to add the values to the table, do:
for i in range(1,len(my_list)-1):
sql = "INSERT IGNORE into test_table(city,state) VALUES (%s, %s)"
mycursor.execute(sql,my_list[i][0],my_list[i][1])
mycursor.commit()
print(mycursor.rowcount, "Record Inserted.")#Get count of rows after insertion
I have a massive table (over 100B records), that I added an empty column to. I parse strings from another field (string) if the required string is available, extract an integer from that field, and want to update it in the new column for all rows that have that string.
At the moment, after data has been parsed and saved locally in a dataframe, I iterate on it to update the Redshift table with clean data. This takes approx 1sec/iteration, which is way too long.
My current code example:
conn = psycopg2.connect(connection_details)
cur = conn.cursor()
clean_df = raw_data.apply(clean_field_to_parse)
for ind, row in clean_df.iterrows():
update_query = build_update_query(row.id, row.clean_integer1, row.clean_integer2)
cur.execute(update_query)
where update_query is a function to generate the update query:
def update_query(id, int1, int2):
query = """
update tab_tab
set
clean_int_1 = {}::int,
clean_int_2 = {}::int,
updated_date = GETDATE()
where id = {}
;
"""
return query.format(int1, int2, id)
and where clean_df is structured like:
id . field_to_parse . clean_int_1 . clean_int_2
1 . {'int_1':'2+1'}. 3 . np.nan
2 . {'int_2':'7-0'}. np.nan . 7
Is there a way to update specific table fields in bulk, so that there is no need to execute one query at a time?
I'm parsing the strings and running the update statement from Python. The database is stored on Redshift.
As mentioned, consider pure SQL and avoid iterating through billions of rows by pushing the Pandas data frame to Postgres as a staging table and then run one single UPDATE across both tables. With SQLAlchemy you can use DataFrame.to_sql to create a table replica of data frame. Even add an index of the join field, id, and drop the very large staging table at end.
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://myuser:mypwd!#myhost/mydatabase")
# PUSH TO POSTGRES (SAME NAME AS DF)
clean_df.to_sql(name="clean_df", con=engine, if_exists="replace", index=False)
# SQL UPDATE (USING TRANSACTION)
with engine.begin() as conn:
sql = "CREATE INDEX idx_clean_df_id ON clean_df(id)"
conn.execute(sql)
sql = """UPDATE tab_tab t
SET t.clean_int_1 = c.int1,
t.clean_int_2 = c.int2,
t.updated_date = GETDATE()
FROM clean_df c
WHERE c.id = t.id
"""
conn.execute(sql)
sql = "DROP TABLE IF EXISTS clean_df"
conn.execute(sql)
engine.dispose()
I have created this table in python 2.7 . I use it to store unique pairs name and value. In some queries I search for names and in others I search for values. Lets say that SELECT queries are 50-50. Is there any way to create a table that will be double index (one index on names and another for values) so my program will seek faster the data ?
Here is the database and table creation:
import sqlite3
#-------------------------db creation ---------------------------------------#
db1 = sqlite3.connect('/my_db.db')
cursor = db1.cursor()
cursor.execute("DROP TABLE IF EXISTS my_table")
sql = '''CREATE TABLE my_table (
name TEXT DEFAULT NULL,
value INT
);'''
cursor.execute(sql)
sql = ("CREATE INDEX index_my_table ON my_table (name);")
cursor.execute(sql)
Or is there any other faster struct for faster value seek ?
You can create another index...
sql = ("CREATE INDEX index_my_table2 ON my_table (value);")
cursor.execute(sql)
I think the best way for faster research is to create a index on the 2 fields.
like: sql = ("CREATE INDEX index_my_table ON my_table (Field1, field2)")
Multi-Column Indices or Covering Indices.
see the (great) doc here: https://www.sqlite.org/queryplanner.html