I have a database in sqlite with c.300 tables. Currently i am iterating through a list and appending the data.
Is there a faster way / more pythonic way of doing this?
df = []
for i in Ave.columns:
try:
df2 = get_mcap(i)
df.append(df2)
#print (i)
except:
pass
df = pd.concat(df, axis=0
Ave is a dataframe where the column in the list i want to iterate through.
def get_mcap(Ticker):
cnx = sqlite3.connect('Market_Cap.db')
df = pd.read_sql_query("SELECT * FROM '%s'"%(Ticker), cnx)
df.columns = ['Date', 'Mcap-Ave', 'Mcap-High', 'Mcap-Low']
df = df.set_index('Date')
df.index = pd.to_datetime(df.index)
cnx.close
return df
Before I post my solution, I should include a quick warning that you should never use string manipulation to generate SQL queries unless it's absolutely unavoidable, and in such cases you need to be certain that you are in control of the data which is being used to format the strings and it won't contain anything that will cause the query to do something unintended.
With that said, this seems like one of those situations where you do need to use string formatting, since you cannot pass table names as parameters. Just make sure there's no way that users can alter what is contained within your list of tables.
Onto the solution. It looks like you can get your list of tables using:
tables = Ave.columns.tolist()
For my simple example, I'm going to use:
tables = ['table1', 'table2', 'table3']
Then use the following code to generate a single query:
query_template = 'select * from {}'
query_parts = []
for table in tables:
query = query_template.format(table)
query_parts.append(query)
full_query = ' union all '.join(query_parts)
Giving:
'select * from table1 union all select * from table2 union all select * from table3'
You can then simply execute this one query to get your results:
cnx = sqlite3.connect('Market_Cap.db')
df = pd.read_sql_query(full_query, cnx)
Then from here you should be able to set the index, convert to datetime etc, but now you only need to do these operations once rather than 300 times. I imagine the overall runtime of this should now be much faster.
Related
I would like to use a string as column names for pandas DataFrame.
The problem arised is that pandas DataFrame interpret the string var as single column instead of multiple ones. An thus the error:
ValueError: 1 columns passed, passed data had 11 columns
The first part of my code is intended to get the column names from the Mysql database I am about to query:
cursor1.execute ("SELECT GROUP_CONCAT(COLUMN_NAME) AS cols FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'or_red' AND TABLE_NAME = 'nomen_prefix'")
for colsTableMysql in cursor1.fetchall() :
colsTable = colsTableMysql[0]
colsTable="'"+colsTable.replace(",", "','")+"'"
The second part uses the created variable "colsTable" :
cursor = connection.cursor()
cursor.execute("SELECT * FROM or_red.nomen_prefix WHERE C_emp IN ("+emplazamientos+")")
tabla = pd.DataFrame(cursor.fetchall(),columns=[colsTable])
#tabla = exec("pd.DataFrame(cursor.fetchall(),columns=["+colsTable+"])")
#tabla = pd.DataFrame(cursor.fetchall())
I have tried ather aproaches like the use of exec(). In that case, there is no error but there is no response with information either, and the result of print(tabla) is None.
¿Is there any direct way of passing the columns dynamically as string to a python pandas DataFrame?
Thanks in advance
I am going to answer my question since I've already found the way.
The first part of my code is intended to get the column names from the Mysql database table I am about to query:
cursor1.execute ("SELECT GROUP_CONCAT(COLUMN_NAME) AS cols FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'or_red' AND TABLE_NAME = 'nomen_prefix'")
for colsTableMysql in cursor1.fetchall() :
colsTable = colsTableMysql[0]
colsTable="'"+colsTable.replace(",", "','")+"'"
The second part uses the created variable "colsTable" as input in the statement to define the columns.
cursor = connection.cursor()
cursor.execute("SELECT * FROM or_red.nomen_prefix WHERE C_emp IN ("+emplazamientos+")")
tabla = eval("pd.DataFrame(cursor.fetchall(),columns=["+colsTable+"])")
Using eval the string is parsed and evaluated as a Python expression.
I am using table merging in order to select items from my db against a list of parameter tuples. The query works fine, but cur.fetchall() does not return the entire table that I want.
For example:
data = (
(1, '2020-11-19'),
(1, '2020-11-20'),
(1, '2020-11-21'),
(2, '2020-11-19'),
(2, '2020-11-20'),
(2, '2020-11-21')
)
query = """
with data(song_id, date) as (
values %s
)
select t.*
from my_table t
join data d
on t.song_id = d.song_id and t.date = d.date::date
"""
execute_values(cursor, query, data)
results = cursor.fetchall()
In practice, my list of tuples to check against is thousands of rows long, and I expect the response to also be thousands of rows long.
But I am only getting 5 rows back if I call cur.fetchall() at the end of this request.
I know that this is because execute_values batches the requests, but there is some strange behavior.
If I pass page_size=10 then I get 2 items back. And if I set fetch=True then I get no results at all (even though the rowcount does not match that).
My thought was to batch these requests, but the page_size for the batch is not matching the number of items that I'm expecting per batch.
How should I change this request so that I can get all the results I'm expecting?
Edit: (years later after much experience with this)
What you really want to do here, is use the COPY command to bulk insert your data into a temporary dataframe. Then use that temporary dataframe to merge on both your columns as you would a normal table. With psycopg2 you can use the copy_expert method to perform the COPY. To reiterate (according to this example) here's how you would do that...
Also... trust me when I say this... if SPEED is an issue for you, this is by far, not-even-close, the fastest method out there.
code in this example is not tested
df = pd.DataFrame('<whatever your dataframe is>')
# Start by creating the temporary table
string = '''
create temp table mydata as (
item_id int,
date date
);
'''
cur.execute(string)
# Now you need to generate an sql string that will copy
# your data into the db
string = sql.SQL("""
copy {} ({})
from stdin (
format csv,
null "NaN",
delimiter ',',
header
)
""").format(sql.Identifier('mydata'), sql.SQL(',').join([sql.Identifier(i) for i in df.columns])
# Write your dataframe to the disk as a csv
df.to_csv('./temp_dataframe.csv', index=False, na_rep='NaN')
# Copy into the database
with open('./temp_dataframe.csv') as csv_file:
cur.copy_expert(string, csv_file)
# Now your data should be in your temporary table, so we can
# perform our select like normal
string = '''
select t.*
from my_table t
join mydata d
on t.item_id = d.item_id and t.date = d.date
'''
cur.execute(string)
data = cur.fetchall()
I need to create the following report scalable:
query = """
(SELECT
'02/11/2019' as Week_of,
media_type,
campaign,
count(ad_start_ts) as frequency
FROM usotomayor.digital
WHERE ds between 20190211 and 20190217
GROUP BY 1,2,3)
UNION ALL
(SELECT
'02/18/2019' as Week_of,
media_type,
campaign,
count(ad_start_ts) as frequency
FROM usotomayor.digital
WHERE ds between 20190211 and 20190224
GROUP BY 1,2,3)
"""
#Converting to dataframe
query2 = spark.sql(query).toPandas()
query2
However, as you can see I cannot make this report scalable if I have a long list of dates for each SQL query that I need to union.
My first attempt at looping in a list of date variables into the SQL script is as follows:
dfys = ['20190217','20190224']
df2 = ['02/11/2019','02/18/2019']
for i in df2:
date=i
for j in dfys:
date2=j
query = f"""
SELECT
'{date}' as Week_of,
raw.media_type,
raw.campaign,
count(raw.ad_start_ts) as frequency
FROM usotomayor.digital raw
WHERE raw.ds between 20190211 and {date2}
GROUP BY 1,2,3
"""
#Converting to dataframe
query2 = spark.sql(query).toPandas()
query2
However, this is not working for me. I think I need to loop through the sql query itself, but I don't know how to do this. Can someone help me?
As a commenter said "this is not working for me" is not very specific so let's start at specifying the problem. You need to execute a query for each pair of dates you need to execute these queries as a loop and save the result (or actually union them, but then you need to change your query logic).
You could do it like this:
dfys = ['20190217', '20190224']
df2 = ['02/11/2019', '02/18/2019']
query_results = list()
for start_date, end_date in zip(dfys, df2):
query = f"""
SELECT
'{start_date}' as Week_of,
raw.media_type,
raw.campaign,
count(raw.ad_start_ts) as frequency
FROM usotomayor.digital raw
WHERE raw.ds between 20190211 and {end_date}
GROUP BY 1,2,3
"""
query_results.append(spark.sql(query).toPandas())
query_results[0]
query_results[1]
Now you get a list of your results (query_results).
I was wondering if you could help me with my problem
I am trying to create a function that gets data frames based on SQL QUERY.
My function is:
def request_data(str,conn):
cur = conn.cursor()
cur.execute(str)
data = pd.read_sql_query(str, conn)
return data
When I try to apply my function using append method, I get what I expect as a result!
Tables_to_initialize = ['sales', 'stores', 'prices']
Tables = []
for x in Tables_to_initialize:
Tables.append(request_data("SELECT * FROM {i} WHERE d_cr_bdd =
(SELECT MAX(d_cr_bdd) FROM {i} ) ; ".format(i = x),conn))
But,
Tables is a list that contains all the sorted data frames based on my query, what i really want to do is to assign every element in my list tables to it's name, for example
Tables_to_initialize[0] = 'sales'
and i want to Tables[0] to be sales as object (data frame).
Is there any method to assign objects inside the function or with append automatically? Or any other solution?
I really appreciate your help
Best regards,
To get a list of objects based on given query.
table = [requested_data("SELECT * FROM {i} WHERE d_cr_bdd = (SELECT MAX(d_cr_bdd) FROM {i}".format(i)) for i in Tables_to_initialize ]
I have two lists : one contains the column names of categorical variables and the other numeric as shown below.
cat_cols = ['stat','zip','turned_off','turned_on']
num_cols = ['acu_m1','acu_cnt_m1','acu_cnt_m2','acu_wifi_m2']
These are the columns names in a table in Redshift.
I want to pass these as a parameter to pull only numeric columns from a table in Redshift(PostgreSql),write that into a csv and close the csv.
Next I want to pull only cat_cols and open the csv and then append to it and close it.
my query so far:
#1.Pull num data:
seg = ['seg1','seg2']
sql_data = str(""" SELECT {num_cols} """ + """FROM public.""" + str(seg) + """ order by random() limit 50000 ;""")
df_data = pd.read_sql(sql_data, cnxn)
# Write to csv.
df_data.to_csv("df_sample.csv",index = False)
#2.Pull cat data:
sql_data = str(""" SELECT {cat_cols} """ + """FROM public.""" + str(seg) + """ order by random() limit 50000 ;""")
df_data = pd.read_sql(sql_data, cnxn)
# Append to df_seg.csv and close the connection to csv.
with open("df_sample.csv",'rw'):
## Append to the csv ##
This is the first time I am trying to do selective querying based on python lists and hence stuck on how to pass the list as column names to select from table.
Can someone please help me with this?
If you want, to make a query in a string representation, in your case will be better to use format method, or f-strings (required python 3.6+).
Example for the your case, only with built-in format function.
seg = ['seg1', 'seg2']
num_cols = ['acu_m1','acu_cnt_m1','acu_cnt_m2','acu_wifi_m2']
query = """
SELECT {} FROM public.{} order by random() limit 50000;
""".format(', '.join(num_cols), seg)
print(query)
If you want use only one item from the seg array, use seg[0] or seg[1] in format function.
I hope this will help you!