I have the following code that creates a dataframe based on user input:
import pandas as pd
from pandas import DataFrame
publications =
pd.read_csv("C:/Users/nkambhal/data/pubmed_search_results_180730.csv", sep=
"|")
publications['title'] = publications['title'].fillna('')
search_term = input('Enter the term you are looking for: ')
publications[['title','publication_id']]
[publications['title'].str.contains(search_term)]
title_mask =
publications.title.str.lower().str.contains(search_term.lower())
new = publications.loc[title_mask, ['title', 'publication_ID']]
Now I want to use the publication ID's in the new dataframe to run this SQL query:
SELECT
author_profile
pub_lst.*
FROM
pub_lst
JOIN
author_profile
ON pub_lst.author_id = author_profile.author_id
WHERE
pub_lst.publication_id IN (67855,65559);
In the where statement, I want the IDs in the new dataframe to be there. So in the data frame there are the publication_ids ( 5, 6, 4) then I want them to be added to the query.
How can I add the appropriate publication_ids to the SQL query and run it through python and save it to a csv file?
To put data into a string, you can use the python's str.format function. You can read about it a little more here
For your query string, it should work out like so:
query_string = """
SELECT
author_profile
pub_lst.*
FROM
pub_lst
JOIN
author_profile
ON pub_lst.author_id = author_profile.author_id
WHERE
pub_lst.publication_id IN {};
"""
print(query_string.format(str(tuple(new.publication_ID.values))))
As for the running the query, you will need to use a python module for whichever database you want to connect it. Such as PyMySQL for connecting to a MySQL database. https://pypi.org/project/PyMySQL/
Although, you could use an ORM such as peewee or SqlAlchemy to make your life a little easier while dealing with SQL databases. Pandas and SqlAlchemy mix really well. But Peewee is easier to get started with.
For creating a csv, you could use the inbuild python csv module, pandas or Peewee or SqlAlchemy in ascending order of difficulty.
Related
Here is my code
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine("connection string")
conn_obj = engine.connect()
my_df = pd.DataFrame({'col1': ['29199'], 'date_created': ['2022-06-29 17:15:49.776867']})
my_df.to_sql('SomeSQLTable', conn_obj, if_exists='append', index = False)
I also created SomeSQLTable with script:
CREATE TABLE SomeSQLTable(
col1 nvarchar(90),
date_created datetime2)
GO
Everything runs fine, but no records are inserted into SQL table and no errors are displayed. I am not sure how to troubleshoot. conn_obj works fine, I was able to pull data.
I don't think it's exactly the answer but I don't have the privileges of commenting right now.
First of all, the pd.to_sql() returns the number of rows affected by the operation, can you please check that?
Lastly, you are defining the data types in the table creation, it could be a problem of casting the data types. I never create the table through sql as pd.to_sql() can create it if needed.
Thirdly, Please check on the table name, there could be an issue with the pascal case in some db's.
I'm querying my SSMS database from pandas and the query I have is pretty huge, I've saved it locally and want to read that query as a pandas dataframe and also there is a date string that I have in the query, I want to replace that datestring with a date that I've already assigned in pandas. For reference sake I'll shorten the query.
I'm currently following below:
query = """SELECT * FROM table where date > 'date_string' """
query_result = pd.read_sql(query, conn)
Instead of writing select * ... in pandas I've saved my query locally. I want pandas to read that query. And also replace date_string with startDate_sql
My date_string keeps changing as I'm looping through a list of dates.
The pandas code would look like
query = 'C:\Users\Admin\Desktop\Python\Open Source\query.sql'
query.replace(date_string, startDate_sql)
query_result = pd.read_sql(query, conn)
In this way I'm not writing my query in pandas as it is a huge query and consumes lot of space.
Can someone please tell me how to solve this and what is the correct syntax?
Thank you very much!
Reading a file in Python
Here's how to read in a text file in Python.
query_filename = 'C:\Users\Admin\Desktop\Python\Open Source\query.sql'
# 'rt' means open for reading, in text mode
with open(query_filename, 'rt') as f:
# read the query_filename file into a variable named query
query = f.read()
# replace the literal string 'date_string' with the contents of the variable startDate_sql
query = query.replace('date_string', startDate_sql)
# get dataframe from database
query_result = pd.read_sql(query, conn)
Using parameterized queries
You should probably avoid string replacement to construct queries, because it suffers from SQL injection. Parameterized queries avoid this problem. Here's an example of how to use query parameterization with Pandas.
I have a database that contains two tables in the data, cdr and mtr. I want a join of the two based on columns ego_id and alter_id, and I want to output this into another table in the same database, complete with the column names, without the use of pandas.
Here's my current code:
mtr_table = Table('mtr', MetaData(), autoload=True, autoload_with=engine)
print(mtr_table.columns.keys())
cdr_table = Table('cdr', MetaData(), autoload=True, autoload_with=engine)
print(cdr_table.columns.keys())
query = db.select([cdr_table])
query = query.select_from(mtr_table.join(cdr_table,
((mtr_table.columns.ego_id == cdr_table.columns.ego_id) &
(mtr_table.columns.alter_id == cdr_table.columns.alter_id))),
)
results = connection.execute(query).fetchmany()
Currently, for my test code, what I do is to convert the results as a pandas dataframe and then put it back in the original SQL database:
df = pd.DataFrame(results, columns=results[0].keys())
df.to_sql(...)
but I have two problems:
loading everything into a pandas dataframe would require too much memory when I start working with the full database
the columns names are (apparently) not included in results and would need to be accessed by results[0].keys()
I've checked this other stackoverflow question but it uses the ORM framework of sqlalchemy, which I unfortunately don't understand. If there's a simpler way to do this (like pandas' to_sql), I think this would be easier.
What's the easiest way to go about this?
So I found out how to do this via CREATE TABLE AS:
query = """
CREATE TABLE mtr_cdr AS
SELECT
mtr.idx,cdr.*
FROM mtr INNER JOIN cdr
ON (mtr.ego_id = cdr.ego_id AND mtr.alter_id = cdr.alter_id)""".format(new_table)
with engine.connect() as conn:
conn.execute(query)
The query string seems to be highly sensitive to parentheses though. If I put a parentheses enclosing the whole SELECT...FROM... statement, it doesn't work.
I'm trying to replace some old MSSQL stored procedures with python, in an attempt to take some of the heavy calculations off of the sql server. The part of the procedure I'm having issues replacing is as follows
UPDATE mytable
SET calc_value = tmp.calc_value
FROM dbo.mytable mytable INNER JOIN
#my_temp_table tmp ON mytable.a = tmp.a AND mytable.b = tmp.b AND mytable.c = tmp.c
WHERE (mytable.a = some_value)
and (mytable.x = tmp.x)
and (mytable.b = some_other_value)
Up to this point, I've made some queries with SQLAlchemy, stored those data in Dataframes, and done the requisite calculations on them. I don't know now how to put the data back into the server using SQLAlchemy, either with raw SQL or function calls. The dataframe I have on my end would essentially have to work in the place of the temporary table created in MSSQL Server, but I'm not sure how I can do that.
The difficulty is of course that I don't know of a way to join between a dataframe and a mssql table, and I'm guessing this wouldn't work so I'm looking for a workaround
As the pandas doc suggests here :
from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://user:password#DSN", echo = False)
dataframe.to_sql('tablename', engine , if_exists = 'replace')
engine parameter for msSql is basically the connection string check it here
if_exist parameter is a but tricky since 'replace' actually drops the table first and then recreates it and then inserts all data at once.
by setting the echo attribute to True it shows all background logs and sql's.
I'm using Python's Peewee ORM to work with a MySQL database. Peewee supplies an object called "fn" that allows you to make certain types of calls to the database. One of those calls I want to make is the following:
Blocks.select(Blocks, fn.Count(Blocks.height))
Where Blocks is a table in my database, which has a column named height. This syntax is taken straight from Peewee's documentation, namely
User.select(
User, fn.Count(Tweet.id))
located here http://peewee.readthedocs.org/en/latest/peewee/querying.html. Note that I also have the following lines at the top of my python file
import peewee
from peewee import *
from peewee import fn
Yet when I run this code, it doesn't work, and it spits out this
<class '__main__.Blocks'> SELECT t1.`height`, t1.`hash`, t1.`time`, t1.`confirmations`, t1.`size`, t1.`version`, t1.`merkleRoot`, t1.`numTX`, t1.`nonce`, t1.`bits`, t1.`difficulty`, t1.`chainwork`, t1.`previousBlockHash`, t1.`nextBlockHash`, Count(t1.`height`) FROM `blocks` AS t1 []
So this is really just printing out the column names that are returned by the select query.
What peewee code do I have to write to return the count of the number of rows in a table? I regret using peewee because it makes what should be simple queries impossibly hard to find the right syntax for.
Peewee lazily evaluates queries, so you need to coerce it to a list or iterate through it in order to retrieve results, e.g.
query = User.select(User, fn.Count(Tweet.id).alias('num_tweets'))
for user in query:
print user.username, user.num_tweets
users = list(query)