I am converting a csv file into a Pandas dataframe and then converting it to Postgres table essentially.
The problem is that I am able to create a table in Postgres but I am unable to select column names from the table while querying it.
This is the sample code I have:
import pandas as pd
from sqlalchemy import create_engine
import psycopg2
engine = create_engine('postgresql://postgres:pwd#localhost:5432/test')
def convertcsvtopostgres(csvfileloc, table_name, delimiter):
data = pd.read_csv(csvfileloc, sep=delimiter, encoding='latin-1')
data.head()
data1 = data.rename(columns=lambda x: x.strip())
data1.to_sql(table_name, engine, index=False)
convertcsvtopostgres("Product.csv","t_product","~")
I can do a select * from test.t_product; but I am unable to do a select product_id from test.t_product;
I am not sure if that is happening because of the encoding of the file and the conversion because of that. Is there any way around this, since I do not want to specify the table structure each time.
Related
I was looking for help here (and in many other place):
How to save Pandas dataframe to a hive table?
Pandas dataframe in pyspark to hive
How to insert a pandas dataframe into an existing Hive external table using Python (without PySpark)?
But I don't think I completely understand the proposals presented, because I failed with any of them
What am I trying to do is:
Extract data from hive table from schema1 to python dataframe.
Do some operations on columns and save as pandas dataframe.
Export pandas dataframe to hive table schema2.
I made points 1-2 as follows:
Extract data from hive table to python dataframe.
transport = puretransport.transport_factory(host='my_host_name',
port=10000,
username='my_username',
password='my_password',
use_ssl=True)
engine = db.create_engine(f"hive://my_username#/schema1",
connect_args={'thrift_transport': transport})
print("Selecting data from table", end=" ")
tab1 = []
for chunk in pd.read_sql_query(
"""select * from schema1.my_table""", con=engine, chunksize=5):
tab1.append(chunk)
df = pd.concat(tab1)
print("DONE")
Do some operations on columns and save as pandas dataframe.
my_code_returning_dataframe...
Export pandas dataframe to hive table schema2.
what_should_i_do_there?
Thank you in advance for any help.
I am trying to get a Oracle SQL database into python so I can aggregate/analyze the data. Pandas would be really useful for this task. But anytime I try to use my code, it just hangs and does not output anything. I am not sure its because I am using the cx oracle package and then using the pandas package?
import cx_Oracle as cxo
import pandas as pd
dsn=cxo.makedsn(
'host.net',
'1111',
service_name='servicename'
)
conn=cxo.connect(
user='Username',
password='password',
dsn=dsn)
c=conn.cursor()
a=c.execute("SELECT * FROM data WHERE date like '%20%'")
conn.close
df=pd.DataFrame(a)
head(df)
However when I use the code below, it prints out the data I am looking for. I need to convert this data into a panda data frame,
for row in c: print(row)
conn.close()
I am very new to python so any help will be really appreciated!!
To convert a cx_Oracle cursor to dataframe you can use de following code.
with conn.cursor() as cursor:
cursor.execute("SELECT * FROM data WHERE date like '%20%'")
from pandas import DataFrame
df = DataFrame(cursor.fetchall())
df.columns = [x[0] for x in cursor.description]
print("I got %d lines " % len(df))
Note I'm using the cursor as context manager. So it will be closed automatically on the end of the block.
I have a pandas dataframe of approx 300,000 rows (20mb), and want to write to a SQL server database.
I have the following code but it is very very slow to execute. Wondering if there is a better way?
import pandas
import sqlalchemy
engine = sqlalchemy.create_engine('mssql+pyodbc://rea-eqx-dwpb/BIWorkArea?
driver=SQL+Server')
df.to_sql(name='LeadGen Imps&Clicks', con=engine, schema='BIWorkArea',
if_exists='replace', index=False)
If you want to speed up you process with writing into the sql database , you can per-setting the dtypes of the table in your database by the data type of your pandas DataFrame
from sqlalchemy import types, create_engine
d={}
for k,v in zip(df.dtypes.index,df.dtypes):
if v=='object':
d[k]=types.VARCHAR(df[k].str.len().max())
elif v=='float64':
d[k]=types.FLOAT(126)
elif v=='int64':
d[k] = types.INTEGER()
Then
df.to_sql(name='LeadGen Imps&Clicks', con=engine, schema='BIWorkArea', if_exists='replace', index=False,dtype=d)
I have a CSV file and I want to import this file into my sqlite3 database using Python. The column names of the CSV is the same with the column names of the database table, the following is the code i am using now.
df = pandas.read_csv(Data.csv)
df.to_sql(table_name, conn, index=False)
However it seems the command will import all data into the database, I am trying to only input the data that does not already exist in the database. Is there a way to do that without iterating every row of the csv or database?
Use the if_exists parameter.
df = pandas.read_csv(Data.csv)
df.to_sql(table_name,conn,if_exists='append',index=False)
I can import into a dataframe like so:
df=pd.DataFrame(cursor.fetchall(),columns=['name','ts','open','close']))
All df column dtypes will be of object.
I can convert them into the proper numbers after by:
df2=df.apply(pd.to_numeric, errors='ignore').info()
Is there any way to do it on the fly where i specify the datatype without having to do multiple calculations/lines of code?
Consider pandas' read_sql using an SQLAlchemy engine as the connection:
import pandas as pd
import psycopg2
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://user:password#/dbname")
df = pd.read_sql('SELECT * FROM myTable', con=engine)