so i am building a database for a larger program and do not have much experience in this area of coding (mostly embedded system programming). My task is to import a large excel file into python. It is large so i'm assuming I must convert it to a CSV then truncate it by parsing and then partitioning and then import to avoid my computer crashing. Once the file is imported i must be able to extract/search specific information based on the column titles. There are other user interactive aspects that are simply string based so not very difficult. As for the rest, I am getting the picture but would like a more efficient and specific design. Can anyone offer me guidance on this?
An excel or csv can be read into python using pandas. The data is stored as rows and columns and is called a dataframe. To import data in such a structure, you need to import pandas first and then read the csv or excel into the dataframe structure.
import pandas as pd
df1= pd.read_csv('excelfilename.csv')
This dataframe structure is similar to tables and you can perform joining of different dataframes, grouping of data etc.
I am not sure if this is what you need, let me know if you need any further clarifications.
I would recommend actually loading it into a proper database such as Mariadb or Postgresql. This will allow you to access the data from other applications and it takes the load off of you for writing a database. You can then use a ORM if you would like to interact with the data or simply use plain SQL via python.
read the CSV
df = pd.read_csv('sample.csv')
connect to a database
conn = sqlite3.connect("Any_Database_Name.db") #if the db does not exist, this creates a Any_Database_Name.db file in the current directory
store your table in the database:
df.to_sql('Some_Table_Name', conn)
read a SQL Query out of your database and into a pandas dataframe
sql_string = 'SELECT * FROM Some_Table_Name' df = pd.read_sql(sql_string, conn)
Related
I have some large (+500 Mbytes) .CSV files that I need to import into a Postgres SQL database.
I am looking for a script or tool that helps me to:
Generate the table columns SQL CREATE code, ideally taking into account the data in the .CSV file in order to create the optimal data types for each column.
Use the header of the .CSV as the name of the column.
It would be perfect if such functionality existed in PostgreSQL or could be added as an add-on.
Thank you very much
you can use this open source tool called pgfutter to create table from your csv file.
git hub link
also postgresql has COPY functionality however copy expect that the table already exists.
'''
cursor.execute(Select * From Table);
'''
Iam using the above code to execute the above select query, but this code gets stucked, because in the table, I am having 93 million records,
Do we have any other method to extract all the data from snowflake table in python script
Depending on what you are trying to do with that data, it'd probably be most efficient to run a COPY INTO location statement to extract the data into a file to a stage, and then run a GET via Python to bring that file locally to wherever you are running python.
However, you might want to provide more detail on how you are using the data in python after the cursor.execute statement. Are you going to iterate over that data set to do something (in which case, you may be better off issuing SQL statements directly to Snowflake, instead), loading it into Pandas to do something (there are better Snowflake functions for pandas in that case), or something else? If you are just creating a file from it, then my suggestion above will work.
The problem is when you are fetching data from Snowflake to Python, the query is getting stuck due to the volume of record and the SF to Python Data conversion.
Are you trying to fetch all the data from the table and how are you using the Data in the downstream which is most important. Restrict the number of columns
Improving Query Performance by Bypassing Data Conversion
To improve query performance, use the SnowflakeNoConverterToPython class in the snowflake.connector.converter_null module to bypass data conversions from the Snowflake internal data type to the native Python data type, e.g.:
con = snowflake.connector.connect(
...
converter_class=SnowflakeNoConverterToPython
)
for rec in con.cursor().execute("SELECT * FROM large_table"):
# rec includes raw Snowflake data
I am trying to fetch data from SQL server Database (Just a simple SELECT * query).
The table contains around 3-5 Million records. Perfomring a SELECT * on the SQL server directly using SSMS takes around 11-15 minutes.
However, when I am connecting via Python and trying to save data into a pandas dataframe, it takes forever. More than 1 hour.
Here is the code I am using:
import pymssql
import pandas as pd
startTime = datetime.now()
## instance a python db connection object- same form as psycopg2/python-mysql drivers also
conn = pymssql.connect(server=r"xyz", database = "abc", user="user",password="pwd")
print ('Connecting to DB: ',datetime.now() - startTime )
stmt = "SELECT * FROM BIG_TABLE;"
# Excute Query here
df_big_table = pd.read_sql(stmt,conn)
There must be a way to do this in a better way? Perhaps parallel processing or something to fetch the data quickly.
My end goal is to Migrate this table from SQL server to PostGres.
This is the way I am doing:
Fetch data from SQL server using python
Save it to a pandas dataframe
Save this data in CSV to disk.
Copy the CSV from disk to Postgres.
Proably, I can combine step 3,4 so that I can do the transition in memory, rather than using disk IO.
There are many complexity like table constrains and definitions, etc. Which I will be taking care later on. I cannot use a third party tool.
I am stuck at Step 1,2. So help with the Python script/ Some other opensource language would be really appreciated.
If there is any other way to reach to my end goal, I welcome sugessions!
Have you tried using 'chunksize' option of pandas.read_sql? you can get all of that into a single dataframe and produce the csv.
If it takes more time then you can split each chunk into multiple files using the pandas.read_sql as a iterator and then after you did your work consolidate those files into a single one and submit it to postgres.
How do I insert data stored in a dataframe to a database in SQL. I've been told that i should use pandas.
Here is the question:
Get data from Quandl. Store this in a dataframe. (I've done this part)
Insert data into a sqlite database. Create a database in sqlite and insert the data into a table with an appropriate schema. This can be done with pandas so there is no need to go outside of your program to do this.
Only started python coding couple of days ago, so bit of a noob to this.
What I've got so far:
import quandl
df = quandl.get("ML/AATRI", start_date="2008-01-01")
import pandas as pd
import sqlite3
Thanks!
You may like to first convert the data to .sql and then import the file in database workbench you are using. Suppose you have dataframe df, then using pandas, convert to sql.
import pandas as pd
df.to_sql('filename.sql', engine)
I hope this works for you.
I have a task to import multiple Excel files in their respective sql server tables. The Excel files are of different schema and I need a mechanism to create a table dynamically; so that I don't have to write a Create Table query. I use SSIS, and I have seen some SSIS articles on the same. However, it looks I have to define the table anyhow. OpenRowSet doesn't work well in case of large excel files.
You can try using BiML, which dynamically creates packages based on meta data.
The only other possible solution is to write a script task.