Using odo to migrate data to SQL - python

I have a large 3 GB CSV file, and I'd like to use Blaze to investigate the data, select down to the data I'm interesting in analyzing, with the eventual goal to migrate that data into a suitable computational backend such as SQlite, PostgresSQL etc. I can get that data into Blaze and work on it fine, but this is the part I'm having trouble with:
db = odo(bdata, 'sqlite:///report.db::report')`
I'm not sure how to properly create a db file to open with sqlite.

You can go directly from CSV to sqlite using the directions listed here.
http://odo.pydata.org/en/latest/perf.html?highlight=sqlite#csv-sqlite3-57m-31s
I think you are missing the column names as warned about here: http://odo.pydata.org/en/latest/sql.html?highlight=sqlite
dshape = discover(resource('report_2015.csv'))
t = odo('report_2015.csv', 'sqlite:///report.db::report', dshape=dshape)

Related

csv->Oracle DB autoscheduled import

I have basic csv report that is produced by other team on a daily basis, each report has 50k rows, those reports are saved on sharedrive everyday. And I have Oracle DB.
I need to create autoscheduled process (or at least less manual) to import those csv reports to Oracle DB. What solution would you recommend for it?
I did not find such solution in SQL Developer, since it is upload from file and not a query. I was thinking about python cron script, that will autoran on a daily basis and transform csv report to txt with needed SQL syntax (insert into...) and then python will connect to Oracle DB and will ran txt file as SQL command and insert data.
But this looks complicated.
Maybe you know other solution that you would recommend yo use?
Create an external table to allow you to access the content of the CSV as if it were a regular table. This assumes the file name does not change day-to-day.
Create a scheduled job to import the data in that external table and do whatever you want with it.
One common blocking issue that prevents using 'external tables' is that external tables require the data to be on the computer hosting the database. Not everyone has access to those servers. Or sometimes the external transfer of data to that machine + the data load to the DB is slower than doing a direct path load from the remote machine.
SQL*Loader with direct path load may be an option: https://docs.oracle.com/en/database/oracle/oracle-database/19/sutil/oracle-sql-loader.html#GUID-8D037494-07FA-4226-B507-E1B2ED10C144 This will be faster than Python.
If you do want to use Python, then read the cx_Oracle manual Batch Statement Execution and Bulk Loading. There is an example of reading from a CSV file.

snowflake select cursor statement fails

'''
cursor.execute(Select * From Table);
'''
Iam using the above code to execute the above select query, but this code gets stucked, because in the table, I am having 93 million records,
Do we have any other method to extract all the data from snowflake table in python script
Depending on what you are trying to do with that data, it'd probably be most efficient to run a COPY INTO location statement to extract the data into a file to a stage, and then run a GET via Python to bring that file locally to wherever you are running python.
However, you might want to provide more detail on how you are using the data in python after the cursor.execute statement. Are you going to iterate over that data set to do something (in which case, you may be better off issuing SQL statements directly to Snowflake, instead), loading it into Pandas to do something (there are better Snowflake functions for pandas in that case), or something else? If you are just creating a file from it, then my suggestion above will work.
The problem is when you are fetching data from Snowflake to Python, the query is getting stuck due to the volume of record and the SF to Python Data conversion.
Are you trying to fetch all the data from the table and how are you using the Data in the downstream which is most important. Restrict the number of columns
Improving Query Performance by Bypassing Data Conversion
To improve query performance, use the SnowflakeNoConverterToPython class in the snowflake.connector.converter_null module to bypass data conversions from the Snowflake internal data type to the native Python data type, e.g.:
con = snowflake.connector.connect(
...
converter_class=SnowflakeNoConverterToPython
)
for rec in con.cursor().execute("SELECT * FROM large_table"):
# rec includes raw Snowflake data

Create Database using Python on Jupyter Notebook

so i am building a database for a larger program and do not have much experience in this area of coding (mostly embedded system programming). My task is to import a large excel file into python. It is large so i'm assuming I must convert it to a CSV then truncate it by parsing and then partitioning and then import to avoid my computer crashing. Once the file is imported i must be able to extract/search specific information based on the column titles. There are other user interactive aspects that are simply string based so not very difficult. As for the rest, I am getting the picture but would like a more efficient and specific design. Can anyone offer me guidance on this?
An excel or csv can be read into python using pandas. The data is stored as rows and columns and is called a dataframe. To import data in such a structure, you need to import pandas first and then read the csv or excel into the dataframe structure.
import pandas as pd
df1= pd.read_csv('excelfilename.csv')
This dataframe structure is similar to tables and you can perform joining of different dataframes, grouping of data etc.
I am not sure if this is what you need, let me know if you need any further clarifications.
I would recommend actually loading it into a proper database such as Mariadb or Postgresql. This will allow you to access the data from other applications and it takes the load off of you for writing a database. You can then use a ORM if you would like to interact with the data or simply use plain SQL via python.
read the CSV
df = pd.read_csv('sample.csv')
connect to a database
conn = sqlite3.connect("Any_Database_Name.db") #if the db does not exist, this creates a Any_Database_Name.db file in the current directory
store your table in the database:
df.to_sql('Some_Table_Name', conn)
read a SQL Query out of your database and into a pandas dataframe
sql_string = 'SELECT * FROM Some_Table_Name' df = pd.read_sql(sql_string, conn)

Importing data from multiple related tables in mySQL to SQLite3 or postgreSQL

I'm updating from an ancient language to Django. I want to keep the data from the old project into the new.
But old project is mySQL. And I'm currently using SQLite3 in dev mode. But read that postgreSQL is most capable. So first question is: Is it better to set up postgreSQL while in development. Or is it an easy transition to postgreSQL from SQLite3?
And for the data in the old project. I am bumping up the table structure from the old mySQL structure. Since it got many relation db's. And this is handled internally with foreignkey and manytomany in SQLite3 (same in postgreSQL I guess).
So I'm thinking about how to transfer the data. It's not really much data. Maybe 3-5.000 rows.
Problem is that I don't want to have same table structure. So a import would be a terrible idea. I want to have the sweet functionality provided by SQLite3/postgreSQL.
One idea I had was to join all the data and create a nested json for each post. And then define into what table so the relations are kept.
But this is just my guessing. So I'm asking you if there is a proper way to do this?
Thanks!
better create the postgres database. write down the python script which take the data from the mysql database and import in postgres database.

Multiple pandas users connecting to SQL DB

New to Pandas & SQL. Haven't found an answer specific to this config, and not sure if standard SQL wisdom applies when introducing pandas to the mix.
Doing a school project that involves ~300 gb of data in ~6gb .csv chunks.
School advised syncing data via dropbox, but this seemed impractical for a 4-person team.
So, current solution is AWS EC2 & RDS instance (MySQL, I think it'll be, 1 table).
What I wanted to confirm before we start setting it up:
If multiple users are working with (and occasionally modifying) the data, can this arrangement manage conflicts? e.g., if user A uses pandas to construct a dataframe from a query, are the records in that query frozen if user B tries to work with them?
My assumption is that the data in the frame are in memory, and the records in the SQL database are free to be modified by others until the dataframe is written back to the db, but I'm hoping that either I'm wrong or there's a simple solution here (like a random sample query for each user or something).
A pandas DataFrame object does not interact directly with the db. Once you read it in it sits in memory locally. You would have to use a method like DataFrame.to_sql to write your changes back to the MySQL DB. For more information on reading and writing to SQL tables, see the pandas documentation here.

Categories