I am trying to export large (10 million rows) table to a semicolon separated .csv file. I am currently using build in tool (Import/Export Wizard) in Microsoft SQL Server Management Studio v17 and the export takes approximately 5 hours.
Is there a simple way to speed up this process?
I am limited by my company to use only R/python solution, beside of course SQL Server itself.
What is the size in memory of your table? I have a ~2Giga table turned into a csv in a couple of minutes.
Check your data source connection, I use OLEDB.
Related
So I compared storage and performance of both MySQL and Timescaledb on PostgreSQL. I'm uploading 100's of CSV files to the stock data table using a python script (uploading using python multiprocessing)
For MySQL I had to create the distributions myself: I created schemas y2008,y2009,...up-to y2020. Within each schema I created 10 tables (a_c, d_f, ..etc to store the tickers in alphabetical groups for best insert and query performance).
For TimescaleDB, I simply had to create_hypertable(stocks,..) which distributed the data into chunks/tables by the Date column. I did not have to 'manually' create the schemas and distributions as in MySQL.
Currently I've tested both setups for 100 tickers, around 6 GB of data. Timescaledb gave a better insert performance (5-6 minutes) as opposed to MySQL (9-10 minutes).
Also, these comparisons are for local PC setups. I haven't compared for even larger data set's or cloud database performances yet.
If someone has experience storing such time-series data, please let me know what is your opinion on the two, or if you recommend something else to look into as well.
Thanks a lot
I am new to SQL, I am working on a research project, we have years worth of data from different sources summing up to hundreds of terabytes of data. I currently have them parsed as python data frames, I need help to literally set up SQL from scratch, I also need help to compile all our data into a SQL database. Please tell me everythign I need to know about SQL as a beginner?
Probably the easiest to get started with one of the free RDMS options, MySQL (https://www.mysql.com/) or PostgreSQL (https://www.postgresql.org/).
Once you've got that installed and configured, and have created the tables you wish to load, you can go with one of two routes to get your data in.
Either you can install the appropriate python libraries to connect to the server you've installed and then INSERT the data in.
Or, if there is a lot of data, look at dumping the data out into a flat file (.csv) and then use the bulk loader to push it into your tables (this is more hassle, but for larger data sets it will be faster).
I am comfortable using python / excel / pandas for my dataFrames . I do not know sql or database languages .
I am about to start on a new project that will include around 4,000 different excel files I have. I will call to have the file opened saved as a dataframe for all 4000 files and then do my math on them. This will include many computations such a as sum , linear regression , and other normal stats.
My question is I know how to do this with 5-10 files no problem. Am I going to run into a problem with memory or the programming taking hours to run? The files are around 300-600kB . I don't use any functions in excel only holding data. Would I be better off have 4,000 separate files or 4,000 tabs. Or is this something a computer can handle without a problem? Thanks for looking into have not worked with a lot of data before and would like to know if I am really screwing up before I begin.
You definitely want to use a database. At nearly 2GB of raw data, you won't be able to do too much to it without choking your computer, even reading it in would take a while.
If you feel comfortable with python and pandas, I guarantee you can learn SQL very quickly. The basic syntax can be learned in an hour and you won't regret learning it for future jobs, its a very useful skill.
I'd recommend you install PostgreSQL locally and then use SQLAlchemy to connect to create a database connection (or engine) to it. Then you'll be happy to hear that Pandas actually has df.to_sql and pd.read_sql making it really easy to push and pull data to and from it as you need it. Also SQL can do any basic math you want like summing, counting etc.
Connecting and writing to a SQL database is as easy as:
from sqlalchemy import create_engine
my_db = create_engine('postgresql+psycopg2://username:password#localhost:5432/database_name')
df.to_sql('table_name', my_db, if_exists='append')
I add the last if_exists='append' because you'll want to add all 4000 to one table most likely.
I am working on querying data and then building a visualization on top of it. Currently my whole pipeline works but it can take upwards of 10 minutes sometimes to return the results of my query and I am very sure I am missing some optimization or another crucial step that is causing this slow speed.
Details:
I have about 500gb in 3500 csv’s. I store these in an Azure Blob Storage Account and run a spark cluster on Azure HDInsights. I am using spark 2.1.
Here is the script(PySpark3 on Azure Jupyter Notebook) I use to ingest the data:
csv_df = spark.read.csv('wasb://containername#storageaccountname.blob.core.windows.net/folder/*.csv', header=True, inferSchema=True) //Read CSV
csv_df.write.parquet('wasb://containername#storageaccountname.blob.core.windows.net/folder/parquet_folder/csvdfdata.parquet’) //Write Parquet
parquet_df = spark.read.csv('wasb://containername#storageaccountname.blob.core.windows.net/folder/parquet_folder/csvdfdata.parquet) //Read Parquet
parquet_df.createOrReplaceTempView(‘temp_table’) //Create a temporary table
spark.sql("create table permenant_table as select * from temp_table"); //Create a permanent table
I then use the ODBC Driver and this code to pull data. I understand odbc can slow things a little but I believe 10 minutes is way more than expected.
https://github.com/Azure-Samples/hdinsight-dotnet-odbc-spark-sql/blob/master/Program.cs
My code to pull data is similar to this ^
The problem is that the pipeline works but it is way too slow for it to be of any use. The visualizations I create need to pull data in a few seconds at best.
Other details:
A good amount of queries use DateID which has dates in int format = 20170629 (29th june 2017)
Sample Query = select DateId, count(PageId) as total from permanent_table where (DateId >= 20170623) and (DateId <= 20170629) group by DateId order by DateId asc
Any help would be greatly appreciated! Thanks in advance!
Thank You!
First, one of clarification: What queries are you running from ODBC connection? Is it table creation queries? They would take long time. Make sure you run only read queries from ODBC on a pre-created hive table.
Now assuming you do the above here is few things you can do to make queries run in few seconds.
Thrift server on HDI uses dynamic resource allocation. So the first query will take extra time while resources are allocated. After that it should be faster. You can check status of Ambari -> Yarn UI -> Thrift application how much resources it uses - it should use all cores of your cluster.
3500 files is too much. When you create parquet table coalesce(num_partitions) (or repartition) it into smaller number of partitions. Adjust it so there is about 100MB per partition or if there is not enough data - at least one partition per core of your cluster.
In your data generation script you can skip one step - instead of creating temp table - directly create hive table in parquet format. Replace csv_df.write.parquet with csv_df.write.mode(SaveMode.Overwrite).saveAsTable("tablename")
For date queries you can partition your data by year, month, day columns (you will need to extract them first). If you do this you won't need to worry about #2. You may end-up with too many files, if so you would need to reduce partitioning to only year, month.
Size of your cluster. For 500GB of text files you should be fine with few nodes of D14v2 (may be 2-4). But depends on complexity of your queries.
I have about million records in a list that I would like to write to a Netezza table. I have been using executemany() command with pyodbc, which seems to be very slow (I can load much faster if I save the records to Excel and load to Netezza from the excel file). Are there any faster alternatives to loading a list with executemany() command?
PS1: The list is generated by a proprietary DAG in our company, so writing to the list is very fast.
PS2: I have also tried looping executemany() into chunks, with each chunk containing a list with 100 records. It takes approximately 60 seconds to load, which seems very slow.
From Python I have had great performance loading millions of rows to Netezza using transient external tables. Basically Python creates a CSV file on the local machine, and then tells the ODBC driver to load the CSV file into the remote server.
The simplest example:
SELECT *
FROM EXTERNAL '/tmp/test.txt'
SAMEAS test_table
USING (DELIM ',');
Behind the scenes this is equivalent to the nzload command, but it does not require nzload. This worked great for me on Windows where I did not have nzload.
Caveat: be careful with the formatting of the CSV, the values in the CSV, and the options to the command. Netezza gives obscure error messages for invalid values.
Netezza is good for bulk loads, where executeMany() inserts number of rows in one go. The best way to load millions of rows is "nzload" utility which can be scheduled by vbscript, Excel Macro from Windows or Shell script from Linux.