I am working on querying data and then building a visualization on top of it. Currently my whole pipeline works but it can take upwards of 10 minutes sometimes to return the results of my query and I am very sure I am missing some optimization or another crucial step that is causing this slow speed.
Details:
I have about 500gb in 3500 csv’s. I store these in an Azure Blob Storage Account and run a spark cluster on Azure HDInsights. I am using spark 2.1.
Here is the script(PySpark3 on Azure Jupyter Notebook) I use to ingest the data:
csv_df = spark.read.csv('wasb://containername#storageaccountname.blob.core.windows.net/folder/*.csv', header=True, inferSchema=True) //Read CSV
csv_df.write.parquet('wasb://containername#storageaccountname.blob.core.windows.net/folder/parquet_folder/csvdfdata.parquet’) //Write Parquet
parquet_df = spark.read.csv('wasb://containername#storageaccountname.blob.core.windows.net/folder/parquet_folder/csvdfdata.parquet) //Read Parquet
parquet_df.createOrReplaceTempView(‘temp_table’) //Create a temporary table
spark.sql("create table permenant_table as select * from temp_table"); //Create a permanent table
I then use the ODBC Driver and this code to pull data. I understand odbc can slow things a little but I believe 10 minutes is way more than expected.
https://github.com/Azure-Samples/hdinsight-dotnet-odbc-spark-sql/blob/master/Program.cs
My code to pull data is similar to this ^
The problem is that the pipeline works but it is way too slow for it to be of any use. The visualizations I create need to pull data in a few seconds at best.
Other details:
A good amount of queries use DateID which has dates in int format = 20170629 (29th june 2017)
Sample Query = select DateId, count(PageId) as total from permanent_table where (DateId >= 20170623) and (DateId <= 20170629) group by DateId order by DateId asc
Any help would be greatly appreciated! Thanks in advance!
Thank You!
First, one of clarification: What queries are you running from ODBC connection? Is it table creation queries? They would take long time. Make sure you run only read queries from ODBC on a pre-created hive table.
Now assuming you do the above here is few things you can do to make queries run in few seconds.
Thrift server on HDI uses dynamic resource allocation. So the first query will take extra time while resources are allocated. After that it should be faster. You can check status of Ambari -> Yarn UI -> Thrift application how much resources it uses - it should use all cores of your cluster.
3500 files is too much. When you create parquet table coalesce(num_partitions) (or repartition) it into smaller number of partitions. Adjust it so there is about 100MB per partition or if there is not enough data - at least one partition per core of your cluster.
In your data generation script you can skip one step - instead of creating temp table - directly create hive table in parquet format. Replace csv_df.write.parquet with csv_df.write.mode(SaveMode.Overwrite).saveAsTable("tablename")
For date queries you can partition your data by year, month, day columns (you will need to extract them first). If you do this you won't need to worry about #2. You may end-up with too many files, if so you would need to reduce partitioning to only year, month.
Size of your cluster. For 500GB of text files you should be fine with few nodes of D14v2 (may be 2-4). But depends on complexity of your queries.
Related
So I compared storage and performance of both MySQL and Timescaledb on PostgreSQL. I'm uploading 100's of CSV files to the stock data table using a python script (uploading using python multiprocessing)
For MySQL I had to create the distributions myself: I created schemas y2008,y2009,...up-to y2020. Within each schema I created 10 tables (a_c, d_f, ..etc to store the tickers in alphabetical groups for best insert and query performance).
For TimescaleDB, I simply had to create_hypertable(stocks,..) which distributed the data into chunks/tables by the Date column. I did not have to 'manually' create the schemas and distributions as in MySQL.
Currently I've tested both setups for 100 tickers, around 6 GB of data. Timescaledb gave a better insert performance (5-6 minutes) as opposed to MySQL (9-10 minutes).
Also, these comparisons are for local PC setups. I haven't compared for even larger data set's or cloud database performances yet.
If someone has experience storing such time-series data, please let me know what is your opinion on the two, or if you recommend something else to look into as well.
Thanks a lot
I would like to write the bulk data to BQ using software API.
My restrictions are:
I am going to use the max size of BQ, columns 10,000 and ~35000 rows (this can be bigger)
Schema autodetect is required
If possible, I would like to use some kind of parallelism to write many tables at the same time asynchronously (for that Apache-beam & dataflow might be the solution)
When using Pandas library for BQ, there is a limit on the size of the dataframe that can be written. this requires partitioning of the data
What would be the best way to do so?
Many thanks for any advice / comment,
eilalan
Apache beam would be the right component as it supports huge volume data processing in batch and streaming mode.
I don't think Beam as "Schema auto-detect". But, you can use BigQuery API to fetch the schema if the table already exists.
I am new to Google BigQuery so I'm trying to understand how to best accomplish my use case.
I have daily data of customer visits stored in BigQuery that I wish to analyse using some algorithms that I have written in python. Since, there are multiple scripts that use subsets of the daily data, I was wondering what would be the best way to fetch and temporarily store the data. Additionally, the scripts run in a sequential manner. Each script modifies some columns of the data and the subsequent script uses this modified data. After all the scripts have run, I want to store the modified data back to BigQuery.
Some approaches I had in mind are:
Export the bigquery table into a GAE (Google App Engine) instance as a db file and query the relevant data for each script from the db file using sqlite3 python package. Once, all the scripts have run, store the modified table back to BigQuery and then remove the db file from the GAE instance.
Query data from BigQuery every time I want to run a script using the google-cloud python client library or pandas gbq package. Modify the BigQuery table after running each script.
Could somebody know which of these would be a better way to accomplish this (in terms of efficiency/cost) or suggest alternatives?
Thanks!
The answer to your question mostly depends on your use case and the size of the data that you will be processing, so there is not an absolute and correct answer for it.
However, there are some points that you may want to take into account regarding the usage of BigQuery and how some of its features can be interesting for you in the scenario you described.
Let me quickly go over the main topics you should have a look at:
Pricing: leaving aside the billing of storage, and focusing in the cost of queries themselves (which is more related to your use case), BigQuery billing is based on the number of bytes processed on each query. There is a 1TB free quota per month, and from then on, the cost is of $5 per TB of processed data, being the minimum measurable unit 10MB of data.
Cache: when BigQuery returns some information, it is stored in a temporary cached table (or a permanent one if you wish), and they are maintained for approximately 24 hours with some exceptions that you may find in this same documentation link (they are also best-effort, so earlier deletion may happen too). Results returned from a cached table are not billed (because as per the definition of the billing, the cost is based on the number of bytes processed, and accessing a cached table implies that there is no processing being done), as long as you are running the exact same query. I think it would be worth having a look at this feature, because from your sentence "Since there are multiple scripts that use subsets of the daily data", maybe (but just guessing here) it applies to your use case to perform a single query once and then retrieve the results multiple times from a cached version without having to store it anywhere else.
Partitions: BigQuery offers the concept of partitioned tables, which are individual tables that are partitioned into smaller segments by date, what will make it easier to query data daily as you require.
Speed: BigQuery offers a real-time analytics platform, so you will be able to perform fast queries retrieving the information you need, applying some initial processing that you can later use in your custom Python algorithms.
So, in general, I would say that there is no need for you to keep any other database with partial results a part from your BigQuery storage. In terms of resource and cost efficiency, BigQuery offers enough features for you to work with your data locally without having to deal with huge expenses or delays in data retrieving. However, again, this will finally depend on your use case and the amount of data you are storing and need to process simultaneously; but in general terms, I would just go with BigQuery on its own.
I am trying to read in a table from my Postgres database into Python. Table has around 8 million rows and 17 columns, and has a size of 622MB in the DB.
I can export the entire table to csv using psql, and then use pd.read_csv() to read it in. It works perfectly fine. Python process only uses around 1GB of memory and everything is good.
Now, the task we need to do requires this pull to be automated, so I thought I could read the table in using pd.read_sql_table() directly from the DB. Using the following code
import sqlalchemy
engine = sqlalchemy.create_engine("postgresql://username:password#hostname:5432/db")
the_frame = pd.read_sql_table(table_name='table_name', con=engine,schema='schemaname')
This approach starts using a lot of memory. When I track the memory usage using Task Manager, I can see the Python process memory usage climb and climb, until it hits all the way up to 16GB and freezes the computer.
Any ideas on why this might be happening is appreciated.
You need to set the chunksize argument so that pandas will iterate over smaller chunks of data. See this post: https://stackoverflow.com/a/31839639/3707607
I want to append about 700 millions rows and 2 columns to a database. Using the code below:
disk_engine = create_engine('sqlite:///screen-user.db')
chunksize = 1000000
j = 0
index_start = 1
for df in pd.read_csv('C:/Users/xxx/Desktop/jjj.tsv', chunksize=chunksize, header = None, names=['screen','user'],sep='\t', iterator=True, encoding='utf-8'):
df.to_sql('data', disk_engine, if_exists='append')
count = j*chunksize
print(count)
print(j)
It is taking a really long time (I estimate it would take days). Is there a more efficient way to do this? In R, I have have been using the data.table package to load large data sets and it only take 1 minute. Is there a similar package in Python? As a tangential point, I want to also physically store this file on my Desktop. Right now, I am assuming 'data' is being stored as a temporary file. How would I do this?
Also assuming I load the data into a database, I want the queries to execute in a minute or less. Here is some pseudocode of what I want to do using Python + SQL:
#load data(600 million rows * 2 columns) into database
#def count(screen):
#return count of distinct list of users for a given set of screens
Essentially, I am returning the number of screens for a given set of users.Is the data too big for this task? I also want to merge this table with another table. Is there a reason why the fread function in R is much faster?
If your goal is to import data from your TSV file into SQLite, you should try the native import functionality in SQLite itself. Just open the sqlite console program and do something like this:
sqlite> .separator "\t"
sqlite> .import C:/Users/xxx/Desktop/jjj.tsv screen-user
Don't forget to build appropriate indexes before doing any queries.
As #John Zwinck has already said, you should probably use native RDBMS's tools for loading such amount of data.
First of all I think SQLite is not a proper tool/DB for 700 millions rows especially if you want to join/merge this data afterwards.
Depending of what kind of processing you want to do with your data after loading, I would either use free MySQL or if you can afford having a cluster - Apache Spark.SQL and parallelize processing of your data on multiple cluster nodes.
For loading you data into MySQL DB you can and should use native LOAD DATA tool.
Here is a great article showing how to optimize data load process for MySQL (for different: MySQL versions, MySQL options, MySQL storage engines: MyISAM and InnoDB, etc.)
Conclusion: use native DB's tools for loading big amount of CSV/TSV data efficiently instead of pandas, especially if your data doesn't fit into memory and if you want to process (join/merge/filter/etc.) your data after loading.