I've got a sql data (MariaDB) with a timestamp index. Sadly there are multiple rows with the indentical timestamp. The best way would be to reject the current data and start using a unique index. Sadly that is not easily done.
When querying like that:
SELECT TIMESTAMP, DEVICE, VALUE FROM df WHERE DEVICE IN ('A', 'B') ORDER BY TIMESTAMP ASC
The order of the elements with the identical timestamp isn't the same as the order when looking at the complete data without ORDER BY TIMESTAMP ASC. I would like to get the data in the same order as written into the SQL data. I am querying in python with pandas and my work around would be to get the complete data prepared with python but that is slower.
Can you help me? I know it should be done different in the first place but maybe there is a work-around.
Fabian
SQL doesn't guarantee the order the data is retrieved in. You need to use another column to force data retrieval in a specific order. don't you have another column you can use in the order by?
Related
suppose we have very very huge table similar like this.
if we use pyspark to call this table and groupby("id").agg('value':'sum')
comparing with
partial call this table by
where date first then
groupby("id").agg('value':'sum')
then sum all partial values together.
Which one should be faster?
Can you please elaborate your question ? I am assuming that you want to perform the following operations
case-1: sum(value) group by id (adds all date)
case-2: sum(value) group by id where date = 1
In either case, your performance will depend on the following:
Cardinality of your id column. Whether you have a very large number of unique values or small unique values repeating.
The type of file format you are using. If it is columnar (like parquet) vs row (like csv)
The partitioning and bucketing strategy you are using for storing the files. If date columns hold few values, go for partitioning, else bucketing.
All these factors will help determine whether the above 2 cases will show similar processing time or they will drastically be different. This is due to the fact, that reading large amount of data will take more time than reading less / pruned data with given filters. Also, your shuffle block size and partitions play key role
I have a question regarding reading large amounts of data from a Postgres database, and process it using spark in parallel. Let's assume I have a table in Postgres I would like to read into Spark using JDBC. Let's assume it has the following columns:
id (bigint)
date (datetime)
many other columns (different types)
Currently the Postgres table is not partitioned. I would like to transform a lot of data in parallel, and eventually store the transformed data somewhere else.
Question: How can we optimize parallel reading of the data from Postgres?
The documentation (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html) suggests to use a partitionColum to process the queries in parallel. In addition, one is required to set lowerBound, and upperBound. From what I understand is that in my case, I can either use the column id and date for partitionColumn. However, the problem here is how to set the lowerBound and upperBound values when partitioning on one of the columns. I noticed that data skew arises in my case if not set properly. For processing in Spark, I do not care about natural partitions. I just need to transform all data as quick as possible, so optimizing for unskewed partitions would be prefered I think.
I have come up with a solution for this, but I am unsure if it actually makes sense to do this. Essentially it is hashing the id's into partitions. My solution would be to do use mod() on the id column with a specified number of partitions. So then the dbtable field in the would be something like:
"(SELECT *, mod(id, <<num-parallel-queries>>) as part FROM <<schema>>.<<table>>) as t"
And then I use partitionColum="part", lowerBound=0, and upperBound=<<num-parallel-queries>> as options for the Spark read JDBC job.
Please, let me know if this makes sense!
It is a good idea to "partition" by the primary key column.
To get partitions of equal size, use the table statistics:
SELECT histogram_bounds::text::bigint[]
FROM pg_stats
WHERE tablename = 'mytable'
AND attname = 'id';
If you have default_statistics_target at its default value of 100, this will be an array of 101 values that delimit the percentiles from 0 to 100. You can use this to partition your table evenly.
For example: if the array looks like {42,10001,23066,35723,49756,...,999960} and you need 50 partitions, the first would be all rows with id < 23066, the second all rows with 23066 ≤ id < 49756, and so on.
I am trying to turn the values in a column into separate columns with the value coming from another column, similar to this post, except dynamically. In other words, I'm trying to turn a table form long to wide format, similar to the functionality of spread in r or pivot in python.
Is there a way to pivot a table in athena dynamically -- without having to hard code the columns to pull?
No, there is no way to write a query that results in different number of columns depending on the data. The columns must be known before query execution starts.
I'm not sure if this has been answered before, I didn't get anything on a quick search.
My table is built in a random order, but thereafter it is modified very rarely. I do frequent selects from the table and in each select I need to order the query by the same column. Now is there a way to sort a table permanently by a column so that it does not need to be done again for each select?
You can add an index sorted by the column you want. The data will be presorted according to that index.
You can have only one place where you define it, and re-use that for
every query:
def base_query(session, what_for):
return session.query(what_for).order_by(what_for.rank_or_whatever)
Expand that as needed, then for all but very complex queries you can use that like so:
some_query = base_query(session(), Employee).filter(Employee.feet > 3)
The resulting query will be ordered by Employee.rank_or_whatever. If you are always querying for the same, You won't habve to use it as argument, of course.
EDIT: If you could somehow define a "permanent" order on your table which is observed by the engine without being issued an ORDER BY I'd think this would be an implementation feature specific to which RDBMS you use, and just convenience. Internally it makes for a DBMS no sense to being coerced how to store the data, since retrieving this data in a specific order is easily and efficiently accomplished by using an INDEX - forcing a specific order would probably decrease overall performance.
MyModel.objects.all().order_by('-timestamp')
There is a unique key on timestamp and another column:
`code` varchar(3) NOT NULL,
`timestamp` date NOT NULL,
UNIQUE KEY `exchangerate_currency_70883b95_uniq` (`code`,`timestamp`)
All I want is to obtain the latest row in the table.
The query achieves that but I am thinking of the future when it will grow to 100K rows.
Are there glaring performance problems with this query & schema ?
Without seeing your full query, and your schema, it's impossible to do more than speculate.
But... I believe the order of columns in a multi-column index in MySQL is important. This would mean that your index on (code,timestamp) is likely unusable for ordering by timestamp. If you changed the order to instead be (timestamp,code), it probably would be useable for ORDER BY timestamp, however it may hurt performance for other queries.
If your usage requires an index for both, you may need to create a second index on just the timestamp column.
Since your query is on code and timestamp, the index can't be used if there is no code in the WHERE clause (and you'll see filesort in the results of explain). if there is code= condition, the query will be fast enough. However, if you have 100000+ records that match the query part and you want to order by timestamp, it will be extremly fast if you limit the timestamp to let's say last several days (you have to choose the limit based on your application).