I have a large table of timeseries data. ~ 800 million rows. I need to properly index this large dataset. My UI has dropdown menus inputs as query selectors, allowing users to update the dataset/visualization.There are 7 potential user inputs that would prompt a query on the table
Generally the query order stays consistent. Stage>Week>Team>Opponent>Map>Round>Stat. Should I be creating a single multi-column index on this sequence? Or should i be applying multiple multi-column indexes? Or a third option of indexing each of the column that are user inputs individually. Which is the most efficient approach?
def timeseries (map,stage,week,stat,team,opponent,round):
teams=[team,opponent]
df=df[df.match_id == id_dict[stage][week][team][opponent]]
df=df[df.mapname == map]
df=df[df.stat_type == stat]
df=df[df.team.isin(teams)]
df=df[df.map_round == round]
--> df to visualization.
The first filter of match_id is a bit of a work around, as the user essentially selects a match id indirectly based on their other input selectors. (id_dict returns a single match id of a game)
This article may be useful depending on the version of PostGRES you are running PostGRES Indexing to summarize:
The database will combine as many single row indexes for optimization, but it will still have to cross reference full rows. If you know that some combinations of rows will be more popular then others, I'd make combined indexes on those for better performance. If you are not inserting into the table, having many indexes should not hurt.
Related
suppose we have very very huge table similar like this.
if we use pyspark to call this table and groupby("id").agg('value':'sum')
comparing with
partial call this table by
where date first then
groupby("id").agg('value':'sum')
then sum all partial values together.
Which one should be faster?
Can you please elaborate your question ? I am assuming that you want to perform the following operations
case-1: sum(value) group by id (adds all date)
case-2: sum(value) group by id where date = 1
In either case, your performance will depend on the following:
Cardinality of your id column. Whether you have a very large number of unique values or small unique values repeating.
The type of file format you are using. If it is columnar (like parquet) vs row (like csv)
The partitioning and bucketing strategy you are using for storing the files. If date columns hold few values, go for partitioning, else bucketing.
All these factors will help determine whether the above 2 cases will show similar processing time or they will drastically be different. This is due to the fact, that reading large amount of data will take more time than reading less / pruned data with given filters. Also, your shuffle block size and partitions play key role
I want to write a system to materialize partial datasets that are repeated across the workload of a warehouse, so an expensive and common computation is computed only once, and all subsequent queries using that dataset (or a subset of that dataset) do a plain SELECT from the materialized dataset instead of running the query again. The warehouse is using Spark + HDFS.
Finding the most common queries, subqueries and CTEs was easy and already solved.
What I'm stuck now is to find a way to find materialization candidates, I can't device a method to identify if one query is a superset of another (or, conversely, one query being a subset of another). If I knew that, then it would be possible to do query rewriting and replace the expensive subquery by the already materialized table.
One way I was thinking about this is to build a graph where queries and sub-queries are nodes and the edges represent the fact that one query is answered by the other. But I'm having a hard time building the edges of this graph.
In summary: Given 2 SQL statements, for example, given 2 queries:
Q1: SELECT col1, col2 FROM table;
and
Q2: SELECT col1 FROM table WHERE col2 > 0;
How can I determine Q2 is a subset of Q1?
This is a simple example, joins, aggregations, unions, where and having conditions also must be taken into consideration.
Evaluating datasets it is not an option, should be determined just based on the SQL statement and/or the execution plans (we are talking about multi petabytes datasets and comparing datasets is too resource intensive)
My ideas so far: take the list of projections and compare, of the list of projected columns of Q1 contains the list of projections of Q2 then Q1 might be a superset of Q2.
Also check the columns involved in the filters and what filters are being applied.
My problem is as follows:
I have a large dataframe called details containing 900K rows and the other one containing 80M rows named attributes.
Both have a column A on which I would like to do a left-outer join, the left dataframe being deatils.
There are only 75K unique entries in column A in the dataframe details. The dataframe attributes 80M unique entries in column A.
What is the best possible way to achieve the join operation?
What have I tried?
The simple join i.e. details.join(attributes, "A", how="left_outer") just times out (or gives out of memory).
Since there are only 75K unique entries in column A in details, we don't care about the rest in the dataframe in attributes. So, first I filter that using:
uniqueA = details.select('A').distinct().collect()
uniqueA = map(lambda x: x.A, uniqueA)
attributes_filtered = attributes.filter(attributes.A.isin(*uniqueA))
I thought this would work out because the attributes table comes down from 80M rows to mere 75K rows. However, it still takes forever to complete the join (and it never completes).
Next, I thought that there are too many partitions and the data to be joined is not on the same partition. Though, I don't know how to bring all the data to the same partition, I figured repartitioning may help. So here it goes.
details_repartitioned = details.repartition("A")
attributes_repartitioned = attributes.repartition("A")
The above operation brings down the number of partitions in attributes from 70K to 200. The number of partitions in details are about 1100.
details_attributes = details_repartitioned.join(broadcast(
attributes_repartitioned), "A", how='left_outer') # tried without broadcast too
After all this, the join still doesn't work. I am still learning PySpark so I might have misunderstood the fundamentals behind repartitioning. If someone could shed light on this, it would be great.
P.S. I have already seen this question but that does not answer this question.
Details table has 900k items with 75k distinct entries in column A. I think the filter on the column A you have tried is a correct direction. However, the collect and followed by the map operation
attributes_filtered = attributes.filter(attributes.A.isin(*uniqueA))
this is too expensive. An alternate approach would be
uniqueA = details.select('A').distinct().persist(StorageLevel.DISK_ONLY)
uniqueA.count // Breaking the DAG lineage
attrJoined = attributes.join(uniqueA, "inner")
Also, you probably need to set the shuffle partition correctly if you haven't done that yet.
One problem could happen in your dataset is that skew. It could happen among 75k unique values only a few joining with a large number of rows in the attribute table. In that case join could take much longer time and may not finish.
To resolve that you need to find the skewed values of column A and process them separately.
I was wondering how pandas handles memory usage in python? I was wondering more specifically how the memory is handled if I set a pandas dataframe query results to a variable. Behind the hood, would it just be some memory addresses to the original dataframe object or would I be cloning all of the data?
I'm afraid of memory ballooning out of control but I have a dataframe that has non-unique fields I can't index it by. It's incredibly slow to query and plot data from it using commands like df[(df[''] == x) & (df[''] == y)].
(They're both integer values in the rows. They're also not unique, hence the fact it returns multiple results.)
I'm very new to pandas anyway, but any insights as to how to handle a situation where I'm looking for the arrays of values where two conditions match would be great too. Right now I'm using an O(n) algorithm to loop through and index it because even that runs faster than the search queries when I need to access the data quickly. Watching my system take twenty seconds on a dataset of only 6,000 rows is foreboding.
I'm not looking for the count or filter expressions, I just want to select from the 5th record to the 10th record, if that makes sense.
I'm working with a very large table but only in small sections at a time, currently each time I need a section I query my entire table and choose my section from the result. But is there a faster way to do it, for example only selecting the records between index 5 and index 10 (My table is indexed by the way)?
Looking at the documentation, it looks as if you could use the slice filter, or use limit and offset.