athena - dynamically pivot rows to column - python

I am trying to turn the values in a column into separate columns with the value coming from another column, similar to this post, except dynamically. In other words, I'm trying to turn a table form long to wide format, similar to the functionality of spread in r or pivot in python.
Is there a way to pivot a table in athena dynamically -- without having to hard code the columns to pull?

No, there is no way to write a query that results in different number of columns depending on the data. The columns must be known before query execution starts.

Related

How to check differences between column values in pandas?

I'm manually comparing two or three rows very similar using pandas. Is there a more automated way to do this? I would like a better method than using '=='.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html
See if this will satisfy your needs.
df['sales_diff'] = df['sales'].diff()
The above code snippet creates a new column in your data frame, which contains the difference between the previous row by default. You can screw around with the parameters (axis) to compare rows or columns and you can change (period) to compare to a specific row or column.

Parallel reading and processing from Postgres using (Py)Spark

I have a question regarding reading large amounts of data from a Postgres database, and process it using spark in parallel. Let's assume I have a table in Postgres I would like to read into Spark using JDBC. Let's assume it has the following columns:
id (bigint)
date (datetime)
many other columns (different types)
Currently the Postgres table is not partitioned. I would like to transform a lot of data in parallel, and eventually store the transformed data somewhere else.
Question: How can we optimize parallel reading of the data from Postgres?
The documentation (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html) suggests to use a partitionColum to process the queries in parallel. In addition, one is required to set lowerBound, and upperBound. From what I understand is that in my case, I can either use the column id and date for partitionColumn. However, the problem here is how to set the lowerBound and upperBound values when partitioning on one of the columns. I noticed that data skew arises in my case if not set properly. For processing in Spark, I do not care about natural partitions. I just need to transform all data as quick as possible, so optimizing for unskewed partitions would be prefered I think.
I have come up with a solution for this, but I am unsure if it actually makes sense to do this. Essentially it is hashing the id's into partitions. My solution would be to do use mod() on the id column with a specified number of partitions. So then the dbtable field in the would be something like:
"(SELECT *, mod(id, <<num-parallel-queries>>) as part FROM <<schema>>.<<table>>) as t"
And then I use partitionColum="part", lowerBound=0, and upperBound=<<num-parallel-queries>> as options for the Spark read JDBC job.
Please, let me know if this makes sense!
It is a good idea to "partition" by the primary key column.
To get partitions of equal size, use the table statistics:
SELECT histogram_bounds::text::bigint[]
FROM pg_stats
WHERE tablename = 'mytable'
AND attname = 'id';
If you have default_statistics_target at its default value of 100, this will be an array of 101 values that delimit the percentiles from 0 to 100. You can use this to partition your table evenly.
For example: if the array looks like {42,10001,23066,35723,49756,...,999960} and you need 50 partitions, the first would be all rows with id < 23066, the second all rows with 23066 ≤ id < 49756, and so on.

How to read two tables from single excel sheet using python?

please refer this image: Two tables in single excel sheet
I need dynamic python code which can read two tables from single excel sheet without specifying the header position. The number of columns and number of rows can change with time.
Please help!
It's a little hard for me personally to write the actual code for something like this without the excel file itself, but I can definitely tell you the strategy/steps for dealing with it. As you know, pandas treats it as a single DataFrame. That means you should too. The trick is to not get fooled into thinking that this is truly structured data and works with identical logic to a structured table. Think of what you're doing to be less similar to cleaning structured data than it is telling a computer how to measure and cut a piece of paper. Instead of approaching it as two tables, think of it as a large DataFrame where rows fall into three categories:
Rows with nothing
Rows that you want to end up in the first table
Rows that you want to end up in the second table
The first thing to do is try and create a column that will sort the rows into those three groups. Looking at it, I would rely on the cells that say "information about table (1/2)". You can create a column that says 1 if the first column has "table 1", 2 if it has "table 2" and will be null otherwise. You may be worried about all of the actual table values having null values for this new column. Don't be yet.
Now, with the new column, you want to use the .ffill() method on the column. This will take all of the non-null values in the column and propagate them downwards to all available null values. At this point, all rows of the first table will have 1 for the column and the rows for the second table will have 2. We have the first major step out of the way.
Now, the first column should still have null values because you haven't done anything with it. Fortunately, the null values here only exist where the entire row is empty. Drop all rows with null values for the first column. At last, you should now be able to create two new DataFrames using Boolean masking.
e.g.: df1 = df.loc[df["filter"]==1].copy(deep=True)
You will still have the columns and headers to handle/clean up how you'd like, but at this point, it should be much easier for you to clean those up from a single table rather than two tables smashed together within a DataFrame.

Properly indexing large Time Series dataset for complex query

I have a large table of timeseries data. ~ 800 million rows. I need to properly index this large dataset. My UI has dropdown menus inputs as query selectors, allowing users to update the dataset/visualization.There are 7 potential user inputs that would prompt a query on the table
Generally the query order stays consistent. Stage>Week>Team>Opponent>Map>Round>Stat. Should I be creating a single multi-column index on this sequence? Or should i be applying multiple multi-column indexes? Or a third option of indexing each of the column that are user inputs individually. Which is the most efficient approach?
def timeseries (map,stage,week,stat,team,opponent,round):
teams=[team,opponent]
df=df[df.match_id == id_dict[stage][week][team][opponent]]
df=df[df.mapname == map]
df=df[df.stat_type == stat]
df=df[df.team.isin(teams)]
df=df[df.map_round == round]
--> df to visualization.
The first filter of match_id is a bit of a work around, as the user essentially selects a match id indirectly based on their other input selectors. (id_dict returns a single match id of a game)
This article may be useful depending on the version of PostGRES you are running PostGRES Indexing to summarize:
The database will combine as many single row indexes for optimization, but it will still have to cross reference full rows. If you know that some combinations of rows will be more popular then others, I'd make combined indexes on those for better performance. If you are not inserting into the table, having many indexes should not hurt.

SQL ORDER BY Timestamp, identical values

I've got a sql data (MariaDB) with a timestamp index. Sadly there are multiple rows with the indentical timestamp. The best way would be to reject the current data and start using a unique index. Sadly that is not easily done.
When querying like that:
SELECT TIMESTAMP, DEVICE, VALUE FROM df WHERE DEVICE IN ('A', 'B') ORDER BY TIMESTAMP ASC
The order of the elements with the identical timestamp isn't the same as the order when looking at the complete data without ORDER BY TIMESTAMP ASC. I would like to get the data in the same order as written into the SQL data. I am querying in python with pandas and my work around would be to get the complete data prepared with python but that is slower.
Can you help me? I know it should be done different in the first place but maybe there is a work-around.
Fabian
SQL doesn't guarantee the order the data is retrieved in. You need to use another column to force data retrieval in a specific order. don't you have another column you can use in the order by?

Categories