Flatten pyspark nested structure - Pyspark - python

I want to flatten a nested column into separate ones, containing only a few specific values.
First off, I start with an AVRO capture file from Event hub. Once I convert that to a dataframe in python, I receive the following column after removing the irrelevant ones.
This column has the following structure.
What I want to do next, is flatten this column remaining specific values.
I can get this done for one cell, but because I am dealing with a nested structure the column is not iterable.
Anyone can help me out?

You have 2 data fields in your schema so I'd call the first one data_struct and the second one data_array. If you do avro_df_body.select('data_struct.data_array'), you'd have an ArrayType column, which you can apply explode function to break it down to multiple rows.
(avro_df_body
.withColumn('tmp', F.explode('data_struct.data_array'))
.withColumn('speed', F.col('tmp.a'))
.withColumn('timestamp', F.col('tmp.time'))
.show()
)

Related

Splitting a dataframe column into two separate data frames using pandas

i am using python to code in jupyter notebook.
Im trying to use pandas to split a column of dataframe (called "PostTypeId' into two separate dataframes, based on the columns value - one dataframe is to be called Questions and has the column value of 1, and the second dataframe is to be called Answers that has the column value of 2. Im asked to do all this by defining it within a function, called split_df
Wondering how i would go about this.
Thanks so much:)
you can do it by:
Questions=pd.DataFrame(PostTypeId[PostTypeId.col_name==1])
Answers=pd.DataFrame(PostTypeId[PostTypeId.col_name==2])
when creating the function, use the filter values as the argument

Why do double square brackets create a DataFrame with loc or iloc?

Comparing:
df.loc[:,'col1']
df.loc[:,['col1']]
Why does (2) create a DataFrame, while (1) creates a Series?
in principle when it's a list, it can be a list of more than one column's names, so it's natural for pandas to give you a DataFrame because only DataFrame can host more than one column. However, when it's a string instead of a list, pandas can safely say that it's just one column, and thus giving you a Series won't be a problem. Take the two formats and two outcomes as a reasonable flexibility to get whichever you need, a series or a dataframe. sometimes you just need specifically one of the two.

Converting for loop to numpy calculation for pandas dataframes

So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.
It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]
Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]

Break a dictionary out of a StringType column in a spark dataframe

I have a spark table that I want to read in python (I'm using python 3 in databricks) In effect the structure is below. The log data is stored in a single string column but is a dictionary.
How do I break out the dictionary items to read them.
dfstates = spark.createDataFrame([[{"EVENT_ID":"123829:0","EVENT_TS":"2020-06-22T10:16:01.000+0000","RECORD_INDEX":0},
{"EVENT_ID":"123829:1","EVENT_TS":"2020-06-22T10:16:01.000+0000","RECORD_INDEX":1},
{"EVENT_ID":"123828:0","EVENT_TS":"2020-06-20T21:17:39.000+0000","RECORD_INDEX":0}],
['texas','24','01/04/2019'],
['colorado','13','01/07/2019'],
['maine','14','']]).toDF('LogData','State','Orders','OrdDate')
What I want to do is read the spark table into a dataframe, find the max event timestamp, find the rows with that timestamp then count and read just those rows into a new dataframe with the data columns and from the log data, add columns for event id (without the record index), event date and record index.
Downstream I'll be validating the data, converting from StringType to appropriate data type and filling in missing or invalid values as appropriate. All along I'll be asserting that row counts = original row counts.
The only thing I'm stuck on though is how to read this log data column and change it to something I can work with. Something in spark like pandas.series()?
You can split your single struct type column into multiple columns using dfstates.select('Logdata.*) Refer this answer : How to split a list to multiple columns in Pyspark?
Once you have seperate columns, then you can do standard pyspark operations like filtering

Spark - HashingTF inputCol accepts one column but I want more

I'm trying to use HashTF in Spark but I have one major problem.
If inputCol has only one column like this
HashingTF(inputCol="bla",outputCol="tf_features") it works fine.
But if I try to add more columns I get error message "Cannot convert list to string".
All I want to do is
HashingTF(inputCol=["a_col","b_col","c_col"], outputCol="tf_features").
Any ideas on how to fix it?
HashingTF takes on input one column, if you want to use other columns you can create array of these columns using array function and then flat them using explode, you will have one column with values from all columns. Finally you can pass that column to HashingTF.
df2 = df.select(explode(array(f.col("a_col"),f.col("b_col")))).as('newCol'))

Categories