How to Retrieve a field value from a Glue DynamicFrame by name - python

In a Spark DataFrame you can address a column's value in the schema by using its name like df['personId'] - but that way does not work with Glue's DynamicFrame. Is there a similar way, without converting the DynamicFrame to a DataFrame, to directly access a columns values by name?

You can use select_fields, see
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-SelectFields.html.
In your case it would be df.select_fields("personId"). Depending on what you want to do, you can save it as a new dynamic frame or just look at the data.
new_frame = df.select_fields("personId")
new_frame.show()

Related

Is there a way to update a single pandas dataframe through a dash datatable?

I have a single pandas dataframe which is visualized through two separate dash data tables, that show different aspects of this dataframe, not the whole thing.
One of these data tables is editable, and I want the user to be able to export the changed dataframe after they're done with the editing.
I read about storing the dataframe in a hidden div and then converting it into json, but it seems that this qay is used when the whole dataframe is accessible through the data table, which is not the case for me.
Is there any way to deploy this?
You can use a Dash Core Components (dcc) Store object to hold the dataframe in your layout. You can then create a callback that will update some or all of this dataframe when a change is applied to the data table.
As an example:
Given your layout file has a Store component, e.g.:
app_layout = html.Div([dcc.Store(id='my-dataframe')])
Then the following callback will read/write the dataframe.
#app.callback(
Output('my-dataframe', 'data'),
Input('my_trigger', 'my trigger'), # Set this to whatever should trigger the callback
State('my-dataframe', 'data')
)
def update_dateframe(my_trigger, my_dataframe):
if my_dataframe is not None:
df = pd.read_json(input, orient='records')
else:
pass # Do whatever you need to do to initialise the dataframe
# Do whatever you need to do to update the dataframe.
return df.to_json(orient='records')

Break a dictionary out of a StringType column in a spark dataframe

I have a spark table that I want to read in python (I'm using python 3 in databricks) In effect the structure is below. The log data is stored in a single string column but is a dictionary.
How do I break out the dictionary items to read them.
dfstates = spark.createDataFrame([[{"EVENT_ID":"123829:0","EVENT_TS":"2020-06-22T10:16:01.000+0000","RECORD_INDEX":0},
{"EVENT_ID":"123829:1","EVENT_TS":"2020-06-22T10:16:01.000+0000","RECORD_INDEX":1},
{"EVENT_ID":"123828:0","EVENT_TS":"2020-06-20T21:17:39.000+0000","RECORD_INDEX":0}],
['texas','24','01/04/2019'],
['colorado','13','01/07/2019'],
['maine','14','']]).toDF('LogData','State','Orders','OrdDate')
What I want to do is read the spark table into a dataframe, find the max event timestamp, find the rows with that timestamp then count and read just those rows into a new dataframe with the data columns and from the log data, add columns for event id (without the record index), event date and record index.
Downstream I'll be validating the data, converting from StringType to appropriate data type and filling in missing or invalid values as appropriate. All along I'll be asserting that row counts = original row counts.
The only thing I'm stuck on though is how to read this log data column and change it to something I can work with. Something in spark like pandas.series()?
You can split your single struct type column into multiple columns using dfstates.select('Logdata.*) Refer this answer : How to split a list to multiple columns in Pyspark?
Once you have seperate columns, then you can do standard pyspark operations like filtering

How to access JSON values from PySpark dataframes with default values?

I have a spark dataframe which has a Json on one of the columns. My task is to turn this dataframe in to a columnar type of dataframe. The problem is that the JSON is dynamic and it always changes structure. What I would like to do is attempt to take values from it and if it and in case it does not have then, return a default value. Is there an option for this in the dataframe? This is how I am taking values out of the JSON, the problem is that in case one of the level changes name or structure, it will not fail.
columnar_df = df.select(col('json')['level1'].alias('json_level1'),
col('json')['level1']['level2a'].alias('json_level1_level2a'),
col('json')['level1']['level2b'].alias('json_levelb'),
)
you can do something like that with json_tuple
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.json_tuple
df.select(json_tuple(col("json"), << all_the_fields , _you_want >> ))

Replace a column in postgres table with pandas column using sqlalchemy

I have a postgres table that I read into a pandas DataFrame. I then apply some functions that change all values in 1 column to other values. I need to find a way to then update the corresponding column in the postgres table. Is there a straightforward way of doing this? The basic workflow I need is like:
df = pd.read_sql(...)
df.col_to_update = apply_some_functions_to_col(df.col_to_update)
sa.update(tbl).values({"col_to_update": df.col_to_udpate})
Though this particular statement wouldn't work because the API here doesn't understand pandas series.

Viewing the content of a Spark Dataframe Column

I'm using Spark 1.3.1.
I am trying to view the values of a Spark dataframe column in Python. With a Spark dataframe, I can do df.collect() to view the contents of the dataframe, but there is no such method for a Spark dataframe column as best as I can see.
For example, the dataframe df contains a column named 'zip_code'. So I can do df['zip_code'] and it turns a pyspark.sql.dataframe.Column type, but I can't find a way to view the values in df['zip_code'].
You can access underlying RDD and map over it
df.rdd.map(lambda r: r.zip_code).collect()
You can also use select if you don't mind results wrapped using Row objects:
df.select('zip_code').collect()
Finally, if you simply want to inspect content then show method should be enough:
df.select('zip_code').show()
You can simply write:
df.select('your column's name').show()
In your case here, it will be:
df.select('zip_code').show()
To view the complete content:
df.select("raw").take(1).foreach(println)
(show will show you an overview).

Categories