Viewing the content of a Spark Dataframe Column - python

I'm using Spark 1.3.1.
I am trying to view the values of a Spark dataframe column in Python. With a Spark dataframe, I can do df.collect() to view the contents of the dataframe, but there is no such method for a Spark dataframe column as best as I can see.
For example, the dataframe df contains a column named 'zip_code'. So I can do df['zip_code'] and it turns a pyspark.sql.dataframe.Column type, but I can't find a way to view the values in df['zip_code'].

You can access underlying RDD and map over it
df.rdd.map(lambda r: r.zip_code).collect()
You can also use select if you don't mind results wrapped using Row objects:
df.select('zip_code').collect()
Finally, if you simply want to inspect content then show method should be enough:
df.select('zip_code').show()

You can simply write:
df.select('your column's name').show()
In your case here, it will be:
df.select('zip_code').show()

To view the complete content:
df.select("raw").take(1).foreach(println)
(show will show you an overview).

Related

Accessing imported data from Excel in Pandas

I'm new to python and just trying to redo my first project from matlab. I've written a code in vscode to import an excel file using pandas
filename=r'C:\Users\user\Desktop\data.xlsx'
sheet=['data']
with pd.ExcelFile(filename) as xls:
Dateee=pd.read_excel(xls, sheet,index_col=0)
Then I want to access data in a row and column.
I tried to print data using code below:
for key in dateee.keys():
print(dateee.keys())
but this returns nothing.
Is there anyway to access the data (as a list)?
You can iterate on each column, making the contents of each a list:
for c in df:
print(df[c].to_list())
df is what the dataframe was assigned as. (OP had inconsistent syntax & so I didn't use that.)
Look into df.iterrows() or df.itertuples() if you want to iterate by row. Example:
for row in df.itertuples():
print(row)
Look into df.iloc and df.loc for row and column selection of individual values, see Pandas iloc and loc – quickly select rows and columns in DataFrames.
Or df.iat or df.at for getting or setting single values, see here, here, and here.

How to access JSON values from PySpark dataframes with default values?

I have a spark dataframe which has a Json on one of the columns. My task is to turn this dataframe in to a columnar type of dataframe. The problem is that the JSON is dynamic and it always changes structure. What I would like to do is attempt to take values from it and if it and in case it does not have then, return a default value. Is there an option for this in the dataframe? This is how I am taking values out of the JSON, the problem is that in case one of the level changes name or structure, it will not fail.
columnar_df = df.select(col('json')['level1'].alias('json_level1'),
col('json')['level1']['level2a'].alias('json_level1_level2a'),
col('json')['level1']['level2b'].alias('json_levelb'),
)
you can do something like that with json_tuple
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.json_tuple
df.select(json_tuple(col("json"), << all_the_fields , _you_want >> ))

How to Retrieve a field value from a Glue DynamicFrame by name

In a Spark DataFrame you can address a column's value in the schema by using its name like df['personId'] - but that way does not work with Glue's DynamicFrame. Is there a similar way, without converting the DynamicFrame to a DataFrame, to directly access a columns values by name?
You can use select_fields, see
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-SelectFields.html.
In your case it would be df.select_fields("personId"). Depending on what you want to do, you can save it as a new dynamic frame or just look at the data.
new_frame = df.select_fields("personId")
new_frame.show()

Passing pandas groupby result to html in a pretty way

I wonder how could I pass python pandas groupby result to html formatted such as printed in console. Pic below. to_html does not work because It says that
Series object has no attribute to_html()
The one on the left is from console the one from the right is from my html view.
Using reset_index() on your GroupBy object will enable you to treat it as a normal DataFrame i.e. apply to_html to it.
You can make sure you output a DataFrame, even if the output is a single series.
I can think of two ways.
results_series = df[column_name] # your results, returns a series
# method 1: select column from list, as a DataFrame
results_df = df[[column_name]] # returns a DataFrame
# method 2: after selection, generate a new DataFrame
results_df = pd.DataFrame(results_series)
# then, export to html
results_df.to_html('output.html')

How to process pyspark dataframe columns

I have a pyspark df with >4k columns without any labels/headers. Based on the column values I need apply specific operations on each columns.
I did the same using pandas but I don't want to use pandas and would like to apply the column wise transformation directly on spark dataframe.
any idea as how can i apply column wise transformation if the df is having >4k columns without any label.also I don't want to apply transformations on specific df column index.
According to the Spark documentation, a dataframe contains - unlike what you said - headers, much like a database table.
In any case, a simple for loop should do the trick:
for column in spark_dataframe.columns:
(do whatever you want to do with your columns)

Categories