I'm using Spark 1.3.1.
I am trying to view the values of a Spark dataframe column in Python. With a Spark dataframe, I can do df.collect() to view the contents of the dataframe, but there is no such method for a Spark dataframe column as best as I can see.
For example, the dataframe df contains a column named 'zip_code'. So I can do df['zip_code'] and it turns a pyspark.sql.dataframe.Column type, but I can't find a way to view the values in df['zip_code'].
You can access underlying RDD and map over it
df.rdd.map(lambda r: r.zip_code).collect()
You can also use select if you don't mind results wrapped using Row objects:
df.select('zip_code').collect()
Finally, if you simply want to inspect content then show method should be enough:
df.select('zip_code').show()
You can simply write:
df.select('your column's name').show()
In your case here, it will be:
df.select('zip_code').show()
To view the complete content:
df.select("raw").take(1).foreach(println)
(show will show you an overview).
Related
I'm new to python and just trying to redo my first project from matlab. I've written a code in vscode to import an excel file using pandas
filename=r'C:\Users\user\Desktop\data.xlsx'
sheet=['data']
with pd.ExcelFile(filename) as xls:
Dateee=pd.read_excel(xls, sheet,index_col=0)
Then I want to access data in a row and column.
I tried to print data using code below:
for key in dateee.keys():
print(dateee.keys())
but this returns nothing.
Is there anyway to access the data (as a list)?
You can iterate on each column, making the contents of each a list:
for c in df:
print(df[c].to_list())
df is what the dataframe was assigned as. (OP had inconsistent syntax & so I didn't use that.)
Look into df.iterrows() or df.itertuples() if you want to iterate by row. Example:
for row in df.itertuples():
print(row)
Look into df.iloc and df.loc for row and column selection of individual values, see Pandas iloc and loc – quickly select rows and columns in DataFrames.
Or df.iat or df.at for getting or setting single values, see here, here, and here.
I have a spark dataframe which has a Json on one of the columns. My task is to turn this dataframe in to a columnar type of dataframe. The problem is that the JSON is dynamic and it always changes structure. What I would like to do is attempt to take values from it and if it and in case it does not have then, return a default value. Is there an option for this in the dataframe? This is how I am taking values out of the JSON, the problem is that in case one of the level changes name or structure, it will not fail.
columnar_df = df.select(col('json')['level1'].alias('json_level1'),
col('json')['level1']['level2a'].alias('json_level1_level2a'),
col('json')['level1']['level2b'].alias('json_levelb'),
)
you can do something like that with json_tuple
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.json_tuple
df.select(json_tuple(col("json"), << all_the_fields , _you_want >> ))
In a Spark DataFrame you can address a column's value in the schema by using its name like df['personId'] - but that way does not work with Glue's DynamicFrame. Is there a similar way, without converting the DynamicFrame to a DataFrame, to directly access a columns values by name?
You can use select_fields, see
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-SelectFields.html.
In your case it would be df.select_fields("personId"). Depending on what you want to do, you can save it as a new dynamic frame or just look at the data.
new_frame = df.select_fields("personId")
new_frame.show()
I wonder how could I pass python pandas groupby result to html formatted such as printed in console. Pic below. to_html does not work because It says that
Series object has no attribute to_html()
The one on the left is from console the one from the right is from my html view.
Using reset_index() on your GroupBy object will enable you to treat it as a normal DataFrame i.e. apply to_html to it.
You can make sure you output a DataFrame, even if the output is a single series.
I can think of two ways.
results_series = df[column_name] # your results, returns a series
# method 1: select column from list, as a DataFrame
results_df = df[[column_name]] # returns a DataFrame
# method 2: after selection, generate a new DataFrame
results_df = pd.DataFrame(results_series)
# then, export to html
results_df.to_html('output.html')
I have a pyspark df with >4k columns without any labels/headers. Based on the column values I need apply specific operations on each columns.
I did the same using pandas but I don't want to use pandas and would like to apply the column wise transformation directly on spark dataframe.
any idea as how can i apply column wise transformation if the df is having >4k columns without any label.also I don't want to apply transformations on specific df column index.
According to the Spark documentation, a dataframe contains - unlike what you said - headers, much like a database table.
In any case, a simple for loop should do the trick:
for column in spark_dataframe.columns:
(do whatever you want to do with your columns)