I have a pyspark df with >4k columns without any labels/headers. Based on the column values I need apply specific operations on each columns.
I did the same using pandas but I don't want to use pandas and would like to apply the column wise transformation directly on spark dataframe.
any idea as how can i apply column wise transformation if the df is having >4k columns without any label.also I don't want to apply transformations on specific df column index.
According to the Spark documentation, a dataframe contains - unlike what you said - headers, much like a database table.
In any case, a simple for loop should do the trick:
for column in spark_dataframe.columns:
(do whatever you want to do with your columns)
Related
I have a dataframe with a column I want to explode ("genre") and then reorganize the dataframe because I get many duplicates.
I also don't want to lose information. I get the following dataframe after using split and explode:
https://i.stack.imgur.com/eVWzg.png
I want to get a dataframe without the duplicates but keeping the genre column as it is. I thought of stacking it or making it multiindex but how should I proceed?
This database is not exactly what I'm using but it's similar in the column I want to work with:
https://www.kaggle.com/PromptCloudHQ/imdb-data
I have a spark table that I want to read in python (I'm using python 3 in databricks) In effect the structure is below. The log data is stored in a single string column but is a dictionary.
How do I break out the dictionary items to read them.
dfstates = spark.createDataFrame([[{"EVENT_ID":"123829:0","EVENT_TS":"2020-06-22T10:16:01.000+0000","RECORD_INDEX":0},
{"EVENT_ID":"123829:1","EVENT_TS":"2020-06-22T10:16:01.000+0000","RECORD_INDEX":1},
{"EVENT_ID":"123828:0","EVENT_TS":"2020-06-20T21:17:39.000+0000","RECORD_INDEX":0}],
['texas','24','01/04/2019'],
['colorado','13','01/07/2019'],
['maine','14','']]).toDF('LogData','State','Orders','OrdDate')
What I want to do is read the spark table into a dataframe, find the max event timestamp, find the rows with that timestamp then count and read just those rows into a new dataframe with the data columns and from the log data, add columns for event id (without the record index), event date and record index.
Downstream I'll be validating the data, converting from StringType to appropriate data type and filling in missing or invalid values as appropriate. All along I'll be asserting that row counts = original row counts.
The only thing I'm stuck on though is how to read this log data column and change it to something I can work with. Something in spark like pandas.series()?
You can split your single struct type column into multiple columns using dfstates.select('Logdata.*) Refer this answer : How to split a list to multiple columns in Pyspark?
Once you have seperate columns, then you can do standard pyspark operations like filtering
I'm trying to use HashTF in Spark but I have one major problem.
If inputCol has only one column like this
HashingTF(inputCol="bla",outputCol="tf_features") it works fine.
But if I try to add more columns I get error message "Cannot convert list to string".
All I want to do is
HashingTF(inputCol=["a_col","b_col","c_col"], outputCol="tf_features").
Any ideas on how to fix it?
HashingTF takes on input one column, if you want to use other columns you can create array of these columns using array function and then flat them using explode, you will have one column with values from all columns. Finally you can pass that column to HashingTF.
df2 = df.select(explode(array(f.col("a_col"),f.col("b_col")))).as('newCol'))
I have a postgres table that I read into a pandas DataFrame. I then apply some functions that change all values in 1 column to other values. I need to find a way to then update the corresponding column in the postgres table. Is there a straightforward way of doing this? The basic workflow I need is like:
df = pd.read_sql(...)
df.col_to_update = apply_some_functions_to_col(df.col_to_update)
sa.update(tbl).values({"col_to_update": df.col_to_udpate})
Though this particular statement wouldn't work because the API here doesn't understand pandas series.
I'm using Spark 1.3.1.
I am trying to view the values of a Spark dataframe column in Python. With a Spark dataframe, I can do df.collect() to view the contents of the dataframe, but there is no such method for a Spark dataframe column as best as I can see.
For example, the dataframe df contains a column named 'zip_code'. So I can do df['zip_code'] and it turns a pyspark.sql.dataframe.Column type, but I can't find a way to view the values in df['zip_code'].
You can access underlying RDD and map over it
df.rdd.map(lambda r: r.zip_code).collect()
You can also use select if you don't mind results wrapped using Row objects:
df.select('zip_code').collect()
Finally, if you simply want to inspect content then show method should be enough:
df.select('zip_code').show()
You can simply write:
df.select('your column's name').show()
In your case here, it will be:
df.select('zip_code').show()
To view the complete content:
df.select("raw").take(1).foreach(println)
(show will show you an overview).