Spark - HashingTF inputCol accepts one column but I want more - python

I'm trying to use HashTF in Spark but I have one major problem.
If inputCol has only one column like this
HashingTF(inputCol="bla",outputCol="tf_features") it works fine.
But if I try to add more columns I get error message "Cannot convert list to string".
All I want to do is
HashingTF(inputCol=["a_col","b_col","c_col"], outputCol="tf_features").
Any ideas on how to fix it?

HashingTF takes on input one column, if you want to use other columns you can create array of these columns using array function and then flat them using explode, you will have one column with values from all columns. Finally you can pass that column to HashingTF.
df2 = df.select(explode(array(f.col("a_col"),f.col("b_col")))).as('newCol'))

Related

Accessing a value from Dask using .loc

For the life of me, I cant figure how to combine these two dataframes. I am using the newest most updated versions of all softwares, including Python, Pandas and Dask.
#pandasframe has 10k rows and 3 columns -
['monkey','banana','furry']
#daskframe has 1.5m rows, 1column, 135 partitions -
row.index: 'monkey_banana_furry'
row.mycolumn = 'happy flappy tuna'
my dask dataframe has a string as its index for accessing,
but when i do daskframe.loc[index_str] it returns a dask dataframe, but i thought it was supposed to return one single specific row. and i dont know how to access the row/value that i need from that dataframe. what i want is to input the index, and output one specific value.
what am i doing wrong?
Even pandas.DataFrame.loc don't return a scalar if you don't specify a label for the columns.
Anyways, to get a scalar in your case, first, you need to add dask.dataframe.DataFrame.compute so you can get a pandas dataframe (since dask.dataframe.DataFrame.loc returns a dask dataframe). And only then, you can use the pandas .loc.
Assuming (dfd) is your dask dataframe, try this :
dfd.loc[index_str].compute().loc[index_str, "happy flappy tuna"]
Or this :
dfd.loc[index_str, "happy flappy tuna"].compute().iloc[0]

Splitting a dataframe column into two separate data frames using pandas

i am using python to code in jupyter notebook.
Im trying to use pandas to split a column of dataframe (called "PostTypeId' into two separate dataframes, based on the columns value - one dataframe is to be called Questions and has the column value of 1, and the second dataframe is to be called Answers that has the column value of 2. Im asked to do all this by defining it within a function, called split_df
Wondering how i would go about this.
Thanks so much:)
you can do it by:
Questions=pd.DataFrame(PostTypeId[PostTypeId.col_name==1])
Answers=pd.DataFrame(PostTypeId[PostTypeId.col_name==2])
when creating the function, use the filter values as the argument

Unable to update new column values in rows which were derived from existing column having multiple values separeted by ','?

Original dataframe
Converted Dataframe using stack and split:
Adding new column to a converted dataframe:
What i am trying to is add a new column using np.select(condition, values) but it not updating the two addition rows derived from H1 its returning with 0 or NAN. Can someone please help me here ?
Please note i have already done the reset index but still its not helping.
I think using numpy in this situation is kind of unnecessary.
you can use something like the following code:
df[df.State == 'CT']['H3'] = 4400000

Flatten pyspark nested structure - Pyspark

I want to flatten a nested column into separate ones, containing only a few specific values.
First off, I start with an AVRO capture file from Event hub. Once I convert that to a dataframe in python, I receive the following column after removing the irrelevant ones.
This column has the following structure.
What I want to do next, is flatten this column remaining specific values.
I can get this done for one cell, but because I am dealing with a nested structure the column is not iterable.
Anyone can help me out?
You have 2 data fields in your schema so I'd call the first one data_struct and the second one data_array. If you do avro_df_body.select('data_struct.data_array'), you'd have an ArrayType column, which you can apply explode function to break it down to multiple rows.
(avro_df_body
.withColumn('tmp', F.explode('data_struct.data_array'))
.withColumn('speed', F.col('tmp.a'))
.withColumn('timestamp', F.col('tmp.time'))
.show()
)

Convert columns from multiple dataframes to boolean

I am trying to convert columns from multiple dataframes to boolean.
What I have written to convert them is the following:
for i in range(0,4):
df[i][['Col1','Col2','Col3','Col4']].astype('bool')
However it does not convert anything. Not all the columns from the dataframes need to be converted to boolean, so I have selected above only those ones that need to be converted.
When I print df[1].dtypes (but I get the same results also from the other dataframes), all the columns above are objects, not boolean.
Could you please tell me where the error is in my code?
Note that .astype returns a new object, and does not do the change in place. In order to perform the change, run:
for i in range(0,4):
df[i][['Col1','Col2','Col3','Col4']] = df[i][['Col1','Col2','Col3','Col4']].astype('bool')

Categories