Extracting TF-IDF features as multiple columns with pyspark - python

Usually pyspark.ml.feature.IDF returns one outputCol that contains SparseVector. All i need is having N-columns with real number values, where N is a number of features defined in IDF(to use that dataframe in catboost later).
I have tried to convert column to array
def dense_to_array(v):
new_array = list([float(x) for x in v])
return new_array
dense_to_array_udf = F.udf(dense_to_array, T.ArrayType(T.FloatType()))
data = data.withColumn('tf_idf_features_array', dense_to_array_udf('tf_idf_features'))
and after that use Pandas to convert to columns
data = data.toPandas()
cols = [f'tf_idf_{i}' for i in range(32)]
data = pd.DataFrame(info['tf_idf_features_array'].values.tolist(), columns=cols)
I don't like that way, because i find it really slow. Is there a way to solve my problem over pyspark without pandas?

Related

Transform DataFrame based on another DataFrame numbers

I would transform a dataframe based on another dataframe transformed numbers.
Code
df1 = pd.DataFrame(np.random.random_integers(0,100,(100,2)), columns=['M','A'])
max_min = 100
step = 5
df1['RnkGroup'] = df1.groupby(pd.cut(df1['M'],
range(0,max_min,step)))['A'].transform('rank', ascending=False).round()
df1['RnkMeanGroup'] = df1.groupby(pd.cut(df1['M'],
range(0,max_min+step,step)))['RnkGroup'].transform('mean').round()
Is it possible transform a new df2 = pd.DataFrame(np.random.random_integers(0,100,(100,2)), columns=['M','A']) based on the previous one?
I need something like fit_transform of sklearn.

Calculation on every dataset entry in Vaex

I wish to transform every column in a dataset so its entries are between 0 and 1 based on the min/max of a column. I get the min/max of each column with df.minmax(col_names) and then want to find the column width col_width = col_max - col_min. With this I wish to transform the data df = (df - col_min)/col_width.
How can I perform this opperation so that every entry is calcualted based on the column in belongs to?
You can do this quite easily with vaex. Consider this example:
import vaex
import vaex.ml
df = vaex.example()
# functional API
df_with_scaled_features = df.ml.minmax_scaler()
# scikit-learn like API
scaler = vaex.ml.MinMaxScaler()
df_with_scaled_features_again = scaler.fit_transform(df)
Of course you can choose to do the maths yourself, but why bother when this is implemented on the vaex side in an efficient way.

How to extract inside of column to several columns

I have excel file and import to dataframe. I want to extract inside of column to several columns.
Here is original
After importing to pandas in python, I get this data with '\n'
So, I want to extract inside of column. Could you all share idea or code?
My expected columns are....
Don't worry no one is born knowing everything about SO. Considering the data you gave, specially that 'Vector:...' is not separated by '\n', the following works:
import pandas as pd
import numpy as np
data = pd.read_excel("the_data.xlsx")
ok = []
l = len(data['Details'])
for n in range(l):
x = data['Details'][n].split()
x[2] = x[2].lstrip('Vector:')
x = [v for v in x if v not in ['Type:', 'Mission:']]
ok += x
values = np.array(ok).reshape(l, 3)
df = pd.DataFrame(values, columns=['Type', 'Vector', 'Mission'])
data.drop('Details', axis=1, inplace=True)
final = pd.concat([data, df], axis=1)
The process goes like this:
First you split all elements of the Details columns as a list of strings. Second you deal with the 'Vector:....' special case and filter column names. Third you store all the values in a list which will inturn be converted to a numpy array with shape (length, 3). Finally you drop the old 'Details' column and perform a concatenation with the df created from splited strings.
You may want to try a more efficient way to transform your data when reading by trying to use this ideas inside the pd.read_excel method using converters

Save pandas dataframe with numpy arrays column

Let us consider the following pandas dataframe:
df = pd.DataFrame([[1,np.array([6,7])],[4,np.array([8,9])]], columns = {'A','B'})
where the B column is composed by two numpy arrays.
If we save the dataframe and the load it again, the numpy array is converted into a string.
df.to_csv('test.csv', index = False)
df.read_csv('test.csv')
Is there any simple way of solve this problem? Here is the output of the loaded dataframe.
you can pickle the data instead.
df.to_pickle('test.csv')
df = pd.read_pickle('test.csv')
This will ensure that the format remains the same. However, it is not human readable
If human readability is an issue, I would recommend converting it to a json file
df.to_json('abc.json')
df = pd.read_json('abc.json')
Use the following function to format each row.
def formatting(string_numpy):
"""formatting : Conversion of String List to List
Args:
string_numpy (str)
Returns:
l (list): list of values
"""
list_values = string_numpy.split(", ")
list_values[0] = list_values[0][2:]
list_values[-1] = list_values[-1][:-2]
return list_values
Then use the following apply function to convert it back into numpy arrays.
df[col] = df.col.apply(formatting)

Dropping an item from a DataFrame vector column

I have a DataFrame with a single column 'value'. I want to split it by space, remove the first item from the split, and recombine the remaining items into a vector column.
It's very easy to do with a UDF or by converting to and from RDD, but I want to use only DataFrame API for performance and code simplicity reasons.
The best I could do was this:
import pyspark.sql.functions as F
from pyspark.ml.feature import VectorAssembler
df = sqlContext.createDataFrame([['10 11 12']], ['value'])
df_split = df.select(F.split('value', ' ').alias('split'))
n = df_split.select(F.size(df_split['split'])).collect()[0][0]
df_columns = df_split.select([F.col('split')[i].astype('int').alias(str(i)) for i in range(1, n)])
v = VectorAssembler(inputCols=[str(i) for i in range(1, n)], outputCol='result')
df_result = v.transform(df_columns).select('result')
It works, but requires an extra action (to get the size of the column after split), and a lot of code for such a simple task. Is there a simpler way of doing this?
In addition, VectorAssembler won't work for non-numeric types.
Spark 2.0.0, python 3.5.

Categories