Calculation on every dataset entry in Vaex - python

I wish to transform every column in a dataset so its entries are between 0 and 1 based on the min/max of a column. I get the min/max of each column with df.minmax(col_names) and then want to find the column width col_width = col_max - col_min. With this I wish to transform the data df = (df - col_min)/col_width.
How can I perform this opperation so that every entry is calcualted based on the column in belongs to?

You can do this quite easily with vaex. Consider this example:
import vaex
import vaex.ml
df = vaex.example()
# functional API
df_with_scaled_features = df.ml.minmax_scaler()
# scikit-learn like API
scaler = vaex.ml.MinMaxScaler()
df_with_scaled_features_again = scaler.fit_transform(df)
Of course you can choose to do the maths yourself, but why bother when this is implemented on the vaex side in an efficient way.

Related

Transform DataFrame based on another DataFrame numbers

I would transform a dataframe based on another dataframe transformed numbers.
Code
df1 = pd.DataFrame(np.random.random_integers(0,100,(100,2)), columns=['M','A'])
max_min = 100
step = 5
df1['RnkGroup'] = df1.groupby(pd.cut(df1['M'],
range(0,max_min,step)))['A'].transform('rank', ascending=False).round()
df1['RnkMeanGroup'] = df1.groupby(pd.cut(df1['M'],
range(0,max_min+step,step)))['RnkGroup'].transform('mean').round()
Is it possible transform a new df2 = pd.DataFrame(np.random.random_integers(0,100,(100,2)), columns=['M','A']) based on the previous one?
I need something like fit_transform of sklearn.

Featurizer to eliminate features

I am trying to set up a featurizers which drops out all but the first 10 columns of my database. The database consists of 76 columns in total. The idea is to apply a PolynomialFeatures(1)) to the 10 columns I would like to keep, but then I cannot see a way to eliminate smartly the remaining 66 columns (I was thinking something like PolynomialFeatures(0)) but it does not seem to work. The idea was to multiply them by the constant 0). The issues are basically 2: 1) how to tell DataFrameMapper to apply the same featurizer to a range of columns (namely A_11 to A_76); 2) how to tell DataFrameMapper to apply aa featurizer that eliminates such columns.
The (incomplete) code I tried so far looks as follows. I denoted A_11-A_76 the issue 1) (i.e. the range) and as ? the issue 2 in the code:
from dml_iv.utilities import SubsetWrapper, ConstantModel
from econml.sklearn_extensions.linear_model import StatsModelsLinearRegression
col = ["A_"+str(k) for k in range(XW.shape[1])]
XW_db = pd.DataFrame(XW, columns=col)
from sklearn_pandas import DataFrameMapper
subset_names = set(['A_0','A_1','A_2','A_3','A_4','A_5','A_6','A_7','A_8','A_9','A_10'])
# list of indices of features X to use in the final model
mapper = DataFrameMapper([
('A_0', PolynomialFeatures(1)),
('A_1', PolynomialFeatures(1)),
('A_2', PolynomialFeatures(1)),
('A_3', PolynomialFeatures(1)),
('A_4', PolynomialFeatures(1)),
('A_5', PolynomialFeatures(1)),
('A_11 - A_66', ?)]) ## PROBLEMATIC PART
Why don't you drop columns you don't want from your dataframe and map what's left?
cols_map = [...] # list of columns to map
cols_drop = [...] # list of columns to drop
XW_db = XW_db.drop(cols_drop, axis=1) # you're left with only what to map
mapper = DataFrameMapper(cols_map)
...
If the reason for not wanting to drop columns is that they will be used later, you can simply assign the result of your drop to other variables, thus creating several subset dataframes which are easier to manipulate:
df2 = df1.drop(cols_drop2,axis=1) # df2 is a subset of df1
df3 = df1.drop(cols_drop3,axis=1) # df3 is a subset of df1
# Alternative is to decide what to keep instead of what to drop
df4 = df1[cols_keep] # df4 is a subset of df1
# df1 remains the full dataframe

Extracting TF-IDF features as multiple columns with pyspark

Usually pyspark.ml.feature.IDF returns one outputCol that contains SparseVector. All i need is having N-columns with real number values, where N is a number of features defined in IDF(to use that dataframe in catboost later).
I have tried to convert column to array
def dense_to_array(v):
new_array = list([float(x) for x in v])
return new_array
dense_to_array_udf = F.udf(dense_to_array, T.ArrayType(T.FloatType()))
data = data.withColumn('tf_idf_features_array', dense_to_array_udf('tf_idf_features'))
and after that use Pandas to convert to columns
data = data.toPandas()
cols = [f'tf_idf_{i}' for i in range(32)]
data = pd.DataFrame(info['tf_idf_features_array'].values.tolist(), columns=cols)
I don't like that way, because i find it really slow. Is there a way to solve my problem over pyspark without pandas?

Python Pandas: create rank columns, move orginal column max rank

I need to be able to
1. calculate ranks for each column in all rows,
2. the find the max column label of each row,
3. and then in each row move the max ranked column of the original df.
It is trivial to do when working only with the data in the original df. But if different ranking calls are needed, it seems difficult to accomplish.
Below is my Python Pandas code to accomplish this. But it does not work. It does not seem to interpret my statement, df1['maxV'] = df1[df1['maxR']] as I expect. Suggestions to achieve will be appreciated.
import pandas as pd
import numpy ass np
df1 = pd.DataFrame(np.random.randn(10,3),columns=list('ABC')
rankV = df1.pct_change(3) # calculate ranking values
df1['maxR'] = rankV.idxmax(axis=1) # add max ranked column label of rankv
df1['maxV'] = df1[df1['maxR']] # move max ranked column value to maxV
Iterate the rows and accumulate the values in an array:
maxVals = [np.nan]*3
for index, row in df1[pd.notna(df1['maxR'])].iterrows():
maxVals.append(df1.loc[index, row['maxR']])
df1['maxV'] = maxVals
Alternative: A less intuitive way might be to index df1 using the index and values, which will return a wider Dataframe(# columns equal to # of rows) which has the maxes on the diagonal:
maxVals = [np.nan]*3
newDf = df1.loc[df1['maxR'][3:].index, df1['maxR'][3:].values]
maxVals.extend(np.diag(newDf))
df1['maxV'] = maxVals

Dropping an item from a DataFrame vector column

I have a DataFrame with a single column 'value'. I want to split it by space, remove the first item from the split, and recombine the remaining items into a vector column.
It's very easy to do with a UDF or by converting to and from RDD, but I want to use only DataFrame API for performance and code simplicity reasons.
The best I could do was this:
import pyspark.sql.functions as F
from pyspark.ml.feature import VectorAssembler
df = sqlContext.createDataFrame([['10 11 12']], ['value'])
df_split = df.select(F.split('value', ' ').alias('split'))
n = df_split.select(F.size(df_split['split'])).collect()[0][0]
df_columns = df_split.select([F.col('split')[i].astype('int').alias(str(i)) for i in range(1, n)])
v = VectorAssembler(inputCols=[str(i) for i in range(1, n)], outputCol='result')
df_result = v.transform(df_columns).select('result')
It works, but requires an extra action (to get the size of the column after split), and a lot of code for such a simple task. Is there a simpler way of doing this?
In addition, VectorAssembler won't work for non-numeric types.
Spark 2.0.0, python 3.5.

Categories