I would transform a dataframe based on another dataframe transformed numbers.
Code
df1 = pd.DataFrame(np.random.random_integers(0,100,(100,2)), columns=['M','A'])
max_min = 100
step = 5
df1['RnkGroup'] = df1.groupby(pd.cut(df1['M'],
range(0,max_min,step)))['A'].transform('rank', ascending=False).round()
df1['RnkMeanGroup'] = df1.groupby(pd.cut(df1['M'],
range(0,max_min+step,step)))['RnkGroup'].transform('mean').round()
Is it possible transform a new df2 = pd.DataFrame(np.random.random_integers(0,100,(100,2)), columns=['M','A']) based on the previous one?
I need something like fit_transform of sklearn.
Related
I wish to transform every column in a dataset so its entries are between 0 and 1 based on the min/max of a column. I get the min/max of each column with df.minmax(col_names) and then want to find the column width col_width = col_max - col_min. With this I wish to transform the data df = (df - col_min)/col_width.
How can I perform this opperation so that every entry is calcualted based on the column in belongs to?
You can do this quite easily with vaex. Consider this example:
import vaex
import vaex.ml
df = vaex.example()
# functional API
df_with_scaled_features = df.ml.minmax_scaler()
# scikit-learn like API
scaler = vaex.ml.MinMaxScaler()
df_with_scaled_features_again = scaler.fit_transform(df)
Of course you can choose to do the maths yourself, but why bother when this is implemented on the vaex side in an efficient way.
I extract data from column using:
df_filtered = pd.DataFrame(df[1].apply(extractMMYY).tolist(), columns=['MM', 'YY'])
It returns me a new dataset, but I need to return MM, YY into initial dataset df.
I have tried:
df(df[1].apply(extractMMYY).tolist(), columns=['MM', 'YY'])
Or I need to bins two datasets to be able filter first df by df_filtered
It looks to me like you are trying to do
df[['MM', 'YY']] = df[1].apply(extractMMYY).tolist()
I am trying to set up a featurizers which drops out all but the first 10 columns of my database. The database consists of 76 columns in total. The idea is to apply a PolynomialFeatures(1)) to the 10 columns I would like to keep, but then I cannot see a way to eliminate smartly the remaining 66 columns (I was thinking something like PolynomialFeatures(0)) but it does not seem to work. The idea was to multiply them by the constant 0). The issues are basically 2: 1) how to tell DataFrameMapper to apply the same featurizer to a range of columns (namely A_11 to A_76); 2) how to tell DataFrameMapper to apply aa featurizer that eliminates such columns.
The (incomplete) code I tried so far looks as follows. I denoted A_11-A_76 the issue 1) (i.e. the range) and as ? the issue 2 in the code:
from dml_iv.utilities import SubsetWrapper, ConstantModel
from econml.sklearn_extensions.linear_model import StatsModelsLinearRegression
col = ["A_"+str(k) for k in range(XW.shape[1])]
XW_db = pd.DataFrame(XW, columns=col)
from sklearn_pandas import DataFrameMapper
subset_names = set(['A_0','A_1','A_2','A_3','A_4','A_5','A_6','A_7','A_8','A_9','A_10'])
# list of indices of features X to use in the final model
mapper = DataFrameMapper([
('A_0', PolynomialFeatures(1)),
('A_1', PolynomialFeatures(1)),
('A_2', PolynomialFeatures(1)),
('A_3', PolynomialFeatures(1)),
('A_4', PolynomialFeatures(1)),
('A_5', PolynomialFeatures(1)),
('A_11 - A_66', ?)]) ## PROBLEMATIC PART
Why don't you drop columns you don't want from your dataframe and map what's left?
cols_map = [...] # list of columns to map
cols_drop = [...] # list of columns to drop
XW_db = XW_db.drop(cols_drop, axis=1) # you're left with only what to map
mapper = DataFrameMapper(cols_map)
...
If the reason for not wanting to drop columns is that they will be used later, you can simply assign the result of your drop to other variables, thus creating several subset dataframes which are easier to manipulate:
df2 = df1.drop(cols_drop2,axis=1) # df2 is a subset of df1
df3 = df1.drop(cols_drop3,axis=1) # df3 is a subset of df1
# Alternative is to decide what to keep instead of what to drop
df4 = df1[cols_keep] # df4 is a subset of df1
# df1 remains the full dataframe
Usually pyspark.ml.feature.IDF returns one outputCol that contains SparseVector. All i need is having N-columns with real number values, where N is a number of features defined in IDF(to use that dataframe in catboost later).
I have tried to convert column to array
def dense_to_array(v):
new_array = list([float(x) for x in v])
return new_array
dense_to_array_udf = F.udf(dense_to_array, T.ArrayType(T.FloatType()))
data = data.withColumn('tf_idf_features_array', dense_to_array_udf('tf_idf_features'))
and after that use Pandas to convert to columns
data = data.toPandas()
cols = [f'tf_idf_{i}' for i in range(32)]
data = pd.DataFrame(info['tf_idf_features_array'].values.tolist(), columns=cols)
I don't like that way, because i find it really slow. Is there a way to solve my problem over pyspark without pandas?
So I have multiple data frames that I am attempting to loop over.
I have created a list using the following code:
data_list = [df1, df2, df3]
After that I would like to filter out a predefined range of numbers in the column 'Firm_Code' in each data frame.
So far, I am able to filter out firms with a respective code between 6000 and 6999 for a single data frame as follows:
FFirms = range(6000,7000)
Non_FFirms = [b for b in df1['Firm_Code'] if b not in FFirms]
df1 = df1.loc[df1['Firm_Code'].isin(Non_FFirms)]
Now I would like to loop over the data_list. My first try looks like the following:
for i in data_list:
i = i.loc[i.Firm_Code.isin(Non_FFirms)]
Appreciate any suggestions!
Instead of making the list of dataframes, you can concat all the data frames into a single dataframe.
data_df = pd.concat([df1,df2,df3],ignore_index=True)
In case you need identification from which dataframe you have fetched the value you can add a new column say 'Df_number'.
using data_df you can you can filter the data
FFirms = range(6000,7000)
Non_FFirms = [b for b in df1['Firm_Code'] if b not in FFirms]
filtered_data_df = data_df.loc[data_df['Firm_Code'].isin(Non_FFirms)]