I have a large text dataset and I'm using the MinMaxScaler to transform one feature. The code is working fine but takes more than 3 mins and I want to reduce the consumed time for this process. Is there any suggestions to speed up this process OR alternative method to so this transformation faster?
df = cleanData('data.csv')
scaler = MinMaxScaler(feature_range=(0, 5))
scaler.fit(pd.DataFrame(df.loc[:,'year']))
df.loc[:,'year'] = scaler.transform(pd.DataFrame(df.loc[:,'year']))
You can try doing it with dask-ml:
import dask.dataframe as dd
from dask_ml.preprocessing import MinMaxScaler
# or read directly from csv with ddf = dd.read_csv('data.csv')
ddf = dd.from_pandas(df, npartitions=10)
scaler = MinMaxScaler(feature_range=(0, 5))
scaler.fit(ddf['year'])
ddf['year'] = scaler.transform(ddf['year'])
There are also other preprocessing tools available in dask_ml, see https://ml.dask.org/modules/generated/dask_ml.preprocessing.MinMaxScaler.html?highlight=minmaxscaler
Related
I would like to use sklearn pipeline with Ray cluster to make computation paralel.
I found example https://docs.ray.io/en/master/ray-more-libs/joblib.html
I try code below but it doesn't work paralelly:
import joblib
from ray.util.joblib import register_ray
register_ray()
with joblib.parallel_backend('ray'):
df = pd.read_csv(filepath, sep=sep, encoding=encoding, on_bad_lines='skip', low_memory=False)
y = df.pop('target')
X = df.copy()
out= pipe.fit_transform(X, y)
If I use import modin.pandas as pd the fit method shows problem that X,y are not pandas dataframe types
my question has to do with a very large dataset I'm running a regression on in Python. I have categorical data (gender, industry, region, salary groupings, etc.) that I would like to run a regression on with statsmodels. The whole dataframe comes out to be about 83 columns in width after using pd.getdummies() on roughly 5 million lines.
Code:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from datetime import datetime as dt
#Start time
print('Start Time: ', dt.now())
#Variables
groups = ['sex', 'central_age', 'group_size', 'industry', 'region', 'salary']
base_cases = ['sex_Male', 'central_age_47.0', 'group_size_F. 100-249', 'salary_A. < 25',
'industry_H. Manufacturing - heavy, steel etc.', 'region_C. Division 3: East North Central']
aggregates = ['death_amount_exposed', 'death_claim_amount']
#Read/ format data to transform data into categorical variables
df = pd.read_pickle(r'./Life_Mortality_Data.pkl')
df = df[df['death_amount_exposed']!=0]
df['central_age'] = df['central_age'].apply(str)
final = pd.get_dummies(df[groups]).join(df[aggregates]).astype(float)
final.drop(base_cases, axis=1, inplace=True)
#Prepare sting of variables to regress on in next step
var_columns = list(final.columns)
for i in aggregates:
var_columns.remove(i)
variables = '+'.join('Q("' + i + '")' for i in var_columns)
#Training and testing with Poisson model
print('Regression Time: ', dt.now(), '\n')
res1 = smf.glm(formula='death_claim_amount ~'+variables, data=final, offset=np.log(final['death_amount_exposed']), family=sm.families.Poisson(sm.families.links.log())).fit()
#Print stats summary, base cases, and multiplicative factors
print(res1.summary())
print('Base Cases:')
for case in base_cases:
print(case)
print('\nParameters:\n', np.exp(res1.params))
#This takes the result of a statsmodel results table and transforms it into a dataframe
def results_summary_to_dataframe(results):
pvals = results.pvalues
coeff = results.params
std_err = results.bse
conf_lower = results.conf_int()[0]
conf_higher = results.conf_int()[1]
results_df = pd.DataFrame({"pvals":pvals,
"coeff":coeff,
"std_error":std_err,
"conf_lower":conf_lower,
"conf_higher":conf_higher
})
#Reordering columns
results_df = results_df[["coeff","std_error","pvals","conf_lower","conf_higher"]]
return results_df
#Write data to excel
results_summary_to_dataframe(res1).to_excel(r'./All_Regression_Amounts_v1.xlsx')
#End time
print('\nEnd Time: ', dt.now())
The problem I'm having is that I run out of memory at the point where the statsmodels regression is run. I am using the 64-bit version of Python on Windows and have 32 GB of memory which I thought would be more than enough to handle this kind of computation but am not sure if I'm not using all available memory or if something may be wrong with my code. I'm very new to this kind of analysis and handling this much data. I'd really appreciate any help on what I can do to resolve this error
When building linear models on datasets which are too large to hold in memory your best bet is to train the model with Stochastic Gradient Descent. This fits the model iteratively by gradient descent using repeated small samples of the data rather than all the data at once.
Scikit-learn has a SGDClassifier module which fits a linear model like this. You could take a look at that and see if it might work for you.
I'm trying to scale some data from a csv file. I'm doing this with pyspark to obtain the dataframe and sklearn for the scale part. Here is the code:
from sklearn import preprocessing
import numpy as np
import pyspark
from pysparl.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.option('header','true').csv('flights,csv')
X_scaled = preprocessing.scale(df)
If I make the dataframe with pandas the scale part doesn't have any problems, but with spark I get this error:
ValueError: setting an array element with a sequence.
So I'm guessing that the element types are different between pandas and pyspark, but how can I work with pyspark to do the scale?
sklearn works with pandas dataframe. So you have to convert spark dataframe to pandas dataframe.
X_scaled = preprocessing.scale(df.toPandas())
You can use the "StandardScaler" method from "pyspark.ml.feature". Attaching a sample script to perform the exact pre-processing as sklearn,
Step 1:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features",
outputCol="scaled_features",
withStd=True,withMean=True)
scaler_model = scaler.fit(transformed_data)
scaled_data = scaler_model.transform(transformed_data)
Remember before you perform step 1, you need to assemble all the features with VectorAssembler. Hence this will be your step 0.
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=required_features, outputCol='features')
transformed_data = assembler.transform(df)
I am trying to do the following
df = pd.read_csv('a.csv')
scaler = MinMaxScaler()
df_copy = df.copy(deep=True)
for i in range(1, len(df)):
df_chunk = df_copy.iloc[i,i+10]
df_chunk = scaler.fit_transform (df_chunk)
so each df_chunk should be a scaled data frame.
The issue is that some are not scaled correctly.
If I were to plot the scaled data points, a properly scaled data frame will look like a range of numbers scattered between 0 and 1 sort of evenly. But the data frames I get are in 2 extremes, with the first ~80% of the numbers in the 0.9 range, while the others near the 0.1 range.
So it feels like the first ~80% of the data got scaled twice by the scaler. I have already tried using pandas deep copy to solve this, but it doesn't seem to help.
If you have any idea, why?
I would really appreciate it.
I'm not too sure why you want to apply the scaler on chunks of your data. If you fear that your CSV may be too large, you would want to read the CSV by chunks in the read_csv operation and process those chunks.
Now onto your issue. You're re-fitting your scaler on every chunk which is why you're getting the weird results. You either have to fit the entire data with your scaler, or you have to online fit the data using the partial_fit method.
I'll provide you both solutions.
Solution #1: read and fit the entire data
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df = pd.read_csv('a.csv')
df_scaled = scaler.fit_transform(df)
Solution #2: read the csv by chunks, and online train
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
# first read the csv by chunks and update the scaler
for chunk in pd.read_csv('a.csv', chunksize=10):
scaler.partial_fit(chunk)
# read the csv again by chunks to transform the chunks
for chunk in pd.read_csv('a.csv', chunksize=10):
transformed = scaler.transform(chunk)
# not too sure what you want to do after this
# but you can either print the results of the transformation
# or write the transformed chunk to a new csv
This is a follow on question from Subsetting Dask DataFrames. I wish to shuffle data from a dask dataframe before sending it in batches to a ML algorithm.
The answer in that question was to do the following:
for part in df.repartition(npartitions=100).to_delayed():
batch = part.compute()
However, even if I was to shuffle the contents of batch I'm a bit worried that it might not be ideal. The data is a time series set so datapoints would be highly correlated within each partition.
What I would ideally like is something along the lines of:
rand_idx = np.random.choice(len(df), batch_size, replace=False)
batch = df.iloc[rand_idx, :]
which would work on pandas but not dask. Any thoughts?
Edit 1: Potential Solution
I tried doing
train_len = int(len_df*0.8)
idx = np.random.permutation(len_df)
train_idx = idx[:train_len]
test_idx = idx[train_len:]
train_df = df.loc[train_idx]
test_df = df.loc[test_idx]
However, if I try doing train_df.loc[:5,:].compute() this return a 124451 row dataframe. So clearly using dask wrong.
I recommend adding a column of random data to your dataframe and then using that to set the index:
df = df.map_partitions(add_random_column_to_pandas_dataframe, ...)
df = df.set_index('name-of-random-column')
I encountered the same issue recently and came up with a different approach using dask array and shuffle_slice introduced in this pull request
It shuffles the whole sample
import numpy as np
from dask.array.slicing import shuffle_slice
d_arr = df.to_dask_array(True)
df_len = len(df)
np.random.seed(42)
index = np.random.choice(df_len, df_len, replace=False)
d_arr = shuffle_slice(d_arr, index)
and to transform back to dask dataframe
df = d_arr.to_dask_dataframe(df.columns)
for me it works well for large data sets
If you're trying to separate your dataframe into training and testing subsets, it is what does sklearn.model_selection.train_test_split and it works with pandas.DataFrame. (Go there for an example)
And for your case of using it with dask, you may be interested by the library dklearn, that seems to implements this function.
To do that, we can use the train_test_split function, which mirrors
the scikit-learn function of the same name. We'll hold back 20% of the
rows:
from dklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
More information here.
Note: I did not perform any test with dklearn, this is just a thing I came across, but I hope it can help.
EDIT: what about dask.DataFrame.random_split?
Examples
50/50 split
>>> a, b = df.random_split([0.5, 0.5])
80/10/10 split, consistent random_state
>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)
Use for ML applications is illustrated here
For people here really just wanting to shuffle the rows as the title implies:
This is costly
import numpy as np
random_idx = np.random.permutation(len(sd.index))
sd.assign(random_idx=random_idx)
sd = sd.set_index('x', sorted=True)