Python: Linear regression from Pandas df - ordinal dates conversion - python

First time trying to forecast using basic linear regression in Python. Discovered I had to convert dates to ordinal dates then into a 2D numpy array. I now want to convert the numpy array back to YYYY/MMM/DD for a useable visual plot, but am failing. Never used numpy before, therefore x_full_month.map(dt.datetime.fromordinal) is not working, as does not seem to be valid in numpy.
from sklearn.linear_model import LinearRegression
model=LinearRegression()
df['Date_Ordinal']=df['Date'].map(dt.datetime.toordinal)
x=df['Date_Ordinal']
y=df['Cost']
x_train = x.values.reshape(-1, 1)
y_train = y.values.reshape(-1, 1)
y_pred = model.predict(x_train)
From the predictive model, I'm then creating a new X of ordinal dates for the full month, to get a full months response
x_full_month = np.arange(737850,737880,1).reshape((-1, 1))
y_pred_new = model.predict(x_new)
print('predicted response:', y_pred.T, sep='\n')
This seems to work, however has an ordinal dated X (as expected), how would I get a nicely formatted X for plotting. Or get this back into a Pandas array, which I'm more familiar with? Or, am I completely going about this a roundabout way?
Edit: corrected parameter name

Several hours later and I have a solution. I'm still sure I'm going about this in-efficiently, but the steps below do work for me.
# .flatten converts numpy arrays into pandas df columns
df = pd.DataFrame(y_pred.flatten(),x_full_month.flatten())
# creates a new index (as pd.Dataframe made x_full_month the index initially)
df.reset_index(inplace=True)
# meaningful column names
df = df.rename(columns = {'index':'ord_date',0:'cumul_DN'})
# Convert oridinal date to yyyy-mm-dd
df['date']=df['ord_date'].map(dt.datetime.fromordinal)

Related

Data imputation in Python for Google Analytics data

I have sets of Google Analytics data from a website which I plan to analyse for a project. However, due to maintenance and other factors, there are chunks of dates for which there is no data. I want to impute this data while still maintaining the integrity of the data as I plan to plot these sets and compare the curves of different sets to each-other over time.
Example
I want to use the nearest valid datapoints to each missing datapoint to impute that value in order to maintain the underlying shape that can be seen from the image.
I've already tried to use scikit-learn's KNN-Imputer and Iterative Imputer but I'm either miss-understanding how these imputers are supposed to be used or they're not the correct for what I'm trying to do, potentially both.
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
df = pd.read_csv('data.csv', names=['Day','Views'],delimiter=',',skiprows=3, usecols=[0,1], skipfooter=1, engine='python', quoting= 1)
df = df.replace(0, np.nan)
da = df.Views.rename_axis('ID').values
da = da.reshape(-1,1)
imputer = IterativeImputer(n_nearest_features = 100, max_iter = 10)
df_imputed = imputer.fit_transform(da)
df_imputed.reshape(1,-1)
df.Views = df_imputed
df
All of the NaN values are calculated to be the exact same number from what I have currently implemented.
Any help would be greatly appreciated.
The problem here was I reshaping the array. My data was just a 1D array of values so I was making it 2D by reshaping the array which was causing all the NaN values to be calculated as the same. When I added an index column and included this as an input to the imputer the values were calculated correctly.I also ended up using a KNN imputer from sklearn instead of the iterative imputer in this instance.

Normalizing all numeric columns in my dataset and compare before and after

I want to normalize all the numeric values in my dataset.
I have taken my whole dataset into a pandas dataframe.
My code to do this so far:
for column in numeric: #numeric=df._get_numeric_data()
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
But how do i verify this is correct though?
I tried plotting a histogram for one of the columns before normalizing and after adding this piece of code before and after my for loop:
x=df['Below.Primary'] #Below.Primary is one of my column names
plt.hist(x, bins=45)
The blue histogram was before the for loop and the orange, after.
My total code looked like this:
ln[21] plt.hist(df['Below.Primary'], bins=45)
ln[22] for column in numeric:
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
x=df['Below.Primary']
plt.hist(x, bins=45)
I don't see any reduction in scale. What have i done wrong? If not correct, can someone point out the correct way to do what i wanted to do?
Try use this:
scaler = preprocessing.StandardScaler()
df[col] = scaler.fit_transform(df[col])
A couple general things first.
If numeric is a list of column names (looks like this is the case), the for loop is not necessary.
A Pandas series using an ndarray under the hood so you can just request the ndarray with Series.values instead of calling np.array(). See this page on the Pandas Series.
I am assuming you are using preprocessing from sklearn.
I recommend using sklearn.preprocessing.Normalizer for this.
import pandas as pd
from sklearn.preprocessing import Normalizer
### Without the for loop (recommended)
# this version returns array
normalizer = Normalizer()
normalized_values = normalizer.fit_transform(df[numeric])
# normalized_values is a 2D array which is useful
# for many applications
# to convert back to DataFrame
df = pd.DataFrame(normalized_values, columns = numeric)
### with the for-loop (not recommended)
for column in numeric:
x_array = df[column].values.reshape(-1,1)
df[column] = normalizer.fit_transform(x_array)
You have to set normalized_X to the respective column while iterating.
for column in numeric:
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
df[column]= normalized_X #Setting normalized value in the column
x=df['Below.Primary']
plt.hist(x, bins=45)

Creating a stacked area plot in python with a Pandas DataFrame

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dates = np.arange(1990,2061, 1)
dates = dates.astype('str').astype('datetime64')
df = pd.DataFrame(np.random.randint(0, dates.size, size=(dates.size,3)), columns=list('ABC'))
df['year'] = dates
cols = df.columns.tolist()
cols = [cols[-1]] + cols[:-1]
df = df[cols]
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.stackplot(df['year'], df.drop('year',axis=1))
Based on this code, I'm getting an error "TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''"
I'm trying to figure out how to plot a DataFrame object with years in the first column, and then stacked area from the subsequent columns (A, B, C)..
Also, since I'm a complete beginner here... feel free to comment on my code as to make it cleaner / better. I understand that if I use Matplotlib instead of the Pandas integrated plot method, that I have more functionality to adjust things later on?
Thanks!
I run into two problems running your code.
First, stackplot seems to dislike using string representations of dates. Datetime data types are very finicky sometimes. Either use integers for your 'year' column, or use .values to convert from pandas to numpy datatypes as described in this question
Secondly, according to the documentation for stackplot, when you call stackplot(x, y) if x is a Nx1 array, then y must be MxN, where M is the number of columns. Your df.drop('year',axis=1)) will end up as NxM and throw another error at you. If you take the transpose, however, you can make it work.
If I just replace your final line with
ax.stackplot(df['year'].values, df.drop('year',axis=1).T)
I get a plot that looks like this:

Rolling PCA on pandas dataframe

I'm wondering if anyone knows of how to implement a rolling/moving window PCA on a pandas dataframe. I've looked around and found implementations in R and MATLAB but not Python. Any help would be appreciated!
This is not a duplicate - moving window PCA is not the same as PCA on the entire dataframe. Please see pandas.DataFrame.rolling() if you do not understand the difference
Unfortunately, pandas.DataFrame.rolling() seems to flatten the df before rolling, so it cannot be used as one might expect to roll over the rows of the df and pass windows of rows to the PCA.
The following is a work-around for this based on rolling over indices instead of rows. It may not be very elegant but it works:
# Generate some data (1000 time points, 10 features)
data = np.random.random(size=(1000,10))
df = pd.DataFrame(data)
# Set the window size
window = 100
# Initialize an empty df of appropriate size for the output
df_pca = pd.DataFrame( np.zeros((data.shape[0] - window + 1, data.shape[1])) )
# Define PCA fit-transform function
# Note: Instead of attempting to return the result,
# it is written into the previously created output array.
def rolling_pca(window_data):
pca = PCA()
transf = pca.fit_transform(df.iloc[window_data])
df_pca.iloc[int(window_data[0])] = transf[0,:]
return True
# Create a df containing row indices for the workaround
df_idx = pd.DataFrame(np.arange(df.shape[0]))
# Use `rolling` to apply the PCA function
_ = df_idx.rolling(window).apply(rolling_pca)
# The results are now contained here:
print df_pca
A quick check reveals that the values produced by this are identical to control values computed by slicing appropriate windows manually and running PCA on them.

Unable to Apply Scikit-Learn Imputer To A Dataset With Two Features

I am trying to read in the complete Titanic dataset, which can be found here:
biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls
import pandas as pd
# Importing the dataset
dataset = pd.read_excel('titanic3.xls')
y = dataset.iloc[:, 1].values
x = dataset.iloc[:, 2:14].values
# Create Dataset for Men
men_on_board = dataset[dataset['sex'] == 'male']
male_fatalities = men_on_board[men_on_board['survived'] ==0]
X_male = male_fatalities.iloc[:, [4,8]].values
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X_male[:,0])
X_male[:,0] = imputer.transform(X_male[:,0])
When I run all but the last line, I get the following warning:
/Users/<username>/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
When I run the last line, it throws the following error:
File "<ipython-input-2-07afef05ee1c>", line 1, in <module>
X_male[:,0] = imputer.transform(X_male[:,0])
ValueError: could not broadcast input array from shape (523) into shape (682)
I've used the above code snippet for imputation on other projects, not sure why it's not working.
A quick solution is to change axis = 0 to axis = 1. This will make it work, though I'm not sure if that's what you want. So I want to give some explanation about what happened here as following:
The warning basically tells you sklearn estimator now requires 2D data arrays rather than 1D data arrays where interpreting data as samples (rows) vs as features (columns) matters. During this deprecation process, this requirement is enforce by np.atleast_2d which assume your data has a single sample (row). Meanwhile, you passed axis = 0 to the Imputer which "impute along columns" by strategy = 'mean'. However, you have only 1 row now. When it comes across a missing value, there is no mean to replace that missing value. Therefore the entire column (which contains just this missing value) is discarded. As you can see, this is equal to
X_male[:,0][~np.isnan(X_male[:,0])].reshape(1, -1)
That's why the assignment X_male[:,0] = imputer.transform(X_male[:,0]) failed: X_male[:,0] is shape(682) while imputer.transform(X_male[:,0]) is shape(523). My previous solution basically changes it to "impute along rows" where you do have mean to replace missing values. You won't drop anything this time and your imputer.transform(X_male[:,0]) is shape(682) which can be assigned to X_male[:,0].
Now I don't know why your code snippet for imputation works on other projects. For your specific case here, a (logically) better way in regarding to the deprecation warning could be using X.reshape(-1, 1) since your data has a single feature and 682 samples. However, you need to reshape the transformed data back before being able to be assigned to X_male[:,0]:
imputer = imputer.fit(X_male[:,0].reshape(-1, 1))
X_male[:,0] = imputer.transform(X_male[:,0].reshape(-1, 1)).reshape(-1)

Categories