Ordinary Least Squares Regression for multiple columns in Pandas Dataframe - python

I'm trying to find a way to iterate code for a linear regression over many many columns, upwards of Z3. Here is a snippet of the dataframe called df1
Time A1 A2 A3 B1 B2 B3
1 1.00 6.64 6.82 6.79 6.70 6.95 7.02
2 2.00 6.70 6.86 6.92 NaN NaN NaN
3 3.00 NaN NaN NaN 7.07 7.27 7.40
4 4.00 7.15 7.26 7.26 7.19 NaN NaN
5 5.00 NaN NaN NaN NaN 7.40 7.51
6 5.50 7.44 7.63 7.58 7.54 NaN NaN
7 6.00 7.62 7.86 7.71 NaN NaN NaN
This code returns the slope coefficient of a linear regression for the very ONE column only and concatenates the value to a numpy series called series, here is what it looks like for extracting the slope for the first column:
from sklearn.linear_model import LinearRegression
series = np.array([]) #blank list to append result
df2 = df1[~np.isnan(df1['A1'])] #removes NaN values for each column to apply sklearn function
df3 = df2[['Time','A1']]
npMatrix = np.matrix(df3)
X, Y = npMatrix[:,0], npMatrix[:,1]
slope = LinearRegression().fit(X,Y) # either this or the next line
m = slope.coef_[0]
series= np.concatenate((SGR_trips, m), axis = 0)
As it stands now, I am using this slice of code, replacing "A1" with a new column name all the way up to "Z3" and this is extremely inefficient. I know there are many easy way to do this with some modules but I have the drawback of having all these intermediate NaN values in the timeseries so it seems like I'm limited to this method, or something like it.
I tried using a for loop such as:
for col in df1.columns:
and replacing 'A1', for example with col in the code, but this does not seem to be working.
Is there any way I can do this more efficiently?
Thank you!

One liner (or three)
time = df[['Time']]
pd.DataFrame(np.linalg.pinv(time.T.dot(time)).dot(time.T).dot(df.fillna(0)),
['Slope'], df.columns)
Broken down with a bit of explanation
Using the closed form of OLS
In this case X is time where we define time as df[['Time']]. I used the double brackets to preserve the dataframe and its two dimensions. If I'd done single brackets, I'd have gotten a series and its one dimension. Then the dot products aren't as pretty.
is np.linalg.pinv(time.T.dot(time)).dot(time.T)
Y is df.fillna(0). Yes, we could have done one column at a time, but why when we could do it altogether. You have to deal with the NaNs. How would you imagine dealing with them? Only doing it over the time you had data? That is equivalent to placing zeroes in the NaN spots. So, I did.
Finally, I use pd.DataFrame(stuff, ['Slope'], df.columns) to contain all slopes in one place with the original labels.
Note that I calculated the slope of the regression for Time against itself. Why not? It was there. Its value is 1.0. Great! I probably did it right!

Looping is a decent strategy for a modest number (say, fewer than thousands) of columns. Without seeing your implementation, I can't say what's wrong, but here's my version, which works:
slopes = []
for c in cols:
if c=="Time": break
mask = ~np.isnan(df1[c])
x = np.atleast_2d(df1.Time[mask].values).T
y = np.atleast_2d(df1[c][mask].values).T
reg = LinearRegression().fit(x, y)
slopes.append(reg.coef_[0])
I've simplified your code a bit to avoid creating so many temporary DataFrame objects, but it should work fine your way too.

Related

pandas.Series.interpolate() along "index" shows unexpected results

A pandas.Series() called "bla" in my example contains pressures in Pa as the index and wind speeds in m/s as values:
bla
100200.0 2.0
97600.0 NaN
91100.0 NaN
85000.0 3.0
82600.0 NaN
...
6670.0 NaN
5000.0 2.0
4490.0 NaN
3880.0 NaN
3000.0 9.0
Length: 29498, dtype: float64
bla.index
Float64Index([100200.0, 97600.0, 91100.0, 85000.0, 82600.0, 81400.0,
79200.0, 73200.0, 70000.0, 68600.0,
...
11300.0, 10000.0, 9970.0, 9100.0, 7000.0, 6670.0,
5000.0, 4490.0, 3880.0, 3000.0],
dtype='float64', length=29498)
As the wind speed values are NaN more often than not, I intended to interpolate considering the different pressure levels in order to have more wind speed values to work with.
The docs of interpolate() state that there's a method called "index" which interpolates considering the index-values, but the results don't make sense as compared to the initial values:
bla.interpolate(method="index", axis=0, limit=1, limit_direction="both")
100200.0 **2.00**
97600.0 10.40
91100.0 8.00
85000.0 **3.00**
82600.0 9.75
...
6670.0 3.00
5000.0 **2.00**
4490.0 9.00
3880.0 5.00
3000.0 **9.00**
Length: 29498, dtype: float64
I marked the original values in boldface.
I'd rather expect something like when using "linear":
bla.interpolate(method="linear", axis=0, limit=1, limit_direction="both")
100200.0 **2.000000**
97600.0 2.333333
91100.0 2.666667
85000.0 **3.000000**
82600.0 4.600000
...
6670.0 4.500000
5000.0 **2.000000**
4490.0 4.333333
3880.0 6.666667
3000.0 **9.000000**
Nevertheless, I'd like to use properly "index" as interpolation method, since this should be the most accurate considering the pressure levels for interpolation to mark the "distance" between each wind speed value.
By and large, I'd like to understand how the interpolation results using "index" with the pressure levels in it could become so counterintuitive, and how I could achieve them to be more sound.
Thanks to #ALollz in the first comment underneath my question I came up where the issue lied:
It was just that my dataframe had 2 index levels, the outer being unique measuring timestamps, the inner being a standard range-index.
I should've looked just at each sub-set associated with the unique timestamps separately.
Within these subsets, interpolation makes sense and the results are being produced just right.
Example:
# Loop over all unique timestamps in the outermost index level
for timestamp in df.index.get_level_values(level=0).unique():
# Extract the current subset
df_subset = df.loc[timestamp, :]
# Carry out interpolation on a column of interest
df_subset["column of interest"] = df_subset[
"column of interest"].interpolate(method="linear",
axis=0,
limit=1,
limit_direction="both")

Does storing a large amout of NaN values in a large panda dataframe massively effect performance and memory usage?

I have several large dataframes which are built up from a vehicle log. As only one message can be present on the CAN bus (vehicle communication protocol) at any time.
This is a simlipied dataframe without any interpolation:
time messageA1 messageA2 messageA3 messageB1 messageB2 message C1 messageC2
0 1 2 1 NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN 3 2
2 NaN NaN NaN 3 7 NaN NaN
And this can continue for millions of rows with NaN values consisting of about 95% of the entire dataframe. I have read that when a NaN/Null/None value is within a dataframe it is float64 value.
My questions:
Is a float64 value allocated for every NaN value?
If yes, does it do this memory efficiently?
Will having a large dataframe, with 95% of it NaN values, be inefficient when it comes to process performance?
Is a float64 value allocated for every NaN value?
Yes it is;
If yes, does it do this memory efficiently?
No it does not, instead you are supposed to use a sparse data structure;
Will having a large dataframe, with 95% of it NaN values, be inefficient when it comes to process performance?
Yes it will, on all those operations that are O(f(N)), depending on the f(N). Think of you averaging data, for instance. You'll have to check if any is NaN, then don't use it (or maybe consider it 0, it depends) and this is just overhead.
You might want to compare the shear size of dense (your current implementation) against spares data structures in your case:
'dense : {:0.2f} Kbytes'.format(df.memory_usage().sum() / 1e3)
'sparse: {:0.2f} Kbytes'.format(sdf.memory_usage().sum() / 1e3)
The two numbers should be pretty different

Pandas: Filling nan poor performance - avoid iterating over rows?

I have a performance problem with filling missing values in my dataset. This concerns a 500mb / 5.000.0000 row dataset (Kaggle: Expedia 2013).
It would be easiest to use df.fillna(), but it seems I cannot use this to fill every NaN with a different value.
I created a lookup table:
srch_destination_id | Value
2 0.0110
3 0.0000
5 0.0207
7 NaN
8 NaN
9 NaN
10 0.1500
12 0.0114
This table contains per srch_destination_id the corresponding value to replace NaN with in dataset.
# Iterate over dataset row per row. If missing value (NaN), fill in the min. val
# found in lookuptable.
for row in range(len(dataset)):
if pd.isnull(dataset.iloc[row]['prop_location_score2']):
cell = dataset.iloc[row]['srch_destination_id']
df.set_value(row, 'prop_location_score2', lookuptable.loc[cell])
This code works when iterating over 1000 rows, but when iterating over all 5 million rows, my computer never finishes (I waited hours).
Is there a better way to do what I'm doing? Did I make a mistake somewhere?
pd.Series.fillna does accept a series or a dictionary, as well as scalar replacement values.
Therefore, you can create a series mapping from lookup:
s = lookup.set_index('srch_destination')['Value']
Then use this to fill in NaN values in dataset:
dataset['prop_loc'] = dataset['prop_loc'].fillna(dataset['srch_destination'].map(s.get))
Notice that in the fillna input we are mapping an identifier from dataset. In addition, we use pd.Series.map to perform the necessary mapping.

Data calculation in pandas python

I have:
A1 A2 Random data Random data2 Average Stddev
0 0.1 2.0 300 3000 1.05 1.343503
1 0.5 4.5 4500 450 2.50 2.828427
2 3.0 1.2 800 80 2.10 1.272792
3 9.0 9.0 900 90 9.00 0.000000
And would like to add a column 'ColumnX' that needs to have the values calculated as :
ColumnX = min(df['Random data']-df['Average'],df[Random data2]-
df[Stddev])/3.0*df['A2'])
I get the error:
ValueError: The truth value of a Series is ambiguous.
Your error has to do with pandas preferring bitwise operators and using the built in min function isn't going to work row wise.
A potential solution would be to make two new calculated columns then using the pandas dataframe .min method.
df['calc_col_1'] = df['Random data']-df['Average']
df['calc_col_2'] = (df['Random data2']-df['Stddev'])/(3.0*df['A2'])
df['min_col'] = df[['calc_col_1','calc_col_2']].min(axis=1)
The method min(axis=1) will find the min between the two columns by row then assigned to the new column. This way is efficient because you're using numpy vectorization, and it is easier to read.

pandas column division ValueError (putmask: mask and data must be the same size)

I am attempting to divide one column by another inside of a function:
lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')
As can be seen, I am dividing by a column within the DataFrame, but I am getting a rather strange error:
ValueError: putmask: mask and data must be the same size
I must confess, this is the first time I have seen this error. It seems to suggest that the DF and the column are of different lengths, but clearly (since the column comes from the DataFrame) they are not.
A further twist is that am using this function to loop a data management procedure over year-specific sets (the data are from the Quarterly Census of Employment and Wages 'singlefiles' in the beta series). The sets associated with the 1990-2000 time period go off without a hitch, but 2001 throws this error. I am afraid I have not been able to identify a difference in structure across years, and even if I could, how would it explain the length mismatch?
Any thoughts would be greatly appreciated.
EDIT (2/1/2014): Thanks for taking a look Tom. As requested, the pandas version is 0.13.0, and the data file in question is located here on the BLS FTP site. Just to clarify what I meant by consistent structure, every year has the same variable set and dtype (in addition to a consistent data code structure).
EDIT (2/1/2014): Perhaps it would be useful to share the entire function:
def qcew(f,m_dict):
'''Function reads in file and captures county level aggregations with government contributions'''
#Read in file
cew=pd.read_csv(f)
#Create string version of area fips
cew['fips']=cew['area_fips'].astype(str)
#Generate description variables
cew['area']=cew['fips'].map(m_dict['area'])
cew['industry']=cew['industry_code'].map(m_dict['industry'])
cew['agglvl']=cew['agglvl_code'].map(m_dict['agglvl'])
cew['own']=cew['own_code'].map(m_dict['ownership'])
cew['size']=cew['size_code'].map(m_dict['size'])
#Generate boolean masks
lagg_mask=cew['agglvl_code']==73
lsize_mask=cew['size_code']==0
#Subset data to above specifications
cew_super=cew[lagg_mask & lsize_mask]
#Define column subset
lsub_cols=['year','fips','area','industry_code','industry','own','annual_avg_estabs_count','annual_avg_emplvl',\
'total_annual_wages','own_code']
#Subset to desired columns
cew_sub=cew_super[lsub_cols]
#Rename columns
cew_sub.columns=['year','fips','cty','ind_code','industry','own','estabs','emp','tot_wages','own_code']
#Set index
cew_sub.set_index(['year','fips','cty'],inplace=True)
#Capture total wage base and the contributions of Federal, State, and Local
cew_base=cew_sub['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_fed=cew_sub[cew_sub['own_code']==1]['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_st=cew_sub[cew_sub['own_code']==2]['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_loc=cew_sub[cew_sub['own_code']==3]['tot_wages'].groupby(level=['year','fips','cty']).sum()
#Convert to DFs for join
lbase=DataFrame(cew_base).rename(columns={0:'base'})
lfed=DataFrame(cew_fed).rename(columns={0:'fed_wage'})
lstate=DataFrame(cew_st).rename(columns={0:'st_wage'})
llocal=DataFrame(cew_loc).rename(columns={0:'loc_wage'})
#Join these series
lcontrib_lev=pd.concat([lbase,lfed,lstate,llocal],axis='index').fillna(0)
#Diag prints
print f
print lcontrib_lev.head()
print lcontrib_lev.describe()
print '*****************************\n'
#Calculate proportional contributions (failure point)
lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')
#Group base data by year, county, and industry
cew_g=cew_sub.reset_index().groupby(['year','fips','cty','ind_code','industry']).sum().reset_index()
#Join contributions to joined data
cew_contr=cew_g.set_index(['year','fips','cty']).join(lcontrib[['fed_wage','st_wage','loc_wage']])
return cew_contr[[x for x in cew_contr.columns if x != 'own_code']]
Work ok for me (this is on 0.13.1, but IIRC I don't think anything in this particular area changed, but its possible it was a bug that was fixed).
In [48]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').head()
Out[48]:
base fed_wage st_wage loc_wage
year fips cty
2001 1000 1000 NaN NaN NaN NaN
1000 NaN NaN NaN NaN
10000 10000 NaN NaN NaN NaN
10000 NaN NaN NaN NaN
10001 10001 NaN NaN NaN NaN
[5 rows x 4 columns]
In [49]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').tail()
Out[49]:
base fed_wage st_wage loc_wage
year fips cty
2001 CS566 CS566 1 0.000000 0.000000 0.000000
US000 US000 1 0.022673 0.027978 0.073828
USCMS USCMS 1 0.000000 0.000000 0.000000
USMSA USMSA 1 0.000000 0.000000 0.000000
USNMS USNMS 1 0.000000 0.000000 0.000000
[5 rows x 4 columns]

Categories