for loop in scipy.stats.linregress - python

I am using the scipy stats module to calculate the linear regression. ie
slope, intercept, r_value, p_value, std_err
= stats.linregress(data['cov_0.0075']['num'],data['cov_0.0075']['com'])
where data is a dictionary containing several 'cov_x' keys corresponding to a dataframe with columns 'num' and 'com'
I want to be able to loop through this dictionary and do linear regression on each 'cov_x'. I am not sure how to do this. I tried:
for i in data:
slope_+str(i), intercept+str(i), r_value+str(i),p_value+str(i),std_err+str(i)= stats.linregress(data[i]['num'],data[i]['com'])
Essentially I want len(x) slope_x values.

You could use a list comprehension to collect all the stats.linregress return values:
result = [stats.linregress(df['num'],df['com']) for key, df in data.items()]
result is a list of 5-tuples. To collect all the first, second, third, etc... elements from each 5-tuple into separate lists, use zip(*[...]):
slopes, intercepts, r_values, p_values, stderrs = zip(*result)

You should be able to do what you're trying to, but there are a couple of things you should watch out for.
First, you can't add a string to a variable name and store it that way. No plus signs on the left of the equals sign. Ever.
You should be able to accomplish what you're trying to do, however. Just make sure that you use the dict data type if you want string indexing.
import scipy.stats as stats
import pandas as pd
import numpy as np
data = {}
l = ['cov_0.0075','cov_0.005']
for i in l:
x = np.random.random(100)
y = np.random.random(100)+15
d = {'num':x,'com':y}
df = pd.DataFrame(data=d)
data[i] = df
slope = {}
intercept = {}
r_value = {}
p_value = {}
std_error = {}
for i in data:
slope[str(i)], \
intercept[str(i)], \
r_value[str(i)],\
p_value[str(i)], std_error[str(i)]= stats.linregress(data[i]['num'],data[i]['com'])
print(slope,intercept,r_value,p_value,std_error)
should work just fine. Otherwise, you can store individual values and put them in a list later.

Related

Going from code that compares one value to all other values to all values of all other values

I've written the code that will find take a number of n grams, a specific index, and a threshold, and return the values that fall within that threshold. However, currently, it only compares a set of tokens (a given index) to each set of tokens at all other indices. I want to compare each set tokens at all indices to every other set of tokens at all indices. I don't think this is a difficult question, but python is my main language and I struggle with for loops a bit.
So essentially, the variable token in the function should actually iterate over each string in the column, and be compared with comp_token and the index call would be removed, since it would be iterating over all indices.
Let me know if that isn't clear enough and I will think more about how to say this: it is just difficult because the thing I am asking is the thing I am struggling with.
data = ['Time', "NY Times", 'Atlantic']
ph = pd.DataFrame(data, columns=['companies'])
ph.reset_index(inplace=True)
import py_stringmatching as sm
import pandas as pd
import numpy as np
jac = sm.Jaccard()
def predict_label(num, index, thresh):
qg_num_tok = sm.QgramTokenizer(qval = num)
companies = ph.companies.to_list()
ids = ph['index']
companies_qg_num_token = {}
companies_id2index = {}
for i in range(len(companies)):
companies_id2index[i] = companies[i]
companies_qg_num_token[i] = qg_num_tok.tokenize(companies[i])
predicted_label = [1]
token = companies_qg_num_token[index] #index you want: to get all the tokens
for comp_name in ids[1:]:
comp_token = companies_qg_num_token[comp_name]
sim = jac.get_sim_score(token, comp_token)
if sim > thresh:
predicted_label.append(1)
else:
predicted_label.append(0)
#companies_id2index must be equal to token numbner
ph.loc[ph['index'] != companies_id2index[index], 'label'] = 0 #if not equal to index
ph['prediction'] = predicted_label
print(token)
print(companies_id2index[index])
return ph.query('prediction==1')
predict_label(ph, 1, .5)

Re-arrange/combine dataframes in dictionaries

I have a dictionary which is made up of two data frames
dict_keys(['on_folds', 'global'])
each data frame is made up of two series
results['global'].keys() = Index(['rMSE', 'Bias'])
Each series has indexes with floats
results['global']['rMSE'].keys() = Index(['lasso_reg', 'lasso_class']),
I would like to build a new dictionary by combining the items of the old ones into the data frames
dict_keys(['lasso_reg', 'lass_class'])
with series
results['lasso_reg'].keys() = Index(['rMSE', 'Bias'])
and floats
results['lasso_reg']['rMSE'].keys() = Index(['on_folds', 'global'])
This is essentially a switch of the indexes. I am not sure if transposing or combing would do it. How do I do that?
Update: here example of the contents of the dictionary
I have the following dictionary
{'on_folds': rMSE 95% CI Bias SD
lasso_reg 1.690569 0.722917 3.257023 1.280235
lasso_class 2.616418 0.662500 4.025593 1.863953
xgb_class 6.899894 0.409375 9.261215 2.760382
xgb_reg 5.436057 0.482292 8.004550 2.138652
rf_reg 4.998613 0.659375 7.028649 2.291934
rf_class 3.820162 0.659375 6.436152 1.651670
and I would like to change it so it looks like this
{'lasso_reg': rMSE 95% CI Bias SD
on_folds 1.690569 0.722917 3.257023 1.280235
'lasso_class': rMSE 95% CI Bias SD
on_folds 1.690569 0.722917 3.257023 1.280235
etc.
I'm still not sure what your data looks like but try this one:
out = {}
for k, df in dct.items():
temp = df.loc[['lasso_reg','lass_class'],:]
temp.index = [k, k]
out.setdefault('lasso_reg', []).append(temp.iloc[0])
out.setdefault('lass_class', []).append(temp.iloc[1])
out = {k: pd.concat(v) for k,v in out.items()}

How to find which input value in a loop yielded the output min?

I am trying to solve a min value problem, I could obtain the min values from two loops but, what I really need is also the exact values that correspended to output min.
from __future__ import division
from numpy import*
b1=0.9917949
b2=0.01911
b3=0.000840
b4=0.10175
b5=0.000763
mu=1.66057*10**(-24) #gram
c=3.0*10**8
Mler=open("olasiM.txt","w+")
data=zeros(0,'float')
for A in range(1,25):
M2=zeros(0,'float')
print 'A=',A
for Z in range(1,A+1):
SEMF=mu*c**2*(b1*A+b2*A**(2./3.)-b3*Z+b4*A*((1./2.)-(Z/A))**2+(b5*Z**2)/(A**(1./3.)))
SEMF=array(SEMF)
M2=hstack((M2,SEMF))
minm2=min(M2)
data=hstack((data,minm2))
data=hstack((data,A))
datalist = data.tolist()
for i in range (len(datalist)):
Mler.write(str(datalist[i])+'\n')
Mler.close()
Here, what I want is to see the min value of the SEMF and, corresponding A,Z values, For example, it has to be A=1, Z=1 and SEMF= some#
I also don't know how to write these, A and Z values to the document
The big advantage of numpy over using python lists is vectorized operations. Unfortunately your code fails completely in using them. For example the whole inner loop that has Z as index can easily be vectorized. You instead are computing the single elements using python floats and then stacking them one by one in the numpy array M2.
So I'd refactor that part of the code by:
import numpy as np
# ...
Zs = np.arange(1, A+1, dtype=float)
SEMF = mu*c**2 * (b1*A + b2*A**(2./3.) - b3*Zs + b4*A*((1./2.) - (Zs/A))**2 + (b5*Zs**2)/(A**(1./3.)))
Here the SEMF array should be exactly what you'd obtain as the final M2 array. Now you can find the minimum and stack that value into your data array:
min_val = SEMF.min()
data = hstack((data,minm2))
data = hstack((data,A))
If you also what to keep track for which value of Z you got the minimum you can use the argmin method:
min_val, min_pos = SEMF.min(), SEMF.argmin()
data = hstack((data,np.array([min_val, min_pos, A])))
The final code should look like:
from __future__ import division
import numpy as np
b1 = 0.9917949
b2 = 0.01911
b3 = 0.000840
b4 = 0.10175
b5 = 0.000763
mu = 1.66057*10**(-24) #gram
c = 3.0*10**8
data=zeros(0,'float')
for A in range(1,25):
Zs = np.arange(1, A+1, dtype=float)
SEMF = mu*c**2 * (b1*A + b2*A**(2./3.) - b3*Zs + b4*A*((1./2.) - (Zs/A))**2 + (b5*Zs**2)/(A**(1./3.)))
min_val, min_pos = SEMF.min(), SEMF.argmin()
data = hstack((data,np.array([min_val, min_pos, A])))
datalist = data.tolist()
with open("olasiM.txt","w+") as mler:
for i in range (len(datalist)):
mler.write(str(datalist[i])+'\n')
Note that numpy provides some functions to save/load array to/from files, like savetxt so I suggest that instead of manually saving the values there to use these functions.
Probably some numpy expert could vectorize also the operations for the As. Unfortunately my numpy knowledge isn't that advanced and I don't know how the handle the fact that we'd have a variable number of dimensions due to the range(1, A+1) thing...

Python looping and Pandas rank/index quirk

This question pertains to one posted here:
Sort dataframe rows independently by values in another dataframe
In the linked question, I utilize a Pandas Dataframe to sort each row independently using values in another Pandas Dataframe. The function presented there works perfectly every single time it is directly called. For example:
import pandas as pd
import numpy as np
import os
##Generate example dataset
d1 = {}
d2 = {}
d3 = {}
d4 = {}
## generate data:
np.random.seed(5)
for col in list("ABCDEF"):
d1[col] = np.random.randn(12)
d2[col+'2'] = np.random.random_integers(0,100, 12)
d3[col+'3'] = np.random.random_integers(0,100, 12)
d4[col+'4'] = np.random.random_integers(0,100, 12)
t_index = pd.date_range(start = '2015-01-31', periods = 12, freq = "M")
#place data into dataframes
dat1 = pd.DataFrame(d1, index = t_index)
dat2 = pd.DataFrame(d2, index = t_index)
dat3 = pd.DataFrame(d3, index = t_index)
dat4 = pd.DataFrame(d4, index = t_index)
## Functions
def sortByAnthr(X,Y,Xindex, Reverse=False):
#order the subset of X.index by Y
ordrX = [x for (x,y) in sorted(zip(Xindex,Y), key=lambda pair: pair[1],reverse=Reverse)]
return(ordrX)
def OrderRow(row,df):
ordrd_row = df.ix[row.dropna().name,row.dropna().values].tolist()
return(ordrd_row)
def r_selectr(dat2,dat1, n, Reverse=False):
ordr_cols = dat1.apply(lambda x: sortByAnthr(x,dat2.loc[x.name,:],x.index,Reverse),axis=1).iloc[:,-n:]
ordr_cols.columns = list(range(0,n)) #assign interpretable column names
ordr_r = ordr_cols.apply(lambda x: OrderRow(x,dat1),axis=1)
return([ordr_cols, ordr_r])
## Call functions
ordr_cols2,ordr_r = r_selectr(dat2,dat1,5)
##print output:
print("Ordering set:\n",dat2.iloc[-2:,:])
print("Original set:\n", dat1.iloc[-2:,:])
print("Column ordr:\n",ordr_cols2.iloc[-2:,:])
As can be checked, the columns of dat1 are correctly ordered according to the values in dat2.
However, when called from a loop over dataframes, it does not rank/index correctly and produces completely dubious results. Although I am not quite able to recreate the problem using the reduced version presented here, the idea should be the same.
## Loop test:
out_dict = {}
data_dicts = {'dat2':dat2, 'dat3': dat3, 'dat4':dat4}
for i in range(3):
#this outer for loop supplies different parameter values to a wrapper
#function that calls r_selectr.
for key in data_dicts.keys():
ordr_cols,_ = r_selectr(data_dicts[key], dat1,5)
out_list.append(ordr_cols)
#do stuff here
#print output:
print("Ordering set:\n",dat3.iloc[-2:,:])
print("Column ordr:\n",ordr_cols2.iloc[-2:,:])
In my code (almost completely analogous to the example given here), the ordr_cols are no longer ordered correctly for any of the sorting data frames.
I currently solve the issue by separating the ordering and indexing operations with r_selectr into two separate functions. That, for some reason, resolves the issue though I have no idea why.

Averaging values in a list of a list of a list in Python

I'm working on a method to average data from multiple files and put the results into a single file. Each line of the files looks like:
File #1
Test1,5,2,1,8
Test2,10,4,3,2
...
File #2
Test1,2,4,5,1
Test2,4,6,10,3
...
Here is the code I use to store the data:
totalData = []
for i in range(0, len(files)):
data = []
if ".csv" in files[i]:
infile = open(files[i],"r")
temp = infile.readline()
while temp != "":
data.append([c.strip() for c in temp.split(",")])
temp = infile.readline()
totalData.append(data)
So what I'm left with is totalData looking like the following:
totalData = [[
[Test1,5,2,1,8],
[Test2,10,4,3,2]],
[[Test1,2,4,5,1],
[Test2,4,6,10,3]]]
What I want to average is for all Test1, Test2, etc, average all the first values and then the second values and so forth. So testAverage would look like:
testAverage = [[Test1,3.5,3,3,4.5],
[Test2,7,5,6.5,2.5]]
I'm struggling to think of a concise/efficient way to do this. Any help is greatly appreciated! Also, if there are better ways to manage this type of data, please let me know.
It just need two loops
totalData = [ [['Test1',5,2,1,8],['Test2',10,4,3,2]],
[['Test1',2,4,5,1],['Test2',4,6,10,3]] ]
for t in range(len(totalData[0])): #tests
result = [totalData[0][t][0],]
for i in range(1,len(totalData[0][0])): #numbers
sum = 0.0
for j in range(len(totalData)):
sum += totalData[j][t][i]
sum /= len(totalData)
result.append(sum)
print result
first flatten it out
results = itertools.chain.from_iterable(totalData)
then sort it
results.sort()
then use groupby
data = {}
for key,values in itertools.groupby(results,lambda x:x[0]):
columns = zip(*values)
data[key] = [sum(c)*1.0/len(c) for c in columns]
and finally just print your data
If your data structure is regular, the best is probably to use numpy. You should be able to install it with pip from the terminal
pip install numpy
Then in python:
import numpy as np
totalData = np.array(totalData)
# remove the last dimension (i.e. 'Test1', 'Test2'), since it's not a number
totalData = np.array(totalData[:, :, 1:], float)
# average
np.mean(totalData, axis=0)

Categories