How to loop the data with condition? - python

I have a set of data such that the 1st column is age (numerical), 2nd column is gender (categorical) and 3rd column is saving (numerical).
What I want to do is find the mean and standard deviation if the column is a numerical data, and find the mode if the column is categorical data.
I tried to find the index if the type = num and put the index into the for loop to calculate the mean and standard deviation and the rest of the index is used to calculate the mode of the categorical data (in this case is 2nd column), however, I had stuck in the loop.
import numpy as np
data = np.array([[11, "male",1222],[23,"female",333],[15,"male",542]])
# type of the data above
types = ["num","cat","num"]
idx = []
for i in range(2):
if (types[i] == "num"):
idx.append(types[i].index)
for i in idx:
np.mean(data[:,i].astype("float64"))
I hope the code is able to obtain the mean and standard deviation for numerical data and mode for categorical data. If it is possible, try not to build in any other package (I'm not sure `index' have it own package or not).

Simply remove the parenthesis in the if statement.
...
idx = []
for i in range(2):
if types[i] == "num":
idx.append(types[i].index)
...
Edit:
Instead of looping a range I would suggest iterate your types array with enumerate, in that way you have the index of your desired item.
for index, _type in enumerate(types):
if _type == 'num':
idx.append(index)

Related

I'm not sure what is wrong with my code.. (linear/polynomial regression)

I have a data set (csv file) with three seperate columns. Column 0 is the signal time, Column 1 is the frequency, and Column 2 is the intensity. The is alot of noise in the data that can be sorted though by finding the variance of each signal frequency. If it is <2332 then it is the right frequency. Hence, this would be the data I would want to calculate Linear/Poly regression on. p.s. I have to calc linear manually :(. The nested for loop decision structure I have isn't currently working. Any solutions would be helpful! thanks
data = csv.reader(file1)
sort = sorted(data, key=(operator.itemgetter(1))) #sorted by the frequencies
for row in sort:
x.append(float(row[0]))
y.append(float(row[2]))
frequencies.append(float(row[1]))
for i in range(499) :
freq_dict.update({ frequencies[i] : [x[i], y[i]] })
for key in freq_dict.items():
for row in sort :
if key == float(row[1]):
a.append(float(row[1]))
b.append(float(row[2]))
c.append(float(row[0]))
else :
num = np.var(a)
if num < 2332.0:
linearRegression(c, b, linear)
print('yo')
polyRegression(c, b, d, linear, py)
mplot.plot(linear, py)
else:
a = []
b = []
c = []
variances of 2332 or less are the frequencies I need
variances of 2332 or less are the frequencies I need
I used range of 499 because that is the length of my data set. Also, I tried to clear the lists (a,b,c) if the frequency wasn't correct.
There are several issues I see going on. I am unsure why you sort your data, if you all ready know the exact values you are looking for. I am unsure why you split up the data into separate variables as well. The double "for" loops means that you are repeating everything in "sort" for every single key in freq_dict. Not sure if that was your intention to repeat all those values multiple times. Also, freq_dict.items() produces tuples (key,value pairs), so your "key" is a tuple, hence "key" will never equal a float. Anyway, here is an attempt to re-write some code.
import csv, numpy
import matplotlib.pyplot as plt
from scipy import stats
data = csv.reader(file1) #Read file.
f_data = filter(lambda (x,f,y):f<2332.0,data) #Filter data to condition.
x,_,y = list(zip(*f_data)) #Split data down column.
#Standard linear stats function.
slope,intercept,r_value,p_value,std_err = stats.linregress(x,y)
#Plot the data and the fit line.
plt.scatter(x,y)
plt.plot(x,numpy.array(x)*slope+intercept)
plt.show()
A more similar solution was using the corrcoef of the list. But in similar style it was as follows:
for key, value in freq_dict.items(): #1487
for row in sort: #when row -> goes to a new freq it calculates corrcoef of an empty list.
if key == float(row[1]): #1487
a.append(float(row[2]))
b.append(float(row[0]))
elif key != float(row[1]):
if a:
num = np.corrcoef(b, a)[0,1]
if (num < somenumber).any():
do stuff
a = [] #clear the lists and reset number
b = []
num = 0

Monte carlo simulation in python - problem with looping

I am running a simple python script for MC. Basically it reads through every row in the dataframe and selects the max and min of the two variables. Then the simulation if run 1000 times selecting a random value between the min and max and computes the product and writes the P50 value back to the datatable.
Somehow the P50 output is the same for all rows. Any help on where I am going wrong?
import pandas as pd
import random
import numpy as np
data = [[0.075,0.085, 120, 150], [0.055, 0.075, 150, 350],[0.045,0.055,175,400]]
df = pd.DataFrame(data, columns = ['P_min','P_max','H_min','H_max'])
NumSim = 1000
for index, row in df.iterrows():
outdata = np.zeros(shape=(NumSim,), dtype=float)
for k in range(NumSim):
phi = (row['P_min'] + (row['P_max'] - row['P_min']) * random.uniform(0, 1))
ht = (row['H_min'] + (row['H_max'] - row['H_min']) * random.uniform(0, 1))
outdata[k] = phi*ht
df['out_p50'] = np.percentile(outdata,50)
print(df)
By df['out_p50'] = np.percentile(outdata,50) you are saying that you want the whole column to be set to given value, not a specific row of the column. Therefore, the numbers are generated and saved but they are saved to the whole column and in the end, you see the last generated number in every row.
Instead, use df.loc[index, 'out_p50'] = np.percentile(outdata,50) to specify the specific row you want to set.
Yup -- you're writing a scalar value to the entire column. You overwrite that value on each iteration. If you want, you can simply specify the row with df.loc for a quick fix. Also consider using outdata.median instead of percentile.
Perhaps the most important feature of PANDAS is the built-in support for vectorization: you work with entire columns of data, rather than looping through the data frame. Think like a list comprehension in which you don't need the for row in df iteration at the end.

apply max to varying-dimension subsets of pandas dataframe

For a dataframe with an indexed column with repeated indexes, I'm trying to get the maximum value found in a different column, by index, and assign it to a third column, so that for any given row, we can see the maximum value found in any row with the same index.
I'm doing this over a very large data set and would like it to be vectorized if possible. For now, I can't get it to work at all
multiindexDF = pd.DataFrame([[1,2,3,3,4,4,4,4],[5,6,7,10,15,11,25,89]]).transpose()
multiindexDF.columns = ['theIndex','theValue']
multiindexDF['maxValuePerIndex'] = 0
uniqueIndicies = multiindexDF['theIndex'].unique()
for i in uniqueIndices:
matchingIndices = multiindexDF['theIndex'] == i
maxValue = multiindexDF[matchingIndices == i]['theValue'].max()
multiindexDF.loc[matchingIndices]['maxValuePerIndex'] = maxValue
This fails, telling me I should use .loc, when I'm already using it. Not sure what the error means, and not sure how I can fix this so I don't have to loop through everything so I can vectorize it instead
I'm looking for this
targetDF = pd.DataFrame([[1,2,3,3,4,4,4,4],[5,6,10,7,15,11,25,89],[5,6,10,10,89,89,89,89]]).transpose()
targetDF
Looks like this is a good case for groupby transform, this can get the maximum value per index group and transform them back onto their original index (rather than the grouped index):
multiindexDF['maxValuePerIndex'] = multiindexDF.groupby("theIndex")["theValue"].transform("max")
The reason you're getting the SettingWithCopyWarning is that in your .loc call you're taking a slice of a slice and setting the value there, see the two pair of square brackets in:
multiindexDF.loc[matchingIndices]['maxValuePerIndex'] = maxValue
So it tries to assign the value to the slice rather than the original DataFrame, you're doing a .loc and then another [] after it in a chain.
So using your original approach:
for i in uniqueIndices:
matchingIndices = multiindexDF['theIndex'] == i
maxValue = multiindexDF.loc[matchingIndices, 'theValue'].max()
multiindexDF.loc[matchingIndices, 'maxValuePerIndex'] = maxValue
(Notice I've also changed the first .loc where you were incorrectly using the boolean index)

Drop Pandas DataFrame lines according to a GropuBy property

I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)

Iterating over entire columns and storing the result into a list

I would like to know how I could iterate through each columns of a dataframe to perform some calculations and store the result in an another dataframe.
df_empty = []
m = daily.ix[:,-1] #Columns= stocks & Rows= daily returns
stocks = daily.ix[:,:-1]
for col in range (len(stocks.columns)):
s = daily.ix[:,col]
covmat = np.cov(s,m)
beta = covmat[0,1]/covmat[1,1]
return (beta)
print(beta)
In the above example, I first want to calculate a covariance matrix between "s" (the columns representing stocks daily returns and for which I want to iterate through one by one) and "m" (the market daily return which is my reference column/the last column of my dataframe). Then I want to calculate the beta for each covariance pair stock/market.
I'm not sure why return(beta) give me a single numerical result for one stock while print(beta) print the beta for all stocks.
I'd like to find a way to create a dataframe with all these betas.
beta_df = df_empty.append(beta)
I have tried the above code but it returns 'none' as if it could not append the outcome.
Thank you for your help
The return statement within your for-loop ends the loop itself the first time the return is encountered. Moreover, you are not saving the beta value anywhere because the for-loop itself does not return a value in python (it only has side effects).
Apart from that, you may choose a more pandas-like approach using apply on the data frame which basically iterates over the columns of the data frame and passes each column to a supplied function as the first parameter while returning the result of the function call. Here is a minimal working example with some dummy data:
import pandas as pd
import numpy as pd
# create some dummy data
daily = pd.DataFrame(np.random.randint(100, size=(100, 5)))
# define reference column
cov_column = daily.iloc[:, -1]
# setup computation function
def compute(column):
covmat = np.cov(column, cov_column)
return covmat[0,1]/covmat[1,1]
# use apply to iterate over columns
result = daily.iloc[:, :-1].apply(compute)
# show output
print(result)
0 -0.125382
1 0.024777
2 0.011324
3 -0.017622
dtype: float64

Categories