I have what I think is a very interesting problem here but have little idea how I can go about solving it computationally or whether a Python dataframe is appropriate for this purpose. I have data like so:
SuperGroup Group Code Weight Income
8 E1 E012 a 0.5 1000
9 E1 E012 b 0.2 1000
10 E1 E013 b 0.2 1000
11 E1 E013 c 0.3 1000
Effectively, 'Code' has a one-to-one relationship with 'Weight'.
'SuperGroup' has a one-to-one relationship with 'Income'.
A SuperGroup is composed of many Groups and a Group has many Codes.
I am attempting to distribute the income according to the combined weights of codes within that group so for E012 this is (0.5*0.2 = 0.1) and for E013 this is (0.2*0.3 = 0.06) As a proportion of their total, E012s becomes 0.625 (0.1/(0.1+0.06) and E013s becomes 0.375 (0.06/(0.1+0.06).
The dataframe can be collapsed and re-written as:
SuperGroup Group Code CombinedWeight Income
8 E1 E012 a,b 0.625 1000
10 E1 E013 b,c 0.375 1000
I am capable of producing the above dataframe, but my next step is to apply the weights to the income to distribute it in such a way that it averages to 1000 still but reflects the size of the weight of the group it is associated with.
Letting x=0.625 and y=0.375 then x=1.67y
Additionally, (x+y)/2 = 1000 note: my data often has several groups present in a supergroup so it could be more than 2 resulting in a system of linear equations if my understanding is correct
Solving simultaneously produces 1250 and 750 as the weighted incomes. The dataframe can be re-written as:
SuperGroup Group Code Income
8 E1 E012 a,b 1250
10 E1 E013 b,c 750
which is effectively how I need it. Any guidance is warmly appreciated.
First we agg the DataFrame on ['SuperGroup', 'Group']
res = (df.groupby(['SuperGroup', 'Group'])
.agg({'Weight': lambda x: x.cumprod().iloc[-1],
'Code': ','.join,
'Income': 'first'}))
Then we re-adjust the Income within each SuperGroup with the help of transform:
s = res.groupby(level='SuperGroup')
res['Income'] = s.Income.transform('sum')*res.Weight/s.Weight.transform('sum')
Weight Code Income
SuperGroup Group
E1 E012 0.10 a,b 1250.0
E013 0.06 b,c 750.0
Related
Trying to print 2 columns against one another to see how much one depends on the other i.e how chances of admission depends on research experience - print the average chance of admit against research.
I'm not sure I'm getting the command correct and if size or something else should be used at the end:
df.groupby(['Chance of Admit', 'Research']).size()
this is the result when I run the above:
Chance of Admit Research
0.34 0 2
0.36 0 1
1 1
0.37 0 1
0.38 0 2
..
0.93 1 12
0.94 1 13
0.95 1 5
0.96 1 8
0.97 1 4
Length: 99, dtype: int64
To see the mean admission rate by number of research papers, you should group by 'research' and take the mean:
df.groupby('Research').mean()
To see additional stats grouped by the number of research papers, use .describe():
df.groupby('Research').describe()
Finally, it may be useful to plot a correlation of the chance of admission vs. the amount of research:
df.plot.scatter(x='Research', y='Chance of Admit')
Syntactically speaking, to take the mean of the chance to admit column grouped by research, all you need is the following:
df[['Chance of Admit', 'Research']].groupby('Research').mean()
But be wary here, as the number may not actually tell you what the chance of getting admitted is given a particular amount of research. That is because you don't know the denominators in those probabilities.
For example,
Suppose I had a dataset that contained the following two rows:
Chance of Admit Research
0.75 1
0.25 1
The 'mean' "chance of admit" for research==1 would be .50 when calculated this way, but suppose that the first change came from a population of 100 students where 75% (75) we admitted. And the second from a population of 900, where 25% (225) were admitted.
Then over all the data we have, we would see 300 with research==1 admitted from a total of 1000 students or 30% change to admit with reseach==1
I am trying to aggregate pandas DataFrame and create 2 new columns that would be a slope and an intercept from a simple linear regression fit.
The dummy dataset looks like this:
CustomerID Month Value
a 1 10
a 2 20
a 3 20
b 1 30
b 2 40
c 1 80
c 2 90
And I want the output to look like this - which would regress Value against Month for each CustomerID:
CustomerID Slope Intercept
a 0.30 10
b 0.20 30
c 0.12 80
I know I could run a loop and then for each customerID run the linear regression model, but my dataset is huge and I need a vectorized approach. I tried using groupby and apply by passing linear regression function but didn't find a solution that would work.
Thanks in advance!
By using scpiy with groupby , here I am using for loop rather than apply , since apply is slower than for loop
from scipy import stats
pd.DataFrame.from_dict({y:stats.linregress(x['Month'],x['Value'])[:2] for y, x in df.groupby('CustomerID')},'index').\
rename(columns={0:'Slope',1:'Intercept'})
Out[798]:
Slope Intercept
a 5.0 6.666667
b 10.0 20.000000
c 10.0 70.000000
I have 12 unique groups that I am trying to randomly sample from, each with a different number of observations. I want to randomly sample from the entire population (dataframe) with each group having the same probability of being selected from. The simplest example of this would be a dataframe with 2 groups.
groups probability
0 a 0.25
1 a 0.25
2 b 0.5
using np.random.choice(df['groups'], p=df['probability'], size=100) Each iteration will now have a 50% chance of selecting group a and a 50% chance of selecting group b
To come up with the probabilities I used the formula:
(1. / num_groups) / size_of_groups
or in Python:
num_groups = len(df['groups'].unique()) # 2
size_of_groups = df.groupby('label').size() # {a: 2, b: 1}
(1. / num_groups) / size_of_groups
Which returns
groups
a 0.25
b 0.50
This works great until I get past 10 unique groups, after which I start getting weird distributions. Here is a small example:
np.random.seed(1234)
group_size = 12
groups = np.arange(group_size)
probs = np.random.uniform(size=group_size)
probs = probs / probs.sum()
g = np.random.choice(groups, size=10000, p=probs)
df = pd.DataFrame({'groups': g})
prob_map = ((1. / len(df['groups'].unique())) / df.groupby('groups').size()).to_dict()
df['probability'] = df['groups'].map(prob_map)
plt.hist(np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True))
plt.xticks(np.arange(group_size))
plt.show()
I would expect a fairly uniform distribution with a large enough sample size, but I am getting these wings when the number of groups is 11+. If I change the group_size variable to 10 or lower, I do get the desired uniform distribution.
I can't tell if the problem is with my formula for calculating the probabilities, or possibly a floating point precision problem? Anyone know a better way to accomplish this, or a fix for this example?
Thanks in advance!
you are using hist which defaults to 10 bins...
plt.rcParams['hist.bins']
10
pass group_size as the bins parameter.
plt.hist(
np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True),
bins=group_size)
There is no problem about your calculations. Your resulting array is:
arr = np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True)
If you check the value counts:
pd.Series(arr).value_counts().sort_index()
Out:
0 855
1 800
2 856
3 825
4 847
5 835
6 790
7 847
8 834
9 850
10 806
11 855
dtype: int64
It is pretty close to a uniform distribution. The problem is with the default number of bins (10) of the histogram. Instead, try this:
bins = np.linspace(-0.5, 10.5, num=12)
pd.Series(arr).plot.hist(bins=bins)
I'm trying to find a way to iterate code for a linear regression over many many columns, upwards of Z3. Here is a snippet of the dataframe called df1
Time A1 A2 A3 B1 B2 B3
1 1.00 6.64 6.82 6.79 6.70 6.95 7.02
2 2.00 6.70 6.86 6.92 NaN NaN NaN
3 3.00 NaN NaN NaN 7.07 7.27 7.40
4 4.00 7.15 7.26 7.26 7.19 NaN NaN
5 5.00 NaN NaN NaN NaN 7.40 7.51
6 5.50 7.44 7.63 7.58 7.54 NaN NaN
7 6.00 7.62 7.86 7.71 NaN NaN NaN
This code returns the slope coefficient of a linear regression for the very ONE column only and concatenates the value to a numpy series called series, here is what it looks like for extracting the slope for the first column:
from sklearn.linear_model import LinearRegression
series = np.array([]) #blank list to append result
df2 = df1[~np.isnan(df1['A1'])] #removes NaN values for each column to apply sklearn function
df3 = df2[['Time','A1']]
npMatrix = np.matrix(df3)
X, Y = npMatrix[:,0], npMatrix[:,1]
slope = LinearRegression().fit(X,Y) # either this or the next line
m = slope.coef_[0]
series= np.concatenate((SGR_trips, m), axis = 0)
As it stands now, I am using this slice of code, replacing "A1" with a new column name all the way up to "Z3" and this is extremely inefficient. I know there are many easy way to do this with some modules but I have the drawback of having all these intermediate NaN values in the timeseries so it seems like I'm limited to this method, or something like it.
I tried using a for loop such as:
for col in df1.columns:
and replacing 'A1', for example with col in the code, but this does not seem to be working.
Is there any way I can do this more efficiently?
Thank you!
One liner (or three)
time = df[['Time']]
pd.DataFrame(np.linalg.pinv(time.T.dot(time)).dot(time.T).dot(df.fillna(0)),
['Slope'], df.columns)
Broken down with a bit of explanation
Using the closed form of OLS
In this case X is time where we define time as df[['Time']]. I used the double brackets to preserve the dataframe and its two dimensions. If I'd done single brackets, I'd have gotten a series and its one dimension. Then the dot products aren't as pretty.
is np.linalg.pinv(time.T.dot(time)).dot(time.T)
Y is df.fillna(0). Yes, we could have done one column at a time, but why when we could do it altogether. You have to deal with the NaNs. How would you imagine dealing with them? Only doing it over the time you had data? That is equivalent to placing zeroes in the NaN spots. So, I did.
Finally, I use pd.DataFrame(stuff, ['Slope'], df.columns) to contain all slopes in one place with the original labels.
Note that I calculated the slope of the regression for Time against itself. Why not? It was there. Its value is 1.0. Great! I probably did it right!
Looping is a decent strategy for a modest number (say, fewer than thousands) of columns. Without seeing your implementation, I can't say what's wrong, but here's my version, which works:
slopes = []
for c in cols:
if c=="Time": break
mask = ~np.isnan(df1[c])
x = np.atleast_2d(df1.Time[mask].values).T
y = np.atleast_2d(df1[c][mask].values).T
reg = LinearRegression().fit(x, y)
slopes.append(reg.coef_[0])
I've simplified your code a bit to avoid creating so many temporary DataFrame objects, but it should work fine your way too.
I have a very big Pandas dataframe where I need an ordering within groups based on another column. I know how to iterate over groups, do an operation on the group and union all those groups back into one dataframe however this is slow and I feel like there is a better way achieve this. Here is the input and what I want out of it. Input:
ID price
1 100.00
1 80.00
1 90.00
2 40.00
2 40.00
2 50.00
Output:
ID price order
1 100.00 3
1 80.00 1
1 90.00 2
2 40.00 1
2 40.00 2 (could be 1, doesn't matter too much)
2 50.00 3
Since this is over about 5kk records with around 250,000 IDs efficiency is important.
If speed is what you want, then the following should be pretty good, although it is a bit more complicated as it makes use of complex number sorting in numpy. This is similar to the approach used (my me) when writing the aggregate-sort method in the package numpy-groupies.
# get global sort order, for sorting by ID then price
full_idx = np.argsort(df['ID'] + 1j*df['price'])
# get min of full_idx for each ID (note that there are multiple ways of doing this)
n_for_id = np.bincount(df['ID'])
first_of_idx = np.cumsum(n_for_id)-n_for_id
# subtract first_of_idx from full_idx
rank = np.empty(len(df),dtype=int)
rank[full_idx] = arange(len(df)) - first_of_idx[df['ID'][full_idx]]
df['rank'] = rank+1
It takes 2s for 5m rows on my machine, which is about 100x faster than using groupby.rank from pandas (although I didn't actually run the pandas version with 5m rows because it would take too long; I'm not sure how #ayhan managed to do it in only 30s, perhaps a difference in pandas versions?).
If you do use this, then I recommend testing it thoroughly, as I have not.
You can use rank:
df["order"] = df.groupby("ID")["price"].rank(method="first")
df
Out[47]:
ID price order
0 1 100.0 3.0
1 1 80.0 1.0
2 1 90.0 2.0
3 2 40.0 1.0
4 2 40.0 2.0
5 2 50.0 3.0
It takes about 30s on a dataset of 5m rows with 250000 ID's (i5-3330) :
df = pd.DataFrame({"price": np.random.rand(5000000), "ID": np.random.choice(np.arange(250000), size = 5000000)})
%time df["order"] = df.groupby("ID")["price"].rank(method="first")
Wall time: 36.3 s