Python calculations using groups to sum - python

Good afternoon all,
Bit stuck with the last stage of a calculation.
I have a dataframe which outputs as such:
LaCode Group Frequency
0 718 NaN 2
1 718 3 1
2 719 1 4
3 719 2 10
I'm struggling with the percentage calculation which is for each LaCode, ignore where Group is NaN (and just put NaN (or blank) and calculate percentage of the frequency's where Group is known.
Should output as such:
Percentage
NaN
100
28.571
71.428
Can anyone help with this? My code doesn't take into account the change in LaCode and I can't work out the correct syntax to incorporate that issue.
Thanks.
Edit: For completeness, I have converted the NaN to an integer that stands out so I can see it (in this instance 0 as that isn't a valid group in the survey)
The code I'm using for calculation was provided to me and I tweaked a little. Works ok when just one LaCode:
df['Percentage'] = df[df['Value'] != 0]['Count'].apply(lambda x: x/sum(df[df['Value'] != 0]['Count']))

Related

Interpolation still leaving NaN's (pandas groupby)

I have a Dataframe with the location of some customers (so I have a column with Customer_id and others with Lat and Lon) and I am trying to interpolate the NaN's according to each customer.
For example, if I interpolate with the nearest approach here (I made up the values here):
Customer_id Lat Lon
A 1 1
A NaN NaN
A 2 2
B NaN NaN
B 4 4
I would like the NaN for B to be 4 and not 2.
I have tried this
series.groupby('Customer_id').apply(lambda group: group.interpolate(method = 'nearest', limit_direction = 'both'))
And the number of NaN's goes down from 9003 to 94. But I'm not understanding why it is still leaving some missing values.
I checked and these 94 missing values corresponded to records from customers that were already being interpolated. For example,
Customer_id Lat
0. A 1
1. A NaN
2. A NaN
3. A NaN
4. A NaN
It would interpolate correctly until some value (let's say it interpolates 1, 2 and 3 correctly) and then leaves 4 as NaN.
I have tried to set a limit in interpolate greater than the maximum number of records per client but it is still not working out. I don't know where my mistake is, can somebody help out?
(I don't know if it's relevant to mention or not but I fabricated my own NaN's for this. This is the code I used Replace some values in a dataframe with NaN's if the index of the row does not exist in another dataframe I think the problem isn't here but since I'm very confused as to where the issue actually is I'll just leave it here)
When you interpolate with nearest it is only able to fill in-between missing values. (You'll notice this because you get an error when there's only 1 non-null value, like in your example). The remaining null values are "edges" which are taken care of with .bfill().ffill() for the nearest logic. This is also the appropriate logic to "interpolate" with only one non-missing value.
def my_interp(x):
if x.notnull().sum() > 1:
return x.interpolate(method='nearest').ffill().bfill()
else:
return x.ffill().bfill()
df.groupby('Customer_id').transform(my_interp)
# Lat Lon
#0 1.0 1.0
#1 1.0 1.0
#2 2.0 2.0
#3 4.0 4.0
#4 4.0 4.0

Inserting missing numbers in dataframe

I have a program that ideally measures the temperature every second. However, in reality this does not happen. Sometimes, it skips a second or it breaks down for 400 seconds and then decides to start recording again. This leaves gaps in my 2-by-n dataframe, where ideally n = 86400 (the amount of seconds in a day). I want to apply some sort of moving/rolling average to it to get a nicer plot, but if I do that to the "raw" datafiles, the amount of data points becomes less. This is shown here, watch the x-axis. I know the "nice data" doesn't look nice yet; I'm just playing with some values.
So, I want to implement a data cleaning method, which adds data to the dataframe. I thought about it, but don't know how to implement it. I thought of it as follows:
If the index is not equal to the time, then we need to add a number, at time = index. If this gap is only 1 value, then the average of the previous number and the next number will do for me. But if it is bigger, say 100 seconds are missing, then a linear function needs to be made, which will increase or decrease the value steadily.
So I guess a training set could be like this:
index time temp
0 0 20.10
1 1 20.20
2 2 20.20
3 4 20.10
4 100 22.30
Here, I would like to get a value for index 3, time 3 and the values missing between time = 4 and time = 100. I'm sorry about my formatting skills, I hope it is clear.
How would I go about programming this?
Use merge with complete time column and then interpolate:
# Create your table
time = np.array([e for e in np.arange(20) if np.random.uniform() > 0.6])
temp = np.random.uniform(20, 25, size=len(time))
temps = pd.DataFrame([time, temp]).T
temps.columns = ['time', 'temperature']
>>> temps
time temperature
0 4.0 21.662352
1 10.0 20.904659
2 15.0 20.345858
3 18.0 24.787389
4 19.0 20.719487
The above is a random table generated with missing time data.
# modify it
filled = pd.Series(np.arange(temps.iloc[0,0], temps.iloc[-1, 0]+1))
filled = filled.to_frame()
filled.columns = ['time'] # Create a fully filled time column
merged = pd.merge(filled, temps, on='time', how='left') # merge it with original, time without temperature will be null
merged.temperature = merged.temperature.interpolate() # fill nulls linearly.
# Alternatively, use reindex, this does the same thing.
final = temps.set_index('time').reindex(np.arange(temps.time.min(),temps.time.max()+1)).reset_index()
final.temperature = final.temperature.interpolate()
>>> merged # or final
time temperature
0 4.0 21.662352
1 5.0 21.536070
2 6.0 21.409788
3 7.0 21.283505
4 8.0 21.157223
5 9.0 21.030941
6 10.0 20.904659
7 11.0 20.792898
8 12.0 20.681138
9 13.0 20.569378
10 14.0 20.457618
11 15.0 20.345858
12 16.0 21.826368
13 17.0 23.306879
14 18.0 24.787389
15 19.0 20.719487
First you can set the second values to actual time values as such:
df.index = pd.to_datetime(df['time'], unit='s')
After which you can use pandas' built-in time series operations to resample and fill in the missing values:
df = df.resample('s').interpolate('time')
Optionally, if you still want to do some smoothing you can use the following operation for that:
df.rolling(5, center=True, win_type='hann').mean()
Which will smooth with a 5 element wide Hanning window. Note: any window-based smoothing will cost you value points at the edges.
Now your dataframe will have datetimes (including date) as index. This is required for the resample method. If you want to lose the date, you can simply use:
df.index = df.index.time

Pandas Concat Different Sized DataFrame to End of Column

Note: Contrived example. Please don't hate on forecasting and I don't need advice on it. This is strictly a Pandas how-to question.
Example - One Solution
I have two different sized DataFrames, one representing sales and one representing a forecast.
sales = pd.DataFrame({'sales':[5,3,5,6,4,4,5,6,7,5]})
forecast = pd.DataFrame({'forecast':[5,5.5,6,5]})
The forecast needs to be with the latest sales, which is at the end of the list of sales numbers [5, 6, 7, 5]. Other times, I might want it at other locations (please don't ask why, I just need it this way).
This works:
df = pd.concat([sales, forecast], ignore_index=True, axis=1)
df.columns = ['sales', 'forecast'] # Not necessary, making next command pretty
df.forecast = df.forecast.shift(len(sales) - len(forecast))
This gives me the desired outcome:
Question
What I want to know is: Can I concatenate to the end of the sales data without performing the additional shift (the last command)? I'd like to do this in one step instead of two. concat or something similar is fine, but I'd like to skip the shift.
I'm not hung up on having two lines of code. That's okay. I want a solution with the maximum possible performance. My application is sensitive to every millisecond we throw at it on account of huge volumes.
Not sure if that is much faster but you could do
sales = pd.DataFrame({'sales':[5,3,5,6,4,4,5,6,7,5]})
forecast = pd.DataFrame({'forecast':[5,5.5,6,5]})
forecast.index = sales.index[-forecast.shape[0]:]
which gives
forecast
6 5.0
7 5.5
8 6.0
9 5.0
and then simply
pd.concat([sales, forecast], axis=1)
yielding the desired outcome:
sales forecast
0 5 NaN
1 3 NaN
2 5 NaN
3 6 NaN
4 4 NaN
5 4 NaN
6 5 5.0
7 6 5.5
8 7 6.0
9 5 5.0
A one-line solution using the same idea, as mentioned by #Dark in the comments, would be:
pd.concat([sales, forecast.set_axis(sales.index[-len(forecast):], inplace=False)], axis=1)
giving the same output.

Python Pandas Running Totals with Resets

I would like to perform the following task. Given a 2 columns (good and bad) I would like to replace any rows for the two columns with a running total. Here is an example of the current dataframe along with the desired data frame.
EDIT: I should have added what my intentions are. I am trying to create equally binned (in this case 20) variable using a continuous variable as the input. I know the pandas cut and qcut functions are available, however the returned results will have zeros for the good/bad rate (needed to compute the weight of evidence and information value). Zeros in either the numerator or denominator will not allow the mathematical calculations to work.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
Here is an explanation of what I need to do to the above dataframe.
Roughly speaking, anytime I encounter a zero for either column, I need to use a running total for the column which is not zero to the next row which has a non-zero value for the column that contained zeros.
Here is the desired output:
dd={'AAA':range(0,16),
'good':[19,20,60,59,72,64,52,38,24,17,19,12,5,7,6,2],
'bad':[1,1,1,6,8,10,6,6,10,5,8,2,2,1,3,2]}
desired_df=pd.DataFrame(data=dd)
print(desired_df)
The basic idea of my solution is to create a column from a cumsum over non-zero values in order to get the zero values with the next non zero value into one group. Then you can use groupby + sum to get your the desired values.
two_good = df.groupby((df['bad']!=0).cumsum().shift(1).fillna(0))['good'].sum()
two_bad = df.groupby((df['good']!=0).cumsum().shift(1).fillna(0))['bad'].sum()
two_good = two_good.loc[two_good!=0].reset_index(drop=True)
two_bad = two_bad.loc[two_bad!=0].reset_index(drop=True)
new_df = pd.concat([two_bad, two_good], axis=1).dropna()
print(new_df)
bad good
0 1 19.0
1 1 20.0
2 1 28.0
3 6 91.0
4 8 72.0
5 10 64.0
6 6 52.0
7 6 38.0
8 10 24.0
9 5 17.0
10 8 19.0
11 2 12.0
12 2 5.0
13 1 7.0
14 3 6.0
15 1 2.0
This code treats your etch case of trailing zeros different from your desired output, it simple cuts it off. You'd have to add some extra code to catch that one with a different logic.
P.Tillmann. I appreciate your assistance with this. For the more advanced readers I would assume you to find this code appalling, as I do. I would be more than happy to take any recommendation which makes this more streamlined.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
row_good=0
row_bad=0
row_bad_zero_count=0
row_good_zero_count=0
row_out='NO'
crappy_fix=pd.DataFrame()
for index,row in df.iterrows():
if row['good']==0 or row['bad']==0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count += 1
row_good_zero_count += 1
output_ind='1'
row_out='NO'
elif index+1 < len(df) and (df.loc[index+1,'good']==0 or df.loc[index+1,'bad']==0):
row_bad=row['bad']
row_good=row['good']
output_ind='2'
row_out='NO'
elif (row_bad_zero_count > 1 or row_good_zero_count > 1) and row['good']!=0 and row['bad']!=0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='3'
else:
row_bad=row['bad']
row_good=row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='4'
if ((row['good']==0 or row['bad']==0)
and (index > 0 and (df.loc[index-1,'good']!=0 or df.loc[index-1,'bad']!=0))
and row_good != 0 and row_bad != 0):
row_out='YES'
if row_out=='YES':
temp_dict={'AAA':row['AAA'],
'good':row_good,
'bad':row_bad}
crappy_fix=crappy_fix.append([temp_dict],ignore_index=True)
print(str(row['AAA']),'-',
str(row['good']),'-',
str(row['bad']),'-',
str(row_good),'-',
str(row_bad),'-',
str(row_good_zero_count),'-',
str(row_bad_zero_count),'-',
row_out,'-',
output_ind)
print(crappy_fix)

scale numerical values for different groups in python

I want to scale the numerical values (similar like R's scale function) based on different groups.
Noted: when I talked about the scale, I am referring to this metric
(x-group_mean)/group_std
Dataset (for demonstration the ideas) for example:
advertiser_id value
10 11
10 22
10 2424
11 34
11 342342
.....
Desirable results:
advertiser_id scaled_value
10 -0.58
10 -0.57
10 1.15
11 -0.707
11 0.707
.....
referring to this link: implementing R scale function in pandas in Python? I used the function for def scale and want to apply for it, like this fashion:
dt.groupby("advertiser_id").apply(scale)
but get an error:
ValueError: Shape of passed values is (2, 15770), indices imply (2, 23375)
In my original datasets the number of rows is 15770, but I don't think in my case the scale function maps a single value to more than 2 (in this case) results.
I would appreciate if you can give me some sample code or some suggestions into how to modify it, thanks!
First, np.std behaves differently than most other languages in that it delta degrees of freedom defaults to be 0. Therefore:
In [9]:
print df
advertiser_id value
0 10 11
1 10 22
2 10 2424
3 11 34
4 11 342342
In [10]:
print df.groupby('advertiser_id').transform(lambda x: (x-np.mean(x))/np.std(x, ddof=1))
value
0 -0.581303
1 -0.573389
2 1.154691
3 -0.707107
4 0.707107
This matches R result.
2nd, if any of your groups (by advertiser_id) happens to contain just 1 item, std would be 0 and you will get nan. Check if you get nan for this reason. R would return nan in this case as well.

Categories