Pandas Concat Different Sized DataFrame to End of Column - python

Note: Contrived example. Please don't hate on forecasting and I don't need advice on it. This is strictly a Pandas how-to question.
Example - One Solution
I have two different sized DataFrames, one representing sales and one representing a forecast.
sales = pd.DataFrame({'sales':[5,3,5,6,4,4,5,6,7,5]})
forecast = pd.DataFrame({'forecast':[5,5.5,6,5]})
The forecast needs to be with the latest sales, which is at the end of the list of sales numbers [5, 6, 7, 5]. Other times, I might want it at other locations (please don't ask why, I just need it this way).
This works:
df = pd.concat([sales, forecast], ignore_index=True, axis=1)
df.columns = ['sales', 'forecast'] # Not necessary, making next command pretty
df.forecast = df.forecast.shift(len(sales) - len(forecast))
This gives me the desired outcome:
Question
What I want to know is: Can I concatenate to the end of the sales data without performing the additional shift (the last command)? I'd like to do this in one step instead of two. concat or something similar is fine, but I'd like to skip the shift.
I'm not hung up on having two lines of code. That's okay. I want a solution with the maximum possible performance. My application is sensitive to every millisecond we throw at it on account of huge volumes.

Not sure if that is much faster but you could do
sales = pd.DataFrame({'sales':[5,3,5,6,4,4,5,6,7,5]})
forecast = pd.DataFrame({'forecast':[5,5.5,6,5]})
forecast.index = sales.index[-forecast.shape[0]:]
which gives
forecast
6 5.0
7 5.5
8 6.0
9 5.0
and then simply
pd.concat([sales, forecast], axis=1)
yielding the desired outcome:
sales forecast
0 5 NaN
1 3 NaN
2 5 NaN
3 6 NaN
4 4 NaN
5 4 NaN
6 5 5.0
7 6 5.5
8 7 6.0
9 5 5.0
A one-line solution using the same idea, as mentioned by #Dark in the comments, would be:
pd.concat([sales, forecast.set_axis(sales.index[-len(forecast):], inplace=False)], axis=1)
giving the same output.

Related

Calculating date difference for pandas dataframe rows with changing baseline dates

Hi I am using the date difference as a machine learning feature, analyzing how the weight of a patient changed over time.
I successfully tested a method to do that as shown below, but the question is how to extend this to a dataframe where I have to see date difference for each patient as shown in the figure above. The encircled column is what im aiming to get. So basically the baseline date from which the date difference is calculated changes every time for a new patient name so that we can track the weight progress over time for that patient! Thanks
s='17/6/2016'
s1='22/6/16'
a=pd.to_datetime(s,infer_datetime_format=True)
b=pd.to_datetime(s1,infer_datetime_format=True)
e=b.date()-a.date()
str(e)
str(e)[0:2]
I think it would be something like this, (but im not sure how to do this exactly):
def f(row):
# some logic here
return val
df['Datediff'] = df.apply(f, axis=1)
You can use transform with first
df['Datediff'] = df['Date'] - df1.groupby('Name')['Date'].transform('first')
Another solution can be using cumsum
df['Datediff'] = df.groupby('Name')['Date'].apply(lambda x:x.diff().cumsum().fillna(0))
df["Datediff"] = df.groupby("Name")["Date"].diff().fillna(0)/ np.timedelta64(1, 'D')
df["Datediff"]
0 0.0
1 12.0
2 14.0
3 66.0
4 23.0
5 0.0
6 10.0
7 15.0
8 14.0
9 0.0
10 14.0
Name: Datediff, dtype: float64

Inserting missing numbers in dataframe

I have a program that ideally measures the temperature every second. However, in reality this does not happen. Sometimes, it skips a second or it breaks down for 400 seconds and then decides to start recording again. This leaves gaps in my 2-by-n dataframe, where ideally n = 86400 (the amount of seconds in a day). I want to apply some sort of moving/rolling average to it to get a nicer plot, but if I do that to the "raw" datafiles, the amount of data points becomes less. This is shown here, watch the x-axis. I know the "nice data" doesn't look nice yet; I'm just playing with some values.
So, I want to implement a data cleaning method, which adds data to the dataframe. I thought about it, but don't know how to implement it. I thought of it as follows:
If the index is not equal to the time, then we need to add a number, at time = index. If this gap is only 1 value, then the average of the previous number and the next number will do for me. But if it is bigger, say 100 seconds are missing, then a linear function needs to be made, which will increase or decrease the value steadily.
So I guess a training set could be like this:
index time temp
0 0 20.10
1 1 20.20
2 2 20.20
3 4 20.10
4 100 22.30
Here, I would like to get a value for index 3, time 3 and the values missing between time = 4 and time = 100. I'm sorry about my formatting skills, I hope it is clear.
How would I go about programming this?
Use merge with complete time column and then interpolate:
# Create your table
time = np.array([e for e in np.arange(20) if np.random.uniform() > 0.6])
temp = np.random.uniform(20, 25, size=len(time))
temps = pd.DataFrame([time, temp]).T
temps.columns = ['time', 'temperature']
>>> temps
time temperature
0 4.0 21.662352
1 10.0 20.904659
2 15.0 20.345858
3 18.0 24.787389
4 19.0 20.719487
The above is a random table generated with missing time data.
# modify it
filled = pd.Series(np.arange(temps.iloc[0,0], temps.iloc[-1, 0]+1))
filled = filled.to_frame()
filled.columns = ['time'] # Create a fully filled time column
merged = pd.merge(filled, temps, on='time', how='left') # merge it with original, time without temperature will be null
merged.temperature = merged.temperature.interpolate() # fill nulls linearly.
# Alternatively, use reindex, this does the same thing.
final = temps.set_index('time').reindex(np.arange(temps.time.min(),temps.time.max()+1)).reset_index()
final.temperature = final.temperature.interpolate()
>>> merged # or final
time temperature
0 4.0 21.662352
1 5.0 21.536070
2 6.0 21.409788
3 7.0 21.283505
4 8.0 21.157223
5 9.0 21.030941
6 10.0 20.904659
7 11.0 20.792898
8 12.0 20.681138
9 13.0 20.569378
10 14.0 20.457618
11 15.0 20.345858
12 16.0 21.826368
13 17.0 23.306879
14 18.0 24.787389
15 19.0 20.719487
First you can set the second values to actual time values as such:
df.index = pd.to_datetime(df['time'], unit='s')
After which you can use pandas' built-in time series operations to resample and fill in the missing values:
df = df.resample('s').interpolate('time')
Optionally, if you still want to do some smoothing you can use the following operation for that:
df.rolling(5, center=True, win_type='hann').mean()
Which will smooth with a 5 element wide Hanning window. Note: any window-based smoothing will cost you value points at the edges.
Now your dataframe will have datetimes (including date) as index. This is required for the resample method. If you want to lose the date, you can simply use:
df.index = df.index.time

Create multiple dataframe columns containing calculated values from an existing column

I have a dataframe, sega_df:
Month 2016-11-01 2016-12-01
Character
Sonic 12.0 3.0
Shadow 5.0 23.0
I would like to create multiple new columns, by applying a formula for each already existing column within my dataframe (to put it shortly, pretty much double the number of columns). That formula is (100 - [5*eachcell])*0.2.
For example, for November for Sonic, (100-[5*12.0])*0.2 = 8.0, and December for Sonic, (100-[5*3.0])*0.2 = 17.0 My ideal output is:
Month 2016-11-01 2016-12-01 Weighted_2016-11-01 Weighted_2016-12-01
Character
Sonic 12.0 3.0 8.0 17.0
Shadow 5.0 23.0 15.0 -3.0
I know how to create a for loop to create one column. This is for if only one month was in consideration:
for w in range(1,len(sega_df.index)):
sega_df['Weighted'] = (100 - 5*sega_df)*0.2
sega_df[sega_df < 0] = 0
I haven't gotten the skills or experience yet to create multiple columns. I've looked for other questions that may answer what exactly I am doing but haven't gotten anything to work yet. Thanks in advance.
One vectorised approach is to drown to numpy:
A = sega_df.values
A = (100 - 5*A) * 0.2
res = pd.DataFrame(A, index=sega_df.index, columns=('Weighted_'+sega_df.columns))
Then join the result to your original dataframe:
sega_df = sega_df.join(res)

Python Pandas Running Totals with Resets

I would like to perform the following task. Given a 2 columns (good and bad) I would like to replace any rows for the two columns with a running total. Here is an example of the current dataframe along with the desired data frame.
EDIT: I should have added what my intentions are. I am trying to create equally binned (in this case 20) variable using a continuous variable as the input. I know the pandas cut and qcut functions are available, however the returned results will have zeros for the good/bad rate (needed to compute the weight of evidence and information value). Zeros in either the numerator or denominator will not allow the mathematical calculations to work.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
Here is an explanation of what I need to do to the above dataframe.
Roughly speaking, anytime I encounter a zero for either column, I need to use a running total for the column which is not zero to the next row which has a non-zero value for the column that contained zeros.
Here is the desired output:
dd={'AAA':range(0,16),
'good':[19,20,60,59,72,64,52,38,24,17,19,12,5,7,6,2],
'bad':[1,1,1,6,8,10,6,6,10,5,8,2,2,1,3,2]}
desired_df=pd.DataFrame(data=dd)
print(desired_df)
The basic idea of my solution is to create a column from a cumsum over non-zero values in order to get the zero values with the next non zero value into one group. Then you can use groupby + sum to get your the desired values.
two_good = df.groupby((df['bad']!=0).cumsum().shift(1).fillna(0))['good'].sum()
two_bad = df.groupby((df['good']!=0).cumsum().shift(1).fillna(0))['bad'].sum()
two_good = two_good.loc[two_good!=0].reset_index(drop=True)
two_bad = two_bad.loc[two_bad!=0].reset_index(drop=True)
new_df = pd.concat([two_bad, two_good], axis=1).dropna()
print(new_df)
bad good
0 1 19.0
1 1 20.0
2 1 28.0
3 6 91.0
4 8 72.0
5 10 64.0
6 6 52.0
7 6 38.0
8 10 24.0
9 5 17.0
10 8 19.0
11 2 12.0
12 2 5.0
13 1 7.0
14 3 6.0
15 1 2.0
This code treats your etch case of trailing zeros different from your desired output, it simple cuts it off. You'd have to add some extra code to catch that one with a different logic.
P.Tillmann. I appreciate your assistance with this. For the more advanced readers I would assume you to find this code appalling, as I do. I would be more than happy to take any recommendation which makes this more streamlined.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
row_good=0
row_bad=0
row_bad_zero_count=0
row_good_zero_count=0
row_out='NO'
crappy_fix=pd.DataFrame()
for index,row in df.iterrows():
if row['good']==0 or row['bad']==0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count += 1
row_good_zero_count += 1
output_ind='1'
row_out='NO'
elif index+1 < len(df) and (df.loc[index+1,'good']==0 or df.loc[index+1,'bad']==0):
row_bad=row['bad']
row_good=row['good']
output_ind='2'
row_out='NO'
elif (row_bad_zero_count > 1 or row_good_zero_count > 1) and row['good']!=0 and row['bad']!=0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='3'
else:
row_bad=row['bad']
row_good=row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='4'
if ((row['good']==0 or row['bad']==0)
and (index > 0 and (df.loc[index-1,'good']!=0 or df.loc[index-1,'bad']!=0))
and row_good != 0 and row_bad != 0):
row_out='YES'
if row_out=='YES':
temp_dict={'AAA':row['AAA'],
'good':row_good,
'bad':row_bad}
crappy_fix=crappy_fix.append([temp_dict],ignore_index=True)
print(str(row['AAA']),'-',
str(row['good']),'-',
str(row['bad']),'-',
str(row_good),'-',
str(row_bad),'-',
str(row_good_zero_count),'-',
str(row_bad_zero_count),'-',
row_out,'-',
output_ind)
print(crappy_fix)

Finding the percent change of values in a Series

I have a DataFrame with 2 columns. I need to know at what point the number of questions has increased.
In [19]: status
Out[19]:
seconds questions
0 751479 9005591
1 751539 9207129
2 751599 9208994
3 751659 9210429
4 751719 9211944
5 751779 9213287
6 751839 9214916
7 751899 9215924
8 751959 9216676
9 752019 9217533
I need the change in percent of 'questions' column and then sort on it. This does not work:
status.pct_change('questions').sort('questions').head()
Any suggestions?
Try this way instead:
>>> status['change'] = status.questions.pct_change()
>>> status.sort_values('change', ascending=False)
questions seconds change
0 9005591 751479 NaN
1 9207129 751539 0.022379
2 9208994 751599 0.000203
6 9214916 751839 0.000177
4 9211944 751719 0.000164
3 9210429 751659 0.000156
5 9213287 751779 0.000146
7 9215924 751899 0.000109
9 9217533 752019 0.000093
8 9216676 751959 0.000082
pct_change can be performed on Series as well as DataFrames and accepts an integer argument for the number of periods you want to calculate the change over (the default is 1).
I've also assumed that you want to sort on the 'change' column with the greatest percentage changes showing first...

Categories