In the (5 first rows) result below, you can see Freq column and the rolling means (3) column MMeans calculated using pandas:
Freq MMeans
0 215 NaN
1 453 NaN
2 277 315.000000
3 38 256.000000
4 1 105.333333
I was expecting MMeans to start at index 1 since 1 is the mean of (0-1-2). Is there an option that I am missing with rolling method?
edit 1
print(pd.DataFrame({
'Freq':eff,
'MMeans': dF['Freq'].rolling(3).mean()}))
edit 2
Sorry #Yuca for not being as clear as I'd like to. Next is the columns I'd like pandas to return :
Freq MMeans
0 215 NaN
1 453 315.000000
2 277 256.000000
3 38 105.333333
4 1 29.666667
which are not the results returned with min_periods=2
use min_periods =1
df['rol_mean'] = df['Freq'].rolling(3,min_periods=1).mean()
output:
Freq MMeans rol_mean
0 215 NaN 215.000000
1 453 NaN 334.000000
2 277 315.000000 315.000000
3 38 256.000000 256.000000
4 1 105.333333 105.333333
Related
I would like to shift a column in a Pandas DataFrame, but I haven't been able to find a method to do it from the documentation without rewriting the whole DF. Does anyone know how to do it?
DataFrame:
## x1 x2
##0 206 214
##1 226 234
##2 245 253
##3 265 272
##4 283 291
Desired output:
## x1 x2
##0 206 nan
##1 226 214
##2 245 234
##3 265 253
##4 283 272
##5 nan 291
In [18]: a
Out[18]:
x1 x2
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
In [19]: a['x2'] = a.x2.shift(1)
In [20]: a
Out[20]:
x1 x2
0 0 NaN
1 1 5
2 2 6
3 3 7
4 4 8
You need to use df.shift here.
df.shift(i) shifts the entire dataframe by i units down.
So, for i = 1:
Input:
x1 x2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Output:
x1 x2
0 Nan Nan
1 206 214
2 226 234
3 245 253
4 265 272
So, run this script to get the expected output:
import pandas as pd
df = pd.DataFrame({'x1': ['206', '226', '245',' 265', '283'],
'x2': ['214', '234', '253', '272', '291']})
print(df)
df['x2'] = df['x2'].shift(1)
print(df)
Lets define the dataframe from your example by
>>> df = pd.DataFrame([[206, 214], [226, 234], [245, 253], [265, 272], [283, 291]],
columns=[1, 2])
>>> df
1 2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Then you could manipulate the index of the second column by
>>> df[2].index = df[2].index+1
and finally re-combine the single columns
>>> pd.concat([df[1], df[2]], axis=1)
1 2
0 206.0 NaN
1 226.0 214.0
2 245.0 234.0
3 265.0 253.0
4 283.0 272.0
5 NaN 291.0
Perhaps not fast but simple to read. Consider setting variables for the column names and the actual shift required.
Edit: Generally shifting is possible by df[2].shift(1) as already posted however would that cut-off the carryover.
If you don't want to lose the columns you shift past the end of your dataframe, simply append the required number first:
offset = 5
DF = DF.append([np.nan for x in range(offset)])
DF = DF.shift(periods=offset)
DF = DF.reset_index() #Only works if sequential index
I suppose imports
import pandas as pd
import numpy as np
First append new row with NaN, NaN,... at the end of DataFrame (df).
s1 = df.iloc[0] # copy 1st row to a new Series s1
s1[:] = np.NaN # set all values to NaN
df2 = df.append(s1, ignore_index=True) # add s1 to the end of df
It will create new DF df2. Maybe there is more elegant way but this works.
Now you can shift it:
df2.x2 = df2.x2.shift(1) # shift what you want
Trying to answer a personal problem and similar to yours I found on Pandas Doc what I think would answer this question:
DataFrame.shift(periods=1, freq=None, axis=0)
Shift index by desired number of periods with an optional time freq
Notes
If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
Hope to help future questions in this matter.
df3
1 108.210 108.231
2 108.231 108.156
3 108.156 108.196
4 108.196 108.074
... ... ...
2495 108.351 108.279
2496 108.279 108.669
2497 108.669 108.687
2498 108.687 108.915
2499 108.915 108.852
df3['yo'] = df3['yo'].shift(-1)
yo price
0 108.231 108.210
1 108.156 108.231
2 108.196 108.156
3 108.074 108.196
4 108.104 108.074
... ... ...
2495 108.669 108.279
2496 108.687 108.669
2497 108.915 108.687
2498 108.852 108.915
2499 NaN 108.852
This is how I do it:
df_ext = pd.DataFrame(index=pd.date_range(df.index[-1], periods=8, closed='right'))
df2 = pd.concat([df, df_ext], axis=0, sort=True)
df2["forecast"] = df2["some column"].shift(7)
Basically I am generating an empty dataframe with the desired index and then just concatenate them together. But I would really like to see this as a standard feature in pandas so I have proposed an enhancement to pandas.
I'm new to pandas, and I may not be understanding the question, but this solution worked for my problem:
# Shift contents of column 'x2' down 1 row
df['x2'] = df['x2'].shift()
Or, to create a new column with contents of 'x2' shifted down 1 row
# Create new column with contents of 'x2' shifted down 1 row
df['x3'] = df['x2'].shift()
I had a read of the official docs for shift() while trying to figure this out, but it doesn't make much sense to me, and has no examples referencing this specific behavior.
Note that the last row of column 'x2' is effectively pushed off the end of the Dataframe. I expected shift() to have a flag to change this behaviour, but I can't find anything.
I have a dataframe that looks like this
DEP_TIME
0 1851
1 1146
2 2016
3 1350
4 916
...
607341 554
607342 633
607343 657
607344 705
607345 628
I need to get every value in this column DEP_TIME to have the format hh:mm.
All cells are of type string and can remain that type.
Some cells are only missing the colon (rows 0 to 3), others are also missing the leading 0 (rows 4+).
Some cells are empty and should ideally have string value of 0.
I need to do it in an efficient way since I have a few million records. How do I do it?
Use to_datetime with Series.dt.strftime:
df['DEP_TIME'] = (pd.to_datetime(df['DEP_TIME'], format='%H%M', errors='coerce')
.dt.strftime('%H:%M')
.fillna('00:00'))
print (df)
DEP_TIME
0 18:51
1 11:46
2 20:16
3 13:50
4 09:16
607341 05:54
607342 06:33
607343 06:57
607344 07:05
607345 06:28
import re
d = [['1851'],
['1146'],
['2016'],
['916'],
['814'],
[''],
[np.nan]]
df = pd.DataFrame(d, columns=['DEP_TIME'])
df['DEP_TIME'] = df['DEP_TIME'].fillna('0')
df['DEP_TIME'] = df['DEP_TIME'].apply(lambda y: '0' if y=='' else re.sub(r'(\d{1,2})(\d{2})$', lambda x: x[1].zfill(2)+':'+x[2], y))
df
DEP_TIME
0 18:51
1 11:46
2 20:16
3 09:16
4 08:14
5 0
I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
Following this answer to a previous questions I had asked, I used this code to summarise the ultrasound measurements using the maximum measurement recorded in a single trimester (13 weeks):
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
This results in the following output:
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
However, MotherID and PregnancyID no longer appear as columns in the output of df.info(). Similarly, when I output the dataframe to a csv file, I only get columns 1,2 and 3. The id columns only appear when running df.head() as can be seen in the dataframe above.
I need to preserve the id columns as I want to use them to merge this dataframe with another one using the ids. Therefore, my question is, how do I preserve these id columns as part of my dataframe after running the code above?
Chain that with reset_index:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
# .drop(columns = 'gestationalAgeInWeeks') # don't need this
.groupby(['MotherID', 'PregnancyID','tm'])['abdomCirc'] # change here
.max().add_prefix('abdomCirc_') # here
.unstack()
.reset_index() # and here
)
Or a more friendly version with pivot_table:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
.pivot_table(index= ['MotherID', 'PregnancyID'], columns='tm',
values= 'abdomCirc', aggfunc='max')
.add_prefix('abdomCirc_') # remove this if you don't want the prefix
.reset_index()
)
Output:
tm MotherID PregnancyID abdomCirc_1 abdomCirc_2 abdomCirc_3
0 abdomCirc_0 abdomCirc_0 NaN 200.0 NaN
1 abdomCirc_1 abdomCirc_1 NaN 315.0 350.0
2 abdomCirc_2 abdomCirc_2 180.0 NaN NaN
I am trying to get the weighted mean for each column (A-F) of a Pandas.Dataframe with "Value" as the weight. I can only find solutions for problems with categories, which is not what I need.
The comparable solution for normal means would be
df.means()
Notice the df has Nan in the columns and "Value".
A B C D E F Value
0 17656 61496 83 80 117 99 2902804
1 75078 61179 14 3 6 14 3761964
2 21316 60648 86 Nan 107 93 127963
3 6422 48468 28855 26838 27319 27011 131354
4 12378 42973 47153 46062 46634 42689 3303909572
5 54292 35896 59 6 3 18 27666367
6 21272 Nan 126 12 3 5 9618047
7 26434 35787 113 17 4 8 309943
8 10508 34314 34197 7100 10 10 NaN
I can use this for a single column.
df1 = df[['A','Value']]
df1 = df1.dropna()
np.average(df1['A'], weights=df1['Value'])
There must be a simple method. It's driving me nuts I don't see it.
I would appreciate any help.
You could use masked arrays. We could dropoff rows where Value column has NaN values.
In [353]: dff = df.dropna(subset=['Value'])
In [354]: dff.apply(lambda x: np.ma.average(
np.ma.MaskedArray(x, mask=np.isnan(x)), weights=dff.Value))
Out[354]:
A 1.282629e+04
B 4.295120e+04
C 4.652817e+04
D 4.545254e+04
E 4.601520e+04
F 4.212276e+04
Value 3.260246e+09
dtype: float64
I have a pandas dataframe where I'm trying to append two column values if the value of the second column is not NaN. Importantly, after appending the two values I need the value from the second column set to NaN. I have managed to concatenate the values but cannot update the second column to NaN.
This is what I start with for ldc_df[['ad_StreetNo', 'ad_StreetNo2']].head(5):
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196 198
4 227 NaN
This is what I currently have after appending:
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196-198 198
4 227 NaN
But here is what I am trying to obtain:
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196-198 NaN
4 227 NaN
Where the value for ldc_df['ad_StreetNo2'].loc[3] should be changed to NaN.
This is the code I am using currently:
def street_check(street_number_one, street_number_two):
if pd.notnull(street_number_one) and pd.notnull(street_number_two):
return str(street_number_one) + '-' + str(street_number_two)
else:
return street_number_one
ldc_df['ad_StreetNo'] = ldc_df[['ad_StreetNo', 'ad_StreetNo2']].apply(lambda x: street_check(*x),axis=1)
Does anyone have any advice as to how I can obtain my expected output?
Sam
# Convert the Street numbers to a string so that you can append the '-' character.
ldc_df['ad_StreetNo'] = ldc_df['ad_StreetNo'].astype(str)
# Create a mask of those addresses having an additional street number.
mask = ldc_df.loc[ldc_df['ad_StreetNo2'].notnull()
# Use the mask to append the additional street number.
ldc_df.loc[mask, 'ad_StreetNo'] += '-' + ldc_df.loc[mask, 'ad_StreetNo2'].astype(str)
# Set the additional street number to NaN.
ldc_df.loc[mask, 'ad_StreetNo2'] = np.nan
Alternative Solution
ldc_df['ad_StreetNo'] = (
ldc_df['ad_StreetNo'].astype(str)
+ ['' if np.isnan(n) else '-{}'.format(str(int(n)))
for n in ldc_df['ad_StreetNo2']]
)
ldc_df['ad_StreetNo2'] = np.nan
pd.DataFrame.stack folds a dataframe with a single level column index into a series object. Along the way, it drops any null values by default. We can then group by the previous index levels and join with '-'.
df.stack().astype(str).groupby(level=0).apply('-'.join)
0 284
1 51
2 136
3 196-198
4 227
dtype: object
I then use assign to create a copy of df while overwriting the two columns.
df.assign(
ad_StreetNo=df.stack().astype(str).groupby(level=0).apply('-'.join),
ad_StreetNo2=np.NaN
)
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196-198 NaN
4 227 NaN