I have a python dataframe with hourly values for Jan 2015 except some hours are missing the index and values both. Ideally the dataframe with columns named "dates" and "values" should have 744 rows in it. However, it has randomly missing 10 hours and hence has only 734 rows. I want to interpolate for missing hours in the month to create the desired dataframe with 744 "dates" and 744 "values".
Edit:
I am new to python so I am struggling with implementing this idea:
Create a dataframe with first column as all hours in Jan 2015
Create the second column of same size as first of NANs
Fill the second column with available values hence the missing hours have NANs in them
Use the panda interpolate funtion
Edit2:
I was looking for hint for code snippets. Based on suggestion below I was able to create the following code but it fails to fill in the values which are zeros at the start of the month i.e. for hours 1 through 5 on Jan 1.
import panda as pd
st_dt = '2015-01-01'
en_dt = '2015-01-31'
DateTimeHour = pd.date_range( pd.Timestamp( st_dt ).date(), pd.Timestamp(
en_dt ).date(), freq='H')
Pwr.index = pd.DatetimeIndex(Pwr.index) #Pwr is the original dataframe
Pwr = Pwr.reindex( DateTimeHour, fill_value = 0 )
Pwr2 = pd.Series( Pwr.values )
Pwr2.interpolate( imit_direction='both' )
Use df.asfreq to expand the DataFrame so as to have an hourly frequency. NaN is inserted for missing values:
df = df.asfreq('H')
then use df.interpolate to replace the NaNs with (linearly) interpolated values based on the DatetimeIndex and the nearest non-NaN values:
df = df.interpolate(method='time')
For example,
import numpy as np
import pandas as pd
N, M = 744, 734
index = pd.date_range('2015-01-01', periods=N, freq='H')
idx = np.random.choice(np.arange(N), M, replace=False)
idx.sort()
index = index[idx]
# This creates a toy DataFrame with 734 non-null rows:
df = pd.DataFrame({'values': np.random.randint(10, size=(M,))}, index=index)
# This expands the DataFrame to 744 rows (10 null rows):
df = df.asfreq('H')
# This makes `df` have 744 non-null rows:
df = df.interpolate(method='time')
What you want requires a combination of this technique:
Add missing dates to pandas dataframe
And the pandas function pandas.Series.interpolate. From what you've said, the option 'linear' is what you want.
EDIT:
Interpolate will not work in the case were you have datapoints missing at the very start of the time series. One idea is to use pandas.Series.fillna with 'backfill' after the interpolation. Also, do not set fill_value to 0 whe you call reindex
A general interpolation is the following:
If the key exits:
Return the value
else:
Find the first key before and after the required key, find the distance (which you can define using a desired metric) to both keys and take a weighted average of the values, weighed by the distances of the keys (close is heigher weight).
Related
I am trying to resample a time series to get annual maximum values for different time steps(eg., 3h, 6h, etc. The original series is at an hourly resolution. I first converted the date format to pandas date format, used that column as an index, and resampled it. The final output should be the years and the corresponding maximum values at the desired timestep. However, i am getting a list of NaN. I am not sure, how can I incorporate a range in my code. Here is my code so far for a 3H timestep
import pandas as pd
df = pd.read_csv('data.txt', delimiter = ";")
df = pd.DataFrame(df[['yyyymmddhh', 'rainfall']])
datin["yyyymmddhh"] = pd.to_datetime(datin["yyyymmddhh"], format="%Y%M%d%H")
datin.set_index("yyyymmddhh").resample("3H").sum().resample("Y").max()
stn_n;yyyymmddhh;rainfall
xyz;1980123123;-
xyz;1981010100;0.0
xyz;1981010101;0.0
xyz;1981010102;0.0
xyz;1981010103;0.0
xyz;1981010104;0.0
xyz;1981010105;0.0
xyz;1981010106;0.0
xyz;1981010107;0.0
xyz;1981010108;0.0
xyz;1981010109;0.4
xyz;1981010110;0.6
xyz;1981010111;0.1
xyz;1981010112;0.1
xyz;1981010113;0.0
xyz;1981010114;0.1
xyz;1981010115;0.6
I am trying to add 3 columns' values to come up with a new column as total value. Code is below:
df3[["Bronze","Gold","Silver"]] =
df3[["Bronze","Gold","Silver"]].astype("int")
df3["Total Medal"]= df3.iloc[:, -3:0].sum(axis=1)
df3[["Total Medal"]].astype("int")
I know that Bronze, Gold, Silver columns have 1 and 0 values and they are the last 3 columns in the dataframe. Their original types were "uint8" so I changed them to "int".
Total Medal column after these lines come out as type "float" (instead of int) and yield only the value 0. How can I properly add these columns?
To add the value of 3 columns to a new column simply do
df['Total Medal'] = df.sum(axis=1)
This can e.g. be done using assign:
import numpy as np
import pandas as pd
#create data frame
data = {"gold":np.random.choice([0,1],size=10),"silver":np.random.choice([0,1],size=10), "bronze":np.random.choice([0,1],size=10)}
df = pd.DataFrame(data)
#calculate new column and add to dataframe
df = df.assign(mysum=df.gold+df.silver+df.bronze)
Edit: df["mysum"] = df.sum(axis=1) only works if your dataframe only has the three relevant columns, because it sums over all columns (and not only over the three you want).
I have a a Dataframe with two columns and I want to set each of the columns median value to zero. How can i do this without changing the standard deviation ? Or better is this the right way to do that ?
suppose I have:
df = pd.DataFrame(np.random.randn(100, 2))
#first column
df0=df[0]
#set median to zero
test=abs(df0-df.median())
Since I again looked for
test.median()
it is printing me not zero but a different value as above. Do I have a mistake in thought?
IIUC, you want
test= df0 - df[0].median()
>>> test.median()
0.0
If you just get the absolute values of the series, you'll change the median value because of course, it depends on the ordering of elements.
There are mainly 2 things you need to do here:
Iterating over the columns
For each column, you want to calculate its median and substract it from all values (related to that column)
And don't use absolute as it'll ruin the median = 0 you want.
import pandas as pd
df = pd.DataFrame(np.random.randn(100, 2))
for col in df.columns:
df[col] = df[col] - np.median(df[col])
Testing:
for col in df.columns:
print(np.median(df[col]))
0.0
0.0
I would like to sum columns with the same start of name.
Example :
import pandas as pd
import numpy as np
df=pd.DataFrame({'product':['TV','COMPUTER','SMARTPHONE'],
'price_2012':np.random.randint(100,300,3),
'price_2013':np.random.randint(100,300,3),
'price_2014':np.random.randint(100,300,3),
'price_2015':np.random.randint(100,300,3),
'price_2016':np.random.randint(100,300,3)})
For this exemple i want to create a new column price_2012_2016 equal to the price sum of 2013 to 2016 without list all column.
PS: In SAS i do like this : price_2012_2016=sum(of prix_2012-prix-2016);
Cordialy,
Laurent A.
You could simply do the following:
df['price_2012_2016'] = df[[col for col in df.columns if col.startswith('price_')]].sum(axis=1)
This takes the sum of only the columns that start with "price_" within the df DataFrame and saves the result as the price_2012_2016 column. The axis=1 parameter is for that sum to be computed on the column axis and not the rows, see below:
I'm stuck trying to figure out how to sum one of the columns in my dataframe based on day/month/year etc. I don't want to perform the aggregation on the other columns. As the dataframe will become shorter, I would like to use the minimum value from the other columns of the dataframe.
This is what I have, but it does not produce what I want. It only sums the first and last part and then gives me NaN values for the rest.
df = pd.DataFrame(zip(points, data, junk), columns=['Dates', 'Data', 'Junk'])
df.set_index('Dates', inplace=True)
_add = {'Data': np.sum, 'Junk': np.min}
newdf = df.resample('D', how=_add)
Thanks