i have a dataset with Time(seconds) column, i want to creat another column with full date and time ( yy/mm/dd h:m:s), so i wrote the function below but it takes too long for larger datasets, any advice how can i optimize it ?
Column Time is like this
0 0.0
1 0.2
2 0.4
3 0.6
4 0.8
5 1.0
6 1.2
7 1.4
8 1.6
9 1.8
10 2.0
11 2.2
12 2.4
13 2.6
14 2.8
15 3.0
16 3.2
17 3.4
18 3.6
19 3.8
20 4.0
def calcul_datetime(ds,h,m,s,Y,M,D):
for index,row in ds.iterrows():
if row.Time.is_integer()==True and row.Time!=0:
if s==60:
if m==60:
if h==24:
ds.iloc[index,9]="{Y}-{M}-{D} {heure}:{min}:{sec}".format(Y=Y, M=M, D=D ,heure=h,min=m,sec=s)
return ds ```
The datetime module has pretty much everything you need to do computations on times and dates.
Specifically, the datetime.datetime.fromtimestamp method converts seconds since the epoch to a datetime.datetime object. This object then has a convenient method for string representation: datetime.datetime.strftime.
Here's an example:
import datetime
some_arbitrary_time_value = 10000000
as_a_datetime = datetime.datetime.fromtimestamp(some_arbitrary_time_value)
as_a_string = as_a_datetime.strftime("%Y-%m-%d %H:%M:%S")
I have a data frame with columns containing different country values, I would like to have a function that shifts the rows in this dataframe independently without the dates. For example, I have a list of related profile shifters for each country which would be used in shifting the rows.
If the profile shifter for a country is -3, that country column, is shifted 3 times downwards, while the last 3 values become the first 3 values in the dataframe. If a profile shifter is +3, the third value of a row is shifted upwards while the first 2 values become the last values in that column.
After the rows have been shifted instead of having the default Nan value appear in the empty cells, I want the preceding or succeeding values to take up the empty cells. The function should also return a data frame Sample-dataset Profile Shifter Expected-results.
Sample Dataset:
Datetime ARG AUS BRA
1/1/2050 0.00 0.1 2.1 3.1
1/1/2050 1.00 0.2 2.2 3.2
1/1/2050 2.00 0.3 2.3 3.3
1/1/2050 3.00 0.4 2.4 3.4
1/1/2050 4.00 0.5 2.5 3.5
1/1/2050 5.00 0.6 2.6 3.6
Country Profile Shifters:
UTC -3 -2 4
Desired Output:
Datetime ARG AUS BRA
1/1/2050 0.00 0.3 2.4 3.4
1/1/2050 1.00 0.4 2.5 3.5
1/1/2050 2.00 0.5 2.1 3.1
1/1/2050 3.00 0.1 2.2 3.2
1/1/2050 4.00 0.2 2.3 3.3
This is what I have been trying for days now but it's not working
cols = df1.columns
for i in cols:
if i == 'ARG':
x = df1.iat[0:3,0]
df1['ARG'] = df1.ARG.shift(periods=-3)
df1['ARG'].replace(to_replace=np.nan, x)
elif i == 'AUS':
df1['AUS'] = df1.AUS.shift(periods=2)
elif i == 'BRA':
df1['BRA'] = df1.BRA.shift(periods=1)
This works but is far from being 'good pandas'. I hope that someone will come along and give a nicer, cleaner 'more pandas' answer.
Imports used:
import pandas as pd
import datetime as datetime
Offset data setup:
offsets = pd.DataFrame({"Country" : ["ARG", "AUS", "BRA"], "UTC Offset" : [-3, -2, 4]})
Country UTC Offset
0 ARG -3
1 AUS -2
2 BRA 4
Note that the timezone offset data I've used here is in a slightly different structure from the example data (country codes by rows, rather than columns). Also worth pointing out that Australia and Brazil have several time zones, so there is no one single UTC offset which applies to those whole countries (only one in Argentina though).
Sample data setup:
sampleDf = pd.DataFrame()
for i in range(6):
dt = datetime.datetime(2050,1,1,i)
sampleDf = sampleDf.append({'Datetime' : dt,
'ARG' : i / 10,
'AUS' : (i + 10)/ 10,
'BRA' : (i + 20) / 10},
Datetime ARG AUS BRA
0 2050-01-01 00:00:00 0.0 1.0 2.0
1 2050-01-01 01:00:00 0.1 1.1 2.1
2 2050-01-01 02:00:00 0.2 1.2 2.2
3 2050-01-01 03:00:00 0.3 1.3 2.3
4 2050-01-01 04:00:00 0.4 1.4 2.4
5 2050-01-01 05:00:00 0.5 1.5 2.5
Code to shift cells:
for idx, offsetData in offsets.iterrows(): # See note 1
countryCode = offsetData["Country"]
utcOffset = offsetData["UTC Offset"]
dfRowCount = sampleDf.shape[0]
wrappedOffset = (dfRowCount + utcOffset) if utcOffset < 0 else \
(-dfRowCount + utcOffset) # See note 2
countryData = sampleDf[countryCode]
sampleDf[countryCode] = pd.concat([countryData.shift(utcOffset).dropna(),
countryData.shift(wrappedOffset).dropna()]).sort_index() # See note 3
Datetime ARG AUS BRA
0 2050-01-01 00:00:00 0.0 1.4 2.4
1 2050-01-01 01:00:00 0.1 1.5 2.5
2 2050-01-01 02:00:00 0.2 1.0 2.0
3 2050-01-01 03:00:00 0.3 1.1 2.1
4 2050-01-01 04:00:00 0.4 1.2 2.2
5 2050-01-01 05:00:00 0.5 1.3 2.3
Iterating over rows in pandas like this (to me) indicates 'you've run out of pandas skill, and are kind of going against the design of pandas'. What I have here works, but it won't benefit from any/many of the efficiencies of using pandas, and would not be appropriate for a large dataset. Using itertuples rather than iterrows is supposed to be quicker, but I think neither is great, so I went with what seemed most readable for this case.
This solution does two shifts, one of the data shifted by the timezone offset, then a second shift of everything else to fill in what would otherwise be NaN holes left by the first shift. This line calculates the size of that second shift.
Finally, the results of the two shifts are concatenated together (after dropping any NaN values from both of them) and assigned back to the original (unshifted) column. sort_index puts them back in order based on the index, rather than having the two shifted parts one-after-another.
I have many dataframes (timeseries) that are of different lengths ranging between 28 and 179. I need to make them all of length 104. (upsampling those below 104 and downsampling those above 104)
For upsampling, the linear method can be sufficient to my needs. For downsampling, the mean of the values should be good.
To get all files to be the same length, I thought that I need to make all dataframes start and end at the same dates.
I was able to downsample all to the size of the smallest dataframe (i.e. 28) using below lines of code:
df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'), inplace=True)
However, this will not give me good results when I feed them into the model I need them for as it shrinks the longer files so much thus distorting the data.
This is what I tried so far:
df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'), inplace=True)
if df.shape[0]>100: resampled=df.resample('D').mean()
elif df.shape[0]<100: resampled=df.astype(float).resample('33D').interpolate(axis=0, method='linear')
else: break
Now, in the above lines of code, I am getting the files to be the same length (length 100). The downsampling part works fine too.
What's not working is the interpoaltion on the upsampling part. It just returns dataframes of length 100 with the first value of every column just copied over to all the rows.
What I need is to make them all size 104 (average size). This means any df of length>104 needs to downsampled and any df of length<104 needs to be upsampled.
As an example, please consider the two dfs as follows:
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 11 -7 -2
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 6 -3 -2
4 4 0 -4
5 8 2 -6
6 10 4 -8
7 12 6 -10
Suppose the avg length is 6, the expected output would be:
df1 upsampled to length 6 using interpolation - for e.g. resamle(rule).interpolate().
And df2 downsampled to length 6 using resample(rule).mean() .
If I could get all the files to be upsampled to 179, that would be fine as well.
I assume the problem is when you do resample in the up-sampling case, the other values are not kept. With you example df1, you can see it by using asfreq on one column:
print (df1.set_index(pd.date_range(start='1/1/1991' ,periods=len(df1), end='1/1/2000'))[1]
#99 rows are nan on the 100 length resampled dataframe
So when you do interpolate instead of asfreq, it actually interpolates with just the first value, meaning that the first value is "repeated" over all the rows
To get the result you want, then before interpolating, use also mean even in the up-sampling case, such as:
print (df1.set_index(pd.date_range(start='1/1/1991' ,periods=len(df1), end='1/1/2000'))[1]
1991-01-01 3.000000
1991-02-03 3.060606
1991-03-08 3.121212
1991-04-10 3.181818
1991-05-13 3.242424
Freq: 33D, Name: 1, dtype: float64
and you will get values as you want.
To conclude, I think in both up-sampling and down-sampling cases, you can use the same command
resampled = (df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'))
Because the interpolate would not affect the result in the down-sampling case.
Here is my version using skimage.transform.resize() function:
df1 = pd.DataFrame({
'a': [3,5,9,11],
'b': [-1,-3,-5,-7],
'c': [0,2,0,-2]
a b c
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 11 -7 -2
import pandas as pd
import numpy as np
from skimage.transform import resize
def df_resample(df1, num=1):
df2 = pd.DataFrame()
for key, value in df1.iteritems():
temp = value.to_numpy()/value.abs().max() # normalize
resampled = resize(temp, (num,1), mode='edge')*value.abs().max() # de-normalize
df2[key] = resampled.flatten().round(2)
return df2
df2 = df_resample(df1, 20) # resampling rate is 20
a b c
0 3.0 -1.0 0.0
1 3.0 -1.0 0.0
2 3.0 -1.0 0.0
3 3.4 -1.4 0.4
4 3.8 -1.8 0.8
5 4.2 -2.2 1.2
6 4.6 -2.6 1.6
7 5.0 -3.0 2.0
8 5.8 -3.4 1.6
9 6.6 -3.8 1.2
10 7.4 -4.2 0.8
11 8.2 -4.6 0.4
12 9.0 -5.0 0.0
13 9.4 -5.4 -0.4
14 9.8 -5.8 -0.8
15 10.2 -6.2 -1.2
16 10.6 -6.6 -1.6
17 11.0 -7.0 -2.0
18 11.0 -7.0 -2.0
19 11.0 -7.0 -2.0
I want to perform a moving window linear fit to the columns in my dataframe.
n =5
df = pd.DataFrame(index=pd.date_range('1/1/2000', periods=n))
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
2000-01-01 1.9 3.2
2000-01-02 2.3 1.3
2000-01-03 4.4 5.6
2000-01-04 5.6 9.4
2000-01-05 7.3 10.4
For, say, column B, I want to perform a linear fit using the first two rows, then another linear fit using the second and third rown and so on. And the same for column A. I am only interested in the slope of the fit so at the end, I want a new dataframe with the entries above replaced by the different rolling slopes.
After doing
I try something like
model = pd.ols(y=df['A'], x=df['index'], window_type='rolling',window=3)
But I get
KeyError: 'index'
I aded a new column
df['i'] = range(0,len(df))
and I can now run
pd.ols(y=df['A'], x=df.i, window_type='rolling',window=3)
(it gives an error for window=2)
I am not understaing this well because I was expecting a string of numbers but I get just one result
-------------------------Summary of Regression Analysis---------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 3
Number of Degrees of Freedom: 2
R-squared: 0.8981
Adj R-squared: 0.7963
Rmse: 1.1431
F-stat (1, 1): 8.8163, p-value: 0.2068
Degrees of Freedom: model 1, resid 1
-----------------------Summary of Estimated Coefficients--------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
x 2.4000 0.8083 2.97 0.2068 0.8158 3.9842
intercept 1.2667 2.5131 0.50 0.7028 -3.6590 6.1923
---------------------------------End of Summary---------------------------------
Now I understand better what is going on. I can acces the different values of the fits using
I havent tried it out, but I don't think you need to specify the window_type='rolling', if you specify the window to something, window will automatically be set to rolling.
I have problems doing this with the DatetimeIndex you created with pd.date_range, and find datetimes a confusing pain to work with in general due to the number of types out there and apparent incompatibility between APIs. Here's how I would do it if the date were an integer (e.g. days since 12/31/99, or years) or float in your example. It won't help your datetime problem, but hopefully it helps with the rolling linear fit part.
Generating your date with integers instead of datetimes:
df = pd.DataFrame()
df['date'] = range(1,6)
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
date B A
0 1 1.9 3.2
1 2 2.3 1.3
2 3 4.4 5.6
3 4 5.6 9.4
4 5 7.3 10.4
Since you want to group by 2 dates every time, then fit a linear model on each group, let's duplicate the records and number each group with the index:
df_dbl = pd.concat([df,df], names = ['date', 'B', 'A']).sort()
df_dbl = df_dbl.iloc[1:-1] # removes the first and last row
date B A
0 1 1.9 3.2 # this record is removed
0 1 1.9 3.2
1 2 2.3 1.3
1 2 2.3 1.3
2 3 4.4 5.6
2 3 4.4 5.6
3 4 5.6 9.4
3 4 5.6 9.4
4 5 7.3 10.4
4 5 7.3 10.4 # this record is removed
c = df_dbl.index[1:len(df_dbl.index)].tolist()
df_dbl.index = c
date B A
1 1 1.9 3.2
1 2 2.3 1.3
2 2 2.3 1.3
2 3 4.4 5.6
3 3 4.4 5.6
3 4 5.6 9.4
4 4 5.6 9.4
4 5 7.3 10.4
Now it's ready to group by index to run linear models on B vs. date, which I learned from Using Pandas groupby to calculate many slopes. I use scipy.stats.linregress since I got weird results with pd.ols and couldn't find good documentation to understand why (perhaps because it's geared toward datetime).
1 0.4
2 2.1
3 1.2
4 1.7
I've got a very simple problem, but I can't seem to get it right.
Consider this dataframe
df = pd.DataFrame({'group' :
['A', 'A', 'A', 'B', 'B'], 'time' : [20, 21, 22, 20, 21],
'price' : [3.1, 3.5, 3.0, 2.3, 2.1]})
group price time
0 A 3.1 20
1 A 3.5 21
2 A 3.0 22
3 B 2.3 20
4 B 2.1 21
Now I want to take the standard deviation of the price of each group, but conditional on it being before time 22 (let's call it early_std). I want to then create a variable with that information.
The expected result is
group price time early_std
A 3.1 20 0.282843
A 3.5 21 0.282843
A 3.0 22 0.282843
B 2.3 20 0.141421
B 2.1 21 0.141421
This is what I tried:
df['early_std'] = df[df.time < 22].groupby('group').\
price.transform(lambda x : x.std())
This almost works but it gives a missing value on time = 22:
group price time early_std
0 A 3.1 20 0.282843
1 A 3.5 21 0.282843
2 A 3.0 22 NaN
3 B 2.3 20 0.141421
4 B 2.1 21 0.141421
I also tried with apply and I think it works, but I need to reset the index, which is something I'd rather avoid (I have a large dataset and I need to do this repeatedly)
early_std2 = df[df.time < 22].groupby('group').price.std()
df.set_index('group', inplace=True)
df['early_std2'] = early_std2
price time early_std early_std2
A 3.1 20 0.282843 0.282843
A 3.5 21 0.282843 0.282843
A 3.0 22 NaN 0.282843
B 2.3 20 0.141421 0.141421
B 2.1 21 0.141421 0.141421
It looks like you only need to add fillna() to your first code to expand the std values:
df['early_std'] = df[df.time < 22].groupby('group')['price'].transform(pd.Series.std)
df['early_std'] = df.groupby('group')['early_std'].apply(lambda x: x.fillna(x.max()))
To get:
group price time early_std
0 A 3.1 20 0.283
1 A 3.5 21 0.283
2 A 3.0 22 0.283
3 B 2.3 20 0.141
4 B 2.1 21 0.141
EDIT: I have changed ffill to a more general fillna, but you could also use chained .bfill().ffill() to achieve the same result.
Your second approach is very close to what you are trying to achieve.
This may not be the most efficient method but it worked for me:
df['early_std'] = 0
for index,value in early_std2.iteritems():
df.early_std[df.group==index] = value
I want to apply a function f to many slices within each row of a pandas DataFrame.
For example, DataFrame df would look as such:
df = pandas.DataFrame(np.round(np.random.normal(size=(2,49)), 2))
So, I have a dataframe of 2 rows by 49 columns, and my function needs to be applied to every consequent slice of 7 data points in both rows, and so that the resulting dataframe looks identical to the input dataframe.
I was doing it as such:
df1.T[:7], df1.T[7:14], df1.T[14:21],..., df1.T[43:50] = f(df.T.iloc[:7,:]), f(df.T.iloc[7:14,:]),..., f(df.T.iloc[43:50,:])
As you can see that's a whole lot of redundant code.. so I would like to create a loop or something so that it applies the function to every 7 subsequent data point...
I have no idea how to approach this. Is there a more elegant way to do this?
I thought I could maybe use a transform function for this, but in the pandas documentation I can only see that applied to a dataframe that has been grouped and not on slices of the data....
Hopefully this is clear.. let me know.
Thank you.
To avoid redundant code you can just do a loop like this:
STEP = 7
for i in range(0,len(df),STEP):
df1.T[i:i+STEP] = f(df1.T[i:i+STEP]) # could also do an apply here somehow, depending on what you want to do
Don't Repeat Yourself
You don't provide any examples of your desired output, so here's my best guess at what you want...
If your data are lumped into groups of seven, the you need to come up with a way to label them as such.
If other words, you with want to work with arbitrary arrays, use numpy. If you want to work with labeled, meaningful data and it's associated metadata, then use pandas.
Also, pandas works more efficiently when operating (and displaying!) row-wise data. So that mean store data long (49x2), not wide (2x49)
Here's an example of what I mean. I have the same 49x2 random array, but assigned grouping labels to the rows ahead of time.
Let's yeah you're reading in some wide-ish data as following:
import pandas
import numpy
from io import StringIO # python 3
# from StringIO import StringIO # python 2
datafile = StringIO("""\
df = pandas.read_csv(datafile)
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
You could add a cluster value to the columns, like so:
cluster_size = 3
col_vals = []
for n, col in enumerate(df.columns):
cluster = int(n/cluster_size)
col_vals.append((cluster, col))
df.columns = pandas.Index(col_vals)
0 1 2 3
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
By default, the groupby method tries to group rows, but you can group columns (I just fogured this out), by passing axis=1 when you create the object. So the sum of each cluster of columns for each row is as follows:
df.groupby(axis=1, level=0).sum()
0 1 2 3
0 0.3 1.2 2.1 0.9
1 3.3 4.2 5.1 1.9
2 6.3 7.2 8.1 2.9
But again, if all you're doing is more "global" operations, there's no need to any of this.
In-place column cluster operation
df[0] *= 5
0 1 2 3
0 0 2.5 5 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 25 27.5 30 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 50 52.5 55 2.3 2.4 2.5 2.6 2.7 2.8 2.9
In-place row operation
df.T[0] += 20
0 1 2 3
0 20 22.5 25 20.3 20.4 20.5 20.6 20.7 20.8 20.9
1 25 27.5 30 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 50 52.5 55 2.3 2.4 2.5 2.6 2.7 2.8 2.9
Operate on the entire dataframe at once
def myFunc(x):
return 5 + x**2
0 1 2 3
0 405 511.25 630 417.09 421.16 425.25 429.36 433.49 437.64 441.81
1 630 761.25 905 6.69 6.96 7.25 7.56 7.89 8.24 8.61
2 2505 2761.25 3030 10.29 10.76 11.25 11.76 12.29 12.84 13.41