I have a dataframe with 2 columns and 3000 rows.
First column is representing time in time-steps. For example first row is 0, second is 1, ..., last one is 2999.
Second column is representing pressure. The pressure changes as we iterate over the rows, but shows a repetitive behaviour. So every few steps we see that it goes to its minimum value (which is 375), then goes up again, then again at 375 etc.
What I want to do in Python, is to iterate over the rows and see:
1) at which time-steps we see pressure is at its minimum
2)Find the frequency between the minimum values.
import numpy as np
import pandas as pd
import numpy.random as rnd
import scipy.linalg as lin
from matplotlib.pylab import *
import re
from pylab import *
import datetime
df = pd.read_csv('test.csv')
row = next(df.iterrows())[0]
dataset = np.loadtxt(df, delimiter=";")
df.columns = ["Timestamp", "Pressure"]
print(df[[0, 1]])
You don't need to iterate row-wise, you can compare the entire column against the min value to mask it, you can then use the mask to find the timestep diff:
Data setup:
In [44]:
df = pd.DataFrame({'timestep':np.arange(20), 'value':np.random.randint(375, 400, 20)})
df
Out[44]:
timestep value
0 0 395
1 1 377
2 2 392
3 3 396
4 4 377
5 5 379
6 6 384
7 7 396
8 8 380
9 9 392
10 10 395
11 11 393
12 12 390
13 13 393
14 14 397
15 15 396
16 16 393
17 17 379
18 18 396
19 19 390
mask the df by comparing the column against the min value:
In [45]:
df[df['value']==df['value'].min()]
Out[45]:
timestep value
1 1 377
4 4 377
We can use the mask with loc to find the corresponding 'timestep' value and use diff to find the interval differences:
In [48]:
df.loc[df['value']==df['value'].min(),'timestep'].diff()
Out[48]:
1 NaN
4 3.0
Name: timestep, dtype: float64
You can divide the above by 1/60 to find frequency wrt to 1 minute or whatever frequency unit you desire
Related
I would like to shift a column in a Pandas DataFrame, but I haven't been able to find a method to do it from the documentation without rewriting the whole DF. Does anyone know how to do it?
DataFrame:
## x1 x2
##0 206 214
##1 226 234
##2 245 253
##3 265 272
##4 283 291
Desired output:
## x1 x2
##0 206 nan
##1 226 214
##2 245 234
##3 265 253
##4 283 272
##5 nan 291
In [18]: a
Out[18]:
x1 x2
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
In [19]: a['x2'] = a.x2.shift(1)
In [20]: a
Out[20]:
x1 x2
0 0 NaN
1 1 5
2 2 6
3 3 7
4 4 8
You need to use df.shift here.
df.shift(i) shifts the entire dataframe by i units down.
So, for i = 1:
Input:
x1 x2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Output:
x1 x2
0 Nan Nan
1 206 214
2 226 234
3 245 253
4 265 272
So, run this script to get the expected output:
import pandas as pd
df = pd.DataFrame({'x1': ['206', '226', '245',' 265', '283'],
'x2': ['214', '234', '253', '272', '291']})
print(df)
df['x2'] = df['x2'].shift(1)
print(df)
Lets define the dataframe from your example by
>>> df = pd.DataFrame([[206, 214], [226, 234], [245, 253], [265, 272], [283, 291]],
columns=[1, 2])
>>> df
1 2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Then you could manipulate the index of the second column by
>>> df[2].index = df[2].index+1
and finally re-combine the single columns
>>> pd.concat([df[1], df[2]], axis=1)
1 2
0 206.0 NaN
1 226.0 214.0
2 245.0 234.0
3 265.0 253.0
4 283.0 272.0
5 NaN 291.0
Perhaps not fast but simple to read. Consider setting variables for the column names and the actual shift required.
Edit: Generally shifting is possible by df[2].shift(1) as already posted however would that cut-off the carryover.
If you don't want to lose the columns you shift past the end of your dataframe, simply append the required number first:
offset = 5
DF = DF.append([np.nan for x in range(offset)])
DF = DF.shift(periods=offset)
DF = DF.reset_index() #Only works if sequential index
I suppose imports
import pandas as pd
import numpy as np
First append new row with NaN, NaN,... at the end of DataFrame (df).
s1 = df.iloc[0] # copy 1st row to a new Series s1
s1[:] = np.NaN # set all values to NaN
df2 = df.append(s1, ignore_index=True) # add s1 to the end of df
It will create new DF df2. Maybe there is more elegant way but this works.
Now you can shift it:
df2.x2 = df2.x2.shift(1) # shift what you want
Trying to answer a personal problem and similar to yours I found on Pandas Doc what I think would answer this question:
DataFrame.shift(periods=1, freq=None, axis=0)
Shift index by desired number of periods with an optional time freq
Notes
If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
Hope to help future questions in this matter.
df3
1 108.210 108.231
2 108.231 108.156
3 108.156 108.196
4 108.196 108.074
... ... ...
2495 108.351 108.279
2496 108.279 108.669
2497 108.669 108.687
2498 108.687 108.915
2499 108.915 108.852
df3['yo'] = df3['yo'].shift(-1)
yo price
0 108.231 108.210
1 108.156 108.231
2 108.196 108.156
3 108.074 108.196
4 108.104 108.074
... ... ...
2495 108.669 108.279
2496 108.687 108.669
2497 108.915 108.687
2498 108.852 108.915
2499 NaN 108.852
This is how I do it:
df_ext = pd.DataFrame(index=pd.date_range(df.index[-1], periods=8, closed='right'))
df2 = pd.concat([df, df_ext], axis=0, sort=True)
df2["forecast"] = df2["some column"].shift(7)
Basically I am generating an empty dataframe with the desired index and then just concatenate them together. But I would really like to see this as a standard feature in pandas so I have proposed an enhancement to pandas.
I'm new to pandas, and I may not be understanding the question, but this solution worked for my problem:
# Shift contents of column 'x2' down 1 row
df['x2'] = df['x2'].shift()
Or, to create a new column with contents of 'x2' shifted down 1 row
# Create new column with contents of 'x2' shifted down 1 row
df['x3'] = df['x2'].shift()
I had a read of the official docs for shift() while trying to figure this out, but it doesn't make much sense to me, and has no examples referencing this specific behavior.
Note that the last row of column 'x2' is effectively pushed off the end of the Dataframe. I expected shift() to have a flag to change this behaviour, but I can't find anything.
Sorry if this is duplicate post - I can't find a related post though
from random import seed
seed(100)
P = pd.DataFrame(np.random.randint(0, 100, size=(1000, 2)), columns=list('AB'))
What I'd like is to group P by the quartiles/quantiles/deciles/etc of column A and then calculate a aggregate statistic (such as mean) by group. I can define deciles of the column as
P['A'].quantile(np.arange(10) / 10)
I'm not sure how to groupby the deciles of A. Thanks in advance!
If you want to group P e.g. by quartiles, run:
gr = P.groupby(pd.qcut(P.A, 4, labels=False))
Then you can perform any operations on these groups.
For presentation, below you have just a printout of P limited to
20 rows:
for key, grp in gr:
print(f'\nGroup: {key}\n{grp}')
which gives:
Group: 0
A B
0 8 24
3 10 94
10 9 93
15 4 91
17 7 49
Group: 1
A B
7 34 24
8 15 60
12 27 4
13 31 1
14 13 83
Group: 2
A B
4 52 98
5 53 66
9 58 16
16 59 67
18 47 65
Group: 3
A B
1 67 87
2 79 48
6 98 14
11 86 2
19 61 14
As you can see, each group (quartile) has 5 members, so the grouping is
correct.
As a supplement
If you are interested in borders of each quartile, run:
pd.qcut(P.A, 4, labels=False, retbins=True)[1]
Then cut returns 2 results (a tuple). The first element (number 0) is
the result returned before, but we are this time interested in the
second element (number 1) - the bin borders.
For your data they are:
array([ 4. , 12.25, 40.5 , 59.5 , 98. ])
So e.g. the first quartile is between 4 and 12.35.
You can use the quantile Series to make another column, to marking each row with its quantile label, and then group by that column. numpy searchsorted is very useful to do this:
import numpy as np
import pandas as pd
from random import seed
seed(100)
P = pd.DataFrame(np.random.randint(0, 100, size=(1000, 2)), columns=list('AB'))
q = P['A'].quantile(np.arange(10) / 10)
P['G'] = P['A'].apply(lambda x : q.index[np.searchsorted(q, x, side='right')-1])
Since the quantile Series stores the lower values of the quantile intervals, be sure to pass the parameter side='right' to np.searchsorted to not get 0 (the minimum should be 1 or you have one index more than you need).
Now you can elaborate your statistics by doing, for example:
P.groupby('G').agg(['sum', 'mean']) #add to the list all the statistics method you wish
I have two columns in a dataframe with overlapping values.How to find the duplicate values of first column present in the second column and return the corresponding row number of the value in second column in a new column.
import pandas as pd
import csv
from pandas.compat import StringIO
print(pd.__version__)
csvdata = StringIO("""a,b
111,122
122,3
111,9
254,395
265,245
111,395
220,111
395,305
395,8""")
df1 = pd.read_csv(csvdata, sep=",")
# find unique duplicate values in first column
col_a_dups = df1['a'][df1['a'].duplicated()].unique()
corresponding_value = df1['b'][df1['b'].isin(col_a_dups)]
print(df1.join(corresponding_value, lsuffix="_l", rsuffix="_r"))
#print(corresponding_value.index)
Produces
0.24.2
a b_l b_r
0 111 122 NaN
1 122 3 NaN
2 111 9 NaN
3 254 395 395.0
4 265 245 NaN
5 111 395 395.0
6 220 111 111.0
7 395 305 NaN
8 395 8 NaN
I have a time series with hourly frequency and a label per day. I would like to fix the class imbalance by oversampling while preserving the sequence for each one day period. Ideally I would be able to use ADASYN or another method better than random oversampling. Here is what the data looks like:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
np.random.seed(seed=1111)
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(45), freq='H')
data = np.random.random(size=len(days))
data2 = np.random.random(size=len(days))
df = pd.DataFrame({'DateTime': days, 'col1': data, 'col_2' : data2})
df['Date'] = [df.loc[i,'DateTime'].floor('D') for i in range(len(df))]
class_labels = []
for i in df['Date'].unique():
class_labels.append([i,np.random.choice((1,2,3,4,5,6,7,8,9,10),size=1,
p=(.175,.035,.016,.025,.2,.253,.064,.044,.072,.116))[0]])
class_labels = pd.DataFrame(class_labels)
df['class_label'] = [class_labels[class_labels.loc[:,0] == df.loc[i,'Date']].loc[:,1].values[0] for i in range(len(df))]
df = df.set_index('DateTime')
df.drop('Date',axis=1,inplace=True)
print(df['class_label'].value_counts())
df.head(15)
Out[209]:
5 264
1 240
6 145
9 120
7 120
10 72
8 72
4 24
2 24
Out[213]:
col1 col_2 class_label
DateTime
2019-02-01 18:28:29.214935 0.095549 0.307041 6
2019-02-01 19:28:29.214935 0.925004 0.981620 6
2019-02-01 20:28:29.214935 0.343573 0.610662 6
2019-02-01 21:28:29.214935 0.310477 0.482961 6
2019-02-01 22:28:29.214935 0.002010 0.242208 6
2019-02-01 23:28:29.214935 0.235595 0.355516 6
2019-02-02 00:28:29.214935 0.237792 0.028726 5
2019-02-02 01:28:29.214935 0.735916 0.221198 5
2019-02-02 02:28:29.214935 0.495468 0.712723 5
2019-02-02 03:28:29.214935 0.784425 0.818065 5
2019-02-02 04:28:29.214935 0.126506 0.414326 5
2019-02-02 05:28:29.214935 0.606649 0.264835 5
2019-02-02 06:28:29.214935 0.466121 0.244843 5
2019-02-02 07:28:29.214935 0.237132 0.298100 5
2019-02-02 08:28:29.214935 0.435159 0.621991 5
I would like to use ADASYN or SMOTE, but even random oversampling to fix the class imbalance would be good.
The desired result is in hourly increments like the original, has one label per day and classes are balanced:
print(df['class_label'].value_counts())
Out[211]:
5 264
1 264
6 264
9 264
7 264
10 264
8 264
4 264
2 264
Using for loop with groupby then sample each subset
newdf=pd.concat([y.sample(264,replace=True) for _, y in df.groupby('class_label')])
newdf.class_label.value_counts()
9 264
7 264
5 264
1 264
10 264
8 264
6 264
4 264
2 264
Name: class_label, dtype: int64
You really can't "oversample" time series data, at least not in the same sense that you can unordered data. It wouldn't be possible to have 264 examples of every class, that would mean inserting new data into the time series between existing points and throwing all of the time sensitive patters out of wack.
The best option (as far as oversampling) is to synthetically generate one or more new time series based on your original data. One option: for each point, pick a random class then interpolate between the closest data points of that class from the original time series. Another option: randomly sample 24 points from each class (which will always include all of class 2 and 4) and interpolate the rest of the time series a few times until you have a set of balanced time series.
A much better option is to address class imbalance some other way, say by changing your loss/error function.
I have a dataframe with 2 columns, Time and Pressure, with around 3000 rows, as this:
time value
0 393
1 389
2 402
3 408
4 413
5 463
6 471
7 488
8 422
9 404
10 370
I want to find 1) the most frequent value of pressure and 2) after how many time-steps we see this value. My code is this so far:
import numpy as np
import pandas as pd
from matplotlib.pylab import *
import re
from pylab import *
import datetime
from scipy import stats
pd.set_option('display.max_rows', 5000)
df = pd.read_csv('copy.csv')
row = next(df.iterrows())[0]
dataset = np.loadtxt(df, delimiter=";")
df.columns = ["LTimestamp", "LPressure"]
list(df.columns.values)
## Timestep
df = pd.DataFrame({'timestep': df.LTimestamp, 'value': df.LPressure})
df['timestep'] = pd.to_datetime(df['timestep'], unit='ms').dt.time
# print(df)
## Find most seen value in pressure
count = df['value'].value_counts().sort_values(ascending=[False]).nlargest(1).values[0]
print (count)
## Mask the df by comparing the column against the most seen value.
print(df[df['value'] == count])
## Find interval differences
x = df.loc[df['value'] == count, 'timestep'].diff()
print(x)
The output is this, where 101 is the number of times the most frequent value (400) occurs.
>>> 101
>>> Empty DataFrame
>>> Columns: [timestep, value]
>>> Index: []
>>> Series([], Name: timestep, dtype: object)
>>> [Finished in 1.7s]
I don't understand why it returns an empty Index array. If instead of
print(df[df['value'] == count])
I use
print(df[df['value'] == 400])
I can see the masked df with the interval differences, as here:
50 1.0
112 62.0
215 103.0
265 50.0
276 11.0
277 1.0
278 1.0
318 40.0
366 48.0
367 1.0
But later on, I will want to calculate this for the minimum values, or the second largest etc. This is why I want to use count and not a specific number. Can someone help with this?
I'd suggest to use
>>> val = df['value'].value_counts().nlargest(1).index[0]
>>> df[df['value'] == val]
time value
2 2 402
3 3 402
7 7 402
8 8 402
A more general solution is to assign a rank of the frequency to each value in df.
import pandas as pd
df = pd.DataFrame({
'time': np.arange(20)
})
df['value'] = df.time ** 2 % 7
vcs = {v: i for i, v in enumerate(df.value.value_counts().index)}
df['freq_rank'] = df.value.apply(vcs.get)