how to fill missing time slots in python? - python

I'm trying to fill the missing slots in the CSV file which has date and time as a string.
My input from a csv file is:
A B C
56 2017-10-26 22:15:00 89
2 2017-10-27 00:30:00 54
20 2017-10-28 05:00:00 64
24 2017-10-29 06:00:00 2
91 2017-11-01 22:45:00 78
62 2017-11-02 15:30:00 99
91 2017-11-02 22:45:00 34
Output should be
A B C
0 2017-10-26 00:00:00 89
1 2017-10-26 00:15:00 89
.
.
.
.
.
56 2017-10-26 22:15:00 89
..
.
.
.
.
96 2017-10-26 23:45:00 89
0 2017-10-27 00:00:00 54
1 2017-10-27 00:15:00 54
2 2017-10-27 00:30:00 54
.
.
.
20 2017-10-28 05:00:00 64
21 2017-10-28 05:15:00 64
.
.
.
.
24 2017-10-29 06:00:00 2
.
91 2017-11-01 22:45:00 78
.
62 2017-11-02 15:30:00 99
.
91 2017-11-02 22:45:00 34
The output range is 15 min time slots for days between 2017-10-26 -> 2017-11-02 and each day have 96 slots.
And the same as above.

Using resample to get 15-min intervalsand bfill to fill missing values in B:
df = df.set_index(pd.to_datetime(df.pop('B')))
df.loc[df.index.min().normalize()] = None
df = df.resample('15min').max().bfill()
df['A'] = 4*df.index.hour + df.index.minute//15
print(df)
Output:
A C
B
2017-10-26 00:00:00 0 89.0
2017-10-26 00:15:00 1 89.0
2017-10-26 00:30:00 2 89.0
... .. ...
2017-11-02 22:15:00 89 34.0
2017-11-02 22:30:00 90 34.0
2017-11-02 22:45:00 91 34.0

You need to resample your data and to fill missing values by propagating the last known value for each date. Pandas could be helpful to do that. Assuming you loaded your csv in pandas (with pandas.read_csv), and you obtained a dataframe (let's call it df) where the date column is your index (df.set_index('B')), then:
df.resample(rule='15M').ffill()
The rule parameter defines the new frequency, and the call to .ffill() means "forward fill", i.e., replace missing data with previous ones.

Related

Using idxmax and idxmin to change values in different rows

I am trying find the cleanest, most pandastic way to create a new column that has the minimum values from one column in the same row as the maximum values in another column. The rest of the values can be nan as I will be interpolating.
rng = pd.date_range(start=datetime.date(2020,8,1), end=datetime.date(2020,8,3), freq='H')
df = pd.DataFrame(rng, columns=['date'])
df.index=pd.to_datetime(df['date'])
df.drop(['date'],axis=1,inplace=True)
df['val0']=np.random.randint(0,50,49)
df['val1']=np.random.randint(0,50,49)
One realization of df (cut and paste for reproducability):
val0 val1
date
2020-08-01 00:00:00 17 4
2020-08-01 01:00:00 89 0
2020-08-01 02:00:00 85 48
2020-08-01 03:00:00 83 13
2020-08-01 04:00:00 56 65
2020-08-01 05:00:00 48 31
2020-08-01 06:00:00 55 11
2020-08-01 07:00:00 15 87
2020-08-01 08:00:00 92 70
2020-08-01 09:00:00 95 57
2020-08-01 10:00:00 68 79
2020-08-01 11:00:00 87 7
2020-08-01 12:00:00 43 15
2020-08-01 13:00:00 23 4
2020-08-01 14:00:00 68 13
2020-08-01 15:00:00 68 63
2020-08-01 16:00:00 28 86
2020-08-01 17:00:00 12 40
2020-08-01 18:00:00 51 20
2020-08-01 19:00:00 20 48
2020-08-01 20:00:00 79 78
2020-08-01 21:00:00 67 89
2020-08-01 22:00:00 46 52
2020-08-01 23:00:00 7 47
2020-08-02 00:00:00 14 73
2020-08-02 01:00:00 70 30
2020-08-02 02:00:00 2 39
2020-08-02 03:00:00 65 81
2020-08-02 04:00:00 65 8
2020-08-02 05:00:00 83 60
2020-08-02 06:00:00 1 64
2020-08-02 07:00:00 13 63
2020-08-02 08:00:00 45 78
2020-08-02 09:00:00 83 7
2020-08-02 10:00:00 75 0
2020-08-02 11:00:00 52 3
2020-08-02 12:00:00 59 34
2020-08-02 13:00:00 54 57
2020-08-02 14:00:00 90 66
2020-08-02 15:00:00 82 56
2020-08-02 16:00:00 9 2
2020-08-02 17:00:00 5 51
2020-08-02 18:00:00 67 96
2020-08-02 19:00:00 18 77
2020-08-02 20:00:00 28 89
2020-08-02 21:00:00 96 53
2020-08-02 22:00:00 28 46
2020-08-02 23:00:00 41 87
2020-08-03 00:00:00 26 47
Now I find idxmax for and idxmin:
minidx=df.groupby(pd.Grouper(freq='D')).idxmin()
maxidx=df.groupby(pd.Grouper(freq='D')).idxmax()
minidx:
val0 val1
date
2020-08-01 2020-08-01 23:00:00 2020-08-01 01:00:00
2020-08-02 2020-08-02 06:00:00 2020-08-02 10:00:00
2020-08-03 2020-08-03 00:00:00 2020-08-03 00:00:00
maxidx:
val0 val1
date
2020-08-01 2020-08-01 09:00:00 2020-08-01 21:00:00
2020-08-02 2020-08-02 21:00:00 2020-08-02 18:00:00
2020-08-03 2020-08-03 00:00:00 2020-08-03 00:00:00
In this case, I would like to put the minimum daily value (7) located at 2020-08-01 23:00:00 into a new column at 2020-08-01 21:00:00 (i.e. adjacent to 89, the daily max of val1), and do the same for all other dates so the 'new' value on 2020-08-02 18:00:00 will be 1 (i.e. the minimum daily value occurring on 2020-08-02 06:00:00).
I tried the following, but I just get a bunch of nans:
df.loc[maxidx['val1'].values,'new']=df.loc[minidx['val0'].values,'val0']
If I just set it to an int (df.loc[maxidx['val1'].values,'new']=6), I get the int in the places I want the new values. The values I want are give by df.loc[minidx['val0'].values,'val0'], but I can't seem to get them into the dataframe.
minidx['val0'].values and maxidx['val1'].values are arrays of the same size with elements of type numpy.datetime64, and they are all generated from the same dataframe so maxidx and minidx should exist in df.index (df.index.values).
Is there an obvious reason this isn't working? Thanks
The simplest solution I have found is to loop through the idxmin and idxmax:
for v0,v1 in zip(minidx['val0'].values,maxidx['val1'].values):
df.loc[v1,'new']=df.loc[v0,'val0']
This gives me what I want, but doesn't seem very pandastic, so any other suggestions to accomplish the same thing would be great.
IIUC, you can do this using NamedAgg:
df.groupby(pd.Grouper(freq='D')).agg(val0_min_time=('val0','idxmin'),
val0_min_value=('val0','min'),
val0_max_time=('val0','idxmax'),
val0_max_value=('val0','max'),
val1_min_time=('val1','idxmin'),
val1_min_value=('val1','min'),
val1_max_time=('val1','idxmax'),
val1_max_value=('val1','max'),)
Output:
val0_min_time val0_min_value val0_max_time val0_max_value val1_min_time val1_min_value val1_max_time val1_max_value
date
2020-08-01 2020-08-01 23:00:00 7 2020-08-01 09:00:00 95 2020-08-01 01:00:00 0 2020-08-01 21:00:00 89
2020-08-02 2020-08-02 06:00:00 1 2020-08-02 21:00:00 96 2020-08-02 10:00:00 0 2020-08-02 18:00:00 96
2020-08-03 2020-08-03 00:00:00 26 2020-08-03 00:00:00 26 2020-08-03 00:00:00 47 2020-08-03 00:00:00 47

Getting wrong datetime after resampling

I want to create missing records from a time serie of % humidity.
datetime humidite
0 2019-07-09 08:30:00 87
1 2019-07-09 11:00:00 87
2 2019-07-09 17:30:00 82
3 2019-07-09 23:30:00 80
4 2019-07-11 06:15:00 79
5 2019-07-19 14:30:00 39
6 2019-07-21 00:00:00 80
I tried to index with existing datetime (result at this step is ok) :
humdt["datetime"] = pd.to_datetime(humdt["datetime"])
humdt = humdt.set_index("datetime")
humidite
datetime
2019-07-09 08:30:00 87
2019-07-09 11:00:00 87
2019-07-09 17:30:00 82
2019-07-09 23:30:00 80
2019-07-11 06:15:00 79
2019-07-19 14:30:00 39
Then reindex at 15 min frequency (my target frequency) :
humdt.resample("15min").asfreq()
humidite
datetime
2019-06-26 10:00:00 34.0
2019-06-26 10:15:00 33.0
2019-06-26 10:30:00 32.0
2019-06-26 10:45:00 31.0
2019-06-26 11:00:00 30.0
2019-06-26 11:15:00 29.0
As a result, I get wrong starting time and values, just frequency is respected.
Can you help me please ? I also tried to merge a range of datetime defined as my expected records with my data and it doesn't work. Thank you !!!

Print rows from output1 based on output2 values

df1
slot Time Location User
56 2017-10-26 22:15:00 89 1
2 2017-10-27 00:30:00 54 1
20 2017-10-28 05:00:00 64 1
24 2017-10-29 06:00:00 2 1
91 2017-11-01 22:45:00 78 1
62 2017-11-02 15:30:00 99 1
91 2017-11-02 22:45:00 34 1
47 2017-10-26 20:15:00 465 2
1 2017-10-27 00:10:00 67 2
20 2017-10-28 05:00:00 5746 2
28 2017-10-29 07:00:00 36 2
91 2017-11-01 22:45:00 786 2
58 2017-11-02 14:30:00 477 2
95 2017-11-02 23:45:00 7322 2
df2
slot
2
91
62
58
I need the output df3 as
slot Time Location User
2 2017-10-27 00:30:00 54 1
91 2017-11-01 22:45:00 78 1
91 2017-11-02 22:45:00 34 1
91 2017-11-01 22:45:00 786 2
62 2017-11-02 15:30:00 99 1
58 2017-11-02 14:30:00 477 2
if those are csv file then we can join them
join File1 file2 > file3
But how can we do the same for the outputs in Jupyter notebook
Try isin:
df1[df1.slot.isin(df2.slot)]
Output:
slot Time Location User
1 2 2017-10-27 00:30:00 54 1
4 91 2017-11-01 22:45:00 78 1
5 62 2017-11-02 15:30:00 99 1
6 91 2017-11-02 22:45:00 34 1
11 91 2017-11-01 22:45:00 786 2
12 58 2017-11-02 14:30:00 477 2

How to create a transition matrix for a column in python?

How do I convert column B into the transition matrix in python?
Size of the matrix is 19 which is unique values in column B.
There are a total of 432 rows in the dataset.
time A B
2017-10-26 09:00:00 36 816
2017-10-26 10:45:00 43 816
2017-10-26 12:30:00 50 998
2017-10-26 12:45:00 51 750
2017-10-26 13:00:00 52 998
2017-10-26 13:15:00 53 998
2017-10-26 13:30:00 54 998
2017-10-26 14:00:00 56 998
2017-10-26 14:15:00 57 834
2017-10-26 14:30:00 58 1285
2017-10-26 14:45:00 59 1288
2017-10-26 23:45:00 95 1285
2017-10-27 03:00:00 12 1285
2017-10-27 03:30:00 14 1285
...
2017-11-02 14:00:00 56 998
2017-11-02 14:15:00 57 998
2017-11-02 14:30:00 58 998
2017-11-02 14:45:00 59 998
2017-11-02 15:00:00 60 816
2017-11-02 15:15:00 61 275
2017-11-02 15:30:00 62 225
2017-11-02 15:45:00 63 1288
2017-11-02 16:00:00 64 1088
2017-11-02 18:15:00 73 1285
2017-11-02 20:30:00 82 1285
2017-11-02 21:00:00 84 1088
2017-11-02 21:15:00 85 1088
2017-11-02 21:30:00 86 1088
2017-11-02 22:00:00 88 1088
2017-11-02 22:30:00 90 1088
2017-11-02 23:00:00 92 1088
2017-11-02 23:30:00 94 1088
2017-11-02 23:45:00 95 1088
The matrix should contain the number of transition between them.
B -----------------1088------1288----------------------------
B
.
.
1088 8 2
.
.
.
.
. Number of transitions between them.
..
.
.
I use your data to create DataFrame only with column B but it should work also with all columns.
text = '''time A B
2017-10-26 09:00:00 36 816
2017-10-26 10:45:00 43 816
2017-10-26 12:30:00 50 998
2017-10-26 12:45:00 51 750
2017-10-26 13:00:00 52 998
2017-10-26 13:15:00 53 998
2017-10-26 13:30:00 54 998
2017-10-26 14:00:00 56 998
2017-10-26 14:15:00 57 834
2017-10-26 14:30:00 58 1285
2017-10-26 14:45:00 59 1288
2017-10-26 23:45:00 95 1285
2017-10-27 03:00:00 12 1285
2017-10-27 03:30:00 14 1285
2017-11-02 14:00:00 56 998
2017-11-02 14:15:00 57 998
2017-11-02 14:30:00 58 998
2017-11-02 14:45:00 59 998
2017-11-02 15:00:00 60 816
2017-11-02 15:15:00 61 275
2017-11-02 15:30:00 62 225
2017-11-02 15:45:00 63 1288
2017-11-02 16:00:00 64 1088
2017-11-02 18:15:00 73 1285
2017-11-02 20:30:00 82 1285
2017-11-02 21:00:00 84 1088
2017-11-02 21:15:00 85 1088
2017-11-02 21:30:00 86 1088
2017-11-02 22:00:00 88 1088
2017-11-02 22:30:00 90 1088
2017-11-02 23:00:00 92 1088
2017-11-02 23:30:00 94 1088
2017-11-02 23:45:00 95 1088'''
import pandas as pd
B = [int(row[29:].strip()) for row in text.split('\n') if 'B' not in row]
df = pd.DataFrame({'B': B})
I get unique values in colum to use it later to create matrix
numbers = sorted(df['B'].unique())
print(numbers)
[225, 275, 750, 816, 834, 998, 1088, 1285, 1288]
I create shifted column C so I have both values in every row
df['C'] = df.shift(-1)
print(df)
B C
0 816 816.0
1 816 998.0
2 998 750.0
3 750 998.0
I group by ['B', 'C'] so I can count pairs
groups = df.groupby(['B', 'C'])
counts = {i[0]:(len(i[1]) if i[0][0] != i[0][1] else 0) for i in groups} # don't count (816,816)
# counts = {i[0]:len(i[1]) for i in groups} # count even (816,816)
print(counts)
{(225, 1288.0): 2, (275, 225.0): 2, (750, 998.0): 2, (816, 275.0): 2, (816, 816.0): 2, (816, 998.0): 2, (834, 1285.0): 2, (998, 750.0): 2, (998, 816.0): 2, (998, 834.0): 2, (998, 998.0): 12, (1088, 1088.0): 14, (1088, 1285.0): 2, (1285, 998.0): 2, (1285, 1088.0): 2, (1285, 1285.0): 6, (1285, 1288.0): 2, (1288, 1088.0): 2, (1288, 1285.0): 2}
Now I can create matrix. Using numbers and counts I create column/Series (with correct index) and I add it to matrix.
matrix = pd.DataFrame()
for x in numbers:
matrix[x] = pd.Series([counts.get((x,y), 0) for y in numbers], index=numbers)
print(matrix)
Result
225 275 750 816 834 998 1088 1285 1288
225 0 2 0 0 0 0 0 0 0
275 0 0 0 2 0 0 0 0 0
750 0 0 0 0 0 2 0 0 0
816 0 0 0 2 0 2 0 0 0
834 0 0 0 0 0 2 0 0 0
998 0 0 2 2 0 12 0 2 0
1088 0 0 0 0 0 0 14 2 2
1285 0 0 0 0 2 0 2 6 2
1288 2 0 0 0 0 0 0 2 0
Full example
text = '''time A B
2017-10-26 09:00:00 36 816
2017-10-26 10:45:00 43 816
2017-10-26 12:30:00 50 998
2017-10-26 12:45:00 51 750
2017-10-26 13:00:00 52 998
2017-10-26 13:15:00 53 998
2017-10-26 13:30:00 54 998
2017-10-26 14:00:00 56 998
2017-10-26 14:15:00 57 834
2017-10-26 14:30:00 58 1285
2017-10-26 14:45:00 59 1288
2017-10-26 23:45:00 95 1285
2017-10-27 03:00:00 12 1285
2017-10-27 03:30:00 14 1285
2017-11-02 14:00:00 56 998
2017-11-02 14:15:00 57 998
2017-11-02 14:30:00 58 998
2017-11-02 14:45:00 59 998
2017-11-02 15:00:00 60 816
2017-11-02 15:15:00 61 275
2017-11-02 15:30:00 62 225
2017-11-02 15:45:00 63 1288
2017-11-02 16:00:00 64 1088
2017-11-02 18:15:00 73 1285
2017-11-02 20:30:00 82 1285
2017-11-02 21:00:00 84 1088
2017-11-02 21:15:00 85 1088
2017-11-02 21:30:00 86 1088
2017-11-02 22:00:00 88 1088
2017-11-02 22:30:00 90 1088
2017-11-02 23:00:00 92 1088
2017-11-02 23:30:00 94 1088
2017-11-02 23:45:00 95 1088'''
import pandas as pd
B = [int(row[29:].strip()) for row in text.split('\n') if 'B' not in row]
df = pd.DataFrame({'B': B})
numbers = sorted(df['B'].unique())
print(numbers)
df['C'] = df.shift(-1)
print(df)
groups = df.groupby(['B', 'C'])
counts = {i[0]:(len(i[1]) if i[0][0] != i[0][1] else 0) for i in groups} # don't count (816,816)
# counts = {i[0]:len(i[1]) for i in groups} # count even (816,816)
print(counts)
matrix = pd.DataFrame()
for x in numbers:
matrix[str(x)] = pd.Series([counts.get((x,y), 0) for y in numbers], index=numbers)
print(matrix)
EDIT:
counts = {i[0]:(len(i[1]) if i[0][0] != i[0][1] else 0) for i in groups} # don't count (816,816)
as normal for loop
counts = {}
for pair, group in groups:
if pair[0] != pair[1]: # don't count (816,816)
counts[pair] = len(group)
else:
counts[pair] = 0
Invert value when it is bigger thant 10
counts = {}
for pair, group in groups:
if pair[0] != pair[1]: # don't count (816,816)
count = len(group)
if count > 10 :
counts[pair] = -count
else
counts[pair] = count
else:
counts[pair] = 0
EDIT:
counts = {}
for pair, group in groups:
if pair[0] != pair[1]: # don't count (816,816)
#counts[(A,B)] = len((A,B)) + len((B,A))
if pair not in counts:
counts[pair] = len(group) # put first value
else:
counts[pair] += len(group) # add second value
#counts[(B,A)] = len((A,B)) + len((B,A))
if (pair[1],pair[0]) not in counts:
counts[(pair[1],pair[0])] = len(group) # put first value
else:
counts[(pair[1],pair[0])] += len(group) # add second value
else:
counts[pair] = 0 # (816,816) gives 0
#counts[(A,B)] == counts[(B,A)]
counts_2 = {}
for pair, count in counts.items():
if count > 10 :
counts_2[pair] = -count
else:
counts_2[pair] = count
matrix = pd.DataFrame()
for x in numbers:
matrix[str(x)] = pd.Series([counts_2.get((x,y), 0) for y in numbers], index=numbers)
print(matrix)
An alternative, pandas based approach. Note I've used shift(1) which means transition is the next number:
text = '''time A B
2017-10-26 09:00:00 36 816
2017-10-26 10:45:00 43 816
2017-10-26 12:30:00 50 998
2017-10-26 12:45:00 51 750
2017-10-26 13:00:00 52 998
2017-10-26 13:15:00 53 998
2017-10-26 13:30:00 54 998
2017-10-26 14:00:00 56 998
2017-10-26 14:15:00 57 834
2017-10-26 14:30:00 58 1285
2017-10-26 14:45:00 59 1288
2017-10-26 23:45:00 95 1285
2017-10-27 03:00:00 12 1285
2017-10-27 03:30:00 14 1285
2017-11-02 14:00:00 56 998
2017-11-02 14:15:00 57 998
2017-11-02 14:30:00 58 998
2017-11-02 14:45:00 59 998
2017-11-02 15:00:00 60 816
2017-11-02 15:15:00 61 275
2017-11-02 15:30:00 62 225
2017-11-02 15:45:00 63 1288
2017-11-02 16:00:00 64 1088
2017-11-02 18:15:00 73 1285
2017-11-02 20:30:00 82 1285
2017-11-02 21:00:00 84 1088
2017-11-02 21:15:00 85 1088
2017-11-02 21:30:00 86 1088
2017-11-02 22:00:00 88 1088
2017-11-02 22:30:00 90 1088
2017-11-02 23:00:00 92 1088
2017-11-02 23:30:00 94 1088
2017-11-02 23:45:00 95 1088'''
import pandas as pd
B = [int(row[29:].strip()) for row in text.split('\n') if 'B' not in row]
df = pd.DataFrame({'B': B})
# alternative approach
df['C'] = df['B'].shift(1) # shift forward so B transitions to C
df['counts'] = 1 # add an arbirtary counts column for group by
# group together the combinations then unstack to get matrix
trans_matrix = df.groupby(['B', 'C']).count().unstack()
# max the columns a bit neater
trans_matrix.columns = trans_matrix.columns.droplevel()
The result is:
Which I think is correct, i.e the one time you observe 225, it then transitions to 1288. You would just divide through by the sample size to get a probability transition matrix for each value.

Group data by time of the day

I have a dataframe with datetime index:df.head(6)
NUMBERES PRICE
DEAL_TIME
2015-03-02 12:40:03 5 25
2015-03-04 14:52:57 7 23
2015-03-03 08:10:09 10 43
2015-03-02 20:18:24 5 37
2015-03-05 07:50:55 4 61
2015-03-02 09:08:17 1 17
The dataframe includes the data of one week. Now I need to count the time period of the day. If time period is 1 hour, I know the following method would work:
df_grouped = df.groupby(df.index.hour).count()
But I don't know how to do when the time period is half hour. How can I realize it?
UPDATE:
I was told that this question is similar to How to group DataFrame by a period of time?
But I had tried the methods mentioned. Maybe it's my fault that I didn't say it clearly. 'DEAL_TIME' ranges from '2015-03-02 00:00:00' to '2015-03-08 23:59:59'. If I use pd.TimeGrouper(freq='30Min') or resample(), the time periods would range from '2015-03-02 00:30' to '2015-03-08 23:30'. But what I want is a series like below:
COUNT
DEAL_TIME
00:00:00 53
00:30:00 49
01:00:00 31
01:30:00 22
02:00:00 1
02:30:00 24
03:00:00 27
03:30:00 41
04:00:00 41
04:30:00 76
05:00:00 33
05:30:00 16
06:00:00 15
06:30:00 4
07:00:00 60
07:30:00 85
08:00:00 3
08:30:00 37
09:00:00 18
09:30:00 29
10:00:00 31
10:30:00 67
11:00:00 35
11:30:00 60
12:00:00 95
12:30:00 37
13:00:00 30
13:30:00 62
14:00:00 58
14:30:00 44
15:00:00 45
15:30:00 35
16:00:00 94
16:30:00 56
17:00:00 64
17:30:00 43
18:00:00 60
18:30:00 52
19:00:00 14
19:30:00 9
20:00:00 31
20:30:00 71
21:00:00 21
21:30:00 32
22:00:00 61
22:30:00 35
23:00:00 14
23:30:00 21
In other words, the time period should be irrelevant to the date.
You need a 30-minute time grouper for this:
grouper = pd.TimeGrouper(freq="30T")
You also need to remove the 'date' part from the index:
df.index = df.reset_index()['index'].apply(lambda x: x - pd.Timestamp(x.date()))
Now, you can group by time alone:
df.groupby(grouper).count()
You can find somewhat obscure TimeGrouper documentation here: pandas resample documentation (it's actually resample documentation, but both features use the same rules).
In pandas, the most common way to group by time is to use the
.resample() function.
In v0.18.0 this function is two-stage.
This means that df.resample('M') creates an object to which we can
apply other functions (mean, count, sum, etc.)
The code snippet will be like,
df.resample('M').count()
You can refer here for example.

Categories