set_codes in multiIndexed pandas series - python

I want to multiIndex an array of data.
Initially, I was indexing my data with datetime, but for some later applications, I had to add another numeric index (that goes from 0 the len(array)-1).
I have written those little lines:
O = [0.701733664614, 0.699495411782, 0.572129320819, 0.613315597684, 0.58079660603, 0.596638918579, 0.48453382119]
Ab = [datetime.datetime(2018, 12, 11, 14, 0), datetime.datetime(2018, 12, 21, 10, 0), datetime.datetime(2018, 12, 21, 14, 0), datetime.datetime(2019, 1, 1, 10, 0), datetime.datetime(2019, 1, 1, 14, 0), datetime.datetime(2019, 1, 11, 10, 0), datetime.datetime(2019, 1, 11, 14, 0)]
tst = pd.Series(O,index=Ab)
ld = len(tst)
index = pd.MultiIndex.from_product([(x for x in range(0,ld)),Ab], names=['id','dtime'])
print (index)
data = pd.Series(O,index=index)
But when printting index, I get some bizzare ''codes'':
The levels & names are perfect, but the codes go from 0 to 763...764 times (instead of one)!
I tried to add the set_codes command:
index.set_codes([x for x in range(0,ld)], level=0)
print (index)
I vain, I have the following error :
ValueError: Unequal code lengths: [764, 583696]
the initial pandas series:
print (tst)
2005-01-01 14:00:00 0.544177
2005-01-01 14:00:00 0.544177
2005-01-21 14:00:00 0.602239
...
2019-05-21 10:00:00 0.446813
2019-05-21 14:00:00 0.466573
Length: 764, dtype: float64
the new expected one
id dtime
0 2005-01-01 14:00:00 0.544177
1 2005-01-01 14:00:00 0.544177
2 2005-01-21 14:00:00 0.602239
...
762 2019-05-21 10:00:00 0.446813
763 2019-05-21 14:00:00 0.466573
Thanks in advance

You can create new index by MultiIndex.from_arrays and reassign to Series:
s.index = pd.MultiIndex.from_arrays([np.arange(len(s)), s.index], names=['id','dtime'])

Related

Jupyter Labs: Kernel Dies when Converting Tuple to PandasData Frame

I have the following tuple that I'm trying to create a data frame out of:
testing =
([datetime.datetime(2020, 2, 5, 0, 0),
datetime.datetime(2020, 2, 5, 2, 40),
datetime.datetime(2020, 2, 5, 5, 20),
datetime.datetime(2020, 2, 5, 8, 0),
datetime.datetime(2020, 2, 5, 10, 40),
datetime.datetime(2020, 2, 5, 13, 20),
datetime.datetime(2020, 2, 5, 16, 0),
datetime.datetime(2020, 2, 5, 18, 40),
datetime.datetime(2020, 2, 5, 21, 20),
datetime.datetime(2020, 2, 6, 0, 0)],
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
I use this snippet to create a data frame:
df_testing = pd.DataFrame(testing)
df_testing.head()
However, this causes the kernel to die every time. If i only loop at 1 item (e.g. I do df_testing = pd.DataFrame(testing[0])), the code runs fine.
I'm not super familiar with using tuples so is there some type of property that inhibits them from being turned into a data frame?
NOTE:
There is a lot of code that generates that this "testing variable" it's just a portion of the overall data I would like to eventually convert. I filled in some dummy data for the example here. I would prefer not to modify the data type of this variable if at all possible.
Also I'm running Python 3.7 in case that matters.
EDIT:
Here is a screenshot of my trying to run the test code I put in.
I just ran your exact code (pay attention you wrote different variable names - test vs testing).
After changing the variable names it worked just fine:
I guess the problem is with your jupyter Labs installation.
I would use:
new_df = pd.Series(dict(zip(*test))).to_frame('name_column')
print(new_df)
or
new_df = pd.DataFrame({'name_column':dict(zip(*test))})
print(new_df)
Output
name_column
2020-02-05 00:00:00 1
2020-02-05 02:40:00 2
2020-02-05 05:20:00 3
2020-02-05 08:00:00 4
2020-02-05 10:40:00 5
2020-02-05 13:20:00 6
2020-02-05 16:00:00 7
2020-02-05 18:40:00 8
2020-02-05 21:20:00 9
2020-02-06 00:00:00 10
You could use DataFrame.reset_index if you want convert index in column
Another option is DataFrame.transpose
new_df = pd.DataFrame(test,index=['Date','values']).T
print(new_df)
Date values
0 2020-02-05 00:00:00 1
1 2020-02-05 02:40:00 2
2 2020-02-05 05:20:00 3
3 2020-02-05 08:00:00 4
4 2020-02-05 10:40:00 5
5 2020-02-05 13:20:00 6
6 2020-02-05 16:00:00 7
7 2020-02-05 18:40:00 8
8 2020-02-05 21:20:00 9
9 2020-02-06 00:00:00 10

interpolate pandas frame using time index from another data frame

So, I have 2 data frames where the first one has the following structure:
'ds', '1_sensor_id', '1_val_1', '1_val_2'
0 2019-09-13 12:40:00 33469 30 43
1 2019-09-13 12:45:00 33469 43 43
The second one has the following structure:
'ds', '2_sensor_id', '2_val_1', '2_val_2'
0 2019-09-13 12:42:00 20006 6 50
1 2019-09-13 12:47:00 20006 5 80
So what I want to do is merge the two pandas frame through interpolation. So ultimately, the merged frame should have values defined at the time stamps (ds) defined in frame 1 and the 2_val_1 and 2_val_2 columns would be interpolated and the merged frame would have a row for each value in ds column in frame 1. What would be the best way to do this in pandas? I tried the merge_asof function but this does nearest neighbourhood interpolation and I did not get all the time stamps back.
You can append one frame to another and use interpolate(), example:
import datetime
import pandas as pd
df1 = pd.DataFrame(columns=['ds', '1_sensor_id', '1_val_1', '1_val_2'],
data=[[datetime.datetime(2019, 9, 13, 12, 40, 00), 33469, 30, 43],
[datetime.datetime(2019, 9, 13, 12, 45, 00), 33469, 33, 43]])
df2 = pd.DataFrame(columns=['ds', '2_sensor_id', '2_val_1', '2_val_2'],
data=[[datetime.datetime(2019, 9, 13, 12, 42, 00), 20006, 6, 50],
[datetime.datetime(2019, 9, 13, 12, 47, 00), 20006, 5, 80]])
df = df1.append(df2, sort=False)
df.set_index('ds', inplace=True)
df.interpolate(method = 'time', limit_direction='backward', inplace=True)
print(df)
1_sensor_id 1_val_1 ... 2_val_1 2_val_2
ds ...
2019-09-13 12:40:00 33469.0 30.0 ... 6.0 50.0
2019-09-13 12:45:00 33469.0 33.0 ... 5.4 68.0
2019-09-13 12:42:00 NaN NaN ... 6.0 50.0
2019-09-13 12:47:00 NaN NaN ... 5.0 80.0

Filter Dataframe with a list of time ranges

below is a simplified version of my setup:
import pandas as pd
import datetime as dt
df_data = pd.DataFrame({'DateTime' : [dt.datetime(2017, 9, 1, 0, 0, 0),dt.datetime(2017, 9, 1, 1, 0, 0),dt.datetime(2017, 9, 1, 2, 0, 0),dt.datetime(2017, 9, 1, 3, 0, 0)], 'Data' : [1,2,3,5]})
df_timeRanges = pd.DataFrame({'startTime':[dt.datetime(2017, 8, 30, 0, 0, 0), dt.datetime(2017, 9, 1, 1, 30, 0)], 'endTime':[dt.datetime(2017, 9, 1, 0, 30, 0), dt.datetime(2017, 9, 1, 2, 30, 0)]})
print df_data
print df_timeRanges
This gives:
Data DateTime
0 1 2017-09-01 00:00:00
1 2 2017-09-01 01:00:00
2 3 2017-09-01 02:00:00
3 5 2017-09-01 03:00:00
endTime startTime
0 2017-09-01 00:30:00 2017-08-30 00:00:00
1 2017-09-01 02:30:00 2017-09-01 01:30:00
I would like to filter df_data with df_timeRanges, with the remaining rows in a single dataframe, kind of like:
df_data_filt = df_data[(df_data['DateTime'] >= df_timeRanges['startTime']) & (df_data['DateTime'] <= df_timeRanges['endTime'])]
I did not expect the above line to work, and it returned this error:
ValueError: Can only compare identically-labeled Series objects
Would anyone be able to provide some tips on this? The df_data and df_timeRanges in my real task are much bigger.
Thanks in advance
IIUIC, Use
In [794]: mask = np.logical_or.reduce([
(df_data.DateTime >= x.startTime) & (df_data.DateTime <= x.endTime)
for i, x in df_timeRanges.iterrows()])
In [795]: df_data[mask]
Out[795]:
Data DateTime
0 1 2017-09-01 00:00:00
2 3 2017-09-01 02:00:00
Or, also
In [807]: func = lambda x: (df_data.DateTime >= x.startTime) & (df_data.DateTime <= x.endTime)
In [808]: df_data[df_timeRanges.apply(func, axis=1).any()]
Out[808]:
Data DateTime
0 1 2017-09-01 00:00:00
2 3 2017-09-01 02:00:00

calculate time difference pandas dataframe

I have a pandas dataframe where index is as follows :
Index([16/May/2013:23:56:43, 16/May/2013:23:56:42, 16/May/2013:23:56:43, ..., 17/May/2013:23:54:45, 17/May/2013:23:54:45, 17/May/2013:23:54:45], dtype=object)
I have calculated time difference in consequent occurrences in the following method.
df2['tvalue'] = df2.index
df2['tvalue'] = np.datetime64(df2['tvalue'])
df2['delta'] = (df2['tvalue']-df2['tvalue'].shift()).fillna(0)
So I got following output
Time tvalue delta
16/May/2013:23:56:43 2013-05-01 13:23:56 00:00:00
16/May/2013:23:56:42 2013-05-01 13:23:56 00:00:00
16/May/2013:23:56:43 2013-05-01 13:23:56 00:00:00
16/May/2013:23:56:43 2013-05-01 13:23:56 00:00:00
16/May/2013:23:56:48 2013-05-01 13:23:56 00:00:00
16/May/2013:23:56:48 2013-05-01 13:23:56 00:00:00
16/May/2013:23:56:48 2013-05-01 13:23:56 00:00:00
16/May/2013:23:57:44 2013-05-01 13:23:57 00:00:01
16/May/2013:23:57:44 2013-05-01 13:23:57 00:00:00
16/May/2013:23:57:44 2013-05-01 13:23:57 00:00:00
But it has calculated time difference taking the year as hours and the date is also different?What can be the problem here?
Parsing your date was non-trivial, I think strptime could prob do it, but didn't work for me. Your example above your times are just strings, not datetimes.
In [140]: from dateutil import parser
In [130]: def parse(x):
.....: date, hh, mm, ss = x.split(':')
.....: dd, mo, yyyy = date.split('/')
.....: return parser.parse("%s %s %s %s:%s:%s" % (yyyy,mo,dd,hh,mm,ss))
.....:
In [131]: map(parse,idx)
Out[131]:
[datetime.datetime(2013, 5, 16, 23, 56, 43),
datetime.datetime(2013, 5, 16, 23, 56, 42),
datetime.datetime(2013, 5, 16, 23, 56, 43),
datetime.datetime(2013, 5, 17, 23, 54, 45),
datetime.datetime(2013, 5, 17, 23, 54, 45),
datetime.datetime(2013, 5, 17, 23, 54, 45)]
In [132]: pd.to_datetime(map(parse,idx))
Out[132]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-05-16 23:56:43, ..., 2013-05-17 23:54:45]
Length: 6, Freq: None, Timezone: None
In [133]: df = DataFrame(dict(time = pd.to_datetime(map(parse,idx))))
In [134]: df
Out[134]:
time
0 2013-05-16 23:56:43
1 2013-05-16 23:56:42
2 2013-05-16 23:56:43
3 2013-05-17 23:54:45
4 2013-05-17 23:54:45
5 2013-05-17 23:54:45
In [138]: df['delta'] = (df['time']-df['time'].shift()).fillna(0)
In [139]: df
Out[139]:
time delta
0 2013-05-16 23:56:43 00:00:00
1 2013-05-16 23:56:42 -00:00:01
2 2013-05-16 23:56:43 00:00:01
3 2013-05-17 23:54:45 23:58:02
4 2013-05-17 23:54:45 00:00:00
5 2013-05-17 23:54:45 00:00:00

Python iterate through month, use month in between Query

I have the following model:
Deal(models.Model):
start_date = models.DateTimeField()
end_date = models.DateTimeField()
I want to iterate through a given year
year = '2010'
For each month in year I want to execute a query to see if the month is between start_date and end_date.
How can I iterate through a given year? Use the month to do a query?
SELECT * FROM deals WHERE month BETWEEN start_date AND end_date
The outcome will tell me if I had a deal in January 2010 and/or in February 2010, etc.
How can I iterate through a given year?
You could use python-dateutil's rrule. Install with command pip install python-dateutil.
Example usage:
In [1]: from datetime import datetime
In [2]: from dateutil import rrule
In [3]: list(rrule.rrule(rrule.MONTHLY, dtstart=datetime(2010,01,01,00,01), count=12))
Out[3]:
[datetime.datetime(2010, 1, 1, 0, 1),
datetime.datetime(2010, 2, 1, 0, 1),
datetime.datetime(2010, 3, 1, 0, 1),
datetime.datetime(2010, 4, 1, 0, 1),
datetime.datetime(2010, 5, 1, 0, 1),
datetime.datetime(2010, 6, 1, 0, 1),
datetime.datetime(2010, 7, 1, 0, 1),
datetime.datetime(2010, 8, 1, 0, 1),
datetime.datetime(2010, 9, 1, 0, 1),
datetime.datetime(2010, 10, 1, 0, 1),
datetime.datetime(2010, 11, 1, 0, 1),
datetime.datetime(2010, 12, 1, 0, 1)]
Use the month to do a query?
You could iterate over months like this:
In [1]: from dateutil import rrule
In [2]: from datetime import datetime
In [3]: months = list(rrule.rrule(rrule.MONTHLY, dtstart=datetime(2010,01,01,00,01), count=13))
In [4]: i = 0
In [5]: while i < len(months) - 1:
...: print "start_date", months[i], "end_date", months[i+1]
...: i += 1
...:
start_date 2010-01-01 00:01:00 end_date 2010-02-01 00:01:00
start_date 2010-02-01 00:01:00 end_date 2010-03-01 00:01:00
start_date 2010-03-01 00:01:00 end_date 2010-04-01 00:01:00
start_date 2010-04-01 00:01:00 end_date 2010-05-01 00:01:00
start_date 2010-05-01 00:01:00 end_date 2010-06-01 00:01:00
start_date 2010-06-01 00:01:00 end_date 2010-07-01 00:01:00
start_date 2010-07-01 00:01:00 end_date 2010-08-01 00:01:00
start_date 2010-08-01 00:01:00 end_date 2010-09-01 00:01:00
start_date 2010-09-01 00:01:00 end_date 2010-10-01 00:01:00
start_date 2010-10-01 00:01:00 end_date 2010-11-01 00:01:00
start_date 2010-11-01 00:01:00 end_date 2010-12-01 00:01:00
start_date 2010-12-01 00:01:00 end_date 2011-01-01 00:01:00
Replace the "print" statement with a query. Feel free to adapt it to your needs.
There is probably a better way but that could do the job.

Categories