composing a dateframe with an index of a datetime - python

I have two lists but one is a datetime. How can I combine to form a date frame with index of this datetime and the values of lista2?
lista1 = [datetime.datetime(2017, 11, 11, 0, 0), datetime.datetime(2017, 11, 12, 0, 0), datetime.datetime(2017, 11, 13, 0, 0)]
lista2 = [31488, 14335, 89]

You can use the index parameter from the constructor to specify a list as indices, and use the other one as data:
pd.DataFrame(lista2,index=lista1)
For your sample data, this gives:
>>> pd.DataFrame(lista2,index=lista1)
0
2017-11-11 31488
2017-11-12 14335
2017-11-13 89

past two list to a list of tuple
pd.DataFrame(list(zip(lista1,lista2))).set_index(0)
Out[646]:
1
0
2017-11-11 31488
2017-11-12 14335
2017-11-13 89

Related

convert yyyymmdd to serial number python

How do I convert a list of dates that are in the form yyyymmdd to a serial number? For example, if I have this list of dates:
t = [1898-10-12 06:00,1898-10-12 12:00,1932-09-30 08:00,1932-09-30 00:00]
How do I convert each date to a serial number? Im currently using the datetime toordinal() command, but each date is being rounded to the same serial number. How do I get the same dates with different times to be different numbers?
The times in the list are the datetime.datetime numbers. I tried then doing:
thurser = []
for i in range(len(t)):
thurser.append(t[i].toordinal())
But am not getting serial numbers as floats.
datetime.toordinal() considers only the 'date' part of the datetime object, not the time. So does date.toordinal() - it only has a date part. The first 2 and last 2 elements in your list have datetimes on the same date but at different times, which .toordinal ignores. So, .toordinal will give you the same value for those same-dated datetimes.
In general, the solution would be to calculate the delta between your dates and a pre-determined/fixed one. I'm using datetime.datetime(1, 1, 1), the earliest possible datetime, so all the deltas are positive:
thurser = []
# assuming t is a list of datetime objects
for d in t:
delta = d - datetime.datetime(1, 1, 1)
thurser.append(delta.days + delta.seconds/(24 * 3600))
>>> print(thurser)
[693149.25, 693149.5, 705555.3333333334, 705555.0]
And if you prefer ints instead of floats, then use seconds instead of days:
thurser.append(int(delta.total_seconds())) # total_seconds has microseconds in the float
>>> print(thurser)
[59888095200, 59888116800, 60959980800, 60959952000]
And to get back the original values in the 2nd example:
>>> [datetime.timedelta(seconds=d) + datetime.datetime(1, 1, 1) for d in thurser]
[datetime.datetime(1898, 10, 12, 6, 0), datetime.datetime(1898, 10, 12, 12, 0),
datetime.datetime(1932, 9, 30, 8, 0), datetime.datetime(1932, 9, 30, 0, 0)]
>>> _ == t # compare with original values
True
Let me know if my understanding is wrong, I tried following and gives distinct numbers for each value of the list.
I modified
t = ['1898-10-12 06:00','1898-10-12 12:00','1932-09-30 08:00','1932-09-30 00:00']
with
t = [datetime.datetime(1898, 10, 12, 6, 0), datetime.datetime(1898, 10, 12, 12, 0), datetime.datetime(1932, 9, 30, 8, 0), datetime.datetime(1932, 9, 30, 0, 0)]
As mentioned in comment it is list of datetime.datetime.
I am considering total MilliSeconds from 1970-01-01 00:00:00 the given date to generate a number.
So dates which are before above date give values in negative. But distinct values.
t = [datetime.datetime(1898, 10, 12, 6, 0), datetime.datetime(1898, 10, 12, 12, 0), datetime.datetime(1932, 9, 30, 8, 0), datetime.datetime(1932, 9, 30, 0, 0)]
thurser = []
x = []
for i in range(len(t)):
thurser.append(t[i].toordinal())
x.append((t[i]-datetime.datetime.utcfromtimestamp(0)).total_seconds() * 1000.0)
print(thurser)
print(x)
output:
[693150, 693150, 705556, 705556]
[-2247501600000.0, -2247480000000.0, -1175616000000.0, -1175644800000.0]

Calculate Average Number of Days Between Multiple Dates

Let's say I have the following data frame. I want to calculate the average number of days between all the activities for a particular account.
Below is my desired result:
Now I know how to calculate the number of days between two dates with the following code. But I don't know how to calculate what I am looking for across multiple dates.
from datetime import date
d0 = date(2016, 8, 18)
d1 = date(2016, 9, 26)
delta = d0 - d1
print delta.days
I would do this as follows in pandas (assuming the Date column is a datetime64):
In [11]: df
Out[11]:
Account Activity Date
0 A a 2015-10-21
1 A b 2016-07-07
2 A c 2016-07-07
3 A d 2016-09-14
4 A e 2016-10-12
5 B a 2015-11-24
6 B b 2015-12-30
In [12]: df.groupby("Account")["Date"].apply(lambda x: x.diff().mean())
Out[12]:
Account
A 89 days 06:00:00
B 36 days 00:00:00
Name: Date, dtype: timedelta64[ns]
If your dates are in a list:
>>> from datetime import date
>>> dates = [date(2015, 10, 21), date(2016, 7, 7), date(2016, 7, 7), date(2016, 9, 14), date(2016, 10, 12), date(2016, 10, 12), date(2016, 11, 22), date(2016, 12, 21)]
>>> differences = [(dates[i]-dates[i-1]).days for i in range(1, len(dates))] #[260, 0, 69, 28, 0, 41, 29]
>>> float(sum(differences))/len(differences)
61.0
>>>

Fill in missing dates in dataframe using the mean

I have dates that I'm pulling into a dataframe at regular intervals.
The data is generally well-formed, but sometimes there are bad data in an otherwise date column.
I would always expect to have a date in the parsed 9 digit form:
(tm_year=2000, tm_mon=11, tm_mday=30, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=335, tm_isdst=-1)
(2015, 12, 29, 0, 30, 50, 1, 363, 0)
How should I check and fix this?
What I would like to do is replace whatever is not a date, with a date based on a variable that represents the last_update + 1/2 the update interval, so the items are not filtered out by later functions.
Data as shown is published_parsed from feedparser.
import pandas as pd
import datetime
# date with ugly data
df_date_ugly = pd.DataFrame({'date': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
'None', '',
(2015, 12, 28, 23, 59, 12, 0, 362, 0)
]})
# date is fine
df_date = pd.DataFrame({'date': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0)
]})
Pseudocode
if the original_date is valid
return original_date
else
return substitute_date
import calendar
import numpy as np
import pandas as pd
def tuple_to_timestamp(x):
try:
return calendar.timegm(x) # 1
except (TypeError, ValueError):
return np.nan
df = pd.DataFrame({'orig': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
'None', '',
(2015, 12, 30, 23, 59, 12, 0, 362, 0)]})
ts = df['orig'].apply(tuple_to_timestamp) # 2
# 0 1451349050
# 1 1451347152
# 2 NaN
# 3 NaN
# 4 1451519952
# Name: orig, dtype: float64
ts = ts.interpolate() # 3
# 0 1451349050
# 1 1451347152
# 2 1451404752
# 3 1451462352
# 4 1451519952
# Name: orig, dtype: float64
df['fixed'] = pd.to_datetime(ts, unit='s') # 4
print(df)
yields
orig fixed
0 (2015, 12, 29, 0, 30, 50, 1, 363, 0) 2015-12-29 00:30:50
1 (2015, 12, 28, 23, 59, 12, 0, 362, 0) 2015-12-28 23:59:12
2 None 2015-12-29 15:59:12
3 2015-12-30 07:59:12
4 (2015, 12, 30, 23, 59, 12, 0, 362, 0) 2015-12-30 23:59:12
Explanation:
calendar.timegm converts each time-tuple to a timestamp. Unlike
time.mktime, it interprets the time-tuple as being in UTC, not local time.
apply calls tuple_to_timestamp for each row of df['orig'].
The nice thing about timestamps is that they are numeric, so you can then use
numerical methods such as Series.interpolate to fill in NaNs with interpolated
values. Note that the two NaNs do not get filled with same interpolated value; their values are linearly interpolated based on their position as given by ts.index.
pd.to_datetime converts to timestamps to dates.
When working with dates and times in pandas, convert them to a pandas timestamp using pandas.to_datetime. To use this function, we will convert the list into a string with just the date and time elements. For your case, values that are not lists of length 9 will be considered bad and are replaced with a empty string ''.
#convert list into string with date & time
#only elements with lists of length 9 will be parsed
dates_df = df_date_ugly.applymap(lambda x: "{0}/{1}/{2} {3}:{4}:{5}".format(x[0],x[1],x[2], x[3], x[4], x[5]) if len(x)==9 else '')
#convert to a pandas timestamp
dates_df = pd.to_datetime(dates_df['date'], errors = 'coerce'))
date
0 2015-12-29 00:30:50
1 2015-12-28 23:59:12
2 NaT
3 NaT
4 2015-12-28 23:59:12
Find the indices where the dates are missing use pd.isnull():
>>>missing = pd.isnull(dates_df['date']).index
>>>missing
Int64Index([2, 3], dtype='int64')
To set the missing date as the midpoint between 2 dates:
start_date = dates_df.iloc[0,:]
end_date = dates_df.iloc[4,:]
missing_date = start_date + (end_date - start_date)/2

"Parallel" indexing in pandas (not hierarchical)

Short version: I have two TimeSeries (recording start and recording end) I would like to use as indices for data in a Panel (or DataFrame). Not hierarchical, but parallel. I am uncertain how to do this.
Long version:
I am constructing a pandas Panel with some data akin to temperature and density at certain distances from an antenna. As I see it, the most natural structure is having e.g. temp and dens as items (i.e. sub-DataFrames of the Panel), recording time as major axis (index), and thus distance from the antenna as minor axis (colums).
My problem is this: For each recording, the instrument averages/integrates over some amount of time. Thus, for each data dump, two timestamps are saved: start recording and end recording. I need both of those. Thus, I would need something which might be called "parallel indexing", where two different TimeSeries (startRec and endRec) work as indices, and I can get whichever I prefer for a certain data point. Of course, I don't really need to index by both, but both need to be naturally available in the data structure. For example, for any given temperature or density recording, I need to be able to get both the start and end time of the recording.
I could of course keep the two TimeSeries in a separate DataFrame, but with the main point of pandas being automatic data alignment, this is not really ideal.
How can I best achieve this?
Example data
Sample Panel with three recordings at two distances from the antenna:
import pandas as pd
import numpy as np
data = pd.Panel(data={'temp': np.array([[21, 20],
[19, 17],
[15, 14]]),
'dens': np.array([[1001, 1002],
[1000, 998],
[997, 995]])},
minor_axis=['1m', '3m'])
Output of data:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: dens to temp
Major_axis axis: 0 to 2
Minor_axis axis: 1m to 3m
Here, the major axis is currently only an integer-based index (0 to 2). The minor axis is the two measurement distances from the antenna.
I have two TimeSeries I'd like to use as indices:
from datetime import datetime
startRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 00, 00),
datetime(2013, 11, 12, 15, 00, 00),
datetime(2013, 11, 13, 15, 00, 00)])
endRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 00, 10),
datetime(2013, 11, 12, 15, 00, 10),
datetime(2013, 11, 13, 15, 00, 10)])
Output of startRec:
0 2013-11-11 15:00:00
1 2013-11-12 15:00:00
2 2013-11-13 15:00:00
dtype: datetime64[ns]
Being in a Panel makes this a little trickier. I typically stick with DataFrames.
But how does this look:
import pandas as pd
from datetime import datetime
startRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 0, 0),
datetime(2013, 11, 12, 15, 0, 0),
datetime(2013, 11, 13, 15, 0, 0)])
endRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 0, 10),
datetime(2013, 11, 12, 15, 0, 10),
datetime(2013, 11, 13, 15, 0, 10)])
_data1m = pd.DataFrame(data={
'temp': np.array([21, 19, 15]),
'dens': np.array([1001, 1000, 997]),
'start': startRec,
'end': endRec
}
)
_data3m = pd.DataFrame(data={
'temp': np.array([20, 17, 14]),
'dens': np.array([1002, 998, 995]),
'start': startRec,
'end': endRec
}
)
_data1m.set_index(['start', 'end'], inplace=True)
_data3m.set_index(['start', 'end'], inplace=True)
data = pd.Panel(data={'1m': _data1m, '3m': _data3m})
data.loc['3m'].select(lambda row: row[0] < pd.Timestamp('2013-11-12') or
row[1] < pd.Timestamp('2013-11-13'))
and that outputs:
dens temp
start end
2013-11-11 15:00:00 2013-11-11 15:00:10 1002 20
2013-11-12 15:00:00 2013-11-12 15:00:10 998 17

Pandas: Combine different timespans and cumsum

I have the following DataFrame:
from datetime import datetime
from pandas import DataFrame
df = DataFrame({
'Buyer': ['Carl', 'Carl', 'Carl', 'Carl', 'Joe', 'Carl'],
'Quantity': [18, 3, 5, 1, 9, 3],
'Date': [
datetime(2013, 9, 1, 13, 0),
datetime(2013, 9, 1, 13, 5),
datetime(2013, 10, 1, 20, 0),
datetime(2013, 10, 3, 10, 0),
datetime(2013, 12, 2, 12, 0),
datetime(2013, 9, 2, 14, 0),
]
})
First: I am looking to add another column to this DataFrame which sums up the purchases of the last 5 days for each buyer. In particular the result should look like this:
Quantity
Buyer Date
Carl 2013-09-01 21
2013-09-02 24
2013-10-01 5
2013-10-03 6
Joe 2013-12-02 9
To do so I started with the following:
df1 = (df.set_index(['Date', 'Buyer'])
.unstack(level=[1])
.resample('D', how='sum')
.fillna(0))
However, I do not know how to add another column to this DataFrame which can add up for each row the previous 5 row entries.
Second:
Add another column to this DataFrame which does not only sum up the purchases of the last 5 days like in (1) but also weights these purchases based on their dates. For example: those purchases from 5 days ago should be counted 20%, those from 4 days ago 40%, those from 3 days ago 60%, those from 2 days ago 80% and those from one day ago and from today 100%

Categories