I have a question about selecting a range in a pandas DataFrame in Python. I have a column with times and a column with values. I would like to select all the rows with times between 6 a.m. and 6 p.m. (so from 6:00:00 to 18:00:00). I've succeeded at selecting all the night times (between 18:00:00 and 6:00:00), but if I apply the same to the day times, it doesn't work. Is there something wrong with my syntax? Below is a minimal working example. timeslice2 returns an empty DataFrame in my case.
import pandas as pd
times = ("1:00:00", "2:00:00", "3:00:00", "4:00:00", "5:00:00", "6:00:00", "7:00:00", "8:00:00", "9:00:00", \
"10:00:00", "11:00:00", "12:00:00", "13:00:00", "14:00:00", "15:00:00", "16:00:00", "17:00:00", \
"18:00:00", "19:00:00", "20:00:00", "21:00:00", "22:00:00", "23:00:00")
values = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
data = zip(times, values)
colnames = ["Time", "values"]
df = pd.DataFrame(data=data, columns=colnames)
print(df)
# selecting only night times
timeslice1 = df[(df['Time'] > '18:00:00') & (df['Time'] <= '6:00:00')]
# selecting only day times
timeslice2 = df[(df['Time'] > '6:00:00') & (df['Time'] <= '18:00:00')]
print(timeslice1)
print(timeslice2)
I've been able to select the right range with this answer, but it seems strange to me that the above doesn't work. Moreover, if I convert my 'Time' column to 'datetime', as needed, it uses the date of today and I don't want that.
This way it works, the first range if treated like datetime has no results because it will mean two different dates (days) in chronological order.
import pandas as pd
times = ("1:00:00", "2:00:00", "3:00:00", "4:00:00", "5:00:00", "6:00:00", "7:00:00", "8:00:00", "9:00:00", \
"10:00:00", "11:00:00", "12:00:00", "13:00:00", "14:00:00", "15:00:00", "16:00:00", "17:00:00", \
"18:00:00", "19:00:00", "20:00:00", "21:00:00", "22:00:00", "23:00:00")
values = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
data = zip(times, values)
colnames = ["Time", "values"]
df = pd.DataFrame(data=data, columns=colnames)
print('Original df \n',df)
# selecting only night times
timeslice1 = df[(df['Time'] > '18:00:00') & (df['Time'] <= '6:00:00')]
# selecting only day times
#conver Time column to datetime
df['Time'] = pd.to_datetime(df['Time'])
timeslice2 = df[(df['Time'] > '6:00:00') & (df['Time'] <= '18:00:00')]
#convert df back to string
timeslice2["Time"] = timeslice2["Time"].dt.strftime('%H:%M:%S')
print('Slice 1 \n', timeslice1)
print('Slice 2 \n', timeslice2)
Related
I am working with a pandas dataframe with multi-index columns (two levels). I need to drop a column from level 0 and later get a list of the remaining columns in level=0. Strangely, the dropping part works fine, but somehow the dropped column shows back up if you call df.columns.levels[0].
Here's a MRE. When I call df.columns the result is this:
MultiIndex([('Week2', 'Hours'),
('Week2', 'Sales')],
)
Which sure looks like Week1 is gone. But if I call df.columns.levels[0].tolist()...
['Week1', 'Week2']
Here's the full code:
import pandas as pd
import numpy as np
n = ['Mickey', 'Minnie', 'Snow White', 'Donald',
'Goofy', 'Elsa', 'Pluto', 'Daisy', 'Mad Hatter']
df1 = pd.DataFrame(data={'Hours': [32, 30, 34, 33, 22, 12, 19, 17, 9],
'Sales': [10, 15, 12, 15, 6, 11, 9, 7, 4]},
index=n)
df2 = pd.DataFrame(data={'Hours': [40, 33, 29, 31, 17, 22, 13, 16, 12],
'Sales': [12, 14, 8, 16, 3, 12, 5, 6, 4]},
index=n)
df1.columns = pd.MultiIndex.from_product([['Week1'], df1.columns])
df2.columns = pd.MultiIndex.from_product([['Week2'], df2.columns])
df = pd.concat([df1, df2], axis=1)
#I want to remove a whole level
df = df.drop('Week1', axis=1, level=0)
print(df.columns) #Seems successful after this print, only Week 2 is left
print(df.columns.levels[0].tolist()) #This list includes Week1!
Use remove_unused_levels:
From the documentation:
Unused level(s) means levels that are not expressed in the labels. The resulting MultiIndex will have the same outward appearance, meaning the same .values and ordering. It will also be .equals() to the original.
df.columns = df.columns.remove_unused_levels()
print(df.columns.levels)
# Output
[['Week2'], ['Hours', 'Sales']]
I have dates that I'm pulling into a dataframe at regular intervals.
The data is generally well-formed, but sometimes there are bad data in an otherwise date column.
I would always expect to have a date in the parsed 9 digit form:
(tm_year=2000, tm_mon=11, tm_mday=30, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=335, tm_isdst=-1)
(2015, 12, 29, 0, 30, 50, 1, 363, 0)
How should I check and fix this?
What I would like to do is replace whatever is not a date, with a date based on a variable that represents the last_update + 1/2 the update interval, so the items are not filtered out by later functions.
Data as shown is published_parsed from feedparser.
import pandas as pd
import datetime
# date with ugly data
df_date_ugly = pd.DataFrame({'date': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
'None', '',
(2015, 12, 28, 23, 59, 12, 0, 362, 0)
]})
# date is fine
df_date = pd.DataFrame({'date': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0)
]})
Pseudocode
if the original_date is valid
return original_date
else
return substitute_date
import calendar
import numpy as np
import pandas as pd
def tuple_to_timestamp(x):
try:
return calendar.timegm(x) # 1
except (TypeError, ValueError):
return np.nan
df = pd.DataFrame({'orig': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
'None', '',
(2015, 12, 30, 23, 59, 12, 0, 362, 0)]})
ts = df['orig'].apply(tuple_to_timestamp) # 2
# 0 1451349050
# 1 1451347152
# 2 NaN
# 3 NaN
# 4 1451519952
# Name: orig, dtype: float64
ts = ts.interpolate() # 3
# 0 1451349050
# 1 1451347152
# 2 1451404752
# 3 1451462352
# 4 1451519952
# Name: orig, dtype: float64
df['fixed'] = pd.to_datetime(ts, unit='s') # 4
print(df)
yields
orig fixed
0 (2015, 12, 29, 0, 30, 50, 1, 363, 0) 2015-12-29 00:30:50
1 (2015, 12, 28, 23, 59, 12, 0, 362, 0) 2015-12-28 23:59:12
2 None 2015-12-29 15:59:12
3 2015-12-30 07:59:12
4 (2015, 12, 30, 23, 59, 12, 0, 362, 0) 2015-12-30 23:59:12
Explanation:
calendar.timegm converts each time-tuple to a timestamp. Unlike
time.mktime, it interprets the time-tuple as being in UTC, not local time.
apply calls tuple_to_timestamp for each row of df['orig'].
The nice thing about timestamps is that they are numeric, so you can then use
numerical methods such as Series.interpolate to fill in NaNs with interpolated
values. Note that the two NaNs do not get filled with same interpolated value; their values are linearly interpolated based on their position as given by ts.index.
pd.to_datetime converts to timestamps to dates.
When working with dates and times in pandas, convert them to a pandas timestamp using pandas.to_datetime. To use this function, we will convert the list into a string with just the date and time elements. For your case, values that are not lists of length 9 will be considered bad and are replaced with a empty string ''.
#convert list into string with date & time
#only elements with lists of length 9 will be parsed
dates_df = df_date_ugly.applymap(lambda x: "{0}/{1}/{2} {3}:{4}:{5}".format(x[0],x[1],x[2], x[3], x[4], x[5]) if len(x)==9 else '')
#convert to a pandas timestamp
dates_df = pd.to_datetime(dates_df['date'], errors = 'coerce'))
date
0 2015-12-29 00:30:50
1 2015-12-28 23:59:12
2 NaT
3 NaT
4 2015-12-28 23:59:12
Find the indices where the dates are missing use pd.isnull():
>>>missing = pd.isnull(dates_df['date']).index
>>>missing
Int64Index([2, 3], dtype='int64')
To set the missing date as the midpoint between 2 dates:
start_date = dates_df.iloc[0,:]
end_date = dates_df.iloc[4,:]
missing_date = start_date + (end_date - start_date)/2
I've got a column of timestamps (in ms) in a pandas DataFrame. From the timestamps, I'm trying to derive the hour, minute, day of week, and month of the timestamp in separate columns.
I've tried using the apply function across the column, but to no avail. So, I took a very naive (but not very concise) approach to create these columns:
import pandas
import datetime
df=pd.DataFrame( {'time':[1401811621559, 1402673694105, 1402673749561, 1401811615479, 1402673708254], 'person':['Harry', 'Ann', 'Sue', 'Jeremy', 'Anne']})
df['time'] = pandas.to_datetime(df.time, unit='ms')
days = []
tod = []
month = []
minutes = []
for row in df['time']:
days.append(row.strftime('%w'))
tod.append(row.strftime('%H'))
month.append(row.strftime('%m'))
minutes.append(row.strftime('%M'))
##
df['dayOfWeek'] = days
df['timeOfDay'] = tod
df['month'] = month
df['minutes'] = minutes
Is there a way to do this that is more like this?
df['dayOfWeek'] = df['time'].apply(strftime('%w'),axis = 1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'strftime' is not defined
At the moment you have to wrap the column in a DatetimeIndex:
In [11]: dti = pd.DatetimeIndex(df['time'])
In [12]: dti.dayofweek
Out[12]: array([1, 4, 4, 1, 4])
In [13]: dti.time
Out[13]:
array([datetime.time(16, 7, 1, 559000), datetime.time(15, 34, 54, 105000),
datetime.time(15, 35, 49, 561000), datetime.time(16, 6, 55, 479000),
datetime.time(15, 35, 8, 254000)], dtype=object)
In [14]: dti.month
Out[14]: array([6, 6, 6, 6, 6])
In [15]: dti.minute
Out[15]: array([ 7, 34, 35, 6, 35])
etc.
See this issue for making these methods directly available from a datetime series.
You might also make it a lambda function:
df['dayOfWeek2'] = df.time.apply(lambda x:x.strftime('%w'))
Now typing
df.dayOfWeek2 == df.dayOfWeek
yields
0 True
1 True
2 True
3 True
4 True
dtype: bool
Yes there is, modifying your code slightly...
def timeGroups(row):
row['days'] = row['time'].strftime('%w'))
#do the same thing for month,seconds,etc.
return row
df['dayOfWeek'] = df['time'].apply(timeGroups,axis = 1)
Short version: I have two TimeSeries (recording start and recording end) I would like to use as indices for data in a Panel (or DataFrame). Not hierarchical, but parallel. I am uncertain how to do this.
Long version:
I am constructing a pandas Panel with some data akin to temperature and density at certain distances from an antenna. As I see it, the most natural structure is having e.g. temp and dens as items (i.e. sub-DataFrames of the Panel), recording time as major axis (index), and thus distance from the antenna as minor axis (colums).
My problem is this: For each recording, the instrument averages/integrates over some amount of time. Thus, for each data dump, two timestamps are saved: start recording and end recording. I need both of those. Thus, I would need something which might be called "parallel indexing", where two different TimeSeries (startRec and endRec) work as indices, and I can get whichever I prefer for a certain data point. Of course, I don't really need to index by both, but both need to be naturally available in the data structure. For example, for any given temperature or density recording, I need to be able to get both the start and end time of the recording.
I could of course keep the two TimeSeries in a separate DataFrame, but with the main point of pandas being automatic data alignment, this is not really ideal.
How can I best achieve this?
Example data
Sample Panel with three recordings at two distances from the antenna:
import pandas as pd
import numpy as np
data = pd.Panel(data={'temp': np.array([[21, 20],
[19, 17],
[15, 14]]),
'dens': np.array([[1001, 1002],
[1000, 998],
[997, 995]])},
minor_axis=['1m', '3m'])
Output of data:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: dens to temp
Major_axis axis: 0 to 2
Minor_axis axis: 1m to 3m
Here, the major axis is currently only an integer-based index (0 to 2). The minor axis is the two measurement distances from the antenna.
I have two TimeSeries I'd like to use as indices:
from datetime import datetime
startRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 00, 00),
datetime(2013, 11, 12, 15, 00, 00),
datetime(2013, 11, 13, 15, 00, 00)])
endRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 00, 10),
datetime(2013, 11, 12, 15, 00, 10),
datetime(2013, 11, 13, 15, 00, 10)])
Output of startRec:
0 2013-11-11 15:00:00
1 2013-11-12 15:00:00
2 2013-11-13 15:00:00
dtype: datetime64[ns]
Being in a Panel makes this a little trickier. I typically stick with DataFrames.
But how does this look:
import pandas as pd
from datetime import datetime
startRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 0, 0),
datetime(2013, 11, 12, 15, 0, 0),
datetime(2013, 11, 13, 15, 0, 0)])
endRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 0, 10),
datetime(2013, 11, 12, 15, 0, 10),
datetime(2013, 11, 13, 15, 0, 10)])
_data1m = pd.DataFrame(data={
'temp': np.array([21, 19, 15]),
'dens': np.array([1001, 1000, 997]),
'start': startRec,
'end': endRec
}
)
_data3m = pd.DataFrame(data={
'temp': np.array([20, 17, 14]),
'dens': np.array([1002, 998, 995]),
'start': startRec,
'end': endRec
}
)
_data1m.set_index(['start', 'end'], inplace=True)
_data3m.set_index(['start', 'end'], inplace=True)
data = pd.Panel(data={'1m': _data1m, '3m': _data3m})
data.loc['3m'].select(lambda row: row[0] < pd.Timestamp('2013-11-12') or
row[1] < pd.Timestamp('2013-11-13'))
and that outputs:
dens temp
start end
2013-11-11 15:00:00 2013-11-11 15:00:10 1002 20
2013-11-12 15:00:00 2013-11-12 15:00:10 998 17
I have the following DataFrame:
from datetime import datetime
from pandas import DataFrame
df = DataFrame({
'Buyer': ['Carl', 'Carl', 'Carl', 'Carl', 'Joe', 'Carl'],
'Quantity': [18, 3, 5, 1, 9, 3],
'Date': [
datetime(2013, 9, 1, 13, 0),
datetime(2013, 9, 1, 13, 5),
datetime(2013, 10, 1, 20, 0),
datetime(2013, 10, 3, 10, 0),
datetime(2013, 12, 2, 12, 0),
datetime(2013, 9, 2, 14, 0),
]
})
First: I am looking to add another column to this DataFrame which sums up the purchases of the last 5 days for each buyer. In particular the result should look like this:
Quantity
Buyer Date
Carl 2013-09-01 21
2013-09-02 24
2013-10-01 5
2013-10-03 6
Joe 2013-12-02 9
To do so I started with the following:
df1 = (df.set_index(['Date', 'Buyer'])
.unstack(level=[1])
.resample('D', how='sum')
.fillna(0))
However, I do not know how to add another column to this DataFrame which can add up for each row the previous 5 row entries.
Second:
Add another column to this DataFrame which does not only sum up the purchases of the last 5 days like in (1) but also weights these purchases based on their dates. For example: those purchases from 5 days ago should be counted 20%, those from 4 days ago 40%, those from 3 days ago 60%, those from 2 days ago 80% and those from one day ago and from today 100%