I'm working on two pandas Timeseries object (the second coming from a group by on 30 minutes):
df_lookup = pd.DataFrame(np.arange(10, 16),
index=('2017-12-15 17:58:00', '2017-12-15 17:59:00',
'2017-12-15 18:00', '2017-12-15 18:01:00',
'2017-12-15 18:02:00', '2017-12-15 18:03:00',
)
)
df_lookup.index = pd.to_datetime(df_lookup.index)
avg_30min = pd.DataFrame([0.066627, 0.1234, 0.0432, 0.234],
index=("2017-12-15 18:00:00", "2017-12-15 18:30:00",
"2017-12-15 19:00:00", "2017-12-15 19:30:00",
)
)
avg_30min.index = pd.to_datetime(avg_30min.index)
I need to iterate over the second, avg_30min, and lookup into the first, df_lookup in order to extract the value at index idx.
for idx, row in avg_30min.iterrows():
value_in_lookup_df = df_lookup.loc[idx]
# Here I'd use the object from the lookup to add a detail into a plot.
I tried using loc and iloc, the former returns:
KeyError: 'the label [2017-12-15 18:00:00] is not in the [index]'
while the latter:
TypeError: cannot do positional indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [2017-12-15 18:00:00] of <class 'pandas._libs.tslib.Timestamp'>
The expected result would be the row from df_lookup which index matches idx, somewhat similar to a dictionary lookup in plain python (row_from_lookup = lookup_df[idx]).
What's the right method to have an exact match on a pandas Timeseries?
It looks like you want a merge on the index columns.
avg_30min.merge(df_lookup, left_index=True, right_index=True)
0_x 0_y
2017-12-15 18:00:00 0.066627 12
Alternatively, find the intersection of indexes, and concatenate.
idx = avg_30min.index.intersection(df_lookup.index)
pd.concat([avg_30min.loc[idx], df_lookup.loc[idx]], 1, ignore_index=True)
0 1
2017-12-15 18:00:00 0.066627 12
Given a datetime.datetime object such as:
dt_obj = datetime.datetime(2017, 12, 15, 18, 0)
which can be extracted e.g. from another dataframe such as avg_30min in the example reported, a lookup into a dataframe which uses a dtype='datetime64[ns]' as index can be performed using get_loc on the index column:
>>> df_lookup.index.get_loc(dt_obj)
2
Then the position can be used to retrieve the requested row, with df_lookup.iloc[].
Related
I am working on a code that groupes a data frame by date:
gk = df_HR.groupby(['date'])
I now get a data frame where every first row from each date is looking like this:
2022-05-23 22:18 60 2022-05-23 22:18:00 1653344280 1.000000
2022-05-24 00:00 54 2022-05-24 00:00:00 1653350400 0.900000
....
I want to drop as an example all the data for the date '2022-05-24'. However, when I use the .drop() function I get the error 'DataFrameGroupBy' object has no attribute 'drop''.
How can I still drop all the data from this date?
Save your group by result in Dataframe-df and then use below code to select list of dates you want to drop .
date_list_filter = [datetime(2009, 5, 2),
datetime(2010, 8, 22)]
df.drop(date_list, inplace=True)
hope this helps !
From what i gather, the goal is to group the data frames by date, and drop dataframes with date's on a certain day
import pandas as pd
# ...
gk = df_HR.groupby(['date'])
good_dfs = []
for date, sub_df in gk:
if DATE_TO_DROP not in date:
good_dfs.append(sub_df)
final_df = pd.concat(good_dfs)
Alternatively, you can just drop rows where 'date' has that string included
df_HR.drop(df_HR[ DATE_TO_REMOVE in df_HR.date].index, inplace=True)
The above is for removing a single date. if you have multiple dates here are those two options again
option1:
dates_to_drop = []
gk = df_HR.groupby(['date'])
good_dfs = []
for date, sub_df in gk:
for bad_date in dates_to_drop:
if bad_date in date:
good_dfs.append(sub_df)
final_df = pd.concat(good_dfs)
option2:
dates_to_drop = []
for bad_date in dates_to_drop:
df_HR.drop(df_HR[ bad_date in df_HR.date ].index, inplace=True)
The reason we have to loop through is because the dates in the DF include more than just the string you're looking for. checking for substring existence in python involves using the 'in' operator. But we can't check if a list of strings is in a string, and so we loop over bad dates, removing all rows with each bad date.
See below code to explain further
my_date=[datetime(2009, 5, 2),
datetime(2010, 8, 22),
datetime(2022,8,22),
datetime(2009,5,2),
datetime(2010,8,22)
]
df=pd.DataFrame(my_date)
df.columns=['Date']
df1=df.groupby('Date').mean()
df1 # now see below data of dataframe df1
df1.drop('2009-05-02',inplace=True)
# given date will be dropped-see screenshot
df1
tldr; I have an index_date in dtype: datetime64[ns] <class 'pandas.core.series.Series'> and a list_of_dates of type <class 'list'> with individual elements in str format. What's the best way to convert these to the same data type so I can sort the dates into closest before and closest after index_date?
I have a pandas dataframe (df) with columns:
ID_string object
indexdate datetime64[ns]
XR_count int64
CT_count int64
studyid_concat object
studydate_concat object
modality_concat object
And it looks something like:
ID_string indexdate XR_count CT_count studyid_concat studydate_concat
0 55555555 2020-09-07 10 1 ['St1', 'St5'...] ['06/22/2019', '09/20/2020'...]
1 66666666 2020-06-07 5 0 ['St11', 'St17'...] ['05/22/2020', '06/24/2020'...]
Where the 0 element in studyid_concat ("St1") corresponds to the 0 element in studydate_concat, and in modality_concat, etc. I did not show modality_concat for space reasons, but it's something like ['XR', 'CT', ...]
My current goal is to find the closest X-ray study performed before and after my indexdate, as well as being able to rank studies from closest to furthest. I'm somewhat new to pandas, but here is my current attempt:
df = pd.read_excel(path_to_excel, sheet_name='Sheet1')
# Convert comma separated string from Excel to lists of strings
df.studyid_concat = df.studyid_concat.str.split(',')
df.studydate_concat = df.studydate_concat.str.split(',')
df.modality_concat = df.modality_concat.str.split(',')
for x in in df['ID_string'].values:
index_date = df.loc[df['ID_string'] == x, 'indexdate']
# Had to use subscript [0] below because result of above was a list in an array
studyid_list = df.loc[df['ID_string'] == x, 'studyid_concat'].values[0]
date_list = df.loc[df['ID_string'] == x, 'studydate_concat'].values[0]
modality_list = df.loc[df['ID_string'] == x, 'modality_concat'].values[0]
xr_date_list = [date_list[x] for x in range(len(date_list)) if modality_list[x]=="XR"]
xr_studyid_list = [studyid_list[x] for x in range(len(studyid_list)) if modality_list[x]=="XR"]
That's about as far as I got because I'm somewhat confused on datatypes here. My indexdate is currently in dtype: datetime64[ns] <class 'pandas.core.series.Series'> which I was thinking of converting using the datetime module, but was having a hard time figuring out how. I also wasn't sure if I needed to. My xr_study_list is a list of strings containing dates in format 'mm/dd/yyyy'. I think if I could figure out the rest if I could get the data types in the right format. I'd just compare if the dates are >= or < indexdate to sort into before/after, and then subtract each date by indexdate and sort. I think whatever I do with my xr_date_list, I'd just have to be sure to do the same with xr_studyid_list to keep track of the unique study id
Edit: Desired output dataframe would look like
ID_string indexdate StudyIDBefore StudyDateBefore
0 55555555 2020-09-07 ['St33', 'St1', ...] [2020-09-06, 2019-06-22, ...]
1 66666666 2020-06-07 ['St11', 'St2', ...] [2020-05-22, 2020-05-01, ...]
Where the "before" variables would be sorted from nearest to furthest, and similar "after columns would exist. My current goal is just to check if a study exists within 3 days before and after this indexdate, but having the above dataframe would give me the flexibility if I need to start looking beyond the nearest study.
Think I found my own answer after spending some time thinking about it some more and referencing more of pandas to_datetime documentation. Basically realized I could convert my list of string dates using pd.to_datetime
date_list = pd.to_datetime(df.loc[df['ID_string'] == x, 'studydate_concat'].values[0]).values
Then could subtract my index date from this list. Opted to do this within a temporary dataframe so I could keep track of the other column values (like study ID, modality, etc.).
Full code is below:
for x in df['ID_string'].values:
index_date = df.loc[df['ID_string'] == x, 'indexdate'].values[0]
date_list = pd.to_datetime(df.loc[df['ID_string'] == x, 'studydate_concat'].values[0]).values
modality_list = df.loc[df['ID_string'] == x, 'modality_concat'].values[0]
studyid_list = df.loc[df['ID_string'] == x, '_concat'].values[0]
tempdata = list(zip(studyid_list, date_list, modality_list))
tempdf = pd.DataFrame(tempdata, columns=['studyid', 'studydate', 'modality'])
tempdf['indexdate'] = index_date
tempdf['timedelta'] = tempdf['studydate']-tempdf['index_date']
tempdf['study_done_wi_3daysbefore'] = np.where((tempdf['timedelta']>=np.timedelta64(-3,'D')) & (tempdf['timedelta']<np.timedelta64(0,'D')), True, False)
tempdf['study_done_wi_3daysafter'] = np.where((tempdf['timedelta']<=np.timedelta64(3,'D')) & (tempdf['timedelta']>=np.timedelta64(0,'D')), True, False)
tempdf['study_done_onindex'] = np.where(tempdf['timedelta']==np.timedelta64(0,'D'), True, False)
XRonindex[x] = True if len(tempdf.loc[(tempdf['study_done_onindex']==True) & (tempdf['modality']=='XR'), 'studyid'])>0 else False
XRwi3days[x] = True if len(tempdf.loc[(tempdf['study_done_wi_3daysbefore']==True) & (tempdf['modality']=='XR'), 'studyid'])>0 else False
# can later map these values back to my original dataframe as a new column
I have a csv file that looks like this when I load it:
# generate example data
users = ['A', 'B', 'C', 'D']
#dates = pd.date_range("2020-02-01 00:00:00", "2020-04-04 20:00:00", freq="H")
dates = pd.date_range("2020-02-01 00:00:00", "2020-02-04 20:00:00", freq="H")
idx = pd.MultiIndex.from_product([users, dates])
idx.names = ["user", "datehour"]
y = pd.Series(np.random.choice(a=[0, 1], size=len(idx)), index=idx).rename('y')
# write to csv and reload (turns out this matters)
y.to_csv('reprod_example.csv')
y = pd.read_csv('reprod_example.csv', parse_dates=['datehour'])
y = y.set_index(['user', 'datehour']).y
>>> y.head()
user datehour
A 2020-02-01 00:00:00 0
2020-02-01 01:00:00 0
2020-02-01 02:00:00 1
2020-02-01 03:00:00 0
2020-02-01 04:00:00 0
Name: y, dtype: int64
I have the following function to create a lagged feature of an index level:
def shift_index(a, dt_idx_name, lag_freq, lag):
# get datetime index of relevant level
ac = a.copy()
dti = ac.index.get_level_values(dt_idx_name)
# shift it
dti_shifted = dti.shift(lag, freq=lag_freq)
# put it back where you found it
ac.index.set_levels(dti_shifted, level=dt_idx_name, inplace=True)
return ac
But when I run:
y_lag = shift_index(y, 'datehour', 'H', 1), I get the following error:
ValueError: Level values must be unique...
(I can actually suppress this error by adding verify_integrity=False
in .index.set_levels... in the function, but that (predictably) causes problems down the line)
Here's the weird part. If you run the example above but without saving/reloading from csv, it works. The reason seems to be, I think, that y.index.get_level_value('datehour') shows a freq='H' attribute right after it's created, but freq=None once its reloaded from csv.
That makes sense, csv obviously doesn't save that metadata. But I've found it surprisingly difficult to set the freq attribute for a MultiIndexed series. For example this did nothing.
df.index.freq = pd.tseries.frequencies.to_offset("H"). And this answer also didn't work for my MultiIndex.
So I think I could solve this if I were able to set the freq attribute of the DateTime component of my MultiIndex. But my ultimate goal is to be create a version of my y data with shifted DateTime MultiIndex component, such as with my shift_index function above. Since I receive my data via csv, "just don't save to csv and reload" is not an option.
After much fidgeting, I was able to set an hourly frequency using asfreq('H') on grouped data, such that each group has unique values for the datehour index.
y = pd.read_csv('reprod_example.csv', parse_dates=['datehour'])
y = y.groupby('user').apply(lambda df: df.set_index('datehour').asfreq('H')).y
Peeking at an index value shows the correct frequency.
y.index[0]
# ('A', Timestamp('2020-02-01 00:00:00', freq='H'))
All this is doing is setting the index in two parts. The user goes first so that the nested datehour index can be unique within it. Once the datehour index is unique, then asfreq can be used without difficulty.
If you try asfreq on a non-unique index, it will not work.
y_load.set_index('datehour').asfreq('H')
# ---------------------------------------------------------------------------
# ValueError Traceback (most recent call last)
# <ipython-input-433-3ba51b619417> in <module>
# ----> 1 y_load.set_index('datehour').asfreq('H')
# ...
# ValueError: cannot reindex from a duplicate axis
I am asking for help in transforming values into date format.
I have following data structure:
ID ACT1 ACT2 ACT3 ACT4
1 154438.0 154104.0 155321.0 155321.0
2 154042.0 154073.0 154104.0 154104.0
...
The number in columns ACT1-4 need to be converted. Some rows contain NaN values.
I found that following function helps me to get a Gregorian date:
from datetime import datetime, timedelta
gregorian = datetime.strptime('1582/10/15', "%Y/%m/%d")
modified_date = gregorian + timedelta(days=154438)
datetime.strftime(modified_date, "%Y/%m/%d")
It would be great to know how I can apply this transformation to all columns except for "ID" and whether the approach is correct (or could be improved).
After the transformation is applied, I need to extract the order of column items, sorted by date in ascending order. For instance
ID ORDER
1 ACT1, ACT3, ACT4, ACT2
2 ACT2, ACT1, ACT3, ACT4
Thank you!
It sounds like you have two questions here.
1) To change to datetime:
cols = [col for col in df.columns if col != 'ID']
df.loc[:, cols] = df.loc[:, cols].applymap(lambda x: datetime.strptime('1582/10/15', "%Y/%m/%d") + timedelta(days=x) if np.isfinite(x) else x)
2) To get the sorted column names:
df['ORDER'] = df.loc[:, cols].apply(lambda dr: ','.join(df.loc[:, cols].columns[dr.dropna().argsort()]), axis=1)
Note: the dropna above will omit columns with NaT values from the order string.
First I would make the input column comma separated so that its much easier to handle of the form:
ID,ACT1,ACT2,ACT3,ACT4
1,154438.0,154104.0,155321.0,155321.0
2,154042.0,154073.0,154104.0,154104.0
Then you can read each line using a CSV reader, extracting key,value pairs that have your column names as keys. Then you pop the ID off that dictionary to get its value ie, 1,2,etc. And you can then reorder according to the value which is the date. The code is below:
#!/usr/bin/env python3
import csv
from operator import itemgetter
idAndTuple = {}
with open('time.txt') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
myID = row.pop('ID',None)
reorderedList = sorted(row.items(), key = itemgetter(1))
idAndTuple[myID] = reorderedList
print( myID, reorderedList )
The result when you run this is:
1 [('ACT2', '154104.0'), ('ACT1', '154438.0'), ('ACT3', '155321.0'), ('ACT4', '155321.0')]
2 [('ACT1', '154042.0'), ('ACT2', '154073.0'), ('ACT3', '154104.0'), ('ACT4', '154104.0')]
which I think is what you are looking for.
I have the following dataframe
df = pd.DataFrame({
'Column_1': ['Position', 'Start', 'End', 'Position'],
'Original_1': ['Open', 'Barn', 'Grass', 'Bubble'],
'Latest_1': ['Shut', 'Horn', 'Date', 'Dinner'],
'Column_2': ['Start', 'Position', 'End', 'During'],
'Original_2': ['Sky', 'Hold', 'Car', 'House'],
'Latest_2': ['Pedal', 'Lap', 'Two', 'Force'],
'Column_3': ['Start', 'End', 'Position', 'During'],
'Original_3': ['Leave', 'Dog', 'Block', 'Hope'],
'Latest_3': ['Sear', 'Crawl', 'Enter', 'Night']
})
For every instance where the word Position is in 'Column_1', 'Column_2', or 'Column_3', I want to capture the associated values in 'Original_1', 'Original_2', 'Original_3' and assign them to the new column named 'Original_Values'.
The following code can accomplish that, but only on a column by column basis.
df['Original_Value1'] = df.loc[df['Column_1'] == 'Position', 'Original_1']
df['Original_Value2'] = df.loc[df['Column_2'] == 'Position', 'Original_2']
df['Original_Value3'] = df.loc[df['Column_3'] == 'Position', 'Original_3']
Is there a way to recreate the above code so that it iterates over the entire data frame (not by specified columns)?
I'm hoping to create one column ('Original_values') with the following result:
0 Open
1 Hold
2 Block
3 Bubble
Name: Original_Values, dtype: object
One way to do it, with df.apply():
def choose_orig(row):
if row['Column_1'] == 'Position':
return row['Original_1']
elif row['Column_2'] == 'Position':
return row['Original_2']
elif row['Column_3'] == 'Position':
return row['Original_3']
return ''
df['Original_Values'] = df.apply(choose_orig, axis=1)
The axis=1 argument to df.apply() causes the choose_orig() function to be called once for each row of the dataframe.
Note that this uses a default value of the empty string, '', when none of the columns match the word 'Position'.
How about creating a mask with the first 3 cols (or specify the name of them) and multiply it with the values in cols 6 to 9 (or specify the names of them). Then take max() value to remove nan.
df['Original_Values'] = ((df.iloc[:,:3] == 'Position') * df.iloc[:,6:9].values).max(1)
print(df['Original_values'])
Returns:
0 Open
1 Hold
2 Block
3 Bubble
Name: Original_Value, dtype: object
Here's a kinda silly way to do it with some stacking, which might perform better if you have a very large df and need to avoid axis=1.
Stack the first three columns to create a list of the index and which 'Original' column the value corresponds to
Stack the columns from which you want to get the values. Use the above list to reindex it, so you return the appropriate value.
Bring those values back to the original df based on the original row index.
Here's the code:
import re
mask_list = ['Column_1', 'Column_2', 'Column_3']
val_list = ['Original_1', 'Original_2', 'Original_3']
idx = df[mask_list].stack()[df[mask_list].stack() == 'Position'].index.tolist()
idx = [(x , re.sub('(.*_)', 'Original_', y)) for x, y in idx]
df['Original_Values'] = df[val_list].stack().reindex(idx).reset_index(level=1).drop(columns='level_1')
df is now:
Column_1 Column_2 Column_3 ... Original_Values
0 Position Start Start ... Open
1 Start Position End ... Hold
2 End End Position ... Block
3 Position During During ... Bubble
If 'Position' is not found in any of the columns in mask_list, Original_Values becomes NaN for that row. If you need to scale it to more columns, simply add them to mask_list and val_list.