I have a csv below:
ID Date Time Flag
1 14/05/2018 00:01:00 NaN
1 14/05/2018 00:02:00 NaN
1 14/05/2018 00:03:00 NaN
1 14/05/2018 00:04:00 NaN
1 14/05/2018 00:05:00 NaN
1 14/05/2018 00:06:00 NaN
1 14/05/2018 00:07:00 NaN
1 14/05/2018 00:08:00 NaN
1 15/05/2018 00:01:00 1
1 15/05/2018 00:02:00 1
1 16/05/2018 00:01:00 1
1 16/05/2018 00:02:00 1
2 10/07/2018 00:03:00 NaN
2 10/07/2018 00:04:00 NaN
2 10/07/2018 00:05:00 NaN
2 10/07/2018 00:06:00 NaN
2 10/07/2018 00:07:00 NaN
2 10/07/2018 00:08:00 NaN
2 11/07/2018 00:01:00 1
2 11/07/2018 00:02:00 1
2 12/07/2018 00:01:00 1
2 12/07/2018 00:02:00 1
I want to update NaN for only 4 rows above the first row (of only the first day and first time of that day) with Flag=1 for each ID.
Expected csv:
1 14/05/2018 00:01:00 NaN
1 14/05/2018 00:02:00 NaN
1 14/05/2018 00:03:00 NaN
1 14/05/2018 00:04:00 NaN
1 14/05/2018 00:05:00 1
1 14/05/2018 00:06:00 1
1 14/05/2018 00:07:00 1
1 14/05/2018 00:08:00 1
1 15/05/2018 00:01:00 1
1 15/05/2018 00:02:00 1
1 16/05/2018 00:01:00 1
1 16/05/2018 00:02:00 1
2 10/07/2018 00:03:00 NaN
2 10/07/2018 00:04:00 NaN
2 10/07/2018 00:05:00 1
2 10/07/2018 00:06:00 1
2 10/07/2018 00:07:00 1
2 10/07/2018 00:08:00 1
2 11/07/2018 00:01:00 1
2 11/07/2018 00:02:00 1
2 12/07/2018 00:01:00 1
2 12/07/2018 00:02:00 1
How can I do that. Thanks.
Since you're changing all Flag values to 1:
import pandas as pd
df = pd.read_csv('path/to/csv.csv')
df['Flag'] = 1
df.to_csv('path/to/csv.csv', index=False)
If, however, you don't want to update all Flag values, check out either loc or iloc for accessing specific parts of your DataFrame.
You need to combine a few different commands. To find the first row for each ID, use pandas groupby on multiple columns, ID and Date, like this:
df = pd.read_csv(input_file)
filtered_df = df.groupby(['ID', 'Date'])
After that you can copy the original dataframe based on the Date and Time of the filtered_df
Related
What I start with is a large dataframe (more than a million entires) of this structure:
id datetime indicator other_values ...
1 2020-01-14 00:12:00 0 ...
1 2020-01-17 00:23:00 1 ...
...
1 2021-02-01 00:00:00 0 ...
2 2020-01-15 00:05:00 0 ...
2 2020-03-10 00:07:00 0 ...
...
2 2021-05-22 00:00:00 1 ...
...
There is no specific order other than a sort by id and then datetime. The dataset is not complete (there is not data for every day, but there can be multiple entires of the same day).
Now for each time where indicator==1 I want to collect every row with the same id and a datetime that is at most 10 days before. All other rows which are not in range of the indicator can be dropped. In the best case I want it to be saved as a dataset of time series which each will be later used in a Neural network. (There can be more than one indicator==1 case per id, other values should be saved).
An example for one id: I want to convert this
id datetime indicator other_values ...
1 2020-01-14 00:12:00 0 ...
1 2020-01-17 00:23:00 1 ...
1 2020-01-17 00:13:00 0 ...
1 2020-01-20 00:05:00 0 ...
1 2020-03-10 00:07:00 0 ...
1 2020-05-19 00:00:00 0 ...
1 2020-05-20 00:00:00 1 ...
into this
id datetime group other_values ...
1 2020-01-14 00:12:00 A ...
1 2020-01-17 00:23:00 A ...
1 2020-01-17 00:13:00 A ...
1 2020-05-19 00:00:00 B ...
1 2020-05-20 00:00:00 B ...
or a similar way to group into group A, B, ... .
A naive python for-loop is not possible due to taking ages for a dataset like this.
There is propably a clever way to use df.groupby('id'), df.groupby('id').agg(...), df.sort_values(...) or df.apply(), but I just do not see it.
Here is a way to do it with pd.merge_asof(). Let's create our data:
data = {'id': [1,1,1,1,1,1,1],
'datetime': ['2020-01-14 00:12:00',
'2020-01-17 00:23:00',
'2020-01-17 00:13:00',
'2020-01-20 00:05:00',
'2020-03-10 00:07:00',
'2020-05-19 00:00:00',
'2020-05-20 00:00:00'],
'ind': [0,1,0,0,0,0,1]
}
df = pd.DataFrame(data)
df['datetime'] = df['datetime'].astype('datetime64')
Data:
id datetime ind
0 1 2020-01-14 00:12:00 0
1 1 2020-01-17 00:23:00 1
2 1 2020-01-17 00:13:00 0
3 1 2020-01-20 00:05:00 0
4 1 2020-03-10 00:07:00 0
5 1 2020-05-19 00:00:00 0
6 1 2020-05-20 00:00:00 1
Next, let's add a date to the dataset and pull all dates where the indicator is 1.
df['date'] = df['datetime'].dt.date.astype('datetime64')
df2 = df.loc[df['ind'] == 1, ['id', 'date', 'ind']].rename({'ind': 'ind2'}, axis=1)
Which gives us this:
df:
id datetime ind date
0 1 2020-01-14 00:12:00 0 2020-01-14
1 1 2020-01-17 00:23:00 1 2020-01-17
2 1 2020-01-17 00:13:00 0 2020-01-17
3 1 2020-01-20 00:05:00 0 2020-01-20
4 1 2020-03-10 00:07:00 0 2020-03-10
5 1 2020-05-19 00:00:00 0 2020-05-19
6 1 2020-05-20 00:00:00 1 2020-05-20
df2:
id date ind2
1 1 2020-01-17 1
6 1 2020-05-20 1
Now let's join them using pd.merge_asof() with direction=forward and a tolerance of 10 days. This will join all data up to 10 days looking forward.
df = pd.merge_asof(df.drop('ind', axis=1), df2, by='id', on='date', tolerance=pd.Timedelta('10d'), direction='forward')
Which gives us this:
id datetime ind date ind2
0 1 2020-01-14 00:12:00 0 2020-01-14 1.0
1 1 2020-01-17 00:23:00 1 2020-01-17 1.0
2 1 2020-01-17 00:13:00 0 2020-01-17 1.0
3 1 2020-01-20 00:05:00 0 2020-01-20 NaN
4 1 2020-03-10 00:07:00 0 2020-03-10 NaN
5 1 2020-05-19 00:00:00 0 2020-05-19 1.0
6 1 2020-05-20 00:00:00 1 2020-05-20 1.0
Next, let's work on creating groups. There are three rules we want to use:
The next value of ind2 is NaN
The next value of ID is not the current value of ID (we're at the last value in the group)
The next day is 10 days greater than the current
With these rules, we can create a Boolean which we can then cumulatively sum to create our groups.
df['group_id'] = df['ind2'].eq( (df['ind2'].shift() == np.NaN)
| (df['id'].shift() != df['id'])
| (df['date'] - df['date'].shift() > pd.Timedelta('10d') )
).cumsum()
id datetime ind date ind2 group_id
0 1 2020-01-14 00:12:00 0 2020-01-14 1.0 1
1 1 2020-01-17 00:23:00 1 2020-01-17 1.0 1
2 1 2020-01-17 00:13:00 0 2020-01-17 1.0 1
3 1 2020-01-20 00:05:00 0 2020-01-20 NaN 1
4 1 2020-03-10 00:07:00 0 2020-03-10 NaN 1
5 1 2020-05-19 00:00:00 0 2020-05-19 1.0 2
6 1 2020-05-20 00:00:00 1 2020-05-20 1.0 2
Now we need to drop all the NaNs from ind2, remove date and we're done.
df = df.dropna(subset='ind2').drop(['date', 'ind2'], axis=1)
Final output:
id datetime ind group_id
0 1 2020-01-14 00:12:00 0 1
1 1 2020-01-17 00:23:00 1 1
2 1 2020-01-17 00:13:00 0 1
5 1 2020-05-19 00:00:00 0 2
6 1 2020-05-20 00:00:00 1 2
I'm not aware of a way to do this with df.agg, but you can put your for loop inside the groupby using .apply(). That way, your comparisons/lookups can be done on smaller tables, then groupby will handle the re-concatenation:
import pandas as pd
import datetime
import uuid
df = pd.DataFrame({
"id": [1, 1, 1, 2, 2, 2],
"datetime": [
'2020-01-14 00:12:00',
'2020-01-17 00:23:00',
'2021-02-01 00:00:00',
'2020-01-15 00:05:00',
'2020-03-10 00:07:00',
'2021-05-22 00:00:00',
],
"indicator": [0, 1, 0, 0, 0, 1]
})
df.datetime = pd.to_datetime(df.datetime)
timedelta = datetime.timedelta(days=10)
def consolidate(grp):
grp['Group'] = None
for time in grp[grp.indicator == 1]['datetime']:
grp['Group'][grp['datetime'].between(time - timedelta, time)] = uuid.uuid4()
return grp.dropna(subset=['Group'])
df.groupby('id').apply(consolidate)
If there are multiple rows with indicator == 1 in each id grouping, then the for loop will apply in index order (so a later group might overwrite an earlier group). If you can be certain that there is only one indicator == 1 in each grouping, we can simplify the consolidate function:
def consolidate(grp):
time = grp[grp.indicator == 1]['datetime'].iloc[0]
grp = grp[grp['datetime'].between(time - timedelta, time)]
grp['Group'] = uuid.uuid4()
return grp
I'd like to change my dataframe adding time intervals for every hour during a month
Original df
money food
0 1 2
1 4 5
2 5 7
Output:
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 00:01:00
2 1 2 2020-01-01 00:02:00
...
2230 5 7 2020-01-31 00:22:00
2231 5 7 2020-01-31 00:23:00
where 2231 = out_rows_number-1 = month_days_number*hours_per_day*orig_rows_number - 1
What is the proper way to perform it?
Use cross join by DataFrame.merge and new DataFrame with all hours per month created by date_range:
df1 = pd.DataFrame({'a':1,
'time':pd.date_range('2020-01-01', '2020-01-31 23:00:00', freq='h')})
df = df.assign(a=1).merge(df1, on='a', how='outer').drop('a', axis=1)
print (df)
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 01:00:00
2 1 2 2020-01-01 02:00:00
3 1 2 2020-01-01 03:00:00
4 1 2 2020-01-01 04:00:00
... ... ...
2227 5 7 2020-01-31 19:00:00
2228 5 7 2020-01-31 20:00:00
2229 5 7 2020-01-31 21:00:00
2230 5 7 2020-01-31 22:00:00
2231 5 7 2020-01-31 23:00:00
[2232 rows x 3 columns]
I have a dataframe with datetimes as index. There are some gaps in the index so I upsample it to have 1 second gap only. I want to fill the gaps by doing half forward filling (from the left side of the gap) and half backward filling (from the right side of the gap).
Input:
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:10 4
Upsampled Input, with 10 second:
2000-01-01 00:00:00 0.0
2000-01-01 00:00:10 NaN
2000-01-01 00:00:20 NaN
2000-01-01 00:00:30 NaN
2000-01-01 00:00:40 NaN
2000-01-01 00:00:50 NaN
2000-01-01 00:01:00 1.0
2000-01-01 00:01:10 NaN
2000-01-01 00:01:20 NaN
2000-01-01 00:01:30 NaN
2000-01-01 00:01:40 NaN
2000-01-01 00:01:50 NaN
2000-01-01 00:02:00 2.0
2000-01-01 00:02:10 NaN
2000-01-01 00:02:20 NaN
2000-01-01 00:02:30 NaN
2000-01-01 00:02:40 NaN
2000-01-01 00:02:50 NaN
2000-01-01 00:03:00 3.0
2000-01-01 00:04:10 4.0
Output I want:
2000-01-01 00:00:00 0.0
2000-01-01 00:00:10 0.0
2000-01-01 00:00:20 0.0
2000-01-01 00:00:30 0.0
2000-01-01 00:00:40 1.0
2000-01-01 00:00:50 1.0
2000-01-01 00:01:00 1.0
2000-01-01 00:01:10 1.0
2000-01-01 00:01:20 1.0
2000-01-01 00:01:30 1.0
2000-01-01 00:01:40 2.0
2000-01-01 00:01:50 2.0
2000-01-01 00:02:00 2.0
2000-01-01 00:02:10 2.0
2000-01-01 00:02:20 2.0
2000-01-01 00:02:30 2.0
2000-01-01 00:02:40 3.0
2000-01-01 00:02:50 3.0
2000-01-01 00:03:00 3.0
2000-01-01 00:04:10 4.0
I managed to get the results I want by getting the edges of the gaps after the upsampling, performing a forward fill across all the gap, and then updating just the right half with the value of the right edge, but since my data is so large, it takes forever to run as some of my files have 1M gaps to fill. I basically do this using a for loop that goes through all the identified gaps.
Is there a way this could be done faster?
Thanks!
Edit:
I only want to upsample and fill gaps where the time difference is smaller than or equal to a given value, in the example only those up to 1 minute, so the last 2 rows won't have an upsample and filling between them.
If you data is 1 min apart, you can do:
df.set_index(0).asfreq('10S').ffill(limit=3).bfill(limit=2)
output:
1
0
2000-01-01 00:00:00 0.0
2000-01-01 00:00:10 0.0
2000-01-01 00:00:20 0.0
2000-01-01 00:00:30 0.0
2000-01-01 00:00:40 1.0
2000-01-01 00:00:50 1.0
2000-01-01 00:01:00 1.0
2000-01-01 00:01:10 1.0
2000-01-01 00:01:20 1.0
2000-01-01 00:01:30 1.0
2000-01-01 00:01:40 2.0
2000-01-01 00:01:50 2.0
2000-01-01 00:02:00 2.0
2000-01-01 00:02:10 2.0
2000-01-01 00:02:20 2.0
2000-01-01 00:02:30 2.0
2000-01-01 00:02:40 3.0
2000-01-01 00:02:50 3.0
2000-01-01 00:03:00 3.0
Setup
ts = pd.Series([0, 1, 2, 3], pd.date_range('2000-01-01', periods=4, freq='min'))
merge_asof with direction='nearest'
pd.merge_asof(
ts.asfreq('10s').to_frame('left'),
ts.to_frame('right'),
left_index=True,
right_index=True,
direction='nearest'
)
left right
2000-01-01 00:00:00 0.0 0
2000-01-01 00:00:10 NaN 0
2000-01-01 00:00:20 NaN 0
2000-01-01 00:00:30 NaN 0
2000-01-01 00:00:40 NaN 1
2000-01-01 00:00:50 NaN 1
2000-01-01 00:01:00 1.0 1
2000-01-01 00:01:10 NaN 1
2000-01-01 00:01:20 NaN 1
2000-01-01 00:01:30 NaN 1
2000-01-01 00:01:40 NaN 2
2000-01-01 00:01:50 NaN 2
2000-01-01 00:02:00 2.0 2
2000-01-01 00:02:10 NaN 2
2000-01-01 00:02:20 NaN 2
2000-01-01 00:02:30 NaN 2
2000-01-01 00:02:40 NaN 3
2000-01-01 00:02:50 NaN 3
2000-01-01 00:03:00 3.0 3
reindex with method='nearest'
ts.reindex(ts.asfreq('10s').index, method='nearest')
2000-01-01 00:00:00 0
2000-01-01 00:00:10 0
2000-01-01 00:00:20 0
2000-01-01 00:00:30 1
2000-01-01 00:00:40 1
2000-01-01 00:00:50 1
2000-01-01 00:01:00 1
2000-01-01 00:01:10 1
2000-01-01 00:01:20 1
2000-01-01 00:01:30 2
2000-01-01 00:01:40 2
2000-01-01 00:01:50 2
2000-01-01 00:02:00 2
2000-01-01 00:02:10 2
2000-01-01 00:02:20 2
2000-01-01 00:02:30 3
2000-01-01 00:02:40 3
2000-01-01 00:02:50 3
2000-01-01 00:03:00 3
Freq: 10S, dtype: int64
Note: that the decision on how to determine nearest produces slightly different results between the two solutions.
pd.merge_asof(
ts.asfreq('10s').to_frame('left'),
ts.to_frame('merge_asof'),
left_index=True,
right_index=True,
direction='nearest'
).assign(reindex=ts.reindex(ts.asfreq('10s').index, method='nearest'))
left merge_asof reindex
2000-01-01 00:00:00 0.0 0 0
2000-01-01 00:00:10 NaN 0 0
2000-01-01 00:00:20 NaN 0 0
2000-01-01 00:00:30 NaN 0 1 # This row is different
2000-01-01 00:00:40 NaN 1 1
2000-01-01 00:00:50 NaN 1 1
2000-01-01 00:01:00 1.0 1 1
2000-01-01 00:01:10 NaN 1 1
2000-01-01 00:01:20 NaN 1 1
2000-01-01 00:01:30 NaN 1 2 # This row is different
2000-01-01 00:01:40 NaN 2 2
2000-01-01 00:01:50 NaN 2 2
2000-01-01 00:02:00 2.0 2 2
2000-01-01 00:02:10 NaN 2 2
2000-01-01 00:02:20 NaN 2 2
2000-01-01 00:02:30 NaN 2 3 # This row is different
2000-01-01 00:02:40 NaN 3 3
2000-01-01 00:02:50 NaN 3 3
2000-01-01 00:03:00 3.0 3 3
I have a dataframe df as below:
date1 item_id
2000-01-01 00:00:00 0
2000-01-01 10:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 12:07:00 7
2000-01-02 00:08:00 8
2000-01-02 00:00:00 0
2000-01-02 00:01:00 1
2000-01-02 03:02:00 2
2000-01-02 00:03:00 3
2000-01-02 00:04:00 4
2000-01-02 00:05:00 5
2000-01-02 04:06:00 6
2000-01-02 00:07:00 7
2000-01-02 00:08:00 8
I need the data for single day i.e. 1st Jan 2000. Below query gives me the correct result. But is there a way it can be done just by passing "2000-01-01"?
result= df[(df['date1'] > '2000-01-01 00:00') & (df['date1'] < '2000-01-01 23:59')]
Use partial string indexing, but need DatetimeIndex first:
df = df.set_index('date1')['2000-01-01']
print (df)
item_id
date1
2000-01-01 00:00:00 0
2000-01-01 10:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 12:07:00 7
Another solution is convert datetimes to strings by strftime and filter by boolean indexing:
df = df[df['date1'].dt.strftime('%Y-%m-%d') == '2000-01-01']
print (df)
date1 item_id
0 2000-01-01 00:00:00 0
1 2000-01-01 10:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 12:07:00 7
The other alternative would be to create a mask:
df[df.date1.dt.date.astype(str) == '2000-01-01']
Full example:
import pandas as pd
data = '''\
date1 item_id
2000-01-01T00:00:00 0
2000-01-01T10:01:00 1
2000-01-01T00:02:00 2
2000-01-01T00:03:00 3
2000-01-01T00:04:00 4
2000-01-01T00:05:00 5
2000-01-01T00:06:00 6
2000-01-01T12:07:00 7
2000-01-02T00:08:00 8
2000-01-02T00:00:00 0
2000-01-02T00:01:00 1
2000-01-02T03:02:00 2'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+', parse_dates=['date1'])
res = df[df.date1.dt.date.astype(str) == '2000-01-01']
print(res)
Returns:
date1 item_id
0 2000-01-01 00:00:00 0
1 2000-01-01 10:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 12:07:00 7
Or
import datetime
df[df.date1.dt.date == datetime.date(2000,1,1)]
I have a csv-file with entries like this:
1,2014 1 1 0 1,5
2,2014 1 1 0 1,5
3,2014 1 1 0 1,5
4,2014 1 1 0 1,6
5,2014 1 1 0 1,6
6,2014 1 1 0 1,12
7,2014 1 1 0 1,17
8,2014 5 7 1 5,4
The first column is the ID, the second the arrival-date (example of last entry: may 07, 1:05 a.m.) and the last column is the duration of work (in minutes).
Actually, I read in the data using pandas and the following function:
import pandas as pd
def convert_data(csv_path):
store = pd.HDFStore(data_file)
print('Loading CSV File')
df = pd.read_csv(csv_path, parse_dates=True)
print('CSV File Loaded, Converting Dates/Times')
df['Arrival_time'] = map(convert_time, df['Arrival_time'])
df['Rel_time'] = (df['Arrival_time'] - REF.timestamp)/60.0
print('Conversion Complete')
store['orders'] = df
My question is: How can I sort the entries according to their duration, but considering the arrival-date? So, I'd like to sort the csv-entries according to "arrival-date + duration". How is this possible?
Thanks for any hint! Best regards, Stan.
OK, the following shows you can convert the date times and then shows how to add the minutes:
In [79]:
df['Arrival_Date'] = pd.to_datetime(df['Arrival_Date'], format='%Y %m %d %H %M')
df
Out[79]:
ID Arrival_Date Duration
0 1 2014-01-01 00:01:00 5
1 2 2014-01-01 00:01:00 5
2 3 2014-01-01 00:01:00 5
3 4 2014-01-01 00:01:00 6
4 5 2014-01-01 00:01:00 6
5 6 2014-01-01 00:01:00 12
6 7 2014-01-01 00:01:00 17
7 8 2014-05-07 01:05:00 4
In [80]:
import datetime as dt
df['Arrival_and_Duration'] = df['Arrival_Date'] + df['Duration'].apply(lambda x: dt.timedelta(minutes=int(x)))
df
Out[80]:
ID Arrival_Date Duration Arrival_and_Duration
0 1 2014-01-01 00:01:00 5 2014-01-01 00:06:00
1 2 2014-01-01 00:01:00 5 2014-01-01 00:06:00
2 3 2014-01-01 00:01:00 5 2014-01-01 00:06:00
3 4 2014-01-01 00:01:00 6 2014-01-01 00:07:00
4 5 2014-01-01 00:01:00 6 2014-01-01 00:07:00
5 6 2014-01-01 00:01:00 12 2014-01-01 00:13:00
6 7 2014-01-01 00:01:00 17 2014-01-01 00:18:00
7 8 2014-05-07 01:05:00 4 2014-05-07 01:09:00
In [81]:
df.sort(columns=['Arrival_and_Duration'])
Out[81]:
ID Arrival_Date Duration Arrival_and_Duration
0 1 2014-01-01 00:01:00 5 2014-01-01 00:06:00
1 2 2014-01-01 00:01:00 5 2014-01-01 00:06:00
2 3 2014-01-01 00:01:00 5 2014-01-01 00:06:00
3 4 2014-01-01 00:01:00 6 2014-01-01 00:07:00
4 5 2014-01-01 00:01:00 6 2014-01-01 00:07:00
5 6 2014-01-01 00:01:00 12 2014-01-01 00:13:00
6 7 2014-01-01 00:01:00 17 2014-01-01 00:18:00
7 8 2014-05-07 01:05:00 4 2014-05-07 01:09:00