Sort csv-data while reading, using pandas

Sort csv-data while reading, using pandas - python

I have a csv-file with entries like this:
1,2014 1 1 0 1,5
2,2014 1 1 0 1,5
3,2014 1 1 0 1,5
4,2014 1 1 0 1,6
5,2014 1 1 0 1,6
6,2014 1 1 0 1,12
7,2014 1 1 0 1,17
8,2014 5 7 1 5,4
The first column is the ID, the second the arrival-date (example of last entry: may 07, 1:05 a.m.) and the last column is the duration of work (in minutes).
Actually, I read in the data using pandas and the following function:
import pandas as pd
def convert_data(csv_path):
store = pd.HDFStore(data_file)
print('Loading CSV File')
df = pd.read_csv(csv_path, parse_dates=True)
print('CSV File Loaded, Converting Dates/Times')
df['Arrival_time'] = map(convert_time, df['Arrival_time'])
df['Rel_time'] = (df['Arrival_time'] - REF.timestamp)/60.0
print('Conversion Complete')
store['orders'] = df
My question is: How can I sort the entries according to their duration, but considering the arrival-date? So, I'd like to sort the csv-entries according to "arrival-date + duration". How is this possible?
Thanks for any hint! Best regards, Stan.

OK, the following shows you can convert the date times and then shows how to add the minutes:
In [79]:
df['Arrival_Date'] = pd.to_datetime(df['Arrival_Date'], format='%Y %m %d %H %M')
df
Out[79]:
ID Arrival_Date Duration
0 1 2014-01-01 00:01:00 5
1 2 2014-01-01 00:01:00 5
2 3 2014-01-01 00:01:00 5
3 4 2014-01-01 00:01:00 6
4 5 2014-01-01 00:01:00 6
5 6 2014-01-01 00:01:00 12
6 7 2014-01-01 00:01:00 17
7 8 2014-05-07 01:05:00 4
In [80]:
import datetime as dt
df['Arrival_and_Duration'] = df['Arrival_Date'] + df['Duration'].apply(lambda x: dt.timedelta(minutes=int(x)))
df
Out[80]:
ID Arrival_Date Duration Arrival_and_Duration
0 1 2014-01-01 00:01:00 5 2014-01-01 00:06:00
1 2 2014-01-01 00:01:00 5 2014-01-01 00:06:00
2 3 2014-01-01 00:01:00 5 2014-01-01 00:06:00
3 4 2014-01-01 00:01:00 6 2014-01-01 00:07:00
4 5 2014-01-01 00:01:00 6 2014-01-01 00:07:00
5 6 2014-01-01 00:01:00 12 2014-01-01 00:13:00
6 7 2014-01-01 00:01:00 17 2014-01-01 00:18:00
7 8 2014-05-07 01:05:00 4 2014-05-07 01:09:00
In [81]:
df.sort(columns=['Arrival_and_Duration'])
Out[81]:
ID Arrival_Date Duration Arrival_and_Duration
0 1 2014-01-01 00:01:00 5 2014-01-01 00:06:00
1 2 2014-01-01 00:01:00 5 2014-01-01 00:06:00
2 3 2014-01-01 00:01:00 5 2014-01-01 00:06:00
3 4 2014-01-01 00:01:00 6 2014-01-01 00:07:00
4 5 2014-01-01 00:01:00 6 2014-01-01 00:07:00
5 6 2014-01-01 00:01:00 12 2014-01-01 00:13:00
6 7 2014-01-01 00:01:00 17 2014-01-01 00:18:00
7 8 2014-05-07 01:05:00 4 2014-05-07 01:09:00

Related

Insert values of rows above python

I have a csv below:
ID Date Time Flag
1 14/05/2018 00:01:00 NaN
1 14/05/2018 00:02:00 NaN
1 14/05/2018 00:03:00 NaN
1 14/05/2018 00:04:00 NaN
1 14/05/2018 00:05:00 NaN
1 14/05/2018 00:06:00 NaN
1 14/05/2018 00:07:00 NaN
1 14/05/2018 00:08:00 NaN
1 15/05/2018 00:01:00 1
1 15/05/2018 00:02:00 1
1 16/05/2018 00:01:00 1
1 16/05/2018 00:02:00 1
2 10/07/2018 00:03:00 NaN
2 10/07/2018 00:04:00 NaN
2 10/07/2018 00:05:00 NaN
2 10/07/2018 00:06:00 NaN
2 10/07/2018 00:07:00 NaN
2 10/07/2018 00:08:00 NaN
2 11/07/2018 00:01:00 1
2 11/07/2018 00:02:00 1
2 12/07/2018 00:01:00 1
2 12/07/2018 00:02:00 1
I want to update NaN for only 4 rows above the first row (of only the first day and first time of that day) with Flag=1 for each ID.
Expected csv:
1 14/05/2018 00:01:00 NaN
1 14/05/2018 00:02:00 NaN
1 14/05/2018 00:03:00 NaN
1 14/05/2018 00:04:00 NaN
1 14/05/2018 00:05:00 1
1 14/05/2018 00:06:00 1
1 14/05/2018 00:07:00 1
1 14/05/2018 00:08:00 1
1 15/05/2018 00:01:00 1
1 15/05/2018 00:02:00 1
1 16/05/2018 00:01:00 1
1 16/05/2018 00:02:00 1
2 10/07/2018 00:03:00 NaN
2 10/07/2018 00:04:00 NaN
2 10/07/2018 00:05:00 1
2 10/07/2018 00:06:00 1
2 10/07/2018 00:07:00 1
2 10/07/2018 00:08:00 1
2 11/07/2018 00:01:00 1
2 11/07/2018 00:02:00 1
2 12/07/2018 00:01:00 1
2 12/07/2018 00:02:00 1
How can I do that. Thanks.

Since you're changing all Flag values to 1:
import pandas as pd
df = pd.read_csv('path/to/csv.csv')
df['Flag'] = 1
df.to_csv('path/to/csv.csv', index=False)
If, however, you don't want to update all Flag values, check out either loc or iloc for accessing specific parts of your DataFrame.

You need to combine a few different commands. To find the first row for each ID, use pandas groupby on multiple columns, ID and Date, like this:
df = pd.read_csv(input_file)
filtered_df = df.groupby(['ID', 'Date'])
After that you can copy the original dataframe based on the Date and Time of the filtered_df

Pandas: time column addition and repeating all rows for a month

I'd like to change my dataframe adding time intervals for every hour during a month
Original df
money food
0 1 2
1 4 5
2 5 7
Output:
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 00:01:00
2 1 2 2020-01-01 00:02:00
...
2230 5 7 2020-01-31 00:22:00
2231 5 7 2020-01-31 00:23:00
where 2231 = out_rows_number-1 = month_days_number*hours_per_day*orig_rows_number - 1
What is the proper way to perform it?

Use cross join by DataFrame.merge and new DataFrame with all hours per month created by date_range:
df1 = pd.DataFrame({'a':1,
'time':pd.date_range('2020-01-01', '2020-01-31 23:00:00', freq='h')})
df = df.assign(a=1).merge(df1, on='a', how='outer').drop('a', axis=1)
print (df)
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 01:00:00
2 1 2 2020-01-01 02:00:00
3 1 2 2020-01-01 03:00:00
4 1 2 2020-01-01 04:00:00
... ... ...
2227 5 7 2020-01-31 19:00:00
2228 5 7 2020-01-31 20:00:00
2229 5 7 2020-01-31 21:00:00
2230 5 7 2020-01-31 22:00:00
2231 5 7 2020-01-31 23:00:00
[2232 rows x 3 columns]

Getting data for given day from pandas Dataframe

I have a dataframe df as below:
date1 item_id
2000-01-01 00:00:00 0
2000-01-01 10:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 12:07:00 7
2000-01-02 00:08:00 8
2000-01-02 00:00:00 0
2000-01-02 00:01:00 1
2000-01-02 03:02:00 2
2000-01-02 00:03:00 3
2000-01-02 00:04:00 4
2000-01-02 00:05:00 5
2000-01-02 04:06:00 6
2000-01-02 00:07:00 7
2000-01-02 00:08:00 8
I need the data for single day i.e. 1st Jan 2000. Below query gives me the correct result. But is there a way it can be done just by passing "2000-01-01"?
result= df[(df['date1'] > '2000-01-01 00:00') & (df['date1'] < '2000-01-01 23:59')]

Use partial string indexing, but need DatetimeIndex first:
df = df.set_index('date1')['2000-01-01']
print (df)
item_id
date1
2000-01-01 00:00:00 0
2000-01-01 10:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 12:07:00 7
Another solution is convert datetimes to strings by strftime and filter by boolean indexing:
df = df[df['date1'].dt.strftime('%Y-%m-%d') == '2000-01-01']
print (df)
date1 item_id
0 2000-01-01 00:00:00 0
1 2000-01-01 10:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 12:07:00 7

The other alternative would be to create a mask:
df[df.date1.dt.date.astype(str) == '2000-01-01']
Full example:
import pandas as pd
data = '''\
date1 item_id
2000-01-01T00:00:00 0
2000-01-01T10:01:00 1
2000-01-01T00:02:00 2
2000-01-01T00:03:00 3
2000-01-01T00:04:00 4
2000-01-01T00:05:00 5
2000-01-01T00:06:00 6
2000-01-01T12:07:00 7
2000-01-02T00:08:00 8
2000-01-02T00:00:00 0
2000-01-02T00:01:00 1
2000-01-02T03:02:00 2'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+', parse_dates=['date1'])
res = df[df.date1.dt.date.astype(str) == '2000-01-01']
print(res)
Returns:
date1 item_id
0 2000-01-01 00:00:00 0
1 2000-01-01 10:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 12:07:00 7
Or
import datetime
df[df.date1.dt.date == datetime.date(2000,1,1)]

pandas dataframe new column which checks previous day

I have a Dataframe which has a Datetime as Index and a column named "Holiday" which is an Flag with 1 or 0.
So if the datetimeindex is a holiday, the Holiday column has 1 in it and if not so 0.
I need a new column that says whether a given datetimeindex is the first day after a holiday or not.The new column should just look if its previous day has the flag "HOLIDAY" set to 1 and then set its flag to 1, otherwise 0.
EDIT
Doing:
df['DayAfter'] = df.Holiday.shift(1).fillna(0)
Has the Output:
Holiday DayAfter AnyNumber
Datum
...
2014-01-01 20:00:00 1 1.0 9
2014-01-01 20:30:00 1 1.0 2
2014-01-01 21:00:00 1 1.0 3
2014-01-01 21:30:00 1 1.0 3
2014-01-01 22:00:00 1 1.0 6
2014-01-01 22:30:00 1 1.0 1
2014-01-01 23:00:00 1 1.0 1
2014-01-01 23:30:00 1 1.0 1
2014-01-02 00:00:00 0 1.0 1
2014-01-02 00:30:00 0 0.0 2
2014-01-02 01:00:00 0 0.0 1
2014-01-02 01:30:00 0 0.0 1
...
if you check the first timestamp for 2014-01-02 the DayAfter flag is set right. But the other flags are 0. Thats wrong.

Create an array of unique days that are holidays and offset them by one day
days = pd.Series(df[df.Holiday == 1].index).add(pd.DateOffset(1)).dt.date.unique()
Create a new column with the one day holiday offsets (days)
df['DayAfter'] = np.where(pd.Series(df.index).dt.date.isin(days),1,0)
Holiday AnyNumber DayAfter
Datum
2014-01-01 20:00:00 1 9 0
2014-01-01 20:30:00 1 2 0
2014-01-01 21:00:00 1 3 0
2014-01-01 21:30:00 1 3 0
2014-01-01 22:00:00 1 6 0
2014-01-01 22:30:00 1 1 0
2014-01-01 23:00:00 1 1 0
2014-01-01 23:30:00 1 1 0
2014-01-02 00:00:00 0 1 1
2014-01-02 00:30:00 0 2 1
2014-01-02 01:00:00 0 1 1
2014-01-02 01:30:00 0 1 1

Unable to convert to datetime using pd.to_datetime

I am trying to read a csv file and convert it to a dataframe to be used as a time series.
The csv file is of this type:
#Date Time CO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
0 NaN NaN %
1 NaN NaN Cooling Coil Hydronic Valve Position
2 2014-01-01 00:00:00 0
3 2014-01-01 01:00:00 0
4 2014-01-01 02:00:00 0
5 2014-01-01 03:00:00 0
6 2014-01-01 04:00:00 0
I read the file using:
df = pd.read_csv ('filepath/file.csv', sep=';', parse_dates = [[0,1]])
producing this result:
#Date_Time FCO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
0 nan nan %
1 nan nan Cooling Coil Hydronic Valve Position
2 2014-01-01 00:00:00 0
3 2014-01-01 01:00:00 0
4 2014-01-01 02:00:00 0
5 2014-01-01 03:00:00 0
6 2014-01-01 04:00:00 0
to continue converting string to datetime and using it as index:
pd.to_datetime(df.values[:,0])
df.set_index([df.columns[0]], inplace=True)
so i get this:
FCO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
#Date_Time
nan nan %
nan nan Cooling Coil Hydronic Valve Position
2014-01-01 00:00:00 0
2014-01-01 01:00:00 0
2014-01-01 02:00:00 0
2014-01-01 03:00:00 0
2014-01-01 04:00:00 0
However, the pd.to_datetime is unable to convert to datetime. Is there a way of finding out what is the error?
Many thanks.
Luis

The string entry 'nan nan' cannot be converted using to_datetime, so replace these with an empty string so that they can now be converted to NaT:
In [122]:
df['Date_Time'].replace('nan nan', '',inplace=True)
df
Out[122]:
Date_Time index CO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
0 0 %
1 1 Cooling Coil Hydronic Valve Position
2 2014-01-01 00:00:00 2 0
3 2014-01-01 01:00:00 3 0
4 2014-01-01 02:00:00 4 0
5 2014-01-01 03:00:00 5 0
6 2014-01-01 04:00:00 6 0
In [124]:
df['Date_Time'] = pd.to_datetime(df['Date_Time'])
df
Out[124]:
Date_Time index CO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
0 NaT 0 %
1 NaT 1 Cooling Coil Hydronic Valve Position
2 2014-01-01 00:00:00 2 0
3 2014-01-01 01:00:00 3 0
4 2014-01-01 02:00:00 4 0
5 2014-01-01 03:00:00 5 0
6 2014-01-01 04:00:00 6 0
UPDATE
Actually if you just set coerce=True then it converts fine:
df['Date_Time'] = pd.to_datetime(df['Date_Time'], coerce=True)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sort csv-data while reading, using pandas - python

Related

Insert values of rows above python

Pandas: time column addition and repeating all rows for a month

Getting data for given day from pandas Dataframe

pandas dataframe new column which checks previous day

Unable to convert to datetime using pd.to_datetime

Categories

Resources