I have a out put from a Pandas DataFrame as following.
id value exit enter time_diff
0 1 a 2012-11-27 10:41:20 2012-11-27 10:39:00 00:02:20
1 2 a 2012-12-07 06:00:10 2012-12-07 06:00:09 00:00:01
2 2 c 2012-12-27 06:05:17 2012-12-27 06:00:17 00:05:00
3 3 a 2012-12-27 06:00:13 2012-12-27 06:00:13 00:00:00
Why following doesn’t work?
df.to_csv('diff.csv', date_format='%H:%M:%S')
For the first one in csv following is there for time_diff
140000000000
Time diff is an integer given in nanoseconds, not a date. I would recommend either pickling or hdf5 if you need to round-trip.
Related
I have a string like this
"""PID TTY TIME CMD
1 ? 00:00:01 systemd
2 ? 00:00:00 kthreadd
3 ? 00:00:00 rcu_gp
4 ? 00:00:00 rcu_par_gp"""
now I want the data to be like so that i can access it like data["PID"] will give me 1,2,3,4 and so for other headers.
I have used pandas and StringIO to convert it to a dataframe but the output of df.columns give ['PID TTY', 'TIME CMD'] which is not something i want.
It will be better if the logic is python related and not with pandas
Use sep="\s+" for separator by whitespace:
from io import StringIO
temp="""PID TTY TIME CMD
1 ? 00:00:01 systemd
2 ? 00:00:00 kthreadd
3 ? 00:00:00 rcu_gp
4 ? 00:00:00 rcu_par_gp"""
df = pd.read_csv(StringIO(temp), sep="\s+")
print (df)
PID TTY TIME CMD
0 1 ? 00:00:01 systemd
1 2 ? 00:00:00 kthreadd
2 3 ? 00:00:00 rcu_gp
3 4 ? 00:00:00 rcu_par_gp
print (df.columns)
Index(['PID', 'TTY', 'TIME', 'CMD'], dtype='object')
I would like to calculate a mean of a time delta serie excluding 00:00:00 values.
Then this is my time serie:
1 00:28:00
3 01:57:00
5 00:00:00
7 01:27:00
9 00:00:00
11 01:30:00
I try to replace 5 and 9 row per NaN and then apply .mean() to the serie. mean() doesn´t include NaN values and I get the desired value.
How can I do that stuff?
I´am trying:
`df["time_column"].replace('0 days 00:00:00', np.NaN).mean()`
but no values are replaced
One idea is use 0 Timedelta object:
out = df["time_column"].replace(pd.Timedelta(0), np.NaN).mean()
print (out)
0 days 01:20:30
This issue that I am facing is very simple yet weird and has troubled me to no end.
I have a dataframe as follows :
df['datetime'] = df['datetime'].dt.tz_convert('US/Pacific')
#converting datetime from datetime64[ns, UTC] to datetime64[ns,US/Pacific]
df.head()
vehicle_id trip_id datetime
6760612 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:00-08:00
6760613 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:01-08:00
6760614 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:02-08:00
6760615 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:03-08:00
6760616 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:04-08:00
df.info ()
vehicle_id int64
trip_id object
datetime datetime64[ns, US/Pacific]
I am trying to find out the datatime difference as follows ( in two different ways) :
df['datetime_diff'] = df['datetime'].diff()
df['time_diff'] = (df['datetime'] - df['datetime'].shift(1)).astype('timedelta64[s]')
For a particular trip_id, I have the results as follows :
df[trip_frame['trip_id'] == '4f874888ce404720a203e36f1cf5b716'][['datetime','datetime_diff','time_diff']].head()
datetime datetime_diff time_diff
6760612 2017-01-01 10:00:00-08:00 NaT NaN
6760613 2017-01-01 10:00:01-08:00 00:00:01 1.0
6760614 2017-01-01 10:00:02-08:00 00:00:01 1.0
6760615 2017-01-01 10:00:03-08:00 00:00:01 1.0
6760616 2017-01-01 10:00:04-08:00 00:00:01 1.0
But for some other trip_ids like the below, you can observe that I am having the datetime difference as zero (for both the columns) when it is actually not.There is a time difference in seconds.
df[trip_frame['trip_id'] == '01b8a24510cd4e4684d67b96369286e0'][['datetime','datetime_diff','time_diff']].head(4)
datetime datetime_diff time_diff
3236107 2017-01-28 03:00:00-08:00 0 days 0.0
3236108 2017-01-28 03:00:01-08:00 0 days 0.0
3236109 2017-01-28 03:00:02-08:00 0 days 0.0
3236110 2017-01-28 03:00:03-08:00 0 days 0.0
df[df['trip_id'] == '01c2a70c25e5428bb33811ca5eb19270'][['datetime','datetime_diff','time_diff']].head(4)
datetime datetime_diff time_diff
8915474 2017-01-21 10:00:00-08:00 0 days 0.0
8915475 2017-01-21 10:00:01-08:00 0 days 0.0
8915476 2017-01-21 10:00:02-08:00 0 days 0.0
8915477 2017-01-21 10:00:03-08:00 0 days 0.0
Any leads as to what the actual issue is ? I will be very grateful.
If I just execute your code without the type conversion, everything looks fine:
df.timestamp - df.timestamp.shift(1)
On the example lines
rows=['2017-01-21 10:00:00-08:00',
'2017-01-21 10:00:01-08:00',
'2017-01-21 10:00:02-08:00',
'2017-01-21 10:00:03-08:00',
'2017-01-21 10:00:03-08:00'] # the above lines are from your example. I just invented this last line to have one equal entry
df= pd.DataFrame(rows, columns=['timestamp'])
df['timestamp']= df['timestamp'].astype('datetime64')
df.timestamp - df.timestamp.shift(1)
The last line returns
Out[40]:
0 NaT
1 00:00:01
2 00:00:01
3 00:00:01
4 00:00:00
Name: timestamp, dtype: timedelta64[ns]
That looks unsuspicious so far. Note, that you already have a timedelta64 series.
If I now add your conversion, I get:
(df.timestamp - df.timestamp.shift(1)).astype('timedelta64[s]')
Out[42]:
0 NaN
1 1.0
2 1.0
3 1.0
4 0.0
Name: timestamp, dtype: float64
You see, that the result is a series of floats. This is probably because there is a NaN in the series. One other thing is the additon [s]. This doesn't seem to work. If you use [ns] it seems to work. If you want to get rid of the nano seconds somehow, I guess you need to do it separately.
I am getting this error
File "pandas/_libs/tslib.pyx", line 356, in pandas._libs.tslib.array_with_unit_to_datetime
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: cannot convert input with unit 's'
when trying to convert pandas column to datetime format.
I checked this answer Convert unix time to readable date in pandas dataframe
but it did not help me to solve the problem.
There is an issue on github that seems to be closed but in the same time people keep reporting issues:
https://github.com/pandas-dev/pandas/issues/10987
Dataframe column has unix time format, here is print out of top 20 rows
0 1420096800
1 1420096800
2 1420097100
3 1420097100
4 1420097400
5 1420097400
6 1420093800
7 1420097700
8 1420097700
9 1420098000
10 1420098480
11 1420098600
12 1420099200
13 1420099500
14 1420099500
15 1420100100
16 1420100400
17 1420096800
18 1420100700
19 1420100820
20 1420101840
Any ideas about how I might solve it?
I tried changing units from s to ms, but it did not help.
pd.__version__
'0.24.2'
Row
df[key] = pd.to_datetime(df[key], unit='s')
It works if you add the origin='unix' parameter:
pd.to_datetime(df['date'], origin='unix', unit='s')
0 2015-01-01 07:20:00
1 2015-01-01 07:20:00
2 2015-01-01 07:25:00
3 2015-01-01 07:25:00
4 2015-01-01 07:30:00
I have a resampling (downsampling) problem that should be straightforward to do but I'm not able!!
Here is a simplified example:
df:
Time A
0 0.01591 0.108929
1 0.27973 0.411764
2 0.55044 0.064253
3 0.81386 0.317394
4 1.07983 0.722707
5 1.35051 1.154193
6 1.61495 1.151492
7 1.88035 0.123389
8 2.15462 0.093583
9 2.41534 0.260944
10 2.67992 1.007564
11 2.95148 0.325353
12 3.21364 0.555593
13 3.47980 0.740621
15 4.01519 1.619669
16 4.28679 0.477371
17 4.55482 0.432049
18 4.81570 0.194224
19 5.07992 0.331936
The Time column is in seconds. I would like to make the Time column the index and downsample the dataframe to 1s. Help please?
You can use reindex and choose one fill method
In [37]: df.set_index('Time').reindex(range(0,6), method='bfill')
Out[37]:
A
0 0.108929
1 0.722707
2 0.093583
3 0.555593
4 1.619669
5 0.331936
First convert your index to datetime format:
df.index=pd.to_datetime(df.Time,unit='s')
Then resample by second (this is the mean value by default but can be changed to sum etc - e.g. add how='sum' as parameter):
d.resample('S')
Time A
Time
1970-01-01 00:00:00 0.414985 0.225585
1970-01-01 00:00:01 1.481410 0.787945
1970-01-01 00:00:02 2.550340 0.421861
1970-01-01 00:00:03 3.346720 0.648107
1970-01-01 00:00:04 4.418125 0.680828
1970-01-01 00:00:05 5.079920 0.331936
The year/date can be changed if important.