Tabular String data convert to python data - python

I have a string like this
"""PID TTY TIME CMD
1 ? 00:00:01 systemd
2 ? 00:00:00 kthreadd
3 ? 00:00:00 rcu_gp
4 ? 00:00:00 rcu_par_gp"""
now I want the data to be like so that i can access it like data["PID"] will give me 1,2,3,4 and so for other headers.
I have used pandas and StringIO to convert it to a dataframe but the output of df.columns give ['PID TTY', 'TIME CMD'] which is not something i want.
It will be better if the logic is python related and not with pandas

Use sep="\s+" for separator by whitespace:
from io import StringIO
temp="""PID TTY TIME CMD
1 ? 00:00:01 systemd
2 ? 00:00:00 kthreadd
3 ? 00:00:00 rcu_gp
4 ? 00:00:00 rcu_par_gp"""
df = pd.read_csv(StringIO(temp), sep="\s+")
print (df)
PID TTY TIME CMD
0 1 ? 00:00:01 systemd
1 2 ? 00:00:00 kthreadd
2 3 ? 00:00:00 rcu_gp
3 4 ? 00:00:00 rcu_par_gp
print (df.columns)
Index(['PID', 'TTY', 'TIME', 'CMD'], dtype='object')

Related

How to convert time data which saved as integer type in csv file into datetime in python

I have csv file and in 'Time' column, time data is saved in integer type like
7
20
132
4321
123456
...
and I have to convert datatime in python like
00:00:07
00:00:20
00:01:32
00:43:21
12:34:56
...
and size of data is almost 250,000,,,
How do I convert this number to a datetime?
I tried but failed
change_time=str(int(df_NPA_2020['TIME'])).zfill(6)
change_time=change_time[:2]+":"+change_time[2:4]+":"+change_time[4:]
change_time
and
change_time=df_NPA_2020['ch_time'] = df_NPA_2020['TIME'].apply(lambda x: pd.to_datetime(str(x), format='%H:%M:%S'))
You're almost there. You have to use .astype(str) method to convert a column as string and not str(df_NPA_2020['TIME']). The latter is like a print.
df_NPA_2020['ch_time'] = pd.to_datetime(df_NPA_2020['TIME'].astype(str).str.zfill(6), format='%H%M%S').dt.time
print(df_NPA_2020)
# Output
TIME ch_time
0 7 1900-01-01 00:00:07
1 20 1900-01-01 00:00:20
2 132 1900-01-01 00:01:32
3 4321 1900-01-01 00:43:21
4 123456 1900-01-01 12:34:56
Parse the number into a datetime, then format it:
import pandas as pd
df = pd.DataFrame([7,20,132,4321,123456], columns=['Time'])
print(df)
df.Time = df.Time.apply(lambda x: pd.to_datetime(f'{x:06}', format='%H%M%S')).dt.strftime('%H:%M:%S')
print(df)
Output:
Time
0 7
1 20
2 132
3 4321
4 123456
Time
0 00:00:07
1 00:00:20
2 00:01:32
3 00:43:21
4 12:34:56

Why is the difference of datetime = zero for two rows in a dataframe?

This issue that I am facing is very simple yet weird and has troubled me to no end.
I have a dataframe as follows :
df['datetime'] = df['datetime'].dt.tz_convert('US/Pacific')
#converting datetime from datetime64[ns, UTC] to datetime64[ns,US/Pacific]
df.head()
vehicle_id trip_id datetime
6760612 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:00-08:00
6760613 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:01-08:00
6760614 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:02-08:00
6760615 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:03-08:00
6760616 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:04-08:00
df.info ()
vehicle_id int64
trip_id object
datetime datetime64[ns, US/Pacific]
I am trying to find out the datatime difference as follows ( in two different ways) :
df['datetime_diff'] = df['datetime'].diff()
df['time_diff'] = (df['datetime'] - df['datetime'].shift(1)).astype('timedelta64[s]')
For a particular trip_id, I have the results as follows :
df[trip_frame['trip_id'] == '4f874888ce404720a203e36f1cf5b716'][['datetime','datetime_diff','time_diff']].head()
datetime datetime_diff time_diff
6760612 2017-01-01 10:00:00-08:00 NaT NaN
6760613 2017-01-01 10:00:01-08:00 00:00:01 1.0
6760614 2017-01-01 10:00:02-08:00 00:00:01 1.0
6760615 2017-01-01 10:00:03-08:00 00:00:01 1.0
6760616 2017-01-01 10:00:04-08:00 00:00:01 1.0
But for some other trip_ids like the below, you can observe that I am having the datetime difference as zero (for both the columns) when it is actually not.There is a time difference in seconds.
df[trip_frame['trip_id'] == '01b8a24510cd4e4684d67b96369286e0'][['datetime','datetime_diff','time_diff']].head(4)
datetime datetime_diff time_diff
3236107 2017-01-28 03:00:00-08:00 0 days 0.0
3236108 2017-01-28 03:00:01-08:00 0 days 0.0
3236109 2017-01-28 03:00:02-08:00 0 days 0.0
3236110 2017-01-28 03:00:03-08:00 0 days 0.0
df[df['trip_id'] == '01c2a70c25e5428bb33811ca5eb19270'][['datetime','datetime_diff','time_diff']].head(4)
datetime datetime_diff time_diff
8915474 2017-01-21 10:00:00-08:00 0 days 0.0
8915475 2017-01-21 10:00:01-08:00 0 days 0.0
8915476 2017-01-21 10:00:02-08:00 0 days 0.0
8915477 2017-01-21 10:00:03-08:00 0 days 0.0
Any leads as to what the actual issue is ? I will be very grateful.
If I just execute your code without the type conversion, everything looks fine:
df.timestamp - df.timestamp.shift(1)
On the example lines
rows=['2017-01-21 10:00:00-08:00',
'2017-01-21 10:00:01-08:00',
'2017-01-21 10:00:02-08:00',
'2017-01-21 10:00:03-08:00',
'2017-01-21 10:00:03-08:00'] # the above lines are from your example. I just invented this last line to have one equal entry
df= pd.DataFrame(rows, columns=['timestamp'])
df['timestamp']= df['timestamp'].astype('datetime64')
df.timestamp - df.timestamp.shift(1)
The last line returns
Out[40]:
0 NaT
1 00:00:01
2 00:00:01
3 00:00:01
4 00:00:00
Name: timestamp, dtype: timedelta64[ns]
That looks unsuspicious so far. Note, that you already have a timedelta64 series.
If I now add your conversion, I get:
(df.timestamp - df.timestamp.shift(1)).astype('timedelta64[s]')
Out[42]:
0 NaN
1 1.0
2 1.0
3 1.0
4 0.0
Name: timestamp, dtype: float64
You see, that the result is a series of floats. This is probably because there is a NaN in the series. One other thing is the additon [s]. This doesn't seem to work. If you use [ns] it seems to work. If you want to get rid of the nano seconds somehow, I guess you need to do it separately.

How to format all dates in a sheet by Pandas?

I had the below sheet data in a excel file:
id data_1 data_2
1 2018/11/11 00:00 123
2 123 2018/11/2 00:00
The date in excel actully is a float, so I want change it to str by using the following syntax:
df = df.astype(dtype=str)
But the pandas change the date format YYYY/MM/DD to YYYY-MM-DD,so I get this in the output:
id data_1 data_2
1 2018-11-11 00:00 123
2 123 2018-11-2 00:00
How do change all dates to str and keep it format as YYYY/MM/DD?
I'm unable to use df.to_datetime() or some syntax like this, because not all dates are in a particular column.And I don't want to traverse all columns to achieve it.
The only way I konw is use regex:
df.replace(['((?<=[0-9]{4})-(?=([0-9]{2}-[0-9]{2})))|((?<=[0-9]{4}-[0-9]{2})-(?=[0-9]{2}))'], ['/'], regex=True)
But it will lead to errors while I have a YYYY-MM-DD data in some other str data.
I only want change the date type in sheet, and df.astype can do it. The only problem is I want YYYY/MM/DD instead of YYYY-MM-DD.
In general, I want change all dates in sheet to type of str. And format it to YYYY/MM/DD HH:MM:SS. astype can achieve the first step.
Is there a simple and quick way to achieve this?
Think you for reading.
consider you have a dataframe with datetime objects but also random integers:
df = pd.DataFrame(pd.date_range(dt.datetime(2018,1,1), dt.datetime(2018,1,6)))
df[0][0] = 123
print(df)
0
0 123
1 2018-01-02
2 2018-01-03
3 2018-01-04
4 2018-01-05
5 2018-01-06
now you can create a new column with the datetime in the desired format by using df.apply and this function convert:
def convert(x):
try:
return x.strftime('%Y/%m/%d')
except AttributeError:
return str(x)
df['date'] = df[0].apply(convert)
print(df)
0 date
0 123 123
1 2018-01-02 00:00:00 2018/01/02
2 2018-01-03 00:00:00 2018/01/03
3 2018-01-04 00:00:00 2018/01/04
4 2018-01-05 00:00:00 2018/01/05
5 2018-01-06 00:00:00 2018/01/06
Note: it might be a better idea to clean up the dates first to avoid unexpected behavior. For example with this
df[df[0].apply(lambda x: type(x)==pd._libs.tslibs.timestamps.Timestamp)]

Work with and change the layout of an csv file in pandas

I read a csv data with pandas and now I would like to change the layout of my dataset. My dataset from excel looks like this:
I run the code with df = pd.read_csv(Location2)
This is what I get:
I would like to have a separated column for time and Watt and their values.
I looked at the documentation but I couldn't find something to make it work.
It seems as if you'd need to set up the correct delimiter that separates the two fields. Try adding delimiter=";" to the parameters
Use read_excel
df = pd.read_excel(Location2)
I think you need parameter sep in read_csv, because default separator is ,:
df = pd.read_csv(Location2, sep=';')
Sample:
import pandas as pd
from pandas.compat import StringIO
temp=u"""time;Watt
0;00:00:00;50
1;01:00:00;45
2;02:00:00;40
3;00:03:00;35"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=";")
print (df)
time Watt
0 00:00:00 50
1 01:00:00 45
2 02:00:00 40
3 00:03:00 35
Then is possible convert time column to_timedelta:
df['time'] = pd.to_timedelta(df['time'])
print (df)
time Watt
0 00:00:00 50
1 01:00:00 45
2 02:00:00 40
3 00:03:00 35
print (df.dtypes)
time timedelta64[ns]
Watt int64
dtype: object

Time format when using pandas.to_csv()

I have a out put from a Pandas DataFrame as following.
id value exit enter time_diff
0 1 a 2012-11-27 10:41:20 2012-11-27 10:39:00 00:02:20
1 2 a 2012-12-07 06:00:10 2012-12-07 06:00:09 00:00:01
2 2 c 2012-12-27 06:05:17 2012-12-27 06:00:17 00:05:00
3 3 a 2012-12-27 06:00:13 2012-12-27 06:00:13 00:00:00
Why following doesn’t work?
df.to_csv('diff.csv', date_format='%H:%M:%S')
For the first one in csv following is there for time_diff
140000000000
Time diff is an integer given in nanoseconds, not a date. I would recommend either pickling or hdf5 if you need to round-trip.

Categories