I have a dataframe with the index as a Timedelta, ranging from 0 to 5 minutes, and a column of floating point numbers.
Here's an example subset:
32 0.740283
34 0.572126
36 0.524788
38 0.509685
40 0.490219
42 0.545977
44 0.444170
46 1.098387
48 2.209113
51 1.426835
53 1.536439
55 1.196625
56 1.923569
The left being the timedelta in seconds, the right being the floating point number.
The issue is when plotting with pandas I get an x axis with labels such as:
0 days 00:00:00, 0 days 00:01:10, 0 days 00:02:15
and so on. Is there any way I can maybe resample (wrong word?) the data so that I can have axes on a minute by minute basis while still maintaining the data points in the right place?
Example code/data:
df = pd.DataFrame({'td':[32,34,36,38,40,42,44,51,53,152,283],
'val': np.random.rand(11)})
df.index = df.td.map(lambda x: pd.Timedelta(seconds=x.astype(int)))
df.drop(['td'], axis=1, inplace=True)
df.val.plot()
Pandas only provides plotting functions for convenience. To have full control, you need to use Matplotlib directly.
As a workaround, you could just use datetime instead of timedelta as index. As long as your timespans are within minutes, Pandas won't plot the day or month.
To use your example, this works:
df = pd.DataFrame({'td':[32,34,36,38,40,42,44,51,53,152,283],
'val': np.random.rand(11)})
df.index = [dt(2010, 1, 1) + timedelta(seconds=int(i)) for i in df.td]
df.drop(['td'], axis=1, inplace=True)
df.val.plot()
Is this what you want?
import pandas as pd
import numpy as np
td = np.array([32,34,36,38,40,42,44,51,53,152,283])*1e9 # if you don't multiply by 1e9, then pandas will assume you are referring to nanoseconds when you use the function pd.to_datetime()
df = pd.DataFrame({'td':td,
'val': np.random.rand(11)})
df.index = pd.to_datetime(df.td)
df.index = df.index.time # select the time component of the index ... ignoring the date
df.drop('td', 1, inplace=True)
print df
val
00:00:32 0.825991
00:00:34 0.578752
00:00:36 0.348558
00:00:38 0.221674
00:00:40 0.706031
00:00:42 0.912452
00:00:44 0.448185
00:00:51 0.368867
00:00:53 0.188401
00:02:32 0.855828
00:04:43 0.494732
df.plot() # it gets the plot you want
Related
I have a dataframe consisting of timestamp and some environmental values. I want to find the maximum and minimum temperature values of the day and store it with the existing dataframe into a csv file.
So far I was able to use the below code and bring the maximum and minimum values ina separate dataframe.
''''
dff = pd.read_csv(n1p2.csv")
df = pd.DataFrame(dff)
df['timeStamp'] = pd.to_datetime(df['timeStamp'])
df = df.set_index('timeStamp')
val = df.loc[df.groupby(df.index.dayofyear).idxmax().iloc[:, 2]]
''''
The dataframe val has maximum values of SItemp for each day. But How do i add it to the original dataframe and save the csv file?
import pandas as pd
d = {'id': [1, 2, 3], 'datatime': ['16-12-19 13:23', '16-12-19 17:45', '17-12-19 13:23'], 'temp': [35,41,32]}
df = pd.DataFrame(data=d)
df = df.set_index('id')
df
Sample Date:
datatime temp
id
1 16-12-19 13:23 35
2 16-12-19 17:45 41
3 17-12-19 13:23 32
df['NewDate'] = pd.to_datetime(df['datatime'],format='%d-%m-%y %H:%M')
df['Date Only']=df['NewDate'].dt.date
df
Only Date:
datatime temp NewDate Date Only
id
1 16-12-19 13:23 35 2019-12-16 13:23:00 2019-12-16
2 16-12-19 17:45 41 2019-12-16 17:45:00 2019-12-16
3 17-12-19 13:23 32 2019-12-17 13:23:00 2019-12-17
Find minimum and maximum:
import numpy as np
data = (df.set_index('Date Only').groupby(level=0)['temp']
.agg([('Minimum',np.min),('Maximum',np.max)]))
data
Result:
Minimum Maximum
Date Only
2019-12-16 35 41
2019-12-17 32 32
I think easiest from this point is to just generate date column to original dataframe and then merge val to that one.
I have not tested but should be something like below.
df['date']=df.index.date
dfFull=df.merge(val, how='left' left_on='date', right_on='dayofyear')
I read this excell sheet (only column of 'DATEHEUREMAX') with pandas using this command:
xdata = read_excel('Data.xlsx', 'Data', usecols=['DATEHEUREMAX'])
now I want to turn this df into a simplify df with only hour:min rounded to 15min up. The main idea is to plot an histogram base on hour:min
Consider the following DataFrame, with a single column, read as datetime (not string):
Dat
0 2019-06-03 12:07:00
1 2019-06-04 10:04:00
2 2019-06-05 11:42:00
3 2019-06-06 10:17:00
To round these dates to 15 mins run:
df['Dat2'] = df.Dat.dt.round('15T').dt.time.map(lambda s: str(s)[:-3])
The result is:
Dat Dat2
0 2019-06-03 12:07:00 12:00
1 2019-06-04 10:04:00 10:00
2 2019-06-05 11:42:00 11:45
3 2019-06-06 10:17:00 10:15
For demonstration purpose, I saved the result in a new column, but you can
save it in the original column.
I think this is what you are asking for
rounded_column = df['time_column'].dt.round('15min').strftime("%H:%M")
although i agree with the commenters you might not really need to do this and just use a timegrouper
There is no need to round your column in order to get a histogram of dates with your DATEHEUREMAX column. For this purpose you can just make use of pd.Grouper as detailed below.
Toy sample code
You can work out this example to get a solution with your date column:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Generating a sample of 10000 timestamps and selecting 500 to randomize them
df = pd.DataFrame(np.random.choice(pd.date_range(start=pd.to_datetime('2015-01-14'),periods = 10000, freq='S'), 500), columns=['date'])
# Setting the date as the index since the TimeGrouper works on Index, the date column is not dropped to be able to count
df.set_index('date', drop=False, inplace=True)
# Getting the histogram
df.groupby(pd.Grouper(freq='15Min')).count().plot(kind='bar')
plt.show()
This code resolves to a graph like below:
Solution with your data
For your data you should be able to do something like:
import pandas as pd
import matplotlib.pyplot as plt
xdata = read_excel('Data.xlsx', 'Data', usecols=['DATEHEUREMAX'])
xdata.set_index('DATEHEUREMAX', drop=False, inplace=True)
xdata.groupby(pd.Grouper(freq='15Min')).count().plot(kind='bar')
plt.show()
I have a dataset which looks like below
[25/May/2015:23:11:15 000]
[25/May/2015:23:11:15 000]
[25/May/2015:23:11:16 000]
[25/May/2015:23:11:16 000]
Now i have made this into a DF and df[0] has [25/May/2015:23:11:15 and df[1] has 000]. I want to send all the data which ends with same seconds to a file. in the above example they end with 15 and 16 as seconds. So all ending with 15 seconds into one and the other into a different one and many more
I have tried the below code
import pandas as pd
data = pd.read_csv('apache-access-log.txt', sep=" ", header=None)
df = pd.DataFrame(data)
print(df[0],df[1].str[-2:])
Converting that column to a datetime would make it easier to work on, e.g.:
df['date'] = pd.to_datetime(df['date'], format='%d/%B/%Y:%H:%m:%S')
The you can simply iterate over a groupby(), e.g.:
In []:
for k, frame in df.groupby(df['date'].dt.second):
#frame.to_csv('file{}.csv'.format(k))
print('{}\n{}\n'.format(k, frame))
Out[]:
15
date value
0 2015-11-25 23:00:15 0
1 2015-11-25 23:00:15 0
16
date value
2 2015-11-25 23:00:16 0
3 2015-11-25 23:00:16 0
You can set your datetime as the index for the dataframe, and then use loc and to_csv Pandas' functions. Obviously, as other answers points out, you should convert your date to datetime while reading your dataframe.
Example:
df = df.set_index(['date'])
df.loc['25/05/2018 23:11:15':'25/05/2018 23:11:15'].to_csv('df_data.csv')
Try out this,
## Convert a new column with seconds value
df['seconds'] = df.apply(lambda row: row[0].split(":")[3].split(" ")[0], axis=1)
for sec in df['seconds'].unique():
## filter by seconds
print("Resutl ",df[df['seconds'] == sec])
I have a pandas Dataframe that is indexed by Date. I would like to select all consecutive gaps by period and all consecutive days by Period. How can I do this?
Example of Dataframe with No Columns but a Date Index:
In [29]: import pandas as pd
In [30]: dates = pd.to_datetime(['2016-09-19 10:23:03', '2016-08-03 10:53:39','2016-09-05 11:11:30', '2016-09-05 11:10:46','2016-09-05 10:53:39'])
In [31]: ts = pd.DataFrame(index=dates)
As you can see there is a gap from 2016-08-03 and 2016-09-19. How do I detect these so I can create descriptive statistics, i.e. 40 gaps, with median gap duration of "x", etc. Also, I can see that 2016-09-05 and 2016-09-06 is a two day range. How I can detect these and also print descriptive stats?
Ideally the result would be returned as another Dataframe in each case since I want use other columns in the Dataframe to groupby.
Pandas version 1.0.1 has a built-in method DataFrame.diff() which you can use to accomplish this. One benefit is you can use pandas series functions like mean() to quickly compute summary statistics on the gaps series object
from datetime import datetime, timedelta
import pandas as pd
# Construct dummy dataframe
dates = pd.to_datetime([
'2016-08-03',
'2016-08-04',
'2016-08-05',
'2016-08-17',
'2016-09-05',
'2016-09-06',
'2016-09-07',
'2016-09-19'])
df = pd.DataFrame(dates, columns=['date'])
# Take the diff of the first column (drop 1st row since it's undefined)
deltas = df['date'].diff()[1:]
# Filter diffs (here days > 1, but could be seconds, hours, etc)
gaps = deltas[deltas > timedelta(days=1)]
# Print results
print(f'{len(gaps)} gaps with average gap duration: {gaps.mean()}')
for i, g in gaps.iteritems():
gap_start = df['date'][i - 1]
print(f'Start: {datetime.strftime(gap_start, "%Y-%m-%d")} | '
f'Duration: {str(g.to_pytimedelta())}')
here's something to get started:
df = pd.DataFrame(np.ones(5),columns = ['ones'])
df.index = pd.DatetimeIndex(['2016-09-19 10:23:03', '2016-08-03 10:53:39', '2016-09-05 11:11:30', '2016-09-05 11:10:46', '2016-09-06 10:53:39'])
daily_rng = pd.date_range('2016-08-03 00:00:00', periods=48, freq='D')
daily_rng = daily_rng.append(df.index)
daily_rng = sorted(daily_rng)
df = df.reindex(daily_rng).fillna(0)
df = df.astype(int)
df['ones'] = df.cumsum()
The cumsum() creates a grouping variable on 'ones' partitioning your data at the points your provided. If you print df to say a spreadsheet it will make sense:
print df.head()
ones
2016-08-03 00:00:00 0
2016-08-03 10:53:39 1
2016-08-04 00:00:00 1
2016-08-05 00:00:00 1
2016-08-06 00:00:00 1
print df.tail()
ones
2016-09-16 00:00:00 4
2016-09-17 00:00:00 4
2016-09-18 00:00:00 4
2016-09-19 00:00:00 4
2016-09-19 10:23:03 5
now to complete:
df = df.reset_index()
df = df.groupby(['ones']).aggregate({'ones':{'gaps':'count'},'index':{'first_spotted':'min'}})
df.columns = df.columns.droplevel()
which gives:
first_time gaps
ones
0 2016-08-03 00:00:00 1
1 2016-08-03 10:53:39 34
2 2016-09-05 11:10:46 1
3 2016-09-05 11:11:30 2
4 2016-09-06 10:53:39 14
5 2016-09-19 10:23:03 1
I am plotting several pandas series objects of "total events per week". The data in the series events_per_week looks like this:
Datetime
1995-10-09 45
1995-10-16 63
1995-10-23 83
1995-10-30 91
1995-11-06 101
Freq: W-SUN, dtype: int64
My problem is as follows. All pandas series are the same length, i.e. beginning in same year 1995. One array begins in 2003 however. events_per_week2003 begins in 2003
Datetime
2003-09-08 25
2003-09-15 36
2003-09-22 74
2003-09-29 25
2003-09-05 193
Freq: W-SUN, dtype: int64
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(20,5))
ax = plt.subplot(111)
plt.plot(events_per_week)
plt.plot(events_per_week2003)
I get the following value error.
ValueError: setting an array element with a sequence.
How can I do this?
I really don't get where you're having problems.
I tried to recreate a piece of the dataframe, and it plotted with no problems.
import numpy, matplotlib
data = numpy.array([45,63,83,91,101])
df1 = pd.DataFrame(data, index=pd.date_range('2005-10-09', periods=5, freq='W'), columns=['events'])
df2 = pd.DataFrame(numpy.arange(10,21,2), index=pd.date_range('2003-01-09', periods=6, freq='W'), columns=['events'])
matplotlib.pyplot.plot(df1.index, df1.events)
matplotlib.pyplot.plot(df2.index, df2.events)
matplotlib.pyplot.show()
Using Series instead of Dataframe:
ds1 = pd.Series(data, index=pd.date_range('2005-10-09', periods=5, freq='W'))
ds2 = pd.Series(numpy.arange(10,21,2), index=pd.date_range('2003-01-09', periods=6, freq='W'))
matplotlib.pyplot.plot(ds1)
matplotlib.pyplot.plot(ds2)
matplotlib.pyplot.show()