Type issue with histogram plot python - python

I've some data as:
t1['Time Delay']
Out[175]:
746 0 days
747 0 days
873 0 days
874 0 days
906 8 days
907 0 days
908 0 days
909 0 days
Name: Time Delay, dtype: timedelta64[ns]
And another column as:
t1['Outcome']
Out[176]:
746 0.0
747 0.0
758 0.0
762 0.0
1422 1.0
1685 0.0
1909 0.0
1913 0.0
Name: Outcome, dtype: float64
I'm trying to plot them as histograms based on this pandas - histogram from two columns? but a bunch of type issues are coming up.
If I check type(t1['Time Delay']), it gives pandas.core.series.Series. Same for the other column.
Do I need to convert the timedelta to float?

Related

Is it possible to convert integers to time intervals in a Pandas pivot table?

This is my pivot DataFrame:
Name Tutor Student
Date
2021-04-12 310 112
2021-04-13 394 210
2021-04-14 357 3
2021-04-15 359 0
2021-04-16 392 0
2021-04-17 307 0
2021-04-18 335 0
2021-04-19 0 121
The values under the Tutor and Student columns are integers representing the number of seconds.
Is it possible to convert these values to time intervals like Python's datetime.timedelta?
Not very clear for the output you are looking for.
We can leverage pd.to_timedelta() method to convert seconds to timedelta.
Solution
df.iloc[:].apply(pd.to_timedelta, unit='s')
(Considering you want all columns in df to be converted to time_delta, if not please use df.loc with column names)
Dry run on provided input:
Input
Name Tutor Student
Date
2021-04-12 310 112
2021-04-13 394 210
2021-04-14 357 3
2021-04-15 359 0
2021-04-16 392 0
2021-04-17 307 0
2021-04-18 335 0
2021-04-19 0 121
Output
Name Tutor Student
Date
2021-04-12 0 days 00:05:10 0 days 00:01:52
2021-04-13 0 days 00:06:34 0 days 00:03:30
2021-04-14 0 days 00:05:57 0 days 00:00:03
2021-04-15 0 days 00:05:59 0 days 00:00:00
2021-04-16 0 days 00:06:32 0 days 00:00:00
2021-04-17 0 days 00:05:07 0 days 00:00:00
2021-04-18 0 days 00:05:35 0 days 00:00:00
2021-04-19 0 days 00:00:00 0 days 00:02:01
try this:
df["Tutor"] = pd.to_datetime(df["Tutor"], unit='s').dt.time
df["Student"] = pd.to_datetime(df["Student"], unit='s').dt.time
Result:
Name Tutor Student
1 2021-04-12 00:05:10 00:01:52
2 2021-04-13 00:06:34 00:03:30
3 2021-04-14 00:05:57 00:00:03
4 2021-04-15 00:05:59 00:00:00
5 2021-04-16 00:06:32 00:00:00
6 2021-04-17 00:05:07 00:00:00
7 2021-04-18 00:05:35 00:00:00
8 2021-04-19 00:00:00 00:02:01

Why applying xlrd.xldate_as_datetime() function does not update subset of dataframe as expected?

I have extracted a dataframe from an excel file that has a datetime column but with a few values in the excel Date format like so:
import pandas as pd
import numpy as np
import xlrd
rnd = np.random.randint(0,1000,size=(10, 1))
test = pd.DataFrame(data=rnd,index=range(0,10),columns=['rnd'])
test['Date'] = pd.date_range(start='1/1/1979', periods=len(test), freq='D')
r1 = np.random.randint(0,5)
r2 = np.random.randint(6,10)
test.loc[r1, 'Date'] = 44305
test.loc[r2, 'Date'] = 44287
test
rnd Date
0 56 1979-01-01 00:00:00
1 557 1979-01-02 00:00:00
2 851 1979-01-03 00:00:00
3 553 44305
4 258 1979-01-05 00:00:00
5 946 1979-01-06 00:00:00
6 930 1979-01-07 00:00:00
7 805 1979-01-08 00:00:00
8 362 44287
9 705 1979-01-10 00:00:00
When I attempt to convert the errant dates using the xlrd.xldate_as_datetime function in isolation I get a series with the correct format:
# Identifying the index of dates in int format
idx_ints = test[test['Date'].map(lambda x: isinstance(x, int))].index
test.loc[idx_ints, 'Date'].map(lambda x: xlrd.xldate_as_datetime(x, 0))
3 2021-04-19
8 2021-04-01
Name: Date, dtype: datetime64[ns]
However when I attempt to apply the change in place I get a wildly different int:
test.loc[idx_ints,'Date'] = test.loc[idx_ints, 'Date'].map(lambda x: xlrd.xldate_as_datetime(x, 0))
test
rnd Date
0 56 1979-01-01 00:00:00
1 557 1979-01-02 00:00:00
2 851 1979-01-03 00:00:00
3 553 1618790400000000000
4 258 1979-01-05 00:00:00
5 946 1979-01-06 00:00:00
6 930 1979-01-07 00:00:00
7 805 1979-01-08 00:00:00
8 362 1617235200000000000
9 705 1979-01-10 00:00:00
Any idea, or perhaps an alternative solution to my date int conversion problem, thanks!
Reversing the logic from the answer I linked, this works fine for your test df:
# where you have numeric values, i.e. "excel datetime format":
nums = pd.to_numeric(test['Date'], errors='coerce') # timestamps will give NaN here
# now first convert the excel dates:
test.loc[nums.notna(), 'datetime'] = pd.to_datetime(nums[nums.notna()], unit='d', origin='1899-12-30')
# ...and the other, "parseable" timestamps:
test.loc[nums.isna(), 'datetime'] = pd.to_datetime(test['Date'][nums.isna()])
test
rnd Date datetime
0 840 44305 2021-04-19
1 298 1979-01-02 00:00:00 1979-01-02
2 981 1979-01-03 00:00:00 1979-01-03
3 806 1979-01-04 00:00:00 1979-01-04
4 629 1979-01-05 00:00:00 1979-01-05
5 540 1979-01-06 00:00:00 1979-01-06
6 499 1979-01-07 00:00:00 1979-01-07
7 155 1979-01-08 00:00:00 1979-01-08
8 208 44287 2021-04-01
9 737 1979-01-10 00:00:00 1979-01-10
If your input already has datetime objects instead of timestamp strings, you could skip the conversion and just transfer the values to the new column I think.

Parsing week of year to datetime objects with pandas

A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
The last column contains the the year along with the weeknumber. Is it possible to convert this to a datetime column with pd.to_datetime?
I've tried:
pd.to_datetime(df.yearweek, format='%Y-%U')
0 2014-01-01
1 2014-01-01
2 2014-01-01
3 2014-01-01
4 2015-01-01
5 2015-01-01
6 2015-01-01
7 2015-01-01
Name: yearweek, dtype: datetime64[ns]
But that output is incorrect, while I believe %U should be the format string for week number. What am I missing here?
You need another parameter for specify day - check this:
df = pd.to_datetime(df.yearweek.add('-0'), format='%Y-%W-%w')
print (df)
0 2014-12-07
1 2014-12-14
2 2014-12-21
3 2014-12-28
4 2015-01-18
5 2015-01-25
6 2015-02-01
7 2015-02-08
Name: yearweek, dtype: datetime64[ns]

Missing data count with Pandas

I have a pandas.DataFrame with TimeSeries (all columns are casted to float) that are indexed with a DatetimeIndex (granularity/frequency is about 1 hour) for row and a MultiIndex for columns. There are missing data within the series (but no missing row, frequency is set). I would like to compute an acquisition performance (percentage) by month.
def mapMonth(x):
return x.replace(day=1, hour=0, minute=0, second=0, microsecond=0)
c = data.groupby(mapMonth).count()
The above code seems to count values ignoring NaN which is what I want. Now I would like to divide this aggregated DataFrame by the expected count.
n = pd.DataFrame(np.full((data.shape[0],), 1, dtype=float), index=data.index).groupby(groupby.mapMonth).sum()
It gives me expected data count by month, but I found this way very tricky.
Anyway I could not succeed dividing the DataFrame c by n using:
p = c.div(n, axis=0)
DataFrames look like:
networkkey RTU
measurandkey NO2
sitekey 41B001 41B004 41B006 41B008 41B011 41MEU1 41N043 41R001 41R002
channelid 280 27 38 55 59 86 103 122 168
2012-01-01 0 728 728 0 728 732 728 728 728
2012-02-01 0 679 678 0 680 686 681 681 679
2012-03-01 0 728 727 0 727 720 726 728 722
2012-04-01 0 705 698 0 702 710 699 705 701
2012-05-01 0 728 728 0 726 728 725 724 680
2012-06-01 0 703 700 0 701 710 705 705 705
2012-07-01 0 728 728 0 728 657 707 728 728
0
2012-01-01 744.0
2012-02-01 696.0
2012-03-01 744.0
2012-04-01 720.0
2012-05-01 744.0
2012-06-01 720.0
2012-07-01 744.0
2012-08-01 744.0
2012-09-01 720.0
2012-10-01 744.0
2012-11-01 720.0
2012-12-01 744.0
I suspect the problem is because of the MultiIndex. Anyway I do not find this method straightforward.
Is there a cleaner/cleaver what to compute this aggregate with Pandas?
I finally found the size function which does not ignore NaN. Therefore the following code perform what I want in few lines:
# Group Data:
g = data.groupby(groupby.mapMonth)
# Compute Performance
c = g.count()
n = g.size()
d = c.div(n, axis=0)

Grouping daily data by month in python/pandas while firstly grouping by user id

I have the table below in a Pandas dataframe:
date user_id whole_cost cost1
02/10/2012 00:00:00 1 1790 12
07/10/2012 00:00:00 1 364 15
30/01/2013 00:00:00 1 280 10
02/02/2013 00:00:00 1 259 24
05/03/2013 00:00:00 1 201 39
02/10/2012 00:00:00 3 623 1
07/12/2012 00:00:00 3 90 0
30/01/2013 00:00:00 3 312 90
02/02/2013 00:00:00 5 359 45
05/03/2013 00:00:00 5 301 34
02/02/2013 00:00:00 5 359 1
05/03/2013 00:00:00 5 801 12
..
The table was extracted from a csv file using the following query :
import pandas as pd
newnames = ['date','user_id', 'whole_cost', 'cost1']
df = pd.read_csv('expenses.csv', names = newnames, index_col = 'date')
I have to analyse the profile of my users and for this purpose:
I would like to group (for each user - they are thousands) queries by month summing the query whole_cost for the entire month e.g. if user_id=1 was has a whole cost of 1790 on 02/10/2012 with cost1 12 and on the 07/10/2012 with whole cost 364, then it should have an entry in the new table of 2154 (as the whole cost) on 31/10/2012 (end of the month end-point representing the month - all dates in the transformed table will be month ends representing the whole month to which they relate).
In 0.14 you'll be able to groupby monthly and another column at the same time:
In [11]: df
Out[11]:
user_id whole_cost cost1
2012-10-02 1 1790 12
2012-10-07 1 364 15
2013-01-30 1 280 10
2013-02-02 1 259 24
2013-03-05 1 201 39
2012-10-02 3 623 1
2012-12-07 3 90 0
2013-01-30 3 312 90
2013-02-02 5 359 45
2013-03-05 5 301 34
2013-02-02 5 359 1
2013-03-05 5 801 12
In [12]: df1 = df.sort_index() # requires sorted DatetimeIndex
In [13]: df1.groupby([pd.TimeGrouper(freq='M'), 'user_id'])['whole_cost'].sum()
Out[13]:
user_id
2012-10-31 1 2154
3 623
2012-12-31 3 90
2013-01-31 1 280
3 312
2013-02-28 1 259
5 718
2013-03-31 1 201
5 1102
Name: whole_cost, dtype: int64
until 0.14 I think you're stuck with doing two groupbys:
In [14]: g = df.groupby('user_id')['whole_cost']
In [15]: g.resample('M', how='sum').dropna()
Out[15]:
user_id
1 2012-10-31 2154
2013-01-31 280
2013-02-28 259
2013-03-31 201
3 2012-10-31 623
2012-12-31 90
2013-01-31 312
5 2013-02-28 718
2013-03-31 1102
dtype: float64
With timegrouper getting deprecated, you can replace it with Grouper to get the same results
df.groupby(['user_id', pd.Grouper(key='date', freq='M')]).agg({'whole_cost':sum})
df.groupby(['user_id', df['date'].dt.dayofweek]).agg({'whole_cost':sum})

Categories