Pandas, mean value of multiple columns after groupby on two columns - python

I have a dataframe that looks like the following:
Date Station_nr BD_val TEMIS_val
0 2003-01-01 29 284.8 291.0
1 2003-01-02 29 302.5 291.0
2 2003-01-03 29 306.5 291.0
3 2003-01-04 29 306.8 291.0
4 2003-01-05 29 324.0 291.0
... ... ... ... ...
3539 2004-01-27 478 285.2 293.0
3540 2004-01-28 478 289.7 293.0
3541 2004-01-29 478 290.9 293.0
3542 2004-01-30 478 289.6 293.0
3543 2004-01-31 478 289.5 281.0
I want to get the monthly mean value of both Val1 and Val2 for every station there is.
So far I have used groupby on two of the columns, and then wanted to select both Val1/Val2 to take the mean from using the following method:
cols = ['BD_val', 'TEMIS_val']
comp_df.groupby([pd.PeriodIndex(comp_df['Date'], freq="M"), comp_df['Station_nr']])[cols].mean()
But this just returns the mean value of Val1, not both columns:
Date Station_nr BD_val
2003-01 29 295.448387
57 282.258065
101 310.516129
111 268.071429
232 289.806452
... ... ...
2003-12 400 294.733333
454 298.176667
473 308.433333
478 309.306667
2004-01 478 291.330000
How do I get the mean values of both columns?
Note: Using a sample dataframe this method does work, so I'm not sure why it won't work on this particular one. Sample dataframe where it does work is shown below, for reference.
# Own made sample dataframe where this method does work.
rng = pd.date_range('2015-02-24', periods=100, freq='D')
df = pd.DataFrame({'Date': rng,
'Station' : range(len(rng)),
'Val1' : np.random.randn(len(rng)),
'Val2' : np.random.randn(len(rng))})
cols = ['Val1', 'Val2']
df.groupby([pd.PeriodIndex(df['Date'], freq="M"), df['Station']])[cols].mean()
Again to be sure, this code section above is how it should work, but it doesn't work in my case and I want to know what the reason could be.

The problem was that when creating the dataframe, the columns are stored as objects, not as the same datatype:
>>>comp_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3544 entries, 0 to 3543
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 3544 non-null object
1 Station_nr 3544 non-null int64
2 BD_val 3544 non-null float64
3 TEMIS_val 3544 non-null object
So prior I need to make sure that both are float, either by to_numeric or what I did .astype.

Related

How to convert the data type from object to numeric & then find the mean for each row in pandas ? eg. convert '<17,500, >=15,000' to 16250(mean val)

data['family_income'].value_counts()
>=35,000 2517
<27,500, >=25,000 1227
<30,000, >=27,500 994
<25,000, >=22,500 833
<20,000, >=17,500 683
<12,500, >=10,000 677
<17,500, >=15,000 634
<15,000, >=12,500 629
<22,500, >=20,000 590
<10,000, >= 8,000 563
< 8,000, >= 4,000 402
< 4,000 278
Unknown 128
The data column to be shown as a MEAN value instead of values in range
data['family_income']
0 <17,500, >=15,000
1 <27,500, >=25,000
2 <30,000, >=27,500
3 <15,000, >=12,500
4 <30,000, >=27,500
...
10150 <30,000, >=27,500
10151 <25,000, >=22,500
10152 >=35,000
10153 <10,000, >= 8,000
10154 <27,500, >=25,000
Name: family_income, Length: 10155, dtype: object
Output: as mean imputed value
0 16250
1 26250
3 28750
...
10152 35000
10153 9000
10154 26500
data['family_income']=data['family_income'].str.replace(',', ' ').str.replace('<',' ')
data[['income1','income2']] = data['family_income'].apply(lambda x: pd.Series(str(x).split(">=")))
data['income1']=pd.to_numeric(data['income1'], errors='coerce')
data['income1']
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
10150 NaN
10151 NaN
10152 NaN
10153 NaN
10154 NaN
Name: income1, Length: 10155, dtype: float64
In this case, conversion of datatype from object to numeric doesn't seem to work since all the values are returned as NaN. So, how to convert to numeric data type and find mean imputed values?
You can use the following snippet:
# Importing Dependencies
import pandas as pd
import string
# Replicating Your Data
data = ['<17,500, >=15,000', '<27,500, >=25,000', '< 4,000 ', '>=35,000']
df = pd.DataFrame(data, columns = ['family_income'])
# Removing punctuation from family_income column
df['family_income'] = df['family_income'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
# Splitting ranges to two columns A and B
df[['A', 'B']] = df['family_income'].str.split(' ', 1, expand=True)
# Converting cols A and B to float
df[['A', 'B']] = df[['A', 'B']].apply(pd.to_numeric)
# Creating mean column from A and B
df['mean'] = df[['A', 'B']].mean(axis=1)
# Input DataFrame
family_income
0 <17,500, >=15,000
1 <27,500, >=25,000
2 < 4,000
3 >=35,000
# Result DataFrame
mean
0 16250.0
1 26250.0
2 4000.0
3 35000.0

how to count value in a dataframe column based on filtering

Given this dataframe :
DriverId time SPEED
0 2021-04-16 21:40:00+00:00 58.500000
2021-04-16 21:41:00+00:00 32.850000
2021-04-16 21:42:00+00:00 89.633333
2021-04-16 21:43:00+00:00 88.166667
2021-04-16 21:44:00+00:00 118.016667
... ... ...
88 2021-04-27 07:30:00+00:00 79.566667
2021-04-27 07:31:00+00:00 59.383333
2021-04-27 07:32:00+00:00 89.133333
2021-04-27 07:33:00+00:00 59.966667
2021-04-27 07:34:00+00:00 25.72413
i want add column to count number of speed under 40 km/h for each driver so i've tried this :
y[y.SPEED<40].count()
it shows this :
SPEED 4721
dtype: int64
and it is not exactly what i want ,the expexted result must be like this :
DriverId SPEED count
0 15.20 2
32.850000
89.633333
88.166667
118.016667
... ... ...
88 79.566667 1
59.383333
89.133333
59.966667
25.72413
my dataframe was a serie which i transform to dataframe
y.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 15082 entries, (0, Timestamp('2021-04-16 21:40:00+0000', tz='UTC')) to (88, Timestamp('2021-04-27 07:34:00+0000', tz='UTC'))
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SPEED 15082 non-null float64
dtypes: float64(1)
memory usage: 922.5 KB
At first, I would have the DriverId in each row rather than only in the first row of the group and then try the following:
y["Count of speed<40 for given driver"]=[sum((y.Driver==x) & (y["Speed"]<40)) for x in y.Driver]
df = pd.DataFrame([['0','2021-04-16 21:40:00+00:00',58.500000],
['0','2021-04-16 21:41:00+00:00', 32.850000],#FIRST ONE
['0','2021-04-16 21:42:00+00:00', 15.633333],#SECOND ONE
['0','2021-04-16 21:43:00+00:00', 88.166667],
['0','2021-04-16 21:44:00+00:00',118.016667],
['88','[2021-04-27 07:30:00+00:00',79.566667],
['88','2021-04-27 07:31:00+00:00',59.383333],
['88','2021-04-27 07:32:00+00:00',89.133333],
['88','2021-04-27 07:33:00+00:00',59.966667],
['88','2021-04-27 07:34:00+00:00',25.72413] # THIRD ONE
],columns=['driver_id','time','speed'])
df = df.set_index("driver_id")
counts = df[df['speed'] < 40].groupby(["driver_id",],as_index=False).agg(
count_col=pd.NamedAgg(column="speed", aggfunc="count")
)
merged_Frame = pd.merge(df, counts, on = 'driver_id', how='inner')
Output
driver_id time speed count_col
0 0 2021-04-16 21:40:00+00:00 58.500000 2
1 0 2021-04-16 21:41:00+00:00 32.850000 2
2 0 2021-04-16 21:42:00+00:00 15.633333 2
3 0 2021-04-16 21:43:00+00:00 88.166667 2
4 0 2021-04-16 21:44:00+00:00 118.016667 2
5 88 [2021-04-27 07:30:00+00:00 79.566667 1
6 88 2021-04-27 07:31:00+00:00 59.383333 1
7 88 2021-04-27 07:32:00+00:00 89.133333 1
8 88 2021-04-27 07:33:00+00:00 59.966667 1
9 88 2021-04-27 07:34:00+00:00 25.724130 1
Reference
pd.NamedAgg
Merge two data frames based on common column values in Pandas
Edit
import pandas as pd
df = pd.DataFrame([['0','2021-04-16 21:40:00+00:00',58.500000],
['0','2021-04-16 21:41:00+00:00', 32.850000],#FIRST ONE
['0','2021-04-16 21:42:00+00:00', 15.633333],#SECOND ONE
['0','2021-04-16 21:43:00+00:00', 88.166667],
['0','2021-04-16 21:44:00+00:00',118.016667],
['88','[2021-04-27 07:30:00+00:00',79.566667],
['88','2021-04-27 07:31:00+00:00',59.383333],
['88','2021-04-27 07:32:00+00:00',89.133333],
['88','2021-04-27 07:33:00+00:00',59.966667],
['88','2021-04-27 07:34:00+00:00',25.72413] # THIRD ONE
],columns=['driver_id','time','speed'])
df = df.set_index(['driver_id', 'time'])
df['count'] = df[df['speed'] < 40].groupby('driver_id')['speed'].transform('count')
Output

Pandas dataframe custom formatting string to time

I have a dataframe that looks like this
DEP_TIME
0 1851
1 1146
2 2016
3 1350
4 916
...
607341 554
607342 633
607343 657
607344 705
607345 628
I need to get every value in this column DEP_TIME to have the format hh:mm.
All cells are of type string and can remain that type.
Some cells are only missing the colon (rows 0 to 3), others are also missing the leading 0 (rows 4+).
Some cells are empty and should ideally have string value of 0.
I need to do it in an efficient way since I have a few million records. How do I do it?
Use to_datetime with Series.dt.strftime:
df['DEP_TIME'] = (pd.to_datetime(df['DEP_TIME'], format='%H%M', errors='coerce')
.dt.strftime('%H:%M')
.fillna('00:00'))
print (df)
DEP_TIME
0 18:51
1 11:46
2 20:16
3 13:50
4 09:16
607341 05:54
607342 06:33
607343 06:57
607344 07:05
607345 06:28
import re
d = [['1851'],
['1146'],
['2016'],
['916'],
['814'],
[''],
[np.nan]]
df = pd.DataFrame(d, columns=['DEP_TIME'])
df['DEP_TIME'] = df['DEP_TIME'].fillna('0')
df['DEP_TIME'] = df['DEP_TIME'].apply(lambda y: '0' if y=='' else re.sub(r'(\d{1,2})(\d{2})$', lambda x: x[1].zfill(2)+':'+x[2], y))
df
DEP_TIME
0 18:51
1 11:46
2 20:16
3 09:16
4 08:14
5 0

Value error when trying to create a new dataframe column with a function

I am running into a value error when trying to create a new column in my dataframe. It looks like this:
state veteran_pop pct_gulf pct_vietnam
0 Alaska 70458 20.0 31.2
1 Arizona 532634 8.8 15.8
2 Colorado 395350 10.1 20.8
3 Georgia 693809 10.8 21.8
4 Iowa 234659 7.1 13.7
So I have a function that looks like this:
def addProportions(table, col1, col2, new_col):
for row, index in table.iterrows():
table[new_col] = ((table[col1] + table[col2])/100)
return(table)
Where table is the table above and col1 = "pct_gulf", col2 = "pct_vietnam", and new_col = "pct_total" like so:
addProportions(table, "pct_gulf", "pct_vietnam", "total_pct")
But when I run this function I get this error message:
ValueError: Wrong number of items passed 2, placement implies 1
--- Alternatively---
I have made my addProportions function like this:
def addProportions(table, col1, col2, new_col):
table[new_col] = 0
for row, index in table.iterrows():
table[new_col] = ((table[col1] + table[col2])/100)
return(table)
And I get this output, which seems like a step in the right direction.
state veteran_pop pct_gulf pct_vietnam total_pct
0 Alaska 70458 20.0 31.2 NaN
1 Arizona 532634 8.8 15.8 NaN
2 Colorado 395350 10.1 20.8 NaN
3 Georgia 693809 10.8 21.8 NaN
4 Iowa 234659 7.1 13.7 NaN
But the problem is when I use type() on the two columns I try to add it comes up as a dataframe and that's why I think I'm getting NaN.
---- Table Info ----
t.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 4 columns):
(state,) 55 non-null object
(veteran_pop,) 55 non-null int64
(pct_gulf,) 55 non-null float64
(pct_vietnam,) 55 non-null float64
dtypes: float64(2), int64(1), object(1)
memory usage: 1.8+ KB
t.index
RangeIndex(start=0, stop=55, step=1)
t.columns
MultiIndex(levels=[[u'pct_gulf', u'pct_vietnam', u'state', u'veteran_pop']],
codes=[[2, 3, 0, 1]])
You don't need a loop. You only need (table is the name of your dataframe):
table.columns=table.columns.droplevel()
table['total_pct']=(table['pct_gulf']+table['pct_vietnam'])/100
print(table)
I think the problem is that you have a MultiIndex.
My DataFrame, when I construct one from your info, looks like this:
table = pd.DataFrame(data={"state":["Alaska", "Arizona", "Colorado",
"Georgia", "Iowa"],
"veteran_pop":[70458, 532634, 395350, 693809, 234659],
"pct_gulf": [20.0, 8.8, 10.1, 10.8, 7.1],
"pct_vietnam": [31.2, 15.8, 20.8, 21.8, 13.7]})
And table.info() shows this:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
state 5 non-null object
veteran_pop 5 non-null int64
pct_gulf 5 non-null float64
pct_vietnam 5 non-null float64
total_pct 5 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 280.0+ bytes
If I construct a MultiIndex, I get an error:
multi = pd.DataFrame(data={("state",):["Alaska", "Arizona", "Colorado", "Georgia", "Iowa"],
("veteran_pop",):[70458, 532634, 395350, 693809, 234659],
("pct_gulf",): [20.0, 8.8, 10.1, 10.8, 7.1],
("pct_vietnam",): [31.2, 15.8, 20.8, 21.8, 13.7]})
If I run addProportions(table) on my regular DataFrame, I get the right answer:
state veteran_pop pct_gulf pct_vietnam total_pct
0 Alaska 70458 20.0 31.2 0.512
1 Arizona 532634 8.8 15.8 0.246
2 Colorado 395350 10.1 20.8 0.309
3 Georgia 693809 10.8 21.8 0.326
4 Iowa 234659 7.1 13.7 0.208
but running it on the MultiIndex throws an error.
TypeError: addProportions() missing 3 required positional arguments:
'col1', 'col2', and 'new_col'
Somehow, you ended up with a MultiIndex in your columns, even though you don't have hierarchical categories here. (You'd only want it if you were breaking down percentages, for example, by year:
columns = pd.MultiIndex.from_product([["percentage","veteran_pop"], ["army","navy"], ["2010", "2015"]])
pd.DataFrame( columns=columns, index=pd.RangeIndex(start=0, stop=5))
percentage veteran_pop
army navy army navy
2010 2015 2010 2015 2010 2015 2010 2015
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
...
You'll need to reshape your DataFrame to use the function you've written. The function works, but you have the wrong kind of index in your columns.
If you want to keep the data as multi-index, change the function to:
def addProportions(table, col1, col2, new_col):
table[new_col] = ((table[(col1,)] + table[(col2,)])/100)
# you can enable the return line if it is in need
# return table
If you want to reshape the data into normal data:
def addProportions(table, col1, col2, new_col):
table[new_col] = ((table[col1] + table[col2])/100)
# you can enable the return line if it is in need
# return table
# shape a new df without the multi-index
new_col = [i[0] for i in multi.columns]
new_df = pd.DataFrame(multi.values, columns = new_col)
# call funtion
addProportions(new_df, "pct_gulf", "pct_vietnam", "total_pct")

pandas, dataframe, groupby, std

New to pandas here. A (trivial) problem: hosts, operations, execution times. I want to group by host, then by host+operation, calculate std deviation for execution time per host, then by host+operation pair. Seems simple?
It works for grouping by a single column:
df
Out[360]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 132564 entries, 0 to 132563
Data columns (total 9 columns):
datespecial 132564 non-null values
host 132564 non-null values
idnum 132564 non-null values
operation 132564 non-null values
time 132564 non-null values
...
dtypes: float32(1), int64(2), object(6)
byhost = df.groupby('host')
byhost.std()
Out[362]:
datespecial idnum time
host
ahost1.test 11946.961952 40367.033852 0.003699
host1.test 15484.975077 38206.578115 0.008800
host10.test NaN 37644.137631 0.018001
...
Nice. Now:
byhostandop = df.groupby(['host', 'operation'])
byhostandop.std()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-364-2c2566b866c4> in <module>()
----> 1 byhostandop.std()
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in std(self, ddof)
386 # todo, implement at cython level?
387 if ddof == 1:
--> 388 return self._cython_agg_general('std')
389 else:
390 f = lambda x: x.std(ddof=ddof)
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_general(self, how, numeric_only)
1615
1616 def _cython_agg_general(self, how, numeric_only=True):
-> 1617 new_blocks = self._cython_agg_blocks(how, numeric_only=numeric_only)
1618 return self._wrap_agged_blocks(new_blocks)
1619
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_blocks(self, how, numeric_only)
1653 values = com.ensure_float(values)
1654
-> 1655 result, _ = self.grouper.aggregate(values, how, axis=agg_axis)
1656
1657 # see if we can cast the block back to the original dtype
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, values, how, axis)
838 if is_numeric:
839 result = lib.row_bool_subset(result,
--> 840 (counts > 0).view(np.uint8))
841 else:
842 result = lib.row_bool_subset_object(result,
/home/username/anaconda/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.row_bool_subset (pandas/lib.c:16540)()
ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'
Huh?? Why do I get this exception?
More questions:
how do I calculate std deviation on dataframe.groupby([several columns])?
how can I limit calculation to a selected column? E.g. it obviously doesn't make sense to calculate std dev on dates/timestamps here.
It's important to know your version of Pandas / Python. Looks like this exception could arise in Pandas version < 0.10 (see ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'). To avoid this, you can cast your float columns to float64:
df.astype('float64')
To calculate std() on selected columns, just select columns :)
>>> df = pd.DataFrame({'a':range(10), 'b':range(10,20), 'c':list('abcdefghij'), 'g':[1]*3 + [2]*3 + [3]*4})
>>> df
a b c g
0 0 10 a 1
1 1 11 b 1
2 2 12 c 1
3 3 13 d 2
4 4 14 e 2
5 5 15 f 2
6 6 16 g 3
7 7 17 h 3
8 8 18 i 3
9 9 19 j 3
>>> df.groupby('g')[['a', 'b']].std()
a b
g
1 1.000000 1.000000
2 1.000000 1.000000
3 1.290994 1.290994
update
As far as it goes, it looks like std() is calling aggregation() on the groupby result, and a subtle bug (see here - Python Pandas: Using Aggregate vs Apply to define new columns). To avoid this, you can use apply():
byhostandop['time'].apply(lambda x: x.std())

Categories