dataframe groupby aggregation count function with condition for binning purpose - python

So I have a dataframe like this
df = pd.DataFrame({
'A': [1,1,2,2,3,3,3],
'B': [1,3,1,3,1,2,1],
'C': [1,3,5,3,7,7,1]})
A B C
0 1 1 1
1 1 3 3
2 2 1 5
3 2 3 3
4 3 1 7
5 3 2 7
6 3 1 1
I want to create a binning of column B (count) with groupby of column A
for example B_bin1 where B < 3 and B_bin2 is the rest (>=3), C_bin1 for C < 5 and C_bin2 for the rest
From that example the output I want is like this
A B_bin1 B_bin2 C_bin1 C_bin2
0 1 1 1 2 0
1 2 1 1 1 1
2 3 3 0 1 2
I found similar question Pandas groupby with bin counts
, it is working for 1 bin
bins = [0,2,10]
temp_df=df.groupby(['A', pd.cut(df['B'], bins)])
temp_df.size().unstack()
B (0, 2] (2, 10]
A
1 1 1
2 1 1
3 3 0
but when I tried using more than 1 bin, it is not working (my real data has a lot of binning groups)
bins = [0,2,10]
bins2 = [0,4,10]
temp_df=df.groupby(['A', pd.cut(df['B'], bins), pd.cut(df['C'], bins2)])
temp_df.size().unstack()
C (0, 4] (4, 10]
A B
1 (0, 2] 1 0
(2, 10] 1 0
2 (0, 2] 0 1
(2, 10] 1 0
3 (0, 2] 1 2
(2, 10] 0 0
My workaround is by create small temporary df and then binning them using 1 group 1 by 1 and then merge them in the end
I also still trying using aggregation (probably using pd.NamedAgg too) similar to this, but I wonder if that can works
df.groupby('A').agg(
b_count = ('B', 'count'),
b_sum = ('B', 'sum')
c_count = ('C', 'count'),
c_sum = ('C', 'sum')
)
Is anyone has another idea for this?

Because you need processing each bin separately instead groupby+size+unstack is used crosstab with join DataFrames by concat:
bins = [0,2,10]
bins2 = [0,4,10]
temp_df1=pd.crosstab(df['A'], pd.cut(df['B'], bins, labels=False)).add_prefix('B_')
temp_df2=pd.crosstab(df['A'], pd.cut(df['C'], bins2, labels=False)).add_prefix('C_')
df = pd.concat([temp_df1, temp_df2], axis=1).reset_index()
print (df)
A B_0 B_1 C_0 C_1
0 1 1 1 2 0
1 2 1 1 1 1
2 3 3 0 1 2

One option, is with get_dummies, before the aggregation; this works since you have a limited bin (I'm skipping the bin and using comparison):
temp = (df
.assign(B = df.B.lt(3), C = df.C.lt(5))
.replace({True:1, False:2})
)
(pd
.get_dummies(temp, columns = ['B','C'], prefix_sep='_bin')
.groupby('A')
.sum()
)
B_bin1 B_bin2 C_bin1 C_bin2
A
1 1 1 2 0
2 1 1 1 1
3 3 0 1 2
You could use the bins, along with pd.factorize and get_dummies:
temp = df.copy()
temp['B'] = pd.cut(df.B, bins)
temp['B'] = pd.factorize(temp.B)[0] + 1
temp['C'] = pd.cut(df.C, bins2)
temp['C'] = pd.factorize(temp.C)[0] + 1
(pd
.get_dummies(temp, columns = ['B','C'], prefix_sep='_bin')
.groupby('A')
.sum()
)
B_bin1 B_bin2 C_bin1 C_bin2
A
1 1 1 2 0
2 1 1 1 1
3 3 0 1 2

Related

How to extract period and variable name from dataframe column strings for multiindex panel data preparation

I'm new to Python and could not find the answer I'm looking for anywhere.
I have a DataFrame that has the following structure:
df = pd.DataFrame(index=list('abc'), data={'A1': range(3), 'A2': range(3),'B1': range(3), 'B2': range(3), 'C1': range(3), 'C2': range(3)})
df
Out[1]:
A1 A2 B1 B2 C1 C2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Where the numbers are periods and he letters are variables. I'd like to transform the columns in a way, that I split the periods and variables into a multiindex. The desired output would look like that
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
I've tried the following:
periods = list(range(1, 3))
df.columns = df.columns.str.replace('\d+', '')
df.columns = pd.MultiIndex.from_product([df.columns, periods])
That seams to be multiplying the columns and raising an ValueError: Length mismatch
in my dataframe I have 72 periods and 12 variables.
Thanks in advance for your help!
Edit: I realized that I haven't been precise enough. I have several columns names something like Impressions1, Impressions2...Impressions72 and hhi1, hhi2...hhi72. So df.columns.str[0],df.columns.str[1] does not work for me, as all column names have a different length. I think the solution might contain regex but I can't figure out how to do it. Any ideas?
Use pd.MultiIndex.from_tuples:
df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns.str[0],df.columns.str[1])))
print(df)
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Alternative:
pd.MultiIndex.from_tuples([tuple(name) for name in df.columns])
or
pd.MultiIndex.from_tuples(map(tuple, df.columns))
You can also use, .str.extract and from_frame:
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract('(.)(.)'), names=[None, None])
Output:
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Here is what actually solved my issue:
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract(r'([a-zA-Z]+)([0-9]+)'), names=[None, None])
Thanks #Scott Boston for your inspiration to the solution!

How to create new column based off values from existing columns in pandas

I have a dataframe with 171 rows and 11 columns.
The 11 columns have values with either 0 or 1
how can i create a new column that will either be a 0 or 1, depending on whether the existing columns have a majority of 0 or 1?
you could do
(df.sum(axis=1)>df.shape[1]/2)+0
import numpy as np
import pandas as pd
X = np.asarray([(0, 0, 0),
(0, 0, 1),
(0, 1, 1),
(1, 1, 1)])
df = pd.DataFrame(X)
df['majority'] = (df.mean(axis=1) > 0.5) + 0
df
Use mean of rows and compare by DataFrame.gt for greater or DataFrame.ge for greater or equal 0.5 (it depends of output if same number of 0 and 1) and last convert mask to integers by Series.astype:
np.random.seed(20193)
df = pd.DataFrame(np.random.choice([0,1], size=(5, 4)))
df['new'] = df.mean(axis=1).gt(0.5).astype(int)
print (df)
0 1 2 3 new
0 1 1 0 0 0
1 1 1 1 0 1
2 0 0 1 0 0
3 1 1 0 1 1
4 1 1 1 1 1
np.random.seed(20193)
df = pd.DataFrame(np.random.choice([0,1], size=(5, 4)))
df['new'] = df.mean(axis=1).ge(0.5).astype(int)
print (df)
0 1 2 3 new
0 1 1 0 0 1
1 1 1 1 0 1
2 0 0 1 0 0
3 1 1 0 1 1
4 1 1 1 1 1

Get two return values from Pandas apply

I'm trying to return two different values from an apply method but I cant figure out how to get the results I need.
With a function as:
def fun(row):
s = [sum(row[i:i+2]) for i in range (len(row) -1)]
ps = s.index(max(s))
return max(s),ps
and df as:
6:00 6:15 6:30
0 3 8 9
1 60 62 116
I'm trying to return the max value of the row, but i also need to get the index of the first value that produces the max combination.
df["phour"] = t.apply(fun, axis=1)
I can get the output I need, but I don't know how I can get the index in a new column.So far im getting both answer in a tuple
6:00 6:15 6:30 phour
0 3 8 9 (17, 1)
1 60 62 116 (178, 1)
How can I get the index value in its own column?
You can get the index in a separate column like this:
df[['phour','index']] = df.apply(lambda row: pd.Series(list(fun(row))), axis=1)
Or if you modify fun slightly:
def fun(row):
s = [sum(row[i:i+2]) for i in range (len(row) -1)]
ps = s.index(max(s))
return [max(s),ps]
Then the code becomes a little less convoluted:
df[['phour','index']] = df.apply(lambda row: pd.Series(fun(row)), axis=1)
You can apply pd.Series
df.drop('Double', 1).join(df.Double.apply(pd.Series, index=['D1', 'D2']))
A B C D1 D2
0 1 2 3 1 2
1 2 3 2 3 4
2 3 4 4 5 6
3 4 1 1 7 8
Equivalently
df.drop('Double', 1).join(
pd.DataFrame(np.array(df.Double.values.tolist()), columns=['D1', 'D2'])
)
setup
using #GordonBean's df
df = pd.DataFrame({'A':[1,2,3,4], 'B':[2,3,4,1], 'C':[3,2,4,1], 'Double': [(1,2), (3,4), (5,6), (7,8)]})
If you are just trying to get the max and argmax, I recommend using the pandas API:
DataFrame.idxmax
So:
df = pd.DataFrame({'A':[1,2,3,4], 'B':[2,3,4,1], 'C':[3,2,4,1]})
df
A B C
0 1 2 3
1 2 3 2
2 3 4 4
3 4 1 1
df['Max'] = df.max(axis=1)
df['ArgMax'] = df.idxmax(axis=1)
df
A B C Max ArgMax
0 1 2 3 3 C
1 2 3 2 3 B
2 3 4 4 4 B
3 4 1 1 4 A
Update:
And if you need the actual index value, you can use numpy.ndarray.argmax:
df['ArgMaxNum'] = df[['A','B','C']].values.argmax(axis=1)
A B C Max ArgMax ArgMaxNum
0 1 2 3 3 C 2
1 2 3 2 3 B 1
2 3 4 4 4 B 1
3 4 1 1 4 A 0
One way to split out the tuples into separate columns could be with tuple unpacking:
df = pd.DataFrame({'A':[1,2,3,4], 'B':[2,3,4,1], 'C':[3,2,4,1], 'Double': [(1,2), (3,4), (5,6), (7,8)]})
df
A B C Double
0 1 2 3 (1, 2)
1 2 3 2 (3, 4)
2 3 4 4 (5, 6)
3 4 1 1 (7, 8)
df['D1'] = [d[0] for d in df.Double]
df['D2'] = [d[1] for d in df.Double]
df
A B C Double D1 D2
0 1 2 3 (1, 2) 1 2
1 2 3 2 (3, 4) 3 4
2 3 4 4 (5, 6) 5 6
3 4 1 1 (7, 8) 7 8
There's got to be a better way but you can do:
df.merge(pd.DataFrame(((i,j) for
i,j in df.apply(lambda x: fun(x)).values),
columns=['phour','index']),
left_index=True,right_index=True)

Pandas: Adding zero values where no rows exist (sparse)

I have a Pandas DataFrame with a MultiIndex. The MultiIndex has values in the range (0,0) to (1000,1000), and the column has two fields p and q.
However, the DataFrame is sparse. That is, if there was no measurement corresponding to a particular index (say (3,2)), there won't be any row for it (3,2). I'd like to make it not sparse, by filling in these rows with p=0 and q=0. Continuing the example, if I do df.loc[3].loc[2], I want it to return p=0 q=0, not No Such Record (as it currently does).
Clarification: By "sparse", I mean it only in the sense I used it, that zero values are omitted. I'm not referring to anything in Pandas or Numpy internals.
Consider this df
data = {
(1, 0): dict(p=1, q=1),
(3, 2): dict(p=1, q=1),
(5, 4): dict(p=1, q=1),
(7, 6): dict(p=1, q=1),
}
df = pd.DataFrame(data).T
df
p q
1 0 1 1
3 2 1 1
5 4 1 1
7 6 1 1
Use reindex with fill_value=0 from a constructed pd.MultiIndex.from_product
mux = pd.MultiIndex.from_product([range(8), range(8)])
df.reindex(mux, fill_value=0)
p q
0 0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
1 0 1 1
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
2 0 0 0
1 0 0
2 0 0
3 0 0
response to comment
You can get min, max of index levels like this
def mn_mx(idx):
return idx.min(), idx.max()
mn0, mx0 = mn_mx(df.index.levels[0])
mn1, mx1 = mn_mx(df.index.levels[1])
mux = pd.MultiIndex.from_product([range(mn0, mx0 + 1), range(mn1, mx1 + 1)])
df.reindex(mux, fill_value=0)

How to repeat a Pandas DataFrame?

This is my DataFrame that should be repeated for 5 times:
>>> x = pd.DataFrame({'a':1,'b':2}, index = range(1))
>>> x
a b
0 1 2
I want to have the result like this:
>>> x.append(x).append(x).append(x)
a b
0 1 2
0 1 2
0 1 2
0 1 2
But there must be a smarter way than appending 4 times. Actually the DataFrame I’m working on should be repeated 50 times.
I haven't found anything practical, including those like np.repeat ---- it just doesn't work on a DataFrame.
Could anyone help?
You can use the concat function:
In [13]: pd.concat([x]*5)
Out[13]:
a b
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
If you only want to repeat the values and not the index, you can do:
In [14]: pd.concat([x]*5, ignore_index=True)
Out[14]:
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2
I think it's cleaner/faster to use iloc nowadays:
In [11]: np.full(3, 0)
Out[11]: array([0, 0, 0])
In [12]: x.iloc[np.full(3, 0)]
Out[12]:
a b
0 1 2
0 1 2
0 1 2
More generally, you can use tile or repeat with arange:
In [21]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 4
In [23]: np.tile(np.arange(len(df)), 3)
Out[23]: array([0, 1, 0, 1, 0, 1])
In [24]: np.repeat(np.arange(len(df)), 3)
Out[24]: array([0, 0, 0, 1, 1, 1])
In [25]: df.iloc[np.tile(np.arange(len(df)), 3)]
Out[25]:
A B
0 1 2
1 3 4
0 1 2
1 3 4
0 1 2
1 3 4
In [26]: df.iloc[np.repeat(np.arange(len(df)), 3)]
Out[26]:
A B
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4
Note: This will work with non-integer indexed DataFrames (and Series).
Try using numpy.repeat:
>>> import numpy as np
>>> df = pd.DataFrame(np.repeat(x.to_numpy(), 5, axis=0), columns=x.columns)
>>> df
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2
I would generally not repeat and/or append, unless your problem really makes it necessary - it is highly inefficiently and typically comes from not understanding the proper way to attack a problem.
I don't know your exact use case, but if you have your values stored as
values = array(1, 2)
df2 = pd.DataFrame(index=arange(0,50), columns=['a', 'b'])
df2[['a', 'b']] = values
will do the job. Perhaps you want to better explain what you're trying to achieve?
Append should work too:
In [589]: x = pd.DataFrame({'a':1,'b':2},index = range(1))
In [590]: x
Out[590]:
a b
0 1 2
In [591]: x.append([x]*5, ignore_index=True) #Ignores the index as per your need
Out[591]:
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2
5 1 2
In [592]: x.append([x]*5)
Out[592]:
a b
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
Without numpy, we could also use Index.repeat and loc (or reindex):
x.loc[x.index.repeat(5)].reset_index(drop=True)
or
x.reindex(x.index.repeat(5)).reset_index(drop=True)
Output:
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2
Apply by row-lambda is a universal approach in my opinion:
df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
df.apply(lambda row: row.repeat(2), axis=0) #.reset_index()
Out[1]:
A B
0 1 2
0 1 2
1 3 4
1 3 4

Categories