So I have a dataframe like this
df = pd.DataFrame({
'A': [1,1,2,2,3,3,3],
'B': [1,3,1,3,1,2,1],
'C': [1,3,5,3,7,7,1]})
A B C
0 1 1 1
1 1 3 3
2 2 1 5
3 2 3 3
4 3 1 7
5 3 2 7
6 3 1 1
I want to create a binning of column B (count) with groupby of column A
for example B_bin1 where B < 3 and B_bin2 is the rest (>=3), C_bin1 for C < 5 and C_bin2 for the rest
From that example the output I want is like this
A B_bin1 B_bin2 C_bin1 C_bin2
0 1 1 1 2 0
1 2 1 1 1 1
2 3 3 0 1 2
I found similar question Pandas groupby with bin counts
, it is working for 1 bin
bins = [0,2,10]
temp_df=df.groupby(['A', pd.cut(df['B'], bins)])
temp_df.size().unstack()
B (0, 2] (2, 10]
A
1 1 1
2 1 1
3 3 0
but when I tried using more than 1 bin, it is not working (my real data has a lot of binning groups)
bins = [0,2,10]
bins2 = [0,4,10]
temp_df=df.groupby(['A', pd.cut(df['B'], bins), pd.cut(df['C'], bins2)])
temp_df.size().unstack()
C (0, 4] (4, 10]
A B
1 (0, 2] 1 0
(2, 10] 1 0
2 (0, 2] 0 1
(2, 10] 1 0
3 (0, 2] 1 2
(2, 10] 0 0
My workaround is by create small temporary df and then binning them using 1 group 1 by 1 and then merge them in the end
I also still trying using aggregation (probably using pd.NamedAgg too) similar to this, but I wonder if that can works
df.groupby('A').agg(
b_count = ('B', 'count'),
b_sum = ('B', 'sum')
c_count = ('C', 'count'),
c_sum = ('C', 'sum')
)
Is anyone has another idea for this?
Because you need processing each bin separately instead groupby+size+unstack is used crosstab with join DataFrames by concat:
bins = [0,2,10]
bins2 = [0,4,10]
temp_df1=pd.crosstab(df['A'], pd.cut(df['B'], bins, labels=False)).add_prefix('B_')
temp_df2=pd.crosstab(df['A'], pd.cut(df['C'], bins2, labels=False)).add_prefix('C_')
df = pd.concat([temp_df1, temp_df2], axis=1).reset_index()
print (df)
A B_0 B_1 C_0 C_1
0 1 1 1 2 0
1 2 1 1 1 1
2 3 3 0 1 2
One option, is with get_dummies, before the aggregation; this works since you have a limited bin (I'm skipping the bin and using comparison):
temp = (df
.assign(B = df.B.lt(3), C = df.C.lt(5))
.replace({True:1, False:2})
)
(pd
.get_dummies(temp, columns = ['B','C'], prefix_sep='_bin')
.groupby('A')
.sum()
)
B_bin1 B_bin2 C_bin1 C_bin2
A
1 1 1 2 0
2 1 1 1 1
3 3 0 1 2
You could use the bins, along with pd.factorize and get_dummies:
temp = df.copy()
temp['B'] = pd.cut(df.B, bins)
temp['B'] = pd.factorize(temp.B)[0] + 1
temp['C'] = pd.cut(df.C, bins2)
temp['C'] = pd.factorize(temp.C)[0] + 1
(pd
.get_dummies(temp, columns = ['B','C'], prefix_sep='_bin')
.groupby('A')
.sum()
)
B_bin1 B_bin2 C_bin1 C_bin2
A
1 1 1 2 0
2 1 1 1 1
3 3 0 1 2
Related
I'm new to Python and could not find the answer I'm looking for anywhere.
I have a DataFrame that has the following structure:
df = pd.DataFrame(index=list('abc'), data={'A1': range(3), 'A2': range(3),'B1': range(3), 'B2': range(3), 'C1': range(3), 'C2': range(3)})
df
Out[1]:
A1 A2 B1 B2 C1 C2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Where the numbers are periods and he letters are variables. I'd like to transform the columns in a way, that I split the periods and variables into a multiindex. The desired output would look like that
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
I've tried the following:
periods = list(range(1, 3))
df.columns = df.columns.str.replace('\d+', '')
df.columns = pd.MultiIndex.from_product([df.columns, periods])
That seams to be multiplying the columns and raising an ValueError: Length mismatch
in my dataframe I have 72 periods and 12 variables.
Thanks in advance for your help!
Edit: I realized that I haven't been precise enough. I have several columns names something like Impressions1, Impressions2...Impressions72 and hhi1, hhi2...hhi72. So df.columns.str[0],df.columns.str[1] does not work for me, as all column names have a different length. I think the solution might contain regex but I can't figure out how to do it. Any ideas?
Use pd.MultiIndex.from_tuples:
df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns.str[0],df.columns.str[1])))
print(df)
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Alternative:
pd.MultiIndex.from_tuples([tuple(name) for name in df.columns])
or
pd.MultiIndex.from_tuples(map(tuple, df.columns))
You can also use, .str.extract and from_frame:
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract('(.)(.)'), names=[None, None])
Output:
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Here is what actually solved my issue:
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract(r'([a-zA-Z]+)([0-9]+)'), names=[None, None])
Thanks #Scott Boston for your inspiration to the solution!
I have a dataframe with 171 rows and 11 columns.
The 11 columns have values with either 0 or 1
how can i create a new column that will either be a 0 or 1, depending on whether the existing columns have a majority of 0 or 1?
you could do
(df.sum(axis=1)>df.shape[1]/2)+0
import numpy as np
import pandas as pd
X = np.asarray([(0, 0, 0),
(0, 0, 1),
(0, 1, 1),
(1, 1, 1)])
df = pd.DataFrame(X)
df['majority'] = (df.mean(axis=1) > 0.5) + 0
df
Use mean of rows and compare by DataFrame.gt for greater or DataFrame.ge for greater or equal 0.5 (it depends of output if same number of 0 and 1) and last convert mask to integers by Series.astype:
np.random.seed(20193)
df = pd.DataFrame(np.random.choice([0,1], size=(5, 4)))
df['new'] = df.mean(axis=1).gt(0.5).astype(int)
print (df)
0 1 2 3 new
0 1 1 0 0 0
1 1 1 1 0 1
2 0 0 1 0 0
3 1 1 0 1 1
4 1 1 1 1 1
np.random.seed(20193)
df = pd.DataFrame(np.random.choice([0,1], size=(5, 4)))
df['new'] = df.mean(axis=1).ge(0.5).astype(int)
print (df)
0 1 2 3 new
0 1 1 0 0 1
1 1 1 1 0 1
2 0 0 1 0 0
3 1 1 0 1 1
4 1 1 1 1 1
I'm trying to return two different values from an apply method but I cant figure out how to get the results I need.
With a function as:
def fun(row):
s = [sum(row[i:i+2]) for i in range (len(row) -1)]
ps = s.index(max(s))
return max(s),ps
and df as:
6:00 6:15 6:30
0 3 8 9
1 60 62 116
I'm trying to return the max value of the row, but i also need to get the index of the first value that produces the max combination.
df["phour"] = t.apply(fun, axis=1)
I can get the output I need, but I don't know how I can get the index in a new column.So far im getting both answer in a tuple
6:00 6:15 6:30 phour
0 3 8 9 (17, 1)
1 60 62 116 (178, 1)
How can I get the index value in its own column?
You can get the index in a separate column like this:
df[['phour','index']] = df.apply(lambda row: pd.Series(list(fun(row))), axis=1)
Or if you modify fun slightly:
def fun(row):
s = [sum(row[i:i+2]) for i in range (len(row) -1)]
ps = s.index(max(s))
return [max(s),ps]
Then the code becomes a little less convoluted:
df[['phour','index']] = df.apply(lambda row: pd.Series(fun(row)), axis=1)
You can apply pd.Series
df.drop('Double', 1).join(df.Double.apply(pd.Series, index=['D1', 'D2']))
A B C D1 D2
0 1 2 3 1 2
1 2 3 2 3 4
2 3 4 4 5 6
3 4 1 1 7 8
Equivalently
df.drop('Double', 1).join(
pd.DataFrame(np.array(df.Double.values.tolist()), columns=['D1', 'D2'])
)
setup
using #GordonBean's df
df = pd.DataFrame({'A':[1,2,3,4], 'B':[2,3,4,1], 'C':[3,2,4,1], 'Double': [(1,2), (3,4), (5,6), (7,8)]})
If you are just trying to get the max and argmax, I recommend using the pandas API:
DataFrame.idxmax
So:
df = pd.DataFrame({'A':[1,2,3,4], 'B':[2,3,4,1], 'C':[3,2,4,1]})
df
A B C
0 1 2 3
1 2 3 2
2 3 4 4
3 4 1 1
df['Max'] = df.max(axis=1)
df['ArgMax'] = df.idxmax(axis=1)
df
A B C Max ArgMax
0 1 2 3 3 C
1 2 3 2 3 B
2 3 4 4 4 B
3 4 1 1 4 A
Update:
And if you need the actual index value, you can use numpy.ndarray.argmax:
df['ArgMaxNum'] = df[['A','B','C']].values.argmax(axis=1)
A B C Max ArgMax ArgMaxNum
0 1 2 3 3 C 2
1 2 3 2 3 B 1
2 3 4 4 4 B 1
3 4 1 1 4 A 0
One way to split out the tuples into separate columns could be with tuple unpacking:
df = pd.DataFrame({'A':[1,2,3,4], 'B':[2,3,4,1], 'C':[3,2,4,1], 'Double': [(1,2), (3,4), (5,6), (7,8)]})
df
A B C Double
0 1 2 3 (1, 2)
1 2 3 2 (3, 4)
2 3 4 4 (5, 6)
3 4 1 1 (7, 8)
df['D1'] = [d[0] for d in df.Double]
df['D2'] = [d[1] for d in df.Double]
df
A B C Double D1 D2
0 1 2 3 (1, 2) 1 2
1 2 3 2 (3, 4) 3 4
2 3 4 4 (5, 6) 5 6
3 4 1 1 (7, 8) 7 8
There's got to be a better way but you can do:
df.merge(pd.DataFrame(((i,j) for
i,j in df.apply(lambda x: fun(x)).values),
columns=['phour','index']),
left_index=True,right_index=True)
I have a Pandas DataFrame with a MultiIndex. The MultiIndex has values in the range (0,0) to (1000,1000), and the column has two fields p and q.
However, the DataFrame is sparse. That is, if there was no measurement corresponding to a particular index (say (3,2)), there won't be any row for it (3,2). I'd like to make it not sparse, by filling in these rows with p=0 and q=0. Continuing the example, if I do df.loc[3].loc[2], I want it to return p=0 q=0, not No Such Record (as it currently does).
Clarification: By "sparse", I mean it only in the sense I used it, that zero values are omitted. I'm not referring to anything in Pandas or Numpy internals.
Consider this df
data = {
(1, 0): dict(p=1, q=1),
(3, 2): dict(p=1, q=1),
(5, 4): dict(p=1, q=1),
(7, 6): dict(p=1, q=1),
}
df = pd.DataFrame(data).T
df
p q
1 0 1 1
3 2 1 1
5 4 1 1
7 6 1 1
Use reindex with fill_value=0 from a constructed pd.MultiIndex.from_product
mux = pd.MultiIndex.from_product([range(8), range(8)])
df.reindex(mux, fill_value=0)
p q
0 0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
1 0 1 1
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
2 0 0 0
1 0 0
2 0 0
3 0 0
response to comment
You can get min, max of index levels like this
def mn_mx(idx):
return idx.min(), idx.max()
mn0, mx0 = mn_mx(df.index.levels[0])
mn1, mx1 = mn_mx(df.index.levels[1])
mux = pd.MultiIndex.from_product([range(mn0, mx0 + 1), range(mn1, mx1 + 1)])
df.reindex(mux, fill_value=0)
This is my DataFrame that should be repeated for 5 times:
>>> x = pd.DataFrame({'a':1,'b':2}, index = range(1))
>>> x
a b
0 1 2
I want to have the result like this:
>>> x.append(x).append(x).append(x)
a b
0 1 2
0 1 2
0 1 2
0 1 2
But there must be a smarter way than appending 4 times. Actually the DataFrame I’m working on should be repeated 50 times.
I haven't found anything practical, including those like np.repeat ---- it just doesn't work on a DataFrame.
Could anyone help?
You can use the concat function:
In [13]: pd.concat([x]*5)
Out[13]:
a b
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
If you only want to repeat the values and not the index, you can do:
In [14]: pd.concat([x]*5, ignore_index=True)
Out[14]:
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2
I think it's cleaner/faster to use iloc nowadays:
In [11]: np.full(3, 0)
Out[11]: array([0, 0, 0])
In [12]: x.iloc[np.full(3, 0)]
Out[12]:
a b
0 1 2
0 1 2
0 1 2
More generally, you can use tile or repeat with arange:
In [21]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 4
In [23]: np.tile(np.arange(len(df)), 3)
Out[23]: array([0, 1, 0, 1, 0, 1])
In [24]: np.repeat(np.arange(len(df)), 3)
Out[24]: array([0, 0, 0, 1, 1, 1])
In [25]: df.iloc[np.tile(np.arange(len(df)), 3)]
Out[25]:
A B
0 1 2
1 3 4
0 1 2
1 3 4
0 1 2
1 3 4
In [26]: df.iloc[np.repeat(np.arange(len(df)), 3)]
Out[26]:
A B
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4
Note: This will work with non-integer indexed DataFrames (and Series).
Try using numpy.repeat:
>>> import numpy as np
>>> df = pd.DataFrame(np.repeat(x.to_numpy(), 5, axis=0), columns=x.columns)
>>> df
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2
I would generally not repeat and/or append, unless your problem really makes it necessary - it is highly inefficiently and typically comes from not understanding the proper way to attack a problem.
I don't know your exact use case, but if you have your values stored as
values = array(1, 2)
df2 = pd.DataFrame(index=arange(0,50), columns=['a', 'b'])
df2[['a', 'b']] = values
will do the job. Perhaps you want to better explain what you're trying to achieve?
Append should work too:
In [589]: x = pd.DataFrame({'a':1,'b':2},index = range(1))
In [590]: x
Out[590]:
a b
0 1 2
In [591]: x.append([x]*5, ignore_index=True) #Ignores the index as per your need
Out[591]:
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2
5 1 2
In [592]: x.append([x]*5)
Out[592]:
a b
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
Without numpy, we could also use Index.repeat and loc (or reindex):
x.loc[x.index.repeat(5)].reset_index(drop=True)
or
x.reindex(x.index.repeat(5)).reset_index(drop=True)
Output:
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2
Apply by row-lambda is a universal approach in my opinion:
df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
df.apply(lambda row: row.repeat(2), axis=0) #.reset_index()
Out[1]:
A B
0 1 2
0 1 2
1 3 4
1 3 4