Based on given number of bins distribute column data into equal average - python

I have dataframe where data is like & max bins where we want to distribute data is 3 bins
x
count
a
2
b
3
c
5
d
7
e
9
So sum will be of count will be 26 we need to distribute into 3 bins, which averages as 8.66 so each bin should have count close to 8 or 9
cluster_id
group
0
{e}
1
{d,a}
2
{c,b}

So I am able to figure it out I used bin packing solution to work on it
https://en.wikipedia.org/wiki/Bin_packing_problem
>>> import binpacking
>>> import pandas as pd
>>> df = pd.DataFrame()
>>> df['x']= ['a','b','c','d','e']
>>> df['count']=[2,3,5,7,9]
>>> df
x count
0 a 2
1 b 3
2 c 5
3 d 7
4 e 9
>>> map_exe_count = df.set_index('x').to_dict()['count']
>>> groups = binpacking.to_constant_bin_number(map_exe_count, bins)
>>> exes_per_bin = [list(group.keys()) for group in groups if len(group.keys()) > 0]
>>> exes_per_bin
[['e'], ['d', 'a'], ['c', 'b']]

Related

How to get event wise frequency and the frequency of each event in a dataframe?

I have a dataset like:
Data
a
a
a
a
a
b
b
b
a
a
b
I want to add a column that looks like the one below. The data will be in the form of a1,1 in the column, where the first element represent the event frequency (a1) and the second element (,1) is the frequency for each event. Is there a way we can do this using python?
Data Frequency
a a1,1
a a1,2
a a1,3
a a1,4
a a1,5
b b1,1
b b1,2
b b1,3
a a2,1
a a2,2
b b2,1
You can use:
# identify changes in Data
m = df['Data'].ne(df['Data'].shift()).cumsum()
# cumulated increments within groups
g1 = df.groupby(m).cumcount().add(1).astype(str)
# increments of different subgroups per Data
g2 = (df.loc[~m.duplicated(), 'Data']
.groupby(df['Data']).cumcount().add(1)
.reindex(df.index, method='ffill')
.astype(str)
)
df['Frequency'] = df['Data'].add(g2+','+g1)
output:
Data Frequency
0 a a1,1
1 a a1,2
2 a a1,3
3 a a1,4
4 a a1,5
5 b b1,1
6 b b1,2
7 b b1,3
8 a a2,1
9 a a2,2
10 b b2,1
Code:
from itertools import groupby
k = [key for key, _group in groupby(df['Data'].tolist())] #OUTPUT ['a', 'b', 'a', 'b']
Key = [v+f'{k[:i].count(v)+1}' for i,v in enumerate(k)] #OUTPUT ['a1', 'b1', 'a2', 'b2']
Sum = [sum(1 for _ in _group) for key, _group in groupby(df['Data'].tolist())] #OUTPUT [4, 3, 2, 1]
df['Frequency'] = [f'{K},{S}' for I, K in enumerate(Key) for S in range(1, Sum[I]+1)]
Output:
Data Frequency
0 a a1,1
1 a a1,2
2 a a1,3
3 a a1,4
4 b b1,1
5 b b1,2
6 b b1,3
7 a a2,1
8 a a2,2
9 b b2,1
def function1(dd:pd.DataFrame):
dd2=dd.assign(col2=dd.col1.ne(dd.col1.shift()).cumsum())\
.assign(col2=lambda dd:dd.Data+dd.col2.astype(str))\
.assign(rk=dd.groupby('col1').col1.transform('cumcount').astype(int)+1)\
.assign(col3=lambda dd:dd.col2+','+dd.rk.astype(str))
return dd2.loc[:,['Data','col3']]
df1.assign(col1=df1.ne(df1.shift()).cumsum()).groupby(['Data']).apply(function1)
Data col3
0 a a1,1
1 a a1,2
2 a a1,3
3 a a1,4
4 a a1,5
5 b b1,1
6 b b1,2
7 b b1,3
8 a a2,1
9 a a2,2
10 b b2,1

How to remove row and rename multiindex table

I have multi index data frame like below and I would like remove the row above 'A' (like shift the dataframe up)
metric data data
F K
C B
A 2 3
B 4 5
C 6 7
D 8 9
desired output
ALIAS data data
metric F K
A 2 3
B 4 5
C 6 7
D 8 9
I looked multiple post but could not find anything closer to create desired outcome. How can I achive the desired output ?
https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
Let's try DataFrame.droplevel to remove level 2 from the columns, and DataFrame.rename_axis to update column axis names:
df = df.droplevel(level=2, axis=1).rename_axis(['ALIAS', 'metric'], axis=1)
Or with the index equivalent methods Index.droplevel and Index.rename:
df.columns = df.columns.droplevel(2).rename(['ALIAS', 'metric'])
df:
ALIAS data
metric F K
A 2 3
B 4 5
C 6 7
D 8 9
Setup:
import numpy as np
import pandas as pd
df = pd.DataFrame(
np.arange(2, 10).reshape(-1, 2),
index=list('ABCD'),
columns=pd.MultiIndex.from_arrays([
['data', 'data'],
['F', 'K'],
['C', 'B']
], names=['metric', None, None])
)
df:
metric data
F K
C B
A 2 3
B 4 5
C 6 7
D 8 9

Getting the total for some columns (independently) in a data frame with python [duplicate]

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Pandas : Sum multiple columns and get results in multiple columns

I have a "sample.txt" like this.
idx A B C D cat
J 1 2 3 1 x
K 4 5 6 2 x
L 7 8 9 3 y
M 1 2 3 4 y
N 4 5 6 5 z
O 7 8 9 6 z
With this dataset, I want to get sum in row and column.
In row, it is not a big deal.
I made result like this.
### MY CODE ###
import pandas as pd
df = pd.read_csv('sample.txt',sep="\t",index_col='idx')
df.info()
df2 = df.groupby('cat').sum()
print( df2 )
The result is like this.
A B C D
cat
x 5 7 9 3
y 8 10 12 7
z 11 13 15 11
But I don't know how to write a code to get result like this.
(simply add values in column A and B as well as column C and D)
AB CD
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Could anybody help how to write a code?
By the way, I don't want to do like this.
(it looks too dull, but if it is the only way, I'll deem it)
df2 = df['A'] + df['B']
df3 = df['C'] + df['D']
df = pd.DataFrame([df2,df3],index=['AB','CD']).transpose()
print( df )
When you pass a dictionary or callable to groupby it gets applied to an axis. I specified axis one which is columns.
d = dict(A='AB', B='AB', C='CD', D='CD')
df.groupby(d, axis=1).sum()
Use concat with sum:
df = df.set_index('idx')
df = pd.concat([df[['A', 'B']].sum(1), df[['C', 'D']].sum(1)], axis=1, keys=['AB','CD'])
print( df)
AB CD
idx
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Does this do what you need? By using axis=1 with DataFrame.apply, you can use the data that you want in a row to construct a new column. Then you can drop the columns that you don't want anymore.
In [1]: import pandas as pd
In [5]: df = pd.DataFrame(columns=['A', 'B', 'C', 'D'], data=[[1, 2, 3, 4], [1, 2, 3, 4]])
In [6]: df
Out[6]:
A B C D
0 1 2 3 4
1 1 2 3 4
In [7]: df['CD'] = df.apply(lambda x: x['C'] + x['D'], axis=1)
In [8]: df
Out[8]:
A B C D CD
0 1 2 3 4 7
1 1 2 3 4 7
In [13]: df.drop(['C', 'D'], axis=1)
Out[13]:
A B CD
0 1 2 7
1 1 2 7

merge/duplicate two data sets by pandas

I am trying to merge two datasets by using pandas. One is location (longitude and latitude) and the other is time frame (0 to 24hrs, 15 mins step = 96 datapoints)
Here is the sample code:
s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6])
df = pd.DataFrame([list(s1), list(s2)], columns = ["A", "B", "C"])
timeframe_array=[]
for i in range(0, 3600, timeframe):
timeframe_array.append(i)
And I want to get the data like this:
A B C time
0 1 2 3 0
1 1 2 3 15
2 1 2 3 30
3 1 2 3 45
...
How can I get the data like this?
While not particularly elegant, this should work:
from __future__ import division # only needed if you're using Python 2
import pandas as pd
from math import ceil
# Constants
timeframe = 15
total_t = 3600
Create df1:
s1 = [1, 2, 3]
s2 = [4, 5, 6]
df1 = pd.DataFrame([s1, s2], columns=['A', 'B', 'C'])
Next, we want to build df2 such that the sequence 0-3600 (step=15) is replicated for each row in df1. We can extract the number of rows with df1.shape[0] (which is 2 in this case).
df2 = pd.DataFrame({'time': range(0, total_t * df1.shape[0], timeframe)})
Next, you need to replicate the rows in df1 to match df2.
factor = ceil(df2.shape[0] / df1.shape[0])
df1_f = pd.concat([df1] * factor).sort_index().reset_index(drop=True)
Lastly, join the two data frames together and trim off any excess rows.
df3 = df1_f.join(df2, how='left')[:df2.shape[0]]
Pandas may have a built-in way to do this, but to my knowledge both join and merge can only make up a difference in rows by filling with a constant (NaN by default).
Result:
>>> print(df3.head(4))
A B C time
0 1 2 3 0
1 1 2 3 15
2 1 2 3 30
3 1 2 3 45
>>> print(df3.tail(4))
A B C time
476 4 5 6 7140
477 4 5 6 7155
478 4 5 6 7170
479 4 5 6 7185
>>> df3.shape # (480, 4)

Categories