Pandas groupby and sum - python

I have a pandas data frame which consists of three columns A ,B ,C and I need to sum up values based on row values
Below is the scenario
A B C
Distance_a distance_b 5
Distance_a distance_c 6
distance_b distance_c 7
distance_b distance_d 7
distance_d Distance_a 9
if I want to find out cumulative distance from distance_A, I need my code to add 5,6 and also it is supposed to consider last column that is distance_d Distance _a and it need to add 9 as well
So cumulative distance from a will be 5+6+9 = 20

#Hongpei's answer is certainly more efficient, but if you just want the sum of distance_a. You can do the following as well
import pandas as pd
# initialize list of lists
data = {'A':['distance_a', 'distance_a', 'distance_b', 'distance_b', 'distance_d'],
'B':['distance_b', 'distance_c', 'distance_c', 'distance_d', 'distance_a'],
'C':[5, 6, 7, 7, 9]}
# Create the pandas DataFrame
df = pd.DataFrame(data)
# Group by columns A and B individually
col_A_groupby = df.groupby(['A']).sum()
col_B_groupby = df.groupby(['B']).sum()
# Sum the values together
dist_a_sum = col_A_groupby.loc['distance_a'] + col_B_groupby.loc['distance_a']

There can be a easy workaround, suppose your original DataFrame is df, then you only need to:
pd.concat([df[['A','C']],
df[['B','C']].rename(columns={'B':'A'})],
sort=False).groupby('A').sum()
Basically what I did is to concat df[['A','C']] and df[['B','C']] together (while rename the second df columns to ['A','C']), and then groupby

IIUC, a melt and sum are enough
s = df.melt('C').groupby('value').C.sum()
print(s)
Out[113]:
value
Distance_a 20
distance_b 19
distance_c 13
distance_d 16
Name: C, dtype: int64

Related

Pandas: add number of unique values to other dataset (as shown in picture):

I need to add the number of unique values in column C (right table) to the related row in the left table based on the values in common column A (as shown in the picture):
thank you in advance
Groupby column A in second dataset and calculate count of each unique value in column C. merge it with first dataset on column A. Rename column C to C-count if needed:
>>> count_df = df2.groupby('A', as_index=False).C.nunique()
>>> output = pd.merge(df1, count_df, on='A')
>>> output.rename(columns={'C':'C-count'}, inplace=True)
>>> output
A B C-count
0 2 22 3
1 3 23 2
2 5 21 1
3 1 24 1
4 6 21 1
Use DataFrameGroupBy.nunique with Series.map for new column in df1:
df1['C-count'] = df1['A'].map(df2.groupby('A')['C'].nunique())
This may not be the most effective way of doing this, so if your databases are too big be careful.
Define the following function:
def c_value(a_value, right_table):
c_ids = []
for index, row in right_table.iterrows():
if row['A'] == a_value:
if row['C'] not in c_ids:
c_ids.append(row['C'])
return len(c_ids)
For this function I'm supposing that the right_table is a pandas.Dataframe.
Now, you do the following to build the new column (assuming that the left table is a pandas.Dataframe):
new_column = []
for index, row in left_table.iterrows():
new_column.append(c_value(row['A'],right_table))
left_table["C-count"] = new_column
After this, the left_table Dataframe should be the one dessired (as far as I understand what you need).

How to make a fixed number of groups by percentile from a dataframe in pandas

I am looking for a way to make n (e.g. 20) groups in a dataframe by a specific column by percentile. (data type is float). I am not sure if the group by quantile function can take care of this, and if it can, how the code should look like.
There are 3 rows a, b, c
i.e. Data are sorted by column 'a', and make 20 groups
Group 1 = 0 to 5 percentile
Group 2 = 5 to 10 percentile
.
.
.
Group 20 = 95 to 100 percentile.
would there also be a way to find the mean a, b, and c of each group, and sort them into another dataframe?
You can create 20 equal size bins using this.
df['newcol'] = pd.qcut(df.a,np.linspace(.05, 1, 19, 0), duplicates='drop')
Then you can groupby the newcol to find the summary stats of a,b and c columns
df.groupby(['newcol']).mean()
# group by percentile
profitdf['quantile_a'] = pd.qcut(profitdf['a'], 20)
profitdf['quantile_b'] = pd.qcut(profitdf['b'], 20)
quantile_a = profitdf.groupby(['quantile_a']).mean()
quantile_b = profitdf.groupby(['quantile_b']).mean()
Solved. Thank you everyone.

How to select data from data-frame in a specific manner using pandas in python

I am new in python. I want to select data from the data frame in the following manner i.e.,
count
2
3
0
6
Here count is my column name and 2,3,0,6.....etc is my row data.
S0 I want to select 1 to 13 rows data and then 2 to 14 rows data and so on till last data of dataset. So is there any solution. Thanks in advance.
Use Series.between to perform boolean indexing:
for i in range(0,n):
print(df[df.count.between(1+i,13+i)])
#[df[df.count.between(1+i,13+i)] for i in range(0,n)] #to keep in a list
Or by the Index:
for i in range(0,n):
print(df[1+i:13+i]])
You can do it by using .iloc and loc:
Rows = [5, 10]
Columns = ["Column1", "Column2"]
df.loc[Rows, Columns]

What's the fastest way to select values from columns based on keys in another columns in pandas?

I need a fast way to extract the right values from a pandas dataframe:
Given a dataframe with (a lot of) data in several named columns and an additional columns whose values only contains names of the other columns, how do I select values from the data-columns with the additional columns as keys?
It's simple to do via an explicit loop, but this is extremely slow with something like .iterrows() directly on the DataFrame. If converting to numpy-arrays, it's faster, but still not fast. Can I combine methods from pandas to do it even faster?
Example: This is the kind of DataFrame structure, where columns A and B contain data and column keys contains the keys to select from:
import pandas
df = pandas.DataFrame(
{'A': [1,2,3,4],
'B': [5,6,7,8],
'keys': ['A','B','B','A']},
)
print(df)
output:
Out[1]:
A B keys
0 1 5 A
1 2 6 B
2 3 7 B
3 4 8 A
Now I need some fast code that returns a DataFrame like
Out[2]:
val_keys
0 1
1 6
2 7
3 4
I was thinking something along the lines of this:
tmp = df.melt(id_vars=['keys'], value_vars=['A','B'])
out = tmp.loc[a['keys']==a['variable']]
which produces:
Out[2]:
keys variable value
0 A A 1
3 A A 4
5 B B 6
6 B B 7
but doesn't have the right order or index. So it's not quite a solution.
Any suggestions?
See if either of these work for you
df['val_keys']= np.where(df['keys'] =='A', df['A'],df['B'])
or
df['val_keys']= np.select([df['keys'] =='A', df['keys'] =='B'], [df['A'],df['B']])
No need to specify anything for the code below!
def value(row):
a = row.name
b = row['keys']
c = df.loc[a,b]
return c
df.apply(value, axis=1)
Have you tried filtering then mapping:
df_A = df[df['key'].isin(['A'])]
df_B = df[df['key'].isin(['B'])]
A_dict = dict(zip(df_A['key'], df_A['A']))
B_dict = dict(zip(df_B['key'], df_B['B']))
df['val_keys'] = df['key'].map(A_dict)
df['val_keys'] = df['key'].map(B_dict).fillna(df['val_keys']) # non-exhaustive mapping for the second one
Your df['val_keys'] column will now contain the result as in your val_keys output.
If you want you can just retain that column as in your expected output by:
df = df[['val_keys']]
Hope this helps :))

Ordering columns in dataframe

Recently updated to pandas 0.17.0 and I'm trying to order the columns in my dataframe alphabetically.
Here are the column labels as they currently are:
['UX2', 'RHO1', 'RHO3', 'RHO2', 'RHO4', 'UX1', 'UX4', 'UX3']
And I want them like this:
['RHO1', 'RHO2', 'RHO3', 'RHO4', 'UX1', 'UX2', 'UX3', 'UX4']
The only way I've been able to do this is following this from 3 years ago: How to change the order of DataFrame columns?
Is there a built-in way to do this in 0.17.0?
To sort the columns alphabetically here, you can just use sort_index:
df.sort_index(axis=1)
The method returns a reindexed DataFrame with the columns in the correct order.
This assumes that all of the column labels are strings (it won't work for a mix of, say, strings and integers). If this isn't the case, you may need to pass an explicit ordering to the reindex method.
You can just sort them and put them back. Suppose you have this:
df = pd.DataFrame()
for i, n in enumerate(['UX2', 'RHO1', 'RHO3', 'RHO2', 'RHO4', 'UX1', 'UX4', 'UX3']):
df[n] = [i]
It looks like this:
df
UX2 RHO1 RHO3 RHO2 RHO4 UX1 UX4 UX3
0 0 1 2 3 4 5 6 7
Do this:
df = df[ sorted(df.columns)]
And you should see this:
df
RHO1 RHO2 RHO3 RHO4 UX1 UX2 UX3 UX4
0 1 3 2 4 5 0 7 6
Create a list of the columns labels in the order you want.
cols = ['RHO1', 'RHO2', 'RHO3', 'RHO4', 'UX1', 'UX2', 'UX3', 'UX4']
Then assign this order to your DataFrame df:
df = df[cols]

Categories