Sum of only certain columns in a pandas Dataframe - python

I have a dataframe similar to the one below. I need to add up the sum of only certain columns: Jan-16, Feb-16, Mar-16, Apr-16 and May-16. I have these columns in a list called months_list
--------------------------------------------------------------------------------------
| Id | Name | Jan-16 | Feb-16 | Mar-16 | Apr-16 | May-16 |
| 4674393 | John Miller | 0 | 1 | 1 | 1 | 1 |
| 4674395 | Joe Smith | 0 | 0 | 1 | 1 | 1 |
---------------------------------------------------------------------------------------
My output should look like the below:
--------------------------------------------------------------------------------------
| Id | Name | Jan-16 | Feb-16 | Mar-16 | Apr-16 | May-16 |
| 4674393 | John Miller | 0 | 1 | 1 | 1 | 1 |
| 4674395 | Joe Smith | 0 | 0 | 1 | 1 | 1 |
|Total | | 0 | 1 | 2 | 2 | 2 |
---------------------------------------------------------------------------------------
A new row called 'Total' should be introduced with a column wise sum for all the columns in my months_list: Jan-16, Feb-16, Mar-16, Apr-16 and May-16
I tried the below and it did not work. I got all NaN values
df.loc['Total',:]= df[months_list].sum(axis=1)

You are using the wrong value of axis parameter.
`axis=0`: Sums the column values
`axis=1`: Sums the row values
Assuming your df to be:
In [4]: df
Out[4]:
Id Name Jan-16 Feb-16 Mar-16 Apr-16 May-16
0 4674393 John Miller 0 1 1 1 1
1 4674395 Joe Smith 0 0 1 1 1
In [10]: months_list =['Jan-16', 'Feb-16', 'Mar-16', 'Apr-16', 'May-16']
You code should be:
In [12]: df.loc['Total'] = df[months_list].sum()
In [13]: df
Out[13]:
Id Name Jan-16 Feb-16 Mar-16 Apr-16 May-16
0 4674393.0 John Miller 0.0 1.0 1.0 1.0 1.0
1 4674395.0 Joe Smith 0.0 0.0 1.0 1.0 1.0
Total NaN NaN 0.0 1.0 2.0 2.0 2.0

Related

Generate date column within a range for every unique ID in python

I have a data set which has unique IDs and names.
| ID | NAME |
| -------- | -------------- |
| 1 | Jane |
| 2 | Max |
| 3 | Tom |
| 4 | Beth |
Now, i want to generate a column with dates using a date range for all the IDs. For example if the date range is ('2019-02-11', '2019-02-15') i want the following output.
| ID | NAME | DATE |
| -------- | -------------- | -------------- |
| 1 | Jane | 2019-02-11 |
| 1 | Jane | 2019-02-12 |
| 1 | Jane | 2019-02-13 |
| 1 | Jane | 2019-02-14 |
| 1 | Jane | 2019-02-15 |
| 2 | Max | 2019-02-11 |
| 2 | Max | 2019-02-12 |
| 2 | Max | 2019-02-13 |
| 2 | Max | 2019-02-14 |
| 2 | Max | 2019-02-15 |
and so on for all the ids. What is the most efficient way to get this in python?
You can do this with a pandas cross merge:
import pandas as pd
df = pd.DataFrame( [[1,'Jane'],[2,'Max'],[3,'Tom'],[4,'Beth']], columns=["ID","NAME"] )
print(df)
df2 = pd.DataFrame(
[['2022-01-01'],['2022-01-02'],['2022-01-03'],['2022-01-04']],
columns=['DATE'])
print(df2)
df3 = pd.merge(df, df2, how='cross')
print(df3)
Output:
ID NAME
0 1 Jane
1 2 Max
2 3 Tom
3 4 Beth
DATE
0 2022-01-01
1 2022-01-02
2 2022-01-03
3 2022-01-04
ID NAME DATE
0 1 Jane 2022-01-01
1 1 Jane 2022-01-02
2 1 Jane 2022-01-03
3 1 Jane 2022-01-04
4 2 Max 2022-01-01
5 2 Max 2022-01-02
6 2 Max 2022-01-03
7 2 Max 2022-01-04
8 3 Tom 2022-01-01
9 3 Tom 2022-01-02
10 3 Tom 2022-01-03
11 3 Tom 2022-01-04
12 4 Beth 2022-01-01
13 4 Beth 2022-01-02
14 4 Beth 2022-01-03
15 4 Beth 2022-01-04

Multiplying pandas columns based on multiple conditions

I have a df like this
| count | people | A | B | C |
|---------|--------|-----|-----|-----|
| yes | siya | 4 | 2 | 0 |
| no | aish | 4 | 3 | 0 |
| total | | 4 | | 0 |
| yes | dia | 6 | 4 | 0 |
| no | dia | 6 | 2 | 0 |
| total | | 6 | | 0 |
I want a output like below
| count | people | A | B | C |
|---------|--------|-----|-----|-----|
| yes | siya | 4 | 2 | 8 |
| no | aish | 4 | 3 | 0 |
| total | | 4 | | 0 |
| yes | dia | 6 | 4 | 0 |
| no | dia | 6 | 2 | 2 |
| total | | 6 | | 0 |
The goal is calculate column C by mulytiplying A and B only when the count value is "yes" but if the column People values are same that is yes for dia and no for also dia , then we have to calculate for the count value "no"
I tried this much so far
df.C= df.groupby("Host", as_index=False).apply(lambda dfx : df.A *
df.B if (df['count'] == 'no') else df.A *df.B)
But not able to achieve the goal, any idea how can I achieve the output
import numpy as np
#Set Condtions
c1=df.groupby('people')['count'].transform('nunique').eq(1)&df['count'].eq('yes')
c2=df.groupby('people')['count'].transform('nunique').gt(1)&df['count'].eq('no')
#Put conditions in list
c=[c1,c2]
#Mke choices corresponding to condition list
choice=[df['A']*df['B'],len(df[df['count'].eq('no')])]
#Apply np select
df['C']= np.select(c,choice,0)
print(df)
count people A B C
0 yes siya 4 2.0 8.0
1 no aish 4 3.0 0.0
2 total NaN 4 0.0 0.0
3 yes dia 6 4.0 0.0
4 no dia 6 2.0 2.0
5 total NaN 6 NaN 0.0

Change Duplicate values index

I have a dataset like
+---------------------------+
| | Name | Id |
| ------------------------- |
| 0 | nick | 1 |
| 1 | john | 2 |
| 2 | mick | 3 |
| 3 | nick | 4 |
| 4 | mick | 5 |
| 5 | nick | 6 |
+---------------------------+
And I want to reset the Id Index like
index | Name | Id
-------------------------
0 | nick | 1
1 | john | 2
2 | mick | 3
3 | nick | 1
4 | mick | 3
5 | nick | 1
Use factorize by name column:
df['Id'] = pd.factorize(df['Name'])[0] + 1
print (df)
Name Id
0 nick 1
1 john 2
2 mick 3
3 nick 1
4 mick 3
5 nick 1

SQL / Python - how to return count for each attribute and sub-attribute from another table

I have SELECT that returns table which has:
-5 possible values for region (from 1 to 5) and
-3 possible values for age (1-3) with 2 possible values (1 or 2) for gender for each age group.
So table 1. looks something like this:
+----------+-----------+--------------+---------------+---------+
| att_name | att_value | sub_att_name | sub_att_value | percent |
+----------+-----------+--------------+---------------+---------+
| region | 1 | NULL | 0 | 34 |
| region | 2 | NULL | 0 | 22 |
| region | 3 | NULL | 0 | 15 |
| region | 4 | NULL | 0 | 37 |
| region | 5 | NULL | 0 | 12 |
| age | 1 | gender | 1 | 28 |
| age | 1 | gender | 2 | 8 |
| age | 2 | gender | 1 | 13 |
| age | 2 | gender | 2 | 45 |
| age | 3 | gender | 1 | 34 |
| age | 3 | gender | 2 | 34 |
+----------+-----------+--------------+---------------+---------+
Second table holds records with values from table 1. where table 1. unique values for att_name and sub_att_name are table 2. attributes:
+--------+-----+-----+
| region | age | gen |
+--------+-----+-----+
| 2 | 2 | 1 |
| 3 | 1 | 2 |
| 3 | 3 | 2 |
| 1 | 3 | 1 |
| 4 | 2 | 2 |
| 5 | 2 | 1 |
+--------+-----+-----+
I want to return count of each unique values for region and age/gender attributes from second table.
Final result should look like this:
+----------+-----------+-----------------+--------------+---------------+---------------------+---------+
| att_name | att_value | att_value_count | sub_att_name | sub_att_value | sub_att_value_count | percent |
+----------+-----------+-----------------+--------------+---------------+---------------------+---------+
| region | 1 | 1 | NULL | 0 | NULL | 34 |
| region | 2 | 1 | NULL | 0 | NULL | 22 |
| region | 3 | 2 | NULL | 0 | NULL | 15 |
| region | 4 | 1 | NULL | 0 | NULL | 37 |
| region | 5 | 1 | NULL | 0 | NULL | 12 |
| age | 1 | NULL | gender | 1 | 0 | 28 |
| age | 1 | NULL | gender | 2 | 1 | 8 |
| age | 2 | NULL | gender | 1 | 2 | 13 |
| age | 2 | NULL | gender | 2 | 1 | 45 |
| age | 3 | NULL | gender | 1 | 1 | 34 |
| age | 3 | NULL | gender | 2 | 1 | 34 |
+----------+-----------+-----------------+--------------+---------------+---------------------+---------+
Explanation
Region - doesn't have sub attribute so sub_att_name and sub_att_value_count are NULL.
att_value_count - counts appearance of each unique region (1 for all regions except for region 3 which shows 2 times).
Age/sex - counts combinations of appearance of age and sex (groups are 1/1, 1/2, 2/1, 2/2 and 3/1, 3/2).
Since we need to fill in values only for combinations att_value_count is NULL.
I'm tagging python and pandas in this question since I don't know if this is possible in SQL at all...i hope it is since we are using analytical tools to pull tables and views from database more naturally.
EDIT
SQL - answers looks complicated, I'll test and see if it works tomorrow.
Python - seems more appealing now - is there a way to parse att_name and sub_att_name, find 1 level and 2 level attributes and act accordingly? I think this is only possible with python and we do have different attributes and attributes levels.
I'l already thankful for given answers!
I think this is good enough to solve the issue:
data_1 = {'att_name':['region','region','region','region','region','age','age','age','age','age','age'],'att_value':[1,2,3,4,5,1,1,2,2,3,3],'sub_att_name':[np.nan,np.nan,np.nan,np.nan,np.nan,'gender','gender','gender','gender','gender','gender'],'sub_att_value':[0,0,0,0,0,1,2,1,2,1,2],'percent':[34,22,15,37,12,28,8,13,45,34,34]}
df_1 = pd.DataFrame(data_1)
data_2 = {'region':[2,3,3,1,4,5],'age':[2,1,3,3,2,2],'gen':[1,2,2,1,2,1]}
df_2 = pd.DataFrame(data_2)
df_2_grouped = df_2.groupby(['age','gen'],as_index=False).agg({'region':'count'}).rename(columns={'region':'counts'})
df_final = df_1.merge(df_2_grouped,how='left',left_on=['att_value','sub_att_value'],right_on=['age','gen']).drop(columns=['age','gen']).rename(columns={'counts':'sub_att_value_counts'}
Output of df_final:
att_name att_value sub_att_name sub_att_value percent sub_at_value_count
0 region 1 NaN 0 34 NaN
1 region 2 NaN 0 22 NaN
2 region 3 NaN 0 15 NaN
3 region 4 NaN 0 37 NaN
4 region 5 NaN 0 12 NaN
5 age 1 gender 1 28 NaN
6 age 1 gender 2 8 1.0
7 age 2 gender 1 13 2.0
8 age 2 gender 2 45 1.0
9 age 3 gender 1 34 1.0
10 age 3 gender 2 34 1.0
This is a pandas solution, basically, lookup or map.
df['att_value_count'] = np.nan
s = df['att_name'].eq('region')
df.loc[s, 'att_value_count'] = df.loc[s,'att_value'].map(df2['region'].value_counts())
# step 2
counts = df2.groupby('age')['gen'].value_counts().unstack('gen', fill_value=0)
df['sub_att_value_count'] = np.nan
tmp = df.loc[~s, ['att_value','sub_att_value']]
counts = df2.groupby('age')['gen'].value_counts().unstack('gen', fill_value=0)
df.loc[~s, 'sub_att_value_count'] = counts.lookup(tmp['att_value'], tmp['sub_att_value'])
You can also use merge so as it is more SQL friendly. For example, in step 2:
counts = df2.groupby('age')['gen'].value_counts().reset_index(name='sub_att_value_count')
(df.merge(counts,
left_on=['att_value','sub_att_value'],
right_on=['age','gen'],
how = 'outer'
)
.drop(['age','gen'], axis=1)
)
Output:
att_name att_value sub_att_name sub_att_value percent att_value_count sub_att_value_count
-- ---------- ----------- -------------- --------------- --------- ----------------- ---------------------
0 region 1 nan 0 34 1 nan
1 region 2 nan 0 22 1 nan
2 region 3 nan 0 15 2 nan
3 region 4 nan 0 37 1 nan
4 region 5 nan 0 12 1 nan
5 age 1 gender 1 28 nan 0
6 age 1 gender 2 8 nan 1
7 age 2 gender 1 13 nan 2
8 age 2 gender 2 45 nan 1
9 age 3 gender 1 34 nan 1
10 age 3 gender 2 34 nan 1
Update: Excuse my SQL skill if this doesn't run (it should though)
select
b.*
c.sub_att_value_count
from
(select
df1.*
a.att_value_count
from
(select
region, count(*) as att_value_count
from df2
group by region
) as a
full outer join df1
where df1.att_value = a.region
) as b
full outer join
(
select
age, gender, count(*) as sub_att_value_count
from df2
group by age, gender
) as c
where b.att_value = c.age and b.sub_att_value = c.gender

How to Sum and Group Data Frame Elements?

+-----+-------+--------+
| | Buyer | Sex |
+-----+-------+--------+
| 0 | 1 | Male |
| 1 | 1 | Female |
| 2 | 0 | Male |
| 3 | 1 | Female |
| ... | ... | ... |
+-----+-------+--------+
I'd like to sum and group the data frame above, into the data frame (table) below. Does pandas have any built-in-functions that can accomplish this or do I have to manually iterate, sum, and group?
+---+---------+------+
| | Female | Male |
+---+---------+------+
| 0 | 81 | 392 |
| 1 | 539 | 233 |
+---+---------+------+
Use pivot_table, with 'count' as your aggfunc.
Also, considering there might be some combinations never found, use fillna to fill the empty cells with 0:
In [28]:
df['V'] = 1
print df
Buyer Sex V
0 1 Male 1
1 1 Female 1
2 0 Male 1
3 1 Female 1
In [29]:
print df.pivot_table(index='Buyer', columns='Sex', values='V', aggfunc='count').fillna(0)
Sex Female Male
Buyer
0 0 1
1 2 1
frame.groupby('Sex').aggregate(some_function)
should work.
or even in your case :
frame.groupby('Sex').sum()

Categories