Summing columns in dataframe in python - python

I am trying to add 3 columns' values to come up with a new column as total value. Code is below:
df3[["Bronze","Gold","Silver"]] =
df3[["Bronze","Gold","Silver"]].astype("int")
df3["Total Medal"]= df3.iloc[:, -3:0].sum(axis=1)
df3[["Total Medal"]].astype("int")
I know that Bronze, Gold, Silver columns have 1 and 0 values and they are the last 3 columns in the dataframe. Their original types were "uint8" so I changed them to "int".
Total Medal column after these lines come out as type "float" (instead of int) and yield only the value 0. How can I properly add these columns?

To add the value of 3 columns to a new column simply do
df['Total Medal'] = df.sum(axis=1)

This can e.g. be done using assign:
import numpy as np
import pandas as pd
#create data frame
data = {"gold":np.random.choice([0,1],size=10),"silver":np.random.choice([0,1],size=10), "bronze":np.random.choice([0,1],size=10)}
df = pd.DataFrame(data)
#calculate new column and add to dataframe
df = df.assign(mysum=df.gold+df.silver+df.bronze)
Edit: df["mysum"] = df.sum(axis=1) only works if your dataframe only has the three relevant columns, because it sums over all columns (and not only over the three you want).

Related

Creating a subset from a dataframe based on a condition from another array

I have a numeric np array which I want to use that as a condition/filter over a column number 4 of a dataframe (df) to extract a subset of dataframe (sale_data_sub). However, I am getting an empty sale_data_sub (with just the name of all the columns and no rows) as a result of the code
sale_data_sub = df.loc[df[4].isin(sale_condition_arr)].values
sale_condition_arr is a numpy array
df is the original dataframe with 100 columns
sale_data_subset is the desired sub_dataframe
Sorry that I didn't include a working sample.
the issue is that your df dataframe don't have headers assigned.
try:
#give your dataframe a header:
df = df.set_axis([str(i) for i in list(range(len(df.columns)))], axis='columns')
#then proceed to your usual work with df:
sale_data_sub = df.loc[df["4"].isin(sale_condition_arr)].values #be careful, it's df["4"] not df[4]

Subtract dataframes with completely different row names and column names

My dataframe 1 looks like this:
windcodes
name
yield
perp
163197.SH
shangguo comp
2.9248
NO
154563.SH
guosheng comp
2.886
Yes
789645.IB
guoyou comp
3.418
NO
My dataframe 2 looks like this
windcodes
CALC
1202203.IB
2.5517
1202203.IB
2.48457
1202203.IB
2.62296
and I want my result dataframe 3 to have one more new column than dataframe 1 which is to use the value in column 'yield' in dataframe 1 subtract the value in column 'CALC' in dataframe 2:
The result dataframe 3 should be looking like this
windcodes
name
yield
perp
yield-CALC
163197.SH
shangguo comp
2.9248
NO
0.3731
154563.SH
guosheng comp
2.886
Yes
0.40413
789645.IB
guoyou comp
3.418
NO
0.79504
It would be really helpful if anyone can tell me how to do it in python.
Just in case you have completely different indexes, use df2's underlying numpy array:
df1['yield-CALC'] = df1['yield'] - df2['yield'].values
You can try something like this:
df1['yield-CALC'] = df1['yield'] - df2['yield']
I'm assuming you don't want to join the dataframes, since the windcodes are not the same.
Do we need to join 2 dataframes from windcodes column? The windcodes are all the same in the sample data you have given in Dataframe2. Can you explain this?
If we are going to join from the windscode field. The code below will work.
df = pd.merge(left=df1, right=df2,how='inner',on='windcodes')
df['yield-CALC'] = df['yield']-df['CALC']
I will try to keep it as elaborated as possible:
environment I have used for coding is Jupyter Notebook
importing our required pandas library
import pandas as pd
getting your first table data in form of lists of lists (you can also use csv,excel etc here)
data_1 = [["163197.SH","shangguo comp",2.9248,"NO"],\
["154563.SH","guosheng comp",2.886,"Yes"] , ["789645.IB","guoyou comp",3.418,"NO"]]
creating dataframe one :
df_1 = pd.DataFrame(data_1 , columns = ["windcodes","name","yield","perp"])
df_1
Output:
getting your second table data in form of lists of lists (you can also use csv,excel etc here)
data_2 = [["1202203.IB",2.5517],["1202203.IB",2.48457],["1202203.IB",2.62296]]
creating dataframe two :
df_2 = pd.DataFrame(data_2 , columns = ["windcodes","CALC"])
df_2
Output:
Now creating the third dataframe:
df_3 = df_1 # becasue first 4 columns are same as our first dataframe
df_3
Output:
Now calculating the fourth column i.e "yield-CALC" :
df_3["yield-CALC"] = df_1["yield"] - df_2["CALC"] # each df_1 datapoint will be subtracted from df_2 datapoint one by one (still confused? search for "SIMD")
df_3
Output:

Hou can i calculate percentage using groupy with pandas

I have 2 questions: First, i have this data-frame:
data = {'Name':['A', 'B', 'C', 'A','D','E','A','C','A','A','A'], 'Family':['B1','B','B','B3','B','B','B','B1','B','B3','B'],
'Region':['North', 'South', 'East', 'West','South', 'East', 'West','North','East', 'West','South'],
'Cod':['1','2','2','1','5','1','1','1','2','1','3'], 'Customer number':['A111','A223','A555','A333','A333','A444','A222','A111','A222','A333','A221']
,'Sales':[100,134,53,34,244,789,213,431,0,55,23]}
and i would like to have a column which returns a percentage of sales in a groupby of the other columns, like in the image below:
Second point is, if the percentage is 0% (like in the first row) i would like to use the same result based on a criterion, for example(if A222 is 0% use the result of A221).
Well answer for question one could be:
#step 1 Import pandas
import pandas as pd
df=pd.DataFrame(data)
# step 2 printing the dataframe
df
# step 3 Calculating the pecentage:
df['percentage of sales'] = (df['Sales'] / df['Sales'].sum())*100
# step 4 :joining this table percentage to the main dataframe
pd.concat([df, df[['percentage of sales ']]], axis=1, sort=False)
Answer for question 2: its depends, what is the condition you want to do.
assumming the logic :
that is one way ,
but the easy way to answer question 1 and 2 is to convert dataframe into a numpy array
then do the operation , and then bring it back to dataframe.
1
check this answers:
Add column for percentage of total to Pandas dataframe
#Converting the percentage column to numpy array
npprices=df['percentage'].to_numpy()
npprices
#loop through the rows and fill the row next row with value from previous row, ASSUMING previous row is not zero.
for i in range(len(npprices)):
if npprices[i]==0:
npprices[i]=npprices[i-1]
#converting in to dataframe back
percentage1=pd.DataFrame({'percentage2':npprices})
# the joing this percentage row to to dataframe
df2i=pd.concat([df, percentage1[['percentage2']]], axis=1, sort=False)
NOTE I added it twice, by mistake. But of course, there could be other easier approach, I hope this helps
Some answers: I used:
Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?
I think this is what you want:
import pandas as pd
df = pd.DataFrame(data)
granular_sum_df = df.groupby(['Name', 'Family', 'Region', 'Cod', 'Customer number'])['Sales'].sum().reset_index()
family_sum_df = df.groupby(['Name', 'Family'])['Sales'].sum().reset_index()
final_df = granular_sum_df.merge(family_sum_df, on=['Name', 'Family'])
final_df['Pct'] = final_df['Sales_x']/final_df['Sales_y']

How to keep indexes when sum by columns based on grouped_by in pandas

I have a dataset where each ID has 6 corresponding rows. I want to this dataset grouped by the column ID and sum aggregate using sum. I wrote this piece of code:
col = [col for col in train.columns if col not in ['Month', 'ID']]
train.groupby('ID')[col].sum().reset_index()
Everything works fine except that I lose column ID. Now, Unique ID from my initial database disappeared and instead I have just enumerated ids from 0 up to the number of rows in the resulting dataset. I want to keep initial indexes, because I will need to merge this dataset with another further. How I can deal with this problem? Thanks for helping very much!
P.S: deleting reset_index() has no effect
P.S: You can see two problems on the images. On first image there is original database. You can see 6 entries for each ID. On the second image there is a databased which is a result from the grouped statement. First problem: IDs are not the same as in the original table. Second problem: the sum over 6 months for each ID is not correct.
Instead of using reset_index() you can simply use the keyword argument as_index: df.groupby('ID', as_index=False)
This will preserve column ID in the resulting DataFrameGroupBy, as described in groupby()'s doc.
as_index : boolean, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
When you group a data frame by some columns, those columns become your new index.
import pandas as pd
import numpy as np
# Create data
n = 6; m = 3
col_id = np.hstack([['id-'+str(i)] * n for i in range(m)]).reshape(-1, 1)
np.random.shuffle(col_id)
data = np.random.rand(m*n, m)
columns = ['v'+str(i+1) for i in range(m)]
df = pd.DataFrame(data, columns=columns)
df['ID'] = col_id
# Group by ID
print(df.groupby('ID').sum())
Will simply give you
v1 v2 v3
ID
id-0 2.099219 2.708839 2.766141
id-1 2.554117 2.183166 3.914883
id-2 2.485505 2.739834 2.250873
If you just want the column ID back, you just have to reset_index()
print(df.groupby('ID').sum().reset_index())
which will leave you with
ID v1 v2 v3
0 id-0 2.099219 2.708839 2.766141
1 id-1 2.554117 2.183166 3.914883
2 id-2 2.485505 2.739834 2.250873
Note:
groupby will sort the resulting DataFrame by its index. If you don't want that for any reason just set sorted=False (see also the documentation)
print(df.groupby('ID', sorted=false).sum())

Pandas - Sum columns with the same start of name

I would like to sum columns with the same start of name.
Example :
import pandas as pd
import numpy as np
df=pd.DataFrame({'product':['TV','COMPUTER','SMARTPHONE'],
'price_2012':np.random.randint(100,300,3),
'price_2013':np.random.randint(100,300,3),
'price_2014':np.random.randint(100,300,3),
'price_2015':np.random.randint(100,300,3),
'price_2016':np.random.randint(100,300,3)})
For this exemple i want to create a new column price_2012_2016 equal to the price sum of 2013 to 2016 without list all column.
PS: In SAS i do like this : price_2012_2016=sum(of prix_2012-prix-2016);
Cordialy,
Laurent A.
You could simply do the following:
df['price_2012_2016'] = df[[col for col in df.columns if col.startswith('price_')]].sum(axis=1)
This takes the sum of only the columns that start with "price_" within the df DataFrame and saves the result as the price_2012_2016 column. The axis=1 parameter is for that sum to be computed on the column axis and not the rows, see below:

Categories