Pandas - Sum columns with the same start of name - python

I would like to sum columns with the same start of name.
Example :
import pandas as pd
import numpy as np
df=pd.DataFrame({'product':['TV','COMPUTER','SMARTPHONE'],
'price_2012':np.random.randint(100,300,3),
'price_2013':np.random.randint(100,300,3),
'price_2014':np.random.randint(100,300,3),
'price_2015':np.random.randint(100,300,3),
'price_2016':np.random.randint(100,300,3)})
For this exemple i want to create a new column price_2012_2016 equal to the price sum of 2013 to 2016 without list all column.
PS: In SAS i do like this : price_2012_2016=sum(of prix_2012-prix-2016);
Cordialy,
Laurent A.

You could simply do the following:
df['price_2012_2016'] = df[[col for col in df.columns if col.startswith('price_')]].sum(axis=1)
This takes the sum of only the columns that start with "price_" within the df DataFrame and saves the result as the price_2012_2016 column. The axis=1 parameter is for that sum to be computed on the column axis and not the rows, see below:

Related

Creating a subset from a dataframe based on a condition from another array

I have a numeric np array which I want to use that as a condition/filter over a column number 4 of a dataframe (df) to extract a subset of dataframe (sale_data_sub). However, I am getting an empty sale_data_sub (with just the name of all the columns and no rows) as a result of the code
sale_data_sub = df.loc[df[4].isin(sale_condition_arr)].values
sale_condition_arr is a numpy array
df is the original dataframe with 100 columns
sale_data_subset is the desired sub_dataframe
Sorry that I didn't include a working sample.
the issue is that your df dataframe don't have headers assigned.
try:
#give your dataframe a header:
df = df.set_axis([str(i) for i in list(range(len(df.columns)))], axis='columns')
#then proceed to your usual work with df:
sale_data_sub = df.loc[df["4"].isin(sale_condition_arr)].values #be careful, it's df["4"] not df[4]

Python how to filter a csv based on a column value and get the row count

I want to do data insepction and print count of rows that matches a certain value in one of the columns. So below is my code
import numpy as np
import pandas as pd
data = pd.read_csv("census.csv")
The census.csv has a column "income" which has 3 values '<=50K', '=50K' and '>50K'
and i want to print number of rows that has income value '<=50K'
i was trying like below
count = data['income']='<=50K'
That does not work though.
Sum Boolean selection
(data['income'].eq('<50K')).sum()
The key is to learn how to filter pandas rows.
Quick answer:
import pandas as pd
data = pd.read_csv("census.csv")
df2 = data[data['income']=='<=50K']
print(df2)
print(len(df2))
Slightly longer answer:
import pandas as pd
data = pd.read_csv("census.csv")
filter = data['income']=='<=50K'
print(filter) # notice the boolean list based on filter criteria
df2 = data[filter] # next we use that boolean list to filter data
print(df2)
print(len(df2))

Hou can i calculate percentage using groupy with pandas

I have 2 questions: First, i have this data-frame:
data = {'Name':['A', 'B', 'C', 'A','D','E','A','C','A','A','A'], 'Family':['B1','B','B','B3','B','B','B','B1','B','B3','B'],
'Region':['North', 'South', 'East', 'West','South', 'East', 'West','North','East', 'West','South'],
'Cod':['1','2','2','1','5','1','1','1','2','1','3'], 'Customer number':['A111','A223','A555','A333','A333','A444','A222','A111','A222','A333','A221']
,'Sales':[100,134,53,34,244,789,213,431,0,55,23]}
and i would like to have a column which returns a percentage of sales in a groupby of the other columns, like in the image below:
Second point is, if the percentage is 0% (like in the first row) i would like to use the same result based on a criterion, for example(if A222 is 0% use the result of A221).
Well answer for question one could be:
#step 1 Import pandas
import pandas as pd
df=pd.DataFrame(data)
# step 2 printing the dataframe
df
# step 3 Calculating the pecentage:
df['percentage of sales'] = (df['Sales'] / df['Sales'].sum())*100
# step 4 :joining this table percentage to the main dataframe
pd.concat([df, df[['percentage of sales ']]], axis=1, sort=False)
Answer for question 2: its depends, what is the condition you want to do.
assumming the logic :
that is one way ,
but the easy way to answer question 1 and 2 is to convert dataframe into a numpy array
then do the operation , and then bring it back to dataframe.
1
check this answers:
Add column for percentage of total to Pandas dataframe
#Converting the percentage column to numpy array
npprices=df['percentage'].to_numpy()
npprices
#loop through the rows and fill the row next row with value from previous row, ASSUMING previous row is not zero.
for i in range(len(npprices)):
if npprices[i]==0:
npprices[i]=npprices[i-1]
#converting in to dataframe back
percentage1=pd.DataFrame({'percentage2':npprices})
# the joing this percentage row to to dataframe
df2i=pd.concat([df, percentage1[['percentage2']]], axis=1, sort=False)
NOTE I added it twice, by mistake. But of course, there could be other easier approach, I hope this helps
Some answers: I used:
Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?
I think this is what you want:
import pandas as pd
df = pd.DataFrame(data)
granular_sum_df = df.groupby(['Name', 'Family', 'Region', 'Cod', 'Customer number'])['Sales'].sum().reset_index()
family_sum_df = df.groupby(['Name', 'Family'])['Sales'].sum().reset_index()
final_df = granular_sum_df.merge(family_sum_df, on=['Name', 'Family'])
final_df['Pct'] = final_df['Sales_x']/final_df['Sales_y']

Is there any function to assign values in a Pandas Dataframe

I am trying to assign values to some rows using pandas dataframe. Is there any function to do this?
For a whole column:
df = df.assign(column=value)
... where column is the name of the column.
For a specific column of a specific row:
df.at[row, column] = value
... where row is the index of the row, and column is the name of the column.
The later changes the dataframe "in place".
There is a good tutorial here.
Basically, try this:
import pandas as pd
import numpy as np
# Creating a dataframe
# Setting the seed value to re-generate the result.
np.random.seed(25)
df = pd.DataFrame(np.random.rand(10, 3), columns =['A', 'B', 'C'])
# np.random.rand(10, 3) has generated a
# random 2-Dimensional array of shape 10 * 3
# which is then converted to a dataframe
df
You will get something like this:

Summing columns in dataframe in python

I am trying to add 3 columns' values to come up with a new column as total value. Code is below:
df3[["Bronze","Gold","Silver"]] =
df3[["Bronze","Gold","Silver"]].astype("int")
df3["Total Medal"]= df3.iloc[:, -3:0].sum(axis=1)
df3[["Total Medal"]].astype("int")
I know that Bronze, Gold, Silver columns have 1 and 0 values and they are the last 3 columns in the dataframe. Their original types were "uint8" so I changed them to "int".
Total Medal column after these lines come out as type "float" (instead of int) and yield only the value 0. How can I properly add these columns?
To add the value of 3 columns to a new column simply do
df['Total Medal'] = df.sum(axis=1)
This can e.g. be done using assign:
import numpy as np
import pandas as pd
#create data frame
data = {"gold":np.random.choice([0,1],size=10),"silver":np.random.choice([0,1],size=10), "bronze":np.random.choice([0,1],size=10)}
df = pd.DataFrame(data)
#calculate new column and add to dataframe
df = df.assign(mysum=df.gold+df.silver+df.bronze)
Edit: df["mysum"] = df.sum(axis=1) only works if your dataframe only has the three relevant columns, because it sums over all columns (and not only over the three you want).

Categories