Sum columns in pandas only if exists else 0 - python

I have a dataframe like
GULOSS GRLoss
1 1
2 2
3 3
I want to sum in such a way that I Get
GULOSS GRLoss Post
6 6 0
where Post does not exist in initial dataframe and Post is required in final dataframe with condition that if it does not exist then make the sum for the non existing column as 0

Assuming I understand your question correctly, here's how I would do it:
if 'POST' not in data.columns:
data['POST'] = 0
datasum = data.sum()

Related

Filter dataframe based on matching values from two columns

I have a dataframe like as shown below
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'Label':[1,2,3,0,0]})
I would like to filter the dataframe based on the below criteria
cdf['Id']==cdf['Label'] # first 3 rows are matching for both columns in cdf
I tried the below
flag = np.where[cdf['Id'].eq(cdf['Label'])==True,1,0]
final_df = cdf[cdf['flag']==1]
but I got the below error
TypeError: 'function' object is not subscriptable
I expect my output to be like as shown below
Id Label
0 1 1
1 2 2
2 3 3
I think you're overthinking this. Just compare the columns:
>>> cdf[cdf['Id'] == cdf['Label']]
Id Label
0 1 1
1 2 2
2 3 3
Your particular error though is coming from the fact that you're using square brackets to call np.where, e.g. np.where[...], which is wrong. You should be using np.where(...) instead, but the above solution is bound to be as fast as it gets ;)
Also you can check query
cdf.query('Id == Label')
Out[248]:
Id Label
0 1 1
1 2 2
2 3 3

How can I create a column target based on two different columns?

I have the following DataFrame with the columns low_scarcity and high_scarcity (a value is either on high or low scarcity):
id
low_scarcity
high_scarcity
0
When I was five..
1
I worked a lot...
2
I went to parties...
3
1 week ago
4
2 months ago
5
another story..
I want to create another column 'target' that when there's an entry in low_scarcity column, the value will be 0, and when there's an entry in high_scarcity column, the value will be 1. Just like this:
id
low_scarcity
high_scarcity
target
0
When I was five..
0
1
I worked a lot...
1
2
I went to parties...
1
3
1 week ago
0
4
2 months ago
0
5
another story..
1
I tried first replacing the entries with no value with 0 and then create a boolean condition, however I can't use .replace('',0) because the columns that are empty don't appear as empty values.
Supposing your dataframe is called df and that a value is either on on high or low scarcity, the following line of code does it
import numpy as np
df['target'] = 1*np.array(df['high_scarcity']!="")
in which the 1* performs an integer conversion of the boolean values.
If that is not the case, then a more complex approach should be taken
res = np.array(["" for i in range(df.shape[0])])
res[df['high_scarcity']!=""] = 1
res[df['low_scarcity']!=""] = 0
df['target'] = res

Pandas save counts of multiple columns in single dataframe

I have a dataframe with 3 columns now which appears like this
Model IsJapanese IsGerman
BenzC 0 1
BensGla 0 1
HondaAccord 1 0
HondaOdyssey 1 0
ToyotaCamry 1 0
I want to create a new dataframe and have TotalJapanese and TotalGerman as two columns in the same dataframe.
I am able to achieve this by creating 2 different dataframes. But wondering how to get both the counts in a single dataframe.
please suggest thank you!
Editing and adding another similar dataframe to this [sorry notsure whether its allowed-but trying
Second dataset- am trying to save multiple counts in single dataframe, based on repetition of data.
Here is my sample dataset
Store Address IsLA IsGA
Albertsons Cross St 1 0
Safeway LeoSt 0 1
Albertsons Main St 0 1
RiteAid Culver St 1 0
My aim is to prepare a new dataset with multiple counts per store
The result should be like this
Store TotalStores TotalLA TotalGA
Alberstons 2 1 1
Safeway 1 0 1
RiteAid 1 1 0
Is it possible to achieve these in single dataframe ?
Thanks!
One way would be to store the sum of Japanese cars and German cars, and manually create a dataframe using them:
j , g =sum(df['IsJapanese']),sum(df['IsGerman'])
total_df = pd.DataFrame({'TotalJapanese':j,
'TotalGerman':g},index=['Totals'])
print(total_df)
TotalJapanese TotalGerman
Totals 3 2
Another way would be to transpose (T) your dataframe, sum(axis=1), and tranpose back:
>>> total_df_v2 = pd.DataFrame(df.set_index('Model').T.sum(axis=1)).T
print(total_df_v2)
IsJapanese IsGerman
3 2
To answer your 2nd question, you can use a DataFrameGroupBy.agg on your 'Store' column, use parameter count on Address and sum on your other two columns. Then you can rename() your columns if needed:
resulting_df = df.groupby('Store').agg({'Address':'count',
'IsLA':'sum',
'IsGA':'sum'}).\
rename({'Address':'TotalStores',
'IsLA':'TotalLA',
'IsGA':'TotalGA'},axis=1)
Prints:
TotalStores IsLA IsGA
Store
Albertsons 2 1 1
RiteAid 1 1 0
Safeway 1 0 1

Pandas Dataframe Filter Multiple Conditions

I am looking to filter a dataframe to only include values that are equal to a certain value, or greater than another value.
Example dataframe:
0 1 2
0 0 1 23
1 0 2 43
2 1 3 54
3 2 3 77
From here, I want to pull all values from column 0, where column 2 is either equal to 23, or greater than 50 (so it should return 0, 1 and 2). Here is the code I have so far:
df = df[(df[2]=23) & (df[2]>50)]
This returns nothing. However, when I split these apart and run them individually (df = df[df[2]=23] and df = df[df[2]>50]), then I do get results back. Does anyone have any insights onto how to get this to work?
As you said , it's or : | not and: &
df = df[(df[2]=23) | (df[2]>50)]

How can I keep all columns in a dataframe, plus add groupby, and sum?

I have a data frame with 5 fields. I want to copy 2 fields from this into a new data frame. This works fine. df1 = df[['task_id','duration']]
Now in this df1, when I try to group by task_id and sum duration, the task_id field drops off.
Before (what I have now).
After (what I'm trying to achieve).
So, for instance, I'm trying this:
df1['total'] = df1.groupby(['task_id'])['duration'].sum()
The result is:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I don't know why I can't just sum the values in a column and group by unique IDs in another column. Basically, all I want to do is preserve the original two columns (['task_id', 'duration']), sum duration, and calculate a percentage of duration in a new column named pct. This seems like a very simple thing but I can't get anything working. How can I get this straightened out?
The code will take care of having the columns retained and getting the sum.
df[['task_id', 'duration']].groupby(['task_id', 'duration']).size().reset_index(name='counts')
Setup:
X = np.random.choice([0,1,2], 20)
Y = np.random.uniform(2,10,20)
df = pd.DataFrame({'task_id':X, 'duration':Y})
Calculate pct:
df = pd.merge(df, df.groupby('task_id').agg(sum).reset_index(), on='task_id')
df['pct'] = df['duration_x'].divide(df['duration_y'])*100
df.drop('duration_y', axis=1) # Drops sum duration, remove this line if you want to see it.
Result:
duration_x task_id pct
0 8.751517 0 58.017921
1 6.332645 0 41.982079
2 8.828693 1 9.865355
3 2.611285 1 2.917901
4 5.806709 1 6.488531
5 8.045490 1 8.990189
6 6.285593 1 7.023645
7 7.932952 1 8.864436
8 7.440938 1 8.314650
9 7.272948 1 8.126935
10 9.162262 1 10.238092
11 7.834692 1 8.754639
12 7.989057 1 8.927129
13 3.795571 1 4.241246
14 6.485703 1 7.247252
15 5.858985 2 21.396850
16 9.024650 2 32.957771
17 3.885288 2 14.188966
18 5.794491 2 21.161322
19 2.819049 2 10.295091
disclaimer: All data is randomly generated in setup, however, calculations are straightforward and should be correct for any case.
I finally got everything working in the following way.
# group by and sum durations
df1 = df1.groupby('task_id', as_index=False).agg({'duration': 'sum'})
list(df1)
# find each task_id as relative percentage of whole
df1['pct'] = df1['duration']/(df1['duration'].sum())
df1 = pd.DataFrame(df1)

Categories