Create several columns with default values in Salesforce - python

I have a dataframe that looks like 1000 rows, 10 columns
I want to add 20 columns with only one single value in each column (what I call a default value)
Therefore, my final df would be 1000 rows, with 30 columns
I know that I can do it 30 times by doing:
df['column 11'] = 'default value'
df['column 12'] = 'default value 2'
But I would like to do it in a proper way of coding
I have a dict with my {'column label' : 'defaultvalues'}
How can I do so ?
I've tried pd.insert or pd.concatenate but couldn't find my way through
thanks
regards,
Eric

One way to do so:
df_len = len(df)
new_df = pd.DataFrame({col: [val] * df_len for col,val in your_dict.items()})
df = pd.concat((df,new_df), axis=1)

Generally if possible spaces in keys in dictionary for new columns names use DataFrame constuctor with DataFrame.join:
df = pd.DataFrame({'a':range(5)})
print (df)
a
0 0
1 1
2 2
3 3
4 4
d = {'A 11' : 's', 'A 12':'c'}
df = df.join(pd.DataFrame(d, index=df.index))
print (df)
a A 11 A 12
0 0 s c
1 1 s c
2 2 s c
3 3 s c
4 4 s c
If no spaces and no numbers in columns names (need valid identifier) is possible use DataFrame.assign:
d = {'A11' : 's', 'A12':'c'}
df = df.assign(**d)
print (df)
a A11 A12
0 0 s c
1 1 s c
2 2 s c
3 3 s c
4 4 s c
Another solution is loop by dictionary and assign:
for k, v in d.items():
df[k] = v

Related

Python : Remove all data in a column of a dataframe and keep the last value in the first row

Let's say that I have a simple Dataframe.
import pandas as pd
data1 = [12,34,'fsdf',678,'','','dfs','','']
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4
5
6 dfs
7
8
I want to delete all the data except the last value found in the column that I want to keep in the first row. It can be an column with thousands of rows. So I would like the result :
Data
0 dfs
1
2
3
4
5
6
7
8
And I have to keep the shape of this dataframe, so not removing rows.
What are the simplest functions to do that efficiently ?
Thank you
Get index of last not empty string value and pass to first value of column:
s = df1.loc[df1['Data'].iloc[::-1].ne('').idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
If empty strings are missing values:
data1 = [12,34,'fsdf',678,np.nan,np.nan,'dfs',np.nan,np.nan]
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4 NaN
5 NaN
6 dfs
7 NaN
8 NaN
s = df1.loc[df1['Data'].iloc[::-1].notna().idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
A simple pandas condition check like this can help,
df1['Data'] = [df1.loc[df1['Data'].ne(""), "Data"].iloc[-1]] + [''] * (len(df1) - 1)
You can replace '' with NaN using df.replace, now use df.last_valid_index
val = df1.loc[df1.replace('', np.nan).last_valid_index(), 'Data']
# Below two lines taken from #jezrael's answer
df1.loc[0, 'Data'] = val
df1.loc[1:, 'Data'] = ''
Or
You can use np.full with fill_value set to np.nan here.
val = df1.loc[df1.replace("", np.nan).last_valid_index(), "Data"]
df1 = pd.DataFrame(np.full(df1.shape, np.nan),
index=df.index,
columns=df1.columns)
df1.loc[0, "Data"] = val

Python pandas groupby agg- sum one column while getting the mean of the rest

Looking to group my fields based on date, and get a mean of all the columns except a binary column which I want to sum in order to get a count.
I know I can do this by:
newdf=df.groupby('date').agg({'var_a': 'mean', 'var_b': 'mean', 'var_c': 'mean', 'binary_var':'sum'})
But there is about 50 columns (other than the binary) that I want to mean, and I feel there must be a simple, quicker way of doing this instead of writing each 'column title' :'mean' for all of them. I've tried to make a list of column names but when I put this in the agg function, it says a list is an unhashable type.
Thanks!
Something like this might work -
df = pd.DataFrame({'a':['a','a','b','b','b','b'], 'b':[10,20,30,40,20,10], 'c':[1,1,0,0,0,1]}, 'd':[20,30,10,15,34,10])
df
a b c d
0 a 10 1 20
1 a 20 1 30
2 b 30 0 10
3 b 40 0 15
4 b 20 0 34
5 b 10 1 10
Assuming c is the binary variable column. Then,
cols = [ val for val in df.columns if val != 'c']
temp = pd.concat([df.groupby(['a'])[cols].mean(), df.groupby(['a'])['c'].sum()], axis=1).reset_index()
temp
a b d c
0 a 15.0 25.00 2
1 b 25.0 17.25 1
In general, I would build the agg dict automatically:
sum_cols = ['binary_val']
agg_dict = {col: 'sum' if col in sum_cols else 'mean'
for col in df.columns if col != 'date'}
df.groupby('date').agg(agg_dict)

count unique values in groups pandas

I have a dataframe like this:
data = {'id': [1,1,1,2,2,3],
'value': ['a','a','a','b','b','c'],
'obj_id': [1,2,3,3,3,4]
}
df = pd.DataFrame (data, columns = ['id','value','obj_id'])
I would like to get the unique counts of obj_id groupby id and value:
1 a 3
2 b 1
3 c 1
But when I do:
result=df.groupby(['id','value'])['obj_id'].nunique().reset_index(name='obj_counts')
the result I got was:
1 a 2
1 a 1
2 b 1
3 c 1
so the first two rows with same id and value don't group together.
How can I fix this? Many thanks!
For me your solution working nice with sample data.
Like mentioned #YOBEN_S in comments is possible problem traling whitespeces, then solution is add Series.str.strip:
data = {'id': [1,1,1,2,2,3],
'value': ['a ','a','a','b','b','c'],
'obj_id': [1,2,3,3,3,4]
}
df = pd.DataFrame (data, columns = ['id','value','obj_id'])
df['value'] = df['value'].str.strip()
df = df.groupby(['id','value'])['obj_id'].nunique().reset_index(name='obj_counts')
print (df)
id value obj_counts
0 1 a 3
1 2 b 1
2 3 c 1

Renaming columns on slice of dataframe not performing as expected

I was trying to clean up column names in a dataframe but only a part of the columns.
It doesn't work when trying to replace column names on a slice of the dataframe somehow, why is that?
Lets say we have the following dataframe:
Note, on the bottom is copy-able code to reproduce the data:
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
I want to clean up the column names (expected output):
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Approach 1:
I can get the clean column names like this:
df.iloc[:, 1:].columns.str[:4]
Index(['ColA', 'ColB', 'ColC'], dtype='object')
Or
Approach 2:
s = df.iloc[:, 1:].columns
[col[:4] for col in s]
['ColA', 'ColB', 'ColC']
But when I try to overwrite the column names, nothing happens:
df.iloc[:, 1:].columns = df.iloc[:, 1:].columns.str[:4]
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Same for the second approach:
s = df.iloc[:, 1:].columns
cols = [col[:4] for col in s]
df.iloc[:, 1:].columns = cols
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
This does work, but you have to manually concat the name of the first column, which is not ideal:
df.columns = ['Value'] + df.iloc[:, 1:].columns.str[:4].tolist()
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Is there an easier way to achieve this? Am I missing something?
Dataframe for reproduction:
df = pd.DataFrame({'Value':[1,2,3,4],
'ColAfjkj':['a', 'b', 'c', 'd'],
'ColBhuqwa':['e', 'f', 'g', 'h'],
'ColCouiqw':['i', 'j', 'k', 'l']})
This is because pandas' index is immutable. If you check the documentation for class pandas.Index, you'll see that it is defined as:
Immutable ndarray implementing an ordered, sliceable set
So in order to modify it you'll have to create a new list of column names, for instance with:
df.columns = [df.columns[0]] + list(df.iloc[:, 1:].columns.str[:4])
Another option is to use rename with a dictionary containing the columns to replace:
df.rename(columns=dict(zip(df.columns[1:], df.columns[1:].str[:4])))
To overwrite columns names you can .rename() method:
So, it will look like:
df.rename(columns={'ColA_fjkj':'ColA',
'ColB_huqwa':'ColB',
'ColC_ouiqw':'ColC'}
, inplace=True)
More info regarding rename here in docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
I had this problem as well and came up with this solution:
First, create a mask of the columns you want to rename
mask = df.iloc[:,1:4].columns
Then, use list comprehension and a conditional to rename just the columns you want
df.columns = [x if x not in mask else str[:4] for x in df.columns]

Preserve the non-numerical columns when doing pandas.DataFrame.groupby().sum()

Can I preserve the non-numerical columns (the 1st appeared value) when doing pandas.DataFrame.groupby().sum() ?
For example, I have a DataFrame like this:
df = pd.DataFrame({'A' : ['aa1', 'aa2', 'aa1', 'aa2'],'B' : ['bb1', 'bbb1', 'bb2', 'bbb2'],'C' : ['cc1', 'ccc2', 'ccc3', 'ccc4'],'D' : [1, 2, 3, 4],'E' : [1, 2, 3, 4]})
>>> df
A B C D E
0 aa1 bb1 cc1 1 1
1 aa2 bbb1 ccc2 2 2
2 aa1 bb2 ccc3 3 3
3 aa2 bbb2 ccc4 4 4
>>> df.groupby(["A"]).sum()
D E
A
aa1 4 4
aa2 6 6
Following is the result I want to obtain:
B C D E
A
aa1 bb1 cc1 4 4
aa2 bbb1 ccc2 6 6
Notice that the value of column B and C is the first associated B value and C value of each group key.
Just use 'first':
df.groupby(["A"]).agg({'B': 'first',
'C': 'first',
'D': sum,
'E': sum})
For each key in the groupby-sum dataframe, look up the key in the original dataframe and put the associated value of column B into a new column.
#groupby and sum over columns C and D
df_1 = df.groupby(['A']).sum()
Find the first values in column B associated with groupby keys
groupby keys
col_b = []
#iterate through keys and find the the first value in df['B'] with that key in column A
for i in df_1.index:
col_b.append(df['B'][df['A'] == i].iloc[0])
#insert list of values into new dataframe
df_1.insert(0, 'B', col_b)
>>>df_1
B D E
A
aa1 bb1 4 4
aa2 bbb1 6 6
Grouping only on column 'A' gives:
df.groupby(['A']).sum()
C D
A
bar 1.26 0.88
foo 0.92 -4.19
Grouping on column 'A' and 'B' gives:
df.groupby(['A','B']).sum()
C D
A B
bar one 1.38 -0.73
three 0.26 0.80
two -0.38 0.81
foo one 1.96 -2.72
three -0.42 -0.18
two -0.62 -1.29
If you want only the column 'B' that has 'one' you can do:
d = df.groupby(['A','B'], as_index=False).sum()
d[d.B=='one'].set_index('A')
B C D
A
bar one 1.38 -0.73
foo one 1.96 -2.72
I'm not sure I understand but is this what you want to do?
Note: I increased the output precision just to get the same numbers shown in the post.
d = df.groupby('A').sum()
d['B'] = 'one'
d.sort_index(axis=1)
B C D
A
bar one 1.259069 0.876959
foo one 0.921510 -4.193397
If you want to put the first sorted value of the column from 'B' instead you can use:
d['B'] = df.B.sort(inplace=False)[0]
So here I replaced 'one','two','three' with 'a', 'b','c' to see if this is what you are trying to do, and use insert() method as suggested by other post
df
A B C D
0 foo a 0.638362 -0.931817
1 bar a 1.380706 -0.733307
2 foo b -0.324514 0.203515
3 bar c 0.258534 0.803298
4 foo b -0.299485 -1.495979
5 bar b -0.380171 0.806968
6 foo a 1.324810 -1.792996
7 foo c -0.417663 -0.176120
d = df.groupby('A').sum()
d.insert(0, 'B', df.B.sort(inplace=False)[0])
d
B C D
A
bar a 1.259069 0.876959
foo a 0.921510 -4.193397
All these answers seem pretty wordy; the Pandas doc [1] also doesn't seem clear on this point despite giving an SQL example on the first page:
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2
which will choose the "first" of Column1, Column2, while Pandas does not. As OP points out the non-numeric columns are simply dropped. My solution is not that pretty either, but perhaps more control is Pandas' design goal:
agg_d = { c: 'sum' if c == 'A' else 'first' for c in df.columns }
df = df.groupby( df['A'] ).agg( agg_d )
this maintains all non-numeric columns, like sql does. This is basically the same as Phillip's answer below, but without needing to explicitly enumerate the columns.
NOTES
https://pandas.pydata.org/pandas-docs/stable/groupby.html

Categories