Multiply columns values by a scalar based on conditions DataFrame - python

I want to multiply column values by a specific scalar based on the name of the column:
if column name = "Math", then all the values in 'Math" column should be multiply by 5;
if column name = "Physique", values in that column should be multiply by 4;
if column name = "Bio", values in that column should be multiplied by 3;
all the remaining columns should be multiplied by 2
What I have:
This is what I should have :
listm = ['Math', 'Physique', 'Bio']
def note_coef(row):
for m in listm:
if 'Math' in listm:
result = df['Math']*5
return result
df2=df.apply(note_coef)
df2
Note I stopped with only 1 if to test my code but the outcome is not what I expected. I am quite new in programming and here as well.

I think the most elegant solution is to define a dictionary (or a pandas.Series) with the multiplying factor for each column of your DataFrame (factors). Then you can multiply all the columns with the corresponding factor simply using df *= factors.
The multiplication is done via column axis alignment, i.e. by aligning the df.columns with the dictionary keys.
For instance, given the following DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame(np.ones(shape=(4, 5)), columns=['Math', 'Physique', 'Bio', 'Algo', 'Archi'])
>>> df
Math Physique Bio Algo Archi
0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0
You can do:
factors = {'Math': 5, 'Physique': 4, 'Bio': 3}
default_factor = 2
factors.update({col: default_factor for col in df.columns if col not in factors})
df *= factors
print(df)
Output:
Math Physique Bio Algo Archi
0 5.0 4.0 3.0 2.0 2.0
1 5.0 4.0 3.0 2.0 2.0
2 5.0 4.0 3.0 2.0 2.0
3 5.0 4.0 3.0 2.0 2.0

Fake data
n=5
d = {'a':np.ones(n),
'b':np.ones(n),
'c':np.ones(n),
'd':np.ones(n)}
df = pd.DataFrame(d)
print(df)
Select the columns and multiply by a tuple.
df[['a','c']] = df[['a','c']] * (2,4)
print(df)
a b c d
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
a b c d
0 2.0 1.0 4.0 1.0
1 2.0 1.0 4.0 1.0
2 2.0 1.0 4.0 1.0
3 2.0 1.0 4.0 1.0
4 2.0 1.0 4.0 1.0

You can use df['col_name'].multiply(value) to apply on a whole column. The remaining columns can be modified in a loop of all columns except listm.
listm = ['Math', 'Physique', 'Bio']
for i, head in enumerate(listm):
df[head] = df[head].multiply(5-i)
heads = df.head()
for head in heads:
if not head in listm:
df[head] = df[head].multiply(2)

here is another way to do it using array multiplication
The data was not provided as a text, so created the test data in a patter of the screen shot
mul = [5,4,3,2,2,2,2,1] # multipliers
df1=df.iloc[:,1:].mul(mul)
df1.total = df1.iloc[:,:7].sum(axis=1)
df.update(df1, join='left', overwrite=True)
df
source Math Physics Bio Algo Archi Sport eng total
0 A 50.0 60.0 60.0 50.0 60.0 70.0 80.0 430.0
1 B 55.0 64.0 63.0 52.0 62.0 72.0 82.0 450.0
2 C 5.5 8.4 9.3 NaN NaN NaN NaN 23.2
3 D NaN NaN NaN 22.0 42.0 62.0 82.0 208.0
4 E 6.0 8.8 9.6 NaN NaN NaN NaN 24.4
5 F NaN NaN NaN 24.0 44.0 64.0 84.0 216.0
TEST DATA
data_out = [
['A', 10,15,20,25,30,35,40],
['B', 11,16,21,26,31,36,41],
['C', 1.1,2.1,3.1],
['D', np.NaN,np.NaN,np.NaN,11,21,31,41],
['E', 1.2,2.2,3.2],
['F', np.NaN,np.NaN,np.NaN,12,22,32,42],
]
df=pd.DataFrame(data_out, columns=[ 'source', 'Math', 'Physics', 'Bio', 'Algo', 'Archi', 'Sport', 'eng'])
df['total'] = df.iloc[:,1:].sum(axis=1)
source Math Physics Bio Algo Archi Sport eng total
0 A 10.0 15.0 20.0 25.0 30.0 35.0 40.0 175.0
1 B 11.0 16.0 21.0 26.0 31.0 36.0 41.0 182.0
2 C 1.1 2.1 3.1 NaN NaN NaN NaN 6.3
3 D NaN NaN NaN 11.0 21.0 31.0 41.0 104.0
4 E 1.2 2.2 3.2 NaN NaN NaN NaN 6.6
5 F NaN NaN NaN 12.0 22.0 32.0 42.0 108.0

Related

Combine list with dataframe and extend the multi-values cells into columns

Assuming I have the following input:
table = pd.DataFrame({'a':[0,0,0,0],'b':[1,1,1,3,],'c':[2,2,5,4],'d':[3,6,6,6]},dtype='float64')
list = [[55,66],
[77]]
#output of the table
a b c d
0 0.0 1.0 2.0 3.0
1 0.0 1.0 2.0 6.0
2 0.0 1.0 5.0 6.0
3 0.0 3.0 4.0 6.0
I want to combine list with table so the final shape would be like:
a b c d ID_0 ID_1
0 0.0 1.0 2.0 3.0 55.0 66.0
1 0.0 1.0 2.0 6.0 77.0 NaN
2 0.0 1.0 5.0 6.0 NaN NaN
3 0.0 3.0 4.0 6.0 NaN NaN
I found a way but it looks a bit long and might be a shorter way to do it.
Step1:
x = pd.Series(list, name ="ID")
new = pd.concat([table, x], axis=1)
# output
a b c d ID
0 0.0 1.0 2.0 3.0 [5, 6]
1 0.0 1.0 2.0 6.0 [77]
2 0.0 1.0 5.0 6.0 NaN
3 0.0 3.0 4.0 6.0 NaN
step2:
ID = new['ID'].apply(pd.Series)
ID = ID.rename(columns = lambda x : 'ID_' + str(x))
new_x = pd.concat([new[:], ID[:]], axis=1)
# output
a b c d ID ID_0 ID_1
0 0.0 1.0 2.0 3.0 [5, 6] 5.0 6.0
1 0.0 1.0 2.0 6.0 [77] 77.0 NaN
2 0.0 1.0 5.0 6.0 NaN NaN NaN
3 0.0 3.0 4.0 6.0 NaN NaN NaN
step3:
new_x = new_x.drop(columns=['ID'], axis = 1)
Any shorter way to achieve the same result?
Assuming a default index on table (as shown in the question), we can simply create a DataFrame (either from_records or with the constructor) and join back to table and allow the indexes to align. (add_prefix is an easy way to add the 'ID_' prefix to the default numeric columns)
new_df = table.join(
pd.DataFrame.from_records(lst).add_prefix('ID_')
)
new_df:
a b c d ID_0 ID_1
0 0.0 1.0 2.0 3.0 55.0 66.0
1 0.0 1.0 2.0 6.0 77.0 NaN
2 0.0 1.0 5.0 6.0 NaN NaN
3 0.0 3.0 4.0 6.0 NaN NaN
Working with 2 DataFrames is generally easier than a DataFrame and a list. Here is what from_records does to lst:
pd.DataFrame.from_records(lst)
0 1
0 55 66.0
1 77 NaN
Index (rows) 0 and 1 will now align with the corresponding index values in table (0 and 1 respectively).
add_prefix fixes the column names before joining:
pd.DataFrame.from_records(lst).add_prefix('ID_')
ID_0 ID_1
0 55 66.0
1 77 NaN
Setup and imports:
import pandas as pd # v1.4.4
table = pd.DataFrame({
'a': [0, 0, 0, 0],
'b': [1, 1, 1, 3, ],
'c': [2, 2, 5, 4],
'd': [3, 6, 6, 6]
}, dtype='float64')
lst = [[55, 66],
[77]]

How to freeze first numbers in sequences between NaNs in Python pandas dataframe

Is there a Pythonic way to, in a timeseries dataframe, by column, go down and pick the first number in a sequence, and then push it forward until the next NaN, and then take the next non-NaN number and push that one down until the next NaN, and so on (retaining the indices and NaNs).
For example, I would like to convert this dataframe:
DF = pd.DataFrame(data={'A':[np.nan,1,3,5,7,np.nan,2,4,6,np.nan], 'B':[8,6,4,np.nan,np.nan,9,7,3,np.nan,3], 'C':[np.nan,np.nan,4,2,6,np.nan,1,5,2,8]})
A B C
0 NaN 8.0 NaN
1 1.0 6.0 NaN
2 3.0 4.0 4.0
3 5.0 NaN 2.0
4 7.0 NaN 6.0
5 NaN 9.0 NaN
6 2.0 7.0 1.0
7 4.0 3.0 5.0
8 6.0 NaN 2.0
9 NaN 3.0 8.0
To this dataframe:
Result = pd.DataFrame(data={'A':[np.nan,1,1,1,1,np.nan,2,2,2,np.nan], 'B':[8,8,8,np.nan,np.nan,9,9,9,np.nan,3], 'C':[np.nan,np.nan,4,4,4,np.nan,1,1,1,1]})
A B C
0 NaN 8.0 NaN
1 1.0 8.0 NaN
2 1.0 8.0 4.0
3 1.0 NaN 4.0
4 1.0 NaN 4.0
5 NaN 9.0 NaN
6 2.0 9.0 1.0
7 2.0 9.0 1.0
8 2.0 NaN 1.0
9 NaN 3.0 1.0
I know I can use a loop to iterate down the columns to do this, but would appreciate some help on how to do it in a more efficient Pythonic way on a very large dataframe. Thank you.
IIUC:
# where DF is not NaN
mask = DF.notna()
Result = (DF.shift(-1) # fill the original NaN's with their next value
.mask(mask) # replace all the original non-NaN with NaN
.ffill() # forward fill
.fillna(DF.iloc[0]) # starting of the the columns with a non-NaN
.where(mask) # replace the original NaN's back
)
Output:
A B C
0 NaN 8.0 NaN
1 1.0 8.0 NaN
2 1.0 8.0 4.0
3 1.0 NaN 4.0
4 1.0 NaN 4.0
5 NaN 9.0 NaN
6 2.0 9.0 1.0
7 2.0 9.0 1.0
8 2.0 NaN 1.0
9 NaN 3.0 1.0

How do I select rows with at least a value in some selected columns? [duplicate]

I have a big dataframe with many columns (like 1000). I have a list of columns (generated by a script ~10). And I would like to select all the rows in the original dataframe where at least one of my list of columns is not null.
So if I would know the number of my columns in advance, I could do something like this:
list_of_cols = ['col1', ...]
df[
df[list_of_cols[0]].notnull() |
df[list_of_cols[1]].notnull() |
...
df[list_of_cols[6]].notnull() |
]
I can also iterate over the list of cols and create a mask which then I would apply to df, but his looks too tedious. Knowing how powerful is pandas with respect to dealing with nan, I would expect that there is a way easier way to achieve what I want.
Use the thresh parameter in the dropna() method. By setting thresh=1, you specify that if there is at least 1 non null item, don't drop it.
df = pd.DataFrame(np.random.choice((1., np.nan), (1000, 1000), p=(.3, .7)))
list_of_cols = list(range(10))
df[list_of_cols].dropna(thresh=1).head()
Starting with this:
data = {'a' : [np.nan,0,0,0,0,0,np.nan,0,0, 0,0,0, 9,9,],
'b' : [np.nan,np.nan,1,1,1,1,1,1,1, 2,2,2, 1,7],
'c' : [np.nan,np.nan,1,1,2,2,3,3,3, 1,1,1, 1,1],
'd' : [np.nan,np.nan,7,9,6,9,7,np.nan,6, 6,7,6, 9,6]}
df = pd.DataFrame(data, columns=['a','b','c','d'])
df
a b c d
0 NaN NaN NaN NaN
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
Rows where not all values are nulls. (Removing row index 0)
df[~df.isnull().all(axis=1)]
a b c d
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
One can use boolean indexing
df[~pd.isnull(df[list_of_cols]).all(axis=1)]
Explanation:
The expression df[list_of_cols]).all(axis=1) returns a boolean array that is applied as a filter to the dataframe:
isnull() applied to df[list_of_cols] creates a boolean mask for the dataframe df[list_of_cols] with True values for the null elements in df[list_of_cols], False otherwise
all() returns True if all of the elements are True (row-wise axis=1)
So, by negation ~ (not all null = at least one is non-null) one gets a mask for all rows that have at least one non-null element in the given list of columns.
An example:
Dataframe:
>>> df=pd.DataFrame({'A':[11,22,33,np.NaN],
'B':['x',np.NaN,np.NaN,'w'],
'C':['2016-03-13',np.NaN,'2016-03-14','2016-03-15']})
>>> df
A B C
0 11 x 2016-03-13
1 22 NaN NaN
2 33 NaN 2016-03-14
3 NaN w 2016-03-15
isnull mask:
>>> ~pd.isnull(df[list_of_cols])
B C
0 True True
1 False False
2 False True
3 True True
apply all(axis=1) row-wise:
>>> ~pd.isnull(df[list_of_cols]).all(axis=1)
0 True
1 False
2 True
3 True
dtype: bool
Boolean selection from dataframe:
>>> df[~pd.isnull(df[list_of_cols]).all(axis=1)]
A B C
0 11 x 2016-03-13
2 33 NaN 2016-03-14
3 NaN w 2016-03-15

Create a column that has the same length of the longest column in the data at the same time

I have the following data:
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
dataFrame = pandas.DataFrame(data).transpose()
Output:
0 1 2
0 1.0 1.0 1.0
1 2.0 2.0 2.0
2 3.0 3.0 3.0
3 NaN 4.0 4.0
4 NaN 5.0 5.0
5 NaN NaN 6.0
6 NaN NaN 7.0
Is it possible to create a 4th column AT THE SAME TIME the others columns are created in data, which has the same length as the longest column of this dataframe (3rd one)?
The data of this column doesn't matter. Assume it's 8. So this is the desired output can be:
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
In my script the dataframe keeps changing every time. This means the longest columns keeps changing with it.
Thanks for reading
This is quite similar to answers from #jpp, #Cleb, and maybe some other answers here, just slightly simpler:
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]] + [[]]
This will automatically give you a column of NaNs that is the same length as the longest columnn, so you don't need the extra work of calculating the length of the longest column. Resulting dataframe:
0 1 2 3
0 1.0 1.0 1.0 NaN
1 2.0 2.0 2.0 NaN
2 3.0 3.0 3.0 NaN
3 NaN 4.0 4.0 NaN
4 NaN 5.0 5.0 NaN
5 NaN NaN 6.0 NaN
6 NaN NaN 7.0 NaN
Note that this answer is less general than some others here (such as by #jpp & #Cleb) in that it will only fill with NaNs. If you want some default fill values other than NaN, you should use one of their answers.
You can append to a list which then immediately feeds the pd.DataFrame constructor:
import pandas as pd
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
df = pd.DataFrame(data + [[8]*max(map(len, data))]).transpose()
print(df)
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
But this is inefficient. Pandas uses NumPy to hold underlying series and setting a series to a constant value is trivial and efficient; you can simply use:
df[3] = 8
It is not entirely clear what you mean by at the same time, but the following would work:
import pandas as pd
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
# get the longest list in data
data.append([8] * max(map(len, data)))
pd.DataFrame(data).transpose()
yielding
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
If you'd like to do it as you create the DataFrame, simply chain a call to assign:
pd.DataFrame(data).T.assign(**{'3': 8})
0 1 2 3
0 1.0 1.0 1.0 8
1 2.0 2.0 2.0 8
2 3.0 3.0 3.0 8
3 NaN 4.0 4.0 8
4 NaN 5.0 5.0 8
5 NaN NaN 6.0 8
6 NaN NaN 7.0 8
You can do a def (read comments):
def f(df):
l=[8]*df[max(df,key=lambda x:df[x].count())].count()
df[3]=l+[np.nan]*(len(df)-len(l))
# the above two lines can be just `df[3] = another solution currently for this problem`
return df
dataFrame = f(pandas.DataFrame(data).transpose())
Then now:
print(dataFrame)
Returns:
0 1 2 3
0 1.0 1.0 1.0 8
1 2.0 2.0 2.0 8
2 3.0 3.0 3.0 8
3 NaN 4.0 4.0 8
4 NaN 5.0 5.0 8
5 NaN NaN 6.0 8
6 NaN NaN 7.0 8
If at you mean at the same time as running pd.DataFrame, the data has to be prepped before it is loaded to your frame.
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
longest = max(len(i) for i in data)
dummy = [8 for i in range(longest)] #dummy data filled with 8
data.append(dummy)
dataFrame = pd.DataFrame(data).transpose()
The example above gets the longest element in your list and creates a dummy to be added onto it before creating your dataframe.
One solution is to add an element to the list that is passed to the dataframe:
pd.DataFrame(data + [[np.hstack(data).max() + 1] * len(max(data))]).T
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
If data is to be modified just:
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
data = data + [[np.hstack(data).max() + 1] * len(max(data))]
pd.DataFrame(data).T

Pandas shift based on different values to calculate percentages

I am trying to calculate percentages of first down from a dataframe.
Here is the dataframe
down distance
1 1.0 10.0
2 2.0 13.0
3 3.0 15.0
4 3.0 20.0
5 4.0 1.0
6 1.0 10.0
7 2.0 9.0
8 3.0 3.0
9 1.0 10.0
I would like to calculate the percent from first down, meaning for second down, what is the percent of yards gained. For third down, perc of third based on first.
For example, I would like to have the following output.
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 (13-10)/13
3 3.0 15.0 (15-10)/15
4 3.0 20.0 (20-10)/20
5 4.0 1.0 (1-10)/20
6 1.0 10.0 NaN # New calculation
7 2.0 9.0 (9-10)/9
8 3.0 3.0 (3-10)/3
9 1.0 10.0 NaN
Thanks
Current solutions all work correctly for the first question.
Here's a vectorised solution:
# define condition
cond = df['down'] == 1
# calculate value to subtract
first = df['distance'].where(cond).ffill().mask(cond)
# perform calculation
df['percentage'] = (df['distance'] - first) / df['distance']
print(df)
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
Using groupby and transform:
s = df.groupby(df.down.eq(1).cumsum()).distance.transform('first')
s = df.distance.sub(s).div(df.distance)
df['percentage'] = s.mask(s.eq(0))
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
With Numpy Bits
Should be pretty zippy!
m = df.down.values == 1 # mask where equal to 1
i = np.flatnonzero(m) # positions where equal to 1
d = df.distance.values # Numpy array of distances
j = np.diff(np.append(i, len(df))) # use diff to find distances between
# values equal to 1. Note that I append
# the length of the df as a terminal value
k = i.repeat(j) # I repeat the positions where equal to 1
# a number of times in order to fill in.
p = np.where(m, np.nan, 1 - d[k] / d) # reduction of % formula while masking
df.assign(percentage=p)
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
use groupby to group by each time down is equal to 1, than transform with your desired calculation. Then you can find where down is 1 again, and convert to NaN (as the calculation is meaningless there, as per your example):
df['percentage'] = (df.groupby(df.down.eq(1).cumsum())['distance']
.transform(lambda x: (x-x.iloc[0])/x))
df.loc[df.down.eq(1),'percentage'] = np.nan
>>> df
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN

Categories