Efficiently scale df columns to zero - python

I am trying to develop a process that automatically scales each Series in a pandas df to zero. For instance, if we use the df below:
import pandas as pd
d = ({
'A' : [0,1,2,3],
'B' : [6,7,8,9],
'C' : [10,11,12,13],
'D' : [-4,-5,-4,-3],
})
df = pd.DataFrame(data=d)
I'm manually adjusting each Column so it begins at zero. You'll notice the increments are either +1 or -1 but the starting integers vary.
df['B'] = df['B'] - 6
df['C'] = df['C'] - 10
df['D'] = df['D'] + 4
Output:
A B C D
0 0 0 0 0
1 1 1 1 -1
2 2 2 2 -2
3 3 3 3 -3
This isn't very efficient as I have to go through each series to determine the scaling factor. Is there a more efficient way to determine this?

You can subtract first row byiloc with sub:
df = df.sub(df.iloc[0])
#same as
#df = df - df.iloc[0]
print (df)
A B C D
0 0 0 0 0
1 1 1 1 -1
2 2 2 2 0
3 3 3 3 1
Detail:
print (df.iloc[0])
A 0
B 6
C 10
D -4
Name: 0, dtype: int64

Related

Comparing the value of a column with the previous value of a new column using Apply in Python (Pandas)

I have a dataframe with these values in column A:
df = pd.DataFrame(A,columns =['A'])
A
0 0
1 5
2 1
3 7
4 0
5 2
6 1
7 3
8 0
I need to create a new column (called B) and populate it using next conditions:
Condition 1: If the value of A is equal to 0 then, the value of B must be 0.
Condition 2: If the value of A is not 0 then I compare its value to the previous value of B. If A is higher than the previous value of B then I take A, otherwise I take B.
The result should be this:
A B
0 0 0
1 5 5
2 1 5
3 7 7
4 0 0
5 2 2
6 1 2
7 3 3
The dataset is huge and using loops would be too slow. I would need to solve this without using loops and the pandas “Loc” function. Anyone could help me to solve this using the Apply function? I have tried different things without success.
Thanks a lot.
One way to do this I guess could be the following
def do_your_stuff(row):
global value
# fancy stuff here
value = row["b"]
[...]
value = df.iloc[0]['B']
df["C"] = df.apply(lambda row: do_your_stuff(row), axis=1)
Try this:
df['B'] = df['A'].shift()
df['B'] = df.apply(lambda x:0 if x.A == 0 else x.A if x.A > x.B else x.B, axis=1)
Use .shift() to shift your one cell down and check if the previous value is smaller and it is not 0. Then use .mask() to replace the values with the previous if the condition stands.
from io import StringIO
import pandas as pd
wt = StringIO("""A
0 0
1 2
2 3
3 1
4 2
5 7
6 0
""")
df = pd.read_csv(wt, sep='\s\s+')
df
A
0 0
1 2
2 3
3 1
4 2
5 7
6 0
def func(df, col):
df['B'] = df[col].mask(cond=((df[col].shift(1) > df[col]) & (df[col] != 0)), other=df[col].shift(1))
if col == 'B':
while ((df[col].shift(1) > df[col]) & (df[col] != 0)).any():
df['B'] = df[col].mask(cond=((df[col].shift(1) > df[col]) & (df[col] != 0)), other=df[col].shift(1))
return df
(df.pipe(func, 'A').pipe(func, 'B'))
Output:
A B
0 0 0
1 2 2
2 3 3
3 1 3
4 2 3
5 7 7
6 0 0
Using the solution of Achille I solved it this way:
import pandas as pd
A = [0,2,3,0,2,7,2,3,2,20,1,0,2,5,4,3,1]
df = pd.DataFrame(A,columns =['A'])
df['B'] = 0
def function(row):
global value
global prev
if row['A'] ==0:
value = 0
elif row['A'] > value:
value = row['A']
else:
value = prev
prev = value
return value
value = df.iloc[0]['B']
prev = value
df["B"] = df.apply(lambda row: function(row), axis=1)
df
output:
A B
0 0 0
1 2 2
2 3 3
3 0 0
4 2 2
5 7 7
6 2 7
7 3 7
8 2 7
9 20 20
10 1 20
11 0 0
12 2 2
13 5 5
14 4 5
15 3 5
16 1 5

How to keep columns based on a given row values

Here how the datalooks like in df dataframe:
A B C D
0.js 2 1 1 -1
1.js 3 -5 1 -4
total 5 -4 2 -5
And I would get new dataframe df1:
A C
0.js 2 1
1.js 3 1
total 5 2
So basically it should look like this:
df1 = df[df["total"] > 0]
but it should filter on row instead of column and I can't figure it out..
You want to use .loc[:, column_mask] i.e.
In [11]: df.loc[:, df.sum() > 0]
Out[11]:
A C
total 5 2
# or
In [12]: df.loc[:, df.iloc[0] > 0]
Out[12]:
A C
total 5 2
Use .where to set negative values to NaN and then dropna setting axis = 1:
df.where(df.gt(0)).dropna(axis=1)
A C
total 5 2
You can use, loc with boolean indexing or reindex:
df.loc[:, df.columns[(df.loc['total'] > 0)]]
OR
df.reindex(df.columns[(df.loc['total'] > 0)], axis=1)
Output:
A C
0.js 2 1
1.js 3 1
total 5 2

map DataFrame index and forward fill nan values

I have a DataFrame with integer indexes that are missing some values (i.e. not equally spaced), I want to create a new DataFrame with equally spaced index values and forward fill column values. Below is a simple example:
have
import pandas as pd
df = pd.DataFrame(['A', 'B', 'C'], index=[0, 2, 4])
0
0 A
2 B
4 C
want to use above and create:
0
0 A
1 A
2 B
3 B
4 C
Use reindex with method='ffill':
df = df.reindex(np.arange(0, df.index.max()+1), method='ffill')
Or:
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), method='ffill')
print (df)
0
0 A
1 A
2 B
3 B
4 C
Using reindex and ffill:
df = df.reindex(range(df.index[0],df.index[-1]+1)).ffill()
print(df)
0
0 A
1 A
2 B
3 B
4 C
You can do this:
In [319]: df.reindex(list(range(df.index.min(),df.index.max()+1))).ffill()
Out[319]:
0
0 A
1 A
2 B
3 B
4 C

Splitting a concatenated string into seperated columns using pandas

I have a pandas dataframe consiting of one column containing a string seperated by "/" I would like split these seperated strings into new columns denoted by a boolean (if they exist)
d = {'col1': ["A/B/C", "B/C", "D/B/A", "C/B"]}
dataFrame = pd.DataFrame(data=d)
col1
0 A/B/C
1 B/C
2 D/B/A
3 C/B
the result would be as following:
d = {'A': [1, 0, 1, 0], 'B':[1,1,1,1], 'C':[1,1,0,1], 'D':[0,0,1,0]}
dataFrame = pd.DataFrame(data=d)
A B C D
0 1 1 1 0
1 0 1 1 0
2 1 1 0 1
3 0 1 1 0
I have attempted with pandas.Series.str.split and pandas.pivot but nothing quite returns the result I am looking for. Any help or nudges in the right direction, would be highly appreciated!
Use pandas.Series.str.get_dummies
df.col1.str.get_dummies('/')
A B C D
0 1 1 1 0
1 0 1 1 0
2 1 1 0 1
3 0 1 1 0
Setup
d = {'col1': ["A/B/C", "B/C", "D/B/A", "C/B"]}
df = pd.DataFrame(data=d)

Python: given list of columns and list of values, return subset of dataframe that meets all criteria

I have a dataframe like the following.
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
Assume that column A will always in the dataframe but sometimes the could be column B, column B and C, or multiple number of columns.
I have created a code to save the columns names (other than A) in a list as well as the unique permutations of the values in the other columns into a list. For instance, in this example, we have columns B and C saved into columns:
col = ['B','C']
The permutations in the simple df are 1,7; 2,8; 3,9. For simplicity assume one permutation is saved as follows:
permutation = [2,8]
How do I select the entire rows (and only those) that equal that permutation?
Right now, I am using:
a[a[col].isin(permutation)]
Unfortunately, I don't get the values in column A.
(I know how to drop those values that are NaN later. BuT How should I do this to keep it dynamic? Sometimes there will be multiple columns. (Ultimately, I'll run through a loop and save the different iterations) based upon multiple permutations in the columns other than A.
Use the intersection of boolean series (where both conditions are true) - first setup code:
import pandas as pd
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
col = ['B','C']
permutation = [2,8]
And here's the solution for this limited example:
>>> df[(df[col[0]] == permutation[0]) & (df[col[1]] == permutation[1])]
A B C
1 Jean 2 8
3 Sue 2 8
To break that down:
>>> b, c = col
>>> per_b, per_c = permutation
>>> column_b_matches = df[b] == per_b
>>> column_c_matches = df[c] == per_c
>>> intersection = column_b_matches & column_c_matches
>>> df[intersection]
A B C
1 Jean 2 8
3 Sue 2 8
Additional columns and values
To take any number of columns and values, I would create a function:
def select_rows(df, columns, values):
if not columns or not values:
raise Exception('must pass columns and values')
if len(columns) != len(values):
raise Exception('columns and values must be same length')
intersection = True
for c, v in zip(columns, values):
intersection &= df[c] == v
return df[intersection]
and to use it:
>>> select_rows(df, col, permutation)
A B C
1 Jean 2 8
3 Sue 2 8
Or you can coerce the permutation to an array and accomplish this with a single comparison, assuming numeric values:
import numpy as np
def select_rows(df, columns, values):
return df[(df[col] == np.array(values)).all(axis=1)]
But this does not work with your code sample as given
I figured out a solution. Aaron's above works well if I only have two columns. I need a solution that works regardless of the size of the df (as size will be 3-7 columns).
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
permutation = [2,8]
col = ['B','C']
interim = df[col].isin(permutation)
df[df.index.isin(interim[(interim != 0).all(1)].index)]
you can do it this way:
In [77]: permutation = np.array([0,2,2])
In [78]: col
Out[78]: ['a', 'b', 'c']
In [79]: df.loc[(df[col] == permutation).all(axis=1)]
Out[79]:
a b c
10 0 2 2
15 0 2 2
16 0 2 2
your solution will not always work properly:
sample DF:
In [71]: df
Out[71]:
a b c
0 0 2 1
1 1 1 1
2 0 1 2
3 2 0 1
4 0 1 0
5 2 0 0
6 2 0 0
7 0 1 0
8 2 1 0
9 0 0 0
10 0 2 2
11 1 0 1
12 2 1 1
13 1 0 0
14 2 1 0
15 0 2 2
16 0 2 2
17 1 0 2
18 0 1 1
19 1 2 0
In [67]: col = ['a','b','c']
In [68]: permutation = [0,2,2]
In [69]: interim = df[col].isin(permutation)
pay attention at the result:
In [70]: df[df.index.isin(interim[(interim != 0).all(1)].index)]
Out[70]:
a b c
5 2 0 0
6 2 0 0
9 0 0 0
10 0 2 2
15 0 2 2
16 0 2 2

Categories