python select columns based on row 0 value range - python

From a Pandas DataFrame, I want to select columns where the value of the first row is between a certain range (e.g, 0.5 - 1.1)
I can select columns where row 0 is greater than or less than a certain number by doing this:
df = pd.DataFrame(example).T
Result = df[df.iloc[:, 0] > 0.5].T
How do I do this for a range (i.e, greater than 0.5 and less than 1).
Thanks.

You can use between:
print (df[df.iloc[:, 0].between(0.5, 1.1)])
Another solution with conditions with & (array and):
print (df[(df.iloc[:, 0] > 0.5) & (df.iloc[:, 0] < 1.1)])
Sample:
df = pd.DataFrame({'a':[1.1,1.4,0.7,0,0.5]})
print (df)
a
0 1.1
1 1.4
2 0.7
3 0.0
4 0.5
#inclusive True is by default
print (df[df.iloc[:, 0].between(0.5, 1.1)])
a
0 1.1
2 0.7
4 0.5
#added inclusive False
print (df[df.iloc[:, 0].between(0.5, 1.1, inclusive=False)])
a
2 0.7
print (df[(df.iloc[:, 0] > 0.5) & (df.iloc[:, 0] < 1.1)])
a
2 0.7
But if need select columns by first row add loc:
df = pd.DataFrame({'A':[1.1,2,3],
'B':[.4,5,6],
'C':[.7,8,9],
'D':[1.0,3,5],
'E':[.5,3,6],
'F':[.7,4,3]})
print (df)
A B C D E F
0 1.1 0.4 0.7 1.0 0.5 0.7
1 2.0 5.0 8.0 3.0 3.0 4.0
2 3.0 6.0 9.0 5.0 6.0 3.0
print (df.loc[:, df.iloc[0, :].between(0.5, 1.1)])
A C D E F
0 1.1 0.7 1.0 0.5 0.7
1 2.0 8.0 3.0 3.0 4.0
2 3.0 9.0 5.0 6.0 3.0
print (df.loc[:, df.iloc[0, :].between(0.5, 1.1, inclusive=False)])
C D F
0 0.7 1.0 0.7
1 8.0 3.0 4.0
2 9.0 5.0 3.0
print (df.loc[:, (df.iloc[0, :] > 0.5) & (df.iloc[0, :] < 1.1)])
C D F
0 0.7 1.0 0.7
1 8.0 3.0 4.0
2 9.0 5.0 3.0

Related

Multiply every 2nd row by -1 in pandas col

I'm trying to multiply every 2nd row by -1 but in a specified column only. Using below, I'm hoping to multiply every 2nd row in column c by -1.
df = pd.DataFrame({
'a' : [2.0,1.0,3.5,2.0,5.0,3.0,1.0,1.0],
'b' : [1.0,-1.0,3.5,3.0,4.0,2.0,3.0,2.0],
'c' : [2.0,2.0,2.0,2.0,-1.0,-1.0,-2.0,-2.0],
})
df['c'] = df['c'][::2] * -1
Intended output:
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 -2.0
2 3.5 3.5 2.0
3 2.0 3.0 -2.0
4 5.0 4.0 -1.0
5 3.0 2.0 1.0
6 1.0 3.0 -2.0
7 1.0 2.0 2.0
One way using pandas.DataFrame.update:
df.update(df['c'][1::2] * -1)
print(df)
Output:
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 -2.0
2 3.5 3.5 2.0
3 2.0 3.0 -2.0
4 5.0 4.0 -1.0
5 3.0 2.0 1.0
6 1.0 3.0 -2.0
7 1.0 2.0 2.0
Use DataFrame.iloc for slicing with Index.get_loc for position of column c:
df.iloc[1::2, df.columns.get_loc('c')] *= -1
#working same like
#df.iloc[1::2, df.columns.get_loc('c')] = df.iloc[1::2, df.columns.get_loc('c')] * -1
Or use DataFrame.loc with select values in df.index:
df.loc[df.index[1::2], 'c'] *= -1
Or:
df.loc[df.index % 2 == 1, 'c'] *= -1
print (df)
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 -2.0
2 3.5 3.5 2.0
3 2.0 3.0 -2.0
4 5.0 4.0 -1.0
5 3.0 2.0 1.0
6 1.0 3.0 -2.0
7 1.0 2.0 2.0
Or you can write your own function:
def multiple(df):
new_df = pd.DataFrame()
for i in range(0, len(df)):
if i // 2 == 0:
new_row = pd.DataFrame(data = df.iloc[i]*(-1)).T
new_df = new_df.append(new_row, ignore_index=True)
else:
new_row = pd.DataFrame(data = df.iloc[i]).T
new_df = new_df.append(new_row, ignore_index=True)
i+=1
return new_df
You can use this code :
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 2.0
2 3.5 3.5 2.0
3 2.0 3.0 2.0
4 5.0 4.0 -1.0
5 3.0 2.0 -1.0
6 1.0 3.0 -2.0
7 1.0 2.0 -2.0
df.loc[df.index % 2 == 1, "c" ] = df.c * - 1
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 -2.0
2 3.5 3.5 2.0
3 2.0 3.0 -2.0
4 5.0 4.0 -1.0
5 3.0 2.0 1.0
6 1.0 3.0 -2.0
7 1.0 2.0 2.0
You can use divmod with a series:
s = 2*np.arange(len(df))%2 - 1
df["c"] = -df.c*s
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 -2.0
2 3.5 3.5 2.0
3 2.0 3.0 -2.0
4 5.0 4.0 -1.0
5 3.0 2.0 1.0
6 1.0 3.0 -2.0
7 1.0 2.0 2.0

How to apply a function between two pandas data frames

How can a custom function be applied to two data frames? The .apply method seems to iterate over rows or columns of a given dataframe, but I am not sure how to use this over two data frames at once. For example,
df1
m1 m2
x y x y z
0 0 10.0 12.0 16.0 17.0 9.0
0 10.0 13.0 15.0 12.0 4.0
1 0 11.0 14.0 14.0 11.0 5.0
1 3.0 14.0 12.0 10.0 9.0
df2
m1 m2
x y x y
0 0.5 0.1 1 0
In general, how can a mapping function of df1 to df2 make a new df3. For example, multiply (but I am looking for a generalized solution where I can just send to a function).
def custFunc(d1,d2):
return (d1 * d2) - d2
df1.apply(lambda x: custFunc(x,df2[0]),axis=1)
#df2[0] meaning it is explicitly first row
and a df3 would be
m1 m2
x y x y z
0 0 5.5 1.3 16.0 0.0 9.0
0 5.5 1.4 15.0 0.0 4.0
1 0 6.0 1.5 14.0 0.0 5.0
1 2.0 1.5 12.0 0.0 9.0
If need your function only pass DataFrame and Series with seelecting by row with DataFrame.loc, last for replace missing values by original is use DataFrame.fillna:
def custFunc(d1,d2):
return (d1 * d2) - d2
df = custFunc(df1, df2.loc[0]).fillna(df1)
print (df)
m1 m2
x y x y z
0 0 4.5 1.1 15.0 0.0 9.0
0 4.5 1.2 14.0 0.0 4.0
1 0 5.0 1.3 13.0 0.0 5.0
1 1.0 1.3 11.0 0.0 9.0
Detail:
print (df2.loc[0])
m1 x 0.5
y 0.1
m2 x 1.0
y 0.0
Name: 0, dtype: float64

Is there a way to fill missing values in multiple columns sharing part of their name with values from another column?

I am trying to fill NaN in multiple columns (too many for the solution to be hardcoded) that share part of their name with values from another column in the same pandas dataframe.
I know that I can fill multiple columns using a constant value and also that I can fill a single column using another from the same dataframe. It's the combination of these two that is not working for me.
For example, consider the data frame:
df = pd.DataFrame({'Val': [1.2,5.4,3.1,4], 'Col - 1': [None,5,1,None], 'Col - 2': [None,None,6,None]})
print(df)
Val Col - 1 Col - 2
0 1.2 NaN NaN
1 5.4 5.0 NaN
2 3.1 1.0 6.0
3 4.0 NaN NaN
Filling multiple columns with a constant value works:
df.loc[:,df.columns.str.contains('Col')] = df.loc[:,df.columns.str.contains('Col')].fillna(value=15)
print(df)
Val Col - 1 Col - 2
0 1.2 15.0 15.0
1 5.4 5.0 15.0
2 3.1 1.0 6.0
3 4.0 15.0 15.0
Filling a single column with values from another column also works:
df['Col - 2'] = df['Col - 2'].fillna(value=df['Val'])
print(df)
Val Col - 1 Col - 2
0 1.2 NaN 1.2
1 5.4 5.0 5.4
2 3.1 1.0 6.0
3 4.0 NaN 4.0
What doesn't work is a combination of the two:
df.loc[:,df.columns.str.contains('Col')] = df.loc[:,df.columns.str.contains('Col')].fillna(value=df['Val'])
The above does nothing and returns the original dataframe. What I am expecting is this:
Val Col - 1 Col - 2
0 1.2 1.2 1.2
1 5.4 5.0 5.4
2 3.1 1.0 6.0
3 4.0 4.0 4.0
You should adding apply lambda , since dataframe fillna will also check the columns name , you fill it by the pd.Series, which do not match the columns , so will make the fillna failed
df.loc[:,df.columns.str.contains('Col')].apply(lambda x : x.fillna(value=df['Val']))
You can use df.filter() here:
m=df.filter(like='Col')
df[m.columns]=m.apply(lambda x: x.fillna(df.Val))
print(df)
Val Col - 1 Col - 2
0 1.2 1.2 1.2
1 5.4 5.0 5.4
2 3.1 1.0 6.0
3 4.0 4.0 4.0
Here's a way around the problem with np.where:
cols = [col for col in df.columns if 'Col' in col]
df[cols] = np.where(df[cols].isna(), df.Val.values[:,None], df[cols])
Output:
Val Col - 1 Col - 2
-- ----- --------- ---------
0 1.2 1.2 1.2
1 5.4 5 5.4
2 3.1 1 6
3 4 4 4

Replacing empty values in a DataFrame with value of a column

Say I have the following pandas dataframe:
df = pd.DataFrame([[3, 2, np.nan, 0],
[5, 4, 2, np.nan],
[7, np.nan, np.nan, 5],
[9, 3, np.nan, 4]],
columns=list('ABCD'))
which returns this:
A B C D
0 3 2.0 NaN 0.0
1 5 4.0 2.0 NaN
2 7 NaN NaN 5.0
3 9 3.0 NaN 4.0
I'd like that if a np.nan is found, that the value is replaced by a value in the A column. So that would mean the result to be this:
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
I've tried multiple things, but I could not get anything to work. Can anyone help?
Here is necessary double transpose:
cols = ['B','C', 'D']
df[cols] = df[cols].T.fillna(df['A']).T
print(df)
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
because:
df[cols] = df[cols].fillna(df['A'], axis=1)
print(df)
NotImplementedError: Currently only can fill with dict/Series column by column
Another solution with numpy.where and broadcasting column A:
df = pd.DataFrame(np.where(df.isnull(), df['A'].values[:, None], df),
index=df.index,
columns=df.columns)
print (df)
A B C D
0 3.0 2.0 3.0 0.0
1 5.0 4.0 2.0 5.0
2 7.0 7.0 7.0 5.0
3 9.0 3.0 9.0 4.0
Thank you #pir for another solution:
df = pd.DataFrame(np.where(df.isnull(), df[['A']], df),
index=df.index,
columns=df.columns)
Currently, fillna doesn't allow for broadcasting a series across columns while aligning the indices.
pandas.DataFrame.mask
This functions exactly like what we'd want fillna to do. Finds the the nulls, fills it in with df.A along axis=0
df.mask(df.isna(), df.A, axis=0)
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
pandas.DataFrame.fillna using a dictionary
However, you can pass a dictionary to fillna that tells it what to do for each column.
df.fillna({k: df.A for k in df})
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
DO fillna with reindex
df.fillna(df[['A']].reindex(columns=df.columns).ffill(1))
Out[20]:
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
Or combine_first
df.combine_first(df.fillna(0).add(df.A,0))
Out[35]:
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
# for each column...
for col in df.columns:
# I select the np.nan and I replace then with the value of A
df.loc[df[col].isnull(), col] = df["A"]

How to fill a particular value with mean value of the column between first row and the corresponding row in pandas dataframe

I have a df like this,
A B C D E
1 2 3 0 2
2 0 7 1 1
3 4 0 3 0
0 0 3 4 3
I am trying to replace all the 0 with mean() value between the first row and the 0 value row for the corresponding column,
My expected output is,
A B C D E
1.0 2.00 3.000000 0.0 2.0
2.0 1.00 7.000000 1.0 1.0
3.0 4.00 3.333333 3.0 1.0
1.5 1.75 3.000000 4.0 3.0
Here is main problem need previous mean value if multiple 0 per column, so realy problematic create vectorized solution:
def f(x):
for i, v in enumerate(x):
if v == 0:
x.iloc[i] = x.iloc[:i+1].mean()
return x
df1 = df.astype(float).apply(f)
print (df1)
A B C D E
0 1.0 2.00 3.000000 0.0 2.0
1 2.0 1.00 7.000000 1.0 1.0
2 3.0 4.00 3.333333 3.0 1.0
3 1.5 1.75 3.000000 4.0 3.0
Better solution:
#create indices of zero values to helper DataFrame
a, b = np.where(df.values == 0)
df1 = pd.DataFrame({'rows':a, 'cols':b})
#for first row is not necessary count means
df1 = df1[df1['rows'] != 0]
print (df1)
rows cols
1 1 1
2 2 2
3 2 4
4 3 0
5 3 1
#loop by each row of helper df and assign means
for i in df1.itertuples():
df.iloc[i.rows, i.cols] = df.iloc[:i.rows+1, i.cols].mean()
print (df)
A B C D E
0 1.0 2.00 3.000000 0 2.0
1 2.0 1.00 7.000000 1 1.0
2 3.0 4.00 3.333333 3 1.0
3 1.5 1.75 3.000000 4 3.0
Another similar solution (with mean of all pairs):
for i, j in zip(*np.where(df.values == 0)):
df.iloc[i, j] = df.iloc[:i+1, j].mean()
print (df)
A B C D E
0 1.0 2.00 3.000000 0.0 2.0
1 2.0 1.00 7.000000 1.0 1.0
2 3.0 4.00 3.333333 3.0 1.0
3 1.5 1.75 3.000000 4.0 3.0
IIUC
def f(x):
for z in range(x.size):
if x[z] == 0: x[z] = np.mean(x[:z+1])
return x
df.astype(float).apply(f)
A B C D E
0 1.0 2.00 3.000000 0.0 2.0
1 2.0 1.00 7.000000 1.0 1.0
2 3.0 4.00 3.333333 3.0 1.0
3 1.5 1.75 3.000000 4.0 3.0

Categories