Selecting all column names where value is greater than 0 - python

I have an undefined number of columns that have some values. for the example lets say there are 4 columns: [a,b,c,d] , and there is value associated with each column name, like this:
a b c d
0 23 11 0
11 43 33 22
12 0 12 0
I want to write another column right next to d, which has the max value of the column whose value is greater than 0, for example:
Like this:
a b c d e
0 23 11 0 b,c
11 43 33 22 a,b,c,d
12 0 12 0 a,c
my attempt:
dic2 = {'a':[12,0,23],'b':[21,23,0],'c':[0,22,33],'d':[0,22,0]}
df = pd.DataFrame(dic2)
df[df>0]
This will return the NaN value wherever there is zero but I don't know how do I get the column who has these NaN value.

You can filter values greater like 0 to boolean DataFrame and then use DataFrame.dot for matrix multiplication with columns names, last remove separator by indexing with str:
df['e'] = df.gt(0).dot(df.columns + ',').str[:-1]
print (df)
a b c d e
0 12 21 0 0 a,b
1 0 23 22 22 b,c,d
2 23 0 33 0 a,c

You can create a new column and use max function on all other columns
df['D'] = df.max(axis=1)
Code will check all column. If You want to specify from which columns You want to have the max value specify them like that
df['D'] = df[column].max(axis=1)
or with list of columns
df['D'] = df[[column1, column2]].max(axis=1)

Related

Pandas: Groupby.transform -> assign specific values to column

In general terms, is there a way to assign specific values to a column via groupby.transform(), where the groupby size is known in advance?
For example:
df = pd.DataFrame(data = {'A':[10,10,20,20],'B':['abc','def','ghi','jkl'],'GroupID':[1,1,2,2]})
funcDict = {'A':'sum','B':['specific_val_1', 'specific_val_2']}
df = df.groupby('GroupID').transform(funcDict)
where the result would be:
index
A
B
1
20
specific_val_1
2
20
specific_val_2
3
40
specific_val_1
4
40
specific_val_2
transform can not accepted dict , so we can do agg with merge
out = df.groupby('GroupID',as_index=False)[['A']].sum()
out = out.merge(pd.DataFrame({'B':['specific_val_1', 'specific_val_2']}),how='cross')
Out[90]:
GroupID A B
0 1 20 specific_val_1
1 1 20 specific_val_2
2 2 40 specific_val_1
3 2 40 specific_val_2

How to create a new column based on a condition in another column

In pandas, How can I create a new column B based on a column A in df, such that:
B=1 if A_(i+1)-A_(i) > 5 or A_(i) <= 10
B=0 if A_(i+1)-A_(i) <= 5
However, the first B_i value is always one
Example:
A
B
5
1 (the first B_i)
12
1
14
0
22
1
20
0
33
1
Use diff with a comparison to your value and convertion from boolean to int using le:
N = 5
df['B'] = (~df['A'].diff().le(N)).astype(int)
NB. using a le(5) comparison with inversion enables to have 1 for the first value
output:
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1
updated answer, simply combine a second condition with OR (|):
df['B'] = (~df['A'].diff().le(5)|df['A'].lt(10)).astype(int)
output: same as above with the provided data
I was little confused with your rows numeration bacause we should have missing value on last row instead of first if we calcule for B_i basing on condition A_(i+1)-A_(i) (first row should have both, A_(i) and A_(i+1) and last row should be missing A_(i+1) value.
Anyway,basing on your example i assumed that we calculate for B_(i+1).
import pandas as pd
df = pd.DataFrame(columns=["A"],data=[5,12,14,22,20,33])
df['shifted_A'] = df['A'].shift(1) #This row can be removed - it was added only show to how shift works on final dataframe
df['B']=''
df.loc[((df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10), 'B']=1 #Update rows that fulfill one of conditions with 1
df.loc[(df['A']-df['A'].shift(1))<=5, 'B']=0 #Update rows that fulfill condition with 0
df.loc[df.index==0, 'B']=1 #Update first row in B column
print(df)
That prints:
A shifted_A B
0 5 NaN 1
1 12 5.0 1
2 14 12.0 0
3 22 14.0 1
4 20 22.0 0
5 33 20.0 1
I am not sure if it is fastest way, but i guess it should be one of easier to understand.
Little explanation:
df.loc[mask, columnname]=newvalue allows us to update value in given column if condition (mask) is fulfilled
(df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10)
Each condition here returns True or False. If we added them the result is True if any of that is True (which is simply OR). In case we need AND we can multiply the conditions
Use Series.diff, replace first missing value for 1 after compare for greater or equal by Series.ge:
N = 5
df['B'] = (df.A.diff().fillna(N).ge(N) | df.A.lt(10)).astype(int)
print (df)
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1

Customized multi-level sorting based on max values in Pandas

I have a dataframe df1:
Label
Value
A
-1
B
15
C
5
B
-5
C
30
D
20
D
11
I need to sort the dataframe s.t. it is sorted by the max value for a given Label.
So df2:
Label
Value
C
30
C
5
D
20
D
11
B
15
B
-5
A
-1
I could think of creating another column as the max of each label and then sorting by it (and value). But that method seems a bit slow. Is there a faster/more efficient way to do this?
Use key parameter by mapping by aggregate max in Series s:
s = df.groupby('Label')['Value'].max()
df = df.sort_values('Label', key=lambda x: x.map(s), ascending=False)
print (df)
Label Value
2 C 5
4 C 30
5 D 20
6 D 11
1 B 15
3 B -5
0 A -1

Sort Dataframe by Descending Rows AND Columns at the Same Time

Currently have a dataframe that is countries by series, with values ranging from 0-25
I want to sort the df so that the highest values appear in the top left (first), while the lowest appear in the bottom right (last).
FROM
A B C D ...
USA 4 0 10 16
CHN 2 3 13 22
UK 2 1 8 14
...
TO
D C A B ...
CHN 22 13 2 3
USA 16 10 4 0
UK 14 8 2 1
...
In this, the column with the highest values is now first, and the same is true with the index.
I have considered reindexing, but this loses the 'Countries' Index.
D C A B ...
0 22 13 2 3
1 16 10 4 0
2 14 8 2 1
...
I have thought about creating a new column and row that has the Mean or Sum of values for that respective column/row, but is this the most efficient way?
How would I then sort the DF after I have the new rows/columns??
Is there a way to reindex using...
df_mv.reindex(df_mv.mean(or sum)().sort_values(ascending = False).index, axis=1)
... that would allow me to keep the country index, and simply sort it accordingly?
Thanks for any and all advice or assistance.
EDIT
Intended result organizes columns AND rows from largest to smallest.
Regarding the first row of the A and B columns in the intended output, these are supposed to be 2, 3 respectively. This is because the intended result interprets the A column as greater than the B column in both sum and mean (even though either sum or mean can be considered for the 'value' of a row/column).
By saying the higher numbers would be in the top left, while the lower ones would be in the bottom right, I simply meant this as a general trend for the resulting df. It is the columns and rows as whole however, that are the intended focus. I apologize for the confusion.
You could use:
rows_index=df.max(axis=1).sort_values(ascending=False).index
col_index=df.max().sort_values(ascending=False).index
new_df=df.loc[rows_index,col_index]
print(new_df)
D C A B
CHN 22 13 2 3
USA 16 10 4 0
UK 14 8 2 1
Use .T to transpose rows to columns and vice versa:
df = df.sort_values(df.max().idxmax(), ascending=False)
df = df.T
df = df.sort_values(df.columns[0], ascending=False).T
Result:
>>> df
D C B A
CHN 22 13 3 2
USA 16 10 0 4
UK 14 8 1 2
Here's another way, this time without transposing but using axis=1 as an argument:
df = df.sort_values(df.max().idxmax(), ascending=False)
df = df.sort_values(df.index[0], axis=1, ascending=False)
Using numpy:
arr = df.to_numpy()
arr = arr[np.max(arr, axis=1).argsort()[::-1], :]
arr = np.sort(arr, axis=1)[:, ::-1]
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df1)
Output:
A B C D
USA 22 13 3 2
CHN 16 10 4 0
UK 14 8 2 1

Pandas Multiindex get values from first entry of index

I have the following multiindex dataframe:
from io import StringIO
import pandas as pd
datastring = StringIO("""File,no,runtime,value1,value2
A,0, 0,12,34
A,0, 1,13,34
A,0, 2,23,34
A,1, 6,23,38
A,1, 7,22,38
B,0,17,15,35
B,0,18,17,35
C,0,34,23,32
C,0,35,21,32
""")
df = pd.read_csv(datastring, sep=',')
df.set_index(['File','no',df.index], inplace=True)
>> df
runtime value1 value2
File no
A 0 0 0 12 34
1 1 13 34
2 2 23 34
1 3 6 23 38
4 7 22 38
B 0 5 17 15 35
6 18 17 35
C 0 7 34 23 32
8 35 21 32
What I would like to get is just the first values of every entry with a new file and a different number
A 0 34
A 1 38
B 0 35
C 0 32
The most similar questions I could find where these
Resample pandas dataframe only knowing result measurement count
MultiIndex-based indexing in pandas
Select rows in pandas MultiIndex DataFrame
but I was unable to construct a solution from them. The best I got was the ix operation, but as the values technically are still there (just not on display), the result is
idx = pd.IndexSlice
df.loc[idx[:,0],:]
could, for example, filter for the 0 value but would still return the entire rest of the dataframe.
Is a multiindex even the right tool for the task at hand? How to solve this?
Use GroupBy.first by first and second level of MultiIndex:
s = df.groupby(level=[0,1])['value2'].first()
print (s)
File no
A 0 34
1 38
B 0 35
C 0 32
Name: value2, dtype: int64
If need one column DataFrame use one element list:
df1 = df.groupby(level=[0,1])[['value2']].first()
print (df1)
value2
File no
A 0 34
1 38
B 0 35
C 0 32
Another idea is remove 3rd level by DataFrame.reset_index and filter by Index.get_level_values with boolean indexing:
df2 = df.reset_index(level=2, drop=True)
s = df2.loc[~df2.index.duplicated(), 'value2']
print (s)
File no
A 0 34
1 38
B 0 35
C 0 32
Name: value2, dtype: int64
For the sake of completeness, I would like to add another method (which I would not have found without the answere by jezrael).
s = df.groupby(level=[0,1])['value2'].nth(0)
This can be generalized to finding any, not merely the first entry
t = df.groupby(level=[0,1])['value1'].nth(1)
Note that the selection was changed from value2 to value1 as for the former, the results of nth(0) and nth(1) would have been identical.
Pandas documentation link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.nth.html

Categories