Pandas, editing dataframe - python

I have df that is similar to the following:
value is_1 is_2 is_3
5 0 1 0
7 0 0 1
4 1 0 0
(it is guaranteed, that the sum of values from columns is_1 ... is_n is equal to 1 calculating by each row)
I need to get the next result:
is_1 is_2 is_3
0 5 0
0 0 7
4 0 0
(I should find the column is_k that is more than 0, and fill it with value from "value" column)
What is the best way to achieve it?

I'd do it this way:
In [16]: df = df.mul(df.pop('value').values, axis=0)
In [17]: df
Out[17]:
is_1 is_2 is_3
0 0 5 0
1 0 0 7
2 4 0 0

Related

Pandas: New column value based on the matching multi-level column's conditions

I have the following dataframe with multi-level columns
In [1]: data = {('A', '10'):[1,3,0,1],
('A', '20'):[3,2,0,0],
('A', '30'):[0,0,3,0],
('B', '10'):[3,0,0,0],
('B', '20'):[0,5,0,0],
('B', '30'):[0,0,1,0],
('C', '10'):[0,0,0,2],
('C', '20'):[1,0,0,0],
('C', '30'):[0,0,0,0]
}
df = pd.DataFrame(data)
df
Out[1]:
A B C
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0
1 3 2 0 0 5 0 0 0 0
2 0 0 3 0 0 1 0 0 0
3 1 0 0 0 0 0 2 0 0
In a new column results I want to return the combined column name containing the maximum value for each subset (i.e. second level column)
My desired output should look like the below
Out[2]:
A B C
10 20 30 10 20 30 10 20 30 results
0 1 3 0 3 0 0 0 1 0 A20&B10&C20
1 3 2 0 0 5 0 0 0 0 A10&B20
2 0 0 3 0 0 1 0 0 0 A30&B30
3 1 0 0 0 0 0 2 0 0 A10&C10
For example the first row:
For column 'A' the max value is under column '20' &
for column 'B' there is only 1 value under '10' &
for column 'C' also it is only one value under '20' &
so the result would be A20&B10&C20
Edit: replacing "+" with "&" in the results column, apparently I was misunderstood and you guys thought I need the summation while I need to column names separated by a separator
Edit2:
The solution provided by #A.B below didn't work for me for some reason. Although it is working on his side and for the sample data on google colab.
somehow the .idxmax(skipna = True) is causing a ValueError: No axis named 1 for object type Series
I found a workaround by transposing the data before this step, and then transposing it back after.
map_res = lambda x: ",".join(list(filter(None,['' if isinstance(x[a], float) else (x[a][0]+x[a][1]) for a in x.keys()])))
df['results'] = df.replace(0, np.nan)\
.T\ # Transpose here
.groupby(level=0)\ # Remove (axis=1) from here
.idxmax(skipna = True)\
.T\ # Transpose back here
.apply(map_res,axis=1)
I am still interested to know why it was is not working without the transpose though?
Idea is replace 0 by NaN, so if use DataFrame.stack all rows with NaNs are removed. Then get indices by DataFrameGroupBy.idxmax, mapping second and third tuple values by map and aggregate join to new column per indices - first level:
df['results'] = (df.replace(0, np.nan)
.stack([0,1])
.groupby(level=[0,1])
.idxmax()
.map(lambda x: f'{x[1]}{x[2]}')
.groupby(level=0)
.agg('&'.join))
print (df)
A B C results
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0 A20&B10&C20
1 3 2 0 0 5 0 0 0 0 A10&B20
2 0 0 3 0 0 1 0 0 0 A30&B30
3 1 0 0 0 0 0 2 0 0 A10&C10
Try:
df["results"] = df.groupby(level=0, axis=1).max().sum(1)
print(df)
Prints:
A B C results
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0 7
1 3 2 0 0 5 0 0 0 0 8
2 0 0 3 0 0 1 0 0 0 4
3 1 0 0 0 0 0 2 0 0 3
Group by level 0 and axis=1
You use idxmax to get max sub-level indexes as tuples (while skipping NaNs).
Apply function to rows (axix-1) to concat names
In function (that you apply to rows), Iterate on keys/columns and concatenate the column levels. Replace Nan (which have type 'float') with an empty string and filter them later.
You won't need df.replace(0, np.nan) if you initially have NaN and let them remain.
map_res = lambda x: ",".join(list(filter(None,['' if isinstance(x[a], float) else (x[a][0]+x[a][1]) for a in x.keys()])))
df['results'] = df.replace(0, np.nan)\
.groupby(level=0, axis=1)\
.idxmax(skipna = True)\
.apply(map_res,axis=1)
Here's output
A B C results
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0 A20,B10,C20
1 3 2 0 0 5 0 0 0 0 A10,B20
2 0 0 3 0 0 1 0 0 0 A30,B30
3 1 0 0 0 0 0 2 0 0 A10,C10

Pandas merge columns with similar prefixes

I have a pandas dataframe with binary columns that looks like this:
DEM_HEALTH_PRIV DEM_HEALTH_PRE DEM_HEALTH_HOS DEM_HEALTH_OUT
0 1 0 0
0 0 1 1
I want to take the suffix of each variable and convert the binary variables to one categorical variable that corresponds with the prefix. For example, merge all DEM_HEALTH variables to include a list of "PRE", "HOS", "OTH" etc. where the value of the column is equal to 1.
Output
DEM_HEALTH_PRIV
['PRE']
['HOS','OUT']
Any help would be much appreciated!
Try this -
#original dataframe is called df
new_cols = [tuple(i.rsplit('_',1)) for i in df.columns]
new_cols = pd.MultiIndex.from_tuples(new_cols)
df.columns = new_cols
data = df[df==1]\
.stack()\
.reset_index(-1)\
.groupby(level=0)['level_1']\
.apply(list)
Explanation
IIUC your data looks something like the following
print(df)
DEM_HEALTH_PRIV DEM_HEALTH_OUT DEM_HEALTH_PRE DEM_HEALTH_HOS
0 0 1 1 1
1 0 1 0 0
2 0 0 1 0
3 0 1 0 0
4 1 0 0 0
5 0 0 1 1
6 1 0 1 0
7 1 0 0 1
8 0 1 0 0
9 0 1 1 0
1. Create multi-index by rsplit
First step is to rsplit (reverse split) the columns by last occurance of "_" substring. Then create a multi-index, DEM_HEALTH is level 0 and PRE, HOS, etc are level 1.
new_cols = [tuple(i.rsplit('_',1)) for i in df.columns]
new_cols = pd.MultiIndex.from_tuples(new_cols)
df.columns = new_cols
print(df)
DEM_HEALTH
PRIV OUT PRE HOS
0 0 1 1 1
1 0 1 0 0
2 0 0 1 0
3 0 1 0 0
4 1 0 0 0
5 0 0 1 1
6 1 0 1 0
7 1 0 0 1
8 0 1 0 0
9 0 1 1 0
2. Stack and Groupby over level=0
data = df[df==1]\
.stack()\
.reset_index(-1)\
.groupby(level=0)['level_1']\
.apply(list)
0 [HOS, OUT, PRE]
1 [OUT]
2 [PRE]
3 [OUT]
4 [PRIV]
5 [HOS, PRE]
6 [PRE, PRIV]
7 [HOS, PRIV]
8 [OUT]
9 [OUT, PRE]
Name: level_1, dtype: object

Delete rows with all the zeros elements in all columns exceptionally leaving a single non zero column in pandas DF

I have a pandas Df with 2 million rows *10 columns.
I want to delete all the zero elements in a row for all columns except single column with non zero elements.
Ex. My Df like:
Índex Time a b c d e
0 1 0 0 0 0 0
1 2 1 2 0 0 0
2 3 0 0 0 0 0
3 4 5 0 0 0 0
4 5 0 0 0 0 0
5 6 7 0 0 0 0
What I needed:
Índex Time a b c d e
0 2 1 2 0 0 0
1 4 5 0 0 0 0
2 6 7 0 0 0 0
My Requirement:
Requirement 1:
Leaving the 1st column (Time) it should check for zero elements in every rows. If all column values are zero delete that particular row.
Requirement 2:
Finally I want my Index to be updated properly
What I tried:
I have been looking at this link.
I understood the logic used but I wasn't able to reproduce the result for my requirement.
I hope there will be a simple method to do the operation...
Use iloc for select all columns without first, comapre for not equal by ne and test at least one True per rows by any for filter by boolean indexing, last reset_index:
df = df[df.iloc[:, 1:].ne(0).any(axis=1)].reset_index(drop=True)
Alternative with remove column Time:
df = df[df.drop('Time', axis=1).ne(0).any(axis=1)].reset_index(drop=True)
print (df)
Time a b c d e
0 2 1 2 0 0 0
1 4 5 0 0 0 0
2 6 7 0 0 0 0

Replace two rows in a Pandas DataFrame with five rows? [duplicate]

I want to start with an empty data frame and then add to it one row each time.
I can even start with a 0 data frame data=pd.DataFrame(np.zeros(shape=(10,2)),column=["a","b"]) and then replace one line each time.
How can I do that?
Use .loc for label based selection, it is important you understand how to slice properly: http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label and understand why you should avoid chained assignment: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [14]:
data=pd.DataFrame(np.zeros(shape=(10,2)),columns=["a","b"])
data
Out[14]:
a b
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
[10 rows x 2 columns]
In [15]:
data.loc[2:2,'a':'b']=5,6
data
Out[15]:
a b
0 0 0
1 0 0
2 5 6
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
[10 rows x 2 columns]
If you are replacing the entire row then you can just use an index and not need row,column slices.
...
data.loc[2]=5,6

Python Convert data to pivot

I am trying to convert a data set with 100,000 rows and 3 columns into pivot. While the following code runs without an error, the values are displayed as NaN.
df1 = pd.pivot_table(df_TEST, values='actions', index=['sku'], columns=['user'])
It is not taking the values (ranges from 1 to 36 ) from DataFrame. Has anyone come across this situation?
This can happen when you are doing a pivot since not all the values might be present. e.g.
In [10]: df_TEST
Out[10]:
a b c
0 0 0 0
1 0 1 0
2 0 2 0
3 1 1 1
4 1 2 3
5 1 4 5
Now, when you do pivot on this,
In [9]: df_TEST.pivot_table(index='a', values='c', columns='b')
Out[9]:
b 0 1 2 4
a
0 0 0 0 NaN
1 NaN 1 3 5
Note that, you got NaN at index 0 and column 4, since there is no entry in df_TEST with column a = 0 and column b = 4.
Typically you fill such values with zeros.
In [11]: df_TEST.pivot_table(index='a', values='c', columns='b').fillna(0)
Out[11]:
b 0 1 2 4
a
0 0 0 0 0
1 0 1 3 5

Categories