I have a DataFrame:
df.head()
Index Value
0 1.0,1.0,1.0,1.0
1 1.0,1.0
2 1.0,1.0
3 3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0
4 4
I'd like to count the occurrences of values in the Value column:
Index Value 1 2 3 4
0 1.0,1.0,1.0,1.0 4 0 0 0
1 1.0,1.0 2 0 0 0
2 1.0,1.0 2 0 0 0
3 3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0 0 0 6 2
4 4 0 0 0 1
I've done this before with string values but I used Counter - which I found you can't use with floats?
df_counts = df['Value'].apply(lambda x: pd.Series(Counter(x.split(','))), 1).fillna(0).astype(int)
Use map to floats and last columns to integers:
df_counts = (df['Value'].apply(lambda x: pd.Series(Counter(map(float, x.split(',')))), 1)
.fillna(0)
.astype(int)
.rename(columns=int))
print (df_counts)
1 3 4
0 4 0 0
1 2 0 0
2 2 0 0
3 0 6 2
4 0 0 1
Last if necessary add all missing categories add reindex and join to original:
cols = np.arange(df_counts.columns.min(), df_counts.columns.max() + 1)
df = df.join(df_counts.reindex(columns=cols, fill_value=0))
print (df)
Value 1 2 3 4
Index
0 1.0,1.0,1.0,1.0 4 0 0 0
1 1.0,1.0 2 0 0 0
2 1.0,1.0 2 0 0 0
3 3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0 0 0 6 2
4 4 0 0 0 1
Related
I have a problem in which I want to take Table 1 and turn it into Table 2 using Python.
Does anybody have any ideas? I've tried to split the Value column from table 1 but run into issues in that each value is a different length, hence I can't always define how much to split it.
Equally I have not been able to think through how to create a new column that counts the position that value was in the string.
Table 1, before:
ID
Value
1
000000S
2
000FY
Table 2, after:
ID
Position
Value
1
1
0
1
2
0
1
3
0
1
4
0
1
5
0
1
6
0
1
7
S
2
1
0
2
2
0
2
3
0
2
4
F
2
5
Y
You can split the string to individual characters and explode:
out = (df
.assign(Value=df['Value'].apply(list))
.explode('Value')
)
output:
ID Value
0 1 0
0 1 0
0 1 0
0 1 0
0 1 0
0 1 0
0 1 S
1 2 0
1 2 0
1 2 0
1 2 F
1 2 Y
Given:
ID Value
0 1 000000S
1 2 000FY
Doing:
df.Value = df.Value.apply(list)
df = df.explode('Value')
df['Position'] = df.groupby('ID').cumcount() + 1
Output:
ID Value Position
0 1 0 1
0 1 0 2
0 1 0 3
0 1 0 4
0 1 0 5
0 1 0 6
0 1 S 7
1 2 0 1
1 2 0 2
1 2 0 3
1 2 F 4
1 2 Y 5
I have the dollowing dataframe:
name code 1 2 3 4 5 6 7 .........155 days
0 Lari EH214 0 5 2 1 0 0 0 0 3
1 Suzi FK362 0 0 0 0 2 3 0 0 108
2 Jil LM121 0 0 4 2 1 0 0 0 5
...
I want to sum the column between column 1 to column with the number that appears on "days" , for example,
for row 1, I will sum 3 days-> 0+5+2
For row 2 108 days,
for row 3 5 days->0+4+2+1+0
How can I do something like this?
Looking for method.
For vectorized solution filter rows by positions first and get mask by compare days in numpy boroadasting, if not match replace 0 in DataFrame.where and last sum:
df1 = df.iloc[:, 2:-1]
m = df1.columns.astype(int).to_numpy() <= df['days'].to_numpy()[:, None]
df['sum'] = df1.where(m, 0).sum(axis=1)
print (df)
name code 1 2 3 4 5 6 7 155 days sum
0 Lari EH214 0 5 2 1 0 0 0 0 3 7
1 Suzi FK362 0 0 0 0 2 3 0 0 108 5
2 Jil LM121 0 0 4 2 1 0 0 0 5 7
IIUC, use:
df['sum'] = df.apply(lambda r: r.loc[1: r['days']].sum(), axis=1)
or, if the column names are strings:
df['sum'] = df.apply(lambda r: r.loc['1': str(r['days'])].sum(), axis=1)
output:
name code 1 2 3 4 5 6 7 155 days sum
0 Lari EH214 0 5 2 1 0 0 0 0 3 7
1 Suzi FK362 0 0 0 0 2 3 0 0 108 5
2 Jil LM121 0 0 4 2 1 0 0 0 5 7
I have a Dataframe like this (with labels on rows and columns):
0 1 2 3
0 1 1 0 0
1 0 1 1 0
2 1 0 1 0
-1 5 6 3 2
I would like to order the columns according to the last row (and then drop the row):
0 1 2 3
0 1 1 0 0
1 1 0 1 0
2 0 1 1 0
Try np.argsort to get the order, then iloc to rearrange columns and drop rows:
df.iloc[:-1, np.argsort(-df.iloc[-1])]
Output:
1 0 2 3
0 1 1 0 0
1 1 0 1 0
2 0 1 1 0
In the pandas data frame, the one-hot encoded vectors are present as columns, i.e:
Rows A B C D E
0 0 0 0 1 0
1 0 0 1 0 0
2 0 1 0 0 0
3 0 0 0 1 0
4 1 0 0 0 0
4 0 0 0 0 1
How to convert these columns into one data frame column by label encoding them in python? i.e:
Rows A
0 4
1 3
2 2
3 4
4 1
5 5
Also need suggestion on this that some rows have multiple 1s, how to handle those rows because we can have only one category at a time.
Try with argmax
#df=df.set_index('Rows')
df['New']=df.values.argmax(1)+1
df
Out[231]:
A B C D E New
Rows
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
4 0 0 0 0 1 5
argmaxis the way to go, adding another way using idxmax and get_indexer:
df['New'] = df.columns.get_indexer(df.idxmax(1))+1
#df.idxmax(1).map(df.columns.get_loc)+1
print(df)
Rows A B C D E New
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5
Also need suggestion on this that some rows have multiple 1s, how to
handle those rows because we can have only one category at a time.
In this case you dot your DataFrame of dummies with an array of all the powers of 2 (based on the number of columns). This ensures that the presence of any unique combination of dummies (A, A+B, A+B+C, B+C, ...) will have a unique category label. (Added a few rows at the bottom to illustrate the unique counting)
df['Category'] = df.dot(2**np.arange(df.shape[1]))
A B C D E Category
Rows
0 0 0 0 1 0 8
1 0 0 1 0 0 4
2 0 1 0 0 0 2
3 0 0 0 1 0 8
4 1 0 0 0 0 1
5 0 0 0 0 1 16
6 1 0 0 0 1 17
7 0 1 0 0 1 18
8 1 1 0 0 1 19
Another readable solution on top of other great solutions provided that works for ANY type of variables in your dataframe:
df['variables'] = np.where(df.values)[1]+1
output:
A B C D E variables
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5
I have a df in python that looks something like this:
'A'
0
1
0
0
1
1
1
1
0
I want to create another column that adds cumulative 1's from column A, and starts over if the value in column A becomes 0 again. So desired output:
'A' 'B'
0 0
1 1
0 0
0 0
1 1
1 2
1 3
1 4
0 0
This is what I am trying, but it's just replicating column A:
df.B[df.A ==0] = 0
df.B[df.A !=0] = df.A + df.B.shift(1)
Let us do cumsum with groupby cumcount
df['B']=(df.groupby(df.A.eq(0).cumsum()).cumcount()).where(df.A==1,0)
Out[81]:
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 0
dtype: int64
Use shift with ne and groupby.cumsum:
df['B'] = df.groupby(df['A'].shift().ne(df['A']).cumsum())['A'].cumsum()
print(df)
A B
0 0 0
1 1 1
2 0 0
3 0 0
4 1 1
5 1 2
6 1 3
7 1 4
8 0 0