Not-quite gradient of dataframe - python

I have a dataframe of ints:
mydf = pd.DataFrame([[0,0,0,1,0,2,2,5,2,4],
[0,1,0,0,2,2,4,5,3,3],
[1,1,1,1,2,2,0,4,4,4]])
I'd like to calculate something that resembles the gradient given by pd.Series.dff() for each row, but with one big change: my ints represent categorical data, so I'm only interested in detecting a change, not the magnitude of it. So the step from 0 to 1 should be the same as the step from 0 to 4.
Is there a way for pandas to interpret my data as categorical in the data frame, and then calculate a Series.diff() on that? Or could you "flatten" the output of Series.diff() to be only 0s and 1s?

If I understand you correctly, this is what you are trying to achieve:
import pandas as pd
mydf = pd.DataFrame([[0,0,0,1,0,2,2,5,2,4],
[0,1,0,0,2,2,4,5,3,3],
[1,1,1,1,2,2,0,4,4,4]])
mydf = mydf.astype("category")
diff_df = mydf.apply(lambda x: x.diff().ne(0), axis=1).astype(int)
The ne returns a boolean array which indicates if the difference between consecutive values is different from zero. Then you use the astype to convert the boolean values to integers (0s and 1s). The result is a dataframe with the same number of rows as the original dataframe, and the same number of columns, but with binary values indicating a change in the categorical value from one step to the next.
0 1 2 3 4 5 6 7 8 9
0 1 0 0 1 1 1 0 1 1 1
1 1 1 1 0 1 0 1 1 1 0
2 1 0 0 0 1 0 1 1 0 0

Related

Python: counting frequency for two columns with the same possible values

I have two columns with two possible values (0 or 1). One column is the predicted value and the other is the real value. Something like this.
ID Predicted Real
1 1 1
2 1 0
3 0 0
4 0 1
5 1 0
6 1 0
I want to count the frequency for 0 and 1 on each column. Something like this
Value Predicted Real
1 4 2
0 2 4
And I want to make a vertical bar plot with the results
You can apply pd.value_counts to the dataframe (assuming ID is the index and not a column, if not set ID as index first)
out = df.apply(pd.value_counts).rename_axis('Value').reset_index()
Value Predicted Real
0 0 2 4
1 1 4 2
df.apply(pd.value_counts).rename_axis('Value').plot(kind='bar') #customize as you want

Summarize Pandas DataFrame by Column Values

I have a Pandas DataFrame and each column is a binary indicator 1/0. It has 4 columns (and 6 rows). I would like to produce a DataFrame that groups rows that are similar and the last (5th) column shows the number of rows that fit that category. Please see the sample below:
df = pd.DataFrame([[0,1,1,0],
[0,1,1,0],
[0,0,0,1],
[0,0,0,1],
[1,1,1,0],
[1,1,1,1],
[1,1,1,0]])
res = pd.DataFrame([[0,1,1,0,2],
[0,0,0,1,2],
[1,1,1,0,2],
[1,1,1,1,1]])
I need to create the "res" DataFrame from df.
This is groupby + size
df.groupby(list(df)).size().to_frame('size').reset_index()
Out[612]:
0 1 2 3 size
0 0 0 0 1 2
1 0 1 1 0 2
2 1 1 1 0 2
3 1 1 1 1 1

Converting pandas column of comma-separated strings into dummy variables

In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas:
0 'a'
1 'a,b,c'
2 'a,b,d'
3 'd'
4 'c,d'
Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated!
Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output. Thanks!
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Use str.get_dummies
df['col'].str.get_dummies(sep=',')
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Edit: Updating the answer to address some questions.
Qn 1: Why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it
Series.str.get_dummies is a series level method (as the name suggests!). We are one hot encoding values in one Series (or a DataFrame column) and hence there is no need to use prefix. Pandas.get_dummies on the other hand can one hot encode multiple columns. In which case, the prefix parameter works as an identifier of the original column.
If you want to apply prefix to str.get_dummies, you can always use DataFrame.add_prefix
df['col'].str.get_dummies(sep=',').add_prefix('col_')
Qn 2: If you have more than one column to begin with, how do you merge the dummies back into the original frame?
You can use DataFrame.concat to merge one hot encoded columns with the rest of the columns in dataframe.
df = pd.DataFrame({'other':['x','y','x','x','q'],'col':['a','a,b,c','a,b,d','d','c,d']})
df = pd.concat([df, df['col'].str.get_dummies(sep=',')], axis = 1).drop('col', 1)
other a b c d
0 x 1 0 0 0
1 y 1 1 1 0
2 x 1 1 0 1
3 x 0 0 0 1
4 q 0 0 1 1
The str.get_dummies function does not accept prefix parameter, but you can rename the column names of the returned dummy DataFrame:
data['col'].str.get_dummies(sep=',').rename(lambda x: 'col_' + x, axis='columns')

Copy pandas DataFrame row to multiple other rows

Simple and practical question, yet I can't find a solution.
The questions I took a look were the following:
Modifying a subset of rows in a pandas dataframe
Changing certain values in multiple columns of a pandas DataFrame at once
Fastest way to copy columns from one DataFrame to another using pandas?
Selecting with complex criteria from pandas.DataFrame
The key difference between those and mine is that I need not to insert a single value, but a row.
My problem is, I pick up a row of a dataframe, say df1. Thus I have a series.
Now I have this other dataframe, df2, that I have selected multiple rows according to a criteria, and I want to replicate that series to all those row.
df1:
Index/Col A B C
1 0 0 0
2 0 0 0
3 1 2 3
4 0 0 0
df2:
Index/Col A B C
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
What I want to accomplish is inserting df1[3] into the lines df2[2] and df3[3] for example. So something like the non working code:
series = df1[3]
df2[df2.index>=2 and df2.index<=3] = series
returning
df2:
Index/Col A B C
1 0 0 0
2 1 2 3
3 1 2 3
4 0 0 0
Use loc and pass a list of the index labels of interest, after the following comma the : indicates we want to set all column values, we then assign the series but call attribute .values so that it's a numpy array. Otherwise you will get a ValueError as there will be a shape mismatch as you're intending to overwrite 2 rows with a single row and if it's a Series then it won't align as you desire:
In [76]:
df2.loc[[2,3],:] = df1.loc[3].values
df2
Out[76]:
A B C
1 0 0 0
2 1 2 3
3 1 2 3
4 0 0 0
Suppose you have to copy certain rows and columns from dataframe to some another data frame do this.
code
df2 = df.loc[x:y,a:b] // x and y are rows bound and a and b are column
bounds that you have to select

Pandas - Get dummies for only certain values

I have a Pandas series of 10000 rows which is populated with a single alphabet, starting from A to Z.
However, I want to create dummy data frames for only A, B, and C, using Pandas get_dummies.
How do I go around doing that?
I don't want to get dummies for all the row values in the column and then select the specific columns, as the column contains other redundant data which eventually causes a Memory Error.
try this:
# create mock dataframe
df = pd.DataFrame( {'alpha':['a','a','b','b','c','e','f','g']})
# use replace with a regex to set characters d-z to None
pd.get_dummies(df.replace({'[^a-c]':None},regex =True))
output:
alpha_a alpha_b alpha_c
0 1 0 0
1 1 0 0
2 0 1 0
3 0 1 0
4 0 0 1
5 0 0 0
6 0 0 0
7 0 0 0

Categories