Python: counting frequency for two columns with the same possible values - python

I have two columns with two possible values (0 or 1). One column is the predicted value and the other is the real value. Something like this.
ID Predicted Real
1 1 1
2 1 0
3 0 0
4 0 1
5 1 0
6 1 0
I want to count the frequency for 0 and 1 on each column. Something like this
Value Predicted Real
1 4 2
0 2 4
And I want to make a vertical bar plot with the results

You can apply pd.value_counts to the dataframe (assuming ID is the index and not a column, if not set ID as index first)
out = df.apply(pd.value_counts).rename_axis('Value').reset_index()
Value Predicted Real
0 0 2 4
1 1 4 2
df.apply(pd.value_counts).rename_axis('Value').plot(kind='bar') #customize as you want

Related

Not-quite gradient of dataframe

I have a dataframe of ints:
mydf = pd.DataFrame([[0,0,0,1,0,2,2,5,2,4],
[0,1,0,0,2,2,4,5,3,3],
[1,1,1,1,2,2,0,4,4,4]])
I'd like to calculate something that resembles the gradient given by pd.Series.dff() for each row, but with one big change: my ints represent categorical data, so I'm only interested in detecting a change, not the magnitude of it. So the step from 0 to 1 should be the same as the step from 0 to 4.
Is there a way for pandas to interpret my data as categorical in the data frame, and then calculate a Series.diff() on that? Or could you "flatten" the output of Series.diff() to be only 0s and 1s?
If I understand you correctly, this is what you are trying to achieve:
import pandas as pd
mydf = pd.DataFrame([[0,0,0,1,0,2,2,5,2,4],
[0,1,0,0,2,2,4,5,3,3],
[1,1,1,1,2,2,0,4,4,4]])
mydf = mydf.astype("category")
diff_df = mydf.apply(lambda x: x.diff().ne(0), axis=1).astype(int)
The ne returns a boolean array which indicates if the difference between consecutive values is different from zero. Then you use the astype to convert the boolean values to integers (0s and 1s). The result is a dataframe with the same number of rows as the original dataframe, and the same number of columns, but with binary values indicating a change in the categorical value from one step to the next.
0 1 2 3 4 5 6 7 8 9
0 1 0 0 1 1 1 0 1 1 1
1 1 1 1 0 1 0 1 1 1 0
2 1 0 0 0 1 0 1 1 0 0

python find the nearest nonzero element in df column

I have df:
id number
1 5
1 0
1 0
1 2
2 0
3 1
I want to write a function to fill 0 values.I want for each id(for each group) , when the value in number column is zero, to search the closet non zero value in the column and return the value. for example to id 1 to fill the second and third-row with 2. If I dont have such value like in id 2 , just to remain it as is.
How can I do that?
You can mask the 0, bfill per group, finally fillna with then original value for the groups than only have zeros:
df['number2'] = (df['number']
.mask(df['number'].eq(0))
.groupby(df['id'])
.bfill()
.fillna(df['number'], downcast='infer')
)
output:
id number number2
0 1 5 5
1 1 0 2
2 1 0 2
3 1 2 2
4 2 0 0
5 3 1 1

Pandas: Row wise count for multiple columns after filter based on condition [duplicate]

This question already has answers here:
Pandas sum by groupby, but exclude certain columns
(4 answers)
Closed 3 years ago.
I have a dataframe df as
Decile 2_Con 3_Con 6_Con Pred
1 0 0 0 0
1 0 0 1 1
1 0 0 1 1
2 0 0 0 0
2 0 1 0 1
2 0 1 1 1
Objective:
I want to compute the count of 6_Con, 3_Con and 2_Con when I filter the above df by Pred and set 6_Con ,3_con and 2_Con to 1 for each Decile.
So the resultant Dataframe should look like:
Decile 2_Con 3_Con 6_Con Pred
1 0 0 2 2 <--We are only counting 1 in the columns for each Decile
2 0 2 1 2
How to achieve this? I am not able to generate a plausible code here.
Thanks in advance
Did you try:
df = df.groupby('Decile').sum()
This makes sense because of the 0-1 values, hence, the sum will be the count of non 0 values (rows) for each different decile.

Multiple Condition Apply Function that iterates over itself

So I have a Dataframe that is the same thing 348 times, but with a different date as a static column. What I would like to do is add a column that checks against that date and then counts the number of rows that are within 20 miles using a lat/lon column and geopy.
My frame is like this:
What I am looking to do is something like an apply function that takes all of the identifying dates that are equal to the column and then run this:
geopy.distance.vincenty(x, y).miles
X would be the location's lat/lon and y would be the iterative lat/lon. I'd want the count of locations in which the above is < 20. I'd then like to store this count as a column in the initial Dataframe.
I'm ok with Pandas, but this is just outside my comfort zone. Thanks.
I started with this DataFrame (because I did not want to type that much by hand and you did not provide any code for the data):
df
Index Number la ID
0 0 1 [43.3948, -23.9483] 1/1/90
1 1 2 [22.8483, -34.3948] 1/1/90
2 2 3 [44.9584, -14.4938] 1/1/90
3 3 4 [22.39458, -55.34924] 1/1/90
4 4 5 [33.9383, -23.4938] 1/1/90
5 5 6 [22.849, -34.397] 1/1/90
Now I introduced an artificial column which is only there to help us get the cartesian product of the distances
df['join'] = 1
df_c = pd.merge(df, df[['la', 'join','Index']], on='join')
The next step is to apply the vincenty function via .apply and store the result in an extra column
df_c['distance'] = df_c.apply(lambda x: distance.vincenty(x.la_x, x.la_y).miles, 1)
Now we have the cartesian product of the original matrix, which means we have the comparison of each city with itself, too. But we will take that into account in the next step by performing -1. We groupby the Index_x and sum all the distances smaller the 20 miles.
df['num_close_cities'] = df_c.groupby('Index_x').apply(lambda x: sum((x.distance < 20))) -1
df.drop('join', 1)
Index Number la ID num_close_cities
0 0 1 [43.3948, -23.9483] 1/1/90 0
1 1 2 [22.8483, -34.3948] 1/1/90 1
2 2 3 [44.9584, -14.4938] 1/1/90 0
3 3 4 [22.39458, -55.34924] 1/1/90 0
4 4 5 [33.9383, -23.4938] 1/1/90 0
5 5 6 [22.849, -34.397] 1/1/90 1

pandas: Grouping or filtering based on values in list, instead of dataframe

I want to get a row count of the frequency of each value, even if that value doesn't exist in the dataframe.
d = {'light' : pd.Series(['b','b','c','a','a','a','a'], index=[1,2,3,4,5,6,9]),'injury' : pd.Series([1,5,5,5,2,2,4], index=[1,2,3,4,5,6,9])}
testdf = pd.DataFrame(d)
injury light
1 1 b
2 5 b
3 5 c
4 5 a
5 2 a
6 2 a
9 4 a
I want to get a count of the number of occurrences of each unique value of 'injury' for each unique value in 'light'.
Normally I would just use groupby(), or (in this case, since I want it to be in a specific format), pivot_table:
testdf.reset_index().pivot_table(index='light',columns='injury',fill_value=0,aggfunc='count')
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
But in this case I actually want to compare the records in the dataframe to an external list of values-- in this case, ['a','b','c','d']. So if 'd' doesn't exist in this dataframe, then I want it to return a count of zero:
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
d 0 0 0 0
The closest I've come is filtering the dataframe based on each value, and then getting the size of that dataframe:
for v in sorted(['a','b','c','d']):
idx2 = (df['light'].isin([v]))
df2 = df[idx2]
print(df2.shape[0])
4
2
1
0
But that only returns counts from the 'light' column-- instead of a cross-tabulation of both columns.
Is there a way to make a pivot table, or a groupby() object, that groups things based on values in a list, rather than in a column in a dataframe? Or is there a better way to do this?
Try this:
df = pd.crosstab(df.light, df.injury,margins=True)
df
injury 1 2 4 5 All
light
a 0 2 1 1 4
b 1 0 0 1 2
c 0 0 0 1 1
All 1 2 1 3 7
df["All"]
light
a 4
b 2
c 1
All 7

Categories