I have a dataframe in Pandas with collected data;
import pandas as pd
df = pd.DataFrame({'Group': ['A','A','A','A','A','A','A','B','B','B','B','B','B','B'], 'Subgroup': ['Blue', 'Blue','Blue','Red','Red','Red','Red','Blue','Blue','Blue','Blue','Red','Red','Red'],'Obs':[1,2,4,1,2,3,4,1,2,3,6,1,2,3]})
+-------+----------+-----+
| Group | Subgroup | Obs |
+-------+----------+-----+
| A | Blue | 1 |
| A | Blue | 2 |
| A | Blue | 4 |
| A | Red | 1 |
| A | Red | 2 |
| A | Red | 3 |
| A | Red | 4 |
| B | Blue | 1 |
| B | Blue | 2 |
| B | Blue | 3 |
| B | Blue | 6 |
| B | Red | 1 |
| B | Red | 2 |
| B | Red | 3 |
+-------+----------+-----+
The Observations ('Obs') are supposed to be numbered without gaps, but you can see we have 'missed' Blue 3 in group A and Blue 4 and 5 in group B. The desired outcome is a percentage of all 'missed' Observations ('Obs') per group, so in the example:
+-------+--------------------+--------+--------+
| Group | Total Observations | Missed | % |
+-------+--------------------+--------+--------+
| A | 8 | 1 | 12.5% |
| B | 9 | 2 | 22.22% |
+-------+--------------------+--------+--------+
I tried both with for loops and by using groups (for example:
df.groupby(['Group','Subgroup']).sum()
print(groups.head)
) but I can't seem to get that to work in any way I try. Am I going about this the wrong way?
From another answer (big shoutout to #Lie Ryan) I found a function to look for missing elements, however I don't quite understand how to implement this yet;
def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
" s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... "
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
def missing_elements(L):
missing = chain.from_iterable(range(x + 1, y) for x, y in window(L) if (y - x) > 1)
return list(missing)
Can anyone give me a pointer is the right direction?
Simple enough, you'll need groupby here:
Using groupby + diff, figure out how many observations are missing per Group and SubGroup
Group df on Group, and compute the size and sum of the column computed in the previous step
A couple more straightforward steps (calculating the %) give you your intended output.
f = [ # declare an aggfunc list in advance, we'll need it later
('Total Observations', 'size'),
('Missed', 'sum')
]
g = df.groupby(['Group', 'Subgroup'])\
.Obs.diff()\
.sub(1)\
.groupby(df.Group)\
.agg(f)
g['Total Observations'] += g['Missed']
g['%'] = g['Missed'] / g['Total Observations'] * 100
g
Total Observations Missed %
Group
A 8.0 1.0 12.500000
B 9.0 2.0 22.222222
A similar approach using groupby, apply and assign:
(
df.groupby(['Group','Subgroup']).Obs
.apply(lambda x: [x.max()-x.min()+1, x.max()-x.min()+1-len(x)])
.apply(pd.Series)
.groupby(level=0).sum()
.assign(pct=lambda x: x[1]/x[0]*100)
.set_axis(['Total Observations', 'Missed', '%'], axis=1, inplace=False)
)
Out[75]:
Total Observations Missed %
Group
A 8 1 12.500000
B 9 2 22.222222
from collections import Counter
gs = ['Group', 'Subgroup']
old_tups = set(zip(*df.values.T))
missed = pd.Series(Counter(
g for (g, s), d in df.groupby(gs)
for o in range(d.Obs.min(), d.Obs.max() + 1)
if (g, s, o) not in old_tups
), name='Missed')
hit = df.set_index(gs).Obs.count(level=0)
total = hit.add(missed).rename('Total')
ratio = missed.div(total).rename('%')
pd.concat([total, missed, ratio], axis=1).reset_index()
Group Total Missed %
0 A 8 1 0.125000
1 B 9 2 0.222222
Related
I have three dfs. df1 contains 46 columns. df2 and df3 contain 41 columns and contain threshold values for columns in df1 which indicate if it needs a repeat or repeat with addition. Below are simplified examples
df1:
|Name | A | B | C |.......
------------------------------
0|ID1 | 10 | 2 | 3 |
1|ID2 | 400 | 1 | 6 |
2|ID3 | 7 | 8 | 9 |
3|ID4 | 12 | 300 | 55 |
4|ID5 | 0 | 1 | 2 |
df2:
| A | B | C |.......
------------------------------
Repeat | 10 | 2 | 50 |
df3:
| A | B | C |.......
------------------------------
Repeat w Addition| 100 | 200 | 500 |
what I'd like to do is create a new column in df1 with the values "Repeat", "Repeat with Addition" or "No" based on the following conditions
for each row if any value in cols A-C is greater than repeat addition = Repeat Addition
for each row if any value in cols A-C is greater than repeat but less than Repeat Addition= Repeat
else No
desired output:
df1:
|Name | A | B | C |.......|Repeat Required?|
--------------------------------------------------
0|ID1 | 10 | 2 | 3 |.......| Repeat
1|ID2 | 400 | 1 | 6 |.......| Repeat with Addition
2|ID3 | 7 | 8 | 9 |.......| Repeat
3|ID4 | 12 | 300 | 55 |.......| Repeat with Addition
4|ID5 | 0 | 1 | 2 |.......| No
what I have so far:
I tried using a function with np.select to fill the column but it produces a bunch of No values when that is not correct
def repeat_required(df):
conds = [df >= df3.loc["Repeat w Addition"], df>= df2.loc["Repeat"]]
labels = ['Repeat with Dilution', 'Repeat']
return np.select(conds, labels, default='No')
df1["Repeat Required?"]=""
df1["Repeat Required?"]=repeat_required(df1.iloc[:,4:-1]) #the first 4 columns contain strings
You're right that you want to use np.select, but the conditions you need to provide are Boolean Series that are the same length as df1. To do this, you need to compare with the rows in df2 and df3 as Series (so that it aligns on the columns) and then check if any of the values in df1 satisfy your conditions.
You can manually specify the columns to compare, as I do in a list below, or you can not specify anything and leverage the fact that pandas will automatically align for the .ge comparison so it would only consider overlapping columns for an any check.
import numpy as np
cols = ['A', 'B', 'C']
conds = [df1[cols].ge(df3[cols].loc['Repeat w Addition']).any(1),
df1[cols].ge(df2[cols].loc['Repeat']).any(1)]
choices = ['Repeat w Addition', 'Repeat']
df1['Repeat Required'] = np.select(conds, choices, default='No')
print(df1)
Name A B C Repeat Required
0 ID1 10 2 3 Repeat
1 ID2 400 1 6 Repeat w Addition
2 ID3 7 8 9 Repeat
3 ID4 12 300 55 Repeat w Addition
4 ID5 0 1 2 No
I have a dataframe that contains accidents of cars, they can be 'L' for light or 'S' for strong:
|car_id|type_acc|datetime_acc|
------------------------------
| 1 | L | 2020-01-01 |
| 1 | L | 2020-01-05 |
| 1 | S | 2020-01-07 |
| 1 | L | 2020-01-09 |
| 2 | L | 2020-01-04 |
| 2 | L | 2020-01-10 |
| 2 | L | 2020-01-12 |
I would like to get a column that counts until the first 'S' and the divide the number of occurrences between the max and min 'L', so the output is:
|car_id|freq_acc|
-----------------
| 1 | 2 | # 4 days (from 1 to 5) / 2 number of 'L' before first 'S'
| 2 | 8 | # 8 days(from 4 to 12) and no 'S'
Is such a thing possible? Thanks
You could use np.ptp to compute the max-min difference:
# find first by S by car_id
df['eq_s'] = df.groupby('car_id')['type_acc'].transform(lambda x: x.eq('S').cumsum())
# compute stats based on previous computation, keeping only the first group
groups = df[df['eq_s'].eq(0)].groupby(['car_id']).agg({'datetime_acc': np.ptp}).reset_index()
# rename
res = groups.rename(columns={'datetime_acc': 'freq_acc'})
print(res)
Output
car_id freq_acc
0 1 4 days
1 2 8 days
Before anything make sure that datetime_acc, is a datetime column, by doing:
df['datetime_acc'] = pd.to_datetime(df['datetime_acc'])
The first step:
# find first by S by car_id
df['eq_s'] = df.groupby('car_id')['type_acc'].transform(lambda x: x.eq('S').cumsum())
will create a new column where the only values of interest are the 0, that is the ones that are before the first S. In the second step we only keep those values, and perform a standard groupby:
# compute stats based on previous computation, keeping only the first group
groups = df[df['eq_s'].eq(0)].groupby(['car_id']).agg({'datetime_acc': np.ptp}).reset_index()
I have the following dataframe
+-------+------------+--+
| index | keep | |
+-------+------------+--+
| 0 | not useful | |
| 1 | start_1 | |
| 2 | useful | |
| 3 | end_1 | |
| 4 | not useful | |
| 5 | start_2 | |
| 6 | useful | |
| 7 | useful | |
| 8 | end_2 | |
+-------+------------+--+
There are two pairs of strings (start_1, end_1, start_2, end_2) that indicate that the rows between those strings are the only ones relevant in the data. Hence, in the dataframe below, the output dataframe would be only composed of the rows at index 2, 6, 7 (since 2 is between start_1 and end_1; and 6 and 7 is between start_2 and end_2)
d = {'keep': ["not useful", "start_1", "useful", "end_1", "not useful", "start_2", "useful", "useful", "end_2"]}
df = pd.DataFrame(data=d)
What is the most Pythonic/Pandas approach to this problem?
Thanks
Here's one way to do that (in a couple of steps, for clarity). There might be others:
df["sections"] = 0
df.loc[df.keep.str.startswith("start"), "sections"] = 1
df.loc[df.keep.str.startswith("end"), "sections"] = -1
df["in_section"] = df.sections.cumsum()
res = df[(df.in_section == 1) & ~df.keep.str.startswith("start")]
Output:
index keep sections in_section
2 2 useful 0 1
6 6 useful 0 1
7 7 useful 0 1
I have multiple columns in my dataframe of which I am using 2 columns "customer id" and "trip id". I used the groupby function data.groupby(['customer_id','trip_id']) There are multiple trips taken from each customer. I want to count how many trips each customer took, but when I am using aggregate function along with group by I am getting 1 in all the rows. How should I proceed ?
I want something in this format.
Example :
Customer_id , Trip_Id, Count
CustID1 ,trip1, 3
trip 2
trip 3
CustID2 ,Trip450, 2
Trip23
You can group by customer and count the number of unique trips using the built in nunique:
data.groupby('Customer_id').agg(Count=('Trip_id', 'nunique'))
You can use data.groupby('customer_id','trip_id').count()
Example:
df1 = pd.DataFrame(columns=["c1","c1a","c1b"], data = [[1,2,3],[1,5,6],[2,8,9]])
print(df1)
# | c1 | c1a | c1b |
# |----|-----|-----|
# | x | 2 | 3 |
# | z | 5 | 6 |
# | z | 8 | 9 |
df2 = df1.groupby("c1").count()
print(df2)
# | | c1a | c1b |
# |----|-----|-----|
# | x | 1 | 1 |
# | z | 2 | 2 |
So I have a dataframe with some values. This is my dataframe:
|in|x|y|z|
+--+-+-+-+
| 1|a|a|b|
| 2|a|b|b|
| 3|a|b|c|
| 4|b|b|c|
I would like to get number of unique values of each row, and number of values that are not equal to value in column x. The result should look like this:
|in | x | y | z | count of not x |unique|
+---+---+---+---+---+---+
| 1 | a | a | b | 1 | 2 |
| 2 | a | b | b | 2 | 2 |
| 3 | a | b | c | 2 | 3 |
| 4 | b | b |nan| 0 | 1 |
I could come up with some dirty decisions here. But there must be some elegant way of doing that. My mind is turning around dropduplicates(that does not work on series); turning into array and .unique(); df.iterrows() that I want to evade; and .apply on each row.
Here are solutions using apply.
df['count of not x'] = df.apply(lambda x: (x[['y','z']] != x['x']).sum(), axis=1)
df['unique'] = df.apply(lambda x: x[['x','y','z']].nunique(), axis=1)
A non-apply solution for getting count of not x:
df['count of not x'] = (~df[['y','z']].isin(df['x'])).sum(1)
Can't think of anything great for unique. This uses apply, but may be faster, depending on the shape of the data.
df['unique'] = df[['x','y','z']].T.apply(lambda x: x.nunique())