I have a pandas dataframe (this is an example, actual dataframe is a lot larger):
data = [['345', 1, '2022_Jan'], ['678', 1, '2022_Jan'], ['123', 1, '2022_Feb'], ['123', 1, '2022_Feb'], ['345', 0, '2022_Mar'], ['678', 1, '2022_Mar'], ['901', 0, '2022_Mar'], ['678', 1, '2022_Mar']]
df = pd.DataFrame(data, columns = ['ID', 'Error Count', 'Year_Month'])
The question I want answered is :How many IDs have errors?
I want to get an output that groups by 'Year_Month' and counts 1 for each ID that occurs in each month. In other words, I want to count only 1 for each ID in a single month.
When I group by 'Year_Month' & 'ID': df.groupby(['Year_Month', 'ID']).count()
it will give me the following output (current output link below) with the total Error Count for each ID, but I only want to count each ID once. I also want the Year_Month to be ordered chronologically, not sure why it's not when my original dataframe is in order by month in the Year_Month column.
My current output
Desired output
Are these actually duplicate records? Are you sure you don't want to record that user 123 had two errors in February?
If so, drop duplicates first, then groupby and sum over Error Count. The .count() method doesn't do what you think it does:
df.drop_duplicates(["ID", "Year_Month"]) \
.groupby(["Year_Month", "ID"])["Error Count"] \
.sum()
Output:
In [3]: counts = df.drop_duplicates(["ID", "Year_Month"]) \
...: .groupby(["Year_Month", "ID"])["Error Count"] \
...: .sum()
In [4]: counts
Out[4]:
Year_Month ID
2022_Feb 123 1
2022_Jan 345 1
678 1
2022_Mar 345 0
678 1
901 0
Name: Error Count, dtype: int64
As far as sorting, you'd want to convert "Year_Month" to a datetime object, because right now they're just being sorted as strings:
In [5]: "2022_Feb" < "2022_Jan"
Out[5]: True
Here's how you could do that:
In [6]: counts.sort_index(level=0, key=lambda ym: pd.to_datetime(ym, format="%Y_%b"))
Out[6]:
Year_Month ID
2022_Jan 345 1
678 1
2022_Feb 123 1
2022_Mar 345 0
678 1
901 0
Name: Error Count, dtype: int64
Here's one way:
(df
.groupby(['Year_Month', 'ID']) # group by the two columns
.sum('Error Count')['Error Count'] # aggregate the sum over error count
.apply(lambda x: int(bool(x)))) # convert to boolean and back to int
.to_frame('Error Count') # add name back to applied column
)
Error Count
Year_Month ID
2022_Feb 123 1
2022_Jan 345 1
678 1
2022_Mar 345 0
678 1
901 0
Here is another way to do that
Converting the sum to a boolean with astype(bool) to return True or False, based on values being 0 or non-zero, and then to 0 or 1 with astype(int)
df.groupby(['Year_Month','ID'])['Error Count'].sum().astype(bool).astype(int)
Year_Month ID
2022_Feb 123 1
2022_Jan 345 1
678 1
2022_Mar 345 0
678 1
901 0
Name: Error Count, dtype: int32
To sort, assign the outcome to a dataframe and then apply the ddejohn solution to sort
counts = df.groupby(['Year_Month','ID'])['Error Count'].sum().astype(bool).astype(int)
counts.sort_index(level=0, key=lambda ym: pd.to_datetime(ym, format="%Y_%b")) # ddejohn: https://stackoverflow.com/a/71927886/3494754 answer above
Year_Month ID
2022_Jan 345 1
678 1
2022_Feb 123 1
2022_Mar 345 0
678 1
901 0
Name: Error Count, dtype: int32
Related
I am trying to compare in which index does the timedelta value in one dataframe1 is equal to the timedelta value in another dataframe2 and then trim the dataframe that has more values to make them both start at the same time:
Dataset1:
TimeStamp Col1 ... Col2500
0 days 10:37:34 346 ... 635
0 days 10:38:34 124 ... 546
0 days 10:39:34 346 ... 745
Dataset2:
TimeStamp Col1 ... Col50
0 days 10:25:20 123 ... 789
0 days 10:25:45 183 ... 787
...
...
0 days 10:37:40 223 ... 789
for i in df2.index:
if str(df1.index[0])[7:12] == str(df2.index[i])[7:12]:
index_value = i
break
df2 = df2.drop(df2.index[[0,i-1]])
Expected output will be Dataset2 starting at the same time (nearest to the minute) with Dataset1
You can use searchsorted for indices for first higher value in df2.index like first value of df1.index. Then select second df2 by positions by iloc:
#necessary both indices are sorted
df1 = df1.sort_index()
df2 = df2.sort_index()
a = df2.index.searchsorted(df1.index[0])
print (a)
2
df2 = df2.iloc[a:]
print (df2)
Col1 ... Col50
TimeStamp
10:37:40 223 ... 789
I have dataframe as follows:
df =
DATA TYPE_1 TYPE_2 TYPE_3 EVALUATED
WW A234 456 456 0
AA 456 123A 567 1
BB 456 123A 456 1
df = pd.DataFrame({"DATA":["WW","AA","BB"],"TYPE_1":["A234","456","456"],"TYPE_2":["456","123A","123A"],"TYPE_3":["456","567","456"],"EVALUATED":[0,1,1]})
I am calculating the number of times TYPE* apeared with EVALUATED equal to 0 and 1. This is the code that I am using:
grouped = (df.set_index('EVALUATED')
.filter(like='TYPE_')
.stack()
.to_frame('TYPE')
.reset_index()
.pivot_table(index='TYPE', columns='EVALUATED', aggfunc='size', fill_value=0)
).reset_index()
I need to slightly improve this code by counting only unique occurences of TYPE* per row.
For example, 456 appeared two times in row 1, but it must be counted only once.
The result should be this one:
grouped =
TYPE 0 1
------------
A234 1 0
456 1 2
123A 0 2
567 0 1
My solution is similar as your, only added .apply(lambda x: pd.Series(x.unique()), axis=1) for remove duplicates in rows and another possible solution for fast size with groupby and unstack:
grouped = df.set_index('EVALUATED')
.filter(like='TYPE_')
.apply(lambda x: pd.Series(x.unique()), axis=1)
.stack()
.to_frame('TYPE')
.reset_index()
.groupby(['TYPE', 'EVALUATED'])
.size()
.unstack(fill_value=0)
print (grouped)
EVALUATED 0 1
TYPE
123A 0 2
456 1 2
567 0 1
A234 1 0
Your solution:
grouped = (df.set_index('EVALUATED')
.filter(like='TYPE_')
.apply(lambda x: pd.Series(x.unique()), axis=1) #added row for unique rows
.stack()
.to_frame('TYPE')
.reset_index()
.pivot_table(index='TYPE', columns='EVALUATED', aggfunc='size', fill_value=0)
).reset_index()
print (grouped)
EVALUATED TYPE 0 1
0 123A 0 2
1 456 1 2
2 567 0 1
3 A234 1 0
Steps:
1) Subset columns starting with the char "TYPE_X".
2) Construct a new DF by taking thier unique values across columns. Let it's new index be the contents present in EVALUATED column.
3) Perform Groupby w.r.t the index axis by specifying level=0. Using apply, stack the frame so that a series gets created. Take it's distinct counts using value_counts. Unstack the obtained DF with providing fill_value=0. Finally, transpose the resulting dataframe.
TYPE_cols = df.filter(like="TYPE_")
d = pd.DataFrame([pd.unique(x) for x in TYPE_cols.values], df['EVALUATED'].values)
result = d.groupby(level=0).apply(lambda x: pd.value_counts(x.values.ravel()))
result.unstack(fill_value=0).T.reset_index().rename(columns={"index":"TYPE"})
I am joining different values from a dataset into one column in Pandas Dataframe, however there are lots of duplication, how can I get rid of them without deleting any row?:
example:
newCol
------
123,456,129,123,123
237,438,365,432,438
using df.newCol.drop_duplicates() removes the entire rows but I want the result to be:
newCol
------
123,456,129
237,438,365,432
...
thank you
You need first split, apply set and then join:
df.newCol = df.newCol.apply(lambda x: ','.join(set(str(x).split(','))))
print (df)
newCol
0 129,123,456
1 432,365,438,237
But you can apply set in join before:
print (df)
0 1 2 3 4
0 123 456 129 123 123
1 237 438 365 432 438
df = df.apply(lambda x: ','.join(set(x.astype(str))), axis=1)
print (df)
0 129,123,456
1 432,365,438,237
dtype: object
Or unique:
df = df.apply(lambda x: ','.join((x.astype(str)).unique()), axis=1)
print (df)
0 123,456,129
1 237,438,365,432
dtype: object
I have the following dataframe named ttm:
usersidid clienthostid eventSumTotal LoginDaysSum score
0 12 1 60 3 1728
1 11 1 240 3 1331
3 5 1 5 3 125
4 6 1 16 2 216
2 10 3 270 3 1000
5 8 3 18 2 512
When i do
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
I get what I expected (though I would've wanted the results to be under a new label named 'ratio'):
clienthostid LoginDaysSum
0 1 4
1 3 2
But when I do
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])
I get:
0 1.0
1 1.5
Why did the labels go? I still also need the grouped need the 'clienthostid' and I need also the results of the apply to be under a label too
Sometimes when I do groupby some of the other columns still appear, why is that that sometimes columns disappear and sometime stays? is there a flag I'm missing that do those stuff?
In the example that I gave, when I did count the results showed on label 'LoginDaysSum', is there a why to add a new label for the results instead?
Thank you,
For return DataFrame after groupby are 2 possible solutions:
parameter as_index=False what works nice with count, sum, mean functions
reset_index for create new column from levels of index, more general solution
df = ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
print (df)
clienthostid LoginDaysSum
0 1 4
1 3 2
df = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'].count().reset_index()
print (df)
clienthostid LoginDaysSum
0 1 4
1 3 2
For second need remove as_index=False and instead add reset_index:
#output is `Series`
a = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'] \
.apply(lambda x: x.iloc[0] / x.iloc[1])
print (a)
clienthostid
1 1.0
3 1.5
Name: LoginDaysSum, dtype: float64
print (type(a))
<class 'pandas.core.series.Series'>
print (a.index)
Int64Index([1, 3], dtype='int64', name='clienthostid')
df1 = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum']
.apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
print (df1)
clienthostid ratio
0 1 1.0
1 3 1.5
Why some columns are gone?
I think there can be problem automatic exclusion of nuisance columns:
#convert column to str
ttm.usersidid = ttm.usersidid.astype(str) + 'aa'
print (ttm)
usersidid clienthostid eventSumTotal LoginDaysSum score
0 12aa 1 60 3 1728
1 11aa 1 240 3 1331
3 5aa 1 5 3 125
4 6aa 1 16 2 216
2 10aa 3 270 3 1000
5 8aa 3 18 2 512
#removed str column userid
a = ttm.groupby(['clienthostid'], sort=False).sum()
print (a)
eventSumTotal LoginDaysSum score
clienthostid
1 321 11 3400
3 288 5 1512
What is the difference between size and count in pandas?
count is a built in method for the groupby object and pandas knows what to do with it. There are two other things specified that goes into determining what the out put looks like.
# For a built in method, when
# you don't want the group column
# as the index, pandas keeps it in
# as a column.
# |----||||----|
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
clienthostid LoginDaysSum
0 1 4
1 3 2
# For a built in method, when
# you do want the group column
# as the index, then...
# |----||||---|
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].count()
# |-----||||-----|
# the single brackets tells
# pandas to operate on a series
# in this case, count the series
clienthostid
1 4
3 2
Name: LoginDaysSum, dtype: int64
ttm.groupby(['clienthostid'], as_index=True, sort=False)[['LoginDaysSum']].count()
# |------||||------|
# the double brackets tells pandas
# to operate on the dataframe
# specified by these columns and will
# return a dataframe
LoginDaysSum
clienthostid
1 4
3 2
When you used apply pandas no longer knows what to do with the group column when you say as_index=False. It has to trust that if you use apply you want returned exactly what you say to return, so it will just throw it away. Also, you have single brackets around your column which says to operate on a series. Instead, use as_index=True to keep the grouping column information in the index. Then follow it up with a reset_index to transfer it from the index back into the dataframe. At this point, it will not have mattered that you used single brackets because after the reset_index you'll have a dataframe again.
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])
0 1.0
1 1.5
dtype: float64
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()
clienthostid LoginDaysSum
0 1 1.0
1 3 1.5
Reading the groupy documentarion, a found out that automatic exclusion of columns after groupby usually caused by the presence of null values in that columns excluded.
Try fill the 'null' with some value.
Like this:
df.fillna('')
You simply need this instead:
ttm.groupby(['clienthostid'], as_index=False, sort=False)[['LoginDaysSum']].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()
The double [[]] will turn the output into a pd.Dataframe instead of a pd.Series.
I have two pandas dataframes.
noclickDF = DataFrame([[0, 123, 321], [0, 1543, 432]],
columns=['click', 'id', 'location'])
clickDF = DataFrame([[1, 123, 421], [1, 1543, 436]],
columns=['click', 'location','id'])
I simply want to join such that the final DF will look like:
click | id | location
0 123 321
0 1543 432
1 421 123
1 436 1543
As you can see the column names of both original DF's are the same, but not in the same order. Also there is no join in a column.
You could also use pd.concat:
In [36]: pd.concat([noclickDF, clickDF], ignore_index=True)
Out[36]:
click id location
0 0 123 321
1 0 1543 432
2 1 421 123
3 1 436 1543
Under the hood, DataFrame.append calls pd.concat.
DataFrame.append has code for handling various types of input, such as Series, tuples, lists and dicts. If you pass it a DataFrame, it passes straight through to pd.concat, so using pd.concat is a bit more direct.
For future users (sometime >pandas 0.23.0):
You may also need to add sort=True to sort the non-concatenation axis when it is not already aligned (i.e. to retain the OP's desired concatenation behavior). I used the code contributed above and got a warning, see Python Pandas User Warning. The code below works and does not throw a warning.
In [36]: pd.concat([noclickDF, clickDF], ignore_index=True, sort=True)
Out[36]:
click id location
0 0 123 321
1 0 1543 432
2 1 421 123
3 1 436 1543
You can use append for that
df = noclickDF.append(clickDF)
print df
click id location
0 0 123 321
1 0 1543 432
0 1 421 123
1 1 436 1543
and if you need you can reset the index by
df.reset_index(drop=True)
print df
click id location
0 0 123 321
1 0 1543 432
2 1 421 123
3 1 436 1543