I'm trying to get the max count of consecutive 0 values from a given data frame with id,date,value columns from a data frame on pandas which look's like that:
id date value
354 2019-03-01 0
354 2019-03-02 0
354 2019-03-03 0
354 2019-03-04 5
354 2019-03-05 5
354 2019-03-09 7
354 2019-03-10 0
357 2019-03-01 5
357 2019-03-02 5
357 2019-03-03 8
357 2019-03-04 0
357 2019-03-05 0
357 2019-03-06 7
357 2019-03-07 7
540 2019-03-02 7
540 2019-03-03 8
540 2019-03-04 9
540 2019-03-05 8
540 2019-03-06 7
540 2019-03-07 5
540 2019-03-08 2
540 2019-03-09 3
540 2019-03-10 2
The desired result will be grouped by the Id and will look like this:
id max_consecutive_zeros
354 3
357 2
540 0
I've achieved what i want with a for but it gets really slow when you are working with huge pandas dataframes, i've found some similar solutions but it didn't work with my problem at all.
Create groupID m for consecutive rows of same value. Next, groupby on id and m and call value_counts, and .loc on multiindex to slice only 0 value of the right-most index level. Finally, filter out duplicates index by duplicated in id and reindex to create 0 value for id having no 0 count
m = df.value.diff().ne(0).cumsum().rename('gid')
#Consecutive rows having the same value will be assigned same IDNumber by this command.
#It is the way to identify a group of consecutive rows having the same value, so I called it groupID.
df1 = df.groupby(['id', m]).value.value_counts().loc[:,:,0].droplevel(-1)
#this groupby groups consecutive rows of same value per ID into separate groups.
#within each group, count number of each value and `.loc` to pick specifically only `0` because we only concern on the count of value `0`.
df1[~df1.index.duplicated()].reindex(df.id.unique(), fill_value=0)
#There're several groups of value `0` per `id`. We want only group of highest count.
#`value_count` already sorted number of count descending, so we just need to pick
#the top one of duplicates by slicing on True/False mask of `duplicated`.
#finally, `reindex` adding any `id` doesn't have value 0 in original `df`.
#Note: `id` is the column `id` in `df`. It is different from groupID `m` we create to use with groupby
Out[315]:
id
354 3
357 2
540 0
Name: value, dtype: int64
Here is one way we need to create the additional key for groupby then , just need groupby this key and id
s=df.groupby('id').value.apply(lambda x : x.ne(0).cumsum())
df[df.value==0].groupby([df.id,s]).size().max(level=0).reindex(df.id.unique(),fill_value=0)
Out[267]:
id
354 3
357 2
540 0
dtype: int64
you could do :
df.groupby('id').value.apply(lambda x : ((x.diff() !=0).cumsum()).where(x ==0,\
np.nan).value_counts().max()).fillna(0)
Output
id
354 3.0
357 2.0
540 0.0
Name: value, dtype: float64
Related
I have a 4 column data frame with numerical values and Nan. What I need is to put the largest numbers in the first columns so that always the first column has the maximum value and the second column the next maximum value.
for x in Exapand_re_metrs[0]:
for y in Exapand_re_metrs[1]:
for z in Exapand_re_metrs[2]:
for a in Exapand_re_metrs[3]:
lista=[x,y,z,a]
lista.sort()
df["AREA_Mayor"]=lista[0]
df["AREA_Menor"]=lista[1]
I'm not so sure what you want to do but here is a solution according to what I understood:
From what I see you have a dataframe with several columns and you would like it to be grouped in a single column with the values from highest to lowest, so I will create a dataframe with almost the same characteristics as follows:
import pandas as pd
import numpy as np
cols = 3
rows = 4
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 1000, (rows, cols)), columns= ["A","B","C"])
print(df)
A B C
0 684 559 629
1 192 835 763
2 707 359 9
3 723 277 754
Now I will group all the columns in a single row and organize them in descending order like this:
data = df.to_numpy().flatten()
data = pd.DataFrame(data)
data.sort_values(by=[0],ascending=False)
So as a result we will obtain a 1xn matrix where the values are in descending order:
0
4 835
5 763
11 754
9 723
6 707
0 684
2 629
1 559
7 359
10 277
3 192
8 9
Note: This code fragment should be adapted to your script; I didn't do it because I don't know your dataset and lastly my English is not that good sorry for any grammatical errors
So I want to cluster the records in this table to find which records are 'similar' (i.e. have enough in common). An example of the table is as follows:
author beginpage endpage volume publication year id_old id_new
0 NaN 495 497 NaN 1975 1 1
1 NaN 306 317 14 1997 2 2
2 lowry 265 275 193 1951 3 3
3 smith p k 76 85 150 1985 4 4
4 NaN 248 254 NaN 1976 5 5
5 hamill p 85 100 391 1981 6 6
6 NaN 1513 1523 7 1979 7 7
7 b oregan 737 740 353 1991 8 8
8 NaN 503 517 98 1975 9 9
9 de wijs 503 517 98 1975 10 10
In this small table, the last row should get 'new_id' equal to 9, to show that these two records are similar.
To make this happen I wrote the code below, which works fine for a small number of records. However, I want to use my code for a table with 15000 records. And of course, if you do the maths, with this code this is going to take way too long.
Anyone who could help me make this code more efficient? Thanks in advance!
My code, where 'dfhead' is the table with the records:
for r in range(0,len(dfhead)):
for o_r in range(r+1,len(dfhead)):
if ((dfhead.loc[r,c] == dfhead.loc[o_r,c]).sum() >= 3) :
if (dfhead.loc[o_r,['id_new']] > dfhead.loc[r,['id_new']]).sum() ==1:
dfhead.loc[o_r,['id_new']] = dfhead.loc[r,['id_new']]
If you are only trying to detect whole equalities between "beginpage", "endpage","volume", "publication", "year", you should try to work on duplicates. I'm not sure about this as your code is still a mistery for me.
Something like this might work (your column "id" needs to be named "id_old" at first in the dataframe though):
cols = ["beginpage", "endpage","volume", "publication", "year"]
#isolate duplicated rows
duplicated = df[df.duplicated(cols, keep=False)]
#find the minimum key to keep
temp = duplicated.groupby(cols, as_index=False)['index'].min()
temp.rename({'id_old':'id_new'}, inplace=True, axis=1)
#import the "minimum key" to duplicated by merging the dataframes
duplicated = duplicated.merge(temp, on=cols, how="left")
#gather the "un-duplicated" rows
unduplicated = df[~df.duplicated(cols, keep=False)]
#concatenate both datasets and reset the index
new_df = unduplicated.append(duplicated)
new_df.reset_index(drop=True, inplace=True)
#where "id_new" is empty, then the data comes from "unduplicated"
#and you could fill the datas from id_old
ix = new_df[new_df.id_new.isnull()].index
new_df.loc[ix, 'id_new'] = new_df.loc[ix, 'id_old']
I would like to extract all rows with the min and max value of a specific column.
Here my piece of df:
id time value
1 16:23:37.006155 406
2 16:23:37.320417 410
3 16:23:37.917598 415
4 16:23:51.049987 420
5 16:23:52.595148 425
6 16:27:13.880722 430
7 16:27:17.258117 435
8 16:28:31.529722 455
9 16:28:37.640527 460
10 16:28:47.782197 405
11 16:28:48.085 410
The goal is to create an other df with the time, value columns with these conditions:
Save the first value
If the value is inferior of the previous value then save it and the previous too.
So I tried this:
df['BeforeDiff'] = df['value'] < df['value'].shift()
But I just have the minimums without the first row.
In other terms, I would like the minimums and maximums of each sequence and do the difference with the time column. The result must be:
id time value
1 16:23:37.006155 406
9 16:28:37.640527 460
10 16:28:47.782197 405
Thanks for your time !
Assuming that the id in your dataframe is an actual column, and not the index:
df.loc[(df['id'] == 1)|(df.value.diff()<0)|((df.value.diff()<0).shift(-1))]
out:
id time value
0 1 16:23:37.006155 406
8 9 16:28:37.640527 460
9 10 16:28:47.782197 405
Following on from this question, I have this dataframe:
ChildID MotherID preWeight
0 20 455 3500
1 20 455 4040
2 13 102 NaN
3 702 946 5000
4 82 571 2000
5 82 571 3500
6 82 571 3800
where I transformed feature 'preWeight' that has multiple observations per MotherID to feature 'preMacro' with a single observation per MotherID, based on the following rules:
if preWeight>=4000 for a particular MotherID, I assigned preMacro a value of "Yes" regardless of the remaining observations
Otherwise I assigned preMacro a value of "No"
Using this line of code:
df.groupby(['ChildID','MotherID']).agg(lambda x: 'Yes' if (x>4000).any() else 'No').reset_index().rename(columns={"preWeight": "preMacro"})
However, I realised that this way I am not preserving the NaN values in the dataset, which ideally should be imputed rather than just assigning them "No" values. So I tried changing the above line to:
df=df.groupby(['MotherID', 'ChildID'])['preWeight'].agg(
lambda x: 'Yes' if (x>4000).any() else (np.NaN if 'no_value' in x.values.all() else 'No')).reset_index().rename(
columns={"preWeight": "preMacro"})
I wanted this line to transform the above dataframe to this:
ChildID MotherID preMacro
0 20 455 Yes
1 13 102 NaN
2 702 946 Yes
3 82 571 No
However I got this error when running it:
TypeError: argument of type 'float' is not iterable
I understand that, in the case of non-missing values, the values of x.values.all() are float numbers, which are not iterable, but I am not sure how else to code this, any ideas?
Thanks.
For performance dont test in custom function per groups, better is aggregate by GroupBy.agg by helper column for boolean mask with GroupBy.all and
GroupBy.any and then set column preMacro by numpy.select:
df = (df.assign(testconst = df['preWeight'] > 4000,
testna = df['preWeight'].notna())
.groupby(['ChildID','MotherID'], sort=False)
.agg({'testconst':'any', 'testna':'all'}))
masks = [df['testconst'] & df['testna'], df['testconst'] | df['testna']]
df['preMacro'] = np.select(masks, ['Yes','No'], default=None)
df = df.drop(['testconst','testna'], axis=1).reset_index()
print (df)
ChildID MotherID preMacro
0 20 455 Yes
1 13 102 None <- for avoid convert np.NaN to string nan is used None
2 702 946 Yes
3 82 571 No
If small DataFrame or performance is not important:
f = lambda x: 'Yes' if (x>4000).any() else ('No' if x.notna().all() else np.NaN)
df1 = (df.groupby(['ChildID','MotherID'], sort=False)['preWeight']
.agg(f)
.reset_index(name='preMacro'))
print (df1)
ChildID MotherID preMacro
0 20 455 Yes
1 13 102 NaN
2 702 946 Yes
3 82 571 No
I have a dataframe df that looks like this:
id Category Time
1 176 12 00:00:00
2 4956 2 00:00:00
3 583 4 00:00:04
4 9395 2 00:00:24
5 176 12 00:03:23
which is basically a set of id and the category of item they used at a particular Time. I use df.groupby['id'] and then I want to see if they used the same category or different and assign True or False respectively (or NaN if that was the first item for that particular id. I also filtered out the data to remove all the ids with only one Time.
For example one of the groups may look like
id Category Time
1 176 12 00:00:00
2 176 12 00:03:23
3 176 2 00:04:34
4 176 2 00:04:54
5 176 2 00:05:23
and I want to perform an operation to get
id Category Time Transition
1 176 12 00:00:00 NaN
2 176 12 00:03:23 False
3 176 2 00:04:34 True
4 176 2 00:04:54 False
5 176 2 00:05:23 False
I thought about doing an apply of some sorts to the Category column after groupby but I am having trouble figuring out the right function.
you don't need a groupby here, you just need sort and shift.
df.sort(['id', 'Time'], inplace=True)
df['Transition'] = df.Category != df.Category.shift(1)
df.loc[df.id != df.id.shift(1), 'Transition'] = np.nan
i haven't tested this, but it should do the trick