Pandas time-series groupby with custom function

Pandas time-series groupby with custom function - python

I have a timeseries with several products. For each product I want to remove null extremes, and in the middle I want to substitute double 0 to np.nan. Here is an example:
Date Id Units Should be
1 a 0 remove row
2 a 5 5
3 a 0 np.nan
4 a 0 np.nan
5 a 1 1
6 a 3 3
1 b 4 4
2 b 2 2
3 b 0 0
4 b 4 4
5 b 0 remove row
6 b 0 remove row
I tried using groupby and for to getting indexes, but I wasnt able to combine the rules.

You can use:
## PART 1: remove the external 0s
# get rows with 0
m = df['Units'].ne(0)
# get masks to identify the middle values
m1 = m.groupby(df['Id']).cummax()
m2 = m[::-1].groupby(df['Id']).cummax()
# slice the "internal" rows
out = df[m1&m2]
## PART2: replace stretches of 2 0s
g = m.ne(m.groupby(df['Id']).shift()).cumsum()
m3 = df.groupby(['Id', g]).transform('size').eq(2)
out.loc[m2&~m, 'Units'] = pd.NA
output:
Date Id Units Should be
1 2 a 5.0 5
2 3 a NaN np.nan
3 4 a NaN np.nan
4 5 a 1.0 1
5 6 a 3.0 3
6 1 b 4.0 4
7 2 b 2.0 2
8 3 b NaN 0
9 4 b 4.0 4

Related

Flatten a dataframe with vector/list elements python

Lets say I have a dataframe like this:
A B C Profile
0 1 4 4 [1,2,3,4]
1 2 4 5 [2,2,4,1]
3 2 4 5 [2,2,4,1]
How can I go about making it become this:
A B C Profile[0] Profile[1] Profile[2] Profile[3]
0 1 4 4 1 2 3 4
1 2 4 5 2 2 4 1
3 2 4 5 2 2 4 1
I have tried this:
flat_list = [sublist for sublist in df['Profile']]
flat_df = pd.DataFrame(flat_list)
pd.concat([df.iloc[:,0:3], flat_df], axis=1)
BUT I have some NaN values and I need to retain the index for the flat list. This method just adds them all and moves all NaNs to the bottom instead of matching indices.
Ie i end up with this:
A B C Profile[0] Profile[1] Profile[2] Profile[3]
0 1 4 4 1 2 3 4
1 2 4 5 2 2 4 1
2 NaN NaN NaN 2 2 4 1
3 2 4 5 NaN NaN NaN NaN
TIA

Change you line pass with the index
flat_df = pd.DataFrame(flat_list, index = df.index)
out = pd.concat([df.iloc[:,0:3], flat_df], axis = 1)

how to get the average of values for one column based on another column value in python (pandas, jupyter)

the image shows the test dataset I am using to verify if the right averages are being calculated.
I want to be able to get the average of the corresponding values in the 'G' column based on the filtered values in the 'T' column.
So I set the values for the 'T' coloumn based on which I want to sum the values in the 'G' column and then divide the total by the count to get an average, which is appended to a variable.
however the average is not correctly calculated. see below
screenshot
total=0
g_avg=[]
output=[]
counter=0
for i, row in df_new.iterrows():
if (row['T'] > 2):
counter+=1
total+=row['G']
if (counter != 0 and row['T']==10):
g_avg.append(total/counter)
counter = 0
total = 0
print(g_avg)
below is a better set of data as there is repetition in the 'T' values so I would need a counter in order to get my average for the G values when the T value is in a certain range i.e. from 2am to 10 am etc
sorry it wont allow me to just paste the dataset so ive took a snippy of it

If you want the average of column G values when T is between 2 and 7:
df_new.loc[(df_new['T']>2) & (df_new['T']<7), 'G'].mean()
Update
It's difficult to know exactly what you want without any expected output. If you have some data that looks like this:
print(df)
T G
0 0 0
1 0 0
2 1 0
3 2 1
4 3 3
5 4 0
6 5 4
7 6 5
8 7 0
9 8 6
10 9 7
And you want something like this:
print(df)
T G
0 0 0
1 0 0
2 1 0
3 2 1
4 3 3
5 4 3
6 5 3
7 6 3
8 7 0
9 8 6
10 9 7
Then you could use boolean indexing and DataFrame.loc:
avg = df.loc[(df['T']>2) & (df['T']<7), 'G'].mean()
df.loc[(df['T']>2) & (df['T']<7), 'G'] = avg
print(df)
T G
0 0 0.0
1 0 0.0
2 1 0.0
3 2 1.0
4 3 3.0
5 4 3.0
6 5 3.0
7 6 3.0
8 7 0.0
9 8 6.0
10 9 7.0
Update 2
If you have some sample data:
print(df)
T G
0 0 1
1 2 2
2 3 3
3 3 1
4 3 2
5 10 4
6 2 5
7 2 5
8 2 5
9 10 5
Method 1: To simply get a list of those means, you could create groups for your interval and filter on m:
m = df['T'].between(0,5,inclusive=False)
g = m.ne(m.shift()).cumsum()[m]
lst = df.groupby(g).mean()['G'].tolist()
print(lst)
[2.0, 5.0]
Method 2: If you want to include these means at their respective T values, then you could do this instead:
m = df['T'].between(0,5,inclusive=False)
g = m.ne(m.shift()).cumsum()
df['G_new'] = df.groupby(g)['G'].transform('mean')
print(df)
T G G_new
0 0 1 1
1 2 2 2
2 3 3 2
3 3 1 2
4 3 2 2
5 10 4 4
6 2 5 5
7 2 5 5
8 2 5 5
9 10 5 5

add nan if missing consecutive values

I have a dataframe like
df2 = pandas.DataFrame(data=[[1,4],[2,2],[2,1],[5,2],[5,3]],columns=['A','B'])
df2
Out[117]:
A B
0 1 4
1 2 2
2 2 1
3 5 2
4 5 3
and I would like to add nan to the column B if consecutive values are missing in column A
the dataframe should become as
df2
Out[117]:
A B
0 1 4
1 2 2
2 2 1
4 3 np.nan
5 4 np.nan
6 5 2
7 5 3
Could you please help me?

You can construct a dataframe to append, concatenate, then sort:
df = pd.DataFrame(data=[[1,4],[2,2],[2,1],[5,2],[5,3]], columns=['A','B'])
# construct dataframe to append
arr = np.arange(df['A'].min(), df['A'].max() + 1)
arr = arr[~np.in1d(arr, df['A'].values)]
df_append = pd.DataFrame({'A': arr})
# concatenate and sort
res = pd.concat([df, df_append]).sort_values('A')
print(res)
A B
0 1 4.0
1 2 2.0
2 2 1.0
0 3 NaN
1 4 NaN
3 5 2.0
4 5 3.0

How to fill values based on data present in column and an array? Pandas

Lets say I have dataframe with nans in each group like
df = pd.DataFrame({'data':[0,1,2,0,np.nan,2,np.nan,0,1],'group':[1,1,1,2,2,2,3,3,3]})
and a numpy array like
x = np.array([0,1,2])
Now based on groups how to fill the missing values that are in the numpy array I have i.e
df = pd.DataFrame({'data':[0,1,2,0,1,2,2,0,1],'group':[1,1,1,2,2,2,3,3,3]})
data group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 2 3
7 0 3
8 1 3
Let me explain a bit of how the data should be filled. Consider the group 2. The values of data are 0,np.nan,2 . The np.nan is the missing value from the array [0,1,2]. So the data to be filled inplace of nan is 1.
For multiple nan values, take a group for example that has data [np.nan,0,np.nan] now the values to be filled in place of nan are 1 and 2. resulting in [1,0,2].

First find value which miss and then add it to fillna:
def f(y):
a = list(set(x)-set(y))
a = 1 if len(a) == 0 else a[0]
y = y.fillna(a)
return (y)
df['data'] = df.groupby('group')['data'].apply(f).astype(int)
print (df)
data group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 2 3
7 0 3
8 1 3
EDIT:
df = pd.DataFrame({'data':[0,1,2,0,np.nan,2,np.nan,np.nan,1, np.nan, np.nan, np.nan],
'group':[1,1,1,2,2,2,3,3,3,4,4,4]})
x = np.array([0,1,2])
print (df)
data group
0 0.0 1
1 1.0 1
2 2.0 1
3 0.0 2
4 NaN 2
5 2.0 2
6 NaN 3
7 NaN 3
8 1.0 3
9 NaN 4
10 NaN 4
11 NaN 4
def f(y):
a = list(set(x)-set(y))
if len(a) == 1:
return y.fillna(a[0])
elif len(a) == 2:
return y.fillna(a[0], limit=1).fillna(a[1])
elif len(a) == 3:
y = pd.Series(x, index=y.index)
return y
else:
return y
df['data'] = df.groupby('group')['data'].apply(f).astype(int)
print (df)
data group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 0 3
7 2 3
8 1 3
9 0 4
10 1 4
11 2 4

drop all rows after first occurance of NaN in specific column (pandas)

I am trying to use the dropna function in pandas. I would like to use it for a specific column.
I can only figure out how to use it to drop NaN if ALL rows have ALL NaN values.
I have a dataframe (see below) that I would like to drop all rows after the first occurance of an NaN in a specific column, column "A"
current code, only works if all row values are NaN.
data.dropna(axis = 0, how = 'all')
data
Original Dataframe
data = pd.DataFrame({"A": (1,2,3,4,5,6,7,"NaN","NaN","NaN"),"B": (1,2,3,4,5,6,7,"NaN","9","10"),"C": range(10)})
data
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 4 3
4 5 5 4
5 6 6 5
6 7 7 6
7 NaN NaN 7
8 NaN 9 8
9 NaN 10 9
What I would like the output to look like:
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 4 3
4 5 5 4
5 6 6 5
6 7 7 6
Any help on this is appreciated.
Obviously I am would like to do it in the cleanest most efficient way possible.
Thanks!

use iloc + argmax
data.iloc[:data.A.isnull().values.argmax()]
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
3 4.0 4 3
4 5.0 5 4
5 6.0 6 5
6 7.0 7 6
or with a different syntax
top_data = data[:data['A'].isnull().argmax()]

Re: accepted answer. If column in question has no NaNs, argmax returns 0 and thus df[:argmax] will return an empty dataframe.
Here's my workaround:
max_ = data.A.isnull().argmax()
max_ = len(data) if max_ == 0 else max_
top_data = data[:max_]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas time-series groupby with custom function - python

Related

Flatten a dataframe with vector/list elements python

how to get the average of values for one column based on another column value in python (pandas, jupyter)

add nan if missing consecutive values

How to fill values based on data present in column and an array? Pandas

drop all rows after first occurance of NaN in specific column (pandas)

Categories

Resources