Pandas time-series groupby with custom function - python

I have a timeseries with several products. For each product I want to remove null extremes, and in the middle I want to substitute double 0 to np.nan. Here is an example:
Date Id Units Should be
1 a 0 remove row
2 a 5 5
3 a 0 np.nan
4 a 0 np.nan
5 a 1 1
6 a 3 3
1 b 4 4
2 b 2 2
3 b 0 0
4 b 4 4
5 b 0 remove row
6 b 0 remove row
I tried using groupby and for to getting indexes, but I wasnt able to combine the rules.

You can use:
## PART 1: remove the external 0s
# get rows with 0
m = df['Units'].ne(0)
# get masks to identify the middle values
m1 = m.groupby(df['Id']).cummax()
m2 = m[::-1].groupby(df['Id']).cummax()
# slice the "internal" rows
out = df[m1&m2]
## PART2: replace stretches of 2 0s
g = m.ne(m.groupby(df['Id']).shift()).cumsum()
m3 = df.groupby(['Id', g]).transform('size').eq(2)
out.loc[m2&~m, 'Units'] = pd.NA
output:
Date Id Units Should be
1 2 a 5.0 5
2 3 a NaN np.nan
3 4 a NaN np.nan
4 5 a 1.0 1
5 6 a 3.0 3
6 1 b 4.0 4
7 2 b 2.0 2
8 3 b NaN 0
9 4 b 4.0 4

Related

Flatten a dataframe with vector/list elements python

Lets say I have a dataframe like this:
A B C Profile
0 1 4 4 [1,2,3,4]
1 2 4 5 [2,2,4,1]
3 2 4 5 [2,2,4,1]
How can I go about making it become this:
A B C Profile[0] Profile[1] Profile[2] Profile[3]
0 1 4 4 1 2 3 4
1 2 4 5 2 2 4 1
3 2 4 5 2 2 4 1
I have tried this:
flat_list = [sublist for sublist in df['Profile']]
flat_df = pd.DataFrame(flat_list)
pd.concat([df.iloc[:,0:3], flat_df], axis=1)
BUT I have some NaN values and I need to retain the index for the flat list. This method just adds them all and moves all NaNs to the bottom instead of matching indices.
Ie i end up with this:
A B C Profile[0] Profile[1] Profile[2] Profile[3]
0 1 4 4 1 2 3 4
1 2 4 5 2 2 4 1
2 NaN NaN NaN 2 2 4 1
3 2 4 5 NaN NaN NaN NaN
TIA
Change you line pass with the index
flat_df = pd.DataFrame(flat_list, index = df.index)
out = pd.concat([df.iloc[:,0:3], flat_df], axis = 1)

how to get the average of values for one column based on another column value in python (pandas, jupyter)

the image shows the test dataset I am using to verify if the right averages are being calculated.
I want to be able to get the average of the corresponding values in the 'G' column based on the filtered values in the 'T' column.
So I set the values for the 'T' coloumn based on which I want to sum the values in the 'G' column and then divide the total by the count to get an average, which is appended to a variable.
however the average is not correctly calculated. see below
screenshot
total=0
g_avg=[]
output=[]
counter=0
for i, row in df_new.iterrows():
if (row['T'] > 2):
counter+=1
total+=row['G']
if (counter != 0 and row['T']==10):
g_avg.append(total/counter)
counter = 0
total = 0
print(g_avg)
below is a better set of data as there is repetition in the 'T' values so I would need a counter in order to get my average for the G values when the T value is in a certain range i.e. from 2am to 10 am etc
sorry it wont allow me to just paste the dataset so ive took a snippy of it
If you want the average of column G values when T is between 2 and 7:
df_new.loc[(df_new['T']>2) & (df_new['T']<7), 'G'].mean()
Update
It's difficult to know exactly what you want without any expected output. If you have some data that looks like this:
print(df)
T G
0 0 0
1 0 0
2 1 0
3 2 1
4 3 3
5 4 0
6 5 4
7 6 5
8 7 0
9 8 6
10 9 7
And you want something like this:
print(df)
T G
0 0 0
1 0 0
2 1 0
3 2 1
4 3 3
5 4 3
6 5 3
7 6 3
8 7 0
9 8 6
10 9 7
Then you could use boolean indexing and DataFrame.loc:
avg = df.loc[(df['T']>2) & (df['T']<7), 'G'].mean()
df.loc[(df['T']>2) & (df['T']<7), 'G'] = avg
print(df)
T G
0 0 0.0
1 0 0.0
2 1 0.0
3 2 1.0
4 3 3.0
5 4 3.0
6 5 3.0
7 6 3.0
8 7 0.0
9 8 6.0
10 9 7.0
Update 2
If you have some sample data:
print(df)
T G
0 0 1
1 2 2
2 3 3
3 3 1
4 3 2
5 10 4
6 2 5
7 2 5
8 2 5
9 10 5
Method 1: To simply get a list of those means, you could create groups for your interval and filter on m:
m = df['T'].between(0,5,inclusive=False)
g = m.ne(m.shift()).cumsum()[m]
lst = df.groupby(g).mean()['G'].tolist()
print(lst)
[2.0, 5.0]
Method 2: If you want to include these means at their respective T values, then you could do this instead:
m = df['T'].between(0,5,inclusive=False)
g = m.ne(m.shift()).cumsum()
df['G_new'] = df.groupby(g)['G'].transform('mean')
print(df)
T G G_new
0 0 1 1
1 2 2 2
2 3 3 2
3 3 1 2
4 3 2 2
5 10 4 4
6 2 5 5
7 2 5 5
8 2 5 5
9 10 5 5

add nan if missing consecutive values

I have a dataframe like
df2 = pandas.DataFrame(data=[[1,4],[2,2],[2,1],[5,2],[5,3]],columns=['A','B'])
df2
Out[117]:
A B
0 1 4
1 2 2
2 2 1
3 5 2
4 5 3
and I would like to add nan to the column B if consecutive values are missing in column A
the dataframe should become as
df2
Out[117]:
A B
0 1 4
1 2 2
2 2 1
4 3 np.nan
5 4 np.nan
6 5 2
7 5 3
Could you please help me?
You can construct a dataframe to append, concatenate, then sort:
df = pd.DataFrame(data=[[1,4],[2,2],[2,1],[5,2],[5,3]], columns=['A','B'])
# construct dataframe to append
arr = np.arange(df['A'].min(), df['A'].max() + 1)
arr = arr[~np.in1d(arr, df['A'].values)]
df_append = pd.DataFrame({'A': arr})
# concatenate and sort
res = pd.concat([df, df_append]).sort_values('A')
print(res)
A B
0 1 4.0
1 2 2.0
2 2 1.0
0 3 NaN
1 4 NaN
3 5 2.0
4 5 3.0

How to fill values based on data present in column and an array? Pandas

Lets say I have dataframe with nans in each group like
df = pd.DataFrame({'data':[0,1,2,0,np.nan,2,np.nan,0,1],'group':[1,1,1,2,2,2,3,3,3]})
and a numpy array like
x = np.array([0,1,2])
Now based on groups how to fill the missing values that are in the numpy array I have i.e
df = pd.DataFrame({'data':[0,1,2,0,1,2,2,0,1],'group':[1,1,1,2,2,2,3,3,3]})
data group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 2 3
7 0 3
8 1 3
Let me explain a bit of how the data should be filled. Consider the group 2. The values of data are 0,np.nan,2 . The np.nan is the missing value from the array [0,1,2]. So the data to be filled inplace of nan is 1.
For multiple nan values, take a group for example that has data [np.nan,0,np.nan] now the values to be filled in place of nan are 1 and 2. resulting in [1,0,2].
First find value which miss and then add it to fillna:
def f(y):
a = list(set(x)-set(y))
a = 1 if len(a) == 0 else a[0]
y = y.fillna(a)
return (y)
df['data'] = df.groupby('group')['data'].apply(f).astype(int)
print (df)
data group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 2 3
7 0 3
8 1 3
EDIT:
df = pd.DataFrame({'data':[0,1,2,0,np.nan,2,np.nan,np.nan,1, np.nan, np.nan, np.nan],
'group':[1,1,1,2,2,2,3,3,3,4,4,4]})
x = np.array([0,1,2])
print (df)
data group
0 0.0 1
1 1.0 1
2 2.0 1
3 0.0 2
4 NaN 2
5 2.0 2
6 NaN 3
7 NaN 3
8 1.0 3
9 NaN 4
10 NaN 4
11 NaN 4
def f(y):
a = list(set(x)-set(y))
if len(a) == 1:
return y.fillna(a[0])
elif len(a) == 2:
return y.fillna(a[0], limit=1).fillna(a[1])
elif len(a) == 3:
y = pd.Series(x, index=y.index)
return y
else:
return y
df['data'] = df.groupby('group')['data'].apply(f).astype(int)
print (df)
data group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 0 3
7 2 3
8 1 3
9 0 4
10 1 4
11 2 4

drop all rows after first occurance of NaN in specific column (pandas)

I am trying to use the dropna function in pandas. I would like to use it for a specific column.
I can only figure out how to use it to drop NaN if ALL rows have ALL NaN values.
I have a dataframe (see below) that I would like to drop all rows after the first occurance of an NaN in a specific column, column "A"
current code, only works if all row values are NaN.
data.dropna(axis = 0, how = 'all')
data
Original Dataframe
data = pd.DataFrame({"A": (1,2,3,4,5,6,7,"NaN","NaN","NaN"),"B": (1,2,3,4,5,6,7,"NaN","9","10"),"C": range(10)})
data
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 4 3
4 5 5 4
5 6 6 5
6 7 7 6
7 NaN NaN 7
8 NaN 9 8
9 NaN 10 9
What I would like the output to look like:
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 4 3
4 5 5 4
5 6 6 5
6 7 7 6
Any help on this is appreciated.
Obviously I am would like to do it in the cleanest most efficient way possible.
Thanks!
use iloc + argmax
data.iloc[:data.A.isnull().values.argmax()]
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
3 4.0 4 3
4 5.0 5 4
5 6.0 6 5
6 7.0 7 6
or with a different syntax
top_data = data[:data['A'].isnull().argmax()]
Re: accepted answer. If column in question has no NaNs, argmax returns 0 and thus df[:argmax] will return an empty dataframe.
Here's my workaround:
max_ = data.A.isnull().argmax()
max_ = len(data) if max_ == 0 else max_
top_data = data[:max_]

Categories