Following up on this answer: Is there a way to do a weight-average rolling sum over a grouping?
rsum = pd.rolling_apply(g.values,p,lambda x: np.nansum(w*x),min_periods=p)
rolling_apply is deprecated now. How would you change this to work under current functionality.
As of 0.18+, use Series.rolling.apply.
w = np.array([0.1,0.1,0.2,0.6])
df.groupby('ID').VALUE.apply(
lambda x: x.rolling(window=4).apply(lambda x: np.dot(x, w), raw=False))
0 NaN
1 NaN
2 NaN
3 146.0
4 166.0
5 NaN
6 NaN
7 NaN
8 2.5
9 NaN
10 NaN
11 NaN
12 35.5
13 21.4
14 NaN
15 NaN
16 NaN
17 8.3
18 9.8
19 NaN
Name: VALUE, dtype: float64
The raw argument is new to 0.23 (set it to specify passing Series v/s arrays), so remove it if you're having trouble on older versions.
Related
I'm working on this raw data frame that needs some cleaning. So far, I have transformed this xlsx file
into this pandas dataframe:
print(df.head(16))
date technician alkalinity colour uv ph turbidity \
0 2020-02-01 00:00:00 Catherine 24.5 33 0.15 7.24 1.53
1 Unnamed: 2 NaN NaN NaN NaN NaN 2.31
2 Unnamed: 3 NaN NaN NaN NaN NaN 2.08
3 Unnamed: 4 NaN NaN NaN NaN NaN 2.2
4 Unnamed: 5 Michel 24 35 0.152 7.22 1.59
5 Unnamed: 6 NaN NaN NaN NaN NaN 1.66
6 Unnamed: 7 NaN NaN NaN NaN NaN 1.71
7 Unnamed: 8 NaN NaN NaN NaN NaN 1.53
8 2020-02-02 00:00:00 Catherine 24 NaN 0.145 7.21 1.44
9 Unnamed: 10 NaN NaN NaN NaN NaN 1.97
10 Unnamed: 11 NaN NaN NaN NaN NaN 1.91
11 Unnamed: 12 NaN NaN 33.0 NaN NaN 2.07
12 Unnamed: 13 Michel 24 34 0.15 7.24 1.76
13 Unnamed: 14 NaN NaN NaN NaN NaN 1.84
14 Unnamed: 15 NaN NaN NaN NaN NaN 1.72
15 Unnamed: 16 NaN NaN NaN NaN NaN 1.85
temperature
0 3
1 NaN
2 NaN
3 NaN
4 3
5 NaN
6 NaN
7 NaN
8 3
9 NaN
10 NaN
11 NaN
12 3
13 NaN
14 NaN
15 NaN
From here, I want to combine the rows so that I only have one row for each date. The values for each row will be the mean in the respective columns. ie.
print(new_df.head(2))
date time alkalinity colour uv ph turbidity temperature
0 2020-02-01 00:00:00 24.25 34 0.151 7.23 1.83 3
1 2020-02-02 00:00:00 24 33.5 0.148 7.23 1.82 3
How can I accomplish this when I have Unnamed values in my date column? Thanks!
Try setting the values to NaN and then use ffill:
df.loc[df.date.str.contains('Unnamed', na=False), 'date'] = np.nan
df.date = df.date.ffill()
If I understand, correctly you want to drop rows that contain 'Unnamed' in the date column, right?
Please look here:
https://stackoverflow.com/a/27360130/12790501
The solution would be something like this:
df = df.drop(df['Unnamed' in df.date].index)
Edit:
No, I would like to replace those Unnamed values with the date so I
could then use the groupby('date') function to return the mean values
for the columns
so in the case you should just iterate over the whole table
last_date = ''
for i in df.index:
if 'Unnamed' not in df.at[i, 'date']:
last_date = df.at[i, 'date']
else:
df.at[i, 'date'] = last_date
If the 'date' column is of type object i.e. string
then just write a logic to loop over the number as seen in image provided it follows a certain pattern-
for _ in range(2,9):
df.loc[(df['date'] == 'Unnamed: '+str(_), 'date'] = your_value
I have a dataset similar to this
Serial A B
1 12
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100
2 32 242
2 3
3 2
3 23 100
3
3 23
I group the dataframe based on Serial and find the maximum value of each A column by df['A_MAX'] = df.groupby('Serial')['A'].transform('max').values and retain the first value by df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
Serial A B A_MAX B_corresponding
1 12 31 203
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100 32 100
2 32 242
2 3
3 2 23 100
3 23 100
3
3 23
Now for the B_corresponding column, I would like to get the corresponding B values of the A_MAX. I thought of locating the A_MAX values in A but there are similar max A values per group. Additional condition, for example in Serial 2 I would also prefer to get the smallest B values between the 32
Idea is use DataFrame.sort_values for maximal values per groups, then remove missing values by DataFrame.dropna and get first rows by Serial by DataFrame.drop_duplicates. Create Series by DataFrame.set_index and last use Series.map:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated())
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated())
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31.0 203.0
1 1 31.0 NaN NaN NaN
2 1 NaN NaN NaN NaN
3 1 12.0 NaN NaN NaN
4 1 31.0 203.0 NaN NaN
5 1 10.0 NaN NaN NaN
6 1 2.0 NaN NaN NaN
7 2 32.0 100.0 32.0 100.0
8 2 32.0 242.0 NaN NaN
9 2 3.0 NaN NaN NaN
10 3 2.0 NaN 23.0 100.0
11 3 23.0 100.0 NaN NaN
12 3 NaN NaN NaN NaN
13 3 23.0 NaN NaN NaN
Converting missing values to empty strings is possible, but get mixed values - numeric and strings, so next processing should be problematic:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated(), '')
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31 203
1 1 31.0 NaN
2 1 NaN NaN
3 1 12.0 NaN
4 1 31.0 203.0
5 1 10.0 NaN
6 1 2.0 NaN
7 2 32.0 100.0 32 100
8 2 32.0 242.0
9 2 3.0 NaN
10 3 2.0 NaN 23 100
11 3 23.0 100.0
12 3 NaN NaN
13 3 23.0 NaN
You could also use dictionaries to achieve the same if you are not so inclined to only use pandas.
a_to_b_mapping = df.groupby('A')['B'].min().to_dict()
series_to_a_mapping = df.groupby('Series')['A'].max().to_dict()
agg_df = {}
for series, a in series_to_a_mapping.items():
agg_df.append((series, a, a_to_b_mapping.get(a, None)))
agg_df = pd.DataFrame(agg_df, columns=['Series', 'A_max', 'B_corresponding'])
agg_df.head()
Series A_max B_corresponding
0 1 31.0 203.0
1 2 32.0 100.0
2 3 23.0 100.0
If you want, you could join this to original dataframe and mask duplicates.
dft = df.join(final_df.set_index('Serial'), on='Serial', how='left')
dft['A_max'] = dft['A_max'].mask(dft['A_max'].duplicated(), '')
dft['B_corresponding'] = dft['B_corresponding'].mask(dft['B_corresponding'].duplicated(), '')
dft
I am raising this question for learning a new method for myself.
I have a dataframe like below,
ID Value
0 1 10
1 1 12
2 1 14
3 1 16
4 1 18
5 2 32
6 2 12
7 2 -8
8 2 -28
9 2 -48
10 2 -68
11 3 12
12 3 1
13 3 43
I want to convert this into:
ID Value ID Value ID Value
0 1.0 10.0 2 32 3.0 12.0
1 1.0 12.0 2 12 3.0 1.0
2 1.0 14.0 2 -8 3.0 43.0
3 1.0 16.0 2 -28 NaN NaN
4 1.0 18.0 2 -48 NaN NaN
5 NaN NaN 2 -68 NaN NaN
one way to solve this,
print
pd.concat([df[df['ID']==1].reset_index(drop=True),df[df['ID']==2].reset_index(drop=True),df[df['ID']==3].reset_index(drop=True)],axis=1)
But I'm thinking can I do the same concat operation for each groupby method result instead of filtering by value?
Any better/new approaches are more appreciated.
Thanks in advance.
Yup, very possible and quite simple with pd.concat, in fact.
df = pd.concat({k : g.reset_index(drop=True) for k, g in df.groupby('ID')}, axis=1)
df.columns = df.columns.droplevel(0)
Or, a minor variation in Dark's (now deleted) answer (which does not give you the opportunity to specify column suffixes automatically) -
pd.concat([g.reset_index(drop=True) for _, g in df.groupby('ID')], axis=1)
df
ID Value ID Value ID Value
0 1.0 10.0 2 32 3.0 12.0
1 1.0 12.0 2 12 3.0 1.0
2 1.0 14.0 2 -8 3.0 43.0
3 1.0 16.0 2 -28 NaN NaN
4 1.0 18.0 2 -48 NaN NaN
5 NaN NaN 2 -68 NaN NaN
Those column names are terrible, though. Rather than dropping the first level, you should consider concatenating them to form a pre/suf-fix for the second level. That should be a good exercise for you with df.columns.map.
I have some data, in which the index is a threshold, and the values are trns (true negative rates) for two classes, 0 and 1.
I want to get a dataframe, indexed by the tnr, of the threshold that corresponds to that tnr, for each class. Essentially, I want this:
I am able to achieve this effect by using the following:
pd.concat([pd.Series(data[0].index.values, index=data[0]),
pd.Series(data[1].index.values, index=data[1])],
axis=1)
Or, generalizing to any number of columns:
def invert_dataframe(df):
return pd.concat([pd.Series(df[col].index.values,
index=df[col]) for col in df.columns],
axis=1)
However, this seems extremely hacky and error prone. Is there a better way to do this, and is there maybe native Pandas functionality that would do this?
You can use stack with pivot:
data = pd.DataFrame({0:[10,20,31],10:[4,22,36],
1:[7,5,6]}, index=[2.1,1.07,2.13])
print (data)
0 1 10
2.10 10 7 4
1.07 20 5 22
2.13 31 6 36
df = data.stack().reset_index()
df.columns = list('abc')
df = df.pivot(index='c', columns='b', values='a')
print (df)
b 0 1 10
c
4 NaN NaN 2.10
5 NaN 1.07 NaN
6 NaN 2.13 NaN
7 NaN 2.10 NaN
10 2.10 NaN NaN
20 1.07 NaN NaN
22 NaN NaN 1.07
31 2.13 NaN NaN
36 NaN NaN 2.13
I am sure there must be a very simple solution to this problem, but I am failing to find it (and browsing through previously asked questions, I didn't find the answer I wanted or didn't understand it).
I have a dataframe similar to this (just much bigger, with many more rows and columns):
x val1 val2 val3
0 0.0 10.0 NaN NaN
1 0.5 10.5 NaN NaN
2 1.0 11.0 NaN NaN
3 1.5 11.5 NaN 11.60
4 2.0 12.0 NaN 12.08
5 2.5 12.5 12.2 12.56
6 3.0 13.0 19.8 13.04
7 3.5 13.5 13.3 13.52
8 4.0 14.0 19.8 14.00
9 4.5 14.5 14.4 14.48
10 5.0 15.0 19.8 14.96
11 5.5 15.5 15.5 15.44
12 6.0 16.0 19.8 15.92
13 6.5 16.5 16.6 16.40
14 7.0 17.0 19.8 18.00
15 7.5 17.5 17.7 NaN
16 8.0 18.0 19.8 NaN
17 8.5 18.5 18.8 NaN
18 9.0 19.0 19.8 NaN
19 9.5 19.5 19.9 NaN
20 10.0 20.0 19.8 NaN
In the next step, I need to compute the derivative dVal/dx for each of the value columns (in reality I have more than 3 columns, so I need to have a robust solution in a loop, I can't select the rows manually each time). But because of the NaN values in some of the columns, I am facing the problem that x and val are not of the same dimension. I feel the way to overcome this would be to only select only those x intervals, for which the val is notnull. But I am not able to do that. I am probably making some very stupid mistakes (I am not a programmer and I am very untalented, so please be patient with me:) ).
Here is the code so far (now that I think of it, I may have introduced some mistakes just by leaving some old pieces of code because I've been messing with it for a while, trying different things):
import pandas as pd
import numpy as np
df = pd.read_csv('H:/DocumentsRedir/pokus/dataframe.csv', delimiter=',')
vals = list(df.columns.values)[1:]
for i in vals:
V = np.asarray(pd.notnull(df[i]))
mask = pd.notnull(df[i])
X = np.asarray(df.loc[mask]['x'])
derivative=np.diff(V)/np.diff(X)
But I am getting this error:
ValueError: operands could not be broadcast together with shapes (20,) (15,)
So, apparently, it did not select only the notnull values...
Is there an obvious mistake that I am making or a different approach that I should adopt? Thanks!
(And another less important question: is np.diff the right function to use here or had I better calculated it manually by finite differences? I'm not finding numpy documentation very helpful.)
To calculate dVal/dX:
dVal = df.iloc[:, 1:].diff() # `x` is in column 0.
dX = df['x'].diff()
>>> dVal.apply(lambda series: series / dX)
val1 val2 val3
0 NaN NaN NaN
1 1 NaN NaN
2 1 NaN NaN
3 1 NaN NaN
4 1 NaN 0.96
5 1 NaN 0.96
6 1 15.2 0.96
7 1 -13.0 0.96
8 1 13.0 0.96
9 1 -10.8 0.96
10 1 10.8 0.96
11 1 -8.6 0.96
12 1 8.6 0.96
13 1 -6.4 0.96
14 1 6.4 3.20
15 1 -4.2 NaN
16 1 4.2 NaN
17 1 -2.0 NaN
18 1 2.0 NaN
19 1 0.2 NaN
20 1 -0.2 NaN
We difference all columns (except the first one), and then apply a lambda function to each column which divides it by the difference in column X.