Downsample non timeseries pandas dataframe

Downsample non timeseries pandas dataframe - python

I have a data frame like below,
Name = ['A','A','A','A','A','A','B','B','B','B','B','B','B']
Id = ['10','10','10','10','10','10','20','20','20','20','20','20','20']
Depth_Feet = ['69.1','70.5','71.4','72.8','73.2','74.2','208.0','209.2','210.2','211.0','211.2','211.7','212.5']
Val = ['2','3.1','1.1','2.1','6.0','1.1','1.2','1.3','3.1','2.9','5.0','6.1','3.2']
d = {'Name':Name,'Id':Id,'Depth_Feet':Depth_Feet,'Val':Val}
df = pd.DataFrame(d)
print (df.head(20))
Depth_Feet Id Name Val
0 69.1 10 A 2
1 70.5 10 A 3.1
2 71.4 10 A 1.1
3 72.8 10 A 2.1
4 73.2 10 A 6.0
5 74.2 10 A 1.1
6 208.0 20 B 1.2
7 209.2 20 B 1.3
8 210.2 20 B 3.1
9 211.0 20 B 2.9
10 211.2 20 B 5.0
11 211.7 20 B 6.1
12 212.5 20 B 3.2
I want to reduce the size of data frame by Depth_Feet column (let's say every 2 feet).
Desired output is
Depth_Feet Id Name Val
0 69.1 10 A 2
1 71.4 10 A 1.1
2 73.2 10 A 6.0
3 208.0 20 B 1.2
4 210.2 20 B 3.1
5 212.5 20 B 3.2
I have tried few options like round and group by etc, but I'm not able to get the result I want.

If need each 2 rows per groups:
df1 = df[df.groupby('Name').cumcount() % 2 == 0]
print (df1)
Name Id Depth_Feet Val
0 A 10 69.1 2
2 A 10 71.4 1.1
4 A 10 73.2 6.0
6 B 20 208.0 1.2
8 B 20 210.2 3.1
10 B 20 211.2 5.0
12 B 20 212.5 3.2
If need resample by 2 per groups convert values to TimedeltaIndex:
df2 = (df.set_index(pd.to_timedelta(df.Depth_Feet.astype(float), unit='D'))
.groupby('Name')
.resample('2D')
.first()
.reset_index(drop=True))
print (df2)
Name Id Depth_Feet Val
0 A 10 69.1 2
1 A 10 71.4 1.1
2 A 10 73.2 6.0
3 B 20 208.0 1.2
4 B 20 210.2 3.1
5 B 20 212.5 3.2

Related

Calculate %-deviation with values from a pandas Dataframe

I am fairly new to python and I have the following dataframe
setting_id subject_id seconds result_id owner_id average duration_id
0 7 1 0 1680.5 2.0 24.000 1.0
1 7 1 3600 1690.5 2.0 46.000 2.0
2 7 1 10800 1700.5 2.0 101.000 4.0
3 7 2 0 1682.5 2.0 12.500 1.0
4 7 2 3600 1692.5 2.0 33.500 2.0
5 7 2 10800 1702.5 2.0 86.500 4.0
6 7 3 0 1684.5 2.0 8.500 1.0
7 7 3 3600 1694.5 2.0 15.000 2.0
8 7 3 10800 1704.5 2.0 34.000 4.0
What I need to do is Calculate the deviation (%) from averages with a "seconds"-value not equal to 0 from those averages with a seconds value of zero, where the subject_id and Setting_id are the same
i.e. setting_id ==7 & subject_id ==1 would be:
(result/baseline)*100
------> for 3600 seconds: (46/24)*100 = +192%
------> for 10800 seconds: (101/24)*100 = +421%
.... baseline = average-result with a seconds value of 0
.... result = average-result with a seconds value other than 0
The resulting df should look like this
setting_id subject_id seconds owner_id average deviation duration_id
0 7 1 0 2 24 0 1
1 7 1 3600 2 46 192 2
2 7 1 10800 2 101 421 4
I want to use these calculations then to plot a regression graph (with seaborn) of deviations from baseline
I have played around with this df for 2 days now and tried different forloops but I just can´t figure out the correct way.

You can use:
# identify rows with 0
m = df['seconds'].eq(0)
# compute the sum of rows with 0
s = (df['average'].where(m)
.groupby([df['setting_id'], df['subject_id']])
.sum()
)
# compute the deviation per group
deviation = (
df[['setting_id', 'subject_id']]
.merge(s, left_on=['setting_id', 'subject_id'], right_index=True, how='left')
['average']
.rdiv(df['average']).mul(100)
.round().astype(int) # optional
.mask(m, 0)
)
df['deviation'] = deviation
# or
# out = df.assign(deviation=deviation)
Output:
setting_id subject_id seconds result_id owner_id average duration_id deviation
0 7 1 0 1680.5 2.0 24.0 1.0 0
1 7 1 3600 1690.5 2.0 46.0 2.0 192
2 7 1 10800 1700.5 2.0 101.0 4.0 421
3 7 2 0 1682.5 2.0 12.5 1.0 0
4 7 2 3600 1692.5 2.0 33.5 2.0 268
5 7 2 10800 1702.5 2.0 86.5 4.0 692
6 7 3 0 1684.5 2.0 8.5 1.0 0
7 7 3 3600 1694.5 2.0 15.0 2.0 176
8 7 3 10800 1704.5 2.0 34.0 4.0 400

Averaging every 10 rows of one column within a dataframe, pulling every tenth item from the others?

Let's say I have the following sample dataframe:
df = pd.DataFrame({'depth': list(range(0, 21)),
'time': list(range(0, 21)),
'metric': random.choices(range(10), k=21)})
df
Out[65]:
depth time metric
0 0 0 2
1 1 1 3
2 2 2 8
3 3 3 0
4 4 4 8
5 5 5 9
6 6 6 5
7 7 7 1
8 8 8 6
9 9 9 6
10 10 10 7
11 11 11 2
12 12 12 7
13 13 13 0
14 14 14 6
15 15 15 0
16 16 16 5
17 17 17 6
18 18 18 9
19 19 19 6
20 20 20 8
I want to average every ten rows of the "metric" column (preserving the first row as is) and pulling the tenth item from the depth and time columns. For example:
depth time metric
0 0 0 2
10 10 10 5.3
20 20 20 4.9
I know that groupby is usually used in these situations, but I do not know how to tweak it to get my desired outcome:
df[['metric']].groupby(df.index //10).mean()
Out[66]:
metric
0 4.8
1 4.8
2 8.0

#BENY's answer is on the right track but not quite right. Should be:
df.groupby((df.index+9)//10).agg({'depth':'last','time':'last','metric':'mean'})

You can do rolling with reindex+ffill
df.rolling(10).mean().reindex(df.index[::10]).fillna(df)
depth time metric
0 0.0 0.0 2.0
10 5.5 5.5 5.3
20 15.5 15.5 4.9
Or to match output for depth and time:
out = (df.assign(metric=df['metric'].rolling(10).mean()
.reindex(df.index[::10]).fillna(df['metric']))
.dropna(subset=['metric']))
print(out)
depth time metric
0 0 0 2.0
10 10 10 5.3
20 20 20 4.9

Let us do agg
g = df.index.isin(df.index[::10]).cumsum()[::-1]
df.groupby(g).agg({'depth':'last','time':'last','metric':'mean'})
Out[263]:
depth time metric
1 20 20 4.9
2 10 10 5.3
3 0 0 2.0

Trying to unstack dataframe with multiple empty columns (NaN)

I currently have a code which turns this:
A B C D E F G H I J
0 1.1.1 amba 50 1 131 4 40 3 150 5
1 2.2.2 erto 50 7 40 8 150 8 131 2
2 3.3.3 gema 131 2 150 5 40 1 50 3
Into this:
ID User 40 50 131 150
0 1.1.1 amba 3 1 4 5
1 2.2.2 erto 8 7 2 8
2 3.3.3 gema 1 3 2 5
And here you can check the code:
import pandas as pd
import io
df1 = pd.read_csv(io.StringIO(""" A B C D E F G H I J
1.1.1 amba 50 1 131 4 40 3 150 5
2.2.2 erto 50 7 40 8 150 8 131 2
3.3.3 gema 131 2 150 5 40 1 50 3"""), sep="\s+")
print(df1)
df2 = (pd.concat([df1.drop(columns=["C","D","E","F","G","H"]).rename(columns={"I":"key","J":"val"}),
df1.drop(columns=["C","D","E","F","I","J"]).rename(columns={"G":"key","H":"val"}),
df1.drop(columns=["C","D","G","H","I","J"]).rename(columns={"E":"key","F":"val"}),
df1.drop(columns=["E","F","G","H","I","J"]).rename(columns={"C":"key","D":"val"}),
])
.rename(columns={"A":"ID","B":"User"})
.set_index(["ID","User","key"])
.unstack(2)
.reset_index()
)
# flatten the columns..
df2.columns = [c[1] if c[0]=="val" else c[0] for c in df2.columns.to_flat_index()]
df2
The program works correctly if Key colums have unique values but it fails if there are duplicate values. The issue I have is that my actual dataframe has rows with 30 clumns, other with 60, other with 63, etc. So the program is detecting empty values as duplicate and the program fails.
Please check this example:
A B C D E F G H I J
0 1.1.1 amba 50 1 131 4 NaN NaN NaN NaN
1 2.2.2 erto 50 7 40 8 150.0 8.0 131.0 2.0
2 3.3.3 gema 131 2 150 5 40.0 1.0 50.0 3.0
And I would like to get something like this:
ID User 40 50 131 150
0 1.1.1 amba 1 4
1 2.2.2 erto 8 7 2 8
2 3.3.3 gema 1 3 2 5
If I try to unstack this, i get the error "Index contains duplicate entries, cannot reshape". I have been reading about this and df.drop_duplicates, pivot_tables, tc could help in this situation but I cannot just make work anything of this with my current code. Any idea about how o fix this? Thanks.

Idea is convert first 2 columns to MultiIndex, then use concat by selected pair and unpair columns by DataFrame.iloc, reshaped by DataFrame.stack and removed third unnecessary level of MultiIndex by DataFrame.reset_index:
df = df.set_index(['A','B'])
df = pd.concat([df.iloc[:, ::2].stack().reset_index(level=2, drop=True),
df.iloc[:, 1::2].stack().reset_index(level=2, drop=True)],
axis=1, keys=('key','val'))
Last add key column to MultiIndex by DataFrame.set_index and reshape by Series.unstack, convert MultiIndex to columns by reset_index, rename columns names and last remove columns levels name by DataFrame.rename_axis:
df = (df.set_index('key', append=True)['val']
.unstack()
.reset_index()
.rename(columns={"A":"ID","B":"User"})
.rename_axis(None, axis=1))
print (df)
ID User 40 50 131 150
0 1.1.1 amba 3 1 4 5
1 2.2.2 erto 8 7 2 8
2 3.3.3 gema 1 3 2 5
Also it working well for second example, because missing rows are removed by stack, also added rename for convert columns names to int if possible:
df = df.set_index(['A','B'])
df = pd.concat([df.iloc[:, ::2].stack().reset_index(level=2, drop=True),
df.iloc[:, 1::2].stack().reset_index(level=2, drop=True)],
axis=1, keys=('key','val'))
print (df)
key val
A B
1.1.1 amba 50.0 1.0
amba 131.0 4.0
2.2.2 erto 50.0 7.0
erto 40.0 8.0
erto 150.0 8.0
erto 131.0 2.0
3.3.3 gema 131.0 2.0
gema 150.0 5.0
gema 40.0 1.0
gema 50.0 3.0
df = (df.set_index('key', append=True)['val']
.unstack()
.rename(columns=int)
.reset_index()
.rename(columns={"A":"ID","B":"User"})
.rename_axis(None, axis=1))
print (df)
ID User 40 50 131 150
0 1.1.1 amba NaN 1.0 4.0 NaN
1 2.2.2 erto 8.0 7.0 2.0 8.0
2 3.3.3 gema 1.0 3.0 2.0 5.0
EDIT1 Added helper column with counter for avoid duplicates:
print (df)
A B C D E F G H I J
0 1.1.1 amba 50 1 50 4 40 3 150 5 <- E=50
1 2.2.2 erto 50 7 40 8 150 8 131 2
2 3.3.3 gema 131 2 150 5 40 1 50 3
df = df.set_index(['A','B'])
df = pd.concat([df.iloc[:, ::2].stack().reset_index(level=2, drop=True),
df.iloc[:, 1::2].stack().reset_index(level=2, drop=True)],
axis=1, keys=('key','val'))
df['g'] = df.groupby(['A','B','key']).cumcount()
print (df)
key val g
A B
1.1.1 amba 50 1 0
amba 50 4 1
amba 40 3 0
amba 150 5 0
2.2.2 erto 50 7 0
erto 40 8 0
erto 150 8 0
erto 131 2 0
3.3.3 gema 131 2 0
gema 150 5 0
gema 40 1 0
gema 50 3 0
df = (df.set_index(['g','key'], append=True)['val']
.unstack()
.reset_index()
.rename(columns={"A":"ID","B":"User"})
.rename_axis(None, axis=1))
print (df)
ID User g 40 50 131 150
0 1.1.1 amba 0 3.0 1.0 NaN 5.0
1 1.1.1 amba 1 NaN 4.0 NaN NaN
2 2.2.2 erto 0 8.0 7.0 2.0 8.0
3 3.3.3 gema 0 1.0 3.0 2.0 5.0

What you are trying to do seems too complex. May i suggest a simpler solution which just converts each row to a dictionary of desired result and then binds them back together:
pd.DataFrame(list(map(lambda row: {'ID':row['A'], 'User':row['B'], row['C']:row['D'],
row['E']:row['F'], row['G']:row['H'], row['I']:row['J']},
df1.to_dict('r'))))

two different csv file data manipulation using pandas

I have two data frame df1 and df2
df1 has following data (N Rows)
Time(s) sv-01 sv-02 sv-03 Val1 val2 val3
1339.4 1 4 12 1.6 0.6 1.3
1340.4 1 12 4 -0.5 0.5 1.4
1341.4 1 6 8 0.4 5 1.6
1342.4 2 5 14 1.2 3.9 11
...... ..... .... ... ..
df2 has following data which has more rows than df1
Time(msec) channel svid value-1 value-2 valu-03
1000 1 2 0 5 1
1000 2 5 1 4 2
1000 3 2 3 4 7
..... .....................................
1339400 1 1 1.6 0.4 5.3
1339400 2 12 0.5 1.8 -4.4
1339400 3 4 -0.20 1.6 -7.9
1340400 1 1 0.3 0.3 1.5
1340400 2 6 2.3 -4.3 1.0
1340400 3 4 2.0 1.1 -0.45
1341400 1 1 2 2.1 0
1341400 2 8 3.4 -0.3 1
1341400 3 6 0 4.1 2.3
.... .... .. ... ... ...
What I am trying to achieve is
1.first multiplying Time(s) column by 1000 so that it matches with df2
millisecond column.
2.In df1 sv 01,02 and 03 are in independent column but those sv are
present in same column under svid.
So goal is when time of df1(after changing) is matching with time
of df2 copy next three consecutive lines i.e copy all matched
lines of that time instant.
Basically I want to iterate the time of df1 in df2 time column
and if there is a match copy three next rows and copy to a new df.
I have seen examples using pandas merge function but in my case both have
different header.
Thanks.

I think you need double boolean indexing - first df2 with isin, for multiple is used mul:
And then count values per groups by cumcount and filter first 3:
df = df2[df2['Time(msec)'].isin(df1['Time(s)'].mul(1000))]
df = df[df.groupby('Time(msec)').cumcount() < 3]
print (df)
Time(msec) channel svid value-1 value-2 valu-03
3 1339400 1 1 1.6 0.4 5.30
4 1339400 2 12 0.5 1.8 -4.40
5 1339400 3 4 -0.2 1.6 -7.90
6 1340400 1 1 0.3 0.3 1.50
7 1340400 2 6 2.3 -4.3 1.00
8 1340400 3 4 2.0 1.1 -0.45
9 1341400 1 1 2.0 2.1 0.00
10 1341400 2 8 3.4 -0.3 1.00
11 1341400 3 6 0.0 4.1 2.30
Detail:
print (df.groupby('Time(msec)').cumcount())
3 0
4 1
5 2
6 0
7 1
8 2
9 0
10 1
11 2
dtype: int64

Isolating Adjacent columns based on str.contains

Hi all so my dataframe looks like such:
A | B | C | D | E
'USD'
'trading expenses-total'
8.10 2.3 5.5
9.1 1.4 6.1
5.4 5.1 7.8
I haven't found anything quite like this so apologies if this is a duplicate. But essentially I am trying to locate the column that contains the string 'total' (column B) and their adjacent columns (C and D) and turn them into a dataframe. I feel like I am close with the following code:
test.loc[:,test.columns.str.contains('total')]
which isolates the correct column, but i can't quite figure out how to grab the adjacent two columns. My desired output is:
B | C | D
'USD'
'trading expenses-total'
8.10 2.3 5.5
9.1 1.4 6.1
5.4 5.1 7.8

OLD answer:
Pandas approach:
In [36]: df = pd.DataFrame(np.random.rand(3,5), columns=['A','total','C','D','E'])
In [37]: df
Out[37]:
A total C D E
0 0.789482 0.427260 0.169065 0.112993 0.142648
1 0.303391 0.484157 0.454579 0.410785 0.827571
2 0.984273 0.001532 0.676777 0.026324 0.094534
In [38]: idx = np.argmax(df.columns.str.contains('total'))
In [39]: df.iloc[:, idx:idx+3]
Out[39]:
total C D
0 0.427260 0.169065 0.112993
1 0.484157 0.454579 0.410785
2 0.001532 0.676777 0.026324
UPDATE:
In [118]: df
Out[118]:
A B C D E
0 NaN USD NaN NaN NaN
1 NaN trading expenses-total NaN NaN NaN
2 A 8.10 2.3 5.5 10.0
3 B 9.1 1.4 6.1 11.0
4 C 5.4 5.1 7.8 12.0
In [119]: col = df.select_dtypes(['object']).apply(lambda x: x.str.contains('total').any()).idxmax()
In [120]: cols = df.columns.to_series().loc[col:].head(3).tolist()
In [121]: col
Out[121]: 'B'
In [122]: cols
Out[122]: ['B', 'C', 'D']
In [123]: df[cols]
Out[123]:
B C D
0 USD NaN NaN
1 trading expenses-total NaN NaN
2 8.10 2.3 5.5
3 9.1 1.4 6.1
4 5.4 5.1 7.8

Here's one approach -
from scipy.ndimage.morphology import binary_dilation as bind
mask = test.columns.str.contains('total')
test_out = test.iloc[:,bind(mask,[1,1,1],origin=-1)]
If you don't have access to SciPy, you can also use np.convolve, like so -
test_out = test.iloc[:,np.convolve(mask,[1,1,1])[:-2]>0]
Sample runs
Case #1 :
In [390]: np.random.seed(1234)
In [391]: test = pd.DataFrame(np.random.randint(0,9,(3,5)))
In [392]: test.columns = [['P','total001','g','r','t']]
In [393]: test
Out[393]:
P total001 g r t
0 3 6 5 4 8
1 1 7 6 8 0
2 5 0 6 2 0
In [394]: mask = test.columns.str.contains('total')
In [395]: test.iloc[:,bind(mask,[1,1,1],origin=-1)]
Out[395]:
total001 g r
0 6 5 4
1 7 6 8
2 0 6 2
Case #2 :
This also works if you have multiple matching columns and also if you are going out of limits and don't have two columns to the right of the matching columns -
In [401]: np.random.seed(1234)
In [402]: test = pd.DataFrame(np.random.randint(0,9,(3,7)))
In [403]: test.columns = [['P','total001','g','r','t','total002','k']]
In [406]: test
Out[406]:
P total001 g r t total002 k
0 3 6 5 4 8 1 7
1 6 8 0 5 0 6 2
2 0 5 2 6 3 7 0
In [407]: mask = test.columns.str.contains('total')
In [408]: test.iloc[:,bind(mask,[1,1,1],origin=-1)]
Out[408]:
total001 g r total002 k
0 6 5 4 1 7
1 8 0 5 6 2
2 5 2 6 7 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Downsample non timeseries pandas dataframe - python

Related

Calculate %-deviation with values from a pandas Dataframe

Averaging every 10 rows of one column within a dataframe, pulling every tenth item from the others?

Trying to unstack dataframe with multiple empty columns (NaN)

two different csv file data manipulation using pandas

Isolating Adjacent columns based on str.contains

Categories

Resources