binning two dimensional data by its index in python - python

How would I bin some data based on the index of the data, in python 3
Let's say I have the following data
1 0.5
3 0.6
5 0.7
6 0.8
8 0.9
10 1
11 1.1
12 1.2
14 1.3
15 1.4
17 1.5
18 1.6
19 1.7
20 1.8
22 1.9
24 2
25 2.1
28 2.2
31 2.3
35 2.4
how would I take this data and bin both columns such that each bin has n number of values in it, and average the numbers in each bin and output them.
for example, if I wanted to bin the values by 4
I would take the first four data points:
1 0.5
3 0.6
5 0.7
6 0.8
and the averages of these would be: 3.75 0.65
I would continue down the columns by taking the next set of four, and so on
until I averaged all of the sets of four to get this:
3.75 0.65
10.25 1.05
16 1.45
21.25 1.85
29.75 2.25
How can I do this using python

Base on numpy reshape
pd.DataFrame([np.mean(x.reshape(len(df)//4,-1),axis=1) for x in df.values.T]).T
0 1
0 3.75 0.65
1 10.25 1.05
2 16.00 1.45
3 21.25 1.85
4 29.75 2.25

You can "bin" the index into groups of 4 and call groupby in the index.
df.groupby(df.index // 4).mean()
0 1
0 3.75 0.65
1 10.25 1.05
2 16.00 1.45
3 21.25 1.85
4 29.75 2.25

Related

how to compute mean absolute deviation row wise in pandas

snippet of the dataframe is as follows. but actual dataset is 200000 x 130.
ID 1-jan 2-jan 3-jan 4-jan
1. 4 5 7 8
2. 2 0 1 9
3. 5 8 0 1
4. 3 4 0 0
I am trying to compute Mean Absolute Deviation for each row value like this.
ID 1-jan 2-jan 3-jan 4-jan mean
1. 4 5 7 8 12.5
1_MAD 8.5 7.5 5.5 4.5
2. 2 0 1 9 6
2_MAD.4 6 5 3
.
.
I tried this,
new_df = pd.DataFrame()
for rows in (df['ID']):
new_df[str(rows) + '_mad'] = mad(df3.loc[row_value][1:])
new_df.T
where mad is a function that compares the mean to each value.
But, this is very time consuming since i have a large dataset and i need to do in a quickest way possible.
pd.concat([df1.assign(mean1=df1.mean(axis=1)).set_index(df1.index.astype('str'))
,df1.assign(mean1=df1.mean(axis=1)).apply(lambda ss:ss.mean1-ss,axis=1)
.T.add_suffix('_MAD').T.assign(mean1='')]).sort_index().pipe(print)
1-jan 2-jan 3-jan 4-jan mean1
ID
1.0 4.00 5.00 7.00 8.00 6.0
1.0_MAD 2.00 1.00 -1.00 -2.00
2.0 2.00 0.00 1.00 9.00 3.0
2.0_MAD 1.00 3.00 2.00 -6.00
3.0 5.00 8.00 0.00 1.00 3.5
3.0_MAD -1.50 -4.50 3.50 2.50
4.0 3.00 4.00 0.00 0.00 1.75
4.0_MAD -1.25 -2.25 1.75 1.75
IIUC use:
#convert ID to index
df = df.set_index('ID')
#mean to Series
mean = df.mean(axis=1)
from toolz import interleave
#subtract all columns by mean, add suffix
df1 = df.sub(mean, axis=0).abs().rename(index=lambda x: f'{x}_MAD')
#join with original with mean and interleave indices
df = pd.concat([df.assign(mean=mean), df1]).loc[list(interleave([df.index, df1.index]))]
print (df)
1-jan 2-jan 3-jan 4-jan mean
ID
1.0 4.00 5.00 7.00 8.00 6.00
1.0_MAD 2.00 1.00 1.00 2.00 NaN
2.0 2.00 0.00 1.00 9.00 3.00
2.0_MAD 1.00 3.00 2.00 6.00 NaN
3.0 5.00 8.00 0.00 1.00 3.50
3.0_MAD 1.50 4.50 3.50 2.50 NaN
4.0 3.00 4.00 0.00 0.00 1.75
4.0_MAD 1.25 2.25 1.75 1.75 NaN
It's possible to specify axis=1 to apply the mean calculation across columns:
df['mean_across_cols'] = df.mean(axis=1)

How to re-sample and interpolate new data in python

I have a csv file containing the following information:
Time(s) Variable
0.003 1
0.009 2
0.056 3
0.094 4
0.4 5
0.98 6
1.08 7
1.45 8
1.89 9
2.45 10
2.73 11
3.2 12
3.29 13
3.5 14
I would like to be able to be able to change the time column into 0.25s intervals starting from 0, and have the associated variable data change along with it (i.e if at 2.45 v=10, at 2.5 v=10.2). The variable data would have to be interpolated against the change in the time data I assume? I need to be able to do this straight from csv rather than writing out the data in python as the real data-set is 1000's of rows.
Not sure if what I want is exactly possible but some thoughts would go along way, thanks!
How about Scipy's interp1d
from scipy.interpolate import interp1d
interp = interp1d(df['Time(s)'], df['Variable'])
new_times = np.arange(0.25, 3.5, 0.25)
pd.DataFrame({'Time(s)': new_times, 'Variable':interp(new_times)})
Output:
Time(s) Variable
0 0.25 4.509804
1 0.50 5.172414
2 0.75 5.603448
3 1.00 6.200000
4 1.25 7.459459
5 1.50 8.113636
6 1.75 8.681818
7 2.00 9.196429
8 2.25 9.642857
9 2.50 10.178571
10 2.75 11.042553
11 3.00 11.574468
12 3.25 12.555556

If the value of particular ID does not exist in another ID, insert row with value to ID

I would like to update and insert a new row, if D1 value is not existing in other ID's, whilst my df['Value'] is left blank (N/A). Your help is appreciated.
Input
D1 ID Value
0.02 1 1.2
0.04 1 1.6
0.06 1 1.9
0.08 1 2.8
0.02 2 4.5
0.04 2 4.1
0.08 2 3.6
0.02 3 2.7
0.04 3 2.9
0.06 3 2.4
0.08 3 2.1
0.1 3 1.9
Expected output:
D1 ID Value
0.02 1 1.2
0.04 1 1.6
0.06 1 1.9
0.08 1 2.8
0.1 1
0.02 2 4.5
0.04 2 4.1
0.06 2
0.08 2 3.6
0.1 2
0.02 3 2.7
0.04 3 2.9
0.06 3 2.4
0.08 3 2.1
0.1 3 1.9
Unfortunately the codes I have written have been way off or simply gets multiple error messages, unlike my other questions I do not have examples to show.
Use unstack and stack. Chain additional sort_index and reset_index to achieve desired order
df_final = (df.set_index(['D1', 'ID']).unstack().stack(dropna=False)
.sort_index(level=[1,0]).reset_index())
Out[952]:
D1 ID Value
0 0.02 1 1.2
1 0.04 1 1.6
2 0.06 1 1.9
3 0.08 1 2.8
4 0.10 1 NaN
5 0.02 2 4.5
6 0.04 2 4.1
7 0.06 2 NaN
8 0.08 2 3.6
9 0.10 2 NaN
10 0.02 3 2.7
11 0.04 3 2.9
12 0.06 3 2.4
13 0.08 3 2.1
14 0.10 3 1.9

Python pivot table iterate through each value

I have an example of data below
Temperature Voltage Data
25 3.3 2
25 3.3 2.5
25 3.3 3.7
25 3.3 3.5
25 3.3 2.7
25 3.45 1.9
25 3.45 1.7
25 3.45 1.5
25 3.45 2
25 3.45 2.9
105 3.3 3
105 3.3 3.5
105 3.3 4.7
105 3.3 4.5
105 3.3 3.7
105 3.45 2.5
105 3.45 2.3
105 3.45 2.1
105 3.45 3.3
105 3.45 4
I would like to iterate through each row to calculate the difference between 2 consecutive data point then count how many times that difference is equal or greater than 1.
Then, print out the number of times that happens per Temperature per Voltage.
Thank you,
Victor
Edit: Added np.abs to make sure to take the absolute value of the difference.
You can use pandas diff for that, and then np.where for the condition:
import numpy as np
import pandas as pd
data = {
'Temperature': [25,25,25,25,25,25,25,25,25,25,105,105,105,105,105,105,105,105,105,105],
'Voltage': [3.3,3.3,3.3,3.3,3.3,3.45,3.45,3.45,3.45,3.45,3.3,3.3,3.3,3.3,3.3,3.45,3.45,3.45,3.45,3.45],
'Data': [2,2.5,3.7,3.5,2.7,1.9,1.7,1.5,2,2.9,3,3.5,4.7,4.5,3.7,2.5,2.3,2.1,3.3,4]
}
df = pd.DataFrame(data)
df['difference'] = df['Data'].diff(1)
df['flag'] = np.where(np.abs(df['difference']) >= 1,'More than 1','Less than one')
print(df)
Output:
Temperature Voltage Data difference flag
0 25 3.30 2.0 NaN Less than one
1 25 3.30 2.5 0.5 Less than one
2 25 3.30 3.7 1.2 More than 1
3 25 3.30 3.5 -0.2 Less than one
4 25 3.30 2.7 -0.8 Less than one
5 25 3.45 1.9 -0.8 Less than one
6 25 3.45 1.7 -0.2 Less than one
7 25 3.45 1.5 -0.2 Less than one
8 25 3.45 2.0 0.5 Less than one
9 25 3.45 2.9 0.9 Less than one
10 105 3.30 3.0 0.1 Less than one
11 105 3.30 3.5 0.5 Less than one
12 105 3.30 4.7 1.2 More than 1
13 105 3.30 4.5 -0.2 Less than one
14 105 3.30 3.7 -0.8 Less than one
15 105 3.45 2.5 -1.2 More than 1
16 105 3.45 2.3 -0.2 Less than one
17 105 3.45 2.1 -0.2 Less than one
18 105 3.45 3.3 1.2 More than 1
19 105 3.45 4.0 0.7 Less than one

How to merge columns and duplicate row values to match in pandas

I want to join 2 dataframes on 'time', but one df uses .25 second intervals and another uses 1 second intervals. I want to join the values from the 1 second interval df to the .25 second interval df and repeat values while within the corresponding second value.
Below are small snippets of the 2 dataframes I want to merge:
time speaker
0.25 1
0.25 2
0.50 1
0.50 2
0.75 1
0.75 2
1.00 1
1.00 2
1.25 1
1.25 2
1.50 1
1.50 2
1.75 1
1.75 2
2.00 1
2.00 2
and:
time label
0 10
1 11
and I want:
time speaker label
0.25 1 10
0.25 2 10
0.50 1 10
0.50 2 10
0.75 1 10
0.75 2 10
1.00 1 10
1.00 2 10
1.25 1 11
1.25 2 11
1.50 1 11
1.50 2 11
1.75 1 11
1.75 2 11
2.00 1 11
2.00 2 11
Thanks!
Here is on way using merge_asof
pd.merge_asof(df1,df2.astype(float),on='time',allow_exact_matches = False)
Out[14]:
time speaker label
0 0.25 1 10.0
1 0.25 2 10.0
2 0.50 1 10.0
3 0.50 2 10.0
4 0.75 1 10.0
5 0.75 2 10.0
6 1.00 1 10.0
7 1.00 2 10.0
8 1.25 1 11.0
9 1.25 2 11.0
10 1.50 1 11.0
11 1.50 2 11.0
12 1.75 1 11.0
13 1.75 2 11.0
14 2.00 1 11.0
15 2.00 2 11.0
IIUC, this is a case of pd.cut:
df1['label'] = pd.cut(df1['time'],
bins=list(df2['time'])+[np.inf],
labels=df2['label'])
Output:
time speaker label
0 0.25 1 10
1 0.25 2 10
2 0.50 1 10
3 0.50 2 10
4 0.75 1 10
5 0.75 2 10
6 1.00 1 10
7 1.00 2 10
8 1.25 1 11
9 1.25 2 11
10 1.50 1 11
11 1.50 2 11
12 1.75 1 11
13 1.75 2 11
14 2.00 1 11
15 2.00 2 11

Categories