Find columns within a certain percentile of a DataFrame - python

Having a multi-column dataframe, I'm interested in how to keep/get the part of the dataframe that fall between the 25th and 75th percentiles per each column ?
I need to remove the rows (which are just time steps) that have values outside the 25-75 percentile range
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
'400.0': [13.909261, 13.758734, 13.513627, 13.095409, 13.628918, 12.782643, 13.278548, 13.160153, 12.155895, 12.152373, 12.147820, 13.023997, 15.010729, 13.006050, 13.002356],
'401.0': [14.581624, 14.173803, 13.757856, 14.223524, 14.695623, 13.818065, 13.300235, 13.173674, 14.145402, 14.144456, 13.142969, 13.022471, 14.010802, 14.006181, 14.002641],
'402.0': [15.253988, 15.588872, 15.002085, 15.351638, 14.762327, 14.853486, 15.321922, 14.187195, 15.134910, 15.136539, 15.138118, 15.020945, 15.010875, 15.006313, 15.002927],
'403.0': [15.633908, 14.833914, 15.146499, 15.431543, 15.798185, 14.874350, 14.333470, 14.192128, 15.130119, 15.134795, 15.136049, 15.019307, 15.012037, 15.006674, 15.003002],
})
I expect to see a lower number of rows, so I have to eliminate a range of measurements that act as outliers of the timeseries.
This is from the original data set, where the x-axis shows the rows. So I need somehow to remove this blob by setting a percentile criteria
At the end I'd take the most strict criteria to apply it for the entire dataframe

I'm not 100% sure this is what you want, but IIUC, you can create a mask, then apply it to your dataframe.
df1[df1.apply(lambda x: x.between(x.quantile(.25), x.quantile(.75))).all(1)]
400.0 401.0 402.0 403.0
8 12.155895 14.145402 15.134910 15.130119
9 12.152373 14.144456 15.136539 15.134795
That will drop any row which contains any value in any column that falls outside your range.
If instead you want to only drop rows which contain all values that fall outside your range, you can use:
df1[df1.apply(lambda x: x.between(x.quantile(.25), x.quantile(.75))).any(1)]
400.0 401.0 402.0 403.0
2 13.513627 13.757856 15.002085 15.146499
3 13.095409 14.223524 15.351638 15.431543
5 12.782643 13.818065 14.853486 14.874350
6 13.278548 13.300235 15.321922 14.333470
7 13.160153 13.173674 14.187195 14.192128
8 12.155895 14.145402 15.134910 15.130119
9 12.152373 14.144456 15.136539 15.134795
10 12.147820 13.142969 15.138118 15.136049
11 13.023997 13.022471 15.020945 15.019307
12 0.010729 14.010802 15.010875 15.012037
13 0.006050 14.006181 15.006313 15.006674
14 0.002356 14.002641 15.002927 15.003002
Rows are retained here if any of the values in any column falls within the percentile range in its respective column.

It is going to be much faster to operate on the underlying numpy arrays here:
a = df1.values
q1 = np.quantile(a, q=0.25, axis=0)
q2 = np.quantile(a, q=0.75, axis=0)
mask = ((q1 < a) & (a < q2)).all(1)
df1[mask]
400.0 401.0 402.0 403.0
8 12.155895 14.145402 15.134910 15.130119
9 12.152373 14.144456 15.136539 15.134795
Invert the mask (df[~mask]) if you want to exclude those rows

Related

Run values of one dataframe through another and find the index of similar value from dataframe

I have two dataframes both consisting of a 1 column with 62 values each:
Distance_obs = [
0.0
0.9084
2.1931
2.85815
3.3903
3.84815
4.2565
4.6287
4.97295
5.29475
5.598
5.8856
6.15975
6.4222
6.67435
6.9173
7.152
7.37925
7.5997
7.8139
8.02235
8.22555
8.42385
8.61755
8.807
8.99245
9.17415
9.35235
9.5272
9.6989
9.86765
10.0335
10.1967
10.3574
10.5156
10.6714
10.825
10.9765
11.1259
11.2732
11.4187
11.5622
11.7041
11.8442
11.9827
12.1197
12.2552
12.3891
12.5216
12.6527
12.7825
12.9109
13.0381
13.1641
13.2889
13.4126
13.5351
13.6565
13.7768
13.8961
14.0144
14.0733
]
and
Cell_mid = [0.814993
1.96757
2.56418
3.04159
3.45236
3.8187
4.15258
4.46142
4.75013
5.02221
5.28026
5.52624
5.76172
5.98792
6.20588
6.41642
6.62027
6.81802
7.01019
7.19723
7.37952
7.55742
7.73122
7.9012
8.0676
8.23063
8.39049
8.54736
8.70141
8.85277
9.00159
9.14798
9.29207
9.43396
9.57374
9.71152
9.84736
9.98136
10.1136
10.2441
10.373
10.5003
10.626
10.7503
10.8732
10.9947
11.1149
11.2337
11.3514
11.4678
11.5831
11.6972
11.8102
11.9222
12.0331
12.143
12.2519
12.3599
12.4669
12.573
12.6782
12.7826
]
I want to run each element in Distance_obs to run through the values in Cell_mid and find the corresponding index of nearest value which matches the element.
I have been trying using the following:
for i in Distance_obs:
Required_value = (np.abs(Cell_mid - i)).idxmin()
But I get
error: ufunc 'subtract' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> dtype('<U32')
One way to do this, could be as follows:
Use pd.merge_asof, passing "nearest" to the direction parameter.
Now, from the merged result select column Cell_mid, and use Series.map with a pd.Series where the values and index of its original df (here: df2) are swapped.
df['Cell_mid_index'] = pd.merge_asof(df, df2,
left_on='Distance_obs',
right_on='Cell_mid',
direction='nearest')\
['Cell_mid'].map(pd.Series(df2['Cell_mid'].index.values,
index=df2['Cell_mid']))
print(df.head())
Distance_obs Cell_mid_index
0 0.00000 0
1 0.90840 0
2 2.19310 1
3 2.85815 3
4 3.39030 4
So, at the intermediate step, we had a merged df like this:
print(pd.merge_asof(df, df2, left_on='Distance_obs',
right_on='Cell_mid', direction='nearest').head())
Distance_obs Cell_mid
0 0.00000 0.814993
1 0.90840 0.814993
2 2.19310 1.967570
3 2.85815 3.041590
4 3.39030 3.452360
And then with .map we are retrieving the appropriate index values from df2.
Data used
import pandas as pd
Distance_obs = [0.0, 0.9084, 2.1931, 2.85815, 3.3903, 3.84815, 4.2565,
4.6287, 4.97295, 5.29475, 5.598, 5.8856, 6.15975, 6.4222,
6.67435, 6.9173, 7.152, 7.37925, 7.5997, 7.8139, 8.02235,
8.22555, 8.42385, 8.61755, 8.807, 8.99245, 9.17415, 9.35235,
9.5272, 9.6989, 9.86765, 10.0335, 10.1967, 10.3574, 10.5156,
10.6714, 10.825, 10.9765, 11.1259, 11.2732, 11.4187, 11.5622,
11.7041, 11.8442, 11.9827, 12.1197, 12.2552, 12.3891, 12.5216,
12.6527, 12.7825, 12.9109, 13.0381, 13.1641, 13.2889, 13.4126,
13.5351, 13.6565, 13.7768, 13.8961, 14.0144, 14.0733]
df = pd.DataFrame(Distance_obs, columns=['Distance_obs'])
Cell_mid = [0.814993, 1.96757, 2.56418, 3.04159, 3.45236, 3.8187, 4.15258,
4.46142, 4.75013, 5.02221, 5.28026, 5.52624, 5.76172, 5.98792,
6.20588, 6.41642, 6.62027, 6.81802, 7.01019, 7.19723, 7.37952,
7.55742, 7.73122, 7.9012, 8.0676, 8.23063, 8.39049, 8.54736,
8.70141, 8.85277, 9.00159, 9.14798, 9.29207, 9.43396, 9.57374,
9.71152, 9.84736, 9.98136, 10.1136, 10.2441, 10.373, 10.5003,
10.626, 10.7503, 10.8732, 10.9947, 11.1149, 11.2337, 11.3514,
11.4678, 11.5831, 11.6972, 11.8102, 11.9222, 12.0331, 12.143,
12.2519, 12.3599, 12.4669, 12.573, 12.6782, 12.7826]
df2 = pd.DataFrame(Cell_mid, columns=['Cell_mid'])

Pandas vlookup like operation to a list

I am unable to properly explain my requirement, but I can show the expected result.
I have a dataframe that looks like so:
Series1
Series2
1370307
1370306
927092
927091
925392
925391
925390
925389
2344089
2344088
1827855
1827854
1715793
1715792
2356467
2356466
1463264
1463263
1712684
1712683
actual dataframe size: 902811 rows × 2 columns
then another dataframe of unique values of Series2. This I've done using value counts.
df2 = df['Series2'].value_counts().rename_axis('Series2').to_frame('counts').reset_index()
Then I need a list of matching Series1 values for each Series2 value:
The expected result is:
Series2
counts
Series1_List
2543113
6
[2543114, 2547568, 2559207, 2563778, 2564330, 2675803]
2557212
6
[2557213, 2557301, 2559192, 2576080, 2675693, 2712790]
2432032
5
[2432033, 2444169, 2490928, 2491392, 2528056]
2559269
5
[2559270, 2576222, 2588034, 2677710, 2713207]
2439554
5
[2439555, 2441882, 2442272, 2443590, 2443983]
2335180
5
[2335181, 2398282, 2527060, 2527321, 2565487]
2494111
4
[2494112, 2495321, 2526026, 2528492]
2559195
4
[2559196, 2570172, 2634537, 2675718]
2408775
4
[2408776, 2409117, 2563765, 2564320]
2408773
4
[2408774, 2409116, 2563764, 2564319]
I achieve this (although only for a subset of 50 rows) using the following code:
df2.loc[:50,'Series1_List'] = df2.loc[:50,'Series2'].apply(lambda x: df[df['Series2']==x]['Series1'].tolist())
If I do this for the whole dataframe it wouldn't complete even in 20 minutes.
So the question is whether there is a faster and efficient method of achieving the result?
IIUC, use:
df2 = (df.groupby('Series2', as_index=False)
.agg(counts=('Series1', 'count'), Series1_List=('Series1', list))
)

How to use one dataframe's index to reindex another one in pandas

I am so sorry that I truly don't know what title I should use. But here is my question
Stocks_Open
d-1 d-2 d-3 d-4
000001.HR 1817.670960 1808.937405 1796.928768 1804.570628
000002.ZH 4867.910878 4652.713598 4652.713598 4634.904168
000004.HD 92.046474 92.209029 89.526880 96.435445
000005.SS 28.822245 28.636893 28.358865 28.729569
000006.SH 192.362963 189.174626 185.986290 187.403328
000007.SH 79.190528 80.515892 81.509916 78.693516
Stocks_Volume
d-1 d-2 d-3 d-4
000001.HR 324234 345345 657546 234234
000002.ZH 4867343 465234 4652598 4634168
000004.HD 9246474 929029 826880 965445
000005.SS 2822245 2836893 2858865 2829569
000006.SH 19262963 1897466 1886290 183328
000007.SH 7190528 803892 809916 7693516
Above are the data I parsed from a database, what I exactly want to do is to obtain the correlation of open price and volume in 4 days for each stock (The first column consists of codes of different stocks). In other words, I am trying to calculate the correlation of corresponding rows of each DataFrame. (This is only simplified example, the real data should be extended to more than 1000 different stocks.)
My attempt is to create a dataframe and to run a loop, assigning the results to that dataframe. But here is a problem, which is, the index pf the created dataframe is not exactly what I want. When I tried to append the correlation column, the bug occurred. (Please ignore the value of correlation, which is I concocted here, just to give an example)
r = pd.DataFrame(index = range(6),columns = ['c']
for i in range(6):
r.iloc[i-1,:] = Stocks_Open.iloc[i-1].corr(Stocks_Volume.iloc[i-1])
Correlation_in_4days = pd.concat([Stocks_Open,Stocks_Volume], axis = 1)
Correlation_in_4days['corr'] = r['c']
for i in range(6):
Correlation_in_4days.iloc[i-1,8] = r.iloc[i-1,:]
r c
1 0.654
2 -0.454
3 0.3321
4 0.2166
5 -0.8772
6 0.3256
The bug occurred.
"ValueError: Incompatible indexer with Series"
I realized that my correlation dataframe's index is integer but not the stock code, but I don't know how to fix it, is there any help?
My ideal result is:
corr
000001.HR 0.654
000002.ZH -0.454
000004.HD 0.3321
000005.SS 0.2166
000006.SH -0.8772
000007.SH 0.3256
Try assign the index back
r.index = Stocks_Open.index

How to find the average of data samples at random intervals in python?

I have temperature data stored in a csv file when plotted looks like the below image. How do I find the average during each interval when the temperature goes above 12. The result should be the T1, T2 ,T3 which should be the average temperature during the interval when its value is above 12.
Could you please suggest how to achieve this in python?
Highlighted the areas approximately over which I need to calculate the average:
Please find below sample data:
R3,R4
1,11
2,11
3,11
4,11
5,11
6,15.05938512
7,15.12975992
8,15.05850141
18,15.1677708
19,15.00921862
20,15.00686921
21,15.01168888
22,11
23,11
24,11
25,11
26,11
27,15.05938512
28,15.12975992
29,15.05850141
30,15.00328706
31,15.12622611
32,15.01479819
33,15.17778891
34,15.01411488
35,9
36,9
37,9
38,9
39,16.16042435
40,16.00091253
41,16.00419677
42,16.15381827
43,16.0471766
44,16.03725301
45,16.13925003
46,16.00072279
47,11
48,1
In pandas, an idea would be to group the data based on the condition T > 12 and use mean as agg func. Ex:
import pandas as pd
# a dummy df:
df = pd.DataFrame({'T': [11, 13, 13, 10, 14]})
# set the condition
m = df['T'] > 12
# define groups
grouper = (~m).cumsum().where(m)
# ...looks like
# 0 NaN
# 1 1.0
# 2 1.0
# 3 NaN
# 4 2.0
# Name: T, dtype: float64
# now we can easily calculate the mean for each group:
grp_mean = df.groupby(grouper)['T'].mean()
# T
# 1.0 13
# 2.0 14
# Name: T, dtype: int64
Note: if you have noisy data (T jumps up and down), it might be clever to apply a filter first (savgol, median etc. - whatever is appropriate) so you don't end up with groups caused by the noise.
I couldn't find a good pattern for this - here's a clunky bit of code that does what you want, though.
In general, use .shift() to find transition points, and use groupby with transform to get your means.
#if you had a csv with Dates and Temps, do this
#tempsDF = pd.read_csv("temps.csv", columns=["Date","Temp"])
#tempsDF.set_index("Date", inplace=True)
#Using fake data since I don't have your csv
tempsDF = pd.DataFrame({'Temp': [0,13,14,13,8,7,5,0,14,16,16,0,0,0]})
#This is a bit clunky - I bet there's a more elegant way to do it
tempsDF["CumulativeFlag"] = 0
tempsDF.loc[tempsDF["Temp"]>12, "CumulativeFlag"]=1
tempsDF.loc[tempsDF["CumulativeFlag"] > tempsDF["CumulativeFlag"].shift(), "HighTempGroup"] = list(range(1,len(tempsDF.loc[tempsDF["CumulativeFlag"] > tempsDF["CumulativeFlag"].shift()])+1))
tempsDF["HighTempGroup"].fillna(method='ffill', inplace=True)
tempsDF.loc[tempsDF["Temp"]<=12, "HighTempGroup"]= None
tempsDF["HighTempMean"] = tempsDF.groupby("HighTempGroup").transform(np.mean)["Temp"]

Inexpensive way to add time series intensity in python pandas dataframe

I am trying to sum (and plot) a total from functions which change states at different times using Python's Pandas.DataFrame. For example:
Suppose we have 3 people whose states can be a) holding nothing, b) holding a 5 pound weight, and c) holding a 10 pound weight. Over time, these people pick weights up and put them down. I want to plot the total amount of weight being held. So, given:
My brute forece attempt:
import pandas as ps
import math
import numpy as np
person1=[3,0,10,10,10,10,10]
person2=[4,0,20,20,25,25,40]
person3=[5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['count','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDfNoCount=allPeopleDf[['start1', 'end1', 'start2', 'end2', 'start3','end3']]
uniqueTimes=sorted(ps.unique(allPeopleDfNoCount.values.ravel()))
possibleStates=[-1,0,1,2] #extra state 0 for initialization
stateData={}
comboStates={}
#initialize dict to add up all of the stateData
for time in uniqueTimes:
comboStates[time]=0.0
allPeopleDf['track']=-1
allPeopleDf['status']=-1
numberState=len(possibleStates)
starti=-1
endi=0
startState=0
for i in range(3):
starti=starti+2
print starti
endi=endi+2
for time in uniqueTimes:
def helper(row):
start=row[starti]
end=row[endi]
track=row[7]
if start <= time and time < end:
return possibleStates[i+1]
else:
return possibleStates[0]
def trackHelp(row):
status=row[8]
track=row[7]
if track<=status:
return status
else:
return track
def Multiplier(row):
x=row[8]
if x==0:
return 0.0*row[0]
if x==1:
return 5.0*row[0]
if x==2:
return 10.0*row[0]
if x==-1:#numeric place holder for non-contributing
return 0.0*row[0]
allPeopleDf['status']=allPeopleDf.apply(helper,axis=1)
allPeopleDf['track']=allPeopleDf.apply(trackHelp,axis=1)
stateData[time]=allPeopleDf.apply(Multiplier,axis=1).sum()
for k,v in stateData.iteritems():
comboStates[k]=comboStates.get(k,0)+v
print allPeopleDf
print stateData
print comboStates
Plots of weight being held over time might look like the following:
And the sum of the intensities over time might look like the black line in the following:
with the black line defined with the Cartesian points: (0,0 lbs),(5,0 lbs),(5,5 lbs),(15,5 lbs),(15,10 lbs),(20,10 lbs),(20,15 lbs),(25,15 lbs),(25,20 lbs),(40,20 lbs). However, I'm flexible and don't necessarily need to define the combined intensity line as a set of Cartesian points. The unique times can be found with:
print list(set(uniqueTimes).intersection(allNoCountT[1].values.ravel())).sort()
,but I can't come up with a slick way of getting the corresponding intensity values.
I started out with a very ugly function to break apart each "person's" graph so that all people had start and stop times (albeit many stop and start times without state change) at the same time, and then I could add up all the "chunks" of time. This was cumbersome; there has to be a slick pandas way of handling this. If anyone can offer a suggestion or point me to another SO like that I might have missed, I'd appreciate the help!
In case my simplified example isn't clear, another might be plotting the intensity of sound coming from a piano: there are many notes being played for different durations with different intensities. I would like the sum of intensity coming from the piano over time. While my example is simplistic, I need a solution that is more on the scale of a piano song: thousands of discrete intensity levels per key, and many keys contributing over the course of a song.
Edit--Implementation of mgab's provided solution:
import pandas as ps
import math
import numpy as np
person1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
person2=['person2',4,0,20,20,25,25,40]
person3=['person3',5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['id','intensity','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDf=ps.melt(allPeopleDf,id_vars=['intensity','id'])
allPeopleDf.columns=['intensity','id','timeid','time']
df=ps.DataFrame(allPeopleDf).drop('timeid',1)
df[df.id=='person1'].drop('id',1) #easier to visualize one id for check
df['increment']=df.groupby('id')['intensity'].transform( lambda x: x.sub(x.shift(), fill_value= 0 ))
TypeError: unsupported operand type(s) for -: 'str' and 'int'
End Edit
Going for the piano keys example, lets assume you have three keys, with 30 levels of intensity.
I would try to keep the data in this format:
import pandas as pd
df = pd.DataFrame([[10,'A',5],
[10,'B',7],
[13,'C',10],
[15,'A',15],
[20,'A',7],
[23,'C',0]], columns=["time", "key", "intensity"])
time key intensity
0 10 A 5
1 10 B 7
2 13 C 10
3 15 A 15
4 20 A 7
5 23 C 0
where you record every change in intensity of any of the keys. From here you can already get the Cartesian coordinates for each individual key as (time,intensity) pairs
df[df.key=="A"].drop('key',1)
time intensity
0 10 5
3 15 15
4 20 7
Then, you can easily create a new column increment that will indicate the change in intensity that occurred for that key at that time point (intensity indicates just the new value of intensity)
df["increment"]=df.groupby("key")["intensity"].transform(
lambda x: x.sub(x.shift(), fill_value= 0 ))
df
time key intensity increment
0 10 A 5 5
1 10 B 7 7
2 13 C 10 10
3 15 A 15 10
4 20 A 7 -8
5 23 C 0 -10
And then, using this new column, you can generate the (time, total_intensity) pairs to use as Cartesian coordinates
df.groupby("time").sum()["increment"].cumsum()
time
10 12
13 22
15 32
20 24
23 14
dtype: int64
EDIT: applying the specific data presented in question
Assuming the data comes as a list of values, starting with the element id (person/piano key), then a factor multiplying the measured weight/intensities for this element, and then pairs of time values indicating the start and end of a series of known states (weight being carried/intensity being emitted). Not sure if I got the data format right. From your question:
data1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
data2=['person2',4,0,20,20,25,25,40]
data3=['person3',5,0,5,5,15,15,40]
And if we know the weight/intensity of each one of the states, we can define:
known_states = [5, 10, 15]
DF_columns = ["time", "id", "intensity"]
Then, the easiest way I came up to load the data includes this function:
import pandas as pd
def read_data(data, states, columns):
id = data[0]
factor = data[1]
reshaped_data = []
for i in xrange(len(states)):
j += 2+2*i
if not data[j] == data[j+1]:
reshaped_data.append([data[j], id, factor*states[i]])
reshaped_data.append([data[j+1], id, -1*factor*states[i]])
return pd.DataFrame(reshaped_data, columns=columns)
Notice that the if not data[j] == data[j+1]: avoids loading data to the dataframe when start and end times for a given state are equal (seems uninformative, and wouldn't appear in your plots anyway). But take it out if you still want these entries.
Then, you load the data:
df = read_data(data1, known_states, DF_columns)
df = df.append(read_data(data2, known_states, DF_columns), ignore_index=True)
df = df.append(read_data(data3, known_states, DF_columns), ignore_index=True)
# and so on...
And then you're right at the beginning of this answer (substituting 'key' by 'id' and the ids, of course)
Appears to be what .sum() is for:
In [10]:
allPeopleDf.sum()
Out[10]:
aStart 0
aEnd 35
bStart 35
bEnd 50
cStart 50
cEnd 90
dtype: int32

Categories