Array Split in data frame column without warning - python

Data frame has columns 'date_sq' and 'value', the 'value' column is an array of 201 columns.
| date_sq | value
| 2022-05-04 |13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t3\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t2\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t1\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t0\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t4\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t3\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t2\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t1\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t0\t0\t0\t1\t13360\t14644\t22099\t8257\t13105\t13879\t12853\t0\t0\t0\t0\t0\t16706\t21558\t17474\t13873\t4\t0\t0\t1\t2949\r\n |
My below codes works:
Split value column to 201 seperate column
Create date in numeric
Create a column 'key'
Drop unnecessary columns
df_spec[[f'{x}' for x in range(total_cols)]] = df_spec['value'].str.split (pat="\t", expand=True).replace(r'\s+|\\n', ' ', regex=True).fillna('0').apply(pd.to_numeric)
df_spec['date'] = pd.to_datetime(df_spec['date_sq']).dt.strftime("%Y%m%d")
df_spec['key'] = (df_spec['date'].astype(str) + df_spec['200'].astype(str)).apply(pd.to_numeric)
df_spec.drop(['value','date_sq','date'], axis=1, inplace=True)
Requirement:
My above code works, but it throws some warning messages.
Is there an optimized way without warnings?
Warning:
<ipython-input-3-70211686c759>:2: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df_spec[[f'{x}' for x in range(total_cols)]] = df_spec['value'].str.split \
<ipython-input-3-70211686c759>:2: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df_spec[[f'{x}' for x in range(total_cols)]] = df_spec['value'].str.split \
... goes on for some lines...
Final dataframe:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | key |
|-------|-------|-------|------|-------|-------|-------|---|---|---|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|------|-------------|
| 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 3 | 0 | 0 | 0 | 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 2 | 0 | 0 | 0 | 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 1 | 0 | 0 | 0 | 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 0 | 0 | 0 | 0 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 4 | 0 | 0 | 0 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 3 | 0 | 0 | 1 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 2 | 0 | 0 | 1 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 1 | 0 | 0 | 1 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 0 | 0 | 0 | 1 | 13360 | 14644 | 22099 | 8257 | 13105 | 13879 | 12853 | 0 | 0 | 0 | 0 | 0 | 16706 | 21558 | 17474 | 13873 | 4 | 0 | 0 | 1 | 2949 | 202205052949 |

I tried the following and got no warnings. I hope it helps in your case
df = pd.DataFrame([['2022-05-04', '13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t3\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t2\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t1\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t0\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t4\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t3\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t2\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t1\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t0\t0\t0\t1\t13360\t14644\t22099\t8257\t13105\t13879\t12853\t0\t0\t0\t0\t0\t16706\t21558\t17474\t13873\t4\t0\t0\t1\t2949\r\n']], columns=['date_sq', 'value'])
date = df['date_sq']
df = df['value'].str.split('\t', expand=True).fillna('0').apply(pd.to_numeric)# .explode().T.reset_index(drop=True)
df['date'] = pd.to_datetime(date).dt.strftime("%Y%m%d")
df['key'] = (df['date'].astype(str) + df[200].astype(str)).apply(pd.to_numeric)
df.drop(['date'], axis=1, inplace=True)
df
I think df_spec[[f'{x}' for x in range(total_cols)]] is unnecessary when using split(..., expand=True). Good luck

Related

How do I change the values of a column with respect to certain conditions using pandas? [duplicate]

This question already has answers here:
vectorize conditional assignment in pandas dataframe
(5 answers)
Closed 1 year ago.
The data frame contains 5 columns named V, W, X, Y, Z.
I'm supposed to change the values in column X from a dataset according to:
if 1 to 100, change to 1
if 101 to 200, change to 2
if 201 to 300, change to 3
otherwise, change to 4
What's the most efficient way this can be done?
Using an example df and #Pygirl's idea:
df = pd.DataFrame({'a': np.random.randint(0, 400, 10), 'b': np.random.randint(0, 400, 10), 'X': np.zeros(10)})
gives us:
| | a | b | X |
|---:|----:|----:|----:|
| 0 | 237 | 188 | 0 |
| 1 | 212 | 147 | 0 |
| 2 | 135 | 30 | 0 |
| 3 | 296 | 154 | 0 |
| 4 | 133 | 219 | 0 |
| 5 | 185 | 317 | 0 |
| 6 | 365 | 5 | 0 |
| 7 | 108 | 189 | 0 |
| 8 | 358 | 34 | 0 |
| 9 | 105 | 2 | 0 |
with pd.cut doc we can get:
df['X'] = pd.cut(df.a, [0, 100, 200, 300, np.inf], labels=[1, 2, 3, 4])
which results in:
| | a | b | X |
|---:|----:|----:|----:|
| 0 | 377 | 232 | 4 |
| 1 | 61 | 11 | 1 |
| 2 | 52 | 217 | 1 |
| 3 | 191 | 42 | 2 |
| 4 | 178 | 228 | 2 |
| 5 | 235 | 206 | 3 |
| 6 | 39 | 222 | 1 |
| 7 | 316 | 210 | 4 |
| 8 | 135 | 390 | 2 |
| 9 | 44 | 311 | 1 |

Binning Pandas value_counts

I have a Pandas Series produced by df.column.value_counts().sort_index().
| N Months | Count |
|------|------|
| 0 | 15 |
| 1 | 9 |
| 2 | 78 |
| 3 | 151 |
| 4 | 412 |
| 5 | 181 |
| 6 | 543 |
| 7 | 175 |
| 8 | 409 |
| 9 | 594 |
| 10 | 137 |
| 11 | 202 |
| 12 | 170 |
| 13 | 446 |
| 14 | 29 |
| 15 | 39 |
| 16 | 44 |
| 17 | 253 |
| 18 | 17 |
| 19 | 34 |
| 20 | 18 |
| 21 | 37 |
| 22 | 147 |
| 23 | 12 |
| 24 | 31 |
| 25 | 15 |
| 26 | 117 |
| 27 | 8 |
| 28 | 38 |
| 29 | 23 |
| 30 | 198 |
| 31 | 29 |
| 32 | 122 |
| 33 | 50 |
| 34 | 60 |
| 35 | 357 |
| 36 | 329 |
| 37 | 457 |
| 38 | 609 |
| 39 | 4744 |
| 40 | 1120 |
| 41 | 591 |
| 42 | 328 |
| 43 | 148 |
| 44 | 46 |
| 45 | 10 |
| 46 | 1 |
| 47 | 1 |
| 48 | 7 |
| 50 | 2 |
my desired output is
| bin | Total |
|-------|--------|
| 0-13 | 3522 |
| 14-26 | 793 |
| 27-50 | 9278 |
I tried df.column.value_counts(bins=3).sort_index() but got
| bin | Total |
|---------------------------------|-------|
| (-0.051000000000000004, 16.667] | 3634 |
| (16.667, 33.333] | 1149 |
| (33.333, 50.0] | 8810 |
I can get the correct result with
a = df.column.value_counts().sort_index()[:14].sum()
b = df.column.value_counts().sort_index()[14:27].sum()
c = df.column.value_counts().sort_index()[28:].sum()
print(a, b, c)
Output: 3522 793 9270
But I am wondering if there is a pandas method that can do what I want. Any advice is very welcome. :-)
You can use pd.cut:
pd.cut(df['N Months'], [0,13, 26, 50], include_lowest=True).value_counts()
Update you should be able to pass custom bin to value_counts:
df['N Months'].value_counts(bins = [0,13, 26, 50])
Output:
N Months
(-0.001, 13.0] 3522
(13.0, 26.0] 793
(26.0, 50.0] 9278
Name: Count, dtype: int64

Pandas: sum multiple columns based on similar consecutive numbers in another column

Given the following table
+----+--------+--------+--------------+
| Nr | Price | Volume | Transactions |
+----+--------+--------+--------------+
| 1 | 194.6 | 100 | 1 |
| 2 | 195 | 10 | 1 |
| 3 | 194.92 | 100 | 1 |
| 4 | 194.92 | 52 | 1 |
| 5 | 194.9 | 99 | 1 |
| 6 | 194.86 | 74 | 1 |
| 7 | 194.85 | 900 | 1 |
| 8 | 194.85 | 25 | 1 |
| 9 | 194.85 | 224 | 1 |
| 10 | 194.6 | 101 | 1 |
| 11 | 194.85 | 19 | 1 |
| 12 | 194.6 | 10 | 1 |
| 13 | 194.6 | 25 | 1 |
| 14 | 194.53 | 12 | 1 |
| 15 | 194.85 | 14 | 1 |
| 16 | 194.6 | 11 | 1 |
| 17 | 194.85 | 93 | 1 |
| 18 | 195 | 90 | 1 |
| 19 | 195 | 100 | 1 |
| 20 | 195 | 50 | 1 |
| 21 | 195 | 50 | 1 |
| 22 | 195 | 25 | 1 |
| 23 | 195 | 5 | 1 |
| 24 | 195 | 500 | 1 |
| 25 | 195 | 100 | 1 |
| 26 | 195.09 | 100 | 1 |
| 27 | 195 | 120 | 1 |
| 28 | 195 | 60 | 1 |
| 29 | 195 | 40 | 1 |
| 30 | 195 | 10 | 1 |
| 31 | 194.6 | 1 | 1 |
| 32 | 194.99 | 1 | 1 |
| 33 | 194.81 | 20 | 1 |
| 34 | 194.81 | 50 | 1 |
| 35 | 194.97 | 17 | 1 |
| 36 | 194.99 | 25 | 1 |
| 37 | 195 | 75 | 1 |
+----+--------+--------+--------------+
For faster testing you can also find here the same table in a pandas dataframe
pd_data_before = pd.DataFrame([[1,194.6,100,1],[2,195,10,1],[3,194.92,100,1],[4,194.92,52,1],[5,194.9,99,1],[6,194.86,74,1],[7,194.85,900,1],[8,194.85,25,1],[9,194.85,224,1],[10,194.6,101,1],[11,194.85,19,1],[12,194.6,10,1],[13,194.6,25,1],[14,194.53,12,1],[15,194.85,14,1],[16,194.6,11,1],[17,194.85,93,1],[18,195,90,1],[19,195,100,1],[20,195,50,1],[21,195,50,1],[22,195,25,1],[23,195,5,1],[24,195,500,1],[25,195,100,1],[26,195.09,100,1],[27,195,120,1],[28,195,60,1],[29,195,40,1],[30,195,10,1],[31,194.6,1,1],[32,194.99,1,1],[33,194.81,20,1],[34,194.81,50,1],[35,194.97,17,1],[36,194.99,25,1],[37,195,75,1]],columns=['Nr','Price','Volume','Transactions'])
The question is how do we sum up the volume and transactions based on similar consecutive prices? The end result would be something like this:
+----+--------+--------+--------------+
| Nr | Price | Volume | Transactions |
+----+--------+--------+--------------+
| 1 | 194.6 | 100 | 1 |
| 2 | 195 | 10 | 1 |
| 4 | 194.92 | 152 | 2 |
| 5 | 194.9 | 99 | 1 |
| 6 | 194.86 | 74 | 1 |
| 9 | 194.85 | 1149 | 3 |
| 10 | 194.6 | 101 | 1 |
| 11 | 194.85 | 19 | 1 |
| 13 | 194.6 | 35 | 2 |
| 14 | 194.53 | 12 | 1 |
| 15 | 194.85 | 14 | 1 |
| 16 | 194.6 | 11 | 1 |
| 17 | 194.85 | 93 | 1 |
| 25 | 195 | 920 | 8 |
| 26 | 195.09 | 100 | 1 |
| 30 | 195 | 230 | 4 |
| 31 | 194.6 | 1 | 1 |
| 32 | 194.99 | 1 | 1 |
| 34 | 194.81 | 70 | 2 |
| 35 | 194.97 | 17 | 1 |
| 36 | 194.99 | 25 | 1 |
| 37 | 195 | 75 | 1 |
+----+--------+--------+--------------+
You can also find the result ready made in a pandas dataframe below:
pd_data_after = pd.DataFrame([[1,194.6,100,1],[2,195,10,1],[4,194.92,152,2],[5,194.9,99,1],[6,194.86,74,1],[9,194.85,1149,3],[10,194.6,101,1],[11,194.85,19,1],[13,194.6,35,2],[14,194.53,12,1],[15,194.85,14,1],[16,194.6,11,1],[17,194.85,93,1],[25,195,920,8],[26,195.09,100,1],[30,195,230,4],[31,194.6,1,1],[32,194.99,1,1],[34,194.81,70,2],[35,194.97,17,1],[36,194.99,25,1],[37,195,75,1]],columns=['Nr','Price','Volume','Transactions'])
I managed to achieve this in a for loop. But the problem is that it is very slow when iterating each row. My data set is huge, around 50 million rows.
Is there any way to achieve this without looping?
A common trick to groupby consecutive values is the following:
df.col.ne(df.col.shift()).cumsum()
We can use that here, then use agg to keep the first values of the columns we aren't summing, and to sum the values we do want to sum.
(df.groupby(df.Price.ne(df.Price.shift()).cumsum())
.agg({'Nr': 'last', 'Price': 'first', 'Volume':'sum', 'Transactions': 'sum'})
).reset_index(drop=True)
Nr Price Volume Transactions
0 1 194.60 100 1
1 2 195.00 10 1
2 4 194.92 152 2
3 5 194.90 99 1
4 6 194.86 74 1
5 9 194.85 1149 3
6 10 194.60 101 1
7 11 194.85 19 1
8 13 194.60 35 2
9 14 194.53 12 1
10 15 194.85 14 1
11 16 194.60 11 1
12 17 194.85 93 1
13 25 195.00 920 8
14 26 195.09 100 1
15 30 195.00 230 4
16 31 194.60 1 1
17 32 194.99 1 1
18 34 194.81 70 2
19 35 194.97 17 1
20 36 194.99 25 1
21 37 195.00 75 1

Assigning states of Hidden Markov Models by idealized values intensity values.

I'm running the pomegranate HMM (http://pomegranate.readthedocs.io/en/latest/HiddenMarkovModel.html) on my data, and I load the results into a Pandas DF, and define the idealized intensity as the median of all the points in that state: df["hmm_idealized"] = df.groupby(["hmm_state"],as_index = False)["Raw"].transform("median"). Sample data:
+-----+-----------------+-------------+------------+
| | hmm_idealized | hmm_state | hmm_diff |
|-----+-----------------+-------------+------------|
| 0 | 99862 | 3 | nan |
| 1 | 99862 | 3 | 0 |
| 2 | 99862 | 3 | 0 |
| 3 | 99862 | 3 | 0 |
| 4 | 99862 | 3 | 0 |
| 5 | 99862 | 3 | 0 |
| 6 | 117759 | 4 | 1 |
| 7 | 117759 | 4 | 0 |
| 8 | 117759 | 4 | 0 |
| 9 | 117759 | 4 | 0 |
| 10 | 117759 | 4 | 0 |
| 11 | 117759 | 4 | 0 |
| 12 | 117759 | 4 | 0 |
| 13 | 117759 | 4 | 0 |
| 14 | 124934 | 2 | -2 |
| 15 | 124934 | 2 | 0 |
| 16 | 124934 | 2 | 0 |
| 17 | 124934 | 2 | 0 |
| 18 | 124934 | 2 | 0 |
| 19 | 117759 | 4 | 2 |
| 20 | 117759 | 4 | 0 |
| 21 | 117759 | 4 | 0 |
| 22 | 117759 | 4 | 0 |
| 23 | 117759 | 4 | 0 |
| 24 | 117759 | 4 | 0 |
| 25 | 117759 | 4 | 0 |
| 26 | 117759 | 4 | 0 |
| 27 | 117759 | 4 | 0 |
| 28 | 117759 | 4 | 0 |
| 29 | 117759 | 4 | 0 |
| 30 | 117759 | 4 | 0 |
| 31 | 117759 | 4 | 0 |
| 32 | 117759 | 4 | 0 |
| 33 | 117759 | 4 | 0 |
| 34 | 117759 | 4 | 0 |
| 35 | 117759 | 4 | 0 |
| 36 | 117759 | 4 | 0 |
| 37 | 117759 | 4 | 0 |
| 38 | 117759 | 4 | 0 |
| 39 | 117759 | 4 | 0 |
| 40 | 106169 | 1 | -3 |
| 41 | 106169 | 1 | 0 |
| 42 | 106169 | 1 | 0 |
| 43 | 106169 | 1 | 0 |
| 44 | 106169 | 1 | 0 |
| 45 | 106169 | 1 | 0 |
| 46 | 106169 | 1 | 0 |
| 47 | 106169 | 1 | 0 |
| 48 | 106169 | 1 | 0 |
| 49 | 106169 | 1 | 0 |
| 50 | 106169 | 1 | 0 |
| 51 | 106169 | 1 | 0 |
| 52 | 106169 | 1 | 0 |
| 53 | 106169 | 1 | 0 |
| 54 | 106169 | 1 | 0 |
| 55 | 106169 | 1 | 0 |
| 56 | 106169 | 1 | 0 |
| 57 | 106169 | 1 | 0 |
| 58 | 106169 | 1 | 0 |
| 59 | 106169 | 1 | 0 |
| 60 | 106169 | 1 | 0 |
| 61 | 106169 | 1 | 0 |
| 62 | 106169 | 1 | 0 |
| 63 | 106169 | 1 | 0 |
| 64 | 106169 | 1 | 0 |
| 65 | 106169 | 1 | 0 |
| 66 | 106169 | 1 | 0 |
| 67 | 106169 | 1 | 0 |
| 68 | 106169 | 1 | 0 |
| 69 | 106169 | 1 | 0 |
| 70 | 106169 | 1 | 0 |
| 71 | 106169 | 1 | 0 |
| 72 | 106169 | 1 | 0 |
| 73 | 106169 | 1 | 0 |
| 74 | 106169 | 1 | 0 |
| 75 | 99862 | 3 | 2 |
| 76 | 99862 | 3 | 0 |
| 77 | 99862 | 3 | 0 |
| 78 | 99862 | 3 | 0 |
| 79 | 99862 | 3 | 0 |
| 80 | 99862 | 3 | 0 |
| 81 | 99862 | 3 | 0 |
| 82 | 99862 | 3 | 0 |
| 83 | 99862 | 3 | 0 |
| 84 | 99862 | 3 | 0 |
| 85 | 99862 | 3 | 0 |
| 86 | 99862 | 3 | 0 |
| 87 | 99862 | 3 | 0 |
| 88 | 99862 | 3 | 0 |
| 89 | 99862 | 3 | 0 |
| 90 | 99862 | 3 | 0 |
| 91 | 99862 | 3 | 0 |
| 92 | 99862 | 3 | 0 |
| 93 | 99862 | 3 | 0 |
| 94 | 99862 | 3 | 0 |
| 95 | 99862 | 3 | 0 |
| 96 | 99862 | 3 | 0 |
| 97 | 99862 | 3 | 0 |
| 98 | 99862 | 3 | 0 |
| 99 | 99862 | 3 | 0 |
| 100 | 99862 | 3 | 0 |
| 101 | 99862 | 3 | 0 |
| 102 | 99862 | 3 | 0 |
| 103 | 99862 | 3 | 0 |
| 104 | 99862 | 3 | 0 |
| 105 | 99862 | 3 | 0 |
| 106 | 99862 | 3 | 0 |
| 107 | 99862 | 3 | 0 |
| 108 | 94127 | 0 | -3 |
| 109 | 94127 | 0 | 0 |
| 110 | 94127 | 0 | 0 |
| 111 | 94127 | 0 | 0 |
| 112 | 94127 | 0 | 0 |
| 113 | 94127 | 0 | 0 |
| 114 | 94127 | 0 | 0 |
| 115 | 94127 | 0 | 0 |
| 116 | 94127 | 0 | 0 |
| 117 | 94127 | 0 | 0 |
| 118 | 94127 | 0 | 0 |
| 119 | 94127 | 0 | 0 |
| 120 | 94127 | 0 | 0 |
| 121 | 94127 | 0 | 0 |
| 122 | 94127 | 0 | 0 |
| 123 | 94127 | 0 | 0 |
| 124 | 94127 | 0 | 0 |
| 125 | 94127 | 0 | 0 |
| 126 | 94127 | 0 | 0 |
| 127 | 94127 | 0 | 0 |
| 128 | 94127 | 0 | 0 |
| 129 | 94127 | 0 | 0 |
| 130 | 94127 | 0 | 0 |
| 131 | 94127 | 0 | 0 |
| 132 | 94127 | 0 | 0 |
| 133 | 94127 | 0 | 0 |
| 134 | 94127 | 0 | 0 |
| 135 | 94127 | 0 | 0 |
| 136 | 94127 | 0 | 0 |
| 137 | 94127 | 0 | 0 |
| 138 | 94127 | 0 | 0 |
| 139 | 94127 | 0 | 0 |
| 140 | 94127 | 0 | 0 |
| 141 | 94127 | 0 | 0 |
| 142 | 94127 | 0 | 0 |
| 143 | 94127 | 0 | 0 |
| 144 | 94127 | 0 | 0 |
| 145 | 94127 | 0 | 0 |
| 146 | 94127 | 0 | 0 |
| 147 | 94127 | 0 | 0 |
| 148 | 94127 | 0 | 0 |
| 149 | 94127 | 0 | 0 |
| 150 | 94127 | 0 | 0 |
| 151 | 94127 | 0 | 0 |
| 152 | 94127 | 0 | 0 |
| 153 | 94127 | 0 | 0 |
| 154 | 94127 | 0 | 0 |
| 155 | 94127 | 0 | 0 |
| 156 | 94127 | 0 | 0 |
| 157 | 94127 | 0 | 0 |
| 158 | 94127 | 0 | 0 |
| 159 | 94127 | 0 | 0 |
| 160 | 94127 | 0 | 0 |
| 161 | 94127 | 0 | 0 |
| 162 | 94127 | 0 | 0 |
| 163 | 94127 | 0 | 0 |
| 164 | 94127 | 0 | 0 |
| 165 | 94127 | 0 | 0 |
| 166 | 94127 | 0 | 0 |
| 167 | 94127 | 0 | 0 |
| 168 | 94127 | 0 | 0 |
| 169 | 94127 | 0 | 0 |
| 170 | 94127 | 0 | 0 |
| 171 | 94127 | 0 | 0 |
| 172 | 94127 | 0 | 0 |
| 173 | 94127 | 0 | 0 |
| 174 | 94127 | 0 | 0 |
| 175 | 94127 | 0 | 0 |
| 176 | 94127 | 0 | 0 |
| 177 | 94127 | 0 | 0 |
| 178 | 94127 | 0 | 0 |
| 179 | 94127 | 0 | 0 |
| 180 | 94127 | 0 | 0 |
| 181 | 94127 | 0 | 0 |
| 182 | 94127 | 0 | 0 |
| 183 | 94127 | 0 | 0 |
| 184 | 94127 | 0 | 0 |
| 185 | 94127 | 0 | 0 |
| 186 | 94127 | 0 | 0 |
| 187 | 94127 | 0 | 0 |
| 188 | 94127 | 0 | 0 |
| 189 | 94127 | 0 | 0 |
| 190 | 94127 | 0 | 0 |
| 191 | 94127 | 0 | 0 |
| 192 | 94127 | 0 | 0 |
| 193 | 94127 | 0 | 0 |
| 194 | 94127 | 0 | 0 |
| 195 | 94127 | 0 | 0 |
| 196 | 94127 | 0 | 0 |
| 197 | 94127 | 0 | 0 |
| 198 | 94127 | 0 | 0 |
| 199 | 94127 | 0 | 0 |
| 200 | 94127 | 0 | 0 |
+-----+-----------------+-------------+------------+
When analyzing the results, I want to count the number of increases in the model. I would like to use the df.where(data.hmm_diff > 0).count() function to read off how many times I increment by one state. However, the increase sometimes spans two states (i.e. it skips the middle state), so I need to reassign the HMM-state label by sorting the idealized values, so that the lowest state would be 0, the highest 4, etc. Is there a way to reassign the hmm_state label from arbitrary to dependent on idealized intensity?
For example,the hmm_state labeled as "1" lies in between of hmm_state 3 and 4
It looks like you just need to define a sorted HMM state like this:
state_orders = {v: i for i, v in enumerate(sorted(df.hmm_idealized.unique()))}
df['sorted_state'] = df.hmm_idealized.map(state_orders)
Then you can continue as you did in the question, but taking a diff on this column, and counting the jumps on it.

where condition is met, get last row pandas

+------------+----+----+----+-----+----+----+----+-----+
| WS | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 |
+------------+----+----+----+-----+----+----+----+-----+
| w1 | 0 | 0 | 0 | 50 | 0 | 0 | 0 | 50 |
| w2 | 0 | 30 | 0 | 0 | 0 | 30 | 0 | 0 |
| d1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| d2 | 62 | 0 | 0 | 0 | 62 | 0 | 0 | 0 |
| Total | 62 | 30 | 0 | 50 | 62 | 30 | 0 | 50 |
| Cumulative | 62 | 92 | 92 | 142 | 62 | 92 | 92 | 142 |
+------------+----+----+----+-----+----+----+----+-----+
Based on the condition of the column having value more than 0, I would like to get the corresponding value of row "Cumulative".
As shown in the image, when 50 > 0, I would like to get the corresponding "Cumulative" value of 142.
+------------+----+----+---+-----+----+----+---+-----+
| WS | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 |
+------------+----+----+---+-----+----+----+---+-----+
| Cumulative | 62 | 92 | 0 | 142 | 62 | 92 | 0 | 142 |
+------------+----+----+---+-----+----+----+---+-----+
I have tried pandas loc and iloc but they cannot perform what I wanted.
Thank you in advanced!
You definitely should have designed your question better, but anyway, here is a possible solution:
Assuming df is your DataFrame:
df[df>5].loc['Cumulative'].dropna()

Categories