Binning Pandas value_counts - python

I have a Pandas Series produced by df.column.value_counts().sort_index().
| N Months | Count |
|------|------|
| 0 | 15 |
| 1 | 9 |
| 2 | 78 |
| 3 | 151 |
| 4 | 412 |
| 5 | 181 |
| 6 | 543 |
| 7 | 175 |
| 8 | 409 |
| 9 | 594 |
| 10 | 137 |
| 11 | 202 |
| 12 | 170 |
| 13 | 446 |
| 14 | 29 |
| 15 | 39 |
| 16 | 44 |
| 17 | 253 |
| 18 | 17 |
| 19 | 34 |
| 20 | 18 |
| 21 | 37 |
| 22 | 147 |
| 23 | 12 |
| 24 | 31 |
| 25 | 15 |
| 26 | 117 |
| 27 | 8 |
| 28 | 38 |
| 29 | 23 |
| 30 | 198 |
| 31 | 29 |
| 32 | 122 |
| 33 | 50 |
| 34 | 60 |
| 35 | 357 |
| 36 | 329 |
| 37 | 457 |
| 38 | 609 |
| 39 | 4744 |
| 40 | 1120 |
| 41 | 591 |
| 42 | 328 |
| 43 | 148 |
| 44 | 46 |
| 45 | 10 |
| 46 | 1 |
| 47 | 1 |
| 48 | 7 |
| 50 | 2 |
my desired output is
| bin | Total |
|-------|--------|
| 0-13 | 3522 |
| 14-26 | 793 |
| 27-50 | 9278 |
I tried df.column.value_counts(bins=3).sort_index() but got
| bin | Total |
|---------------------------------|-------|
| (-0.051000000000000004, 16.667] | 3634 |
| (16.667, 33.333] | 1149 |
| (33.333, 50.0] | 8810 |
I can get the correct result with
a = df.column.value_counts().sort_index()[:14].sum()
b = df.column.value_counts().sort_index()[14:27].sum()
c = df.column.value_counts().sort_index()[28:].sum()
print(a, b, c)
Output: 3522 793 9270
But I am wondering if there is a pandas method that can do what I want. Any advice is very welcome. :-)

You can use pd.cut:
pd.cut(df['N Months'], [0,13, 26, 50], include_lowest=True).value_counts()
Update you should be able to pass custom bin to value_counts:
df['N Months'].value_counts(bins = [0,13, 26, 50])
Output:
N Months
(-0.001, 13.0] 3522
(13.0, 26.0] 793
(26.0, 50.0] 9278
Name: Count, dtype: int64

Related

Array Split in data frame column without warning

Data frame has columns 'date_sq' and 'value', the 'value' column is an array of 201 columns.
| date_sq | value
| 2022-05-04 |13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t3\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t2\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t1\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t0\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t4\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t3\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t2\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t1\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t0\t0\t0\t1\t13360\t14644\t22099\t8257\t13105\t13879\t12853\t0\t0\t0\t0\t0\t16706\t21558\t17474\t13873\t4\t0\t0\t1\t2949\r\n |
My below codes works:
Split value column to 201 seperate column
Create date in numeric
Create a column 'key'
Drop unnecessary columns
df_spec[[f'{x}' for x in range(total_cols)]] = df_spec['value'].str.split (pat="\t", expand=True).replace(r'\s+|\\n', ' ', regex=True).fillna('0').apply(pd.to_numeric)
df_spec['date'] = pd.to_datetime(df_spec['date_sq']).dt.strftime("%Y%m%d")
df_spec['key'] = (df_spec['date'].astype(str) + df_spec['200'].astype(str)).apply(pd.to_numeric)
df_spec.drop(['value','date_sq','date'], axis=1, inplace=True)
Requirement:
My above code works, but it throws some warning messages.
Is there an optimized way without warnings?
Warning:
<ipython-input-3-70211686c759>:2: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df_spec[[f'{x}' for x in range(total_cols)]] = df_spec['value'].str.split \
<ipython-input-3-70211686c759>:2: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df_spec[[f'{x}' for x in range(total_cols)]] = df_spec['value'].str.split \
... goes on for some lines...
Final dataframe:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | key |
|-------|-------|-------|------|-------|-------|-------|---|---|---|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|------|-------------|
| 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 3 | 0 | 0 | 0 | 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 2 | 0 | 0 | 0 | 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 1 | 0 | 0 | 0 | 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 0 | 0 | 0 | 0 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 4 | 0 | 0 | 0 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 3 | 0 | 0 | 1 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 2 | 0 | 0 | 1 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 1 | 0 | 0 | 1 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 0 | 0 | 0 | 1 | 13360 | 14644 | 22099 | 8257 | 13105 | 13879 | 12853 | 0 | 0 | 0 | 0 | 0 | 16706 | 21558 | 17474 | 13873 | 4 | 0 | 0 | 1 | 2949 | 202205052949 |
I tried the following and got no warnings. I hope it helps in your case
df = pd.DataFrame([['2022-05-04', '13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t3\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t2\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t1\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t0\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t4\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t3\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t2\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t1\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t0\t0\t0\t1\t13360\t14644\t22099\t8257\t13105\t13879\t12853\t0\t0\t0\t0\t0\t16706\t21558\t17474\t13873\t4\t0\t0\t1\t2949\r\n']], columns=['date_sq', 'value'])
date = df['date_sq']
df = df['value'].str.split('\t', expand=True).fillna('0').apply(pd.to_numeric)# .explode().T.reset_index(drop=True)
df['date'] = pd.to_datetime(date).dt.strftime("%Y%m%d")
df['key'] = (df['date'].astype(str) + df[200].astype(str)).apply(pd.to_numeric)
df.drop(['date'], axis=1, inplace=True)
df
I think df_spec[[f'{x}' for x in range(total_cols)]] is unnecessary when using split(..., expand=True). Good luck

Filter rows based on condition in Pandas

I have dataframe df_groups that contain sample number, group number and accuracy.
Tabel 1: Samples with their groups
+----+----------+------------+------------+
| | sample | group | Accuracy |
|----+----------+------------+------------|
| 0 | 0 | 6 | 91.6 |
| 1 | 1 | 4 | 92.9333 |
| 2 | 2 | 2 | 91 |
| 3 | 3 | 2 | 90.0667 |
| 4 | 4 | 4 | 91.8 |
| 5 | 5 | 5 | 92.5667 |
| 6 | 6 | 6 | 91.1 |
| 7 | 7 | 5 | 92.3333 |
| 8 | 8 | 2 | 92.7667 |
| 9 | 9 | 0 | 91.1333 |
| 10 | 10 | 4 | 92.5 |
| 11 | 11 | 5 | 92.4 |
| 12 | 12 | 7 | 93.1333 |
| 13 | 13 | 7 | 93.5333 |
| 14 | 14 | 2 | 92.1 |
| 15 | 15 | 6 | 93.2 |
| 16 | 16 | 8 | 92.7333 |
| 17 | 17 | 8 | 90.8 |
| 18 | 18 | 3 | 91.9 |
| 19 | 19 | 3 | 93.3 |
| 20 | 20 | 5 | 90.6333 |
| 21 | 21 | 9 | 92.9333 |
| 22 | 22 | 4 | 93.3333 |
| 23 | 23 | 9 | 91.5333 |
| 24 | 24 | 9 | 92.9333 |
| 25 | 25 | 1 | 92.3 |
| 26 | 26 | 9 | 92.2333 |
| 27 | 27 | 6 | 91.9333 |
| 28 | 28 | 5 | 92.1 |
| 29 | 29 | 8 | 84.8 |
+----+----------+------------+------------+
I want to return a dataframe with any accuracy above (e.g. 92).
so the results will be like this
Tabel 1: Samples with their groups when accuracy above 92.
+----+----------+------------+------------+
| | sample | group | Accuracy |
|----+----------+------------+------------|
| 1 | 1 | 4 | 92.9333 |
| 2 | 5 | 5 | 92.5667 |
| 3 | 7 | 5 | 92.3333 |
| 4 | 8 | 2 | 92.7667 |
| 5 | 10 | 4 | 92.5 |
| 6 | 11 | 5 | 92.4 |
| 7 | 12 | 7 | 93.1333 |
| 8 | 13 | 7 | 93.5333 |
| 9 | 14 | 2 | 92.1 |
| 10 | 15 | 6 | 93.2 |
| 11 | 16 | 8 | 92.7333 |
| 12 | 19 | 3 | 93.3 |
| 13 | 21 | 9 | 92.9333 |
| 14 | 22 | 4 | 93.3333 |
| 15 | 24 | 9 | 92.9333 |
| 16 | 25 | 1 | 92.3 |
| 17 | 26 | 9 | 92.2333 |
| 18 | 28 | 5 | 92.1 |
+----+----------+------------+------------+
so, the result will return based on the condition that is greater than or equal to the predefined accuracy (e.g. 92, 90 or 85, ect).
You can use df.loc[df['Accuracy'] >= predefined_accuracy] .

Pandas: sum multiple columns based on similar consecutive numbers in another column

Given the following table
+----+--------+--------+--------------+
| Nr | Price | Volume | Transactions |
+----+--------+--------+--------------+
| 1 | 194.6 | 100 | 1 |
| 2 | 195 | 10 | 1 |
| 3 | 194.92 | 100 | 1 |
| 4 | 194.92 | 52 | 1 |
| 5 | 194.9 | 99 | 1 |
| 6 | 194.86 | 74 | 1 |
| 7 | 194.85 | 900 | 1 |
| 8 | 194.85 | 25 | 1 |
| 9 | 194.85 | 224 | 1 |
| 10 | 194.6 | 101 | 1 |
| 11 | 194.85 | 19 | 1 |
| 12 | 194.6 | 10 | 1 |
| 13 | 194.6 | 25 | 1 |
| 14 | 194.53 | 12 | 1 |
| 15 | 194.85 | 14 | 1 |
| 16 | 194.6 | 11 | 1 |
| 17 | 194.85 | 93 | 1 |
| 18 | 195 | 90 | 1 |
| 19 | 195 | 100 | 1 |
| 20 | 195 | 50 | 1 |
| 21 | 195 | 50 | 1 |
| 22 | 195 | 25 | 1 |
| 23 | 195 | 5 | 1 |
| 24 | 195 | 500 | 1 |
| 25 | 195 | 100 | 1 |
| 26 | 195.09 | 100 | 1 |
| 27 | 195 | 120 | 1 |
| 28 | 195 | 60 | 1 |
| 29 | 195 | 40 | 1 |
| 30 | 195 | 10 | 1 |
| 31 | 194.6 | 1 | 1 |
| 32 | 194.99 | 1 | 1 |
| 33 | 194.81 | 20 | 1 |
| 34 | 194.81 | 50 | 1 |
| 35 | 194.97 | 17 | 1 |
| 36 | 194.99 | 25 | 1 |
| 37 | 195 | 75 | 1 |
+----+--------+--------+--------------+
For faster testing you can also find here the same table in a pandas dataframe
pd_data_before = pd.DataFrame([[1,194.6,100,1],[2,195,10,1],[3,194.92,100,1],[4,194.92,52,1],[5,194.9,99,1],[6,194.86,74,1],[7,194.85,900,1],[8,194.85,25,1],[9,194.85,224,1],[10,194.6,101,1],[11,194.85,19,1],[12,194.6,10,1],[13,194.6,25,1],[14,194.53,12,1],[15,194.85,14,1],[16,194.6,11,1],[17,194.85,93,1],[18,195,90,1],[19,195,100,1],[20,195,50,1],[21,195,50,1],[22,195,25,1],[23,195,5,1],[24,195,500,1],[25,195,100,1],[26,195.09,100,1],[27,195,120,1],[28,195,60,1],[29,195,40,1],[30,195,10,1],[31,194.6,1,1],[32,194.99,1,1],[33,194.81,20,1],[34,194.81,50,1],[35,194.97,17,1],[36,194.99,25,1],[37,195,75,1]],columns=['Nr','Price','Volume','Transactions'])
The question is how do we sum up the volume and transactions based on similar consecutive prices? The end result would be something like this:
+----+--------+--------+--------------+
| Nr | Price | Volume | Transactions |
+----+--------+--------+--------------+
| 1 | 194.6 | 100 | 1 |
| 2 | 195 | 10 | 1 |
| 4 | 194.92 | 152 | 2 |
| 5 | 194.9 | 99 | 1 |
| 6 | 194.86 | 74 | 1 |
| 9 | 194.85 | 1149 | 3 |
| 10 | 194.6 | 101 | 1 |
| 11 | 194.85 | 19 | 1 |
| 13 | 194.6 | 35 | 2 |
| 14 | 194.53 | 12 | 1 |
| 15 | 194.85 | 14 | 1 |
| 16 | 194.6 | 11 | 1 |
| 17 | 194.85 | 93 | 1 |
| 25 | 195 | 920 | 8 |
| 26 | 195.09 | 100 | 1 |
| 30 | 195 | 230 | 4 |
| 31 | 194.6 | 1 | 1 |
| 32 | 194.99 | 1 | 1 |
| 34 | 194.81 | 70 | 2 |
| 35 | 194.97 | 17 | 1 |
| 36 | 194.99 | 25 | 1 |
| 37 | 195 | 75 | 1 |
+----+--------+--------+--------------+
You can also find the result ready made in a pandas dataframe below:
pd_data_after = pd.DataFrame([[1,194.6,100,1],[2,195,10,1],[4,194.92,152,2],[5,194.9,99,1],[6,194.86,74,1],[9,194.85,1149,3],[10,194.6,101,1],[11,194.85,19,1],[13,194.6,35,2],[14,194.53,12,1],[15,194.85,14,1],[16,194.6,11,1],[17,194.85,93,1],[25,195,920,8],[26,195.09,100,1],[30,195,230,4],[31,194.6,1,1],[32,194.99,1,1],[34,194.81,70,2],[35,194.97,17,1],[36,194.99,25,1],[37,195,75,1]],columns=['Nr','Price','Volume','Transactions'])
I managed to achieve this in a for loop. But the problem is that it is very slow when iterating each row. My data set is huge, around 50 million rows.
Is there any way to achieve this without looping?
A common trick to groupby consecutive values is the following:
df.col.ne(df.col.shift()).cumsum()
We can use that here, then use agg to keep the first values of the columns we aren't summing, and to sum the values we do want to sum.
(df.groupby(df.Price.ne(df.Price.shift()).cumsum())
.agg({'Nr': 'last', 'Price': 'first', 'Volume':'sum', 'Transactions': 'sum'})
).reset_index(drop=True)
Nr Price Volume Transactions
0 1 194.60 100 1
1 2 195.00 10 1
2 4 194.92 152 2
3 5 194.90 99 1
4 6 194.86 74 1
5 9 194.85 1149 3
6 10 194.60 101 1
7 11 194.85 19 1
8 13 194.60 35 2
9 14 194.53 12 1
10 15 194.85 14 1
11 16 194.60 11 1
12 17 194.85 93 1
13 25 195.00 920 8
14 26 195.09 100 1
15 30 195.00 230 4
16 31 194.60 1 1
17 32 194.99 1 1
18 34 194.81 70 2
19 35 194.97 17 1
20 36 194.99 25 1
21 37 195.00 75 1

pandas column shift with day 0 value as 0

I've got a pandas dataframe(pivoted) like customer_name, current_date, current_day_count
+----------+--------------+-------------------+
| customer | current_date | current_day_count |
+----------+--------------+-------------------+
| Mark | 2018_02_06 | 15 |
| | 2018_02_09 | 42 |
| | 2018_02_12 | 33 |
| | 2018_02_21 | 82 |
| | 2018_02_27 | 72 |
| Bob | 2018_02_02 | 76 |
| | 2018_02_23 | 11 |
| | 2018_03_04 | 59 |
| | 2018_03_13 | 68 |
| Shawn | 2018_02_11 | 71 |
| | 2018_02_15 | 39 |
| | 2018_02_18 | 65 |
| | 2018_02_24 | 38 |
+----------+--------------+-------------------+
Now, I want another new column with previous_day_counts for each customer but the first day of the customer's previous day value should be 0 something like this customer, current_date, current_day_count, previous_day_count (with first day value as 0)
+----------+--------------+-------------------+--------------------+
| customer | current_date | current_day_count | previous_day_count |
+----------+--------------+-------------------+--------------------+
| Mark | 2018_02_06 | 15 | 0 |
| | 2018_02_09 | 42 | 33 |
| | 2018_02_12 | 33 | 82 |
| | 2018_02_21 | 82 | 72 |
| | 2018_02_27 | 72 | 0 |
| Bob | 2018_02_02 | 76 | 0 |
| | 2018_02_23 | 11 | 59 |
| | 2018_03_04 | 59 | 68 |
| | 2018_03_13 | 68 | 0 |
| Shawn | 2018_02_11 | 71 | 0 |
| | 2018_02_15 | 39 | 65 |
| | 2018_02_18 | 65 | 38 |
| | 2018_02_24 | 38 | 0 |
+----------+--------------+-------------------+--------------------+
Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'name': ['Mark','Mark','Mark','Mark','Bob','Bob','Bob','Bob'], 'current_day_count': [18,28,29,10,19,92,7,43]})
df['previous_day_count'] = df.groupby('name')['current_day_count'].shift(-1)
df.loc[df.groupby('name',as_index=False).head(1).index,'previous_day_count'] = np.nan
df['previous_day_count'].fillna(0, inplace=True)

Parsing out indeces and values from pandas multi index dataframe

I have a dataframe in a similar format to this:
+--------+--------+----------+------+------+------+------+
| | | | | day1 | day2 | day3 |
+--------+--------+----------+------+------+------+------+
| id_one | id_two | id_three | date | | | |
| 18273 | 50 | 1 | 3 | 9 | 11 | 3 |
| | | | 4 | 26 | 27 | 68 |
| | | | 5 | 92 | 25 | 4 |
| | | | 6 | 60 | 72 | 83 |
| | 60 | 2 | 5 | 69 | 93 | 84 |
| | | | 6 | 69 | 30 | 12 |
| | | | 7 | 65 | 65 | 59 |
| | | | 8 | 57 | 88 | 59 |
| | 70 | 3 | 5 | 22 | 95 | 7 |
| | | | 6 | 40 | 24 | 20 |
| | | | 7 | 73 | 81 | 57 |
| | | | 8 | 43 | 8 | 66 |
+--------+--------+----------+------+------+------+------+
I am trying to create tuple that contains id_one, id_two and the values that each grouping contains.
To test this, I am simply trying to print the ids and values like this:
for id_two, data in df.head(100).groupby(level='id_two'):
print id_two, data.values.ravel()
Which gives me the id_two and the data exactly as it should.
I am running into problems when I try and incorporate id_one. I tried this, but was met with an error ValueError: need more than 2 values to unpack
for id_one, id_two, data in df.head(100).groupby(level='id_two'):
print id_one, id_two, data.values.ravel()
How can I print id_one, id_two and the data?
You can pass a list of columns into the level parameter:
df.head.groupby(level=['id_one', 'id_two'])

Categories