Data frame has columns 'date_sq' and 'value', the 'value' column is an array of 201 columns.
| date_sq | value
| 2022-05-04 |13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t3\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t2\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t1\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t0\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t4\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t3\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t2\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t1\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t0\t0\t0\t1\t13360\t14644\t22099\t8257\t13105\t13879\t12853\t0\t0\t0\t0\t0\t16706\t21558\t17474\t13873\t4\t0\t0\t1\t2949\r\n |
My below codes works:
Split value column to 201 seperate column
Create date in numeric
Create a column 'key'
Drop unnecessary columns
df_spec[[f'{x}' for x in range(total_cols)]] = df_spec['value'].str.split (pat="\t", expand=True).replace(r'\s+|\\n', ' ', regex=True).fillna('0').apply(pd.to_numeric)
df_spec['date'] = pd.to_datetime(df_spec['date_sq']).dt.strftime("%Y%m%d")
df_spec['key'] = (df_spec['date'].astype(str) + df_spec['200'].astype(str)).apply(pd.to_numeric)
df_spec.drop(['value','date_sq','date'], axis=1, inplace=True)
Requirement:
My above code works, but it throws some warning messages.
Is there an optimized way without warnings?
Warning:
<ipython-input-3-70211686c759>:2: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df_spec[[f'{x}' for x in range(total_cols)]] = df_spec['value'].str.split \
<ipython-input-3-70211686c759>:2: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df_spec[[f'{x}' for x in range(total_cols)]] = df_spec['value'].str.split \
... goes on for some lines...
Final dataframe:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | key |
|-------|-------|-------|------|-------|-------|-------|---|---|---|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|------|-------------|
| 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 3 | 0 | 0 | 0 | 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 2 | 0 | 0 | 0 | 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 1 | 0 | 0 | 0 | 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 0 | 0 | 0 | 0 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 4 | 0 | 0 | 0 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 3 | 0 | 0 | 1 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 2 | 0 | 0 | 1 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 1 | 0 | 0 | 1 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 0 | 0 | 0 | 1 | 13360 | 14644 | 22099 | 8257 | 13105 | 13879 | 12853 | 0 | 0 | 0 | 0 | 0 | 16706 | 21558 | 17474 | 13873 | 4 | 0 | 0 | 1 | 2949 | 202205052949 |
I tried the following and got no warnings. I hope it helps in your case
df = pd.DataFrame([['2022-05-04', '13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t3\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t2\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t1\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t0\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t4\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t3\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t2\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t1\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t0\t0\t0\t1\t13360\t14644\t22099\t8257\t13105\t13879\t12853\t0\t0\t0\t0\t0\t16706\t21558\t17474\t13873\t4\t0\t0\t1\t2949\r\n']], columns=['date_sq', 'value'])
date = df['date_sq']
df = df['value'].str.split('\t', expand=True).fillna('0').apply(pd.to_numeric)# .explode().T.reset_index(drop=True)
df['date'] = pd.to_datetime(date).dt.strftime("%Y%m%d")
df['key'] = (df['date'].astype(str) + df[200].astype(str)).apply(pd.to_numeric)
df.drop(['date'], axis=1, inplace=True)
df
I think df_spec[[f'{x}' for x in range(total_cols)]] is unnecessary when using split(..., expand=True). Good luck
Given the following table
+----+--------+--------+--------------+
| Nr | Price | Volume | Transactions |
+----+--------+--------+--------------+
| 1 | 194.6 | 100 | 1 |
| 2 | 195 | 10 | 1 |
| 3 | 194.92 | 100 | 1 |
| 4 | 194.92 | 52 | 1 |
| 5 | 194.9 | 99 | 1 |
| 6 | 194.86 | 74 | 1 |
| 7 | 194.85 | 900 | 1 |
| 8 | 194.85 | 25 | 1 |
| 9 | 194.85 | 224 | 1 |
| 10 | 194.6 | 101 | 1 |
| 11 | 194.85 | 19 | 1 |
| 12 | 194.6 | 10 | 1 |
| 13 | 194.6 | 25 | 1 |
| 14 | 194.53 | 12 | 1 |
| 15 | 194.85 | 14 | 1 |
| 16 | 194.6 | 11 | 1 |
| 17 | 194.85 | 93 | 1 |
| 18 | 195 | 90 | 1 |
| 19 | 195 | 100 | 1 |
| 20 | 195 | 50 | 1 |
| 21 | 195 | 50 | 1 |
| 22 | 195 | 25 | 1 |
| 23 | 195 | 5 | 1 |
| 24 | 195 | 500 | 1 |
| 25 | 195 | 100 | 1 |
| 26 | 195.09 | 100 | 1 |
| 27 | 195 | 120 | 1 |
| 28 | 195 | 60 | 1 |
| 29 | 195 | 40 | 1 |
| 30 | 195 | 10 | 1 |
| 31 | 194.6 | 1 | 1 |
| 32 | 194.99 | 1 | 1 |
| 33 | 194.81 | 20 | 1 |
| 34 | 194.81 | 50 | 1 |
| 35 | 194.97 | 17 | 1 |
| 36 | 194.99 | 25 | 1 |
| 37 | 195 | 75 | 1 |
+----+--------+--------+--------------+
For faster testing you can also find here the same table in a pandas dataframe
pd_data_before = pd.DataFrame([[1,194.6,100,1],[2,195,10,1],[3,194.92,100,1],[4,194.92,52,1],[5,194.9,99,1],[6,194.86,74,1],[7,194.85,900,1],[8,194.85,25,1],[9,194.85,224,1],[10,194.6,101,1],[11,194.85,19,1],[12,194.6,10,1],[13,194.6,25,1],[14,194.53,12,1],[15,194.85,14,1],[16,194.6,11,1],[17,194.85,93,1],[18,195,90,1],[19,195,100,1],[20,195,50,1],[21,195,50,1],[22,195,25,1],[23,195,5,1],[24,195,500,1],[25,195,100,1],[26,195.09,100,1],[27,195,120,1],[28,195,60,1],[29,195,40,1],[30,195,10,1],[31,194.6,1,1],[32,194.99,1,1],[33,194.81,20,1],[34,194.81,50,1],[35,194.97,17,1],[36,194.99,25,1],[37,195,75,1]],columns=['Nr','Price','Volume','Transactions'])
The question is how do we sum up the volume and transactions based on similar consecutive prices? The end result would be something like this:
+----+--------+--------+--------------+
| Nr | Price | Volume | Transactions |
+----+--------+--------+--------------+
| 1 | 194.6 | 100 | 1 |
| 2 | 195 | 10 | 1 |
| 4 | 194.92 | 152 | 2 |
| 5 | 194.9 | 99 | 1 |
| 6 | 194.86 | 74 | 1 |
| 9 | 194.85 | 1149 | 3 |
| 10 | 194.6 | 101 | 1 |
| 11 | 194.85 | 19 | 1 |
| 13 | 194.6 | 35 | 2 |
| 14 | 194.53 | 12 | 1 |
| 15 | 194.85 | 14 | 1 |
| 16 | 194.6 | 11 | 1 |
| 17 | 194.85 | 93 | 1 |
| 25 | 195 | 920 | 8 |
| 26 | 195.09 | 100 | 1 |
| 30 | 195 | 230 | 4 |
| 31 | 194.6 | 1 | 1 |
| 32 | 194.99 | 1 | 1 |
| 34 | 194.81 | 70 | 2 |
| 35 | 194.97 | 17 | 1 |
| 36 | 194.99 | 25 | 1 |
| 37 | 195 | 75 | 1 |
+----+--------+--------+--------------+
You can also find the result ready made in a pandas dataframe below:
pd_data_after = pd.DataFrame([[1,194.6,100,1],[2,195,10,1],[4,194.92,152,2],[5,194.9,99,1],[6,194.86,74,1],[9,194.85,1149,3],[10,194.6,101,1],[11,194.85,19,1],[13,194.6,35,2],[14,194.53,12,1],[15,194.85,14,1],[16,194.6,11,1],[17,194.85,93,1],[25,195,920,8],[26,195.09,100,1],[30,195,230,4],[31,194.6,1,1],[32,194.99,1,1],[34,194.81,70,2],[35,194.97,17,1],[36,194.99,25,1],[37,195,75,1]],columns=['Nr','Price','Volume','Transactions'])
I managed to achieve this in a for loop. But the problem is that it is very slow when iterating each row. My data set is huge, around 50 million rows.
Is there any way to achieve this without looping?
A common trick to groupby consecutive values is the following:
df.col.ne(df.col.shift()).cumsum()
We can use that here, then use agg to keep the first values of the columns we aren't summing, and to sum the values we do want to sum.
(df.groupby(df.Price.ne(df.Price.shift()).cumsum())
.agg({'Nr': 'last', 'Price': 'first', 'Volume':'sum', 'Transactions': 'sum'})
).reset_index(drop=True)
Nr Price Volume Transactions
0 1 194.60 100 1
1 2 195.00 10 1
2 4 194.92 152 2
3 5 194.90 99 1
4 6 194.86 74 1
5 9 194.85 1149 3
6 10 194.60 101 1
7 11 194.85 19 1
8 13 194.60 35 2
9 14 194.53 12 1
10 15 194.85 14 1
11 16 194.60 11 1
12 17 194.85 93 1
13 25 195.00 920 8
14 26 195.09 100 1
15 30 195.00 230 4
16 31 194.60 1 1
17 32 194.99 1 1
18 34 194.81 70 2
19 35 194.97 17 1
20 36 194.99 25 1
21 37 195.00 75 1
I've got a pandas dataframe(pivoted) like customer_name, current_date, current_day_count
+----------+--------------+-------------------+
| customer | current_date | current_day_count |
+----------+--------------+-------------------+
| Mark | 2018_02_06 | 15 |
| | 2018_02_09 | 42 |
| | 2018_02_12 | 33 |
| | 2018_02_21 | 82 |
| | 2018_02_27 | 72 |
| Bob | 2018_02_02 | 76 |
| | 2018_02_23 | 11 |
| | 2018_03_04 | 59 |
| | 2018_03_13 | 68 |
| Shawn | 2018_02_11 | 71 |
| | 2018_02_15 | 39 |
| | 2018_02_18 | 65 |
| | 2018_02_24 | 38 |
+----------+--------------+-------------------+
Now, I want another new column with previous_day_counts for each customer but the first day of the customer's previous day value should be 0 something like this customer, current_date, current_day_count, previous_day_count (with first day value as 0)
+----------+--------------+-------------------+--------------------+
| customer | current_date | current_day_count | previous_day_count |
+----------+--------------+-------------------+--------------------+
| Mark | 2018_02_06 | 15 | 0 |
| | 2018_02_09 | 42 | 33 |
| | 2018_02_12 | 33 | 82 |
| | 2018_02_21 | 82 | 72 |
| | 2018_02_27 | 72 | 0 |
| Bob | 2018_02_02 | 76 | 0 |
| | 2018_02_23 | 11 | 59 |
| | 2018_03_04 | 59 | 68 |
| | 2018_03_13 | 68 | 0 |
| Shawn | 2018_02_11 | 71 | 0 |
| | 2018_02_15 | 39 | 65 |
| | 2018_02_18 | 65 | 38 |
| | 2018_02_24 | 38 | 0 |
+----------+--------------+-------------------+--------------------+
Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'name': ['Mark','Mark','Mark','Mark','Bob','Bob','Bob','Bob'], 'current_day_count': [18,28,29,10,19,92,7,43]})
df['previous_day_count'] = df.groupby('name')['current_day_count'].shift(-1)
df.loc[df.groupby('name',as_index=False).head(1).index,'previous_day_count'] = np.nan
df['previous_day_count'].fillna(0, inplace=True)