Pandas: Loop for specific column values - python
This is a very short trace. But the original file is too huge
highest_layer,transport_layer,src_ip,dst_ip,src_port,dst_port,ip_flag,packet_length,transport_flag,time,timestamp,geo_country,data
DNS,UDP,192.168.1.6,172.217.12.131,32631,53,0,89,-1,2020-06-10 19:38:08.863846,1591832288863,Unknown,
DNS,UDP,192.168.1.6,192.168.1.1,31708,53,0,79,-1,2020-06-10 19:38:08.864186,1591832288864,Unknown,
DNS,UDP,192.168.1.6,172.217.12.131,32631,53,0,79,-1,2020-06-10 19:38:08.866492,1591832288866,Unknown,
SSDP,UDP,192.168.1.6,172.217.12.131,32631,1900,0,216,-1,2020-06-10 19:38:08.887298,1591832288887,Unknown,
DNS,UDP,192.168.1.1,192.168.1.6,53,32631,16384,105,-1,2020-06-10 19:38:08.888232,1591832288888,Unknown,
TCP,TCP,192.168.1.6,172.217.12.131,32631,443,16384,78,2,2020-06-10 19:38:08.888553,1591832288888,Unknown,
DNS,UDP,192.168.1.1,192.168.1.6,53,31708,16384,95,-1,2020-06-10 19:38:08.895148,1591832288895,Unknown,
TCP,TCP,192.168.1.6,172.217.12.131,16807,443,16384,78,2,2020-06-10 19:38:08.895594,1591832288895,Unknown,
DNS,UDP,192.168.1.1,192.168.1.6,53,16807,16384,119,-1,2020-06-10 19:38:08.896202,1591832288896,Unknown,
TCP,TCP,192.168.1.6,172.217.12.131,16807,443,16384,78,2,2020-06-10 19:38:08.896540,1591832288896,Unknown,
DNS,UDP,192.168.1.6,172.217.12.131,16807,53,0,75,-1,2020-06-10 19:38:08.911968,1591832288911,Unknown,
DATA,UDP,192.168.1.3,192.168.1.6,51216,58185,16384,558,-1,2020-06-10 19:38:08.913276,1591832288913,Unknown,
TCP,TCP,172.217.12.131,192.168.1.6,443,53717,0,74,18,2020-06-10 19:38:08.916735,1591832288916,Unknown,
TCP,TCP,192.168.1.6,172.217.12.131,58185,443,16384,66,16,2020-06-10 19:38:08.916860,1591832288916,Unknown,
TLS,TCP,192.168.1.6,172.217.12.131,58185,443,16384,583,24,2020-06-10 19:38:08.917442,1591832288917,Unknown,
TCP,TCP,172.217.10.237,192.168.1.6,443,53718,0,74,18,2020-06-10 19:38:08.919293,1591832288919,Unknown,
TCP,TCP,192.168.1.6,172.217.12.131,58185,443,16384,66,16,2020-06-10 19:38:08.919423,1591832288919,Unknown,
TLS,TCP,192.168.1.6,172.217.12.131,32631,443,16384,583,24,2020-06-10 19:38:08.919593,1591832288919,Unknown,
TCP,TCP,172.217.11.14,192.168.1.6,443,53719,0,74,18,2020-06-10 19:38:08.928819,1591832288928,Unknown,
TCP,TCP,192.168.1.6,172.217.12.131,16807,443,16384,66,16,2020-06-10 19:38:08.928922,1591832288928,Unknown,
TLS,TCP,192.168.1.6,172.217.12.131,58185,443,16384,583,24,2020-06-10 19:38:08.929100,1591832288929,Unknown,
I have dropped a few unwanted columns and i want to cumulative packet length from specific src_ip(192.168.1.6), destination ip address(172.217.12.131) and src_port(32631,16807,58185).
I want to iterate through the src_port for the given src_ip and dest_ip. In this case, for each of the 3 src_port, i need to calculate cumulative packet lengths. Plot x-axis(relative timestamp-which is the index here) y-axis(cumulative packet length). I expect a graph to contain 3 lines for each ports cumulative packet length.
df = pd.read_csv('read.csv', sep=',')
#Calculate relative time for each dataframe
df.index = df['timestamp'] - df.loc[0,'timestamp']
#Drop unwanted columns
drop = df.drop(columns=['highest_layer', 'transport_layer','ip_flag', 'transport_flag','geo_country','data'])
df1 = drop[(drop.src_ip == '192.168.1.6') & (drop.dst_ip == '172.217.12.131')]
for i in df1['src_port']:
df_cumsum = df1.groupby(['src_ip'])['packet_length'].cumsum()
plt.plot(df.index, df_cumsum,label='i')
If i give the port numbers explicitly and plot it without the for loop, it works. But after I iterate through the src_port nothing happens. What am i missing here. Any thoughts please
I created a table by combining it with the original DF after a subsequent grouping from your code. I made a graph based on that table.
df2 = df1[['src_port','packet_length']].groupby('src_port')['packet_length'].transform('cumsum').to_frame()
df2.columns = ['cumsum_packets']
df3 = pd.concat([df1,df2],axis=1)
import matplotlib.pyplot as plt
import seaborn as sns
sns.lineplot(x=df3.index, y=df3['cumsum_packets'], hue=df3['src_port'], data=df3, legend='full')
| timestamp | src_ip | dst_ip | src_port | dst_port | packet_length | time | timestamp | cumsum_packets |
|------------:|:------------|:---------------|-----------:|-----------:|----------------:|:---------------------------|--------------:|-----------------:|
| 0 | 192.168.1.6 | 172.217.12.131 | 32631 | 53 | 89 | 2020-06-10 19:38:08.863846 | 1591832288863 | 89 |
| 3 | 192.168.1.6 | 172.217.12.131 | 32631 | 53 | 79 | 2020-06-10 19:38:08.866492 | 1591832288866 | 168 |
| 24 | 192.168.1.6 | 172.217.12.131 | 32631 | 1900 | 216 | 2020-06-10 19:38:08.887298 | 1591832288887 | 384 |
| 25 | 192.168.1.6 | 172.217.12.131 | 32631 | 443 | 78 | 2020-06-10 19:38:08.888553 | 1591832288888 | 462 |
| 32 | 192.168.1.6 | 172.217.12.131 | 16807 | 443 | 78 | 2020-06-10 19:38:08.895594 | 1591832288895 | 78 |
| 33 | 192.168.1.6 | 172.217.12.131 | 16807 | 443 | 78 | 2020-06-10 19:38:08.896540 | 1591832288896 | 156 |
| 48 | 192.168.1.6 | 172.217.12.131 | 16807 | 53 | 75 | 2020-06-10 19:38:08.911968 | 1591832288911 | 231 |
| 53 | 192.168.1.6 | 172.217.12.131 | 58185 | 443 | 66 | 2020-06-10 19:38:08.916860 | 1591832288916 | 66 |
| 54 | 192.168.1.6 | 172.217.12.131 | 58185 | 443 | 583 | 2020-06-10 19:38:08.917442 | 1591832288917 | 649 |
| 56 | 192.168.1.6 | 172.217.12.131 | 58185 | 443 | 66 | 2020-06-10 19:38:08.919423 | 1591832288919 | 715 |
| 56 | 192.168.1.6 | 172.217.12.131 | 32631 | 443 | 583 | 2020-06-10 19:38:08.919593 | 1591832288919 | 1045 |
| 65 | 192.168.1.6 | 172.217.12.131 | 16807 | 443 | 66 | 2020-06-10 19:38:08.928922 | 1591832288928 | 297 |
| 66 | 192.168.1.6 | 172.217.12.131 | 58185 | 443 | 583 | 2020-06-10 19:38:08.929100 | 1591832288929 | 1298 |
Related
Array Split in data frame column without warning
Data frame has columns 'date_sq' and 'value', the 'value' column is an array of 201 columns. | date_sq | value | 2022-05-04 |13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t3\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t2\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t1\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t0\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t4\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t3\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t2\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t1\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t0\t0\t0\t1\t13360\t14644\t22099\t8257\t13105\t13879\t12853\t0\t0\t0\t0\t0\t16706\t21558\t17474\t13873\t4\t0\t0\t1\t2949\r\n | My below codes works: Split value column to 201 seperate column Create date in numeric Create a column 'key' Drop unnecessary columns df_spec[[f'{x}' for x in range(total_cols)]] = df_spec['value'].str.split (pat="\t", expand=True).replace(r'\s+|\\n', ' ', regex=True).fillna('0').apply(pd.to_numeric) df_spec['date'] = pd.to_datetime(df_spec['date_sq']).dt.strftime("%Y%m%d") df_spec['key'] = (df_spec['date'].astype(str) + df_spec['200'].astype(str)).apply(pd.to_numeric) df_spec.drop(['value','date_sq','date'], axis=1, inplace=True) Requirement: My above code works, but it throws some warning messages. Is there an optimized way without warnings? Warning: <ipython-input-3-70211686c759>:2: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_spec[[f'{x}' for x in range(total_cols)]] = df_spec['value'].str.split \ <ipython-input-3-70211686c759>:2: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_spec[[f'{x}' for x in range(total_cols)]] = df_spec['value'].str.split \ ... goes on for some lines... Final dataframe: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | key | |-------|-------|-------|------|-------|-------|-------|---|---|---|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|------|-------------| | 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 3 | 0 | 0 | 0 | 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 2 | 0 | 0 | 0 | 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 1 | 0 | 0 | 0 | 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 0 | 0 | 0 | 0 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 4 | 0 | 0 | 0 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 3 | 0 | 0 | 1 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 2 | 0 | 0 | 1 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 1 | 0 | 0 | 1 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 0 | 0 | 0 | 1 | 13360 | 14644 | 22099 | 8257 | 13105 | 13879 | 12853 | 0 | 0 | 0 | 0 | 0 | 16706 | 21558 | 17474 | 13873 | 4 | 0 | 0 | 1 | 2949 | 202205052949 |
I tried the following and got no warnings. I hope it helps in your case df = pd.DataFrame([['2022-05-04', '13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t3\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t2\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t1\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t0\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t4\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t3\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t2\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t1\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t0\t0\t0\t1\t13360\t14644\t22099\t8257\t13105\t13879\t12853\t0\t0\t0\t0\t0\t16706\t21558\t17474\t13873\t4\t0\t0\t1\t2949\r\n']], columns=['date_sq', 'value']) date = df['date_sq'] df = df['value'].str.split('\t', expand=True).fillna('0').apply(pd.to_numeric)# .explode().T.reset_index(drop=True) df['date'] = pd.to_datetime(date).dt.strftime("%Y%m%d") df['key'] = (df['date'].astype(str) + df[200].astype(str)).apply(pd.to_numeric) df.drop(['date'], axis=1, inplace=True) df I think df_spec[[f'{x}' for x in range(total_cols)]] is unnecessary when using split(..., expand=True). Good luck
Calculating quarterly growth
I have some daily data in a df, which goes back as far as 1st January 2020. It looks similar to the below but with many id1s on each day. | yyyy_mm_dd | id1 | id2 | cost | |------------|-----|------|-------| | 2020-01-01 | 23 | 7253 | 5003 | | 2020-01-01 | 23 | 7743 | 30340 | | 2020-01-02 | 23 | 7253 | 450 | | 2020-01-02 | 23 | 7743 | 4500 | | ... | ... | ... | ... | | 2021-01-01 | 23 | 7253 | 5675 | | 2021-01-01 | 23 | 134 | 1030 | | 2021-01-01 | 23 | 3445 | 564 | | 2021-01-01 | 23 | 4534 | 345 | | ... | ... | ... | ... | I want like to calculate (1) the summed cost grouped by quarter and id1, (2) the growth % compared to the same quarter in the previous year. I have grouped and calculated the summed cost like so: grouped_quarterly = ( df .withColumn('year_quarter', (F.year(sf.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd')) .groupby('id1', 'year_quarter') .agg( F.sum('cost').alias('cost') ) ) But I am unsure how to get the growth compared to the previous year. Expected output based on the above sample: | year_quarter | id1 | cost | cost_growth | |--------------|-----|------|-------------| | 202101 | 23 | 7614 | -81 | It would also be nice to set cost_growth to 0 if the id1 has no rows in the previous years quarter. Edit: Below is an attempt to make the comparison but I get an error that there is no attribute prev_value: grouped_quarterly = ( df .withColumn('year_quarter', (F.year(sf.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd')) .groupby('id1', 'year_quarter') .agg( F.sum('cost').alias('cost') ) ) w = Window.partitionBy('id1').orderBy('year_quarter') growth = ( grouped_quarterly .withColumn('prev_value', sf.lag(grouped_quarterly.cost).over(w)) .withColumn('diff', sf.when(sf.isnull(grouped_quarterly.cost - grouped_quarterly.prev_value), 0).otherwise(grouped_quarterly.cost - grouped_quarterly.cost)) ) Edit #2: The window function seems to take the previous quarter, regardless of year. This means my prev_value column is the previous quarter rather than the same quarter from the previous year: grouped_quarterly.where(sf.col('id1') == 222).sort('year_quarter').show(10,False) | id1 | year_quarter | cost | |-----|--------------|------| | 222 | 202001 | 73 | | 222 | 202002 | 246 | | 222 | 202003 | 525 | | 222 | 202004 | -27 | | 222 | 202101 | 380 | w = Window.partitionBy('id1').orderBy('year_quarter') growth = ( grouped_quarterly .withColumn('prev_value', sf.lag(sf.col('cost')).over(w)) .withColumn('diff', sf.when(sf.isnull(sf.col('cost') - sf.col('prev_value')), 0).otherwise(sf.col('cost') - sf.col('prev_value'))) ) growth.where(sf.col('id1') == 222).sort('year_quarter').show(10,False) | id1 | year_quarter | cost | prev_value | diff | |-----|--------------|------|------------|------| | 222 | 202001 | 73 | null | 0 | | 222 | 202002 | 246 | 73 | 173 | | 222 | 202003 | 525 | 246 | 279 | | 222 | 202004 | -27 | 525 | -522 | | 222 | 202101 | 380 | -27 | 407 | Edit #3: Using the quarter in the partitioning results in a null prev_value for all rows: grouped_quarterly.where(sf.col('id1') == 222).sort('year_quarter').show(10,False) | id1 | year_quarter | cost | |-----|--------------|------| | 222 | 202001 | 73 | | 222 | 202002 | 246 | | 222 | 202003 | 525 | | 222 | 202004 | -27 | | 222 | 202101 | 380 | w = Window.partitionBy(sf.col('id1'), sf.expr('substring(string(year_quarter), 2)')).orderBy('year_quarter') growth = ( grouped_quarterly .withColumn('prev_value', sf.lag(sf.col('cost')).over(w)) .withColumn('diff', sf.when(sf.isnull(sf.col('cost') - sf.col('prev_value')), 0).otherwise(sf.col('cost') - sf.col('prev_value'))) ) growth.where(sf.col('id1') == 222).sort('year_quarter').show(10,False) | id1 | year_quarter | cost | prev_value | diff | |-----|--------------|------|------------|-------| | 222 | 202001 | 73 | null | 0 | | 222 | 202002 | 246 | null | 0 | | 222 | 202003 | 525 | null | 0 | | 222 | 202004 | -27 | null | 0 | | 222 | 202101 | 380 | null | 0 |
Try using the quarter in the partitioning as well, so that lag will give you the value in the same quarter last year: w = Window.partitionBy(sf.col('id1'), sf.expr('substring(string(year_quarter), -2)')).orderBy('year_quarter')
Binning Pandas value_counts
I have a Pandas Series produced by df.column.value_counts().sort_index(). | N Months | Count | |------|------| | 0 | 15 | | 1 | 9 | | 2 | 78 | | 3 | 151 | | 4 | 412 | | 5 | 181 | | 6 | 543 | | 7 | 175 | | 8 | 409 | | 9 | 594 | | 10 | 137 | | 11 | 202 | | 12 | 170 | | 13 | 446 | | 14 | 29 | | 15 | 39 | | 16 | 44 | | 17 | 253 | | 18 | 17 | | 19 | 34 | | 20 | 18 | | 21 | 37 | | 22 | 147 | | 23 | 12 | | 24 | 31 | | 25 | 15 | | 26 | 117 | | 27 | 8 | | 28 | 38 | | 29 | 23 | | 30 | 198 | | 31 | 29 | | 32 | 122 | | 33 | 50 | | 34 | 60 | | 35 | 357 | | 36 | 329 | | 37 | 457 | | 38 | 609 | | 39 | 4744 | | 40 | 1120 | | 41 | 591 | | 42 | 328 | | 43 | 148 | | 44 | 46 | | 45 | 10 | | 46 | 1 | | 47 | 1 | | 48 | 7 | | 50 | 2 | my desired output is | bin | Total | |-------|--------| | 0-13 | 3522 | | 14-26 | 793 | | 27-50 | 9278 | I tried df.column.value_counts(bins=3).sort_index() but got | bin | Total | |---------------------------------|-------| | (-0.051000000000000004, 16.667] | 3634 | | (16.667, 33.333] | 1149 | | (33.333, 50.0] | 8810 | I can get the correct result with a = df.column.value_counts().sort_index()[:14].sum() b = df.column.value_counts().sort_index()[14:27].sum() c = df.column.value_counts().sort_index()[28:].sum() print(a, b, c) Output: 3522 793 9270 But I am wondering if there is a pandas method that can do what I want. Any advice is very welcome. :-)
You can use pd.cut: pd.cut(df['N Months'], [0,13, 26, 50], include_lowest=True).value_counts() Update you should be able to pass custom bin to value_counts: df['N Months'].value_counts(bins = [0,13, 26, 50]) Output: N Months (-0.001, 13.0] 3522 (13.0, 26.0] 793 (26.0, 50.0] 9278 Name: Count, dtype: int64
Histogram to show sampling rate
I'm having confusion with plotting some data, but here is what I want to do. I have a dataframe with this sample data: >>df.head(20) user_id | trip_id | lat | lon | sampling_rate ---------+--------------------+------------------+------------------+--------------- 126 | 125020080511025052 | 39.9531666666667 | 116.452566666667 | 7 126 | 125020080511025052 | 39.95305 | 116.452683333333 | 16 126 | 125020080511025052 | 39.9530666666667 | 116.452916666667 | 44 126 | 125020080511025052 | 39.9530833333333 | 116.453183333333 | 40 126 | 125020080511025052 | 39.95335 | 116.45365 | 21 126 | 125020080511025052 | 39.9532833333333 | 116.453816666667 | 16 126 | 125020080511025052 | 39.9533166666667 | 116.45405 | 13 126 | 125020080511025052 | 39.9535666666667 | 116.454383333333 | 24 126 | 125020080511025052 | 39.9537166666667 | 116.4546 | 16 126 | 125020080511025052 | 39.9538333333333 | 116.454733333333 | 17 126 | 125020080511025052 | 39.9540166666667 | 116.454966666667 | 37 126 | 125020080511025052 | 39.9541833333333 | 116.455133333333 | 18 126 | 125020080511025052 | 39.95405 | 116.455216666667 | 23 126 | 125020080511025052 | 39.9539 | 116.455266666667 | 19 126 | 125020080511025052 | 39.9537333333333 | 116.455333333333 | 42 126 | 125020080511025052 | 39.95365 | 116.455416666667 | 23 126 | 125020080512015529 | 40.00705 | 116.32225 | 126 | 125020080512015529 | 40.0073 | 116.3225 | 19 126 | 125020080512015529 | 40.0068 | 116.322083333333 | 66 126 | 125020080512015529 | 40.0064333333333 | 116.321666666667 | 2 This table contains trips GPS traces of users. the sampling_rate is shows the GPS sampling for the trips. I want to show the sampling rate plot, such I can see trips with 1sec interval, trips with 2-5 seconds interval, trips with 5-10 seconds intervals etc... I would want to have the number of trips on y-axis and interval on x-axis.
IIUC, Use, np.arange to create bins with the interval of 5sec: plt.figure(figsize=(6, 4)) plt.hist(df['sampling_rate'], bins=np.arange(1, df['sampling_rate'].max() + 5, 5)) plt.xlabel('Sampling Rate') plt.ylabel('Frequency') plt.title('Distribution of sampling rate') plt.show() Result:
Parsing out indeces and values from pandas multi index dataframe
I have a dataframe in a similar format to this: +--------+--------+----------+------+------+------+------+ | | | | | day1 | day2 | day3 | +--------+--------+----------+------+------+------+------+ | id_one | id_two | id_three | date | | | | | 18273 | 50 | 1 | 3 | 9 | 11 | 3 | | | | | 4 | 26 | 27 | 68 | | | | | 5 | 92 | 25 | 4 | | | | | 6 | 60 | 72 | 83 | | | 60 | 2 | 5 | 69 | 93 | 84 | | | | | 6 | 69 | 30 | 12 | | | | | 7 | 65 | 65 | 59 | | | | | 8 | 57 | 88 | 59 | | | 70 | 3 | 5 | 22 | 95 | 7 | | | | | 6 | 40 | 24 | 20 | | | | | 7 | 73 | 81 | 57 | | | | | 8 | 43 | 8 | 66 | +--------+--------+----------+------+------+------+------+ I am trying to create tuple that contains id_one, id_two and the values that each grouping contains. To test this, I am simply trying to print the ids and values like this: for id_two, data in df.head(100).groupby(level='id_two'): print id_two, data.values.ravel() Which gives me the id_two and the data exactly as it should. I am running into problems when I try and incorporate id_one. I tried this, but was met with an error ValueError: need more than 2 values to unpack for id_one, id_two, data in df.head(100).groupby(level='id_two'): print id_one, id_two, data.values.ravel() How can I print id_one, id_two and the data?
You can pass a list of columns into the level parameter: df.head.groupby(level=['id_one', 'id_two'])