Pandas: Loop for specific column values - python

This is a very short trace. But the original file is too huge
highest_layer,transport_layer,src_ip,dst_ip,src_port,dst_port,ip_flag,packet_length,transport_flag,time,timestamp,geo_country,data
DNS,UDP,192.168.1.6,172.217.12.131,32631,53,0,89,-1,2020-06-10 19:38:08.863846,1591832288863,Unknown,
DNS,UDP,192.168.1.6,192.168.1.1,31708,53,0,79,-1,2020-06-10 19:38:08.864186,1591832288864,Unknown,
DNS,UDP,192.168.1.6,172.217.12.131,32631,53,0,79,-1,2020-06-10 19:38:08.866492,1591832288866,Unknown,
SSDP,UDP,192.168.1.6,172.217.12.131,32631,1900,0,216,-1,2020-06-10 19:38:08.887298,1591832288887,Unknown,
DNS,UDP,192.168.1.1,192.168.1.6,53,32631,16384,105,-1,2020-06-10 19:38:08.888232,1591832288888,Unknown,
TCP,TCP,192.168.1.6,172.217.12.131,32631,443,16384,78,2,2020-06-10 19:38:08.888553,1591832288888,Unknown,
DNS,UDP,192.168.1.1,192.168.1.6,53,31708,16384,95,-1,2020-06-10 19:38:08.895148,1591832288895,Unknown,
TCP,TCP,192.168.1.6,172.217.12.131,16807,443,16384,78,2,2020-06-10 19:38:08.895594,1591832288895,Unknown,
DNS,UDP,192.168.1.1,192.168.1.6,53,16807,16384,119,-1,2020-06-10 19:38:08.896202,1591832288896,Unknown,
TCP,TCP,192.168.1.6,172.217.12.131,16807,443,16384,78,2,2020-06-10 19:38:08.896540,1591832288896,Unknown,
DNS,UDP,192.168.1.6,172.217.12.131,16807,53,0,75,-1,2020-06-10 19:38:08.911968,1591832288911,Unknown,
DATA,UDP,192.168.1.3,192.168.1.6,51216,58185,16384,558,-1,2020-06-10 19:38:08.913276,1591832288913,Unknown,
TCP,TCP,172.217.12.131,192.168.1.6,443,53717,0,74,18,2020-06-10 19:38:08.916735,1591832288916,Unknown,
TCP,TCP,192.168.1.6,172.217.12.131,58185,443,16384,66,16,2020-06-10 19:38:08.916860,1591832288916,Unknown,
TLS,TCP,192.168.1.6,172.217.12.131,58185,443,16384,583,24,2020-06-10 19:38:08.917442,1591832288917,Unknown,
TCP,TCP,172.217.10.237,192.168.1.6,443,53718,0,74,18,2020-06-10 19:38:08.919293,1591832288919,Unknown,
TCP,TCP,192.168.1.6,172.217.12.131,58185,443,16384,66,16,2020-06-10 19:38:08.919423,1591832288919,Unknown,
TLS,TCP,192.168.1.6,172.217.12.131,32631,443,16384,583,24,2020-06-10 19:38:08.919593,1591832288919,Unknown,
TCP,TCP,172.217.11.14,192.168.1.6,443,53719,0,74,18,2020-06-10 19:38:08.928819,1591832288928,Unknown,
TCP,TCP,192.168.1.6,172.217.12.131,16807,443,16384,66,16,2020-06-10 19:38:08.928922,1591832288928,Unknown,
TLS,TCP,192.168.1.6,172.217.12.131,58185,443,16384,583,24,2020-06-10 19:38:08.929100,1591832288929,Unknown,
I have dropped a few unwanted columns and i want to cumulative packet length from specific src_ip(192.168.1.6), destination ip address(172.217.12.131) and src_port(32631,16807,58185).
I want to iterate through the src_port for the given src_ip and dest_ip. In this case, for each of the 3 src_port, i need to calculate cumulative packet lengths. Plot x-axis(relative timestamp-which is the index here) y-axis(cumulative packet length). I expect a graph to contain 3 lines for each ports cumulative packet length.
df = pd.read_csv('read.csv', sep=',')
#Calculate relative time for each dataframe
df.index = df['timestamp'] - df.loc[0,'timestamp']
#Drop unwanted columns
drop = df.drop(columns=['highest_layer', 'transport_layer','ip_flag', 'transport_flag','geo_country','data'])
df1 = drop[(drop.src_ip == '192.168.1.6') & (drop.dst_ip == '172.217.12.131')]
for i in df1['src_port']:
df_cumsum = df1.groupby(['src_ip'])['packet_length'].cumsum()
plt.plot(df.index, df_cumsum,label='i')
If i give the port numbers explicitly and plot it without the for loop, it works. But after I iterate through the src_port nothing happens. What am i missing here. Any thoughts please

I created a table by combining it with the original DF after a subsequent grouping from your code. I made a graph based on that table.
df2 = df1[['src_port','packet_length']].groupby('src_port')['packet_length'].transform('cumsum').to_frame()
df2.columns = ['cumsum_packets']
df3 = pd.concat([df1,df2],axis=1)
import matplotlib.pyplot as plt
import seaborn as sns
sns.lineplot(x=df3.index, y=df3['cumsum_packets'], hue=df3['src_port'], data=df3, legend='full')
| timestamp | src_ip | dst_ip | src_port | dst_port | packet_length | time | timestamp | cumsum_packets |
|------------:|:------------|:---------------|-----------:|-----------:|----------------:|:---------------------------|--------------:|-----------------:|
| 0 | 192.168.1.6 | 172.217.12.131 | 32631 | 53 | 89 | 2020-06-10 19:38:08.863846 | 1591832288863 | 89 |
| 3 | 192.168.1.6 | 172.217.12.131 | 32631 | 53 | 79 | 2020-06-10 19:38:08.866492 | 1591832288866 | 168 |
| 24 | 192.168.1.6 | 172.217.12.131 | 32631 | 1900 | 216 | 2020-06-10 19:38:08.887298 | 1591832288887 | 384 |
| 25 | 192.168.1.6 | 172.217.12.131 | 32631 | 443 | 78 | 2020-06-10 19:38:08.888553 | 1591832288888 | 462 |
| 32 | 192.168.1.6 | 172.217.12.131 | 16807 | 443 | 78 | 2020-06-10 19:38:08.895594 | 1591832288895 | 78 |
| 33 | 192.168.1.6 | 172.217.12.131 | 16807 | 443 | 78 | 2020-06-10 19:38:08.896540 | 1591832288896 | 156 |
| 48 | 192.168.1.6 | 172.217.12.131 | 16807 | 53 | 75 | 2020-06-10 19:38:08.911968 | 1591832288911 | 231 |
| 53 | 192.168.1.6 | 172.217.12.131 | 58185 | 443 | 66 | 2020-06-10 19:38:08.916860 | 1591832288916 | 66 |
| 54 | 192.168.1.6 | 172.217.12.131 | 58185 | 443 | 583 | 2020-06-10 19:38:08.917442 | 1591832288917 | 649 |
| 56 | 192.168.1.6 | 172.217.12.131 | 58185 | 443 | 66 | 2020-06-10 19:38:08.919423 | 1591832288919 | 715 |
| 56 | 192.168.1.6 | 172.217.12.131 | 32631 | 443 | 583 | 2020-06-10 19:38:08.919593 | 1591832288919 | 1045 |
| 65 | 192.168.1.6 | 172.217.12.131 | 16807 | 443 | 66 | 2020-06-10 19:38:08.928922 | 1591832288928 | 297 |
| 66 | 192.168.1.6 | 172.217.12.131 | 58185 | 443 | 583 | 2020-06-10 19:38:08.929100 | 1591832288929 | 1298 |

Related

Array Split in data frame column without warning

Data frame has columns 'date_sq' and 'value', the 'value' column is an array of 201 columns.
| date_sq | value
| 2022-05-04 |13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t3\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t2\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t1\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t0\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t4\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t3\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t2\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t1\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t0\t0\t0\t1\t13360\t14644\t22099\t8257\t13105\t13879\t12853\t0\t0\t0\t0\t0\t16706\t21558\t17474\t13873\t4\t0\t0\t1\t2949\r\n |
My below codes works:
Split value column to 201 seperate column
Create date in numeric
Create a column 'key'
Drop unnecessary columns
df_spec[[f'{x}' for x in range(total_cols)]] = df_spec['value'].str.split (pat="\t", expand=True).replace(r'\s+|\\n', ' ', regex=True).fillna('0').apply(pd.to_numeric)
df_spec['date'] = pd.to_datetime(df_spec['date_sq']).dt.strftime("%Y%m%d")
df_spec['key'] = (df_spec['date'].astype(str) + df_spec['200'].astype(str)).apply(pd.to_numeric)
df_spec.drop(['value','date_sq','date'], axis=1, inplace=True)
Requirement:
My above code works, but it throws some warning messages.
Is there an optimized way without warnings?
Warning:
<ipython-input-3-70211686c759>:2: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df_spec[[f'{x}' for x in range(total_cols)]] = df_spec['value'].str.split \
<ipython-input-3-70211686c759>:2: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df_spec[[f'{x}' for x in range(total_cols)]] = df_spec['value'].str.split \
... goes on for some lines...
Final dataframe:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | key |
|-------|-------|-------|------|-------|-------|-------|---|---|---|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|----|----|----|----|----|-------|-------|-------|-------|----|----|----|----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|-------|-------|-------|------|-------|-------|-------|-----|-----|-----|-----|-----|-------|-------|-------|-------|-----|-----|-----|-----|------|-------------|
| 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 3 | 0 | 0 | 0 | 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 2 | 0 | 0 | 0 | 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 1 | 0 | 0 | 0 | 13360 | 12597 | 13896 | 8262 | 12851 | 12345 | 12849 | 0 | 0 | 0 | 0 | 0 | 21320 | 21301 | 22597 | 13624 | 0 | 0 | 0 | 0 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 4 | 0 | 0 | 0 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 3 | 0 | 0 | 1 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 2 | 0 | 0 | 1 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 1 | 0 | 0 | 1 | 13360 | 12341 | 13379 | 8257 | 14641 | 13106 | 12854 | 0 | 0 | 0 | 0 | 0 | 13123 | 13139 | 17473 | 13105 | 0 | 0 | 0 | 1 | 13360 | 14644 | 22099 | 8257 | 13105 | 13879 | 12853 | 0 | 0 | 0 | 0 | 0 | 16706 | 21558 | 17474 | 13873 | 4 | 0 | 0 | 1 | 2949 | 202205052949 |
I tried the following and got no warnings. I hope it helps in your case
df = pd.DataFrame([['2022-05-04', '13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t3\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t2\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t1\t0\t0\t0\t13360\t12597\t13896\t8262\t12851\t12345\t12849\t0\t0\t0\t0\t0\t21320\t21301\t22597\t13624\t0\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t4\t0\t0\t0\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t3\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t2\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t1\t0\t0\t1\t13360\t12341\t13379\t8257\t14641\t13106\t12854\t0\t0\t0\t0\t0\t13123\t13139\t17473\t13105\t0\t0\t0\t1\t13360\t14644\t22099\t8257\t13105\t13879\t12853\t0\t0\t0\t0\t0\t16706\t21558\t17474\t13873\t4\t0\t0\t1\t2949\r\n']], columns=['date_sq', 'value'])
date = df['date_sq']
df = df['value'].str.split('\t', expand=True).fillna('0').apply(pd.to_numeric)# .explode().T.reset_index(drop=True)
df['date'] = pd.to_datetime(date).dt.strftime("%Y%m%d")
df['key'] = (df['date'].astype(str) + df[200].astype(str)).apply(pd.to_numeric)
df.drop(['date'], axis=1, inplace=True)
df
I think df_spec[[f'{x}' for x in range(total_cols)]] is unnecessary when using split(..., expand=True). Good luck

Calculating quarterly growth

I have some daily data in a df, which goes back as far as 1st January 2020. It looks similar to the below but with many id1s on each day.
| yyyy_mm_dd | id1 | id2 | cost |
|------------|-----|------|-------|
| 2020-01-01 | 23 | 7253 | 5003 |
| 2020-01-01 | 23 | 7743 | 30340 |
| 2020-01-02 | 23 | 7253 | 450 |
| 2020-01-02 | 23 | 7743 | 4500 |
| ... | ... | ... | ... |
| 2021-01-01 | 23 | 7253 | 5675 |
| 2021-01-01 | 23 | 134 | 1030 |
| 2021-01-01 | 23 | 3445 | 564 |
| 2021-01-01 | 23 | 4534 | 345 |
| ... | ... | ... | ... |
I want like to calculate (1) the summed cost grouped by quarter and id1, (2) the growth % compared to the same quarter in the previous year.
I have grouped and calculated the summed cost like so:
grouped_quarterly = (
df
.withColumn('year_quarter', (F.year(sf.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd'))
.groupby('id1', 'year_quarter')
.agg(
F.sum('cost').alias('cost')
)
)
But I am unsure how to get the growth compared to the previous year. Expected output based on the above sample:
| year_quarter | id1 | cost | cost_growth |
|--------------|-----|------|-------------|
| 202101 | 23 | 7614 | -81 |
It would also be nice to set cost_growth to 0 if the id1 has no rows in the previous years quarter.
Edit: Below is an attempt to make the comparison but I get an error that there is no attribute prev_value:
grouped_quarterly = (
df
.withColumn('year_quarter', (F.year(sf.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd'))
.groupby('id1', 'year_quarter')
.agg(
F.sum('cost').alias('cost')
)
)
w = Window.partitionBy('id1').orderBy('year_quarter')
growth = (
grouped_quarterly
.withColumn('prev_value', sf.lag(grouped_quarterly.cost).over(w))
.withColumn('diff', sf.when(sf.isnull(grouped_quarterly.cost - grouped_quarterly.prev_value), 0).otherwise(grouped_quarterly.cost - grouped_quarterly.cost))
)
Edit #2: The window function seems to take the previous quarter, regardless of year. This means my prev_value column is the previous quarter rather than the same quarter from the previous year:
grouped_quarterly.where(sf.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost |
|-----|--------------|------|
| 222 | 202001 | 73 |
| 222 | 202002 | 246 |
| 222 | 202003 | 525 |
| 222 | 202004 | -27 |
| 222 | 202101 | 380 |
w = Window.partitionBy('id1').orderBy('year_quarter')
growth = (
grouped_quarterly
.withColumn('prev_value', sf.lag(sf.col('cost')).over(w))
.withColumn('diff', sf.when(sf.isnull(sf.col('cost') - sf.col('prev_value')), 0).otherwise(sf.col('cost') - sf.col('prev_value')))
)
growth.where(sf.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost | prev_value | diff |
|-----|--------------|------|------------|------|
| 222 | 202001 | 73 | null | 0 |
| 222 | 202002 | 246 | 73 | 173 |
| 222 | 202003 | 525 | 246 | 279 |
| 222 | 202004 | -27 | 525 | -522 |
| 222 | 202101 | 380 | -27 | 407 |
Edit #3: Using the quarter in the partitioning results in a null prev_value for all rows:
grouped_quarterly.where(sf.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost |
|-----|--------------|------|
| 222 | 202001 | 73 |
| 222 | 202002 | 246 |
| 222 | 202003 | 525 |
| 222 | 202004 | -27 |
| 222 | 202101 | 380 |
w = Window.partitionBy(sf.col('id1'), sf.expr('substring(string(year_quarter), 2)')).orderBy('year_quarter')
growth = (
grouped_quarterly
.withColumn('prev_value', sf.lag(sf.col('cost')).over(w))
.withColumn('diff', sf.when(sf.isnull(sf.col('cost') - sf.col('prev_value')), 0).otherwise(sf.col('cost') - sf.col('prev_value')))
)
growth.where(sf.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost | prev_value | diff |
|-----|--------------|------|------------|-------|
| 222 | 202001 | 73 | null | 0 |
| 222 | 202002 | 246 | null | 0 |
| 222 | 202003 | 525 | null | 0 |
| 222 | 202004 | -27 | null | 0 |
| 222 | 202101 | 380 | null | 0 |
Try using the quarter in the partitioning as well, so that lag will give you the value in the same quarter last year:
w = Window.partitionBy(sf.col('id1'), sf.expr('substring(string(year_quarter), -2)')).orderBy('year_quarter')

Binning Pandas value_counts

I have a Pandas Series produced by df.column.value_counts().sort_index().
| N Months | Count |
|------|------|
| 0 | 15 |
| 1 | 9 |
| 2 | 78 |
| 3 | 151 |
| 4 | 412 |
| 5 | 181 |
| 6 | 543 |
| 7 | 175 |
| 8 | 409 |
| 9 | 594 |
| 10 | 137 |
| 11 | 202 |
| 12 | 170 |
| 13 | 446 |
| 14 | 29 |
| 15 | 39 |
| 16 | 44 |
| 17 | 253 |
| 18 | 17 |
| 19 | 34 |
| 20 | 18 |
| 21 | 37 |
| 22 | 147 |
| 23 | 12 |
| 24 | 31 |
| 25 | 15 |
| 26 | 117 |
| 27 | 8 |
| 28 | 38 |
| 29 | 23 |
| 30 | 198 |
| 31 | 29 |
| 32 | 122 |
| 33 | 50 |
| 34 | 60 |
| 35 | 357 |
| 36 | 329 |
| 37 | 457 |
| 38 | 609 |
| 39 | 4744 |
| 40 | 1120 |
| 41 | 591 |
| 42 | 328 |
| 43 | 148 |
| 44 | 46 |
| 45 | 10 |
| 46 | 1 |
| 47 | 1 |
| 48 | 7 |
| 50 | 2 |
my desired output is
| bin | Total |
|-------|--------|
| 0-13 | 3522 |
| 14-26 | 793 |
| 27-50 | 9278 |
I tried df.column.value_counts(bins=3).sort_index() but got
| bin | Total |
|---------------------------------|-------|
| (-0.051000000000000004, 16.667] | 3634 |
| (16.667, 33.333] | 1149 |
| (33.333, 50.0] | 8810 |
I can get the correct result with
a = df.column.value_counts().sort_index()[:14].sum()
b = df.column.value_counts().sort_index()[14:27].sum()
c = df.column.value_counts().sort_index()[28:].sum()
print(a, b, c)
Output: 3522 793 9270
But I am wondering if there is a pandas method that can do what I want. Any advice is very welcome. :-)
You can use pd.cut:
pd.cut(df['N Months'], [0,13, 26, 50], include_lowest=True).value_counts()
Update you should be able to pass custom bin to value_counts:
df['N Months'].value_counts(bins = [0,13, 26, 50])
Output:
N Months
(-0.001, 13.0] 3522
(13.0, 26.0] 793
(26.0, 50.0] 9278
Name: Count, dtype: int64

Histogram to show sampling rate

I'm having confusion with plotting some data, but here is what I want to do. I have a dataframe with this sample data:
>>df.head(20)
user_id | trip_id | lat | lon | sampling_rate
---------+--------------------+------------------+------------------+---------------
126 | 125020080511025052 | 39.9531666666667 | 116.452566666667 | 7
126 | 125020080511025052 | 39.95305 | 116.452683333333 | 16
126 | 125020080511025052 | 39.9530666666667 | 116.452916666667 | 44
126 | 125020080511025052 | 39.9530833333333 | 116.453183333333 | 40
126 | 125020080511025052 | 39.95335 | 116.45365 | 21
126 | 125020080511025052 | 39.9532833333333 | 116.453816666667 | 16
126 | 125020080511025052 | 39.9533166666667 | 116.45405 | 13
126 | 125020080511025052 | 39.9535666666667 | 116.454383333333 | 24
126 | 125020080511025052 | 39.9537166666667 | 116.4546 | 16
126 | 125020080511025052 | 39.9538333333333 | 116.454733333333 | 17
126 | 125020080511025052 | 39.9540166666667 | 116.454966666667 | 37
126 | 125020080511025052 | 39.9541833333333 | 116.455133333333 | 18
126 | 125020080511025052 | 39.95405 | 116.455216666667 | 23
126 | 125020080511025052 | 39.9539 | 116.455266666667 | 19
126 | 125020080511025052 | 39.9537333333333 | 116.455333333333 | 42
126 | 125020080511025052 | 39.95365 | 116.455416666667 | 23
126 | 125020080512015529 | 40.00705 | 116.32225 |
126 | 125020080512015529 | 40.0073 | 116.3225 | 19
126 | 125020080512015529 | 40.0068 | 116.322083333333 | 66
126 | 125020080512015529 | 40.0064333333333 | 116.321666666667 | 2
This table contains trips GPS traces of users. the sampling_rate is shows the GPS sampling for the trips.
I want to show the sampling rate plot, such I can see trips with 1sec interval, trips with 2-5 seconds interval, trips with 5-10 seconds intervals etc...
I would want to have the number of trips on y-axis and interval on x-axis.
IIUC, Use, np.arange to create bins with the interval of 5sec:
plt.figure(figsize=(6, 4))
plt.hist(df['sampling_rate'], bins=np.arange(1, df['sampling_rate'].max() + 5, 5))
plt.xlabel('Sampling Rate')
plt.ylabel('Frequency')
plt.title('Distribution of sampling rate')
plt.show()
Result:

Parsing out indeces and values from pandas multi index dataframe

I have a dataframe in a similar format to this:
+--------+--------+----------+------+------+------+------+
| | | | | day1 | day2 | day3 |
+--------+--------+----------+------+------+------+------+
| id_one | id_two | id_three | date | | | |
| 18273 | 50 | 1 | 3 | 9 | 11 | 3 |
| | | | 4 | 26 | 27 | 68 |
| | | | 5 | 92 | 25 | 4 |
| | | | 6 | 60 | 72 | 83 |
| | 60 | 2 | 5 | 69 | 93 | 84 |
| | | | 6 | 69 | 30 | 12 |
| | | | 7 | 65 | 65 | 59 |
| | | | 8 | 57 | 88 | 59 |
| | 70 | 3 | 5 | 22 | 95 | 7 |
| | | | 6 | 40 | 24 | 20 |
| | | | 7 | 73 | 81 | 57 |
| | | | 8 | 43 | 8 | 66 |
+--------+--------+----------+------+------+------+------+
I am trying to create tuple that contains id_one, id_two and the values that each grouping contains.
To test this, I am simply trying to print the ids and values like this:
for id_two, data in df.head(100).groupby(level='id_two'):
print id_two, data.values.ravel()
Which gives me the id_two and the data exactly as it should.
I am running into problems when I try and incorporate id_one. I tried this, but was met with an error ValueError: need more than 2 values to unpack
for id_one, id_two, data in df.head(100).groupby(level='id_two'):
print id_one, id_two, data.values.ravel()
How can I print id_one, id_two and the data?
You can pass a list of columns into the level parameter:
df.head.groupby(level=['id_one', 'id_two'])

Categories