Say I have a yearly cumulative dataframe as follows:
date v1 v2
0 2019-10 109.23 126.17
1 2019-09 108.90 121.07
2 2019-08 95.96 85.40
3 2019-07 91.30 82.92
4 2019-06 80.19 26.04
5 2019-05 65.98 18.58
6 2019-04 38.80 9.87
7 2019-03 3.01 2.51
8 2019-02 3.01 2.49
9 2018-12 221.31 249.87
10 2018-11 215.59 137.92
11 2018-10 195.16 110.69
12 2018-09 160.45 101.15
13 2018-08 124.70 75.57
14 2018-07 122.98 52.48
15 2018-06 73.46 34.82
16 2018-05 42.22 34.61
17 2018-04 9.94 28.52
18 2018-03 4.07 28.52
19 2018-02 2.04 21.84
Just wonder if it's possible to generate cum_v1 and cum_v2 for each year data.
The logic of calculation is: value for cum_v1 in 2019-10 is calculated by value in 2019-10 (taking the initial amount) minus in 2019-09, until 2019-02 will keep same for cum_v1 as v1, and set 0 for all values in 2019-01. Same logic for the year of 2018.
The desired output will like this:
date v1 cum_v1 v2 cum_v2
0 2019-10 109.23 0.33 126.17 5.10
1 2019-09 108.90 12.94 121.07 35.67
2 2019-08 95.96 4.66 85.40 2.48
3 2019-07 91.30 11.11 82.92 56.88
4 2019-06 80.19 14.21 26.04 7.46
5 2019-05 65.98 27.18 18.58 8.71
6 2019-04 38.80 35.79 9.87 7.36
7 2019-03 3.01 0.00 2.51 0.02
8 2019-02 3.01 3.01 2.49 2.49
9 2019-01 0 0 0 0
10 2018-12 221.31 5.72 249.87 111.95
11 2018-11 215.59 20.43 137.92 27.23
12 2018-10 195.16 34.71 110.69 9.54
13 2018-09 160.45 35.75 101.15 25.58
14 2018-08 124.70 1.72 75.57 23.09
15 2018-07 122.98 49.52 52.48 17.66
16 2018-06 73.46 31.24 34.82 0.21
17 2018-05 42.22 32.28 34.61 6.09
18 2018-04 9.94 5.87 28.52 0.00
19 2018-03 4.07 2.03 28.52 6.68
20 2018-02 2.04 2.04 21.84 21.84
21 2018-01 0 0 0 0
Using pandas.Groupby with diff:
df[['cum_v1', 'cum_v2']] = df.groupby(df['date'].str[:4]).diff(-1).fillna(df[['v1', 'v2']])
print(df)
Output:
date v1 v2 cum_v1 cum_v2
0 2019-10 109.23 126.17 0.33 5.10
1 2019-09 108.90 121.07 12.94 35.67
2 2019-08 95.96 85.40 4.66 2.48
3 2019-07 91.30 82.92 11.11 56.88
4 2019-06 80.19 26.04 14.21 7.46
5 2019-05 65.98 18.58 27.18 8.71
6 2019-04 38.80 9.87 35.79 7.36
7 2019-03 3.01 2.51 0.00 0.02
8 2019-02 3.01 2.49 3.01 2.49
9 2018-12 221.31 249.87 5.72 111.95
10 2018-11 215.59 137.92 20.43 27.23
11 2018-10 195.16 110.69 34.71 9.54
12 2018-09 160.45 101.15 35.75 25.58
13 2018-08 124.70 75.57 1.72 23.09
14 2018-07 122.98 52.48 49.52 17.66
15 2018-06 73.46 34.82 31.24 0.21
16 2018-05 42.22 34.61 32.28 6.09
17 2018-04 9.94 28.52 5.87 0.00
18 2018-03 4.07 28.52 2.03 6.68
19 2018-02 2.04 21.84 2.04 21.84
Use DataFrameGroupBy.diff with Series.dt.year with columns in list, replace last missing values by original by DataFrame.fillna, add prefixes by DataFrame.add_prefix and last join to original by DataFrame.join:
df['date'] = pd.to_datetime(df['date']).dt.to_period('m')
cols = ['v1','v2']
df = df.join(df.groupby(df['date'].dt.year)[cols].diff(-1).fillna(df[cols]).add_prefix('cum'))
print(df)
date v1 v2 cumv1 cumv2
0 2019-10 109.23 126.17 0.33 5.10
1 2019-09 108.90 121.07 12.94 35.67
2 2019-08 95.96 85.40 4.66 2.48
3 2019-07 91.30 82.92 11.11 56.88
4 2019-06 80.19 26.04 14.21 7.46
5 2019-05 65.98 18.58 27.18 8.71
6 2019-04 38.80 9.87 35.79 7.36
7 2019-03 3.01 2.51 0.00 0.02
8 2019-02 3.01 2.49 3.01 2.49
9 2018-12 221.31 249.87 5.72 111.95
10 2018-11 215.59 137.92 20.43 27.23
11 2018-10 195.16 110.69 34.71 9.54
12 2018-09 160.45 101.15 35.75 25.58
13 2018-08 124.70 75.57 1.72 23.09
14 2018-07 122.98 52.48 49.52 17.66
15 2018-06 73.46 34.82 31.24 0.21
16 2018-05 42.22 34.61 32.28 6.09
17 2018-04 9.94 28.52 5.87 0.00
18 2018-03 4.07 28.52 2.03 6.68
19 2018-02 2.04 21.84 2.04 21.84
EDIT:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').resample('MS').sum()
cols = ['v1','v2']
df = (df.join(df.groupby(df.index.year)[cols].diff(-1).fillna(df[cols])
.add_prefix('cum')).to_period('m'))
print(df)
v1 v2 cumv1 cumv2
date
2018-02 2.04 21.84 -2.03 -6.68
2018-03 4.07 28.52 -5.87 0.00
2018-04 9.94 28.52 -32.28 -6.09
2018-05 42.22 34.61 -31.24 -0.21
2018-06 73.46 34.82 -49.52 -17.66
2018-07 122.98 52.48 -1.72 -23.09
2018-08 124.70 75.57 -35.75 -25.58
2018-09 160.45 101.15 -34.71 -9.54
2018-10 195.16 110.69 -20.43 -27.23
2018-11 215.59 137.92 -5.72 -111.95
2018-12 221.31 249.87 221.31 249.87
2019-01 0.00 0.00 -3.01 -2.49
2019-02 3.01 2.49 0.00 -0.02
2019-03 3.01 2.51 -35.79 -7.36
2019-04 38.80 9.87 -27.18 -8.71
2019-05 65.98 18.58 -14.21 -7.46
2019-06 80.19 26.04 -11.11 -56.88
2019-07 91.30 82.92 -4.66 -2.48
2019-08 95.96 85.40 -12.94 -35.67
2019-09 108.90 121.07 -0.33 -5.10
2019-10 109.23 126.17 109.23 126.17
Related
df_1 is as follows -
date id score
2019-05 5 78.9
2019-06 5 77.5
2019-07 5 80.2
2019-08 5 82.0
2019-05 2 79.9
2019-06 2 69.3
2019-07 2 75.2
2019-08 2 80.0
2019-05 70 68.8
2019-06 70 67.5
2019-07 70 70.2
2019-08 70 86.0
df_2 is as follows -
date id score
2019-01 2 79.1
2019-02 2 79.2
2019-03 2 75.2
2019-04 2 80.0
2019-01 5 78.9
2019-02 5 78.5
2019-03 5 80.8
2019-04 5 82.8
2019-01 70 68.4
2019-02 70 72.2
2019-03 70 70.5
2019-04 70 81.0
How can I merge them into one dataframe according to date and id, resulting in -
date id score
2019-01 2 79.1
2019-02 2 79.2
2019-03 2 75.2
2019-04 2 80.0
2019-05 2 79.9
2019-06 2 69.3
2019-07 2 75.2
2019-08 2 80.0
2019-01 5 78.9
2019-02 5 78.5
2019-03 5 80.8
2019-04 5 82.8
2019-05 5 78.9
2019-06 5 77.5
2019-07 5 80.2
2019-08 5 82.0
2019-01 70 68.4
2019-02 70 72.2
2019-03 70 70.5
2019-04 70 81.0
2019-05 70 68.8
2019-06 70 67.5
2019-07 70 70.2
2019-08 70 86.0
Use pd.concat:
pd.concat([df_1, df_2]).sort_values(["date", "id"]).reset_index(drop=True)
Concat and sort values
pd.concat([df1, df2]).sort_values(['id', 'date'])
date id score
0 2019-01 2 79.1
1 2019-02 2 79.2
2 2019-03 2 75.2
3 2019-04 2 80.0
4 2019-05 2 79.9
5 2019-06 2 69.3
6 2019-07 2 75.2
7 2019-08 2 80.0
4 2019-01 5 78.9
5 2019-02 5 78.5
6 2019-03 5 80.8
7 2019-04 5 82.8
0 2019-05 5 78.9
1 2019-06 5 77.5
2 2019-07 5 80.2
3 2019-08 5 82.0
8 2019-01 70 68.4
9 2019-02 70 72.2
10 2019-03 70 70.5
11 2019-04 70 81.0
8 2019-05 70 68.8
9 2019-06 70 67.5
10 2019-07 70 70.2
11 2019-08 70 86.0
I have this dataframe called "df_pressure":
Ranking Squad Press Succ Succ% Fail Fail%
11 1 Manchester City 4254 1381 32.5 2873 67.5
10 2 Liverpool 5360 1731 32.3 3629 67.7
5 3 Chelsea 5533 1702 30.8 3831 69.2
16 4 Tottenham 5477 1523 27.8 3954 72.2
0 5 Arsenal 4772 1440 30.2 3332 69.8
12 6 Manchester Utd 5069 1462 28.8 3607 71.2
18 7 West Ham 4917 1372 27.9 3545 72.1
9 8 Leicester City 5982 1719 28.7 4263 71.3
3 9 Brighton 5670 1832 32.3 3838 67.7
19 10 Wolves 5529 1633 29.5 3896 70.5
13 11 Newcastle Utd 5430 1460 26.9 3970 73.1
6 12 Crystal Palace 6041 1809 29.9 4232 70.1
2 13 Brentford 5566 1609 28.9 3957 71.1
1 14 Aston Villa 5515 1524 27.6 3991 72.4
15 15 Southampton 5869 1806 30.8 4063 69.2
7 16 Everton 6346 1892 29.8 4454 70.2
8 17 Leeds United 7078 2118 29.9 4960 70.1
4 18 Burnley 5527 1499 27.1 4028 72.9
17 19 Watford 5730 1656 28.9 4074 71.1
14 20 Norwich City 6146 1570 25.5 4576 74.5
I then decided to create another dataframe for some columns only:
df_pressure_perc=df_pressure[['Squad','Succ%','Fail%']]
df_pressure_perc.reset_index(drop=True, inplace=True)
df_pressure_perc.set_index('Squad')
print(df_pressure_perc)
Output:
Squad Succ% Fail%
0 Manchester City 32.5 67.5
1 Liverpool 32.3 67.7
2 Chelsea 30.8 69.2
3 Tottenham 27.8 72.2
4 Arsenal 30.2 69.8
5 Manchester Utd 28.8 71.2
6 West Ham 27.9 72.1
7 Leicester City 28.7 71.3
8 Brighton 32.3 67.7
9 Wolves 29.5 70.5
10 Newcastle Utd 26.9 73.1
11 Crystal Palace 29.9 70.1
12 Brentford 28.9 71.1
13 Aston Villa 27.6 72.4
14 Southampton 30.8 69.2
15 Everton 29.8 70.2
16 Leeds United 29.9 70.1
17 Burnley 27.1 72.9
18 Watford 28.9 71.1
19 Norwich City 25.5 74.5
Based on this new dataframe "df_pressure_perc", I decided to create a stacked barplot. Upon creating it with the following code: df_pressure_perc.plot(kind='barh', stacked=True, ylabel='Squad', colormap='tab10', figsize=(10, 6))
I realised my viz Y axis were not labelled in terms of the Squad names. Would like to seek some advice on how I can reflect the Y axis in Squad names instead of 0-19.
Visualization(stacked barplot)
i have a dataframe that contains cell phone minutes usage logged by date of call and duration.
It looks like this (30 row sample):
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
I want to group by user_id AND call_date with the ultimate goal of calculating the number of minutes used per month over the course of the year, per user.
I thought i could accomplish this by using:
calls.groupby(['user_id','call_date'])['duration'].sum()
but the results aren't what i expected:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-08-14 13.86
2018-08-16 23.46
2018-08-17 8.11
2018-08-18 1.74
2018-08-19 10.73
2018-08-20 7.32
2018-08-21 0.00
2018-08-23 8.50
2018-08-24 8.63
2018-08-25 35.39
2018-08-27 10.57
2018-08-28 19.91
2018-08-29 0.54
2018-08-31 22.38
2018-09-01 7.53
2018-09-02 10.27
2018-09-03 30.66
2018-09-04 0.00
2018-09-05 9.09
2018-09-06 10.06
i'd hoped that it would be grouped like user_id 1000, all calls for jan with duration summed, all calls for feb with duration summed, etc.
i am really new to python and programming in general and am not sure what my next step should be to get these grouped by user_id and month of the year?
Thanks in advance for any insight you can offer.
Regards,
Jared
Something is not quite right in your setup. First of all, both of your tables are the same, so I am not sure if this is a cut-and-paste error or something else. Here is what I do with your data. Load it up like so, note we explicitly convert call_date to Datetime`
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(
"""
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
"""), delim_whitespace = True, index_col=0)
df['call_date'] = pd.to_datetime(df['call_date'])
Then using
df.groupby(['user_id','call_date'])['duration'].sum()
does the expected grouping by user and by each date:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-09-06 10.06
2018-09-21 5.75
2018-09-30 14.78
2018-10-12 1.00
2018-10-17 15.83
2018-10-27 0.98
2018-10-28 5.90
2018-11-09 1.00
2018-11-15 30.00
2018-11-17 2.45
2018-11-19 2.40
2018-12-04 7.19
2018-12-05 0.00
2018-12-13 6.27
2018-12-24 0.00
If you want to group by month as you seem to suggest you can use the Grouper functionality:
df.groupby(['user_id',pd.Grouper(key='call_date', freq='1M')])['duration'].sum()
which produces
user_id call_date
1000 2018-12-31 116.83
1001 2018-09-30 30.59
2018-10-31 23.71
2018-11-30 35.85
2018-12-31 13.46
Let me know if you are getting different results from following these steps
CODE
import pandas
df = pandas.read_csv('biharpopulation.txt', delim_whitespace=True)
df.columns = ['SlNo','District','Total','Male','Female','Total','Male','Female','SC','ST','SC','ST']
DATA
SlNo District Total Male Female Total Male Female SC ST SC ST
1 Patna 729988 386991 342997 9236 5352 3884 15.5 0.2 38.6 68.7
2 Nalanda 473786 248246 225540 970 524 446 20.2 0.0 29.4 29.8
3 Bhojpur 343598 181372 162226 8337 4457 3880 15.3 0.4 39.1 46.7
4 Buxar 198014 104761 93253 8428 4573 3855 14.1 0.6 37.9 44.6
5 Rohtas 444333 233512 210821 25663 13479 12184 18.1 1.0 41.3 30.0
6 Kaimur 286291 151031 135260 35662 18639 17023 22.2 2.8 40.5 38.6
7 Gaya 1029675 529230 500445 2945 1526 1419 29.6 0.1 26.3 49.1
8 Jehanabad 174738 90485 84253 1019 530 489 18.9 0.07 32.6 32.4
9 Arawal 11479 57677 53802 294 179 115 18.8 0.04
10 Nawada 435975 223929 212046 2158 1123 1035 24.1 0.1 22.4 20.5
11 Aurangabad 472766 244761 228005 1640 865 775 23.5 0.1 35.7 49.7
Saran
12 Saran 389933 199772 190161 6667 3384 3283 12 0.2 33.6 48.5
13 Siwan 309013 153558 155455 13822 6856 6966 11.4 0.5 35.6 44.0
14 Gopalganj 267250 134796 132454 6157 2984 3173 12.4 0.3 32.1 37.8
15 Muzaffarpur 594577 308894 285683 3472 1789 1683 15.9 0.1 28.9 50.4
16 E. Champaran 514119 270968 243151 4812 2518 2294 13.0 0.1 20.6 34.3
17 W. Champaran 434714 228057 206657 44912 23135 21777 14.3 1.5 22.3 24.1
18 Sitamarhi 315646 166607 149039 1786 952 834 11.8 0.1 22.1 31.4
19 Sheohar 74391 39405 34986 64 35 29 14.4 0.0 16.9 38.8
20 Vaishali 562123 292711 269412 3068 1595 1473 20.7 0.1 29.4 29.9
21 Darbhanga 511125 266236 244889 841 467 374 15.5 0.0 24.7 49.5
22 Madhubani 481922 248774 233148 1260 647 613 13.5 0.0 22.2 35.8
23 Samastipur 628838 325101 303737 3362 2724 638 18.5 0.1 25.1 22.0
24 Munger 150947 80031 70916 18060 9297 8763 13.3 1.6 42.6 37.3
25 Begusarai 341173 177897 163276 1505 823 682 14.5 0.1 31.4 78.6
26 Shekhapura 103732 54327 49405 211 115 96 19.7 0.0 25.2 45.6
27 Lakhisarai 126575 65781 60794 5636 2918 2718 15.8 0.7 26.8 12.9
28 Jamui 242710 124538 118172 67357 34689 32668 17.4 4.8 24.5 26.7
The issue is with these 2 lines:
16 E. Champaran 514119 270968 243151 4812 2518 2294 13.0 0.1 20.6 34.3
17 W. Champaran 434714 228057 206657 44912 23135 21777 14.3 1.5 22.3 24.1
If you can somehow remove the space between E. Champaran and W. Champaran then you can do this:
df = pd.read_csv('test.csv', sep=r'\s+', skip_blank_lines=True, skipinitialspace=True)
print(df)
SlNo District Total Male Female Total.1 Male.1 Female.1 SC ST SC.1 ST.1
0 1 Patna 729988 386991 342997 9236 5352 3884 15.5 0.20 38.6 68.7
1 2 Nalanda 473786 248246 225540 970 524 446 20.2 0.00 29.4 29.8
2 3 Bhojpur 343598 181372 162226 8337 4457 3880 15.3 0.40 39.1 46.7
3 4 Buxar 198014 104761 93253 8428 4573 3855 14.1 0.60 37.9 44.6
4 5 Rohtas 444333 233512 210821 25663 13479 12184 18.1 1.00 41.3 30.0
5 6 Kaimur 286291 151031 135260 35662 18639 17023 22.2 2.80 40.5 38.6
6 7 Gaya 1029675 529230 500445 2945 1526 1419 29.6 0.10 26.3 49.1
7 8 Jehanabad 174738 90485 84253 1019 530 489 18.9 0.07 32.6 32.4
8 9 Arawal 11479 57677 53802 294 179 115 18.8 0.04 NaN NaN
9 10 Nawada 435975 223929 212046 2158 1123 1035 24.1 0.10 22.4 20.5
10 11 Aurangabad 472766 244761 228005 1640 865 775 23.5 0.10 35.7 49.7
11 12 Saran 389933 199772 190161 6667 3384 3283 12.0 0.20 33.6 48.5
12 13 Siwan 309013 153558 155455 13822 6856 6966 11.4 0.50 35.6 44.0
13 14 Gopalganj 267250 134796 132454 6157 2984 3173 12.4 0.30 32.1 37.8
14 15 Muzaffarpur 594577 308894 285683 3472 1789 1683 15.9 0.10 28.9 50.4
15 16 E.Champaran 514119 270968 243151 4812 2518 2294 13.0 0.10 20.6 34.3
16 17 W.Champaran 434714 228057 206657 44912 23135 21777 14.3 1.50 22.3 24.1
17 18 Sitamarhi 315646 166607 149039 1786 952 834 11.8 0.10 22.1 31.4
18 19 Sheohar 74391 39405 34986 64 35 29 14.4 0.00 16.9 38.8
19 20 Vaishali 562123 292711 269412 3068 1595 1473 20.7 0.10 29.4 29.9
20 21 Darbhanga 511125 266236 244889 841 467 374 15.5 0.00 24.7 49.5
21 22 Madhubani 481922 248774 233148 1260 647 613 13.5 0.00 22.2 35.8
22 23 Samastipur 628838 325101 303737 3362 2724 638 18.5 0.10 25.1 22.0
23 24 Munger 150947 80031 70916 18060 9297 8763 13.3 1.60 42.6 37.3
24 25 Begusarai 341173 177897 163276 1505 823 682 14.5 0.10 31.4 78.6
25 26 Shekhapura 103732 54327 49405 211 115 96 19.7 0.00 25.2 45.6
26 27 Lakhisarai 126575 65781 60794 5636 2918 2718 15.8 0.70 26.8 12.9
27 28 Jamui 242710 124538 118172 67357 34689 32668 17.4 4.80 24.5 26.7
Your problem is that the CSV is whitespace-delimited, but some of your district names also have whitespace in them. Luckily, none of the district names contain '\t' characters, so we can fix this:
df = pandas.read_csv('biharpopulation.txt', delimiter='\t')
I apologize, there are some quite similar questions. I went through them, but couldn't cope, though. It would be nice, if someone could help me on this.
I am willing to find any character (and blanks) except:
8-digit long substrings (eg 20110101)
substrings such as 0.68G or 10.76B(1 or 2 digits, dot, 2 digits, 1 letter)
from the text:
b'STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN
PRCP SNDP FRSHTT\n486200 99999 20110101 79.3 24 74.5 24 1007.2 8
1006.2 8 6.6 24 2.2 24 7.0 999.9 87.8 74.1 0.00G 999.9 010000\n486200 99999 20110102 79.7 24 74.9 24 1007.8 8 1006.9 8 6.1 24 2.8 24 8.0
15.0 91.9 74.8 0.00G 999.9 010010\n486200 99999 20110103 77.5 24 73.6 24 1008.5 8 1007.6 8 6.0 24 2.8 24 6.0 999.9 83.7 73.4* 0.68G 999.9
010000\n486200 99999 20110104 81.2 24 75.0 24 1007.7 8 1006.8 8 6.3 24
3.0 24 5.1 999.9 89.6* 73.0 0.14G 999.9 010010\n486200 99999 20110105 79.7 24 74.8 24 1007.8 8 1006.8 8 7.0 24 2.4 24 6.0 999.9 87.8 73.0 0.57G 999.9 010000\n486200 99999 20110106 77.4 24 74.6 24 1008.8 8 1007.9 8 6.0 24 1.5 24 4.1 999.9 81.0 73.2 0.16G 999.9 010000\n486200 99999 20110107 77.7 24 75.0 24 1008.9
I came out with the regex: (\d{8}|\d{1,2}\.\d{1,2}[ABCDEFG]) which finds all (1) and (2).
It now need to 'negate' this. I tried out several possibilities such as (?! ... ), but that doesn't seem to work.
My expected output is: 20110101 0.00G 20110102 0.00G 20110103 0.68G 20110104 89.6* 20110105 0.57G 20110106 0.16G20110107
Do you have suggestions, please?
You don't actually need to negate the pattern. Use the sme regex in re.findall function and join the resultant list items with a space character.
>>> s = '''STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN
PRCP SNDP FRSHTT\n486200 99999 20110101 79.3 24 74.5 24 1007.2 8
1006.2 8 6.6 24 2.2 24 7.0 999.9 87.8 74.1 0.00G 999.9 010000\n486200 99999 20110102 79.7 24 74.9 24 1007.8 8 1006.9 8 6.1 24 2.8 24 8.0
15.0 91.9 74.8 0.00G 999.9 010010\n486200 99999 20110103 77.5 24 73.6 24 1008.5 8 1007.6 8 6.0 24 2.8 24 6.0 999.9 83.7 73.4* 0.68G 999.9
010000\n486200 99999 20110104 81.2 24 75.0 24 1007.7 8 1006.8 8 6.3 24
3.0 24 5.1 999.9 89.6* 73.0 0.14G 999.9 010010\n486200 99999 20110105 79.7 24 74.8 24 1007.8 8 1006.8 8 7.0 24 2.4 24 6.0 999.9 87.8 73.0 0.57G 999.9 010000\n486200 99999 20110106 77.4 24 74.6 24 1008.8 8 1007.9 8 6.0 24 1.5 24 4.1 999.9 81.0 73.2 0.16G 999.9 010000\n486200 99999 20110107 77.7 24 75.0 24 1008.9'''
>>> ' '.join(re.findall(r'(\b\d{8}\b|\b\d{1,2}\.\d{1,2}[ABCDEFG])', s))
'20110101 0.00G 20110102 0.00G 20110103 0.68G 20110104 0.14G 20110105 0.57G 20110106 0.16G 20110107'
(?<!\d)\d{8}(?!\d)|\d{1,2}\.\d{2}[a-zA-Z]
Just use this with re.findall..See demo.
https://www.regex101.com/r/rK5lU1/27
import re
p = re.compile(r'(?<!\d)\d{8}(?!\d)|\d{1,2}\.\d{2}[a-zA-Z]', re.MULTILINE | re.IGNORECASE)
test_str = "b'STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN PRCP SNDP FRSHTT\n486200 99999 20110101 79.3 24 74.5 24 1007.2 8 1006.2 8 6.6 24 2.2 24 7.0 999.9 87.8 74.1 0.00G 999.9 010000\n486200 99999 20110102 79.7 24 74.9 24 1007.8 8 1006.9 8 6.1 24 2.8 24 8.0 15.0 91.9 74.8 0.00G 999.9 010010\n486200 99999 20110103 77.5 24 73.6 24 1008.5 8 1007.6 8 6.0 24 2.8 24 6.0 999.9 83.7 73.4* 0.68G 999.9 010000\n486200 99999 20110104 81.2 24 75.0 24 1007.7 8 1006.8 8 6.3 24 3.0 24 5.1 999.9 89.6* 73.0 0.14G 999.9 010010\n486200 99999 20110105 79.7 24 74.8 24 1007.8 8 1006.8 8 7.0 24 2.4 24 6.0 999.9 87.8 73.0 0.57G 999.9 010000\n486200 99999 20110106 77.4 24 74.6 24 1008.8 8 1007.9 8 6.0 24 1.5 24 4.1 999.9 81.0 73.2 0.16G 999.9 010000\n486200 99999 20110107 77.7 24 75.0 24 1008.9\n"
re.findall(p, test_str)