How to split string of text by conjunction in python? - python

I have a dataframe which is a transcript of a 2 person conversation. In the df are words, their timestamps, and the label of the speaker. It looks like this.
word start stop speaker
0 but 2.72 2.85 2
1 that's 2.85 3.09 2
2 alright 3.09 3.47 2
3 we'll 8.43 8.69 1
4 have 8.69 8.97 1
5 to 8.97 9.07 1
6 okay 9.19 10.01 2
7 sure 10.02 11.01 2
8 what? 11.02 12.00 1
9 i 12.01 13.00 2
10 agree 13.01 14.00 2
11 but 14.01 15.00 2
12 i 15.01 16.00 2
13 disagree 16.01 17.00 2
14 thats 17.01 18.00 1
15 fine 18.01 19.00 1
16 however 19.01 20.00 1
17 you 20.01 21.00 1
18 are 21.01 22.00 1
19 like 22.01 23.00 1
20 this 23.01 24.00 1
21 and 24.01 25.00 1
I have code to combine all words per speaker turn into one utterance which preserving the timestamp and speaker label. Using this code:
df.groupby([(df['speaker'] != df['speaker'].shift()).cumsum(), , df['speaker']], as_index=False).agg({
'word': ' '.join,
'start': 'min',
'stop': 'max'
})
I get this:
word start stop speaker
0 but that's alright 2.72 3.47 2
1 we'll have to 8.43 9.07 1
2 okay sure 9.19 11.01 2
3 what? 11.02 12.00 1
However, I want to split these combined utterances into sub-utterances based on the presence of a conjunctive adverb ('however', 'and', 'but', etc.). As a result, I want this:
word start stop speaker
0 but that's alright 2.72 3.47 2
1 we'll have to 8.43 9.07 1
2 okay sure 9.19 11.01 2
3 what? 11.02 12.00 1
4 I agree 12.01 14.00 2
5 but i disagree 14.01 17.00 2
6 thats fine 17.01 19.00 1
7 however you are 19.01 22.00 1
8 like this 22.01 24.00 1
9 and 24.01 25.00 1
Any recommendations on accomplishing this task would be appreciated.

you can add an OR (|) and check if the word is inside a specific list before grouping (e.g. with df['word'].isin(['however', 'and', 'but'])):
df.groupby([((df['speaker'] != df['speaker'].shift()) | (df['word'].isin(['however', 'and', 'but'])) ).cumsum(), df['speaker']], as_index=False).agg({
'word': ' '.join,
'start': 'min',
'stop': 'max'
})

Related

Avoid iteration over rows for computation in pandas

County date available_wheat usage rate (%) consumption
A 1/2/2021 100.00 3
A 1/3/2021 3
A 1/4/2021 2
A 1/5/2021 5
A 1/6/2021 1
A 1/7/2021 2
A 1/8/2021 5
A 1/9/2021 6
A 1/10/2021 7
A 1/11/2021 8
A 1/12/2021 1
A 1/13/2021 2
Above is my dataframe, I need to fill in the available in the columns. Available need to be reduced by usage rate (%), I am able to do using iterrows (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html).
My dataframe is quite big compared to what is displayed so the question is: is it possible to vectorize the calculation using either lambda or something else?
Expected output:
County date available_wheat usage rate (%) consumption
A 1/2/2021 100.00 3 3.00
A 1/3/2021 97.00 3 2.91
A 1/4/2021 94.09 2 1.88
A 1/5/2021 92.21 5 4.61
A 1/6/2021 87.60 1 0.88
A 1/7/2021 86.72 2 1.73
A 1/8/2021 84.99 5 4.25
A 1/9/2021 80.74 6 4.84
A 1/10/2021 75.89 7 5.31
A 1/11/2021 70.58 8 5.65
A 1/12/2021 64.93 1 0.65
A 1/13/2021 64.29 2 1.29
You need to use a shifted cumprod of your usage rate:
factor = df['usage rate (%)'].shift(fill_value=0).rsub(100).div(100).cumprod()
df['available_wheat'] = df['available_wheat'].ffill().mul(factor)
df['consumption'] = df['usage rate (%)'].mul(df['available_wheat']).div(100)
NB. if you have several counties and want to handle them independently, then perform all that within a groupby. Add round(2) to get 2 significant digits.
output:
County date available_wheat usage rate (%) consumption
0 A 1/2/2021 100.000000 3 3.000000
1 A 1/3/2021 97.000000 3 2.910000
2 A 1/4/2021 94.090000 2 1.881800
3 A 1/5/2021 92.208200 5 4.610410
4 A 1/6/2021 87.597790 1 0.875978
5 A 1/7/2021 86.721812 2 1.734436
6 A 1/8/2021 84.987376 5 4.249369
7 A 1/9/2021 80.738007 6 4.844280
8 A 1/10/2021 75.893727 7 5.312561
9 A 1/11/2021 70.581166 8 5.646493
10 A 1/12/2021 64.934673 1 0.649347
11 A 1/13/2021 64.285326 2 1.285707
grouped per County
Same logic in a groupby:
factor = (df.groupby('County')['usage rate (%)']
.apply(lambda s: s.shift(fill_value=0).rsub(100).div(100).cumprod())
)
df['available_wheat'] = df.groupby('County')['available_wheat'].ffill().mul(factor)
df['consumption'] = df['usage rate (%)'].mul(df['available_wheat']).div(100)
available_wheat2=100
def function1(ss:pd.Series):
global available_wheat2
ss['available_wheat']=available_wheat2
ss.consumption=np.round(available_wheat2 * ss['usage rate (%)'] / 100, 2)
available_wheat2= available_wheat2 - ss['consumption']
return ss
df1.apply(function1,axis=1)
out:
County date available_wheat usage rate (%) consumption
0 A 1/2/2021 100.00 3 3.00
1 A 1/3/2021 97.00 3 2.91
2 A 1/4/2021 94.09 2 1.88
3 A 1/5/2021 92.21 5 4.61
4 A 1/6/2021 87.60 1 0.88
5 A 1/7/2021 86.72 2 1.73
6 A 1/8/2021 84.99 5 4.25
7 A 1/9/2021 80.74 6 4.84
8 A 1/10/2021 75.90 7 5.31
9 A 1/11/2021 70.59 8 5.65
10 A 1/12/2021 64.94 1 0.65
11 A 1/13/2021 64.29 2 1.29

How to read webpage dataset in pandas?

I am trying to read this table
on the webpage: https://datahub.io/sports-data/german-bundesliga
I am using this code:
import pandas as pd
url="https://datahub.io/sports-data/german-bundesliga"
pd.read_html(url)[2]
It reads another tables but not the tables of this type.
Also there is a link to this specific table:
https://datahub.io/sports-data/german-bundesliga/r/0.html
I also tried this:
import pandas as pd
url="https://datahub.io/sports-data/german-bundesliga/r/0.html"
pd.read_html(url)
But it says that there are no tables to read
There is no necessity to use the HTML form of the table cause the table is available in CSV format.
pd.read_csv('https://datahub.io/sports-data/german-bundesliga/r/season-1819.csv').head()
output:
Div Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR ... BbAv<2.5 BbAH BbAHh BbMxAHH BbAvAHH BbMxAHA BbAvAHA PSCH PSCD PSCA
0 D1 24/08/2018 Bayern Munich Hoffenheim 3 1 H 1 0 H ... 3.55 22 -2.00 1.92 1.87 2.05 1.99 1.23 7.15 14.10
1 D1 25/08/2018 Fortuna Dusseldorf Augsburg 1 2 A 1 0 H ... 1.76 20 0.00 1.80 1.76 2.17 2.11 2.74 3.33 2.78
2 D1 25/08/2018 Freiburg Ein Frankfurt 0 2 A 0 1 A ... 1.69 20 -0.25 2.02 1.99 1.92 1.88 2.52 3.30 3.07
3 D1 25/08/2018 Hertha Nurnberg 1 0 H 1 0 H ... 1.76 20 -0.25 1.78 1.74 2.21 2.14 1.79 3.61 5.21
4 D1 25/08/2018 M'gladbach Leverkusen 2 0 H 0 0 D ... 2.32 20 0.00 2.13 2.07 1.84 1.78 2.63 3.70 2.69
5 rows × 61 columns

Add zero in between values in column (dataframe)

I have a dataframe:
ID Date Volume Sales
1 20191 3.33 1.33
1 20192 3.33 1.33
1 20193 3.33 1.33
1 20194 2.66 2
1 20195 2.66 2
1 20196 2.66 2
1 20197 2 2.66
1 20198 2 2.66
1 20199 2 2.66
1 201910 1.33 3.33
1 201911 1.33 3.33
1 201912 1.33 3.33
I would like to add a 0 right after the year 2019 in this case to that the date looks like: 201901 etc while 201910 and above remains the same.
My initial thought process is to use;
np.where(df['Date'].str.len() == 5,
where if string equals 5, we add the zero. Otherwise, data stays the same.
Expected output:
ID Date Volume Sales
1 201901 3.33 1.33
1 201902 3.33 1.33
1 201903 3.33 1.33
1 201904 2.66 2
1 201905 2.66 2
1 201906 2.66 2
1 201907 2 2.66
1 201908 2 2.66
1 201909 2 2.66
1 201910 1.33 3.33
1 201911 1.33 3.33
1 201912 1.33 3.33
Assuming these are strings:
df.Date = df.Date.apply(lambda x: x if len(x) == 6 else f"{x[:4]}0{x[-1]}")
But I concur that you should convert to proper dates, as suggested by #0 0 in the comment.

pandas - groupby multiple values?

i have a dataframe that contains cell phone minutes usage logged by date of call and duration.
It looks like this (30 row sample):
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
I want to group by user_id AND call_date with the ultimate goal of calculating the number of minutes used per month over the course of the year, per user.
I thought i could accomplish this by using:
calls.groupby(['user_id','call_date'])['duration'].sum()
but the results aren't what i expected:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-08-14 13.86
2018-08-16 23.46
2018-08-17 8.11
2018-08-18 1.74
2018-08-19 10.73
2018-08-20 7.32
2018-08-21 0.00
2018-08-23 8.50
2018-08-24 8.63
2018-08-25 35.39
2018-08-27 10.57
2018-08-28 19.91
2018-08-29 0.54
2018-08-31 22.38
2018-09-01 7.53
2018-09-02 10.27
2018-09-03 30.66
2018-09-04 0.00
2018-09-05 9.09
2018-09-06 10.06
i'd hoped that it would be grouped like user_id 1000, all calls for jan with duration summed, all calls for feb with duration summed, etc.
i am really new to python and programming in general and am not sure what my next step should be to get these grouped by user_id and month of the year?
Thanks in advance for any insight you can offer.
Regards,
Jared
Something is not quite right in your setup. First of all, both of your tables are the same, so I am not sure if this is a cut-and-paste error or something else. Here is what I do with your data. Load it up like so, note we explicitly convert call_date to Datetime`
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(
"""
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
"""), delim_whitespace = True, index_col=0)
df['call_date'] = pd.to_datetime(df['call_date'])
Then using
df.groupby(['user_id','call_date'])['duration'].sum()
does the expected grouping by user and by each date:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-09-06 10.06
2018-09-21 5.75
2018-09-30 14.78
2018-10-12 1.00
2018-10-17 15.83
2018-10-27 0.98
2018-10-28 5.90
2018-11-09 1.00
2018-11-15 30.00
2018-11-17 2.45
2018-11-19 2.40
2018-12-04 7.19
2018-12-05 0.00
2018-12-13 6.27
2018-12-24 0.00
If you want to group by month as you seem to suggest you can use the Grouper functionality:
df.groupby(['user_id',pd.Grouper(key='call_date', freq='1M')])['duration'].sum()
which produces
user_id call_date
1000 2018-12-31 116.83
1001 2018-09-30 30.59
2018-10-31 23.71
2018-11-30 35.85
2018-12-31 13.46
Let me know if you are getting different results from following these steps

Is there an option in pandas to see if value in column was less than another column in one row and then it changed over time?

I need to find cases where "price of y" was less than 3.5 until time 30:00
and after that when "price of x" jump above 3.5.
I made column of "Demical Time" to make it easier for me (less than 30:00 is less than 1800 sec in Demical)
I tried to find all the cases which price of y was under 3.5 (and above 0) but I failed to write code which gives the cases where price of y was under 3.5 AND price of x was greater than 3.5 after 30:00.
df1 = df[(df['price_of_Y']<3.5)&(df['price_of_Y']>0)& (df['Demical time']<1800)]
#the cases for price of y under 3.5 before time is 30:00 (Demical time =1800)
df2 = df[(df['price_of_X']>3.5) & (df['Demical time'] >1800 )]`
#the cases for price of x above 3.5 after time is 30:00 (Demical time =1800)
# the question is how do i combine them to one line?
price_of_X time price_of_Y Demical time
0 3.30 0 4.28 0
1 3.30 0:00 4.28 0
2 3.30 0:00 4.28 0
3 3.30 0:00 4.28 0
4 3.30 0:00 4.28 0
5 3.30 0:00 4.28 0
6 3.30 0:00 4.28 0
7 3.30 0:00 4.28 0
8 3.30 0:00 4.28 0
9 3.30 0:00 4.28 0
10 3.30 0:00 4.28 0
11 3.25 0:26 4.28 26
12 3.40 1:43 4.28 103
13 3.25 3:00 4.28 180
14 3.25 4:16 4.28 256
15 3.40 5:34 4.28 334
16 3.40 6:52 4.28 412
17 3.40 8:09 4.28 489
18 3.40 9:31 4.28 571
19 5.00 10:58 8.57 658
20 5.00 12:13 8.57 733
21 5.00 13:31 7.38 811
22 5.00 14:47 7.82 887
23 5.00 16:01 7.82 961
24 5.00 17:18 7.38 1038
25 5.00 18:33 7.38 1113
26 5.00 19:50 7.38 1190
27 5.00 21:09 7.38 1269
28 5.00 22:22 7.38 1342
29 5.00 23:37 8.13 1417
... ... ... ... ...
18138 7.50 59:03:00 28.61 3543
18139 7.50 60:19:00 28.61 3619
18140 7.50 61:35:00 34.46 3695
18141 8.00 62:48:00 30.16 3768
18142 7.50 64:03:00 34.46 3843
18143 8.00 65:20:00 30.16 3920
18144 7.50 66:34:00 28.61 3994
18145 7.50 67:53:00 30.16 4073
18146 8.00 69:08:00 26.19 4148
18147 7.00 70:23:00 23.10 4223
18148 7.00 71:38:00 23.10 4298
18149 8.00 72:50:00 30.16 4370
18150 7.50 74:09:00 26.19 4449
18151 7.50 75:23:00 25.58 4523
18152 7.00 76:40:00 19.07 4600
18153 7.00 77:53:00 19.07 4673
18154 9.00 79:11:00 31.44 4751
18155 9.00 80:27:00 27.11 4827
18156 10.00 81:41:00 34.52 4901
18157 10.00 82:56:00 34.52 4976
18158 11.00 84:16:00 43.05 5056
18159 10.00 85:35:00 29.42 5135
18160 10.00 86:49:00 29.42 5209
18161 11.00 88:04:00 35.70 5284
18162 13.00 89:19:00 70.38 5359
18163 15.00 90:35:00 70.42 5435
18164 19.00 91:48:00 137.70 5508
18165 23.00 93:01:00 511.06 5581
18166 NaN NaN NaN 0
18167 NaN NaN NaN 0
[18168 rows x 4 columns]
dataframe:
This should solve it.
I have used a bit different data and condition values, but you should get the idea of what i am doing.
import pandas as pd
df = pd.DataFrame({'price_of_X': [3.30,3.25,3.40,3.25,3.25,3.40],
'price_of_Y': [2.28,1.28,4.28,4.28,1.18,3.28],
'Decimal_time': [0,26,103,180,256,334]
})
print(df)
df1 = df.loc[(df['price_of_Y']<3.5)&(df['price_of_X']>3.3)&(df['Decimal_time']>103),:]
print(df1)
output:
df
price_of_X price_of_Y Decimal_time
0 3.30 2.28 0
1 3.25 1.28 26
2 3.40 4.28 103
3 3.25 4.28 180
4 3.25 1.18 256
5 3.40 3.28 334
df1
price_of_X price_of_Y Decimal_time
5 3.4 3.28 334
Similar to what #IMCoins suggested as a comment, use two boolean masks to achieve the selection that you require.
mask1 = (df['price_of_Y'] < 3.5) & (df['price_of_Y'] > 0) & (df['Demical time'] < 1800)
mask2 = (df['price_of_X'] > 3.5) & (df['Demical time'] > 1800)
df[mask1 | mask2]

Categories