Append all columns from one row into another row - python

I am trying to append every column from one row into another row, I want to do this for every row, but some row will not have any values, take a look at my code it will be more clear:
Here is my data
date day_of_week day_of_month day_of_year month_of_year
5/1/2017 0 1 121 5
5/2/2017 1 2 122 5
5/3/2017 2 3 123 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5
5/9/2017 1 9 129 5
5/10/2017 2 10 130 5
5/11/2017 3 11 131 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5
5/16/2017 1 16 136 5
5/17/2017 2 17 137 5
5/18/2017 3 18 138 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5
5/24/2017 2 24 144 5
5/25/2017 3 25 145 5
5/26/2017 4 26 146 5
Here is my current code:
s = df_md['date'].shift(-1)
df_md['next_calendarday'] = s.mask(s.dt.dayofweek.diff().lt(0))
df_md.set_index('date', inplace=True)
df_md.apply(lambda row: GetNextDayMarketData(row, df_md), axis=1)
def GetNextDayMarketData(row, dataframe):
if(row['next_calendarday'] is pd.NaT):
return
key = row['next_calendarday'].strftime("%Y-%m-%d")
nextrow = dataframe.loc[key]
for index, val in nextrow.iteritems():
if(index != "next_calendarday"):
dataframe.loc[row.name, index+'_nextday'] = val
This works but it's so slow it might as well not work. Here is what the result should look like, you can see that the value from the next row has been added to the previous row. The kicker is that it's the next calendar date and not just the next row in the sequence. If a row does not have an entry for next calendar date, it will simply be blank.
Here is the expected result in csv
date day_of_week day_of_month day_of_year month_of_year next_workingday day_of_week_nextday day_of_month_nextday day_of_year_nextday month_of_year_nextday
5/1/2017 0 1 121 5 5/2/2017 1 2 122 5
5/2/2017 1 2 122 5 5/3/2017 2 3 123 5
5/3/2017 2 3 123 5 5/4/2017 3 4 124 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5 5/9/2017 1 9 129 5
5/9/2017 1 9 129 5 5/10/2017 2 10 130 5
5/10/2017 2 10 130 5 5/11/2017 3 11 131 5
5/11/2017 3 11 131 5 5/12/2017 4 12 132 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5 5/16/2017 1 16 136 5
5/16/2017 1 16 136 5 5/17/2017 2 17 137 5
5/17/2017 2 17 137 5 5/18/2017 3 18 138 5
5/18/2017 3 18 138 5 5/19/2017 4 19 139 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5 5/24/2017 2 24 144 5
5/24/2017 2 24 144 5 5/25/2017 3 25 145 5
5/25/2017 3 25 145 5 5/26/2017 4 26 146 5
5/26/2017 4 26 146 5
5/30/2017 1 30 150 5

Use DataFrame.join with remove column next_calendarday_nextday:
df = df.set_index('date')
df = (df.join(df, on='next_calendarday', rsuffix='_nextday')
.drop('next_calendarday_nextday', axis=1))

Related

Get Row in Other Table

I have a dataframe 'df'. Using the validation data validData, I want to compute the response rate (Florence = 1/Yes) using the rfm_aboveavg (RFM combinations response rates above the overall response). Response rate is given by considering 0/No and 1/Yes, so it would be rfm_crosstab[1] / rfm_crosstab['All'].
Using the results from the validation data, I want to only display the rows that are also shown in the training data output by the RFM column. How do I do this?
Data: 'df'
Seq# ID# Gender M R F FirstPurch ChildBks YouthBks CookBks ... ItalCook ItalAtlas ItalArt Florence Related Purchase Mcode Rcode Fcode Yes_Florence No_Florence
0 1 25 1 297 14 2 22 0 1 1 ... 0 0 0 0 0 5 4 2 0 1
1 2 29 0 128 8 2 10 0 0 0 ... 0 0 0 0 0 4 3 2 0 1
2 3 46 1 138 22 7 56 2 1 2 ... 1 0 0 0 2 4 4 3 0 1
3 4 47 1 228 2 1 2 0 0 0 ... 0 0 0 0 0 5 1 1 0 1
4 5 51 1 257 10 1 10 0 0 0 ... 0 0 0 0 0 5 3 1 0 1
My code: Crosstab for training data trainData
trainData, validData = train_test_split(df, test_size=0.4, random_state=1)
# Response rate for training data as a whole
responseRate = (sum(trainData.Florence == 1) / sum(trainData.Florence == 0)) * 100
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
trainData['RFM'] = trainData['Mcode'].astype(str) + trainData['Rcode'].astype(str) + trainData['Fcode'].astype(str)
rfm_crosstab = pd.crosstab(index = [trainData['RFM']], columns = trainData['Florence'], margins = True)
rfm_crosstab['Percentage of 1/Yes'] = 100 * (rfm_crosstab[1] / rfm_crosstab['All'])
# RFM combinations response rates above the overall response
rfm_aboveavg = rfm_crosstab['Percentage of 1/Yes'] > responseRate
rfm_crosstab[rfm_aboveavg]
Output: Training data
Florence 0 1 All Percentage of 1/Yes
RFM
121 3 2 5 40.000000
131 9 1 10 10.000000
212 1 2 3 66.666667
221 6 3 9 33.333333
222 6 1 7 14.285714
313 2 1 3 33.333333
321 17 3 20 15.000000
322 20 4 24 16.666667
323 2 1 3 33.333333
341 61 10 71 14.084507
343 17 2 19 10.526316
411 12 3 15 20.000000
422 26 5 31 16.129032
423 32 8 40 20.000000
441 96 12 108 11.111111
511 19 4 23 17.391304
513 44 8 52 15.384615
521 24 5 29 17.241379
523 74 16 90 17.777778
533 177 28 205 13.658537
My code: Crosstab for validation data validData
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
validData['RFM'] = validData['Mcode'].astype(str) + validData['Rcode'].astype(str) + validData['Fcode'].astype(str)
rfm_crosstab1 = pd.crosstab(index = [validData['RFM']], columns = validData['Florence'], margins = True)
rfm_crosstab1['Percentage of 1/Yes'] = 100 * (rfm_crosstab1[1] / rfm_crosstab1['All'])
rfm_crosstab1
Output: Validation data
Florence 0 1 All Percentage of 1/Yes
RFM
131 3 1 4 25.000000
141 8 0 8 0.000000
211 2 1 3 33.333333
212 2 0 2 0.000000
213 0 1 1 100.000000
221 5 0 5 0.000000
222 2 0 2 0.000000
231 21 1 22 4.545455
232 3 0 3 0.000000
233 1 0 1 0.000000
241 11 1 12 8.333333
242 8 0 8 0.000000
243 2 0 2 0.000000
311 7 0 7 0.000000
312 8 0 8 0.000000
313 1 0 1 0.000000
321 12 0 12 0.000000
322 13 0 13 0.000000
323 4 1 5 20.000000
331 19 1 20 5.000000
332 25 2 27 7.407407
333 11 1 12 8.333333
341 36 2 38 5.263158
342 30 2 32 6.250000
343 12 0 12 0.000000
411 8 2 10 20.000000
412 7 0 7 0.000000
413 13 1 14 7.142857
421 21 2 23 8.695652
422 30 1 31 3.225806
423 26 1 27 3.703704
431 51 3 54 5.555556
432 42 7 49 14.285714
433 41 5 46 10.869565
441 68 2 70 2.857143
442 78 3 81 3.703704
443 70 5 75 6.666667
511 17 0 17 0.000000
512 13 1 14 7.142857
513 26 6 32 18.750000
521 19 1 20 5.000000
522 25 6 31 19.354839
523 50 6 56 10.714286
531 66 3 69 4.347826
532 65 3 68 4.411765
533 128 24 152 15.789474
541 86 7 93 7.526882
542 100 6 106 5.660377
543 178 17 195 8.717949
All 1474 126 1600 7.875000

Filter rows with consecutive numbers

I have some data.
I want to remain with rows when an ID has 4 consecutive numbers. For example, if ID 1 has rows 100, 101, 102, 103, 105, the "105" should be excluded.
Data:
ID X
0 1 100
1 1 101
2 1 102
3 1 103
4 1 105
5 2 100
6 2 102
7 2 103
8 2 104
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
18 3 112
19 4 100
20 4 102
21 4 103
22 4 104
23 4 105
24 4 107
Expected results:
ID X
0 1 100
1 1 101
2 1 102
3 1 103
4 3 100
5 3 101
6 3 102
7 3 103
8 3 106
9 3 107
10 3 108
11 3 109
12 3 110
13 4 102
14 4 103
15 4 104
16 4 105
You can identify the consecutive values, then filter the groups by size with groupby.filter:
# group consecutive X
g = df['X'].diff().gt(1).cumsum() # no need to group here, we'll group later
# filter groups
out = df.groupby(['ID', g]).filter(lambda g: len(g)>=4)#.reset_index(drop=True)
output:
ID X
0 1 100
1 1 101
2 1 102
3 1 103
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
20 4 102
21 4 103
22 4 104
23 4 105
Another method:
out = df.groupby(df.groupby('ID')['X'].diff().ne(1).cumsum()).filter(lambda x: len(x) >= 4)
print(out)
# Output
ID X
0 1 100
1 1 101
2 1 102
3 1 103
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
20 4 102
21 4 103
22 4 104
23 4 105
def function1(dd:pd.DataFrame):
return dd.assign(rk=(dd.assign(col1=(dd.X.diff()>1).cumsum()).groupby('col1').transform('size')))
df1.groupby('ID').apply(function1).loc[lambda x:x.rk>3,:'X']
ID X
0 1 100
1 1 101
2 1 102
3 1 103
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
20 4 102
21 4 103
22 4 104
23 4 105

In Pandas, giving a datetime index, with rows on all work days, how to determine if a row is beginning of week or end of week?

I have an set of stock information, with datetime set as index, stock market only open on weekdays so all my rows are weekdays, which is fine, I would like to determine if a row is start of the week or end of week, which might NOT always fall on Monday/Friday due to holidays. A better idea is to determine if there is an row entry on the next/previous day in the dataframe ( since my data is guaranteed to only exist for workday), but I dont know how to calculate this. Here is an example of my data:
date day_of_week day_of_month day_of_year month_of_year
5/1/2017 0 1 121 5
5/2/2017 1 2 122 5
5/3/2017 2 3 123 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5
5/9/2017 1 9 129 5
5/10/2017 2 10 130 5
5/11/2017 3 11 131 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5
5/16/2017 1 16 136 5
5/17/2017 2 17 137 5
5/18/2017 3 18 138 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5
5/24/2017 2 24 144 5
5/25/2017 3 25 145 5
5/26/2017 4 26 146 5
5/30/2017 1 30 150 5
Here is my current code
# Date fields
def DateFields(df_input):
dates = df_input.index.to_series()
df_input['day_of_week'] = dates.dt.dayofweek
df_input['day_of_month'] = dates.dt.day
df_input['day_of_year'] = dates.dt.dayofyear
df_input['month_of_year'] = dates.dt.month
df_input['isWeekStart'] = "No" #<--- Need help here
df_input['isWeekEnd'] = "No" #<--- Need help here
df_input['date'] = dates.dt.strftime('%Y-%m-%d')
return df_input
How can I calculate if a row is beginning of week and end of week?
Example of what I am looking for:
date day_of_week day_of_month day_of_year month_of_year isWeekStart isWeekEnd
5/1/2017 0 1 121 5 1 0
5/2/2017 1 2 122 5 0 0
5/3/2017 2 3 123 5 0 0
5/4/2017 3 4 124 5 0 1 # short week, Thursday is last work day
5/8/2017 0 8 128 5 1 0
5/9/2017 1 9 129 5 0 0
5/10/2017 2 10 130 5 0 0
5/11/2017 3 11 131 5 0 0
5/12/2017 4 12 132 5 0 1
5/15/2017 0 15 135 5 1 0
5/16/2017 1 16 136 5 0 0
5/17/2017 2 17 137 5 0 0
5/18/2017 3 18 138 5 0 0
5/19/2017 4 19 139 5 0 1
5/23/2017 1 23 143 5 1 0 # short week, Tuesday is first work day
5/24/2017 2 24 144 5 0 0
5/25/2017 3 25 145 5 0 0
5/26/2017 4 26 146 5 0 1
5/30/2017 1 30 150 5 1 0
EDIT: I forgot that some holidays fall during the middle of week, in this situation, it would be good if it can treat these as a separate "week" with before and after marked accordingly. Although if it's not smart enough to figure this out, just getting the long weekend would be a good start.
Here's an idea with BusinessDay:
prev_working_day = df['date'] - pd.tseries.offsets.BusinessDay(1)
df['isFirstWeekDay'] = (df['date'].dt.isocalendar().week !=
prev_working_day.dt.isocalendar().week)
And similar for last business day. Note that the default holiday calendar is US'. Check out this post for a different one.
Output:
date day_of_week day_of_month day_of_year month_of_year isFirstWeekDay
0 2017-05-01 0 1 121 5 True
1 2017-05-02 1 2 122 5 False
2 2017-05-03 2 3 123 5 False
3 2017-05-04 3 4 124 5 False
4 2017-05-08 0 8 128 5 True
5 2017-05-09 1 9 129 5 False
6 2017-05-10 2 10 130 5 False
7 2017-05-11 3 11 131 5 False
8 2017-05-12 4 12 132 5 False
9 2017-05-15 0 15 135 5 True
10 2017-05-16 1 16 136 5 False
11 2017-05-17 2 17 137 5 False
12 2017-05-18 3 18 138 5 False
13 2017-05-19 4 19 139 5 False
14 2017-05-23 1 23 143 5 False
15 2017-05-24 2 24 144 5 False
16 2017-05-25 3 25 145 5 False
17 2017-05-26 4 26 146 5 False
18 2017-05-30 1 30 150 5 False
Here's an approach using weekly groupby.
df['date'] = pd.to_datetime(df['date'])
business_days = df.assign(date_copy = df['date']).groupby(pd.Grouper(key='date_copy', freq='W'))['date'].apply(list).to_frame()
business_days['isWeekStart'] = business_days['date'].apply(lambda x: [1 if i == min(x) else 0 for i in x])
business_days['isWeekEnd'] = business_days['date'].apply(lambda x: [1 if i == max(x) else 0 for i in x])
business_days = business_days.apply(pd.Series.explode)
pd.merge(df, business_days, left_on='date', right_on='date')
output:
date day_of_week day_of_month day_of_year month_of_year isWeekStart isWeekEnd
0 2017-05-01 0 1 121 5 1 0
1 2017-05-02 1 2 122 5 0 0
2 2017-05-03 2 3 123 5 0 0
3 2017-05-04 3 4 124 5 0 1
4 2017-05-08 0 8 128 5 1 0
5 2017-05-09 1 9 129 5 0 0
6 2017-05-10 2 10 130 5 0 0
7 2017-05-11 3 11 131 5 0 0
8 2017-05-12 4 12 132 5 0 1
9 2017-05-15 0 15 135 5 1 0
10 2017-05-16 1 16 136 5 0 0
11 2017-05-17 2 17 137 5 0 0
12 2017-05-18 3 18 138 5 0 0
13 2017-05-19 4 19 139 5 0 1
14 2017-05-23 1 23 143 5 1 0
15 2017-05-24 2 24 144 5 0 0
16 2017-05-25 3 25 145 5 0 0
17 2017-05-26 4 26 146 5 0 1
18 2017-05-30 1 30 150 5 1 1
Note that 2017-05-30 is marked as both WeekStart and WeekEnd because it is the only date of that week.

Pandas data reshaping, turn multiple rows with the same index but different values into many columns based on incidence

I have the following table in pandas
user_id idaggregate_info num_events num_lark_convo_events num_meals_logged num_breakfasts num_lunches num_dinners num_snacks total_activity sleep_duration num_activity_events num_weights num_notifs idusermission completed mission_delta
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28
2 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 NaT
3 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05
4 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 NaT
Some user_ids have multiple lines that are identical except for their different mission_delta values. How do I transform this into one line for each id, with a columns named "mission_delta_1", "mission_delta_2" (the number of them vary, it could be 1 per user_id to maybe 5 per user_id so naming has to be iterative_ etc so output would be:
user_id idaggregate_info num_events num_lark_convo_events num_meals_logged num_breakfasts num_lunches num_dinners num_snacks total_activity sleep_duration num_activity_events num_weights num_notifs idusermission completed mission_delta_1 mission_delta_2
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28 NaT
2 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05 NaT
Not a duplicate as those address exploding all columns, there is just one that needs to be unstacked. The solutions offered in the duplicate link fail:
df.groupby(level=0).apply(lambda x: pd.Series(x.values.flatten()))
produces the same df as the original with different labels
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28
2 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 NaT
3 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05
The next options:
result2.groupby(level=0).apply(lambda x: pd.Series(x.stack().values))
produces:
0 0 0
1 406
2 94
3 20
4 7
and
df.groupby(level=0).apply(lambda x: x.values.ravel()).apply(pd.Series)
produces the original dataframe:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28
2 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 NaT
3 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05
In essence, I want to turn a df:
id mission_delta
0 NaT
1 1 day
1 2 days
1 1 day
2 5 days
2 NaT
into
id mission_delta1 mission_delta_2 mission_delta_3
0 NaT NaT NaT
1 1 day 2 days 1 day
2 5 days NaT NaT
You might try this;
grp = df.groupby('id')
df_res = grp['mission_delta'].apply(lambda x: pd.Series(x.values)).unstack().fillna('NaT')
df_res = df_res.rename(columns={i: 'mission_delta_{}'.format(i + 1) for i in range(len(df_res))})
print(df_res)
mission_delta_1 mission_delta_2 mission_delta_3
id
0 NaT NaT NaT
1 1 day 2 days 1 day
2 5 days NaT NaT

How to find adjective frequency from a specific categories in brown corpus in NLTK

I'm a beginner to this and I would like to know if there is a possibility to extract an adjective frequency from categories in brown corpus and create a list of this adjectives with Python.
from collections import Counter
from nltk.corpus import brown
# Split the words and POS tags
words, poss = zip(*brown.tagged_words())
# Put them into a Counter object
pos_freq = Counter(poss)
for pos in pos_freq:
print pos, pos_freq[pos]
[out]:
' 317
'' 8789
( 2264
(-HL 162
) 2273
)-HL 184
* 4603
*-HL 8
*-NC 1
*-TL 1
, 58156
,-HL 171
,-NC 5
,-TL 4
-- 3405
---HL 26
. 60638
.-HL 598
.-NC 16
.-TL 2
: 1558
:-HL 138
:-TL 22
ABL 357
ABN 3010
ABN-HL 4
ABN-NC 1
ABN-TL 7
ABX 730
AP 9522
AP$ 9
AP+AP-NC 1
AP-HL 40
AP-NC 2
AP-TL 18
AT 97959
AT-HL 332
AT-NC 35
AT-TL 746
AT-TL-HL 5
BE 6360
BE-HL 13
BE-TL 1
BED 3282
BED* 22
BED-NC 3
BEDZ 9806
BEDZ* 154
BEDZ-HL 1
BEDZ-NC 8
BEG 686
BEM 226
BEM* 9
BEM-NC 2
BEN 2470
BEN-TL 2
BER 4379
BER* 47
BER*-NC 1
BER-HL 11
BER-NC 5
BER-TL 6
BEZ 10066
BEZ* 117
BEZ-HL 30
BEZ-NC 5
BEZ-TL 8
CC 37718
CC-HL 119
CC-NC 5
CC-TL 307
CC-TL-HL 2
CD 13510
CD$ 5
CD-HL 444
CD-NC 5
CD-TL 898
CD-TL-HL 17
CS 22143
CS-HL 25
CS-NC 5
CS-TL 2
DO 1353
DO* 485
DO*-HL 3
DO+PPSS 1
DO-HL 4
DO-NC 2
DO-TL 5
DOD 1047
DOD* 402
DOD*-TL 1
DOD-NC 1
DOZ 467
DOZ* 89
DOZ*-TL 1
DOZ-HL 16
DOZ-TL 2
DT 8957
DT$ 5
DT+BEZ 179
DT+BEZ-NC 1
DT+MD 3
DT-HL 6
DT-NC 7
DT-TL 9
DTI 2921
DTI-HL 6
DTI-TL 2
DTS 2435
DTS+BEZ 2
DTS-HL 2
DTX 104
EX 2164
EX+BEZ 105
EX+HVD 3
EX+HVZ 2
EX+MD 4
EX-HL 1
EX-NC 1
FW-* 6
FW-*-TL 2
FW-AT 24
FW-AT+NN-TL 13
FW-AT+NP-TL 2
FW-AT-HL 1
FW-AT-TL 44
FW-BE 1
FW-BER 3
FW-BEZ 4
FW-CC 27
FW-CC-TL 14
FW-CD 7
FW-CD-TL 2
FW-CS 3
FW-DT 2
FW-DT+BEZ 2
FW-DTS 1
FW-HV 1
FW-IN 84
FW-IN+AT 4
FW-IN+AT-T 3
FW-IN+AT-TL 18
FW-IN+NN 5
FW-IN+NN-TL 2
FW-IN+NP-TL 2
FW-IN-TL 40
FW-JJ 53
FW-JJ-NC 2
FW-JJ-TL 74
FW-JJR 1
FW-JJT 1
FW-NN 288
FW-NN$ 9
FW-NN$-TL 4
FW-NN-NC 6
FW-NN-TL 170
FW-NN-TL-NC 1
FW-NNS 83
FW-NNS-NC 2
FW-NNS-TL 36
FW-NP 7
FW-NP-TL 4
FW-NPS 2
FW-NPS-TL 1
FW-NR 1
FW-NR-TL 3
FW-OD-NC 1
FW-OD-TL 4
FW-PN 1
FW-PP$ 3
FW-PP$-NC 1
FW-PP$-TL 2
FW-PPL 9
FW-PPL+VBZ 2
FW-PPO 4
FW-PPO+IN 3
FW-PPS 1
FW-PPSS 6
FW-PPSS+HV 1
FW-QL 1
FW-RB 32
FW-RB+CC 1
FW-RB-TL 3
FW-TO+VB 1
FW-UH 8
FW-UH-NC 1
FW-UH-TL 1
FW-VB 26
FW-VB-NC 3
FW-VB-TL 1
FW-VBD 2
FW-VBD-TL 1
FW-VBG 7
FW-VBG-TL 1
FW-VBN 12
FW-VBZ 4
FW-WDT 16
FW-WPO 1
FW-WPS 1
HV 3928
HV* 42
HV+TO 3
HV-HL 3
HV-NC 11
HV-TL 3
HVD 4895
HVD* 99
HVD-HL 1
HVG 281
HVG-HL 1
HVN 237
HVZ 2433
HVZ* 22
HVZ-NC 2
HVZ-TL 4
IN 120557
IN+IN 1
IN+PPO 1
IN-HL 508
IN-NC 41
IN-TL 1477
IN-TL-HL 6
JJ 64028
JJ$-TL 1
JJ+JJ-NC 2
JJ-HL 396
JJ-NC 41
JJ-TL 4107
JJ-TL-HL 26
JJ-TL-NC 1
JJR 1958
JJR+CS 1
JJR-HL 17
JJR-NC 5
JJR-TL 15
JJS 359
JJS-HL 1
JJS-TL 20
JJT 1005
JJT-HL 6
JJT-NC 1
JJT-TL 4
MD 12431
MD* 866
MD*-HL 1
MD+HV 7
MD+PPSS 1
MD+TO 2
MD-HL 27
MD-NC 2
MD-TL 8
NIL 157
NN 152470
NN$ 1480
NN$-HL 20
NN$-TL 361
NN+BEZ 34
NN+BEZ-TL 2
NN+HVD-TL 1
NN+HVZ 5
NN+HVZ-TL 1
NN+IN 1
NN+MD 2
NN+NN-NC 1
NN-HL 1471
NN-NC 118
NN-TL 13372
NN-TL-HL 129
NN-TL-NC 3
NNS 55110
NNS$ 257
NNS$-HL 4
NNS$-NC 2
NNS$-TL 74
NNS$-TL-HL 1
NNS+MD 2
NNS-HL 609
NNS-NC 26
NNS-TL 2226
NNS-TL-HL 14
NNS-TL-NC 3
NP 34476
NP$ 2565
NP$-HL 8
NP$-TL 141
NP+BEZ 25
NP+BEZ-NC 3
NP+HVZ 6
NP+HVZ-NC 1
NP+MD 2
NP-HL 517
NP-NC 15
NP-TL 4019
NP-TL-HL 7
NPS 1275
NPS$ 38
NPS$-HL 1
NPS$-TL 3
NPS-HL 8
NPS-NC 2
NPS-TL 67
NR 1566
NR$ 66
NR$-TL 11
NR+MD 1
NR-HL 10
NR-NC 4
NR-TL 309
NR-TL-HL 5
NRS 16
NRS-TL 1
OD 1935
OD-HL 8
OD-NC 1
OD-TL 201
PN 2573
PN$ 89
PN+BEZ 7
PN+HVD 1
PN+HVZ 3
PN+MD 3
PN-HL 2
PN-NC 2
PN-TL 5
PP$ 16872
PP$$ 164
PP$-HL 10
PP$-NC 13
PP$-TL 35
PPL 1233
PPL-HL 1
PPL-NC 2
PPL-TL 1
PPLS 345
PPO 11181
PPO-HL 5
PPO-NC 9
PPO-TL 13
PPS 18253
PPS+BEZ 430
PPS+BEZ-HL 1
PPS+BEZ-NC 3
PPS+HVD 83
PPS+HVZ 43
PPS+MD 144
PPS-HL 19
PPS-NC 9
PPS-TL 6
PPSS 13802
PPSS+BEM 270
PPSS+BER 278
PPSS+BER-N 1
PPSS+BER-NC 1
PPSS+BER-TL 1
PPSS+BEZ 1
PPSS+BEZ* 1
PPSS+HV 241
PPSS+HV-TL 1
PPSS+HVD 83
PPSS+MD 484
PPSS+MD-NC 2
PPSS+VB 2
PPSS-HL 25
PPSS-NC 31
PPSS-TL 9
QL 8735
QL-HL 4
QL-NC 2
QL-TL 6
QLP 261
RB 36464
RB$ 9
RB+BEZ 11
RB+BEZ-HL 1
RB+BEZ-NC 1
RB+CS 3
RB-HL 49
RB-NC 26
RB-TL 40
RBR 1182
RBR+CS 1
RBR-NC 1
RBT 101
RN 9
RP 6009
RP+IN 4
RP-HL 14
RP-NC 5
RP-TL 4
TO 14918
TO+VB 2
TO-HL 55
TO-NC 13
TO-TL 10
UH 608
UH-HL 1
UH-NC 5
UH-TL 15
VB 33693
VB+AT 2
VB+IN 3
VB+JJ-NC 1
VB+PPO 71
VB+RP 2
VB+TO 4
VB+VB-NC 1
VB-HL 125
VB-NC 41
VB-TL 96
VBD 26167
VBD-HL 8
VBD-NC 11
VBD-TL 6
VBG 17893
VBG+TO 17
VBG-HL 146
VBG-NC 16
VBG-TL 133
VBN 29186
VBN+TO 5
VBN-HL 137
VBN-NC 9
VBN-TL 591
VBN-TL-HL 6
VBN-TL-NC 3
VBZ 7373
VBZ-HL 72
VBZ-NC 7
VBZ-TL 17
WDT 5539
WDT+BER 1
WDT+BER+PP 1
WDT+BEZ 47
WDT+BEZ-HL 1
WDT+BEZ-NC 2
WDT+BEZ-TL 1
WDT+DO+PPS 1
WDT+DOD 1
WDT+HVZ 2
WDT-HL 30
WDT-NC 7
WP$ 252
WPO 280
WPO-NC 1
WPO-TL 4
WPS 3924
WPS+BEZ 21
WPS+BEZ-NC 2
WPS+BEZ-TL 1
WPS+HVD 6
WPS+HVZ 2
WPS+MD 8
WPS-HL 2
WPS-NC 3
WPS-TL 12
WQL 176
WQL-TL 5
WRB 4509
WRB+BER 1
WRB+BEZ 11
WRB+BEZ-TL 3
WRB+DO 1
WRB+DOD 6
WRB+DOD* 1
WRB+DOZ 1
WRB+IN 1
WRB+MD 1
WRB-HL 36
WRB-NC 7
WRB-TL 9
`` 8837
And then:
# POS that starts with JJ are adjectives, sum the counts up
print sum(pos_freq[i] for i in pos_freq if i.startswith('JJ'))
[out]:
71994

Categories