I have a dataframe 'df'. Using the validation data validData, I want to compute the response rate (Florence = 1/Yes) using the rfm_aboveavg (RFM combinations response rates above the overall response). Response rate is given by considering 0/No and 1/Yes, so it would be rfm_crosstab[1] / rfm_crosstab['All'].
Using the results from the validation data, I want to only display the rows that are also shown in the training data output by the RFM column. How do I do this?
Data: 'df'
Seq# ID# Gender M R F FirstPurch ChildBks YouthBks CookBks ... ItalCook ItalAtlas ItalArt Florence Related Purchase Mcode Rcode Fcode Yes_Florence No_Florence
0 1 25 1 297 14 2 22 0 1 1 ... 0 0 0 0 0 5 4 2 0 1
1 2 29 0 128 8 2 10 0 0 0 ... 0 0 0 0 0 4 3 2 0 1
2 3 46 1 138 22 7 56 2 1 2 ... 1 0 0 0 2 4 4 3 0 1
3 4 47 1 228 2 1 2 0 0 0 ... 0 0 0 0 0 5 1 1 0 1
4 5 51 1 257 10 1 10 0 0 0 ... 0 0 0 0 0 5 3 1 0 1
My code: Crosstab for training data trainData
trainData, validData = train_test_split(df, test_size=0.4, random_state=1)
# Response rate for training data as a whole
responseRate = (sum(trainData.Florence == 1) / sum(trainData.Florence == 0)) * 100
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
trainData['RFM'] = trainData['Mcode'].astype(str) + trainData['Rcode'].astype(str) + trainData['Fcode'].astype(str)
rfm_crosstab = pd.crosstab(index = [trainData['RFM']], columns = trainData['Florence'], margins = True)
rfm_crosstab['Percentage of 1/Yes'] = 100 * (rfm_crosstab[1] / rfm_crosstab['All'])
# RFM combinations response rates above the overall response
rfm_aboveavg = rfm_crosstab['Percentage of 1/Yes'] > responseRate
rfm_crosstab[rfm_aboveavg]
Output: Training data
Florence 0 1 All Percentage of 1/Yes
RFM
121 3 2 5 40.000000
131 9 1 10 10.000000
212 1 2 3 66.666667
221 6 3 9 33.333333
222 6 1 7 14.285714
313 2 1 3 33.333333
321 17 3 20 15.000000
322 20 4 24 16.666667
323 2 1 3 33.333333
341 61 10 71 14.084507
343 17 2 19 10.526316
411 12 3 15 20.000000
422 26 5 31 16.129032
423 32 8 40 20.000000
441 96 12 108 11.111111
511 19 4 23 17.391304
513 44 8 52 15.384615
521 24 5 29 17.241379
523 74 16 90 17.777778
533 177 28 205 13.658537
My code: Crosstab for validation data validData
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
validData['RFM'] = validData['Mcode'].astype(str) + validData['Rcode'].astype(str) + validData['Fcode'].astype(str)
rfm_crosstab1 = pd.crosstab(index = [validData['RFM']], columns = validData['Florence'], margins = True)
rfm_crosstab1['Percentage of 1/Yes'] = 100 * (rfm_crosstab1[1] / rfm_crosstab1['All'])
rfm_crosstab1
Output: Validation data
Florence 0 1 All Percentage of 1/Yes
RFM
131 3 1 4 25.000000
141 8 0 8 0.000000
211 2 1 3 33.333333
212 2 0 2 0.000000
213 0 1 1 100.000000
221 5 0 5 0.000000
222 2 0 2 0.000000
231 21 1 22 4.545455
232 3 0 3 0.000000
233 1 0 1 0.000000
241 11 1 12 8.333333
242 8 0 8 0.000000
243 2 0 2 0.000000
311 7 0 7 0.000000
312 8 0 8 0.000000
313 1 0 1 0.000000
321 12 0 12 0.000000
322 13 0 13 0.000000
323 4 1 5 20.000000
331 19 1 20 5.000000
332 25 2 27 7.407407
333 11 1 12 8.333333
341 36 2 38 5.263158
342 30 2 32 6.250000
343 12 0 12 0.000000
411 8 2 10 20.000000
412 7 0 7 0.000000
413 13 1 14 7.142857
421 21 2 23 8.695652
422 30 1 31 3.225806
423 26 1 27 3.703704
431 51 3 54 5.555556
432 42 7 49 14.285714
433 41 5 46 10.869565
441 68 2 70 2.857143
442 78 3 81 3.703704
443 70 5 75 6.666667
511 17 0 17 0.000000
512 13 1 14 7.142857
513 26 6 32 18.750000
521 19 1 20 5.000000
522 25 6 31 19.354839
523 50 6 56 10.714286
531 66 3 69 4.347826
532 65 3 68 4.411765
533 128 24 152 15.789474
541 86 7 93 7.526882
542 100 6 106 5.660377
543 178 17 195 8.717949
All 1474 126 1600 7.875000
Related
How can I merge and sum the columns with the same name?
So the output should be 1 Column named Canada as a result of the sum of the 4 columns named Canada.
Country/Region Brazil Canada Canada Canada Canada
Week 1 0 3 0 0 0
Week 2 0 17 0 0 0
Week 3 0 21 0 0 0
Week 4 0 21 0 0 0
Week 5 0 23 0 0 0
Week 6 0 80 0 5 0
Week 7 0 194 0 20 0
Week 8 12 702 3 199 20
Week 9 182 2679 16 2395 260
Week 10 737 8711 80 17928 892
Week 11 1674 25497 153 48195 1597
Week 12 2923 46392 175 85563 2003
Week 13 4516 76095 182 122431 2180
Week 14 6002 105386 183 163539 2431
Week 15 6751 127713 189 210409 2995
Week 16 7081 147716 189 258188 3845
From its current state, this should give the outcome you're looking for:
df = df.set_index('Country/Region') # optional
df.groupby(df.columns, axis=1).sum() # Stolen from Scott Boston as it's a superior method.
Output:
index Brazil Canada
Country/Region
Week 1 0 3
Week 2 0 17
Week 3 0 21
Week 4 0 21
Week 5 0 23
Week 6 0 85
Week 7 0 214
Week 8 12 924
Week 9 182 5350
Week 10 737 27611
Week 11 1674 75442
Week 12 2923 134133
Week 13 4516 200888
Week 14 6002 271539
Week 15 6751 341306
Week 16 7081 409938
I found your dataset interesting, here's how I would clean it up from step 1:
df = pd.read_csv('file.csv')
df = df.set_index(['Province/State', 'Country/Region', 'Lat', 'Long']).stack().reset_index()
df.columns = ['Province/State', 'Country/Region', 'Lat', 'Long', 'date', 'value']
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
df = df.pivot_table(index=df.index, columns='Country/Region', values='value', aggfunc=np.sum)
print(df)
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-22 0 0 0 0 0 ... 0 0 0 0 0
2020-01-23 0 0 0 0 0 ... 0 0 0 0 0
2020-01-24 0 0 0 0 0 ... 0 0 0 0 0
2020-01-25 0 0 0 0 0 ... 0 0 0 0 0
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
2020-07-30 36542 5197 29831 922 1109 ... 11548 10 1726 5555 3092
2020-07-31 36675 5276 30394 925 1148 ... 11837 10 1728 5963 3169
2020-08-01 36710 5396 30950 925 1164 ... 12160 10 1730 6228 3659
2020-08-02 36710 5519 31465 925 1199 ... 12297 10 1734 6347 3921
2020-08-03 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
If you now want to do weekly aggregations, it's as simple as:
print(df.resample('w').sum())
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
2020-02-02 0 0 0 0 0 ... 0 0 0 0 0
2020-02-09 0 0 0 0 0 ... 0 0 0 0 0
2020-02-16 0 0 0 0 0 ... 0 0 0 0 0
2020-02-23 0 0 0 0 0 ... 0 0 0 0 0
2020-03-01 7 0 6 0 0 ... 0 0 0 0 0
2020-03-08 10 0 85 7 0 ... 43 0 0 0 0
2020-03-15 57 160 195 7 0 ... 209 0 0 0 0
2020-03-22 175 464 705 409 5 ... 309 0 0 11 7
2020-03-29 632 1142 2537 1618 29 ... 559 0 0 113 31
2020-04-05 1783 2000 6875 2970 62 ... 1178 4 0 262 59
2020-04-12 3401 2864 11629 4057 128 ... 1847 30 3 279 84
2020-04-19 5838 3603 16062 4764 143 ... 2081 42 7 356 154
2020-04-26 8918 4606 21211 5087 174 ... 2353 42 7 541 200
2020-05-03 15149 5391 27943 5214 208 ... 2432 42 41 738 244
2020-05-10 25286 5871 36315 5265 274 ... 2607 42 203 1260 241
2020-05-17 39634 6321 45122 5317 327 ... 2632 42 632 3894 274
2020-05-24 61342 6798 54185 5332 402 ... 2869 45 1321 5991 354
2020-05-31 91885 7517 62849 5344 536 ... 3073 63 1932 7125 894
2020-06-07 126442 8378 68842 5868 609 ... 3221 63 3060 7623 1694
2020-06-14 159822 9689 74147 5967 827 ... 3396 63 4236 8836 2335
2020-06-21 191378 12463 79737 5981 1142 ... 4466 63 6322 9905 3089
2020-06-28 210487 15349 87615 5985 1522 ... 10242 70 7360 10512 3813
2020-07-05 224560 18707 102918 5985 2186 ... 21897 70 8450 11322 4426
2020-07-12 237087 22399 124588 5985 2940 ... 36949 70 9489 13002 6200
2020-07-19 245264 26845 149611 6098 4279 ... 52323 70 10855 16350 9058
2020-07-26 250970 31255 178605 6237 5919 ... 68154 70 11571 26749 14933
2020-08-02 255739 36370 208457 6429 7648 ... 80685 70 12023 38896 22241
2020-08-09 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
Try:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(20,5)), columns=[*'ZAABC'])
df.groupby(df.columns, axis=1, sort=False).sum()
Output:
Z A B C
0 44 111 67 67
1 9 104 36 87
2 70 176 12 58
3 65 126 46 88
4 81 62 77 72
5 9 100 69 79
6 47 146 99 88
7 49 48 19 14
8 39 97 9 57
9 32 105 23 35
10 75 83 34 0
11 0 89 5 38
12 17 83 42 58
13 31 66 41 57
14 35 57 82 91
15 0 113 53 12
16 42 159 68 6
17 68 50 76 52
18 78 35 99 58
19 23 92 85 48
You can try a transpose and groupby, e.g. something similar to the below.
df_T = df.tranpose()
df_T.groupby(df_T.index).sum()['Canada']
Here's a way to do it:
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
First we rename the columns starting with Canada by appending their integer position, which ensures they are no longer duplicates.
Then we use sum() to add across columns like Canada, put the result in a new column named Canada, and drop the columns that were originally named Canada.
Full test code is:
import pandas as pd
df = pd.DataFrame(
columns=[x.strip() for x in 'Brazil Canada Canada Canada Canada'.split()],
index=['Week ' + str(i) for i in range(1, 17)],
data=[[i] * 5 for i in range(1, 17)])
df.columns.names=['Country/Region']
print(df)
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
print(df)
Output:
Country/Region Brazil Canada Canada Canada Canada
Week 1 1 1 1 1 1
Week 2 2 2 2 2 2
Week 3 3 3 3 3 3
Week 4 4 4 4 4 4
Week 5 5 5 5 5 5
Week 6 6 6 6 6 6
Week 7 7 7 7 7 7
Week 8 8 8 8 8 8
Week 9 9 9 9 9 9
Week 10 10 10 10 10 10
Week 11 11 11 11 11 11
Week 12 12 12 12 12 12
Week 13 13 13 13 13 13
Week 14 14 14 14 14 14
Week 15 15 15 15 15 15
Week 16 16 16 16 16 16
Brazil Canada
Week 1 1 4
Week 2 2 8
Week 3 3 12
Week 4 4 16
Week 5 5 20
Week 6 6 24
Week 7 7 28
Week 8 8 32
Week 9 9 36
Week 10 10 40
Week 11 11 44
Week 12 12 48
Week 13 13 52
Week 14 14 56
Week 15 15 60
Week 16 16 64
I am trying to append every column from one row into another row, I want to do this for every row, but some row will not have any values, take a look at my code it will be more clear:
Here is my data
date day_of_week day_of_month day_of_year month_of_year
5/1/2017 0 1 121 5
5/2/2017 1 2 122 5
5/3/2017 2 3 123 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5
5/9/2017 1 9 129 5
5/10/2017 2 10 130 5
5/11/2017 3 11 131 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5
5/16/2017 1 16 136 5
5/17/2017 2 17 137 5
5/18/2017 3 18 138 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5
5/24/2017 2 24 144 5
5/25/2017 3 25 145 5
5/26/2017 4 26 146 5
Here is my current code:
s = df_md['date'].shift(-1)
df_md['next_calendarday'] = s.mask(s.dt.dayofweek.diff().lt(0))
df_md.set_index('date', inplace=True)
df_md.apply(lambda row: GetNextDayMarketData(row, df_md), axis=1)
def GetNextDayMarketData(row, dataframe):
if(row['next_calendarday'] is pd.NaT):
return
key = row['next_calendarday'].strftime("%Y-%m-%d")
nextrow = dataframe.loc[key]
for index, val in nextrow.iteritems():
if(index != "next_calendarday"):
dataframe.loc[row.name, index+'_nextday'] = val
This works but it's so slow it might as well not work. Here is what the result should look like, you can see that the value from the next row has been added to the previous row. The kicker is that it's the next calendar date and not just the next row in the sequence. If a row does not have an entry for next calendar date, it will simply be blank.
Here is the expected result in csv
date day_of_week day_of_month day_of_year month_of_year next_workingday day_of_week_nextday day_of_month_nextday day_of_year_nextday month_of_year_nextday
5/1/2017 0 1 121 5 5/2/2017 1 2 122 5
5/2/2017 1 2 122 5 5/3/2017 2 3 123 5
5/3/2017 2 3 123 5 5/4/2017 3 4 124 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5 5/9/2017 1 9 129 5
5/9/2017 1 9 129 5 5/10/2017 2 10 130 5
5/10/2017 2 10 130 5 5/11/2017 3 11 131 5
5/11/2017 3 11 131 5 5/12/2017 4 12 132 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5 5/16/2017 1 16 136 5
5/16/2017 1 16 136 5 5/17/2017 2 17 137 5
5/17/2017 2 17 137 5 5/18/2017 3 18 138 5
5/18/2017 3 18 138 5 5/19/2017 4 19 139 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5 5/24/2017 2 24 144 5
5/24/2017 2 24 144 5 5/25/2017 3 25 145 5
5/25/2017 3 25 145 5 5/26/2017 4 26 146 5
5/26/2017 4 26 146 5
5/30/2017 1 30 150 5
Use DataFrame.join with remove column next_calendarday_nextday:
df = df.set_index('date')
df = (df.join(df, on='next_calendarday', rsuffix='_nextday')
.drop('next_calendarday_nextday', axis=1))
I have an set of stock information, with datetime set as index, stock market only open on weekdays so all my rows are weekdays, which is fine, I would like to determine if a row is start of the week or end of week, which might NOT always fall on Monday/Friday due to holidays. A better idea is to determine if there is an row entry on the next/previous day in the dataframe ( since my data is guaranteed to only exist for workday), but I dont know how to calculate this. Here is an example of my data:
date day_of_week day_of_month day_of_year month_of_year
5/1/2017 0 1 121 5
5/2/2017 1 2 122 5
5/3/2017 2 3 123 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5
5/9/2017 1 9 129 5
5/10/2017 2 10 130 5
5/11/2017 3 11 131 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5
5/16/2017 1 16 136 5
5/17/2017 2 17 137 5
5/18/2017 3 18 138 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5
5/24/2017 2 24 144 5
5/25/2017 3 25 145 5
5/26/2017 4 26 146 5
5/30/2017 1 30 150 5
Here is my current code
# Date fields
def DateFields(df_input):
dates = df_input.index.to_series()
df_input['day_of_week'] = dates.dt.dayofweek
df_input['day_of_month'] = dates.dt.day
df_input['day_of_year'] = dates.dt.dayofyear
df_input['month_of_year'] = dates.dt.month
df_input['isWeekStart'] = "No" #<--- Need help here
df_input['isWeekEnd'] = "No" #<--- Need help here
df_input['date'] = dates.dt.strftime('%Y-%m-%d')
return df_input
How can I calculate if a row is beginning of week and end of week?
Example of what I am looking for:
date day_of_week day_of_month day_of_year month_of_year isWeekStart isWeekEnd
5/1/2017 0 1 121 5 1 0
5/2/2017 1 2 122 5 0 0
5/3/2017 2 3 123 5 0 0
5/4/2017 3 4 124 5 0 1 # short week, Thursday is last work day
5/8/2017 0 8 128 5 1 0
5/9/2017 1 9 129 5 0 0
5/10/2017 2 10 130 5 0 0
5/11/2017 3 11 131 5 0 0
5/12/2017 4 12 132 5 0 1
5/15/2017 0 15 135 5 1 0
5/16/2017 1 16 136 5 0 0
5/17/2017 2 17 137 5 0 0
5/18/2017 3 18 138 5 0 0
5/19/2017 4 19 139 5 0 1
5/23/2017 1 23 143 5 1 0 # short week, Tuesday is first work day
5/24/2017 2 24 144 5 0 0
5/25/2017 3 25 145 5 0 0
5/26/2017 4 26 146 5 0 1
5/30/2017 1 30 150 5 1 0
EDIT: I forgot that some holidays fall during the middle of week, in this situation, it would be good if it can treat these as a separate "week" with before and after marked accordingly. Although if it's not smart enough to figure this out, just getting the long weekend would be a good start.
Here's an idea with BusinessDay:
prev_working_day = df['date'] - pd.tseries.offsets.BusinessDay(1)
df['isFirstWeekDay'] = (df['date'].dt.isocalendar().week !=
prev_working_day.dt.isocalendar().week)
And similar for last business day. Note that the default holiday calendar is US'. Check out this post for a different one.
Output:
date day_of_week day_of_month day_of_year month_of_year isFirstWeekDay
0 2017-05-01 0 1 121 5 True
1 2017-05-02 1 2 122 5 False
2 2017-05-03 2 3 123 5 False
3 2017-05-04 3 4 124 5 False
4 2017-05-08 0 8 128 5 True
5 2017-05-09 1 9 129 5 False
6 2017-05-10 2 10 130 5 False
7 2017-05-11 3 11 131 5 False
8 2017-05-12 4 12 132 5 False
9 2017-05-15 0 15 135 5 True
10 2017-05-16 1 16 136 5 False
11 2017-05-17 2 17 137 5 False
12 2017-05-18 3 18 138 5 False
13 2017-05-19 4 19 139 5 False
14 2017-05-23 1 23 143 5 False
15 2017-05-24 2 24 144 5 False
16 2017-05-25 3 25 145 5 False
17 2017-05-26 4 26 146 5 False
18 2017-05-30 1 30 150 5 False
Here's an approach using weekly groupby.
df['date'] = pd.to_datetime(df['date'])
business_days = df.assign(date_copy = df['date']).groupby(pd.Grouper(key='date_copy', freq='W'))['date'].apply(list).to_frame()
business_days['isWeekStart'] = business_days['date'].apply(lambda x: [1 if i == min(x) else 0 for i in x])
business_days['isWeekEnd'] = business_days['date'].apply(lambda x: [1 if i == max(x) else 0 for i in x])
business_days = business_days.apply(pd.Series.explode)
pd.merge(df, business_days, left_on='date', right_on='date')
output:
date day_of_week day_of_month day_of_year month_of_year isWeekStart isWeekEnd
0 2017-05-01 0 1 121 5 1 0
1 2017-05-02 1 2 122 5 0 0
2 2017-05-03 2 3 123 5 0 0
3 2017-05-04 3 4 124 5 0 1
4 2017-05-08 0 8 128 5 1 0
5 2017-05-09 1 9 129 5 0 0
6 2017-05-10 2 10 130 5 0 0
7 2017-05-11 3 11 131 5 0 0
8 2017-05-12 4 12 132 5 0 1
9 2017-05-15 0 15 135 5 1 0
10 2017-05-16 1 16 136 5 0 0
11 2017-05-17 2 17 137 5 0 0
12 2017-05-18 3 18 138 5 0 0
13 2017-05-19 4 19 139 5 0 1
14 2017-05-23 1 23 143 5 1 0
15 2017-05-24 2 24 144 5 0 0
16 2017-05-25 3 25 145 5 0 0
17 2017-05-26 4 26 146 5 0 1
18 2017-05-30 1 30 150 5 1 1
Note that 2017-05-30 is marked as both WeekStart and WeekEnd because it is the only date of that week.
I have the following table in pandas
user_id idaggregate_info num_events num_lark_convo_events num_meals_logged num_breakfasts num_lunches num_dinners num_snacks total_activity sleep_duration num_activity_events num_weights num_notifs idusermission completed mission_delta
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28
2 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 NaT
3 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05
4 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 NaT
Some user_ids have multiple lines that are identical except for their different mission_delta values. How do I transform this into one line for each id, with a columns named "mission_delta_1", "mission_delta_2" (the number of them vary, it could be 1 per user_id to maybe 5 per user_id so naming has to be iterative_ etc so output would be:
user_id idaggregate_info num_events num_lark_convo_events num_meals_logged num_breakfasts num_lunches num_dinners num_snacks total_activity sleep_duration num_activity_events num_weights num_notifs idusermission completed mission_delta_1 mission_delta_2
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28 NaT
2 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05 NaT
Not a duplicate as those address exploding all columns, there is just one that needs to be unstacked. The solutions offered in the duplicate link fail:
df.groupby(level=0).apply(lambda x: pd.Series(x.values.flatten()))
produces the same df as the original with different labels
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28
2 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 NaT
3 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05
The next options:
result2.groupby(level=0).apply(lambda x: pd.Series(x.stack().values))
produces:
0 0 0
1 406
2 94
3 20
4 7
and
df.groupby(level=0).apply(lambda x: x.values.ravel()).apply(pd.Series)
produces the original dataframe:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28
2 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 NaT
3 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05
In essence, I want to turn a df:
id mission_delta
0 NaT
1 1 day
1 2 days
1 1 day
2 5 days
2 NaT
into
id mission_delta1 mission_delta_2 mission_delta_3
0 NaT NaT NaT
1 1 day 2 days 1 day
2 5 days NaT NaT
You might try this;
grp = df.groupby('id')
df_res = grp['mission_delta'].apply(lambda x: pd.Series(x.values)).unstack().fillna('NaT')
df_res = df_res.rename(columns={i: 'mission_delta_{}'.format(i + 1) for i in range(len(df_res))})
print(df_res)
mission_delta_1 mission_delta_2 mission_delta_3
id
0 NaT NaT NaT
1 1 day 2 days 1 day
2 5 days NaT NaT
I'm a beginner to this and I would like to know if there is a possibility to extract an adjective frequency from categories in brown corpus and create a list of this adjectives with Python.
from collections import Counter
from nltk.corpus import brown
# Split the words and POS tags
words, poss = zip(*brown.tagged_words())
# Put them into a Counter object
pos_freq = Counter(poss)
for pos in pos_freq:
print pos, pos_freq[pos]
[out]:
' 317
'' 8789
( 2264
(-HL 162
) 2273
)-HL 184
* 4603
*-HL 8
*-NC 1
*-TL 1
, 58156
,-HL 171
,-NC 5
,-TL 4
-- 3405
---HL 26
. 60638
.-HL 598
.-NC 16
.-TL 2
: 1558
:-HL 138
:-TL 22
ABL 357
ABN 3010
ABN-HL 4
ABN-NC 1
ABN-TL 7
ABX 730
AP 9522
AP$ 9
AP+AP-NC 1
AP-HL 40
AP-NC 2
AP-TL 18
AT 97959
AT-HL 332
AT-NC 35
AT-TL 746
AT-TL-HL 5
BE 6360
BE-HL 13
BE-TL 1
BED 3282
BED* 22
BED-NC 3
BEDZ 9806
BEDZ* 154
BEDZ-HL 1
BEDZ-NC 8
BEG 686
BEM 226
BEM* 9
BEM-NC 2
BEN 2470
BEN-TL 2
BER 4379
BER* 47
BER*-NC 1
BER-HL 11
BER-NC 5
BER-TL 6
BEZ 10066
BEZ* 117
BEZ-HL 30
BEZ-NC 5
BEZ-TL 8
CC 37718
CC-HL 119
CC-NC 5
CC-TL 307
CC-TL-HL 2
CD 13510
CD$ 5
CD-HL 444
CD-NC 5
CD-TL 898
CD-TL-HL 17
CS 22143
CS-HL 25
CS-NC 5
CS-TL 2
DO 1353
DO* 485
DO*-HL 3
DO+PPSS 1
DO-HL 4
DO-NC 2
DO-TL 5
DOD 1047
DOD* 402
DOD*-TL 1
DOD-NC 1
DOZ 467
DOZ* 89
DOZ*-TL 1
DOZ-HL 16
DOZ-TL 2
DT 8957
DT$ 5
DT+BEZ 179
DT+BEZ-NC 1
DT+MD 3
DT-HL 6
DT-NC 7
DT-TL 9
DTI 2921
DTI-HL 6
DTI-TL 2
DTS 2435
DTS+BEZ 2
DTS-HL 2
DTX 104
EX 2164
EX+BEZ 105
EX+HVD 3
EX+HVZ 2
EX+MD 4
EX-HL 1
EX-NC 1
FW-* 6
FW-*-TL 2
FW-AT 24
FW-AT+NN-TL 13
FW-AT+NP-TL 2
FW-AT-HL 1
FW-AT-TL 44
FW-BE 1
FW-BER 3
FW-BEZ 4
FW-CC 27
FW-CC-TL 14
FW-CD 7
FW-CD-TL 2
FW-CS 3
FW-DT 2
FW-DT+BEZ 2
FW-DTS 1
FW-HV 1
FW-IN 84
FW-IN+AT 4
FW-IN+AT-T 3
FW-IN+AT-TL 18
FW-IN+NN 5
FW-IN+NN-TL 2
FW-IN+NP-TL 2
FW-IN-TL 40
FW-JJ 53
FW-JJ-NC 2
FW-JJ-TL 74
FW-JJR 1
FW-JJT 1
FW-NN 288
FW-NN$ 9
FW-NN$-TL 4
FW-NN-NC 6
FW-NN-TL 170
FW-NN-TL-NC 1
FW-NNS 83
FW-NNS-NC 2
FW-NNS-TL 36
FW-NP 7
FW-NP-TL 4
FW-NPS 2
FW-NPS-TL 1
FW-NR 1
FW-NR-TL 3
FW-OD-NC 1
FW-OD-TL 4
FW-PN 1
FW-PP$ 3
FW-PP$-NC 1
FW-PP$-TL 2
FW-PPL 9
FW-PPL+VBZ 2
FW-PPO 4
FW-PPO+IN 3
FW-PPS 1
FW-PPSS 6
FW-PPSS+HV 1
FW-QL 1
FW-RB 32
FW-RB+CC 1
FW-RB-TL 3
FW-TO+VB 1
FW-UH 8
FW-UH-NC 1
FW-UH-TL 1
FW-VB 26
FW-VB-NC 3
FW-VB-TL 1
FW-VBD 2
FW-VBD-TL 1
FW-VBG 7
FW-VBG-TL 1
FW-VBN 12
FW-VBZ 4
FW-WDT 16
FW-WPO 1
FW-WPS 1
HV 3928
HV* 42
HV+TO 3
HV-HL 3
HV-NC 11
HV-TL 3
HVD 4895
HVD* 99
HVD-HL 1
HVG 281
HVG-HL 1
HVN 237
HVZ 2433
HVZ* 22
HVZ-NC 2
HVZ-TL 4
IN 120557
IN+IN 1
IN+PPO 1
IN-HL 508
IN-NC 41
IN-TL 1477
IN-TL-HL 6
JJ 64028
JJ$-TL 1
JJ+JJ-NC 2
JJ-HL 396
JJ-NC 41
JJ-TL 4107
JJ-TL-HL 26
JJ-TL-NC 1
JJR 1958
JJR+CS 1
JJR-HL 17
JJR-NC 5
JJR-TL 15
JJS 359
JJS-HL 1
JJS-TL 20
JJT 1005
JJT-HL 6
JJT-NC 1
JJT-TL 4
MD 12431
MD* 866
MD*-HL 1
MD+HV 7
MD+PPSS 1
MD+TO 2
MD-HL 27
MD-NC 2
MD-TL 8
NIL 157
NN 152470
NN$ 1480
NN$-HL 20
NN$-TL 361
NN+BEZ 34
NN+BEZ-TL 2
NN+HVD-TL 1
NN+HVZ 5
NN+HVZ-TL 1
NN+IN 1
NN+MD 2
NN+NN-NC 1
NN-HL 1471
NN-NC 118
NN-TL 13372
NN-TL-HL 129
NN-TL-NC 3
NNS 55110
NNS$ 257
NNS$-HL 4
NNS$-NC 2
NNS$-TL 74
NNS$-TL-HL 1
NNS+MD 2
NNS-HL 609
NNS-NC 26
NNS-TL 2226
NNS-TL-HL 14
NNS-TL-NC 3
NP 34476
NP$ 2565
NP$-HL 8
NP$-TL 141
NP+BEZ 25
NP+BEZ-NC 3
NP+HVZ 6
NP+HVZ-NC 1
NP+MD 2
NP-HL 517
NP-NC 15
NP-TL 4019
NP-TL-HL 7
NPS 1275
NPS$ 38
NPS$-HL 1
NPS$-TL 3
NPS-HL 8
NPS-NC 2
NPS-TL 67
NR 1566
NR$ 66
NR$-TL 11
NR+MD 1
NR-HL 10
NR-NC 4
NR-TL 309
NR-TL-HL 5
NRS 16
NRS-TL 1
OD 1935
OD-HL 8
OD-NC 1
OD-TL 201
PN 2573
PN$ 89
PN+BEZ 7
PN+HVD 1
PN+HVZ 3
PN+MD 3
PN-HL 2
PN-NC 2
PN-TL 5
PP$ 16872
PP$$ 164
PP$-HL 10
PP$-NC 13
PP$-TL 35
PPL 1233
PPL-HL 1
PPL-NC 2
PPL-TL 1
PPLS 345
PPO 11181
PPO-HL 5
PPO-NC 9
PPO-TL 13
PPS 18253
PPS+BEZ 430
PPS+BEZ-HL 1
PPS+BEZ-NC 3
PPS+HVD 83
PPS+HVZ 43
PPS+MD 144
PPS-HL 19
PPS-NC 9
PPS-TL 6
PPSS 13802
PPSS+BEM 270
PPSS+BER 278
PPSS+BER-N 1
PPSS+BER-NC 1
PPSS+BER-TL 1
PPSS+BEZ 1
PPSS+BEZ* 1
PPSS+HV 241
PPSS+HV-TL 1
PPSS+HVD 83
PPSS+MD 484
PPSS+MD-NC 2
PPSS+VB 2
PPSS-HL 25
PPSS-NC 31
PPSS-TL 9
QL 8735
QL-HL 4
QL-NC 2
QL-TL 6
QLP 261
RB 36464
RB$ 9
RB+BEZ 11
RB+BEZ-HL 1
RB+BEZ-NC 1
RB+CS 3
RB-HL 49
RB-NC 26
RB-TL 40
RBR 1182
RBR+CS 1
RBR-NC 1
RBT 101
RN 9
RP 6009
RP+IN 4
RP-HL 14
RP-NC 5
RP-TL 4
TO 14918
TO+VB 2
TO-HL 55
TO-NC 13
TO-TL 10
UH 608
UH-HL 1
UH-NC 5
UH-TL 15
VB 33693
VB+AT 2
VB+IN 3
VB+JJ-NC 1
VB+PPO 71
VB+RP 2
VB+TO 4
VB+VB-NC 1
VB-HL 125
VB-NC 41
VB-TL 96
VBD 26167
VBD-HL 8
VBD-NC 11
VBD-TL 6
VBG 17893
VBG+TO 17
VBG-HL 146
VBG-NC 16
VBG-TL 133
VBN 29186
VBN+TO 5
VBN-HL 137
VBN-NC 9
VBN-TL 591
VBN-TL-HL 6
VBN-TL-NC 3
VBZ 7373
VBZ-HL 72
VBZ-NC 7
VBZ-TL 17
WDT 5539
WDT+BER 1
WDT+BER+PP 1
WDT+BEZ 47
WDT+BEZ-HL 1
WDT+BEZ-NC 2
WDT+BEZ-TL 1
WDT+DO+PPS 1
WDT+DOD 1
WDT+HVZ 2
WDT-HL 30
WDT-NC 7
WP$ 252
WPO 280
WPO-NC 1
WPO-TL 4
WPS 3924
WPS+BEZ 21
WPS+BEZ-NC 2
WPS+BEZ-TL 1
WPS+HVD 6
WPS+HVZ 2
WPS+MD 8
WPS-HL 2
WPS-NC 3
WPS-TL 12
WQL 176
WQL-TL 5
WRB 4509
WRB+BER 1
WRB+BEZ 11
WRB+BEZ-TL 3
WRB+DO 1
WRB+DOD 6
WRB+DOD* 1
WRB+DOZ 1
WRB+IN 1
WRB+MD 1
WRB-HL 36
WRB-NC 7
WRB-TL 9
`` 8837
And then:
# POS that starts with JJ are adjectives, sum the counts up
print sum(pos_freq[i] for i in pos_freq if i.startswith('JJ'))
[out]:
71994