How to join/merge and sum columns with the same name - python

How can I merge and sum the columns with the same name?
So the output should be 1 Column named Canada as a result of the sum of the 4 columns named Canada.
Country/Region Brazil Canada Canada Canada Canada
Week 1 0 3 0 0 0
Week 2 0 17 0 0 0
Week 3 0 21 0 0 0
Week 4 0 21 0 0 0
Week 5 0 23 0 0 0
Week 6 0 80 0 5 0
Week 7 0 194 0 20 0
Week 8 12 702 3 199 20
Week 9 182 2679 16 2395 260
Week 10 737 8711 80 17928 892
Week 11 1674 25497 153 48195 1597
Week 12 2923 46392 175 85563 2003
Week 13 4516 76095 182 122431 2180
Week 14 6002 105386 183 163539 2431
Week 15 6751 127713 189 210409 2995
Week 16 7081 147716 189 258188 3845

From its current state, this should give the outcome you're looking for:
df = df.set_index('Country/Region') # optional
df.groupby(df.columns, axis=1).sum() # Stolen from Scott Boston as it's a superior method.
Output:
index Brazil Canada
Country/Region
Week 1 0 3
Week 2 0 17
Week 3 0 21
Week 4 0 21
Week 5 0 23
Week 6 0 85
Week 7 0 214
Week 8 12 924
Week 9 182 5350
Week 10 737 27611
Week 11 1674 75442
Week 12 2923 134133
Week 13 4516 200888
Week 14 6002 271539
Week 15 6751 341306
Week 16 7081 409938
I found your dataset interesting, here's how I would clean it up from step 1:
df = pd.read_csv('file.csv')
df = df.set_index(['Province/State', 'Country/Region', 'Lat', 'Long']).stack().reset_index()
df.columns = ['Province/State', 'Country/Region', 'Lat', 'Long', 'date', 'value']
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
df = df.pivot_table(index=df.index, columns='Country/Region', values='value', aggfunc=np.sum)
print(df)
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-22 0 0 0 0 0 ... 0 0 0 0 0
2020-01-23 0 0 0 0 0 ... 0 0 0 0 0
2020-01-24 0 0 0 0 0 ... 0 0 0 0 0
2020-01-25 0 0 0 0 0 ... 0 0 0 0 0
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
2020-07-30 36542 5197 29831 922 1109 ... 11548 10 1726 5555 3092
2020-07-31 36675 5276 30394 925 1148 ... 11837 10 1728 5963 3169
2020-08-01 36710 5396 30950 925 1164 ... 12160 10 1730 6228 3659
2020-08-02 36710 5519 31465 925 1199 ... 12297 10 1734 6347 3921
2020-08-03 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
If you now want to do weekly aggregations, it's as simple as:
print(df.resample('w').sum())
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
2020-02-02 0 0 0 0 0 ... 0 0 0 0 0
2020-02-09 0 0 0 0 0 ... 0 0 0 0 0
2020-02-16 0 0 0 0 0 ... 0 0 0 0 0
2020-02-23 0 0 0 0 0 ... 0 0 0 0 0
2020-03-01 7 0 6 0 0 ... 0 0 0 0 0
2020-03-08 10 0 85 7 0 ... 43 0 0 0 0
2020-03-15 57 160 195 7 0 ... 209 0 0 0 0
2020-03-22 175 464 705 409 5 ... 309 0 0 11 7
2020-03-29 632 1142 2537 1618 29 ... 559 0 0 113 31
2020-04-05 1783 2000 6875 2970 62 ... 1178 4 0 262 59
2020-04-12 3401 2864 11629 4057 128 ... 1847 30 3 279 84
2020-04-19 5838 3603 16062 4764 143 ... 2081 42 7 356 154
2020-04-26 8918 4606 21211 5087 174 ... 2353 42 7 541 200
2020-05-03 15149 5391 27943 5214 208 ... 2432 42 41 738 244
2020-05-10 25286 5871 36315 5265 274 ... 2607 42 203 1260 241
2020-05-17 39634 6321 45122 5317 327 ... 2632 42 632 3894 274
2020-05-24 61342 6798 54185 5332 402 ... 2869 45 1321 5991 354
2020-05-31 91885 7517 62849 5344 536 ... 3073 63 1932 7125 894
2020-06-07 126442 8378 68842 5868 609 ... 3221 63 3060 7623 1694
2020-06-14 159822 9689 74147 5967 827 ... 3396 63 4236 8836 2335
2020-06-21 191378 12463 79737 5981 1142 ... 4466 63 6322 9905 3089
2020-06-28 210487 15349 87615 5985 1522 ... 10242 70 7360 10512 3813
2020-07-05 224560 18707 102918 5985 2186 ... 21897 70 8450 11322 4426
2020-07-12 237087 22399 124588 5985 2940 ... 36949 70 9489 13002 6200
2020-07-19 245264 26845 149611 6098 4279 ... 52323 70 10855 16350 9058
2020-07-26 250970 31255 178605 6237 5919 ... 68154 70 11571 26749 14933
2020-08-02 255739 36370 208457 6429 7648 ... 80685 70 12023 38896 22241
2020-08-09 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075

Try:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(20,5)), columns=[*'ZAABC'])
df.groupby(df.columns, axis=1, sort=False).sum()
Output:
Z A B C
0 44 111 67 67
1 9 104 36 87
2 70 176 12 58
3 65 126 46 88
4 81 62 77 72
5 9 100 69 79
6 47 146 99 88
7 49 48 19 14
8 39 97 9 57
9 32 105 23 35
10 75 83 34 0
11 0 89 5 38
12 17 83 42 58
13 31 66 41 57
14 35 57 82 91
15 0 113 53 12
16 42 159 68 6
17 68 50 76 52
18 78 35 99 58
19 23 92 85 48

You can try a transpose and groupby, e.g. something similar to the below.
df_T = df.tranpose()
df_T.groupby(df_T.index).sum()['Canada']

Here's a way to do it:
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
First we rename the columns starting with Canada by appending their integer position, which ensures they are no longer duplicates.
Then we use sum() to add across columns like Canada, put the result in a new column named Canada, and drop the columns that were originally named Canada.
Full test code is:
import pandas as pd
df = pd.DataFrame(
columns=[x.strip() for x in 'Brazil Canada Canada Canada Canada'.split()],
index=['Week ' + str(i) for i in range(1, 17)],
data=[[i] * 5 for i in range(1, 17)])
df.columns.names=['Country/Region']
print(df)
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
print(df)
Output:
Country/Region Brazil Canada Canada Canada Canada
Week 1 1 1 1 1 1
Week 2 2 2 2 2 2
Week 3 3 3 3 3 3
Week 4 4 4 4 4 4
Week 5 5 5 5 5 5
Week 6 6 6 6 6 6
Week 7 7 7 7 7 7
Week 8 8 8 8 8 8
Week 9 9 9 9 9 9
Week 10 10 10 10 10 10
Week 11 11 11 11 11 11
Week 12 12 12 12 12 12
Week 13 13 13 13 13 13
Week 14 14 14 14 14 14
Week 15 15 15 15 15 15
Week 16 16 16 16 16 16
Brazil Canada
Week 1 1 4
Week 2 2 8
Week 3 3 12
Week 4 4 16
Week 5 5 20
Week 6 6 24
Week 7 7 28
Week 8 8 32
Week 9 9 36
Week 10 10 40
Week 11 11 44
Week 12 12 48
Week 13 13 52
Week 14 14 56
Week 15 15 60
Week 16 16 64

Related

How can I turn this into a DataFrame?

I am new to Python, and was trying to run a basic web scraper. My code looks like this
import requests
import pandas as pd
x = requests.get('https://www.baseball-reference.com/players/p/penaje02.shtml')
dfs = pd.read_html(x.content)
print(dfs)
df = pd.DataFrame(dfs)
when printing dfs it looks like this. I only want the second table.
[ Year Age Tm Lg G PA AB \
0 2018 20 HOU-min A- 36 156 136
1 2019 21 HOU-min A,A+ 109 473 409
2 2021 23 HOU-min AAA,Rk 37 160 145
3 2022 24 HOU AL 136 558 521
4 1 Yr 1 Yr 1 Yr 1 Yr 136 558 521
5 162 Game Avg. 162 Game Avg. 162 Game Avg. 162 Game Avg. 162 665 621
R H 2B ... OPS OPS+ TB GDP HBP SH SF IBB Pos \
0 22 34 5 ... 0.649 NaN 42 0 1 0 1 0 NaN
1 72 124 21 ... 0.825 NaN 180 4 11 0 6 0 NaN
2 25 43 5 ... 0.942 NaN 84 0 7 0 0 0 NaN
3 72 132 20 ... 0.715 101.0 222 5 6 1 5 0 *6/H
4 72 132 20 ... 0.715 101.0 222 5 6 1 5 0 NaN
5 86 157 24 ... 0.715 101.0 264 6 7 1 6 0 NaN
Awards
0 TRC · NYPL
1 DAV,FAY · MIDW,CARL
2 SKT,AST · AAAW,FCL
3 GG
4 NaN
5 NaN
[6 rows x 30 columns]]
however, i end up with error Must pass 2-d input. shape=(1, 6, 30) after my last line. I have tried using df=dfs[1], but got the error list index our of range. Any way i can turn dfs from a list to a datframe?
What do you mean you only want the second table? There's only one table, it's 6 rows and 30 columns. The backslashes show up when whatever you're trying to print to isn't wide enough to contain the dataframe without line wrapping. Here's the dataframe printed in a wider terminal:
The pd.read_html() function returns a List[DataFrame] so you first need to grab your dataframe from the list, and then you can subset it to get the columns you care about:
df = dfs[0]
columns = ['R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS', 'OPS+', 'TB', 'GDP', 'HBP', 'SH', 'SF', 'IBB', 'Pos']
print(df[columns])
Output:
R H 2B 3B HR RBI SB CS BB SO BA OBP SLG OPS OPS+ TB GDP HBP SH SF IBB Pos
0 22 34 5 0 1 10 3 0 18 19 0.250 0.340 0.309 0.649 NaN 42 0 1 0 1 0 NaN
1 72 124 21 7 7 54 20 10 47 90 0.303 0.385 0.440 0.825 NaN 180 4 11 0 6 0 NaN
2 25 43 5 3 10 21 6 1 8 41 0.297 0.363 0.579 0.942 NaN 84 0 7 0 0 0 NaN
3 72 132 20 2 22 63 11 2 22 135 0.253 0.289 0.426 0.715 101.0 222 5 6 1 5 0 *6/H
4 72 132 20 2 22 63 11 2 22 135 0.253 0.289 0.426 0.715 101.0 222 5 6 1 5 0 NaN
5 86 157 24 2 26 75 13 2 26 161 0.253 0.289 0.426 0.715 101.0 264 6 7 1 6 0 NaN

Get Row in Other Table

I have a dataframe 'df'. Using the validation data validData, I want to compute the response rate (Florence = 1/Yes) using the rfm_aboveavg (RFM combinations response rates above the overall response). Response rate is given by considering 0/No and 1/Yes, so it would be rfm_crosstab[1] / rfm_crosstab['All'].
Using the results from the validation data, I want to only display the rows that are also shown in the training data output by the RFM column. How do I do this?
Data: 'df'
Seq# ID# Gender M R F FirstPurch ChildBks YouthBks CookBks ... ItalCook ItalAtlas ItalArt Florence Related Purchase Mcode Rcode Fcode Yes_Florence No_Florence
0 1 25 1 297 14 2 22 0 1 1 ... 0 0 0 0 0 5 4 2 0 1
1 2 29 0 128 8 2 10 0 0 0 ... 0 0 0 0 0 4 3 2 0 1
2 3 46 1 138 22 7 56 2 1 2 ... 1 0 0 0 2 4 4 3 0 1
3 4 47 1 228 2 1 2 0 0 0 ... 0 0 0 0 0 5 1 1 0 1
4 5 51 1 257 10 1 10 0 0 0 ... 0 0 0 0 0 5 3 1 0 1
My code: Crosstab for training data trainData
trainData, validData = train_test_split(df, test_size=0.4, random_state=1)
# Response rate for training data as a whole
responseRate = (sum(trainData.Florence == 1) / sum(trainData.Florence == 0)) * 100
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
trainData['RFM'] = trainData['Mcode'].astype(str) + trainData['Rcode'].astype(str) + trainData['Fcode'].astype(str)
rfm_crosstab = pd.crosstab(index = [trainData['RFM']], columns = trainData['Florence'], margins = True)
rfm_crosstab['Percentage of 1/Yes'] = 100 * (rfm_crosstab[1] / rfm_crosstab['All'])
# RFM combinations response rates above the overall response
rfm_aboveavg = rfm_crosstab['Percentage of 1/Yes'] > responseRate
rfm_crosstab[rfm_aboveavg]
Output: Training data
Florence 0 1 All Percentage of 1/Yes
RFM
121 3 2 5 40.000000
131 9 1 10 10.000000
212 1 2 3 66.666667
221 6 3 9 33.333333
222 6 1 7 14.285714
313 2 1 3 33.333333
321 17 3 20 15.000000
322 20 4 24 16.666667
323 2 1 3 33.333333
341 61 10 71 14.084507
343 17 2 19 10.526316
411 12 3 15 20.000000
422 26 5 31 16.129032
423 32 8 40 20.000000
441 96 12 108 11.111111
511 19 4 23 17.391304
513 44 8 52 15.384615
521 24 5 29 17.241379
523 74 16 90 17.777778
533 177 28 205 13.658537
My code: Crosstab for validation data validData
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
validData['RFM'] = validData['Mcode'].astype(str) + validData['Rcode'].astype(str) + validData['Fcode'].astype(str)
rfm_crosstab1 = pd.crosstab(index = [validData['RFM']], columns = validData['Florence'], margins = True)
rfm_crosstab1['Percentage of 1/Yes'] = 100 * (rfm_crosstab1[1] / rfm_crosstab1['All'])
rfm_crosstab1
Output: Validation data
Florence 0 1 All Percentage of 1/Yes
RFM
131 3 1 4 25.000000
141 8 0 8 0.000000
211 2 1 3 33.333333
212 2 0 2 0.000000
213 0 1 1 100.000000
221 5 0 5 0.000000
222 2 0 2 0.000000
231 21 1 22 4.545455
232 3 0 3 0.000000
233 1 0 1 0.000000
241 11 1 12 8.333333
242 8 0 8 0.000000
243 2 0 2 0.000000
311 7 0 7 0.000000
312 8 0 8 0.000000
313 1 0 1 0.000000
321 12 0 12 0.000000
322 13 0 13 0.000000
323 4 1 5 20.000000
331 19 1 20 5.000000
332 25 2 27 7.407407
333 11 1 12 8.333333
341 36 2 38 5.263158
342 30 2 32 6.250000
343 12 0 12 0.000000
411 8 2 10 20.000000
412 7 0 7 0.000000
413 13 1 14 7.142857
421 21 2 23 8.695652
422 30 1 31 3.225806
423 26 1 27 3.703704
431 51 3 54 5.555556
432 42 7 49 14.285714
433 41 5 46 10.869565
441 68 2 70 2.857143
442 78 3 81 3.703704
443 70 5 75 6.666667
511 17 0 17 0.000000
512 13 1 14 7.142857
513 26 6 32 18.750000
521 19 1 20 5.000000
522 25 6 31 19.354839
523 50 6 56 10.714286
531 66 3 69 4.347826
532 65 3 68 4.411765
533 128 24 152 15.789474
541 86 7 93 7.526882
542 100 6 106 5.660377
543 178 17 195 8.717949
All 1474 126 1600 7.875000

In Pandas, giving a datetime index, with rows on all work days, how to determine if a row is beginning of week or end of week?

I have an set of stock information, with datetime set as index, stock market only open on weekdays so all my rows are weekdays, which is fine, I would like to determine if a row is start of the week or end of week, which might NOT always fall on Monday/Friday due to holidays. A better idea is to determine if there is an row entry on the next/previous day in the dataframe ( since my data is guaranteed to only exist for workday), but I dont know how to calculate this. Here is an example of my data:
date day_of_week day_of_month day_of_year month_of_year
5/1/2017 0 1 121 5
5/2/2017 1 2 122 5
5/3/2017 2 3 123 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5
5/9/2017 1 9 129 5
5/10/2017 2 10 130 5
5/11/2017 3 11 131 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5
5/16/2017 1 16 136 5
5/17/2017 2 17 137 5
5/18/2017 3 18 138 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5
5/24/2017 2 24 144 5
5/25/2017 3 25 145 5
5/26/2017 4 26 146 5
5/30/2017 1 30 150 5
Here is my current code
# Date fields
def DateFields(df_input):
dates = df_input.index.to_series()
df_input['day_of_week'] = dates.dt.dayofweek
df_input['day_of_month'] = dates.dt.day
df_input['day_of_year'] = dates.dt.dayofyear
df_input['month_of_year'] = dates.dt.month
df_input['isWeekStart'] = "No" #<--- Need help here
df_input['isWeekEnd'] = "No" #<--- Need help here
df_input['date'] = dates.dt.strftime('%Y-%m-%d')
return df_input
How can I calculate if a row is beginning of week and end of week?
Example of what I am looking for:
date day_of_week day_of_month day_of_year month_of_year isWeekStart isWeekEnd
5/1/2017 0 1 121 5 1 0
5/2/2017 1 2 122 5 0 0
5/3/2017 2 3 123 5 0 0
5/4/2017 3 4 124 5 0 1 # short week, Thursday is last work day
5/8/2017 0 8 128 5 1 0
5/9/2017 1 9 129 5 0 0
5/10/2017 2 10 130 5 0 0
5/11/2017 3 11 131 5 0 0
5/12/2017 4 12 132 5 0 1
5/15/2017 0 15 135 5 1 0
5/16/2017 1 16 136 5 0 0
5/17/2017 2 17 137 5 0 0
5/18/2017 3 18 138 5 0 0
5/19/2017 4 19 139 5 0 1
5/23/2017 1 23 143 5 1 0 # short week, Tuesday is first work day
5/24/2017 2 24 144 5 0 0
5/25/2017 3 25 145 5 0 0
5/26/2017 4 26 146 5 0 1
5/30/2017 1 30 150 5 1 0
EDIT: I forgot that some holidays fall during the middle of week, in this situation, it would be good if it can treat these as a separate "week" with before and after marked accordingly. Although if it's not smart enough to figure this out, just getting the long weekend would be a good start.
Here's an idea with BusinessDay:
prev_working_day = df['date'] - pd.tseries.offsets.BusinessDay(1)
df['isFirstWeekDay'] = (df['date'].dt.isocalendar().week !=
prev_working_day.dt.isocalendar().week)
And similar for last business day. Note that the default holiday calendar is US'. Check out this post for a different one.
Output:
date day_of_week day_of_month day_of_year month_of_year isFirstWeekDay
0 2017-05-01 0 1 121 5 True
1 2017-05-02 1 2 122 5 False
2 2017-05-03 2 3 123 5 False
3 2017-05-04 3 4 124 5 False
4 2017-05-08 0 8 128 5 True
5 2017-05-09 1 9 129 5 False
6 2017-05-10 2 10 130 5 False
7 2017-05-11 3 11 131 5 False
8 2017-05-12 4 12 132 5 False
9 2017-05-15 0 15 135 5 True
10 2017-05-16 1 16 136 5 False
11 2017-05-17 2 17 137 5 False
12 2017-05-18 3 18 138 5 False
13 2017-05-19 4 19 139 5 False
14 2017-05-23 1 23 143 5 False
15 2017-05-24 2 24 144 5 False
16 2017-05-25 3 25 145 5 False
17 2017-05-26 4 26 146 5 False
18 2017-05-30 1 30 150 5 False
Here's an approach using weekly groupby.
df['date'] = pd.to_datetime(df['date'])
business_days = df.assign(date_copy = df['date']).groupby(pd.Grouper(key='date_copy', freq='W'))['date'].apply(list).to_frame()
business_days['isWeekStart'] = business_days['date'].apply(lambda x: [1 if i == min(x) else 0 for i in x])
business_days['isWeekEnd'] = business_days['date'].apply(lambda x: [1 if i == max(x) else 0 for i in x])
business_days = business_days.apply(pd.Series.explode)
pd.merge(df, business_days, left_on='date', right_on='date')
output:
date day_of_week day_of_month day_of_year month_of_year isWeekStart isWeekEnd
0 2017-05-01 0 1 121 5 1 0
1 2017-05-02 1 2 122 5 0 0
2 2017-05-03 2 3 123 5 0 0
3 2017-05-04 3 4 124 5 0 1
4 2017-05-08 0 8 128 5 1 0
5 2017-05-09 1 9 129 5 0 0
6 2017-05-10 2 10 130 5 0 0
7 2017-05-11 3 11 131 5 0 0
8 2017-05-12 4 12 132 5 0 1
9 2017-05-15 0 15 135 5 1 0
10 2017-05-16 1 16 136 5 0 0
11 2017-05-17 2 17 137 5 0 0
12 2017-05-18 3 18 138 5 0 0
13 2017-05-19 4 19 139 5 0 1
14 2017-05-23 1 23 143 5 1 0
15 2017-05-24 2 24 144 5 0 0
16 2017-05-25 3 25 145 5 0 0
17 2017-05-26 4 26 146 5 0 1
18 2017-05-30 1 30 150 5 1 1
Note that 2017-05-30 is marked as both WeekStart and WeekEnd because it is the only date of that week.

How to read .data format data with Python?

I have uploaded data from https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/ . As you see it has .data format. How to read it as pandas datframe in Python?
I try this. but it dens work:
with open("arrhythmia.data", "r") as f:
arryth_df = pd.DataFrame(f.read())
It says ValueError: DataFrame constructor not properly called!
You can pass url of file to read_csv because here .data is csv format, but no header, so added header=None:
#if want see all data
pd.options.display.max_columns = None
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.data'
df = pd.read_csv(url, header=None)
print (df.head())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 \
0 75 0 190 80 91 193 371 174 121 -16 13 64 -2 ? 63 0
1 56 1 165 64 81 174 401 149 39 25 37 -17 31 ? 53 0
2 54 0 172 95 138 163 386 185 102 96 34 70 66 23 75 0
3 55 0 175 94 100 202 380 179 143 28 11 -5 20 ? 71 0
4 75 0 190 80 88 181 360 177 103 -16 13 61 3 ? ? 0
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 \
0 52 44 0 0 32 0 0 0 0 0 0 0 44 20 36
1 48 0 0 0 24 0 0 0 0 0 0 0 64 0 0
2 40 80 0 0 24 0 0 0 0 0 0 20 56 52 0
3 72 20 0 0 48 0 0 0 0 0 0 0 64 36 0
4 48 40 0 0 28 0 0 0 0 0 0 0 40 24 0
...
...
...
If want also convert ? to missing values NaNs add na_values='?' parameter:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.data'
df = pd.read_csv(url, header=None, na_values='?')
print (df.head())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 \
0 75 0 190 80 91 193 371 174 121 -16 13.0 64.0 -2.0 NaN
1 56 1 165 64 81 174 401 149 39 25 37.0 -17.0 31.0 NaN
2 54 0 172 95 138 163 386 185 102 96 34.0 70.0 66.0 23.0
3 55 0 175 94 100 202 380 179 143 28 11.0 -5.0 20.0 NaN
4 75 0 190 80 88 181 360 177 103 -16 13.0 61.0 3.0 NaN
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 \
0 63.0 0 52 44 0 0 32 0 0 0 0 0 0 0 44
1 53.0 0 48 0 0 0 24 0 0 0 0 0 0 0 64
2 75.0 0 40 80 0 0 24 0 0 0 0 0 0 20 56
3 71.0 0 72 20 0 0 48 0 0 0 0 0 0 0 64
4 NaN 0 48 40 0 0 28 0 0 0 0 0 0 0 40
...
...
Do it this way with StringIO:
from io import StringIO
import pandas as pd
with open("arrhythmia.data", "r") as f:
data = StringIO(f.read())
arryth_df = pd.read_csv(data)

Pandas data reshaping, turn multiple rows with the same index but different values into many columns based on incidence

I have the following table in pandas
user_id idaggregate_info num_events num_lark_convo_events num_meals_logged num_breakfasts num_lunches num_dinners num_snacks total_activity sleep_duration num_activity_events num_weights num_notifs idusermission completed mission_delta
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28
2 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 NaT
3 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05
4 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 NaT
Some user_ids have multiple lines that are identical except for their different mission_delta values. How do I transform this into one line for each id, with a columns named "mission_delta_1", "mission_delta_2" (the number of them vary, it could be 1 per user_id to maybe 5 per user_id so naming has to be iterative_ etc so output would be:
user_id idaggregate_info num_events num_lark_convo_events num_meals_logged num_breakfasts num_lunches num_dinners num_snacks total_activity sleep_duration num_activity_events num_weights num_notifs idusermission completed mission_delta_1 mission_delta_2
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28 NaT
2 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05 NaT
Not a duplicate as those address exploding all columns, there is just one that needs to be unstacked. The solutions offered in the duplicate link fail:
df.groupby(level=0).apply(lambda x: pd.Series(x.values.flatten()))
produces the same df as the original with different labels
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28
2 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 NaT
3 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05
The next options:
result2.groupby(level=0).apply(lambda x: pd.Series(x.stack().values))
produces:
0 0 0
1 406
2 94
3 20
4 7
and
df.groupby(level=0).apply(lambda x: x.values.ravel()).apply(pd.Series)
produces the original dataframe:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28
2 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 NaT
3 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05
In essence, I want to turn a df:
id mission_delta
0 NaT
1 1 day
1 2 days
1 1 day
2 5 days
2 NaT
into
id mission_delta1 mission_delta_2 mission_delta_3
0 NaT NaT NaT
1 1 day 2 days 1 day
2 5 days NaT NaT
You might try this;
grp = df.groupby('id')
df_res = grp['mission_delta'].apply(lambda x: pd.Series(x.values)).unstack().fillna('NaT')
df_res = df_res.rename(columns={i: 'mission_delta_{}'.format(i + 1) for i in range(len(df_res))})
print(df_res)
mission_delta_1 mission_delta_2 mission_delta_3
id
0 NaT NaT NaT
1 1 day 2 days 1 day
2 5 days NaT NaT

Categories