Related
I have a table that looks like
A
B
C
2017
9
65
2017
10
72
2017
11
88
2017
12
97
2018
1
85
2018
2
67
2018
3
76
2018
4
51
2018
5
69
2018
6
97
2018
7
101
2018
8
22
2019
1
56
2019
2
34
2019
3
71
2019
4
122
2019
5
167
2019
6
34
2019
7
17
2019
8
99
2019
9
20
2019
10
26
2019
11
39
2019
12
30
2020
1
56
2020
2
34
2020
3
71
2020
4
122
2020
5
167
2020
6
34
2020
7
17
2020
8
99
2020
9
20
2020
10
26
2020
11
39
2020
12
30
2021
1
56
2021
2
34
2021
3
71
2021
4
122
2021
5
167
2021
6
34
2021
7
17
2021
8
99
2021
9
20
2021
10
26
2021
11
39
2021
12
30
Now what I want is :
A
B
C
D
2017
9
65
890
2017
10
72
890
2017
11
88
890
2017
12
97
890
2018
1
85
890
2018
2
67
890
2018
3
76
890
2018
4
51
890
2018
5
69
890
2018
6
97
890
2018
7
101
890
2018
8
22
890
2019
1
56
715
2019
2
34
715
2019
3
71
715
2019
4
122
715
2019
5
167
715
2019
6
34
715
2019
7
17
715
2019
8
99
715
2019
9
20
715
2019
10
26
715
2019
11
39
715
2019
12
30
715
2020
1
56
715
2020
2
34
715
2020
3
71
715
2020
4
122
715
2020
5
167
715
2020
6
34
715
2020
7
17
715
2020
8
99
715
2020
9
20
715
2020
10
26
715
2020
11
39
715
2020
12
30
715
2021
1
56
715
2021
2
34
715
2021
3
71
715
2021
4
122
715
2021
5
167
715
2021
6
34
715
2021
7
17
715
2021
8
99
715
2021
9
20
715
2021
10
26
715
2021
11
39
715
2021
12
30
715
Here 890 is the sum of all the values from 9,2017 through 8,2018 and 715 is the sum of all values from 1,2019 through 12,2019 and similarly 715 is the sum of all values from 1,2020 through 12,2020 and similarly 715 is the sum of all values from 1,2021 through 12,2021. For ease of calculation the numbers in column C have been taken the same i.e, (56,34,71,122,167,34,17,99,20,26,39,30) for each of 2019, 2020 and 2021. These numbers may vary for each of the years and subsequently their sums. That is we could have values like (67,87,99,100,76,11,23,44,56,78,87,5) for 2020 and (12,13,14,15,16,17,18,19,20,21,22,23) for 2021 for the months (1,2,3,4,5,6,7,8,,10,11,12) subsequently.
Now my efforts:
count_months_in_each_year = data.groupby('CALENDAR_YEAR').agg({'CALMONTH':'count'})
count_months_in_each_year.reset_index(inplace = True)
count_months_in_each_year.rename({'CALMONTH':'Count_of_Months'}, axis =1, inplace = True)
data = pd.merge(data, count_months_in_each_year, on = 'CALENDAR_YEAR', how = 'left', indicator = True )
data.drop(columns = ['_merge'], axis =1 , inplace = True)
Now how to get the sum of the values especially in case when I have to consider 9,2017 through 8,2018 although I have the count.
Now based on this what logic can be driven to generalize the code in order to get the result.
I also tried this :
####Compute total number of records - number of records which have count of months < 12
number_ofless_than_12_records = data.shape[0] - data[data['Count_of_Months']==12].shape[0]
#number_ofless_than_12_records = 144.
#Total records = 576
Can we make use of this somehow?
I think what you are looking for is making groups of 12 rows and transform with the group sum:
df['D'] = df.groupby(df.index // 12)['C'].transform('sum')
A B C D
0 2017 9 65 890
1 2017 10 72 890
2 2017 11 88 890
3 2017 12 97 890
4 2018 1 85 890
5 2018 2 67 890
6 2018 3 76 890
7 2018 4 51 890
8 2018 5 69 890
9 2018 6 97 890
10 2018 7 101 890
11 2018 8 22 890
12 2019 1 56 715
13 2019 2 34 715
14 2019 3 71 715
15 2019 4 122 715
16 2019 5 167 715
17 2019 6 34 715
18 2019 7 17 715
19 2019 8 99 715
20 2019 9 20 715
21 2019 10 26 715
22 2019 11 39 715
23 2019 12 30 715
24 2020 1 56 715
25 2020 2 34 715
26 2020 3 71 715
27 2020 4 122 715
28 2020 5 167 715
29 2020 6 34 715
30 2020 7 17 715
31 2020 8 99 715
32 2020 9 20 715
33 2020 10 26 715
34 2020 11 39 715
35 2020 12 30 715
36 2021 1 56 715
37 2021 2 34 715
38 2021 3 71 715
39 2021 4 122 715
40 2021 5 167 715
41 2021 6 34 715
42 2021 7 17 715
43 2021 8 99 715
44 2021 9 20 715
45 2021 10 26 715
46 2021 11 39 715
47 2021 12 30 715
You can use pandas rolling window function https://pandas.pydata.org/docs/user_guide/window.html
df['D'] = df['C'].rolling(window=12).sum()
This will calculate the sum of the current month and 11 rows back. But it will fill with NaN values in the beginning, until there are enough months to look back.
So we can shift up 11 rows to get the wanted result.
df['D'] = df['D'].shift(-11)
And if you want don't want any NaNs at the end, you can interpolate or pad it out.
df['D'] = df['D'].interpolate()
How can I merge and sum the columns with the same name?
So the output should be 1 Column named Canada as a result of the sum of the 4 columns named Canada.
Country/Region Brazil Canada Canada Canada Canada
Week 1 0 3 0 0 0
Week 2 0 17 0 0 0
Week 3 0 21 0 0 0
Week 4 0 21 0 0 0
Week 5 0 23 0 0 0
Week 6 0 80 0 5 0
Week 7 0 194 0 20 0
Week 8 12 702 3 199 20
Week 9 182 2679 16 2395 260
Week 10 737 8711 80 17928 892
Week 11 1674 25497 153 48195 1597
Week 12 2923 46392 175 85563 2003
Week 13 4516 76095 182 122431 2180
Week 14 6002 105386 183 163539 2431
Week 15 6751 127713 189 210409 2995
Week 16 7081 147716 189 258188 3845
From its current state, this should give the outcome you're looking for:
df = df.set_index('Country/Region') # optional
df.groupby(df.columns, axis=1).sum() # Stolen from Scott Boston as it's a superior method.
Output:
index Brazil Canada
Country/Region
Week 1 0 3
Week 2 0 17
Week 3 0 21
Week 4 0 21
Week 5 0 23
Week 6 0 85
Week 7 0 214
Week 8 12 924
Week 9 182 5350
Week 10 737 27611
Week 11 1674 75442
Week 12 2923 134133
Week 13 4516 200888
Week 14 6002 271539
Week 15 6751 341306
Week 16 7081 409938
I found your dataset interesting, here's how I would clean it up from step 1:
df = pd.read_csv('file.csv')
df = df.set_index(['Province/State', 'Country/Region', 'Lat', 'Long']).stack().reset_index()
df.columns = ['Province/State', 'Country/Region', 'Lat', 'Long', 'date', 'value']
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
df = df.pivot_table(index=df.index, columns='Country/Region', values='value', aggfunc=np.sum)
print(df)
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-22 0 0 0 0 0 ... 0 0 0 0 0
2020-01-23 0 0 0 0 0 ... 0 0 0 0 0
2020-01-24 0 0 0 0 0 ... 0 0 0 0 0
2020-01-25 0 0 0 0 0 ... 0 0 0 0 0
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
2020-07-30 36542 5197 29831 922 1109 ... 11548 10 1726 5555 3092
2020-07-31 36675 5276 30394 925 1148 ... 11837 10 1728 5963 3169
2020-08-01 36710 5396 30950 925 1164 ... 12160 10 1730 6228 3659
2020-08-02 36710 5519 31465 925 1199 ... 12297 10 1734 6347 3921
2020-08-03 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
If you now want to do weekly aggregations, it's as simple as:
print(df.resample('w').sum())
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
2020-02-02 0 0 0 0 0 ... 0 0 0 0 0
2020-02-09 0 0 0 0 0 ... 0 0 0 0 0
2020-02-16 0 0 0 0 0 ... 0 0 0 0 0
2020-02-23 0 0 0 0 0 ... 0 0 0 0 0
2020-03-01 7 0 6 0 0 ... 0 0 0 0 0
2020-03-08 10 0 85 7 0 ... 43 0 0 0 0
2020-03-15 57 160 195 7 0 ... 209 0 0 0 0
2020-03-22 175 464 705 409 5 ... 309 0 0 11 7
2020-03-29 632 1142 2537 1618 29 ... 559 0 0 113 31
2020-04-05 1783 2000 6875 2970 62 ... 1178 4 0 262 59
2020-04-12 3401 2864 11629 4057 128 ... 1847 30 3 279 84
2020-04-19 5838 3603 16062 4764 143 ... 2081 42 7 356 154
2020-04-26 8918 4606 21211 5087 174 ... 2353 42 7 541 200
2020-05-03 15149 5391 27943 5214 208 ... 2432 42 41 738 244
2020-05-10 25286 5871 36315 5265 274 ... 2607 42 203 1260 241
2020-05-17 39634 6321 45122 5317 327 ... 2632 42 632 3894 274
2020-05-24 61342 6798 54185 5332 402 ... 2869 45 1321 5991 354
2020-05-31 91885 7517 62849 5344 536 ... 3073 63 1932 7125 894
2020-06-07 126442 8378 68842 5868 609 ... 3221 63 3060 7623 1694
2020-06-14 159822 9689 74147 5967 827 ... 3396 63 4236 8836 2335
2020-06-21 191378 12463 79737 5981 1142 ... 4466 63 6322 9905 3089
2020-06-28 210487 15349 87615 5985 1522 ... 10242 70 7360 10512 3813
2020-07-05 224560 18707 102918 5985 2186 ... 21897 70 8450 11322 4426
2020-07-12 237087 22399 124588 5985 2940 ... 36949 70 9489 13002 6200
2020-07-19 245264 26845 149611 6098 4279 ... 52323 70 10855 16350 9058
2020-07-26 250970 31255 178605 6237 5919 ... 68154 70 11571 26749 14933
2020-08-02 255739 36370 208457 6429 7648 ... 80685 70 12023 38896 22241
2020-08-09 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
Try:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(20,5)), columns=[*'ZAABC'])
df.groupby(df.columns, axis=1, sort=False).sum()
Output:
Z A B C
0 44 111 67 67
1 9 104 36 87
2 70 176 12 58
3 65 126 46 88
4 81 62 77 72
5 9 100 69 79
6 47 146 99 88
7 49 48 19 14
8 39 97 9 57
9 32 105 23 35
10 75 83 34 0
11 0 89 5 38
12 17 83 42 58
13 31 66 41 57
14 35 57 82 91
15 0 113 53 12
16 42 159 68 6
17 68 50 76 52
18 78 35 99 58
19 23 92 85 48
You can try a transpose and groupby, e.g. something similar to the below.
df_T = df.tranpose()
df_T.groupby(df_T.index).sum()['Canada']
Here's a way to do it:
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
First we rename the columns starting with Canada by appending their integer position, which ensures they are no longer duplicates.
Then we use sum() to add across columns like Canada, put the result in a new column named Canada, and drop the columns that were originally named Canada.
Full test code is:
import pandas as pd
df = pd.DataFrame(
columns=[x.strip() for x in 'Brazil Canada Canada Canada Canada'.split()],
index=['Week ' + str(i) for i in range(1, 17)],
data=[[i] * 5 for i in range(1, 17)])
df.columns.names=['Country/Region']
print(df)
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
print(df)
Output:
Country/Region Brazil Canada Canada Canada Canada
Week 1 1 1 1 1 1
Week 2 2 2 2 2 2
Week 3 3 3 3 3 3
Week 4 4 4 4 4 4
Week 5 5 5 5 5 5
Week 6 6 6 6 6 6
Week 7 7 7 7 7 7
Week 8 8 8 8 8 8
Week 9 9 9 9 9 9
Week 10 10 10 10 10 10
Week 11 11 11 11 11 11
Week 12 12 12 12 12 12
Week 13 13 13 13 13 13
Week 14 14 14 14 14 14
Week 15 15 15 15 15 15
Week 16 16 16 16 16 16
Brazil Canada
Week 1 1 4
Week 2 2 8
Week 3 3 12
Week 4 4 16
Week 5 5 20
Week 6 6 24
Week 7 7 28
Week 8 8 32
Week 9 9 36
Week 10 10 40
Week 11 11 44
Week 12 12 48
Week 13 13 52
Week 14 14 56
Week 15 15 60
Week 16 16 64
I'm scraping National Hockey League (NHL) data for multiple seasons from this URL:
https://www.hockey-reference.com/leagues/NHL_2018_skaters.html
I'm only getting a few instances here and have tried moving my dict statements throughout the for loops. I've also tried utilizing solutions I found on other posts with no luck. Any help is appreciated. Thank you!
import requests
from bs4 import BeautifulSoup
import pandas as pd
dict={}
for i in range (2010,2020):
year = str(i)
source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text
soup = BeautifulSoup(source,features='lxml')
#identifying table in html
table = soup.find('table', id="stats")
#grabbing <tr> tags in html
rows = table.findAll("tr")
#creating passable values for each "stat" in td tag
data_stats = [
"player",
"age",
"team_id",
"pos",
"games_played",
"goals",
"assists",
"points",
"plus_minus",
"pen_min",
"ps",
"goals_ev",
"goals_pp",
"goals_sh",
"goals_gw",
"assists_ev",
"assists_pp",
"assists_sh",
"shots",
"shot_pct",
"time_on_ice",
"time_on_ice_avg",
"blocks",
"hits",
"faceoff_wins",
"faceoff_losses",
"faceoff_percentage"
]
for rownum in rows:
# grabbing player name and using as key
filter = { "data-stat":'player' }
cell = rows[3].findAll("td",filter)
nameval = cell[0].string
list = []
for data in data_stats:
#iterating through data_stat to grab values
filter = { "data-stat":data }
cell = rows[3].findAll("td",filter)
value = cell[0].string
list.append(value)
dict[nameval] = list
dict[nameval].append(year)
# conversion to numeric values and creating dataframe
columns = [
"player",
"age",
"team_id",
"pos",
"games_played",
"goals",
"assists",
"points",
"plus_minus",
"pen_min",
"ps",
"goals_ev",
"goals_pp",
"goals_sh",
"goals_gw",
"assists_ev",
"assists_pp",
"assists_sh",
"shots",
"shot_pct",
"time_on_ice",
"time_on_ice_avg",
"blocks",
"hits",
"faceoff_wins",
"faceoff_losses",
"faceoff_percentage",
"year"
]
df = pd.DataFrame.from_dict(dict,orient='index',columns=columns)
cols = df.columns.drop(['player','team_id','pos','year'])
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
print(df)
Output
Craig Adams Craig Adams 32 ... 43.9 2010
Luke Adam Luke Adam 22 ... 100.0 2013
Justin Abdelkader Justin Abdelkader 29 ... 29.4 2017
Will Acton Will Acton 27 ... 50.0 2015
Noel Acciari Noel Acciari 24 ... 44.1 2016
Pontus Aberg Pontus Aberg 25 ... 10.5 2019
[6 rows x 28 columns]
I'd just use pandas' .read_html(), It does the hard work of parsing tables for you (uses BeautifulSoup under the hood)
Code:
import pandas as pd
result = pd.DataFrame()
for i in range (2010,2020):
print(i)
year = str(i)
url = 'https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html'
#source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text
df = pd.read_html(url,header=1)[0]
df['year'] = year
result = result.append(df, sort=False)
result = result[~result['Age'].str.contains("Age")]
result = result.reset_index(drop=True)
You can then save to file with result.to_csv('filename.csv',index=False)
Output:
print (result)
Rk Player Age Tm Pos GP ... BLK HIT FOW FOL FO% year
0 1 Justin Abdelkader 22 DET LW 50 ... 20 152 148 170 46.5 2010
1 2 Craig Adams 32 PIT RW 82 ... 58 193 243 311 43.9 2010
2 3 Maxim Afinogenov 30 ATL RW 82 ... 21 32 1 2 33.3 2010
3 4 Andrew Alberts 28 TOT D 76 ... 88 216 0 1 0.0 2010
4 4 Andrew Alberts 28 CAR D 62 ... 67 172 0 0 NaN 2010
5 4 Andrew Alberts 28 VAN D 14 ... 21 44 0 1 0.0 2010
6 5 Daniel Alfredsson 37 OTT RW 70 ... 36 41 14 25 35.9 2010
7 6 Bryan Allen 29 FLA D 74 ... 137 120 0 0 NaN 2010
8 7 Cody Almond 20 MIN C 7 ... 5 7 18 12 60.0 2010
9 8 Karl Alzner 21 WSH D 21 ... 21 15 0 0 NaN 2010
10 9 Artem Anisimov 21 NYR C 82 ... 41 45 310 380 44.9 2010
11 10 Nik Antropov 29 ATL C 76 ... 35 82 481 627 43.4 2010
12 11 Colby Armstrong 27 ATL RW 79 ... 29 74 10 10 50.0 2010
13 12 Derek Armstrong 36 STL C 6 ... 0 4 7 8 46.7 2010
14 13 Jason Arnott 35 NSH C 63 ... 17 24 526 551 48.8 2010
15 14 Dean Arsene 29 EDM D 13 ... 13 18 0 0 NaN 2010
16 15 Evgeny Artyukhin 26 TOT RW 54 ... 10 127 1 1 50.0 2010
17 15 Evgeny Artyukhin 26 ANA RW 37 ... 8 90 0 1 0.0 2010
18 15 Evgeny Artyukhin 26 ATL RW 17 ... 2 37 1 0 100.0 2010
19 16 Arron Asham 31 PHI RW 72 ... 16 92 2 11 15.4 2010
20 17 Adrian Aucoin 36 PHX D 82 ... 67 131 1 0 100.0 2010
21 18 Keith Aucoin 31 WSH C 9 ... 0 2 31 25 55.4 2010
22 19 Sean Avery 29 NYR C 69 ... 17 145 4 10 28.6 2010
23 20 David Backes 25 STL RW 79 ... 60 266 504 561 47.3 2010
24 21 Mikael Backlund 20 CGY C 23 ... 4 12 100 86 53.8 2010
25 22 Nicklas Backstrom 22 WSH C 82 ... 61 90 657 660 49.9 2010
26 23 Josh Bailey 20 NYI C 73 ... 36 67 171 255 40.1 2010
27 24 Keith Ballard 27 FLA D 82 ... 201 156 0 0 NaN 2010
28 25 Krys Barch 29 DAL RW 63 ... 13 120 0 3 0.0 2010
29 26 Cam Barker 23 TOT D 70 ... 53 75 0 0 NaN 2010
... ... .. ... .. .. ... ... ... ... ... ... ...
10251 885 Chris Wideman 29 TOT D 25 ... 26 35 0 0 NaN 2019
10252 885 Chris Wideman 29 OTT D 19 ... 25 26 0 0 NaN 2019
10253 885 Chris Wideman 29 EDM D 5 ... 1 7 0 0 NaN 2019
10254 885 Chris Wideman 29 FLA D 1 ... 0 2 0 0 NaN 2019
10255 886 Justin Williams 37 CAR RW 82 ... 32 55 92 150 38.0 2019
10256 887 Colin Wilson 29 COL C 65 ... 31 55 20 32 38.5 2019
10257 888 Garrett Wilson 27 PIT LW 50 ... 16 114 3 4 42.9 2019
10258 889 Scott Wilson 26 BUF C 15 ... 2 29 1 2 33.3 2019
10259 890 Tom Wilson 24 WSH RW 63 ... 52 200 29 24 54.7 2019
10260 891 Luke Witkowski 28 DET D 34 ... 27 67 0 0 NaN 2019
10261 892 Christian Wolanin 23 OTT D 30 ... 31 11 0 0 NaN 2019
10262 893 Miles Wood 23 NJD LW 63 ... 27 97 0 2 0.0 2019
10263 894 Egor Yakovlev 27 NJD D 25 ... 22 12 0 0 NaN 2019
10264 895 Kailer Yamamoto 20 EDM RW 17 ... 11 18 0 0 NaN 2019
10265 896 Keith Yandle 32 FLA D 82 ... 76 47 0 0 NaN 2019
10266 897 Pavel Zacha 21 NJD C 61 ... 24 68 348 364 48.9 2019
10267 898 Filip Zadina 19 DET RW 9 ... 3 6 3 3 50.0 2019
10268 899 Nikita Zadorov 23 COL D 70 ... 67 228 0 0 NaN 2019
10269 900 Nikita Zaitsev 27 TOR D 81 ... 151 139 0 0 NaN 2019
10270 901 Travis Zajac 33 NJD C 80 ... 38 66 841 605 58.2 2019
10271 902 Jakub Zboril 21 BOS D 2 ... 0 3 0 0 NaN 2019
10272 903 Mika Zibanejad 25 NYR C 82 ... 66 134 830 842 49.6 2019
10273 904 Mats Zuccarello 31 TOT LW 48 ... 43 57 10 20 33.3 2019
10274 904 Mats Zuccarello 31 NYR LW 46 ... 42 57 10 20 33.3 2019
10275 904 Mats Zuccarello 31 DAL LW 2 ... 1 0 0 0 NaN 2019
10276 905 Jason Zucker 27 MIN LW 81 ... 38 87 2 11 15.4 2019
10277 906 Valentin Zykov 23 TOT LW 28 ... 6 26 2 7 22.2 2019
10278 906 Valentin Zykov 23 CAR LW 13 ... 2 6 2 6 25.0 2019
10279 906 Valentin Zykov 23 VEG LW 10 ... 3 18 0 1 0.0 2019
10280 906 Valentin Zykov 23 EDM LW 5 ... 1 2 0 0 NaN 2019
[10281 rows x 29 columns]
Scraping heavily formatted tables are positively painful with Beautiful Soup (not to bash on Beautiful Soup, it's wonderful for several use cases). There's a bit of a 'hack' I use for scraping data surrounded with dense markup, if you're willing to be a bit utilitarian about it:
1. Select entire table on web page
2. Copy + paste into Evernote (simplifies and reformats the HTML)
3. Copy + paste from Evernote to Excel or another spreadsheet software (removes the HTML)
4. Save as .csv
Input
Output
It isn't perfect. There will be blank lines in the CSV, but blank lines are easier and far less time-consuming to remove than such data is to scrape. Good luck!
As reference, I've linked my own conversions below.
Parsed to Evernote
Parsed to Excel
I have a dataframe like the following, with a multi-index of integers that represents months and days of the year, along with maximum and minimum temperature recordings from those days.
df
Min Temp Max Temp
Date Date
1 1 -88 139
2 -115 150
3 -110 139
4 -81 156
5 -80 172
... ... ...
12 2 -94 156
3 -97 172
4 -120 156
5 -124 144
6 -161 130
7 -167 135
8 -141 167
9 -135 178
10 -106 194
11 -106 161
12 -94 144
13 -92 133
14 -149 117
15 -158 117
16 -119 122
17 -111 160
18 -142 133
19 -185 130
20 -190 161
21 -167 161
22 -98 150
23 -162 139
24 -90 183
25 -125 183
26 -119 144
27 -76 130
28 -81 134
29 -117 113
30 -127 106
31 -111 122
How can I convert this multi-index to a single index that is of type datetime? Something like this conversion is what I am looking for:
1 1 ---> January 1
1 2 ---> January 2
...
12 31 ---> December 31
Using the top of your dataframe as an example:
>>> df
Min Temp Max Temp
Date Date
1 1 -88 139
2 -115 150
3 -110 139
4 -81 156
5 -80 172
Use pd.to_datetime on the individual levels of your MultiIndex, then strftime with your desired format:
df.index = pd.to_datetime(df.index.get_level_values(0).astype(str) + '-' +
df.index.get_level_values(1).astype(str),
format='%m-%d').strftime('%B %d')
>>> df
Min Temp Max Temp
January 01 -88 139
January 02 -115 150
January 03 -110 139
January 04 -81 156
January 05 -80 172
However, because this is a formatted string, it will no longer be datetime format. If you want it to be datetime, you need to include a year. You can omit the strftime and it will use the default 1900:
df.index = pd.to_datetime(df.index.get_level_values(0).astype(str) + '-' +
df.index.get_level_values(1).astype(str),
format='%m-%d')
>>> df
Min Temp Max Temp
1900-01-01 -88 139
1900-01-02 -115 150
1900-01-03 -110 139
1900-01-04 -81 156
1900-01-05 -80 172
Let's take this sample dataframe:
import pandas as pd
import numpy as np
arrays = [[1, 1, 1, 1, 2, 2, 2, 2], [28, 29, 30, 31 , 1, 2, 3, 4]]
index = pd.MultiIndex.from_arrays(arrays, names=('Month', 'Day'))
df = pd.DataFrame(np.random.randn(8,2), index=index)
Yields:
Month Day 0 1
0 1 28 -0.295065 -0.843433
1 1 29 0.367759 0.837147
2 1 30 0.051956 0.430499
3 1 31 1.917990 1.066545
4 2 1 1.345338 -0.600304
5 2 2 -0.475890 0.763301
6 2 3 0.560985 1.747668
7 2 4 0.377741 -0.310094
Simply use reset_index(), combine the columns and convert to datetime:
new = df.reset_index()
new['Date'] = pd.to_datetime(new['Month'].astype(str) + '/' + new['Day'].astype(str), format='%m/%d')
Yields:
Month Day 0 1 Date
0 1 28 -0.295065 -0.843433 1900-01-28
1 1 29 0.367759 0.837147 1900-01-29
2 1 30 0.051956 0.430499 1900-01-30
3 1 31 1.917990 1.066545 1900-01-31
4 2 1 1.345338 -0.600304 1900-02-01
5 2 2 -0.475890 0.763301 1900-02-02
6 2 3 0.560985 1.747668 1900-02-03
7 2 4 0.377741 -0.310094 1900-02-04
Finally, use set_index() and drop() columns:
new = new.set_index('Date').drop(['Month','Day'], axis=1)
Yields:
0 1
Date
1900-01-28 0.503419 -1.197496
1900-01-29 -0.059114 0.552766
1900-01-30 0.365710 -0.079030
1900-01-31 -2.782296 1.027040
1900-02-01 1.343155 -0.846419
1900-02-02 1.334560 0.392820
1900-02-03 0.537082 1.486579
1900-02-04 0.506200 0.138864
I have a dateframe object with date and calltime columns.
Was trying to build a histogram based on the second column. E.g.
df.groupby('calltime').head(10).plot(kind='hist', y='calltime')
Got the following:
The thing is that I want to get more details for the first bar. E.g. the range itself 0-2500 is huge, and all the data is hidden there... Is there a possibility to split group by smaller range? E.g. by 50, or something like that?
UPD
date calltime
0 1491928756414930 4643
1 1491928756419607 166
2 1491928756419790 120
3 1491928756419927 142
4 1491928756420083 121
5 1491928756420217 109
6 1491928756420409 52
7 1491928756420476 105
8 1491928756420605 35
9 1491928756420654 120
10 1491928756420787 105
11 1491928756420907 93
12 1491928756421013 37
13 1491928756421062 112
14 1491928756421187 41
15 1491928756421240 122
16 1491928756421375 28
17 1491928756421416 158
18 1491928756421587 65
19 1491928756421667 108
20 1491928756421790 55
21 1491928756421858 145
22 1491928756422018 37
23 1491928756422068 63
24 1491928756422145 57
25 1491928756422214 43
26 1491928756422270 73
27 1491928756422357 90
28 1491928756422460 72
29 1491928756422546 77
... ... ...
9845 1491928759997328 670
9846 1491928759998255 372
9848 1491928759999116 659
9849 1491928759999897 369
9850 1491928760000380 746
9851 1491928760001245 823
9852 1491928760002189 634
9853 1491928760002869 335
9856 1491928760003929 4162
9865 1491928760009368 531
use bins
s = pd.Series(np.abs(np.random.randn(100)) ** 3 * 2000)
s.hist(bins=20)
Or you can use pd.cut to produce your own custom bins.
pd.cut(
s, [-np.inf] + [100 * i for i in range(10)] + [np.inf]
).value_counts(sort=False).plot.bar()