summing rows in multi-index pandas dataframe

summing rows in multi-index pandas dataframe - python

I have a Pandas dataframe with a multiindex
A B
year age
1895 0 10 12
1895 1 13 14
...
1965 0 34 45
1965 1 41 34
...
1965 50 56 22
1966 0 10 34
...
I would like to get all ages between two values (e.g. 10 and 20) summed for column A (and B). I played around a bit with .xs e.g.
pops.xs(20, level='age')
gives all the age 20 for each year, but I cannot get this for multiple ages (and summed).
Eg. for 0 and 1 I would like to get
Any suggetions for an elegant (efficient) way to do that?
A B
year
1895 23 26
...
1965 75 79
...

Use query for select with sum per first level years:
print (df)
A B
year age
1895 8 10 12
12 13 14
1965 0 34 45
14 41 34
12 56 22
1966 0 10 34
df = df.query('10 <= age <= 20').sum(level=0)
print (df)
A B
year
1895 13 14
1965 97 56
Detail:
print (df.query('10 <= age <= 20'))
A B
year age
1895 12 13 14
1965 14 41 34
12 56 22
Another solution is use Index.get_level_values for index and filter by boolean indexing:
i = df.index.get_level_values('age')
print (i)
Int64Index([8, 12, 0, 14, 12, 0], dtype='int64', name='age')
df = df[(i >= 10) & (i <= 20)].sum(level=0)
print (df)
A B
year
1895 13 14
1965 97 56

You can use loc and slice to select the part of the DF you want such as:
df.loc[(slice(None),slice(10,20)),:].sum(level=0)
where (slice(None),slice(10,20)) allows you to keep all indexes for all years and age between 10 and 20 included

Related

How to split in train and test by month

I have a dataframe structured like this
Time Z X Y
01-01-18 1 20 10
02-01-18 20 4 15
03-01-18 34 16 21
04-01-18 67 38 8
05-01-18 89 10 18
06-01-18 45 40 4
07-01-18 22 10 13
08-01-18 1 46 11
...
24-12-20 56 28 9
25-12-20 6 14 22
26-12-20 9 5 40
27-12-20 56 11 10
28-12-21 78 61 35
29-12-21 33 23 29
30-12-21 2 35 12
31-12-21 0 31 7
I have data for all days and months from 2018 to 2021, with around 50k observations
How can I aggregate all the data for the same month and perform a Train-Test splitting for each month? I.e. for all the data of the months of January, February, March and so on.

try this:
df['month'] = df.Time.apply(lambda x: x.split('-')[1]) #get month

Sorting Columns By Ascending Order

Given this example dataframe,
Date 01012019 01022019 02012019 02022019 03012019 03022019
Period
1 45 21 43 23 32 23
2 42 12 43 11 14 65
3 11 43 24 23 21 12
I will like to sort the date based on the month - (the date is in ddmmyyyy). However, the date is a string when I type(date). I tried to use pd.to_datetime but it failed with an error month must be in 1..12.
Any advice? Thank you!

Specify format of datetimes in to_datetime and then sort_index:
df.columns = pd.to_datetime(df.columns, format='%d%m%Y')
df = df.sort_index(axis=1)
print (df)
2019-01-01 2019-01-02 2019-01-03 2019-02-01 2019-02-02 2019-02-03
Date
1 45 43 32 21 23 23
2 42 43 14 12 11 65
3 11 24 21 43 23 12

python pandas assign yyyy-mm-dd from multiple years into accumulated week numbers

Given a file with the following columns:
date, userid, amount
where date is in yyyy-mm-dd format. I am trying to use python pandas to assign yyyy-mm-dd from multiple years into accumulated week numbers. For example:
2017-01-01 => 1
2017-12-31 => 52
2018-01-01 => 53
df_counts_dates=pd.read_csv("counts.csv")
print (df_counts_dates['date'].unique())
df = pd.to_datetime(df_counts_dates['date'])
print (df.unique())
print (df.dt.week.unique())
since the data contains Aug 2017-Aug 2018 dates, the above returns
[33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 1 2 3 4 5
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32]
I am wondering if there is any easy way to make the first date "week 1", and make the week number accumulate across years instead of becoming 1 at the beginning of each year?

I believe need a bit different approach - subtract all values of column by first, timedeltas convert to days, floor divide by 7 and last 1 for not starting by 0:
rng = pd.date_range('2017-08-01', periods=365)
df = pd.DataFrame({'date': rng, 'a': range(365)})
print (df.head())
date a
0 2017-08-01 0
1 2017-08-02 1
2 2017-08-03 2
3 2017-08-04 3
4 2017-08-05 4
w = ((df['date'] - df['date'].iloc[0]).dt.days // 7 + 1).unique()
print (w)
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 50 51 52 53]

Python / Pandas - merging two dataframes based in a non-index column

I want to join two dataframes. Already tried concat, merge and join but I should be doing something wrong.
df 1:
index cnpj country state
1 7468 34 23
4 3421 23 12
7 2314 12 45
df 2:
index cnpj street number
2 7468 32 34
5 3421 18 89
546 2314 92 73
I want them to be merged using 'cnpj' as a 'joining key' and preserving the index of df1. It should look like this:
df 1:
index cnpj country state street number
1 7468 34 23 32 34
4 3421 23 12 18 89
7 2314 12 45 92 73
Any suggestions on how to do that?

Let's use merge with suffixes and drop:
df1.merge(df2, on='cnpj',suffixes=('','_y')).drop('index_y',axis=1)
Output:
index cnpj country state street number
0 1 7468 34 23 32 34
1 4 3421 23 12 18 89
2 7 2314 12 45 92 73

Combine duplicated columns within a DataFrame

If I have a dataframe that has columns that include the same name, is there a way to combine the columns that have the same name with some sort of function (i.e. sum)?
For instance with:
In [186]:
df["NY-WEB01"].head()
Out[186]:
NY-WEB01 NY-WEB01
DateTime
2012-10-18 16:00:00 5.6 2.8
2012-10-18 17:00:00 18.6 12.0
2012-10-18 18:00:00 18.4 12.0
2012-10-18 19:00:00 18.2 12.0
2012-10-18 20:00:00 19.2 12.0
How might I collapse the NY-WEB01 columns (there are a bunch of duplicate columns, not just NY-WEB01) by summing each row where the column name is the same?

I believe this does what you are after:
df.groupby(lambda x:x, axis=1).sum()
Alternatively, between 3% and 15% faster depending on the length of the df:
df.groupby(df.columns, axis=1).sum()
EDIT: To extend this beyond sums, use .agg() (short for .aggregate()):
df.groupby(df.columns, axis=1).agg(numpy.max)

pandas >= 0.20: df.groupby(level=0, axis=1)
You don't need a lambda here, nor do you explicitly have to query df.columns; groupby accepts a level argument you can specify in conjunction with the axis argument. This is cleaner, IMO.
# Setup
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
df
A A B B B
0 44 47 0 3 3
1 39 9 19 21 36
2 23 6 24 24 12
3 1 38 39 23 46
4 24 17 37 25 13
<!_ >
df.groupby(level=0, axis=1).sum()
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
Handling MultiIndex columns
Another case to consider is when dealing with MultiIndex columns. Consider
df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
df
one two
A A B B B
0 44 47 0 3 3
1 39 9 19 21 36
2 23 6 24 24 12
3 1 38 39 23 46
4 24 17 37 25 13
To perform aggregation across the upper levels, use
df.groupby(level=1, axis=1).sum()
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
or, if aggregating per upper level only, use
df.groupby(level=[0, 1], axis=1).sum()
one two
A B B
0 91 0 6
1 48 19 57
2 29 24 36
3 39 39 69
4 41 37 38
Alternate Interpretation: Dropping Duplicate Columns
If you came here looking to find out how to simply drop duplicate columns (without performing any aggregation), use Index.duplicated:
df.loc[:,~df.columns.duplicated()]
A B
0 44 0
1 39 19
2 23 24
3 1 39
4 24 37
Or, to keep the last ones, specify keep='last' (default is 'first'),
df.loc[:,~df.columns.duplicated(keep='last')]
A B
0 47 3
1 9 36
2 6 12
3 38 46
4 17 13
The groupby alternatives for the two solutions above are df.groupby(level=0, axis=1).first(), and ... .last(), respectively.

Here is possible simplier solution for common aggregation functions like sum, mean, median, max, min, std - only use parameters axis=1 for working with columns and level:
#coldspeed samples
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
print (df)
print (df.sum(axis=1, level=0))
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
print (df.sum(axis=1, level=1))
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
print (df.sum(axis=1, level=[0,1]))
one two
A B B
0 91 0 6
1 48 19 57
2 29 24 36
3 39 39 69
4 41 37 38
Similar it working for index, then use axis=0 instead axis=1:
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('ABCDE'), index=list('aabbc'))
print (df)
A B C D E
a 44 47 0 3 3
a 39 9 19 21 36
b 23 6 24 24 12
b 1 38 39 23 46
c 24 17 37 25 13
print (df.min(axis=0, level=0))
A B C D E
a 39 9 0 3 3
b 1 6 24 23 12
c 24 17 37 25 13
df.index = pd.MultiIndex.from_arrays([['bar']*3 + ['foo']*2, df.index])
print (df.mean(axis=0, level=1))
A B C D E
a 41.5 28.0 9.5 12.0 19.5
b 12.0 22.0 31.5 23.5 29.0
c 24.0 17.0 37.0 25.0 13.0
print (df.max(axis=0, level=[0,1]))
A B C D E
bar a 44 47 19 21 36
b 23 6 24 24 12
foo b 1 38 39 23 46
c 24 17 37 25 13
If need use another functions like first, last, size, count is necessary use coldspeed answer

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

summing rows in multi-index pandas dataframe - python

You can use loc and slice to select the part of the DF you want such as: df.loc[(slice(None),slice(10,20)),:].sum(level=0) where (slice(None),slice(10,20)) allows you to keep all indexes for all years and age between 10 and 20 included

Related

How to split in train and test by month

Sorting Columns By Ascending Order

python pandas assign yyyy-mm-dd from multiple years into accumulated week numbers

Python / Pandas - merging two dataframes based in a non-index column

Combine duplicated columns within a DataFrame

Categories

Resources