I have Excel files with multiple sheets, each of which looks a little like this (but much longer):
Sample CD4 CD8
Day 1 8311 17.3 6.44
8312 13.6 3.50
8321 19.8 5.88
8322 13.5 4.09
Day 2 8311 16.0 4.92
8312 5.67 2.28
8321 13.0 4.34
8322 10.6 1.95
The first column is actually four cells merged vertically.
When I read this using pandas.read_excel, I get a DataFrame that looks like this:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
How can I either get Pandas to understand merged cells, or quickly and easily remove the NaN and group by the appropriate value? (One approach would be to reset the index, step through to find the values and replace NaNs with values, pass in the list of days, then set the index to the column. But it seems like there should be a simpler approach.)
You could use the Series.fillna method to forword-fill in the NaN values:
df.index = pd.Series(df.index).fillna(method='ffill')
For example,
In [42]: df
Out[42]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
[8 rows x 3 columns]
In [43]: df.index = pd.Series(df.index).fillna(method='ffill')
In [44]: df
Out[44]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
Day 1 8312 13.60 3.50
Day 1 8321 19.80 5.88
Day 1 8322 13.50 4.09
Day 2 8311 16.00 4.92
Day 2 8312 5.67 2.28
Day 2 8321 13.00 4.34
Day 2 8322 10.60 1.95
[8 rows x 3 columns]
df = df.fillna(method='ffill', axis=0) # resolved updating the missing row entries
To casually come back 8 years later, pandas.read_excel() can solve this internally for you with the index_col parameter.
df = pd.read_excel('path_to_file.xlsx', index_col=[0])
Passing index_col as a list will cause pandas to look for a MultiIndex. In the case where there is a list of length one, pandas creates a regular Index filling in the data.
Related
I have this dataframe df:
alpha1 week_day calendar_week
0 2.49 Freitag 2022-04-(01/07)
1 1.32 Samstag 2022-04-(01/07)
2 2.70 Sonntag 2022-04-(01/07)
3 3.81 Montag 2022-04-(01/07)
4 3.58 Dienstag 2022-04-(01/07)
5 3.48 Mittwoch 2022-04-(01/07)
6 1.79 Donnerstag 2022-04-(01/07)
7 2.12 Freitag 2022-04-(08/14)
8 2.41 Samstag 2022-04-(08/14)
9 1.78 Sonntag 2022-04-(08/14)
10 3.19 Montag 2022-04-(08/14)
11 3.33 Dienstag 2022-04-(08/14)
12 2.88 Mittwoch 2022-04-(08/14)
13 2.98 Donnerstag 2022-04-(08/14)
14 3.01 Freitag 2022-04-(15/21)
15 3.04 Samstag 2022-04-(15/21)
16 2.72 Sonntag 2022-04-(15/21)
17 4.11 Montag 2022-04-(15/21)
18 3.90 Dienstag 2022-04-(15/21)
19 3.16 Mittwoch 2022-04-(15/21)
and so on, with ascending calendar weeks.
I performed a pivot table to generate a heatmap.
df_pivot = pd.pivot_table(df, values=['alpha1'], index=['week_day'], columns=['calendar_week'])
What I get is:
alpha1 \
calendar_week 2022-(04-29/05-05) 2022-(05-27/06-02) 2022-(07-29/08-04)
week_day
Dienstag 3.32 2.09 4.04
Donnerstag 3.27 2.21 4.65
Freitag 2.83 3.08 4.19
Mittwoch 3.22 3.14 4.97
Montag 2.83 2.86 4.28
Samstag 2.62 3.62 3.88
Sonntag 2.81 3.25 3.77
\
calendar_week 2022-(08-26/09-01) 2022-04-(01/07) 2022-04-(08/14)
week_day
Dienstag 2.92 3.58 3.33
Donnerstag 3.58 1.79 2.98
Freitag 3.96 2.49 2.12
Mittwoch 3.09 3.48 2.88
Montag 3.85 3.81 3.19
Samstag 3.10 1.32 2.41
Sonntag 3.39 2.70 1.78
As you see the sorting of the pivot table is messed up. I need the same sorting for the columns (calendar weeks) as in the original dataframe.
I have been looking all over but couldn't find how to achieve this.
Would be also very nice, if the sorting of the rows remains the same.
Any help will be greatly appreciated
UPDATE
I didn't paste all the data. It would have been too long
The calendar_week column consist of following elements
'2022-04-(01/07)',
'2022-04-(08/14)',
'2022-04-(15/21)',
'2022-04-(22/28)',
'2022-(04-29/05-05)',
'2022-05-(06/12)',
'2022-05-(13/19)',
'2022-05-(20/26)',
'2022-(05-27/06-02)',
'2022-06-(03/09)'
'2022-06-(10/16)'
'2022-06-(17/23)'
'2022-06-(24/30)'
'2022-07-(01/07)'
'2022-07-(08/14)'
'2022-07-(15/21)'
'2022-07-(22/28)'
'2022-(07-29/08-04)'
'2022-08-(05/11)'
etc....
Each occurs 7 times in df. It represents a calendar week.
The sorting is the natural time sorting.
After pivoting the dataframe, the sorting of the column get messed up. And I guess it's due to the 2 different types: 2022-(07-29/08-04) and 2022-07-(15/21).
Try running this:
df_pivot.sort_values(by = ['calendar_week'], axis = 1, ascending = True)
I got the following output. Is this what you wanted?
calendar_week
2022-04-(01/07)
2022-04-(08/14)
2022-04-(15/21)
week_day
Dienstag
3.58
3.33
3.90
Donnerstag
1.79
2.98
NaN
Freitag
2.49
2.12
3.01
Mittwoch
3.48
2.88
3.16
Montag
3.81
3.19
4.11
be sure to remove the NaN values using the fillna() function.
I hope that answers it. :)
You can use an ordered Categorical for your week days and sort the dates after pivoting with sort_index:
# define the desired order of the days
days = ['Montag', 'Dienstag', 'Mittwoch', 'Donnerstag',
'Freitag', 'Samstag', 'Sonntag']
df_pivot = (df
.assign(week_day=pd.Categorical(df['week_day'], categories=days,
ordered=True))
.pivot_table(values='alpha1', index='week_day',
columns='calendar_week')
.sort_index(axis=1)
)
output:
calendar_week 2022-04-(01/07) 2022-04-(08/14) 2022-04-(15/21)
week_day
Montag 3.81 3.19 4.11
Dienstag 3.58 3.33 3.90
Mittwoch 3.48 2.88 3.16
Donnerstag 1.79 2.98 NaN
Freitag 2.49 2.12 3.01
Samstag 1.32 2.41 3.04
Sonntag 2.70 1.78 2.72
I have a dataframe that I'd like to export to a csv file where each column is stacked on top of one another. I want to use each header as a label with the date in this format, Allu_1_2013.
date Allu_1 Allu_2 Alluv_3 year
2013-01-01 2.00 1.45 3.54 2013
2014-01-01 3.09 2.35 9.01 2014
2015-01-01 4.79 4.89 10.04 2015
The final csv text tile should look like
Allu_1_2013 2.00
Allu_1_2014 3.09
Allu_1_2015 4.79
Allu_2_2013 1.45
Allu_2_2014 2.35
Allu_2_2015 4.89
Allu_3_2013 3.54
Allu_3_2014 9.01
Allu_3_2015 10.04
You can use melt:
new_df = df.melt(id_vars=["date", "year"],
var_name="Date",
value_name="Value").drop(columns=['date'])
new_df['idx'] = new_df['Date'] + '_' + new_df['year'].astype(str)
new_df = new_df.drop(columns=['year', 'Date'])
Value
idx
0
2
Allu_1_2013
1
3.09
Allu_1_2014
2
4.79
Allu_1_2015
3
1.45
Allu_2_2013
4
2.35
Allu_2_2014
5
4.89
Allu_2_2015
6
3.54
Alluv_3_2013
7
9.01
Alluv_3_2014
8
10.04
Alluv_3_2015
In the dataframe, I want to create 3 new columns labeled as A-hat, B-hat, C-hat, whereby I want to return the highest value by comparing the original column A, B and C, else return as NaN. Hence in each of the row, the 3 new column should return two NaN and one highest value.
input df:
A B C
Date
2020-01-05 3.57 5.29 6.23
2020-01-04 4.98 9.64 7.58
2020-01-03 3.79 5.25 6.26
2020-01-02 3.95 5.65 6.61
2020-01-01 -3.10 -7.20 -8.16
output df:
A B C A-hat B-hat C-hat
Date
2020-01-05 3.57 5.29 6.23 NaN NaN 6.23
2020-01-04 4.98 9.64 7.58 NaN 9.64 NaN
2020-01-03 3.79 5.25 6.26 NaN NaN 6.26
2020-01-02 3.95 5.65 6.61 NaN NaN 6.61
2020-01-01 -3.10 -7.20 -8.16 -3.10 NaN NaN
How can I achieve this output?
You can compare maximal values by DataFrame.max in DataFrame.eq and set missing values by DataFrame.where if not matched mask:
df = df.join(df.where(df.eq(df.max(axis=1), axis=0)).add_suffix('-hat'))
print (df)
A B C A-hat B-hat C-hat
Date
2020-01-05 3.57 5.29 6.23 NaN NaN 6.23
2020-01-04 4.98 9.64 7.58 NaN 9.64 NaN
2020-01-03 3.79 5.25 6.26 NaN NaN 6.26
2020-01-02 3.95 5.65 6.61 NaN NaN 6.61
2020-01-01 -3.10 -7.20 -8.16 -3.1 NaN NaN
I am trying to create new columns where each row has the value of the previous row (the day before).
My data is formatted like that (in the orginal file there are 12 columns plus the timestamp and thousands of rows):
import numpy as np
import pandas as pd
df = pd.DataFrame({"Timestamp" : ['1993-11-01' ,'1993-11-02', '1993-11-03', '1993-11-04','1993-11-15'], "Austria" : [6.11 ,6.18, 6.17, 6.17, 6.40],"Belgium" : [7.01, 7.05, 7.2, 7.5, 7.6],"France" : [7.69, 7.61, 7.67, 7.91, 8.61]},index = [1, 2, 3,4,5])
What I have:
Timestamp Austria Belgium France
1 1993-11-01 6.11 7.01 7.69
2 1993-11-02 6.18 7.05 7.61
3 1993-11-03 6.17 7.20 7.67
4 1993-11-04 6.17 7.50 7.91
5 1993-11-15 6.40 7.60 8.61
What I want:
Timestamp Austria t-1 Belgium t-1 France t-1
1 1993-11-01 NaN NaN NaN
2 1993-11-02 6.11 7.01 7.69
3 1993-11-03 6.18 7.05 7.61
4 1993-11-04 6.17 7.20 7.67
5 1993-11-15 6.17 7.50 7.91
Its easy in Excel but I cannot find a way to do that in Python. But surely there is a way. Anyone knows how to do to it?
Use shift on the columns to compute:
cols = ["Austria", "Belgium", "France"]
df[cols] = df[cols].shift()
print(df)
Output
Timestamp Austria Belgium France
1 1993-11-01 NaN NaN NaN
2 1993-11-02 6.11 7.01 7.69
3 1993-11-03 6.18 7.05 7.61
4 1993-11-04 6.17 7.20 7.67
5 1993-11-15 6.17 7.50 7.91
As an alternative:
df.iloc[:, 1:] = df.iloc[:, 1:].shift()
print(df)
First df.set_index on Timestamp column, then use df.shift:
In [4400]: d = df.set_index('Timestamp').shift()
In [4403]: d.columns = [i + ' t-1' for i in d.columns]
In [4406]: d.reset_index(inplace=True)
In [4407]: d
Out[4407]:
Timestamp Austria t-1 Belgium t-1 France t-1
0 1993-11-01 NaN NaN NaN
1 1993-11-02 6.11 7.01 7.69
2 1993-11-03 6.18 7.05 7.61
3 1993-11-04 6.17 7.20 7.67
4 1993-11-15 6.17 7.50 7.91
I have Excel files with multiple sheets, each of which looks a little like this (but much longer):
Sample CD4 CD8
Day 1 8311 17.3 6.44
8312 13.6 3.50
8321 19.8 5.88
8322 13.5 4.09
Day 2 8311 16.0 4.92
8312 5.67 2.28
8321 13.0 4.34
8322 10.6 1.95
The first column is actually four cells merged vertically.
When I read this using pandas.read_excel, I get a DataFrame that looks like this:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
How can I either get Pandas to understand merged cells, or quickly and easily remove the NaN and group by the appropriate value? (One approach would be to reset the index, step through to find the values and replace NaNs with values, pass in the list of days, then set the index to the column. But it seems like there should be a simpler approach.)
You could use the Series.fillna method to forword-fill in the NaN values:
df.index = pd.Series(df.index).fillna(method='ffill')
For example,
In [42]: df
Out[42]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
[8 rows x 3 columns]
In [43]: df.index = pd.Series(df.index).fillna(method='ffill')
In [44]: df
Out[44]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
Day 1 8312 13.60 3.50
Day 1 8321 19.80 5.88
Day 1 8322 13.50 4.09
Day 2 8311 16.00 4.92
Day 2 8312 5.67 2.28
Day 2 8321 13.00 4.34
Day 2 8322 10.60 1.95
[8 rows x 3 columns]
df = df.fillna(method='ffill', axis=0) # resolved updating the missing row entries
To casually come back 8 years later, pandas.read_excel() can solve this internally for you with the index_col parameter.
df = pd.read_excel('path_to_file.xlsx', index_col=[0])
Passing index_col as a list will cause pandas to look for a MultiIndex. In the case where there is a list of length one, pandas creates a regular Index filling in the data.