Pandas Dataframe multiindex - python

I'm new to Python, Pandas, Dash, etc. I'm trying to structure a dataframe so I can create some dash components for graphing that will allow the user to see and filter data.
At the top are aggregation characteristics, the first 3 are required and the remaining are sparse based on whether or not the data was aggregated for that characteristic. After the first ellipses, there are some summary characteristics for the day, and after the second ellipses is the time series data for the aggregation. There are about 3800 pre-calculated aggregate groupings in this example.
Should I try to make the aggregate characteristics into a MultiIndex?
The runid is the identifier of the analysis run that created the output(same number for all 3818 columns), while the UID field should be unique for each column of a single run, but multiple runs will have the same UID with different RUNIDs. The UID is the unique combination of CHAR1 thru CHAR20 for that RUNID and AGGLEVEL. The AGGLEVEL is the analysis grouping which may have one or more columns of output. CHAR3_CHAR6_UNADJ is the unique combinations of CHAR3 and CHAR6, so those two rows are populated while the remaining CHAR rows are null (well NaN) My current example is just one run, but there tens of thousands of runs, although I usually focus on one at a time and probably won't deal with more than 10-20 at a a time for a subset of the data of each. Char1 thru Char20 are only populated if that column has data aggregated by that characteristic.
My dataframe example:
print(dft)
0 ... 3818
UID 32 ... 19980
RUNID 1234 ... 1234
AGGLEVEL CHAR12_ADJ ... CHAR3_CHAR6_UNADJ
CHAR1 NaN ... NaN
CHAR2 NaN ... NaN
CHAR3 NaN ... 1234
CHAR4 NaN ... NaN
CHAR5 NaN ... NaN
CHAR6 NaN ... ABCD
CHAR7 NaN ... NaN
CHAR8 NaN ... NaN
CHAR9 NaN ... NaN
CHAR10 NaN ... NaN
CHAR11 NaN ... NaN
CHAR12 IJKL ... NaN
CHAR13 NaN ... NaN
CHAR14 NaN ... NaN
CHAR15 NaN ... NaN
CHAR16 NaN ... NaN
CHAR17 NaN ... NaN
CHAR18 NaN ... NaN
CHAR19 NaN ... NaN
CHAR20 NaN ... NaN
...
STARTTIME 2018-08-22 00:00:00 ... 2018-08-22 00:00:00
MAXIMUM 2.676 ... 0.654993
MINIMUM 0.8868 ... 0.258181
...
00:00 1.2288 ... 0.335217
01:00 1.2828 ... 0.337848
02:00 1.2876 ... 0.324639
03:00 1.194 ... 0.314569
04:00 1.2876 ... 0.258181
05:00 1.1256 ... 0.284699
06:00 1.4016 ... 0.364655
07:00 1.122 ... 0.388968
08:00 1.0188 ... 0.452711
09:00 1.008 ... 0.507032
10:00 1.0272 ... 0.546807
11:00 0.972 ... 0.605359
12:00 1.062 ... 0.641152
13:00 0.8868 ... 0.625082
14:00 1.1076 ... 0.623865
15:00 0.9528 ... 0.654993
16:00 1.014 ... 0.645511
17:00 2.676 ... 0.62638
18:00 0.9888 ... 0.551629
19:00 1.038 ... 0.518322
20:00 1.2528 ... 0.50793
21:00 1.08 ... 0.456993
22:00 1.1724 ... 0.387063
23:00 1.1736 ... 0.345045
[62 rows x 3819 columns]

You should try to transpose it with dft.T. You will have as index the number of your sample from 0 to 3818 and it'll be easier to select your columns then with dft['STARTTIME'] for instance.
For the NaN, you should do dft = dft.replace('NaN',np.nan) so that Pandas will understand that it's really a NaN and not a string (don't forget to write import numpy as np before). You'll be able then to use pd.isna(dft) to check if you have NaN in your Dataframe or dft.dropna() to keep full completed lines.

Related

how to take the average for certain rows (python)

As you can see in the outcome picture there is a column called 'dt', for every month in the year the temperature of lands/sea is measured. But i want to make the graph shorter so i want to take the average of the 12 months and make the measurements yearly and not monthly.
this is the outcome:
This is what i want:what i want
You can use:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('GlobalTemperatures.csv', parse_dates=['dt'])
out = df.groupby(df['dt'].dt.year).mean()
out.plot()
plt.show()
Output:
>>> out
LandAverageTemperature LandAverageTemperatureUncertainty LandMaxTemperature ... LandMinTemperatureUncertainty LandAndOceanAverageTemperature LandAndOceanAverageTemperatureUncertainty
dt ...
1750 8.719364 2.637818 NaN ... NaN NaN NaN
1751 7.976143 2.781143 NaN ... NaN NaN NaN
1752 5.779833 2.977000 NaN ... NaN NaN NaN
1753 8.388083 3.176000 NaN ... NaN NaN NaN
1754 8.469333 3.494250 NaN ... NaN NaN NaN
... ... ... ... ... ... ... ...
2011 9.516000 0.082000 15.284833 ... 0.136583 15.769500 0.059000
2012 9.507333 0.083417 15.332833 ... 0.145333 15.802333 0.061500
2013 9.606500 0.097667 15.373833 ... 0.149833 15.854417 0.064667
2014 9.570667 0.090167 15.313583 ... 0.139000 15.913000 0.063167
2015 9.831000 0.092167 15.572667 ... 0.141750 16.058583 0.060833
[266 rows x 8 columns]

Filling NaN rows in big pandas datetime indexed dataframe using other not NaN rows values

I have a big weather csv dataframe containing several hundred thousand of rows as well many columns. The rows are time-series sampled every 10 minutes over many years. The index data column that represents datetime consists of year, month, day, hour, minute and second. Unfortunately, there were several thousand missing rows containing only NaNs. The goal is to fill these ones using the values of other rows collected at the same time but of other years if they are not NaNs.
I wrote a python for loop code but it seems like a very time consuming solution. I need your help for a more efficient and faster solution.
The raw dataframe is as follows:
print(df)
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:10:00 996.52 -8.02 265.40 -8.90 93.30
2004-01-01 00:20:00 996.57 -8.41 265.01 -9.28 93.40
2004-01-01 00:40:00 996.51 -8.31 265.12 -9.07 94.20
2004-01-01 00:50:00 996.51 -8.27 265.15 -9.04 94.10
2004-01-01 01:00:00 996.53 -8.51 264.91 -9.31 93.90
... ... ... ... ... ...
2020-12-31 23:20:00 1000.07 -4.05 269.10 -8.13 73.10
2020-12-31 23:30:00 999.93 -3.35 269.81 -8.06 69.71
2020-12-31 23:40:00 999.82 -3.16 270.01 -8.21 67.91
2020-12-31 23:50:00 999.81 -4.23 268.94 -8.53 71.80
2021-01-01 00:00:00 999.82 -4.82 268.36 -8.42 75.70
[820551 rows x 5 columns]
For any reason, there were missing rows in the df dataframe. To identify them, it is possible to apply the below function:
findnanrows(df.groupby(pd.Grouper(freq='10T')).mean())
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:30:00 NaN NaN NaN NaN NaN
2009-10-08 09:50:00 NaN NaN NaN NaN NaN
2009-10-08 10:00:00 NaN NaN NaN NaN NaN
2013-05-16 09:00:00 NaN NaN NaN NaN NaN
2014-07-30 08:10:00 NaN NaN NaN NaN NaN
... ... ... ... ... ...
2016-10-28 12:00:00 NaN NaN NaN NaN NaN
2016-10-28 12:10:00 NaN NaN NaN NaN NaN
2016-10-28 12:20:00 NaN NaN NaN NaN NaN
2016-10-28 12:30:00 NaN NaN NaN NaN NaN
2016-10-28 12:40:00 NaN NaN NaN NaN NaN
[5440 rows x 5 columns]
The aim is to fill all these NaN rows. As an example, the first NaN row which corresponds to the datetime 2004-01-01 00:30:00 should be filled with the not NaN values of another row collected on the same datetime xxxx-01-01 00:30:00 of another year like 2005-01-01 00:30:00 or 2006-01-01 00:30:00 and so on, even 2003-01-01 00:30:00 or 2002-01-01 00:30:00 if they existing. It is possible to apply an average over all these other years.
Seeing the values of the row with the datetime index 2005-01-01 00:30:00:
print(df.loc["2005-01-01 00:30:00", :])
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2005-01-01 00:30:00 996.36 12.67 286.13 7.11 68.82
After filling the row corresponding to the index datetime 2004-01-01 00:30:00 using the values of the row having the index datetime 2005-01-01 00:30:00, the df dataframe will have the following row:
print(df.loc["2004-01-01 00:30:00", :])
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:30:00 996.36 12.67 286.13 7.11 68.82
The two functions that I created are the following. The first is to identify the NaN rows. The second is to fill them.
def findnanrows(df):
is_NaN = df.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = df[row_has_NaN]
return rows_with_NaN
def filldata(weatherdata):
fillweatherdata = weatherdata.copy()
allyears = fillweatherdata.index.year.unique().tolist()
dfnan = findnanrows(fillweatherdata.groupby(pd.Grouper(freq='10T')).mean())
for i in range(dfnan.shape[0]):
dnan = dfnan.index[i]
if dnan.year == min(allyears):
y = 0
dnew = dnan.replace(year=dnan.year+y)
while dnew in dfnan.index:
dnew = dnew.replace(year=dnew.year+y)
y += 1
else:
y = 0
dnew = dnan.replace(year=dnan.year-y)
while dnew in dfnan.index:
dnew = dnew.replace(year=dnew.year-y)
y += 1
new_row = pd.DataFrame(np.array([fillweatherdata.loc[dnew, :]]).tolist(), columns=fillweatherdata.columns.tolist(), index=[dnan])
fillweatherdata = pd.concat([fillweatherdata, pd.DataFrame(new_row)], ignore_index=False)
#fillweatherdata = fillweatherdata.drop_duplicates()
fillweatherdata = fillweatherdata.sort_index()
return fillweatherdata

Pandas trying to make values within a column into new columns after groupby on column

My original dataframe looked like:
timestamp variables value
1 2017-05-26 19:46:41.289 inf 0.000000
2 2017-05-26 20:40:41.243 tubavg 225.489639
... ... ... ...
899541 2017-05-02 20:54:41.574 caspre 684.486450
899542 2017-04-29 11:17:25.126 tvol 50.895000
Now I want to bucket this dataset by time, which can be done with the code:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby(pd.Grouper(key='timestamp', freq='5min'))
But I also want all the different metrics to become columns in the new dataframe. For example the first two rows from the original dataframe would look like:
timestamp inf tubavg caspre tvol ...
1 2017-05-26 19:46:41.289 0.000000 225.489639 xxxxxxx xxxxx
... ... ... ...
xxxxx 2017-05-02 20:54:41.574 xxxxxx xxxxxx 684.486450 50.895000
Now as it can be seen the time has been bucketed by 5 min intervals and will look at all the values of variables and try to create columns for those columns for all buckets. The bucket has assumed the very first value of the time it had bucketed with.
in order to solve this, I have tried a couple of different solutions, but can't seem to find anything without constant errors.
Try unstacking the variables column from rows to columns with .unstack(1). The parameter is 1, because we want the second index column (0 would be the first)
Then, drop the level of the multi-index you just created to make it a little bit cleaner with .droplevel().
Finally, use pd.Grouper. Since the date/time is on the index, you don't need to specify a key.
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.set_index(['timestamp','variables']).unstack(1)
df.columns = df.columns.droplevel()
df = df.groupby(pd.Grouper(freq='5min')).mean().reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-04-29 11:20:00 NaN NaN NaN NaN
2 2017-04-29 11:25:00 NaN NaN NaN NaN
3 2017-04-29 11:30:00 NaN NaN NaN NaN
4 2017-04-29 11:35:00 NaN NaN NaN NaN
... ... ... ... ...
7885 2017-05-26 20:20:00 NaN NaN NaN NaN
7886 2017-05-26 20:25:00 NaN NaN NaN NaN
7887 2017-05-26 20:30:00 NaN NaN NaN NaN
7888 2017-05-26 20:35:00 NaN NaN NaN NaN
7889 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
Another way would be to .groupby the variables as well and then .unstack(1) again:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby([pd.Grouper(freq='5min', key='timestamp'), 'variables']).mean().unstack(1)
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-05-02 20:50:00 684.48645 NaN NaN NaN
2 2017-05-26 19:45:00 NaN 0.0 NaN NaN
3 2017-05-26 20:40:00 NaN NaN 225.489639 NaN

Groupby time bins in multilevel index

I have a sparsely filled data frame that looks like this:
entity_id 59e75f2b9e182f68cf25721d 59e75f2bc0bd722a5f395ee9 59e75f2c05e40310ebe1f433 ...
organisation_id group_id datetime ...
59e7515edb84e482acce8339 59e75177575fc94638c1f8e7 2018-04-01 02:01:00 NaN NaN NaN ...
2018-04-01 02:02:00 NaN 2.15 NaN ...
2018-04-01 02:03:00 NaN NaN 3.689 ...
2018-04-01 02:04:00 NaN NaN NaN ...
2018-04-01 02:05:00 NaN NaN NaN ...
... ... ... ... ...
5cb590649f18c69541d34f7a 2019-04-01 01:55:00 NaN NaN NaN ...
2019-04-01 01:56:00 NaN NaN NaN ...
2019-04-01 01:57:00 NaN NaN NaN ...
2019-04-01 01:58:00 NaN NaN NaN ...
2019-04-01 01:59:00 NaN NaN NaN ...
I would like to group this frame by group_id and 10-minute bins applied to the datetime index (for each group i want values that occurred inside the same 10 minute window to be grouped so i can take the mean over columns, disregarding the minute portion of the datetime index essentially).
I have tried using pd.Grouper(freq='10T') but that doesn't work in conjunction with multilevel indices it would seem.
group_mean = frame.groupby(
pd.Grouper(freq='10T'), level='datetime').mean(axis=1)
This gives me the error message
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'
For reference, my wanted output should look something like this:
group_mean
organisation_id group_id datetime
59e7515edb84e482acce8339 59e75177575fc94638c1f8e7 2018-04-01 02:10:00 mean(axis=1)
2018-04-01 02:20:00 mean(axis=1)
...
5cb590649f18c69541d34f7a 2019-04-01 01:50:00 mean(axis=1)
2019-04-01 02:00:00 mean(axis=1)
...
where mean(axis=1) is the mean of all columns that are not NaN for that specific group and time bin.
Solution need DatetimeIndex, so first convert another levels to columns and add it to groupby in list:
Notice: Mean is per groups, not per columns.
group_mean = (frame.reset_index(['organisation_id','group_id'])
.groupby(['organisation_id',
'group_id',
pd.Grouper(freq='10T',level='datetime')])
.mean())
If need mean per columns:
df = frame.mean(axis=1)

Select a (non-indexed) column based on text content of a cell in a python/pandas dataframe

TL:DR - how do I create a dataframe/series from one or more columns in an existing non-indexed dataframe based on the column(s) containing a specific piece of text?
Relatively new to Python and data analysis and (this is my first time posting a question on Stack Overflow but I've been hunting for an answer for a long time (and used to code regularly) and not having any success.
I have a dataframe import from an Excel file that doesn't have named/indexed columns. I am trying to successfully extract data from nearly 2000 of these files which all have slightly different columns of data (of course - why make it simple... or follow a template... or simply use something other than poorly formatted Excel spreadsheets...).
The original dataframe (from a poorly structured XLS file) looks a bit like this:
0 NaN RIGHT NaN
1 Date UCVA Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
3 4 5 6 7 8 9 \
0 NaN NaN NaN NaN NaN NaN NaN
1 Cyl Axis BSCVA Pentacam remarks K1 K2 K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
... 17 18 19 20 21 22 \
0 ... NaN NaN NaN NaN NaN NaN
1 ... BSCVA Pentacam remarks K1 K2 K2 back K max
2 ... 6/5 NaN NaN NaN NaN NaN
3 ... NaN NaN NaN NaN NaN NaN
4 ... NaN Pentacam 44.3 43.7 -6.2 45.5
5 ... 6/4-4 NaN NaN NaN NaN NaN
6 ... 6/5 NaN NaN NaN NaN NaN
I want to extract a set of dataframes/series that I can then combine back together to get a 'tidy' dataframe e.g.:
1 Date R-UCVA R-Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
1 R-Cyl R-Axis R-BSCVA R-Penta R-K1 R-K2 R-K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
etc. etc. so I'm trying to write some code that will pull a series of columns that I define by looking for the words "Date" or "UCVA" etc. etc. Then I plan to stitch them back together into a single dataframe with patient identifier as an extra column. And then cycle through all the XLS files, appending the whole lot to a single CSV file that I can then do useful stuff on (like put into an Access Database - yes, I know, but it has to be easy to use and already installed on an NHS computer - and statistical analysis).
Any suggestions? I hope that's enough information.
Thanks very much in advance.
Kind regards
Vicky
Here a something that will hopefully get you started.
I have prepared a text.xlsx file:
and I can read it as follows
path = 'text.xlsx'
df = pd.read_excel(path, header=[0,1])
# Deal with two levels of headers, here I just join them together crudely
df.columns = df.columns.map(lambda h: ' '.join(h))
# Slight hack because I messed with the column names
# I create two dataframes, one with the first column, one with the second column
df1 = df[[df.columns[0],df.columns[1]]]
df2 = df[[df.columns[0], df.columns[2]]]
# Stacking them on top of each other
result = pd.concat([df1, df2])
print(result)
#Merging them on the Date column
result = pd.merge(left=df1, right=df2, on=df1.columns[0])
print(result)
This gives the output
RIGHT Sph RIGHT UCVA Unnamed: 0_level_0 Date
0 NaN 6/38 2007-01-13 00:00:00
1 NaN 6/37 2009-11-05 00:00:00
2 NaN 9/56 2009-11-18 00:00:00
0 [-2.00] NaN 2007-01-13 00:00:00
1 NaN NaN 2009-11-05 00:00:00
2 NaN NaN 2009-11-18 00:00:00
and
Unnamed: 0_level_0 Date RIGHT UCVA RIGHT Sph
0 2007-01-13 00:00:00 6/38 [-2.00]
1 2009-11-05 00:00:00 6/37 NaN
2 2009-11-18 00:00:00 9/56 NaN
Some pointers:
How to merger two header rows? See this question and answer.
How to select pandas columns conditionally? See e.g. this or this
How to merge dataframes? There is a very good guide in the pandas doc

Categories