Pandas DataFrame combine rows by column value, where rows can have NaNs

Pandas DataFrame combine rows by column value, where rows can have NaNs - python

I have a Pandas DataFrame like the following:
timestamp A B C D E F
0 1607594400000 83.69 NaN NaN NaN 1003.20 8.66
1 1607594400000 NaN 2.57 44.35 17.18 NaN NaN
2 1607595000000 83.07 NaN NaN NaN 1003.32 8.68
3 1607595000000 NaN 3.00 42.31 20.08 NaN NaN
.. ... ... ... ... ... ... ...
325 1607691600000 90.19 NaN NaN NaN 997.32 10.22
326 1607691600000 NaN 1.80 30.10 14.85 NaN NaN
328 1607692200000 NaN 1.60 26.06 12.78 NaN NaN
327 1607692200000 91.33 NaN NaN NaN 997.52 10.21
I need to combine the rows that have the same value for timestamp, where in the cases where there is nan-value the value is maintained and in the cases where there is value-value the average of the values is calculated.
I tried the solution of the following question but it is not exactly my situation and I don't know how to addapt it:
pandas, combine rows based on certain column values and NAN

Just use groupby:
df.groupby('timestamp', as_index=False).mean()

Try with first, it will pick the not null value for each group
out = df.groupby('timestamp', as_index=False).first()
Or
out = df.set_index('timestamp').mean(level=0)

Related

Row by row mapping keys of dictionary of dataframes to new dictionary of dataframes

I have two dictionaries of data frames LP3 and ExeedenceDict. The ExeedenceDict is a dictionary of 4 dataframes with keys 'two','ten','twentyfive','onehundred'. The LP3 dictionary has keys 'LP_DevilMalad', 'LP_Bloomington', 'LP_DevilEvans', 'LP_Deep', 'LP_Maple', 'LP_CubMaple', 'LP_Cottonwood', 'LP_Mill', 'LP_CubNrPreston'
Edit: I am not sure of the most concise way to title this question but I think the title suites what I am asking.
There is a column in each dataframe within the ExeedenceDict that has row values equal to the keys in the LP3 dictionary.
Below is a 'blank' dataframe for two in the ExeedenceDict that I created. Using the code:
ExeedenceDF = []
cols = ['Location','Size','Annual Exceedence', 'With Reg Skew','Without Reg Skew','5% Lower','95% Upper']
for i in range(5):
i = pd.DataFrame(columns=cols)
i['Location'] = LP_names
i['Size'] = [39.8,24,34,29.7,21.2,53.7,61.7,27.6,31.6]
ExeedenceDF.append(i)
ExeedenceDict = {'two':ExeedenceDF[0], 'ten':ExeedenceDF[1], 'twentyfive':ExeedenceDF[2], 'onehundred':ExeedenceDF[3]}
Location Size Annual Exceedence With Reg Skew Without Reg Skew 5% Lower 95% Upper
0 LP_DevilMalad 39.8 NaN NaN NaN NaN NaN
1 LP_Bloomington 24.0 NaN NaN NaN NaN NaN
2 LP_DevilEvans 34.0 NaN NaN NaN NaN NaN
3 LP_Deep 29.7 NaN NaN NaN NaN NaN
4 LP_Maple 21.2 NaN NaN NaN NaN NaN
5 LP_CubMaple 53.7 NaN NaN NaN NaN NaN
6 LP_Cottonwood 61.7 NaN NaN NaN NaN NaN
7 LP_Mill 27.6 NaN NaN NaN NaN NaN
8 LP_CubNrPreston 31.6 NaN NaN NaN NaN NaN
Below is the dataframe for the key LP_DevilMalad in the LP3 dictionary. This dictionary was built by reading in data from 10 excel spreadsheets. Using the code:
LP_names = ['LP_DevilMalad', 'LP_Bloomington', 'LP_DevilEvans', 'LP_Deep', 'LP_Maple', 'LP_CubMaple', 'LP_Cottonwood', 'LP_Mill', 'LP_CubNrPreston']
for i, df in enumerate(LP_Data):
LP_Data[i] = LP_Data[i].dropna()
LP_Data[i]['Annual Exceedence'] = 1 / LP_Data[i]['Annual Exceedence']
LP_Data[i] = LP_Data[i].loc[LP_Data[i]['Annual Exceedence'].isin([2, 10, 25, 100])]
LP3 = {k:v for (k,v) in zip(LP_names, LP_Data)}
'LP_DevilMalad': Annual Exceedence With Reg Skew Without Reg Skew Log Variance of Est \
6 2.0 21.4 22.4 0.0091
9 10.0 46.5 44.7 0.0119
10 25.0 60.2 54.6 0.0166
12 100.0 81.4 67.4 0.0270
5% Lower 95% Upper
6 14.1 31.2
9 32.1 85.7
10 40.6 136.2
12 51.3 250.6
I am having issues matching the column values of each dataframe within the dictionaries from the keys of LP3 to the Location column in ExeedenceDict dataframes. With the goal of coming up with a script that would do all of this iteratively with some sort of dictionary comprehension.
The caveat is that the two dataframe is just the 6 index value in the LP3 dataframes, ten is the 9th index value, 'twentyfive' is the 10th index value, and onehundred is the 12th index value.
The goale data frame for key two in ExeedenceDict based on the two data frames above would look something like this:
Noting that the rest of the dataframe would be filled with the values from the 6th index from the rest of the dataframe values within the LP3 dictionary.
Location Size Annual Exceedence With Reg Skew Without Reg Skew 5% Lower 95% Upper
0 LP_DevilMalad 39.8 2 21.4 22.4 14.1 31.2
1 LP_Bloomington 24.0 NaN NaN NaN NaN NaN
2 LP_DevilEvans 34.0 NaN NaN NaN NaN NaN
3 LP_Deep 29.7 NaN NaN NaN NaN NaN
4 LP_Maple 21.2 NaN NaN NaN NaN NaN
5 LP_CubMaple 53.7 NaN NaN NaN NaN NaN
6 LP_Cottonwood 61.7 NaN NaN NaN NaN NaN
7 LP_Mill 27.6 NaN NaN NaN NaN NaN
8 LP_CubNrPreston 31.6 NaN NaN NaN NaN NaN

Can't test it without a reproducible example, but I would do something along the lines:
index_map = {
"two": 6,
"ten": 9,
"twentyfive": 10,
"onehundred": 12
}
col_of_interest = ["Annual Exceedence", "With Reg Skew", "Without Reg Skew", "5% Lower", "95% Upper"]
for index_key, df in ExeedenceDict.items():
lp_index = index_map[index_key]
for lp_val in df['Location'].values:
df.loc[df['Location'] == lp_val, col_of_interest] = LP3[lp_val].loc[lp_index, col_of_interest].values

How to filter column based on another column date range

I currently have a dataframe where 1st column is dates (1990 - 2020) and the subsequent columns are 'stocks' that are trading and are NaN if they are not yet being traded. Is there any way to filter the columns based on date range? For example, if 2 years is selected, all stocks that are not null in all columns from 2019-2020 (2 years) will be filtered in.
import pandas as pd
df = pd.read_csv("prices.csv")
df.head()
display(df)
date s_0000 s_0001 s_0002 s_0003 s_0004 s_0005 s_0006 s_0007 s_0008 ... s_2579 s_2580 s_2581 s_2582 s_2583 s_2584 s_2585 s_2586 s_2587 s_2588
0 1990-01-02 NaN 13.389421 NaN NaN NaN NaN NaN 0.266812 NaN ... NaN 1.950358 NaN 7.253997 NaN NaN NaN NaN NaN NaN
1 1990-01-03 NaN 13.588601 NaN NaN NaN NaN NaN 0.268603 NaN ... NaN 1.985185 NaN 7.253997 NaN NaN NaN NaN NaN NaN
2 1990-01-04 NaN 13.610730 NaN NaN NaN NaN NaN 0.269499 NaN ... NaN 1.985185 NaN 7.188052 NaN NaN NaN NaN NaN NaN
3 1990-01-05 NaN 13.477942 NaN NaN NaN NaN NaN 0.270394 NaN ... NaN 1.985185 NaN 7.188052 NaN NaN NaN NaN NaN NaN
4 1990-01-08 NaN 13.477942 NaN NaN NaN NaN NaN 0.272185 NaN ... NaN 1.985185 NaN 7.385889 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7806 2020-12-23 116.631310 22.171579 15.890000 16.577030 9.00 65.023491 157.495850 130.347580 27.481012 ... 19.870001 42.675430 2.90 8.850000 9.93 NaN 0.226 207.470001 158.974014 36.650002
7807 2020-12-24 116.641243 21.912146 15.660000 16.606722 8.77 65.292725 158.870193 131.352829 27.813406 ... 20.180000 42.508686 2.88 8.810000 9.91 NaN 0.229 205.270004 159.839264 36.009998
7808 2020-12-28 117.158287 22.191536 16.059999 16.200956 8.93 66.429459 157.011383 136.050766 28.272888 ... 19.959999 42.528305 2.69 8.760000 9.73 NaN 0.251 199.369995 161.500122 36.709999
7809 2020-12-29 116.561714 21.991972 15.860000 16.745275 8.80 66.529175 154.925140 134.239273 27.705866 ... 19.530001 41.949623 2.59 8.430000 9.61 NaN 0.243 197.839996 162.226105 36.610001
7810 2020-12-30 116.720795 22.899990 16.150000 17.932884 8.60 66.299828 155.884232 133.094650 27.725418 ... 19.870001 42.390987 2.65 8.540000 9.72 NaN 0.230 201.309998 163.369812 36.619999
so I want to do something like:
year = input(Enter number of years:)
year = 3
If year is 3, the daterange selected would be 3 years to 2020 (2018-2020)

You could try the following code:
df[(df['date'] >= '2019-01-01') & (df['date'] <= '2020-12-30')]
Once you filter, you could remove all rows, which include NaN:
df.dropna()

Merging different dataframes together but index might not always be the same

I have 11 different areas (P01, P02, ..., P11) and each area has some equipment identified by a code (INV 1-1, INV 1-2, ..., INV 8-4). The problem is that the number of equipment changes from area to area so, for example, P01 doesn't have the code INV 6-4, but P02 has it. But their values will always be on index array.
I have a dataframe called allEquipAllAreas which holds float values for every INV for each area. Here is an example:
P01-INV-1-1 P01-INV-1-2 P01-INV-1-3 P01-INV-1-4 P11-INV-7-2 P11-INV-7-3 P11-INV-7-4
-0.52 1.89 1.61 1.59 2.02 1.29 -0.89
I created a for to go through all areas and fetch all equipment related to that area, so I would like to end up having a final dataframe (heatMapInvdf) as below but instead of "NaN" I'd like to put allEquipAllAreas on the respectively columns:
P01 P02 P03 P04 P05 P06 P07 P08 P09 P10 P11
INV 1-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
INV 1-2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
INV 1-3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ...
INV 8-2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
INV 8-3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
INV 8-4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I have tried to merge them but couldn't achieve what I want, so did is what I have so far:
index = ['INV 1-1','INV 1-2','INV 1-3','INV 1-4','INV 2-1','INV 2-2','INV 2-3','INV 2-4',
'INV 3-1','INV 3-2','INV 3-3','INV 3-4','INV 4-1','INV 4-2','INV 4-3','INV 4-4',
'INV 5-1','INV 5-2','INV 5-3','INV 5-4','INV 6-1','INV 6-2','INV 6-3','INV 6-4',
'INV 7-1','INV 7-2','INV 7-3','INV 7-4','INV 8-1','INV 8-2','INV 8-3','INV 8-4']
columns = ['P01','P02','P03','P04','P05','P06','P07','P08','P09','P10','P11']
heatMapInvdf = pd.DataFrame(index=index, columns=columns)
for area in areas:
equipInArea = allEquipAllAreas.loc[:,allEquipAllAreas.columns.str.contains('P'+area+'-')]
equipInArea = equipInArea.reindex(sorted(equipInArea.columns), axis=1).T
equipInArea.index = equipInArea.index.str.replace(r'P'+area+'-', '')
heatMapInvdf.merge(equipInArea,how='inner',right_index=True,left_index=True)
Any help is really appreciated!

You have everything you want in your source DF. Systematically re-shape it
transpose
index with multi-index which is splitting original column names
unstack() to get structure you want
droplevel() to clean up
import io
import numpy as np
df = pd.read_csv(io.StringIO("""P01-INV-1-1 P01-INV-1-2 P01-INV-1-3 P01-INV-1-4 P11-INV-7-2 P11-INV-7-3 P11-INV-7-4
-0.52 1.89 1.61 1.59 2.02 1.29 -0.89"""), sep="\s+")
heatMapInvdf = (
# transpose for primary shape that is wanted
df.T
# index by multi-index which are from columns
.set_index(pd.MultiIndex.from_arrays(np.array([c.split("-", 1) for c in df.columns]).T))
# unstack the P0n part of index
.unstack(0)
# remove transitent level from column index
.droplevel(0, axis=1)
)
P01
P11
INV-1-1
-0.52
nan
INV-1-2
1.89
nan
INV-1-3
1.61
nan
INV-1-4
1.59
nan
INV-7-2
nan
2.02
INV-7-3
nan
1.29
INV-7-4
nan
-0.89

Filter forward col 1 without iteration

I am dealing with a "waterfall" structure DataFrame in Pandas, Python.
Column 1 is full, while the rest of the data set is mostly empty representing series available for only a subset of the total period considered:
Instrument AUPRATE. AIB0411 AIB0511 AIB0611 ... AIB1120 AIB1220 AIB0121 AIB0221
Field ...
Date ...
2011-03-31 4.75 4.730 4.710 4.705 ... NaN NaN NaN NaN
2011-04-29 4.75 4.745 4.750 4.775 ... NaN NaN NaN NaN
2011-05-31 4.75 NaN 4.745 4.755 ... NaN NaN NaN NaN
2011-06-30 4.75 NaN NaN 4.745 ... NaN NaN NaN NaN
2011-07-29 4.75 NaN NaN NaN ... NaN NaN NaN NaN
... ... ... ... ... ... ... ... ...
2019-05-31 1.50 NaN NaN NaN ... NaN NaN NaN NaN
2019-06-28 1.25 NaN NaN NaN ... 0.680 NaN NaN NaN
2019-07-31 1.00 NaN NaN NaN ... 0.520 0.530 NaN NaN
2019-08-30 1.00 NaN NaN NaN ... 0.395 0.405 0.405 NaN
2019-09-30 1.00 NaN NaN NaN ... 0.435 0.445 0.445 0.45
What I would like to do is to push the values from "AUPRATE" to the start of the data in every row (such that they effectively represent the zeroth observation). Where the AUPRATE values are not adjacent to the dataset, they should be replaced with NaN.
I could probably write a junky loop to do this but I was wondering if there was an efficient way of achieving the same outcome.
I am very much a novice in pandas and Python. Thank you in advance.
[edit]
Desired output:
Instrument AUPRATE. AIB0411 AIB0511 AIB0611 ... AIB1120 AIB1220 AIB0121 AIB0221
Field ...
Date ...
2011-03-31 4.75 4.730 4.710 4.705 ... NaN NaN NaN NaN
2011-04-29 4.75 4.745 4.750 4.775 ... NaN NaN NaN NaN
2011-05-31 NaN 4.75 4.745 4.755 ... NaN NaN NaN NaN
2011-06-30 NaN NaN 4.75 4.745 ... NaN NaN NaN NaN
2011-07-29 NaN NaN NaN NaN ... NaN NaN NaN NaN
I have implemented the following, based on the suggestion below. I would still be happy if there was a way of doing this without iteration.
for i in range(AU_furures_rates.shape[0]): #iterate over rows
for j in range(AU_furures_rates.shape[1]-1): #iterate over cols
if (pd.notnull(AU_furures_rates.iloc[i,j+1])) and pd.isnull(AU_furures_rates.iloc[i,1]): #move rate when needed
AU_furures_rates.iloc[i,j] = AU_furures_rates.iloc[i,0]
AU_furures_rates.iloc[i,0] = "NaN"
break

Maybe someone would find a 'cleaner' solution, but what I thought about was first iterating over the columns, to check for each row which is the column whose value you need to replace (backwards, so that it'll end up with the first occurance) with:
df['column_to_move'] = np.nan
cols = df.columns.tolist()
for i in range(len(df) - 2, 1, -1):
df.loc[pd.isna(df[cols[i]]) & pd.notna(df[cols[i + 1]]), 'column_to_move'] = cols[i]
And then iterate the columns to fill the value from AUPRATE. to where its needed, and change AUPRATE. itself with np.nan with:
for col in cols[2: -1]:
df.loc[df['column_to_move'] == col, col] = df['AUPRATE.']
df.loc[df['column_to_move'] == col, 'AUPRATE.'] = np.nan
df.drop('column_to_move', axis=1, inplace=True)

Create Pandas Dataframe from List of Dictionaries with missing values for some keys

everyone.
Below is the code I'm using to parse a text file:
import pandas as pd
tags = ['129','30','32','851','9730','9882']
rows = []
file = open('D:\\python\\redi_fix\\redi_august.txt','r')
content = file.readlines()
for line in content:
for message in line.split('\t'):
try:
row_dict = {}
tag,val = message.split('=')
if tag in tags:
row_dict[tag]=val
rows.append(row_dict)
except:
pass
Creating a pandas dataframe from rows yields the following result:
129 30 32 851 9730 9882
r170557 NaN NaN NaN NaN NaN
NaN ARCA NaN NaN NaN NaN
NaN NaN 100 NaN NaN NaN
r170557 NaN NaN NaN NaN NaN
NaN ARCA NaN NaN NaN NaN
NaN NaN 300 NaN NaN NaN
Looks like every value for a key is on a different row.
The result I'm struggling to achieve is all values to be on the same row - see below for example:
129 30 32 851 9730 9882
r170557 ARCA 100 NaN NaN NaN
r170557 ARCA 300 NaN NaN NaN

Using your result dataframe, we need sorted and dropna
result.apply(lambda x : sorted(x,key=pd.isnull)).dropna(thresh=1)
Out[1171]:
129 30 32 851 9730 9882
0 r170557 ARCA 100.0 NaN NaN NaN
1 r170557 ARCA 300.0 NaN NaN NaN

If you want to "collapse" your NaNs, you can perform a groupby + agg on first/last:
df.groupby(df['129'].notnull().cumsum(), as_index=False).agg('first')
129 30 32 851 9730 9882
0 r170557 ARCA 100.0 NaN NaN NaN
1 r170557 ARCA 300.0 NaN NaN NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.