Merging different dataframes together but index might not always be the same - python

I have 11 different areas (P01, P02, ..., P11) and each area has some equipment identified by a code (INV 1-1, INV 1-2, ..., INV 8-4). The problem is that the number of equipment changes from area to area so, for example, P01 doesn't have the code INV 6-4, but P02 has it. But their values will always be on index array.
I have a dataframe called allEquipAllAreas which holds float values for every INV for each area. Here is an example:
P01-INV-1-1 P01-INV-1-2 P01-INV-1-3 P01-INV-1-4 P11-INV-7-2 P11-INV-7-3 P11-INV-7-4
-0.52 1.89 1.61 1.59 2.02 1.29 -0.89
I created a for to go through all areas and fetch all equipment related to that area, so I would like to end up having a final dataframe (heatMapInvdf) as below but instead of "NaN" I'd like to put allEquipAllAreas on the respectively columns:
P01 P02 P03 P04 P05 P06 P07 P08 P09 P10 P11
INV 1-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
INV 1-2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
INV 1-3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ...
INV 8-2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
INV 8-3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
INV 8-4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I have tried to merge them but couldn't achieve what I want, so did is what I have so far:
index = ['INV 1-1','INV 1-2','INV 1-3','INV 1-4','INV 2-1','INV 2-2','INV 2-3','INV 2-4',
'INV 3-1','INV 3-2','INV 3-3','INV 3-4','INV 4-1','INV 4-2','INV 4-3','INV 4-4',
'INV 5-1','INV 5-2','INV 5-3','INV 5-4','INV 6-1','INV 6-2','INV 6-3','INV 6-4',
'INV 7-1','INV 7-2','INV 7-3','INV 7-4','INV 8-1','INV 8-2','INV 8-3','INV 8-4']
columns = ['P01','P02','P03','P04','P05','P06','P07','P08','P09','P10','P11']
heatMapInvdf = pd.DataFrame(index=index, columns=columns)
for area in areas:
equipInArea = allEquipAllAreas.loc[:,allEquipAllAreas.columns.str.contains('P'+area+'-')]
equipInArea = equipInArea.reindex(sorted(equipInArea.columns), axis=1).T
equipInArea.index = equipInArea.index.str.replace(r'P'+area+'-', '')
heatMapInvdf.merge(equipInArea,how='inner',right_index=True,left_index=True)
Any help is really appreciated!

You have everything you want in your source DF. Systematically re-shape it
transpose
index with multi-index which is splitting original column names
unstack() to get structure you want
droplevel() to clean up
import io
import numpy as np
df = pd.read_csv(io.StringIO("""P01-INV-1-1 P01-INV-1-2 P01-INV-1-3 P01-INV-1-4 P11-INV-7-2 P11-INV-7-3 P11-INV-7-4
-0.52 1.89 1.61 1.59 2.02 1.29 -0.89"""), sep="\s+")
heatMapInvdf = (
# transpose for primary shape that is wanted
df.T
# index by multi-index which are from columns
.set_index(pd.MultiIndex.from_arrays(np.array([c.split("-", 1) for c in df.columns]).T))
# unstack the P0n part of index
.unstack(0)
# remove transitent level from column index
.droplevel(0, axis=1)
)
P01
P11
INV-1-1
-0.52
nan
INV-1-2
1.89
nan
INV-1-3
1.61
nan
INV-1-4
1.59
nan
INV-7-2
nan
2.02
INV-7-3
nan
1.29
INV-7-4
nan
-0.89

Related

Row by row mapping keys of dictionary of dataframes to new dictionary of dataframes

I have two dictionaries of data frames LP3 and ExeedenceDict. The ExeedenceDict is a dictionary of 4 dataframes with keys 'two','ten','twentyfive','onehundred'. The LP3 dictionary has keys 'LP_DevilMalad', 'LP_Bloomington', 'LP_DevilEvans', 'LP_Deep', 'LP_Maple', 'LP_CubMaple', 'LP_Cottonwood', 'LP_Mill', 'LP_CubNrPreston'
Edit: I am not sure of the most concise way to title this question but I think the title suites what I am asking.
There is a column in each dataframe within the ExeedenceDict that has row values equal to the keys in the LP3 dictionary.
Below is a 'blank' dataframe for two in the ExeedenceDict that I created. Using the code:
ExeedenceDF = []
cols = ['Location','Size','Annual Exceedence', 'With Reg Skew','Without Reg Skew','5% Lower','95% Upper']
for i in range(5):
i = pd.DataFrame(columns=cols)
i['Location'] = LP_names
i['Size'] = [39.8,24,34,29.7,21.2,53.7,61.7,27.6,31.6]
ExeedenceDF.append(i)
ExeedenceDict = {'two':ExeedenceDF[0], 'ten':ExeedenceDF[1], 'twentyfive':ExeedenceDF[2], 'onehundred':ExeedenceDF[3]}
Location Size Annual Exceedence With Reg Skew Without Reg Skew 5% Lower 95% Upper
0 LP_DevilMalad 39.8 NaN NaN NaN NaN NaN
1 LP_Bloomington 24.0 NaN NaN NaN NaN NaN
2 LP_DevilEvans 34.0 NaN NaN NaN NaN NaN
3 LP_Deep 29.7 NaN NaN NaN NaN NaN
4 LP_Maple 21.2 NaN NaN NaN NaN NaN
5 LP_CubMaple 53.7 NaN NaN NaN NaN NaN
6 LP_Cottonwood 61.7 NaN NaN NaN NaN NaN
7 LP_Mill 27.6 NaN NaN NaN NaN NaN
8 LP_CubNrPreston 31.6 NaN NaN NaN NaN NaN
Below is the dataframe for the key LP_DevilMalad in the LP3 dictionary. This dictionary was built by reading in data from 10 excel spreadsheets. Using the code:
LP_names = ['LP_DevilMalad', 'LP_Bloomington', 'LP_DevilEvans', 'LP_Deep', 'LP_Maple', 'LP_CubMaple', 'LP_Cottonwood', 'LP_Mill', 'LP_CubNrPreston']
for i, df in enumerate(LP_Data):
LP_Data[i] = LP_Data[i].dropna()
LP_Data[i]['Annual Exceedence'] = 1 / LP_Data[i]['Annual Exceedence']
LP_Data[i] = LP_Data[i].loc[LP_Data[i]['Annual Exceedence'].isin([2, 10, 25, 100])]
LP3 = {k:v for (k,v) in zip(LP_names, LP_Data)}
'LP_DevilMalad': Annual Exceedence With Reg Skew Without Reg Skew Log Variance of Est \
6 2.0 21.4 22.4 0.0091
9 10.0 46.5 44.7 0.0119
10 25.0 60.2 54.6 0.0166
12 100.0 81.4 67.4 0.0270
5% Lower 95% Upper
6 14.1 31.2
9 32.1 85.7
10 40.6 136.2
12 51.3 250.6
I am having issues matching the column values of each dataframe within the dictionaries from the keys of LP3 to the Location column in ExeedenceDict dataframes. With the goal of coming up with a script that would do all of this iteratively with some sort of dictionary comprehension.
The caveat is that the two dataframe is just the 6 index value in the LP3 dataframes, ten is the 9th index value, 'twentyfive' is the 10th index value, and onehundred is the 12th index value.
The goale data frame for key two in ExeedenceDict based on the two data frames above would look something like this:
Noting that the rest of the dataframe would be filled with the values from the 6th index from the rest of the dataframe values within the LP3 dictionary.
Location Size Annual Exceedence With Reg Skew Without Reg Skew 5% Lower 95% Upper
0 LP_DevilMalad 39.8 2 21.4 22.4 14.1 31.2
1 LP_Bloomington 24.0 NaN NaN NaN NaN NaN
2 LP_DevilEvans 34.0 NaN NaN NaN NaN NaN
3 LP_Deep 29.7 NaN NaN NaN NaN NaN
4 LP_Maple 21.2 NaN NaN NaN NaN NaN
5 LP_CubMaple 53.7 NaN NaN NaN NaN NaN
6 LP_Cottonwood 61.7 NaN NaN NaN NaN NaN
7 LP_Mill 27.6 NaN NaN NaN NaN NaN
8 LP_CubNrPreston 31.6 NaN NaN NaN NaN NaN
Can't test it without a reproducible example, but I would do something along the lines:
index_map = {
"two": 6,
"ten": 9,
"twentyfive": 10,
"onehundred": 12
}
col_of_interest = ["Annual Exceedence", "With Reg Skew", "Without Reg Skew", "5% Lower", "95% Upper"]
for index_key, df in ExeedenceDict.items():
lp_index = index_map[index_key]
for lp_val in df['Location'].values:
df.loc[df['Location'] == lp_val, col_of_interest] = LP3[lp_val].loc[lp_index, col_of_interest].values

Pandas DataFrame combine rows by column value, where rows can have NaNs

I have a Pandas DataFrame like the following:
timestamp A B C D E F
0 1607594400000 83.69 NaN NaN NaN 1003.20 8.66
1 1607594400000 NaN 2.57 44.35 17.18 NaN NaN
2 1607595000000 83.07 NaN NaN NaN 1003.32 8.68
3 1607595000000 NaN 3.00 42.31 20.08 NaN NaN
.. ... ... ... ... ... ... ...
325 1607691600000 90.19 NaN NaN NaN 997.32 10.22
326 1607691600000 NaN 1.80 30.10 14.85 NaN NaN
328 1607692200000 NaN 1.60 26.06 12.78 NaN NaN
327 1607692200000 91.33 NaN NaN NaN 997.52 10.21
I need to combine the rows that have the same value for timestamp, where in the cases where there is nan-value the value is maintained and in the cases where there is value-value the average of the values is calculated.
I tried the solution of the following question but it is not exactly my situation and I don't know how to addapt it:
pandas, combine rows based on certain column values and NAN
Just use groupby:
df.groupby('timestamp', as_index=False).mean()
Try with first, it will pick the not null value for each group
out = df.groupby('timestamp', as_index=False).first()
Or
out = df.set_index('timestamp').mean(level=0)

Filter forward col 1 without iteration

I am dealing with a "waterfall" structure DataFrame in Pandas, Python.
Column 1 is full, while the rest of the data set is mostly empty representing series available for only a subset of the total period considered:
Instrument AUPRATE. AIB0411 AIB0511 AIB0611 ... AIB1120 AIB1220 AIB0121 AIB0221
Field ...
Date ...
2011-03-31 4.75 4.730 4.710 4.705 ... NaN NaN NaN NaN
2011-04-29 4.75 4.745 4.750 4.775 ... NaN NaN NaN NaN
2011-05-31 4.75 NaN 4.745 4.755 ... NaN NaN NaN NaN
2011-06-30 4.75 NaN NaN 4.745 ... NaN NaN NaN NaN
2011-07-29 4.75 NaN NaN NaN ... NaN NaN NaN NaN
... ... ... ... ... ... ... ... ...
2019-05-31 1.50 NaN NaN NaN ... NaN NaN NaN NaN
2019-06-28 1.25 NaN NaN NaN ... 0.680 NaN NaN NaN
2019-07-31 1.00 NaN NaN NaN ... 0.520 0.530 NaN NaN
2019-08-30 1.00 NaN NaN NaN ... 0.395 0.405 0.405 NaN
2019-09-30 1.00 NaN NaN NaN ... 0.435 0.445 0.445 0.45
What I would like to do is to push the values from "AUPRATE" to the start of the data in every row (such that they effectively represent the zeroth observation). Where the AUPRATE values are not adjacent to the dataset, they should be replaced with NaN.
I could probably write a junky loop to do this but I was wondering if there was an efficient way of achieving the same outcome.
I am very much a novice in pandas and Python. Thank you in advance.
[edit]
Desired output:
Instrument AUPRATE. AIB0411 AIB0511 AIB0611 ... AIB1120 AIB1220 AIB0121 AIB0221
Field ...
Date ...
2011-03-31 4.75 4.730 4.710 4.705 ... NaN NaN NaN NaN
2011-04-29 4.75 4.745 4.750 4.775 ... NaN NaN NaN NaN
2011-05-31 NaN 4.75 4.745 4.755 ... NaN NaN NaN NaN
2011-06-30 NaN NaN 4.75 4.745 ... NaN NaN NaN NaN
2011-07-29 NaN NaN NaN NaN ... NaN NaN NaN NaN
I have implemented the following, based on the suggestion below. I would still be happy if there was a way of doing this without iteration.
for i in range(AU_furures_rates.shape[0]): #iterate over rows
for j in range(AU_furures_rates.shape[1]-1): #iterate over cols
if (pd.notnull(AU_furures_rates.iloc[i,j+1])) and pd.isnull(AU_furures_rates.iloc[i,1]): #move rate when needed
AU_furures_rates.iloc[i,j] = AU_furures_rates.iloc[i,0]
AU_furures_rates.iloc[i,0] = "NaN"
break
Maybe someone would find a 'cleaner' solution, but what I thought about was first iterating over the columns, to check for each row which is the column whose value you need to replace (backwards, so that it'll end up with the first occurance) with:
df['column_to_move'] = np.nan
cols = df.columns.tolist()
for i in range(len(df) - 2, 1, -1):
df.loc[pd.isna(df[cols[i]]) & pd.notna(df[cols[i + 1]]), 'column_to_move'] = cols[i]
And then iterate the columns to fill the value from AUPRATE. to where its needed, and change AUPRATE. itself with np.nan with:
for col in cols[2: -1]:
df.loc[df['column_to_move'] == col, col] = df['AUPRATE.']
df.loc[df['column_to_move'] == col, 'AUPRATE.'] = np.nan
df.drop('column_to_move', axis=1, inplace=True)

How to do join of multiindex dataframe with another multiindex dataframe?

This is to go further from the following thread:
How to do join of multiindex dataframe with a single index dataframe?
The multi-indices of df1 are sublevel indices of df2.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: import itertools
In [4]: inner = ('a','b')
In [5]: outer = ((10,20), (1,2))
In [6]: cols = ('one','two','three','four')
In [7]: sngl = pd.DataFrame(np.random.randn(2,4), index=inner, columns=cols)
In [8]: index_tups = list(itertools.product(*(outer + (inner,))))
In [9]: index_mult = pd.MultiIndex.from_tuples(index_tups)
In [10]: mult = pd.DataFrame(index=index_mult, columns=cols)
In [11]: sngl
Out[11]:
one two three four
a 2.946876 -0.751171 2.306766 0.323146
b 0.192558 0.928031 1.230475 -0.256739
In [12]: mult
Out[12]:
one two three four
10 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
20 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
In [13]: mult.ix[(10,1)] = sngl
In [14]: mult
Out[14]:
one two three four
10 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
20 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
# the new dataframes
sng2=pd.concat([sng1,sng1],keys=['X','Y'])
mult2=pd.concat([mult,mult],keys=['X','Y'])
In [110]:
sng2
Out[110]:
one two three four
X a 0.206810 -1.056264 -0.572809 -0.314475
b 0.514873 -0.941380 0.132694 -0.682903
Y a 0.206810 -1.056264 -0.572809 -0.314475
b 0.514873 -0.941380 0.132694 -0.682903
In [121]: mult2
Out[121]:
one two three four
X 10 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
20 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
Y 10 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
20 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
the code above is long, please scroll
The two multilevel indices of sng2 share the 1st and 4th indices of mul2. ('X','a') for example.
#DSM proposed a solution to work with a multiindex df2 and single index df1
mult[:] = sngl.loc[mult.index.get_level_values(2)].values
BUt DataFrame.index.get_level_values(2) can only work for one level of index.
It's not clear from the question which index levels the data frames share. I think you need to revise the set-up code as it gives an error at the definition of sngl. Anyway, suppose mult shares the first and second level with sngl you can just drop the second level from the index of mult and index in:
mult[:] = sngl.loc[mult.index.droplevel(2)].values
On a side note, you can construct a multi index from a product directly using pd.MultiIndex.from_product rather than using itertools

Calculating similarity between rows of pandas dataframe

Goal is to identify top 10 similar rows for each row in dataframe.
I start with following dictionary:
import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine
d = {'0001': [('skiing',0.789),('snow',0.65),('winter',0.56)],'0002': [('drama', 0.89),('comedy', 0.678),('action',-0.42) ('winter',-0.12),('kids',0.12)],'0003': [('action', 0.89),('funny', 0.58),('sports',0.12)],'0004': [('dark', 0.89),('Mystery', 0.678),('crime',0.12), ('adult',-0.423)],'0005': [('cartoon', -0.89),('comedy', 0.678),('action',0.12)],'0006': [('drama', -0.49),('funny', 0.378),('Suspense',0.12), ('Thriller',0.78)],'0007': [('dark', 0.79),('Mystery', 0.88),('crime',0.32), ('adult',-0.423)]}
To put it in dataframe I do following:
col_headers = []
entities = []
for key, scores in d.iteritems():
entities.append(key)
d[key] = dict(scores)
col_headers.extend(d[key].keys())
col_headers = list(set(col_headers))
populate the dataframe:
df = pd.DataFrame(columns=col_headers, index=entities)
for k in d:
df.loc[k] = pd.Series(d[k])
df.fillna(0.0, axis=1)
One of the issue in addition to my main goal that I have at this point of the code is my dataframe still has NaN. This probably why my result matrix is filled with NaNs.
Mystery drama kids winter funny snow crime dark sports Suspense adult skiing action comedy cartoon Thriller
0004 0.678 NaN NaN NaN NaN NaN 0.12 0.89 NaN NaN -0.423 NaN NaN NaN NaN NaN
0005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.12 0.678 -0.89 NaN
0006 NaN -0.49 NaN NaN 0.378 NaN NaN NaN NaN 0.12 NaN NaN NaN NaN NaN 0.78
0007 0.88 NaN NaN NaN NaN NaN 0.32 0.79 NaN NaN -0.423 NaN NaN NaN NaN NaN
0001 NaN NaN NaN 0.56 NaN 0.65 NaN NaN NaN NaN NaN 0.789 NaN NaN NaN NaN
0002 NaN 0.89 0.12 -0.12 NaN NaN NaN NaN NaN NaN NaN NaN -0.42 0.678 NaN NaN
0003 NaN NaN NaN NaN 0.58 NaN NaN NaN 0.12 NaN NaN NaN 0.89 NaN NaN NaN
To calculate cosine similarity and generate the similarity matrix between rows I do following:
data = df.values
m, k = data.shape
mat = np.zeros((m, m))
for i in xrange(m):
for j in xrange(m):
if i != j:
mat[i][j] = cosine(data[i,:], data[j,:])
else:
mat[i][j] = 0.
here is how mat looks like:
[[ 0. nan nan nan nan nan nan]
[ nan 0. nan nan nan nan nan]
[ nan nan 0. nan nan nan nan]
[ nan nan nan 0. nan nan nan]
[ nan nan nan nan 0. nan nan]
[ nan nan nan nan nan 0. nan]
[ nan nan nan nan nan nan 0.]]
Assuming NaN issue get fix and mat spits out meaning full similarity matrix. How can I get an output as follows:
{0001:[003,005,002],0002:[0001, 0004, 0007]....}
One of the issue in addition to my main goal that I have at this point of the code is my dataframe still has NaN.
That's beacause df.fillna does not modify DataFrame, but returns a new one. Fix it and your result will be fine.

Categories