Related
Two dataframes have been concatenated with different keys (multiindex dataframe) with same index. Dates are the index. There are different products in each dataframe as column names and their prices. I basically had to find the correlation between these two dataframes and overlapping period count. Correlation is done but how to find the count of overlapping rows with each product from each dataframe and produce result as a dataframe with products from dataframe 1 as column name and products from dataframe2 as row names and values as the number of overlapping rows for the same period. It should be a matrix.
For example: Dataframe1:
df1 = pd.DataFrame(data = {'col1' : [1/12/2020, 2/12/2020, 3/12/2020,],
'col2' : [10, 11, 12], 'col3' :[13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1/12/2020, 2/12/2020, 3/12/2020,],
'A' : [10, 9, 12], 'B' :[4, 14, 2]})
df1=df1.set_index('col1')
df2=df2.set_index('col1')
concat_data1 = pd.concat([df1, df2], axis=1, keys=['df1', 'df2'])
concat_data1
df1 df2
col2 col3 A B
col1
1/12/2020 10 13 10 4
2/12/2020 11 14 9 14
3/12/2020 12 10 12 2
Need output result as: Overlapping period=
col2 col3
A 2 0
B 0 1
This is a way of doing it:
import itertools
import pandas as pd
data1 = {
'col1': ['1/12/2020', '2/12/2020', '3/12/2020', '4/12/2020'],
'col2': [10, 11, 12, 14],
'col3': [13, 14, 10, 6],
'col4': [10, 9, 15, 10],
'col5': [10, 9, 15, 5],
}
data2 = {
'col1': ['1/12/2020', '2/12/2020', '3/12/2020', '4/12/2020'],
'A': [10, 9, 12, 14],
'B' :[4, 14, 2, 9],
'C': [6, 9, 1, 3],
'D': [6, 9, 1, 8]
}
df1 = pd.DataFrame(data1).set_index('col1')
df2 = pd.DataFrame(data2).set_index('col1')
concat_data = pd.concat([df1, df2], axis=1, keys=['df1', 'df2'])
columns = {df: list(concat_data[df].columns) for df in set(concat_data.columns.get_level_values(0))}
matrix = pd.DataFrame(data=0, columns=columns['df1'], index=columns['df2'])
for row in concat_data.iterrows():
for cols in list(itertools.product(columns['df1'], columns['df2'])):
matrix.loc[cols[1], cols[0]] += row[1]['df1'][cols[0]] == row[1]['df2'][cols[1]]
print(matrix)
I want to replace value 0 in each row in the pandas dataframe with a value that comes from a list that has the same index as the row index of the dataframe.
# here is my dataframe
df = pd.DataFrame({'a': [12, 52, 0], 'b': [33, 0, 110], 'c':[0, 15, 134]})
#here is the list
maxValueInRow = [3,5,34]
# the desired output would be:
df_updated = pd.DataFrame({'a': [12, 52, 3], 'b': [33, 5, 110], 'c':[34, 15, 134]})
I thought it could be something like
df.apply(lambda row: maxValueInRow[row.name] if row==0 else row, axis=1)
but that didnt work and produced 'The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().' error.
Any thoughts would be greatly appreciated.
Here is what you need:
# here is my dataframe
df = pd.DataFrame({'a': [12, 52, 0], 'b': [33, 0, 110], 'c':[0, 15, 134]})
#here is the list
maxValueInRow = [3,5,34]
for index, row in df.iterrows():
for column in df.columns:
if row[column] == 0:
df.iloc[index][column] = maxValueInRow[index]
df
Output
a
b
c
0
12
33
3
1
52
5
15
2
34
110
134
Update
As per your comments, it seems by replacing the values with the same index, you meant something else. Anyway, here is an update to your problem:
# here is my dataframe
df = pd.DataFrame({'a': [12, 52, 0], 'b': [33, 0, 110], 'c':[0, 15, 134]})
data = df.to_dict()
maxValueInRow = [3,5,34]
i = 0
for chr, innerList in data.items():
for index in range(len(innerList)):
value = innerList[index]
if value == 0:
data[chr][index] = maxValueInRow[i]
i += 1
df = pd.DataFrame(data)
df
Output
a
b
c
0
12
33
34
1
52
5
15
2
3
110
134
You could use .replace:
df = pd.DataFrame({'a': [12, 52, 0], 'b': [33, 0, 110], 'c':[0, 15, 134]})
maxValueInRow = [3,5,34]
repl = {col: {0: value} for col, value in zip(df.columns, maxValueInRow)}
df_updated = df.replace(repl)
Result:
a b c
0 12 33 34
1 52 5 15
2 3 110 134
Been searching the net, but I cannot find any resource on deleting duplicate multiindex column name.
Given a multiindex as below
level1
level2
A B C B A C
ONE 11 12 13 11 12 13
TWO 21 22 23 21 22 23
THREE 31 32 33 31 32 33
Drop duplicated B and C
Expected output
level1
level2
A B C A
ONE 11 12 13 11
TWO 21 22 23 21
THREE 31 32 33 31
Code
import pandas as pd
df = pd.DataFrame({'A': [11, 21, 31],
'B': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'TWO', 'THREE'])
df2 = pd.DataFrame({'B': [11, 21, 31],
'A': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'TWO', 'THREE'])
df.columns = pd.MultiIndex.from_product([['level1'],['level2'],df.columns ])
df2.columns = pd.MultiIndex.from_product([['level1'],['level2'],df2.columns ])
df=pd.concat([df,df2],axis=1)
-Drop by index not working
You can try:
mask=(df.T.duplicated() | (df.columns.get_level_values(2).isin(['A','D'])))
Finally:
df=df.loc[:, mask]
#OR
#df=df.T.loc[mask].T
Adaptation to df.T.drop_duplicates().T by Anurag Dabas.
Select only unique column and its value
drop_col=['B','C']
drop_single=[df.loc [:, (slice ( None ), slice ( None ), DCOL)].T.drop_duplicates().T for DCOL in drop_col]
Drop the columns from the df
df=df.drop ( drop_col, axis=1, level=2 )
Combine everything to get the intended output
df=pd.concat([df,*drop_single],axis=1)
Complete solution
import pandas as pd
df = pd.DataFrame({'A': [11, 21, 31],
'B': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'TWO', 'THREE'])
df2 = pd.DataFrame({'B': [11, 21, 31],
'A': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'TWO', 'THREE'])
df.columns = pd.MultiIndex.from_product([['level1'],['level2'],df.columns ])
df2.columns = pd.MultiIndex.from_product([['level1'],['level2'],df2.columns ])
df=pd.concat([df,df2],axis=1)
drop_col=['B','C']
drop_single=[df.loc [:, (slice ( None ), slice ( None ), DCOL)].iloc [:, 1] for DCOL in drop_col]
df=df.drop ( drop_col, axis=1, level=2 )
df_unique=pd.concat(drop_single,axis=1)
df=pd.concat([df,df_unique],axis=1)
print(df)
You can try this:
# Drop last 2 columns of dataframe
df.drop(columns=df.columns[-2:],
axis=1,
inplace=True)
I have two pandas dataframes with following format:
df_ts = pd.DataFrame([
[10, 20, 1, 'id1'],
[11, 22, 5, 'id1'],
[20, 54, 5, 'id2'],
[22, 53, 7, 'id2'],
[15, 24, 8, 'id1'],
[16, 25, 10, 'id1']
], columns = ['x', 'y', 'ts', 'id'])
df_statechange = pd.DataFrame([
['id1', 2, 'ok'],
['id2', 4, 'not ok'],
['id1', 9, 'not ok']
], columns = ['id', 'ts', 'state'])
I am trying to get it to the format, such as:
df_out = pd.DataFrame([
[10, 20, 1, 'id1', None ],
[11, 22, 5, 'id1', 'ok' ],
[20, 54, 5, 'id2', 'not ok'],
[22, 53, 7, 'id2', 'not ok'],
[15, 24, 8, 'id1', 'ok' ],
[16, 25, 10, 'id1', 'not ok']
], columns = ['x', 'y', 'ts', 'id', 'state'])
I understand how to accomplish it iteratively by grouping by id and then iterating through each row and changing status when it appears. Is there a pandas build-in more scalable way of doing this?
Unfortunately pandas merge support only equality joins. See more details at the following thread:
merge pandas dataframes where one value is between two others
if you want to merge by interval you'll need to overcome the issue, for example by adding another filter after the merge:
joined = a.merge(b,on='id')
joined = joined[joined.ts.between(joined.ts1,joined.ts2)]
You can merge pandas data frames on two columns:
pd.merge(df_ts,df_statechange, how='left',on=['id','ts'])
in df_statechange that you shared here there is no common values on ts in both dataframes. Apparently you just copied not complete data frame here. So i got this output:
x y ts id state
0 10 20 1 id1 NaN
1 11 22 5 id1 NaN
2 20 54 5 id2 NaN
3 22 53 7 id2 NaN
4 15 24 8 id1 NaN
5 16 25 10 id1 NaN
But indeed if you have common ts in the data frames it will have your desired output. For example:
df_statechange = pd.DataFrame([
['id1', 5, 'ok'],
['id1', 8, 'ok'],
['id2', 5, 'not ok'],
['id2',7, 'not ok'],
['id1', 9, 'not ok']
], columns = ['id', 'ts', 'state'])
the output:
x y ts id state
0 10 20 1 id1 NaN
1 11 22 5 id1 ok
2 20 54 5 id2 not ok
3 22 53 7 id2 not ok
4 15 24 8 id1 ok
5 16 25 10 id1 NaN
I have n small dataframes which I combine into one multiindex dataframe.
d1 = DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr22'])
d2 = DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
d3 = DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['d', 10, 14],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
How to combine these into one dataframe?
Result:
name attr11 attr21 attr22
d1 a 5 NULL 9
b 4 NULL 61
c 24 NULL 9
d2 a NULL 5 19
b NULL 14 16
c NULL 4 9
d3 a NULL 5 19
b NULL 14 16
c NULL 4 9
d NULL 10 14
you can build the multiindex after the concatenation. You just need to add a column to each frame with the dataframe id:
frames =[d1,d2,d3]
Add a column to each frame with the frame id:
for x in enumerate(frames):
x[1]['frame_id'] = 'd'+str(x[0]+1)
then concatenate the list of frames and set the index on the desired columns:
pd.concat(frames).set_index(['frame_id','name'])