Pandas: Multiindex dataframe indexing - set values to dataframe subset - python

I have a Multiindex dataframe with 2 index levels and 2 column levels.
The first level index and first index columns are the same. The second levels share elements but are not equal. This gives me a non square dataframe (I have more elements in my 2nd level columns than in my second level index)
I want to set all elements of my dataframe to 0 in the case the first level index is not equal to the first level column. I have done it recursively but am sure there is a better way.
Can you help?
Thanks

Related

Want to drop duplicate based on one column but want to keep first two rows

Hi I am droping duplicate from dataframe based on one column i.e "ID", Till now i am droping the duplicate and keeping the first occurence but I want to keep the first(top) two occurrence instead of only one. So I can compare the values of first two rows of another column "similarity_score".
data_2 = data.sort_values('similarity_score' , ascending = False)
data_2.drop_duplicates(subset=['ID'], keep='first').reset_index()
Let us sort the values then do groupby + head
data.sort_values('similarity', ascending=False).groupby('ID').head(2)
Alternatively, you can use groupby + nlargest which will also give you the desired result:
data.groupby('ID')['similarity'].nlargest(2).droplevel(1)

Pandas DataFrame sorting issues by value and index

I have two DataFrames that I append together ignoring the index so the rows from the appended DataFrame remain the same.
One DataFrame index goes from 0 to 200 and the second DataFrame index goes from 0 to 76
After appending them I try to sort it with a .sort_values then .sort_index because I want the same dates to be together but I also want the larger index to be above the smaller index with the same date as shown in the image below from my output. The red and green is correct but not the blue highlight
I think what is happening is that I have the process in reverse. I think I am sorting by index then by Date and the index order just lands randomly.
lookForwardData=lookForwardData.append(lookForwardDataShell,
ignore_index=True).sort_values("Date",ignore_index=False)
IIUC, You could do sort_values after resetting the index so it sorts on both the Date col and the index (Date ascending and Index descending)
lookForwardData=lookForwardData.append(lookForwardDataShell,ignore_index=True)
output = (lookForwardData.reset_index()
.sort_values(['Date','index'],ascending=[True,False]).set_index("index"))

Merge/Join Multi-index Dataframes and combine columns

I am trying to merge two dataframe that are multi-index, while preserving the highest level index. The problem is merging on axis=1 results in the below two columns. Merging/joining on axis=0 drops any value in the 0_y column that has the same sub-index as an entry in )_x. An example below is (226,0), where the value 1510123295301 gets dropped if I merge/join on axis=0.
Is there any way to merge two multi-index dataframes into a single column, preserving the primary index (e.g. 226), but expanding to include non-duplicates in the right-hand column (e.g. 226(0-6))?

Python Pandas Dataframes comparison on 2 columns (with where clause)

I'm stuck on particluar python question here. I have 2 dataframes DF1 and DF2. In both, I have 2 columns pID and yID (which are not indexed, just default). I'm look to add a column Found in DF1 where the respective values of columns (pID and yID) were found in DF2. Also, I would like to zone in on just values in DF2 where aID == 'Text'.
I believe the below gets me the 1st part of this question; however, I'm unsure how as to incorporate the where.
DF1['Found'] = (DF1[['pID', 'yID']] == DF2[['pID','yID']]).all(axis=1).astype(bool)
Suggestions or answers would be most appreciated. Thanks.
You could subset the second dataframe containing aID == 'Text' to get a reduced DF from which select those portions of columns to be compared against the first dataframe.
Use DF.isin() to check if the values that are present under these column names match or not. And, .all(axis=1) returns True if both the columns happen to be True, else they become False. Convert the boolean series to integers via astype(int) and assign the result to the new column, Found.
df1_sub = df1[['pID', 'yID']]
df2_sub = df2.query('aID=="Text"')[['pID', 'yID']]
df1['Found'] = df1_sub.isin(df2_sub).all(axis=1).astype(int)
df1
Demo DF's used:
df1 = pd.DataFrame(dict(pID=[1,2,3,4,5],
yID=[10,20,30,40,50]))
df2 = pd.DataFrame(dict(pID=[1,2,8,4,5],
yID=[10,12,30,40,50],
aID=['Text','Best','Text','Best','Text']))
If it does not matter where those matches occur, then merge the two dataframes on 'pID', 'yID' common columns as the key by considering the bigger DF's index (right_index=True) as the new index axis that needs to be emitted and aligned after the merge operation is over.
Access these indices which indicate matches found and assign the value, 1 to a new column named Found while filling it's missing elements with 0's throughout.
df1.loc[pd.merge(df1_sub, df2_sub, on=['pID', 'yID'], right_index=True).index, 'Found'] = 1
df1['Found'].fillna(0, inplace=True)
df1 should be modifed accordingly post the above steps.

Pandas merge DataFrames based on index/column combination

I have two DataFrames that I want to merge. I have read about merging on multiple columns, and preserving the index when merging. My problem needs to cater for both, and I am having difficulty figuring out the best way to do this.
The first DataFrame looks like this
and the second looks like this
I want to merge these based on the Date and the ID. In the first DataFrame the Date is the index and the ID is a column; in the second DataFrame both Date and ID are part of a MultiIndex.
Essentially, as a result I want a DataFrame that looks like DataFrame 2 with an additional column for the Events from DataFrame 1.
I'd suggest reseting the index (reset_index) and then merging the DataFrame, as you've read. Then you can set the index (set_index) to reproduce your desired MultiIndex.

Categories