I have two DataFrames that I append together ignoring the index so the rows from the appended DataFrame remain the same.
One DataFrame index goes from 0 to 200 and the second DataFrame index goes from 0 to 76
After appending them I try to sort it with a .sort_values then .sort_index because I want the same dates to be together but I also want the larger index to be above the smaller index with the same date as shown in the image below from my output. The red and green is correct but not the blue highlight
I think what is happening is that I have the process in reverse. I think I am sorting by index then by Date and the index order just lands randomly.
lookForwardData=lookForwardData.append(lookForwardDataShell,
ignore_index=True).sort_values("Date",ignore_index=False)
IIUC, You could do sort_values after resetting the index so it sorts on both the Date col and the index (Date ascending and Index descending)
lookForwardData=lookForwardData.append(lookForwardDataShell,ignore_index=True)
output = (lookForwardData.reset_index()
.sort_values(['Date','index'],ascending=[True,False]).set_index("index"))
Related
I'm trying to create a excel with value counts and percentage, I'm almost finishing but when I run my for loop, the percentage is added like a new df.to_frame with two more columns but I only want one this is how it looks in excel:
I want that the blue square not appears in the excel or the df and the music percentage is next to the counts of music column, also the music percentage I would like to put it with percentage format instead 0.81 --> 81%. Below is my code.
li = []
for i in range(0, len(df.columns)):
value_counts = df.iloc[:, i].value_counts().to_frame().reset_index()
value_percentage = df.iloc[:, i].value_counts(normalize=True).to_frame().reset_index()#.style.format('{:.2%}')
li.append(value_counts)
li.append(value_percentage)
data = pd.concat(li, axis=1)
The .reset_index() function creates a column in your dataframe called index. So you are appending two-column dataframes each time, one of which is the index. You could add .drop(columns='index') after .reset_index() to drop the index column at each step and therefore also in your final dataframe.
However, depending on your application you may want to be careful with resetting the index because it looks like you are appending in a way where your rows do not align (i.e. not all your index columns are not all the same).
To change your dataframe values to strings with percentages you can use:
value_counts = (value_counts*100).astype(str)+'%'
Hi I am droping duplicate from dataframe based on one column i.e "ID", Till now i am droping the duplicate and keeping the first occurence but I want to keep the first(top) two occurrence instead of only one. So I can compare the values of first two rows of another column "similarity_score".
data_2 = data.sort_values('similarity_score' , ascending = False)
data_2.drop_duplicates(subset=['ID'], keep='first').reset_index()
Let us sort the values then do groupby + head
data.sort_values('similarity', ascending=False).groupby('ID').head(2)
Alternatively, you can use groupby + nlargest which will also give you the desired result:
data.groupby('ID')['similarity'].nlargest(2).droplevel(1)
I know how to delete rows and columns from a dataframe using .drop() method, by passing axis and labels.
Here's the Dataframe:
Now, if i want to remove all rows whose STNAME is equal to from (Arizona all the way to Colorado), how should i do it ?
I know i could just do it by passing row labels 2 to 7 to .drop() method but if i have a lot of data and i don't know the starting and ending indexes, it won't be possible.
Might be kinda hacky, but here is an option:
index1 = df.index[df['STNAME'] == 'Arizona'].tolist()[0]
index2 = df.index[df['STNAME'] == 'Colorado'].tolist()[-1:][0]
df = df.drop(np.arange(index1, index2+1))
This basically takes the first index number of Arizona and the last index number of Colorado, and deletes every row from the data frame between these indexes.
I have a Multiindex dataframe with 2 index levels and 2 column levels.
The first level index and first index columns are the same. The second levels share elements but are not equal. This gives me a non square dataframe (I have more elements in my 2nd level columns than in my second level index)
I want to set all elements of my dataframe to 0 in the case the first level index is not equal to the first level column. I have done it recursively but am sure there is a better way.
Can you help?
Thanks
I have such a data frame df:
a b
10 2
3 1
0 0
0 4
....
# about 50,000+ rows
I wish to choose the df[:5, 'a']. But When I call df.loc[:5, 'a'], I got an error: KeyError: 'Cannot get right slice bound for non-unique label: 5. When I call df.loc[5], the result contains 250 rows while there is just one when I use df.iloc[5]. Why does this thing happen and how can I index it properly? Thank you in advance!
The error message is explained here: if the index is not monotonic, then both slice bounds must be unique members of the index.
The difference between .loc and .iloc is label vs integer position based indexing - see docs. .loc is intended to select individual labels or slices of labels. That's why .loc[5] selects all rows where the index has the value 250 (and the error is about a non-unique index). iloc, in contrast, select row number 5 (0-indexed). That's why you only get a single row, and the index value may or may not be 5. Hope this helps!
To filter with non-unique indexs try something like this:
df.loc[(df.index>0)&(df.index<2)]
The issue with the way you are addressing is that, there are multiple rows with index as 5. So the loc attribute does not know which one to pick. To know just do a df.loc[5] you will get number of rows with same index.
Either you can sort it using sort_index or you can first aggregate data based on index and then retrieve.
Hope this helps.