Making a Multiindexed Pandas Dataframe Non-Symmetric - python

I have a multi-indexed dataframe which looks roughly like this:
import pandas as pd
test = pd.DataFrame({('A', 'a'):[1,2,3,4,5], ('A', 'b'):[5,4,3,2,1], ('B', 'a'):[5,2,3,4,1], ('B','b'):[1,4,3,2,5]})
>>> Output
A B
a b a b
0 1 5 5 1
1 2 4 2 4
2 3 3 3 3
3 4 2 4 2
4 5 1 1 5
In this dataframe, the zero-th row and fifth row are symmetric in the sense that if the entire A and B columns of the zero-th row are flipped, it becomes identical to the fifth one. Similarly, the second row is symmetric with itself.
I am planning to remove these rows from my original dataframe, thus making it 'non-symmetric'. The specific plans are as follow:
If a row with higher index is symmetric with a row with lower index, keep the lower one and remove the higher one. For example, from the above dataframe, keep the zero-th row and remove the fifth row.
If a row is symmetric with itself, remove that row. For example, from the above dataframe, remove the second row.
My attempt was to first zip the four lists into a tuple list, remove the symmetric tuples by a simple if-statement, unzip them, and merge them back into a dataframe. However, this turned out to be inefficient, making it unscalable for large dataframes.
How can I achieve this in an efficient manner? I guess utilizing several built-in pandas methods is necessary, but it seems quite complicated.

Namudon'tdie,
Try this solution:
import pandas as pd
test = pd.DataFrame({('A', 'a'):[1,2,3,4,5], ('A', 'b'):[5,4,3,2,1], ('B', 'a'):[5,2,3,4,1], ('B','b'):[1,4,3,2,5]})
test['idx'] = test.index * 2 # adding auxiliary column 'idx' (all even)
test2 = test.iloc[:, [2,3,0,1,4]] # creating flipped DF
test2.columns = test.columns # fixing column names
test2['idx'] = test2.index * 2 + 1 # for flipped DF column 'idx' is all odd
df = pd.concat([test, test2])
df = df.sort_values (by='idx')
df = df.set_index('idx')
print(df)
A B
a b a b
idx
0 1 5 5 1
1 5 1 1 5
2 2 4 2 4
3 2 4 2 4
4 3 3 3 3
5 3 3 3 3
6 4 2 4 2
7 4 2 4 2
8 5 1 1 5
9 1 5 5 1
df = df.drop_duplicates() # remove rows with duplicates
df = df[df.index%2 == 0] # remove rows with odd idx (flipped)
df = df.reset_index()[['A', 'B']]
print(df)
A B
a b a b
0 1 5 5 1
1 2 4 2 4
2 3 3 3 3
3 4 2 4 2
The idea is to create flipped rows with odd indexes, so that they will be placed under their original rows after reindexing. Then delete duplicates, keeping rows with lower indices. For cleanup simply delete remaining rows with odd indices.
Note that row [3,3,3,3] stayed. There should be a separate filter to take care of self-symmetric rows. Since your definition of self-symmetric is unclear (other rows have certain degree of symmetry too), I leave this part to you. Should be straightforward.

Related

Python Pandas - How to get group by counts by values from multiple columns with multiple values

My data includes a few variables holding data from multi-answer questions. These are stored as string (comma separated) and aren't ordered by value.
I need to run different counts across 2 or more of these variables at the same time, i.e. get the frequencies of each combination of their unique values.
I also have a second dataframe with the available codes for each variable
df_meta['a']['Categories'] = ['1', '2', '3','4']
df_meta['b']['Categories'] = ['1', '2']
If this is my data
df = pd.DataFrame(np.array([["1,3","1"],["3","1,2"],["1,3,2","1"],["3,1","2,1"]]),
columns=['a', 'b'])
index a b
1 1,3 1
2 3 1,2
3 1,3,2 1
4 3,1 2,1
Ideally, this is what the output would look like
a b count
1 1 3
1 2 1
2 1 1
2 2 0
3 1 4
3 2 2
4 1 0
4 2 0
Although if I it's not possible to get the zero-counts, this would be just fine
a b count
1 1 3
1 2 1
2 1 1
3 1 4
3 2 2
So far, I got the counts for each of these variables individually, by using split and value_counts
df["a"].str.split(',',expand=True).stack().value_counts()
3 4
1 3
2 1
df["b"].str.split(',',expand=True).stack().value_counts()
1 4
2 2
But I can't figure how to group by them, because of the differences in the indexes.
df2 = pd.DataFrame()
df2["a"] = df["a"].str.split(',',expand=True).stack()
df2["b"] = df["b"].str.split(',',expand=True).stack()
df2.groupby(['a','b']).size()
a b
1 1 3
3 1 1
2 1
Is there a way to adjust the groupby to only count the instances of the first index or another way to count the unique combinations more efficiency?
I can alternatively iterate through all codes using the df_meta dataframe, but some of the actual variables have 300-400 codes and it's very slow, when I try to cross 2-3 of them and, if it's possible to use groupby or another function, it should work much faster.
First we make your dataframe to start with.
df = pd.DataFrame(np.array([["1,3","1"],["3","1,2"],["1,3,2","1"],
["3,1","2,1"]]),columns=['a', 'b'])
Then split columns to separate dataframes.
da = df["a"].str.split(',',expand=True)
db = df["b"].str.split(',',expand=True)
Loop through all rows and both dataframes. Make temporary dataframes of all compinations and add them to a list.
ab = list()
for r in range(len(da)):
for i in da.iloc[r,:]:
for j in db.iloc[r,:]:
if i != None and j != None:
daf = pd.DataFrame({'a':[i], 'b':[j]})
ab.append(daf)
Concatenate list of temporary dataframes into one new dataframe.
dfn = pd.concat(ab)
Groupby with 'a' and 'b' columns and size() gives you the answer.
print(dfn.groupby(['a', 'b']).size().reset_index(name='count'))
a b count
0 1 1 3
1 1 2 1
2 2 1 1
3 3 1 4
4 3 2 2

Making a long Dataframe based on columns and row in pandas

So suppose I have a dataframe like:
A B
0 1 1
1 2 4
2 3 9
I want to have one long dataframe where there are three columns row, col, value like:
row col value
0 0 A 1
1 1 A 2
2 2 A 3
3 0 B 1
4 1 B 4
5 2 B 9
Basically making a 2D array into 1D and remembering the row and column of each entry so the resulting dataframe would be of shape (n*m , 3).
How is this possible with Pandas?
Actually the order of entries in the resulting dataframe isn't important for me.
use melt:
df = df.reset_index()
df.melt(id_vars=['index'], value_vars=['A','B'])
it should give you the thing you want. Let me know if it works.

Attempting to delete multiple rows from Pandas Dataframe but more rows than intended are being deleted

I have a list, to_delete, of row indexes that I want to delete from both of my two Pandas Dataframes, df1 & df2. They both have 500 rows. to_delete has 50 entries.
I run this:
df1.drop(df1.index[to_delete], inplace=True)
df2.drop(df2.index[to_delete], inplace=True)
But this results in df1 and df2 having 250 rows each. It deletes 250 rows from each, and not the 50 specific rows that I want it to...
to_delete is ordered in descending order.
The full method:
def method(results):
#results is a 500 x 1 matrix of 1's and -1s
global df1, df2
deletions = []
for i in xrange(len(results)-1, -1, -1):
if results[i] == -1:
deletions.append(i)
df1.drop(df1.index[deletions], inplace=True)
df2.drop(df2.index[deletions], inplace=True)
Any suggestions as to what I'm doing wrong?
(I've also tried using .iloc instead of .index and deleting in the if statement instead of appending to a list first.
Your index values are not unique and when you use drop it is removing all rows with those index values. to_delete may have been of length 50 but there were 250 rows that had those particular index values.
Consider the example
df = pd.DataFrame(dict(A=range(10)), [0, 1, 2, 3, 4] * 2)
df
A
0 0
1 1
2 2
3 3
4 4
0 5
1 6
2 7
3 8
4 9
Let's say you want to remove the first, third, and fourth rows.
to_del = [0, 2, 3]
Using your method
df.drop(df.index[to_del])
A
1 1
4 4
1 6
4 9
Which is a problem
Option 1
use np.in1d to find complement of to_del
This is more self explanatory than the others. I'm looking in an array from 0 to n and seeing if it is in to_del. The result will be a boolean array the same length as df. I use ~ to get the negation and use that to slice the dataframe.
df[~np.in1d(np.arange(len(df)), to_del)]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
Option 2
use np.bincount to find complement of to_del
This accomplishes the same thing as option 1 by counting the positions defined in to_del. I end up with an array of 0 and 1 with a 1 in each position defined in to_del and 0 else where. I want to keep the 0s so I make a boolean array by finding where it is equal to 0. I then use this to slice the dataframe.
df[np.bincount(to_del, minlength=len(df)) == 0]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
Option 3
use np.setdiff1d to find positions
This uses set logic to find the difference between a full array of positions and just the ones I want to delete. I then use iloc to select.
df.iloc[np.setdiff1d(np.arange(len(df)), to_del)]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9

How to save the sort order in Dataframe?

I want to add a new column to save the sort order, which is sorted by one of the columns, in Dataframe. For example, I would like to sort by column 'B'(ascending) and add a new column 'C' to save the sort order. that means i want to get a column'C', which is [4,3,1,2,4,2]
df=pd.DataFrame({"A":[1,2,3,4,5,6],"B":[5,2,0,1,5,1]})
Try with rank, and method='dense' so that rank always increases by 1 between groups:
import pandas as pd
df=pd.DataFrame({"A":[1,2,3,4,5,6],"B":[5,2,0,1,5,1]})
df['C']=df['B'].rank(method='dense')
df
Output:
A B C
0 1 5 4
1 2 2 3
2 3 0 1
3 4 1 2
4 5 5 4
5 6 1 2

Delete columns with extremely unequally distributed values from pandas dataframe

Given the following code:
import numpy as np
import pandas as pd
arr = np.array([
[1,2,9,1,1,1],
[2,3,3,1,0,1],
[1,4,2,1,2,1],
[2,3,1,1,2,1],
[1,2,3,1,8,1],
[2,2,5,1,1,1],
[1,3,8,7,4,1],
[2,4,7,8,3,3]
])
# 1,2,3,4,5,6 <- Number of the columns.
df = pd.DataFrame(arr)
for _ in df.columns.values:
print {x: list(df[_]).count(x) for x in set(df[_])}
I want to delete from the dataframe all the columns in which one value occurs more often than all the other values of the column together. In this case I would like to drop the columns 4 and 6 (see comment) since the number 1 occurs more often than all the other numbers in these columns together (6 > 2 in column 4 and 7 > 1 in column 6). I don't want to drop the first column (4 = 4). How would I do that?
Another option is to do a value counts on each column and if the maximum of the count is smaller or equal to half of the number of rows of the data frame, then select it:
df.loc[:, df.apply(lambda col: max(col.value_counts()) <= df.shape[0]/2)]
# 0 1 2 4
#0 1 2 9 1
#1 2 3 3 0
#2 1 4 2 2
#3 2 3 1 2
#4 1 2 3 8
#5 2 2 5 1
#6 1 3 8 4
#7 2 4 7 3

Categories