Delete columns with extremely unequally distributed values from pandas dataframe - python

Given the following code:
import numpy as np
import pandas as pd
arr = np.array([
[1,2,9,1,1,1],
[2,3,3,1,0,1],
[1,4,2,1,2,1],
[2,3,1,1,2,1],
[1,2,3,1,8,1],
[2,2,5,1,1,1],
[1,3,8,7,4,1],
[2,4,7,8,3,3]
])
# 1,2,3,4,5,6 <- Number of the columns.
df = pd.DataFrame(arr)
for _ in df.columns.values:
print {x: list(df[_]).count(x) for x in set(df[_])}
I want to delete from the dataframe all the columns in which one value occurs more often than all the other values of the column together. In this case I would like to drop the columns 4 and 6 (see comment) since the number 1 occurs more often than all the other numbers in these columns together (6 > 2 in column 4 and 7 > 1 in column 6). I don't want to drop the first column (4 = 4). How would I do that?

Another option is to do a value counts on each column and if the maximum of the count is smaller or equal to half of the number of rows of the data frame, then select it:
df.loc[:, df.apply(lambda col: max(col.value_counts()) <= df.shape[0]/2)]
# 0 1 2 4
#0 1 2 9 1
#1 2 3 3 0
#2 1 4 2 2
#3 2 3 1 2
#4 1 2 3 8
#5 2 2 5 1
#6 1 3 8 4
#7 2 4 7 3

Related

Allocate lowest value over n rows to n rows in DataFrame

I need to take the lowest value over n rows and add it to these n rows in a new colomn of the dataframe. For example:
n=3
Column 1 Column 2
5 3
3 3
4 3
7 2
8 2
2 2
5 4
4 4
9 4
8 2
2 2
3 2
5 2
Please take note that if the number of rows is not dividable by n, the last values are incorporated in the last group. So in this example n=4 for the end of the dataframe.
Thanking you in advance!
I do not know any straight forward way to do this, but here is a working example (not elegant, but working...).
If you do not worry about the number of rows being dividable by n, you could use .groupby():
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
df['new_col']=df.groupby(df.index // n).transform('min')
which yields:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 4
7 6 4
8 4 4
9 1 1
10 2 1
However, we can see that the last 2 rows are grouped together, instead of them being grouped with the 3 previous values in this case.
A way around would be to look at the .count() of elements in each group generated by grouby, and check the last one:
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
# Temporary dataframe
A = df.groupby(df.index // n).transform('min')
# The min value of each group in a second dataframe
min_df = df.groupby(df.index // n).min()
# The size of the last group
last_batch = df.groupby(df.index // n).count()[-1:]
# if the last size is not equal to n
if last_batch.values[0][0] !=n:
last_group = last_batch+n
A[-last_group.values[0][0]:]=min_df[-2:].min()
# Assign the temporary modified dataframe to df
df['new_col'] = A
which yields the expected result:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 1
7 6 1
8 4 1
9 1 1
10 2 1

I have a dataframe where some index number are missing how do I cut dataframe previous to that missing index number

[enter image description here][1]
Index number 72 is missing from original dataframe which is shown in image. I want to cut dataframe like [0:71,:] with condition like when index sequence breaks then dataframe automatically cuts from previous index value.
Compare shifted values of index subtracted by original values if greater like 1 with invert ordering by [::-1] and Series.cummax, last filter in boolean indexing:
df = pd.DataFrame({'a': range(3,13)}).drop(3)
print (df)
a
0 3
1 4
2 5
4 7
5 8
6 9
7 10
8 11
9 12
df = df[df.index.to_series().shift(-1, fill_value=0).sub(df.index).gt(1)[::-1].cummax()]
print (df)
a
0 3
1 4
2 5
i came to this:
df = pd.DataFrame({'col':[1,2,3,4,5,6,7,8,9]}, index=[-1,0,1,2,3,4,5,7,8])
ind = next((i for i in range(len(df)-1) if df.index[i]+1!=df.index[i+1]),len(df))+1
>>> df.iloc[:ind]
'''
col
-1 1
0 2
1 3
2 4
3 5
4 6
5 7
With numpy, get the values that are equal to a normal range starting from the first index, up to the first mismatch (excluded):
df[np.minimum.accumulate(df.index==np.arange(df.index[0], df.index[0]+len(df)))]
Example:
col
-1 1
0 2
1 3
3 4
4 5
output:
col
-1 1
0 2
1 3

Making a Multiindexed Pandas Dataframe Non-Symmetric

I have a multi-indexed dataframe which looks roughly like this:
import pandas as pd
test = pd.DataFrame({('A', 'a'):[1,2,3,4,5], ('A', 'b'):[5,4,3,2,1], ('B', 'a'):[5,2,3,4,1], ('B','b'):[1,4,3,2,5]})
>>> Output
A B
a b a b
0 1 5 5 1
1 2 4 2 4
2 3 3 3 3
3 4 2 4 2
4 5 1 1 5
In this dataframe, the zero-th row and fifth row are symmetric in the sense that if the entire A and B columns of the zero-th row are flipped, it becomes identical to the fifth one. Similarly, the second row is symmetric with itself.
I am planning to remove these rows from my original dataframe, thus making it 'non-symmetric'. The specific plans are as follow:
If a row with higher index is symmetric with a row with lower index, keep the lower one and remove the higher one. For example, from the above dataframe, keep the zero-th row and remove the fifth row.
If a row is symmetric with itself, remove that row. For example, from the above dataframe, remove the second row.
My attempt was to first zip the four lists into a tuple list, remove the symmetric tuples by a simple if-statement, unzip them, and merge them back into a dataframe. However, this turned out to be inefficient, making it unscalable for large dataframes.
How can I achieve this in an efficient manner? I guess utilizing several built-in pandas methods is necessary, but it seems quite complicated.
Namudon'tdie,
Try this solution:
import pandas as pd
test = pd.DataFrame({('A', 'a'):[1,2,3,4,5], ('A', 'b'):[5,4,3,2,1], ('B', 'a'):[5,2,3,4,1], ('B','b'):[1,4,3,2,5]})
test['idx'] = test.index * 2 # adding auxiliary column 'idx' (all even)
test2 = test.iloc[:, [2,3,0,1,4]] # creating flipped DF
test2.columns = test.columns # fixing column names
test2['idx'] = test2.index * 2 + 1 # for flipped DF column 'idx' is all odd
df = pd.concat([test, test2])
df = df.sort_values (by='idx')
df = df.set_index('idx')
print(df)
A B
a b a b
idx
0 1 5 5 1
1 5 1 1 5
2 2 4 2 4
3 2 4 2 4
4 3 3 3 3
5 3 3 3 3
6 4 2 4 2
7 4 2 4 2
8 5 1 1 5
9 1 5 5 1
df = df.drop_duplicates() # remove rows with duplicates
df = df[df.index%2 == 0] # remove rows with odd idx (flipped)
df = df.reset_index()[['A', 'B']]
print(df)
A B
a b a b
0 1 5 5 1
1 2 4 2 4
2 3 3 3 3
3 4 2 4 2
The idea is to create flipped rows with odd indexes, so that they will be placed under their original rows after reindexing. Then delete duplicates, keeping rows with lower indices. For cleanup simply delete remaining rows with odd indices.
Note that row [3,3,3,3] stayed. There should be a separate filter to take care of self-symmetric rows. Since your definition of self-symmetric is unclear (other rows have certain degree of symmetry too), I leave this part to you. Should be straightforward.

In pandas Dataframe with multiindex how can I filter by order?

Assume the following dataframe
>>> import pandas as pd
>>> L = [(1,'A',9,9), (1,'C',8,8), (1,'D',4,5),(2,'H',7,7),(2,'L',5,5)]
>>> df = pd.DataFrame.from_records(L).set_index([0,1])
>>> df
2 3
0 1
1 A 9 9
C 8 8
D 4 5
2 H 7 7
L 5 5
I want to filter the rows in the nth position of level 1 of the multiindex, i.e. filtering the first
2 3
0 1
1 A 9 9
2 H 7 7
or filtering the third
2 3
0 1
1 D 4 5
How can I achieve this ?
You can filter rows with the help of GroupBy.nth after performing grouping on the first level of the multi-index DF. Since n follows the 0-based indexing approach, you need to provide the values appropriately to it as shown:
1) To select the first row grouped per level=0:
df.groupby(level=0, as_index=False).nth(0)
2) To select the third row grouped per level=0:
df.groupby(level=0, as_index=False).nth(2)

Reduction the Pandas dataframe to other dataframe

I have two data frames and their shapes are (707,140) and (34,98).
I want to minimize the bigger data frame to the small one based on the same index name and column names.
So after the removing additional rows and columns from bigger data frame, in the final its shape should be (34,98) with the same index and columns with the small data frame.
How can I do this in python ?
I think you can select by loc index and columns of small DataFrame:
dfbig.loc[dfsmall.index, dfsmall.columns]
Sample:
dfbig = pd.DataFrame({'a':[1,2,3,4,5], 'b':[4,7,8,9,4], 'c':[5,0,1,2,4]})
print (dfbig)
a b c
0 1 4 5
1 2 7 0
2 3 8 1
3 4 9 2
4 5 4 4
dfsmall = pd.DataFrame({'a':[4,8], 'c':[0,1]})
print (dfsmall)
a c
0 4 0
1 8 1
print (dfbig.loc[dfsmall.index, dfsmall.columns])
a c
0 1 5
1 2 0

Categories