Drop columns according to some criteria - python

For example, I have a dataframe called dat, then I want to apply a function on each column of the dataframe, if the return value is Ture, then keep this column and turn to next column, if the return value is False, then drop this column and turn to next column.
I know I can write a for loop to do this, but is there a efficient way to do this?

You could do it like this using boolean index on df.columns:
I want to drop all columns where the 'sum' for simplicity is greater than 50
df = pd.DataFrame({'A':[2,4,6,8],'B':[101,102,102,102]})
r = df.apply(np.sum) # applies the sum function to all columns
c = r <= 50 #create boolean test for columns
df[c[c].index] #Use boolea indexing to get columns and column filter for dataframe
Output:
A
0 2
1 4
2 6
3 8
Updating an old answer:
df.loc[:, df.sum() <= 50]

Related

How to compare two columns value in pandas

I Have a dataframe which has some unique IDs in two of the columns.for e.g
S.no. Column1 Column2
1 00001x 00002x
2 00003j 00005k
3 00002x 00001x
4 00004d 00008e
Value can be anything in the string format
I want to compare the two column in such a way that either of s.no 1 or 3 data remains. as these id contains the same information. only its order is different.
Basically if for one row value in a column 1 is X and column 2 is Y and for other row value in column 1 is Y and in Column 2 is x then only one of the row should remain.
is that possible in python?
You can convert your columns as frozenset per row.
This will give a common order to apply duplicated.
Finally, slice the rows using the previous output as mask:
mask = df.filter(like='Column').apply(frozenset, axis=1).duplicated()
df[~mask]
previous answer using set:
mask = df.filter(like='Column').apply(lambda x: tuple(set(x)), axis=1).duplicated()
df[~mask]
NB. Using a set or sorted requires to convert as tuple (lambda x: tuple(sorted(x))) as the duplicated function hashes the values, which is not possible with mutable objects
output:
S.no. Column1 Column2
0 1 00001x 00002x
1 2 00003j 00005k
3 4 00004d 00008e

Lambda function with groupby and condition issue

I wanted to see how to express counts of unique values of column B for each unique value in column A where corresponding column C >0.
df:
A B C
1 10 0
1 12 3
2 3 1
I tried this but its missing the where clause to filter for C>0. How do I add it?
df.groupby(['A'])['B'].apply(lambda b : b.astype(int).nunique())
Let's first start by creating the dataframe that OP mentions in the question
import pandas as pd
df = pd.DataFrame({'A': [1,1,2], 'B': [10,12,3], 'C': [0,3,1]})
Now, in order to achieve what OP wants, there are various options. One of the ways is by selecting the df where column C is greater than 0, then use pandas.DataFrame.groupby to group by column A. Finally, use pandas.DataFrame.nunique to count the unique values of column B. In one line it would look like the following
count = df[df['C'] > 0].groupby('A')['B'].nunique()
[Out]:
A
1 1
2 1
If one wants to sum every number in of unique items that satisfy the condition, from the dataframe count, one can simply do
count = count.sum()
[Out]:
2
Assuming one wants to do everything in one line, one can use pandas.DataFrame.sum as
count = df[df['C'] > 0].groupby('A')['B'].nunique().sum()
[Out]:
2

Pandas reindexing task based on a column value

I have a dataframe with millions of rows with unique indexes and a column('b') that has several repeated values.
I would like to generate a dataframe without the duplicated data but I do not want to lose the index information. I want the new dataframe to have an index that is a concatenation of the indexes ("old_index1,old_index2") where 'b' had duplicated values but remains unchanged for rows where 'b' had unique values. The values of the 'b' column should remain unchanged like in a keep=first strategy. Example below.
Input dataframe:
df = pd.DataFrame(data = [[1,"non_duplicated_1"],
[2,"duplicated"],
[2,"duplicated"],
[3,"non_duplicated_2"],
[4,"non_duplicated_3"]],
index=['one','two','three','four','five'],
columns=['a','b'])
desired output:
a b
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
The actual dataframe is quite large so I would like to avoid non-vectorized operations.
I am finding this surprisingly difficult...Any ideas?
You can use transform on the index column (after you use reset_index). Then, drop duplicates in column b:
df.index = df.reset_index().groupby('b')['index'].transform(','.join)
df.drop_duplicates('b',inplace=True)
>>> df
a b
index
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
Setup
dct = {'index': ','.join, 'a': 'first'}
You can reset_index before using groupby, although it's unclear to me why you want this:
df.reset_index().groupby('b', as_index=False, sort=False).agg(dct).set_index('index')
b a
index
one non_duplicated_1 1
two,three duplicated 2
four non_duplicated_2 3
five non_duplicated_3 4

Pandas - add value at specific iloc into new dataframe column

I have a large dataframe containing lots of columns.
For each row/index in the dataframe I do some operations, read in some ancilliary ata, etc and get a new value. Is there a way to add that new value into a new column at the correct row/index?
I can use .assign to add a new column but as I'm looping over the rows and only generating the data to add for one value at a time (generating it is quite involved). When it's generated I'd like to immediately add it to the dataframe rather than waiting until I've generated the entire series.
This doesn't work and gives a key error:
df['new_column_name'].iloc[this_row]=value
Do I need to initialise the column first or something?
There are two steps to created & populate a new column using only a row number...
(in this approach iloc is not used)
First, get the row index value by using the row number
rowIndex = df.index[someRowNumber]
Then, use row index with the loc function to reference the specific row and add the new column / value
df.loc[rowIndex, 'New Column Title'] = "some value"
These two steps can be combine into one line as follows
df.loc[df.index[someRowNumber], 'New Column Title'] = "some value"
If you have a dataframe like
import pandas as pd
df = pd.DataFrame(data={'X': [1.5, 6.777, 2.444, pd.np.NaN], 'Y': [1.111, pd.np.NaN, 8.77, pd.np.NaN], 'Z': [5.0, 2.333, 10, 6.6666]})
Instead of iloc,you can use .loc with row index and column name like df.loc[row_indexer,column_indexer]=value
df.loc[[0,3],'Z'] = 3
Output:
X Y Z
0 1.500 1.111 3.000
1 6.777 NaN 2.333
2 2.444 8.770 10.000
3 NaN NaN 3.000
If you want to add values to certain rows in a new column, depending on values in other cells of the dataframe you can do it like this:
import pandas as pd
df = pd.DataFrame(data={"A":[1,1,2,2], "B":[1,2,3,4]})
Add value in a new column based on the values in cloumn "A":
df.loc[df.A == 2, "C"] = 100
This creates the column "C" and addes the value 100 to it, if column "A" is 2.
Output:
A B C
0 1 1 NaN
1 1 2 NaN
2 2 3 100
3 2 4 100
It is not necessary to initialise the column first.
You can just use pandas built in function DataFrame.at
You can chose a list on several index or a single index and column
df.at[4, 'B'] = 10

How to multiply one column to few other multiple column in Python DataFrame

I have a Dataframe of 100 Columns and I want to multiply one column ('Count') value with the columns position ranging from 6 to 74. Please tell me how to do that.
I have been trying
df = df.ix[0, 6:74].multiply(df["Count"], axis="index")
df = df[df.columns[6:74]]*df["Count"]
None of them is working
The result Dataframe should be of 100 columns with all original columns where columns number 6 to 74 have the multiplied values in all the rows.
You can multiply the columns in place.
columns = df.columns[6:75]
df[columns] *= df['Count']

Categories