How to search and select column names based on values? - python

Suppose I have a pandas DataFrame like this
name col1 col2 col3
0 AAA 1 0 2
1 BBB 2 1 2
2 CCC 0 0 2
I want (a) the names of any columns that contain a value of 2 anywhere in the column (i.e., col1, col3), and (b) the names of any columns that contain only values of 2 (i.e., col3).
I understand how to use DataFrame.any() and DataFrame.all() to select rows in a DataFrame where a value appears in any or all columns, but I'm trying to find COLUMNS where a value appears in (a) any or (b) all rows.

You can do what you described with columns:
df.columns[df.eq(2).any()]
# Index(['col1', 'col3'], dtype='object')
df.columns[df.eq(2).all()]
# Index(['col3'], dtype='object')

You can loop over the columns (a):
for column in df.columns:
if (df[column]==2).any(axis=None)==True:
print(column +"contains 2")
Here you get the name of the columns containing one or more 2.
(b) :
for column in df.columns:
if (df[column]==2).all(axis=None)==True:
print(column +"contains 2")

Related

Splitting a dataframe column into multiple columns with specific names

I am trying to split a dataframe column into multiple columns as under:
There are three columns overall. Two should rename in the new dataframe, while the third one to be split into new columns.
Split is to be done using a specific character (say ":")
The column that requires split can have varied number of ":" split. So new columns can be different for different rows, leaving some column values as NULL for some rows. That is okay.
Each subsequently formed column has a specific name. Max number of columns that can be formed is known.
There are four dataframes. Each one has this same formatted column that has to be split.
I came across following solutions but they don't work for the reasons mentioned:
Link
pd.concat([df[[0]], df[1].str.split(', ', expand=True)], axis=1)
This creates columns with names as 0,1,2... I need the new columns to have specific names.
Link
df = df.apply(lambda x:pd.Series(x))
This does no change to the dataframe. Couldn't understand why.
Link
df['command'], df['value'] = df[0].str.split().str
Here the column names are renamed properly, but this requires knowing beforehand how many columns will be formed. In my case, it is dynamic for each dataframe. For rows, the split successfully puts NULL value in extra columns. But using the same code for another dataframe generates an error saying number of keys should be same.
I couldn't post comments on these answers as I am new here on this community. I would appreciate if someone can help me understand how I can achieve my objective - which is: Dynamically use same code to split one column into many for different dataframes on multiple occasions while renaming the newly generated columns to predefined name.
For example:
Dataframe 1:
Col1 Col2 Col3
0 A A:B:C A
1 A A:B:C:D:E A
2 A A:B A
Dataframe 2:
Col1 Col2 Col3
0 A A:B:C A
1 A A:B:C:D A
2 A A:B A
Output should be:
New dataframe 1:
Col1 ColA ColB ColC ColD ColE Col3
0 A A B C NaN NaN A
1 A A B C D E A
2 A A B NaN NaN NaN A
New dataframe 2:
Col1 ColA ColB ColC ColD ColE Col3
0 A A B C NaN NaN A
1 A A B C D NaN A
2 A A B NaN NaN NaN A
(If ColE is not there, then also it is fine.)
After this, I will be concatenating these dataframes into one, where I will need counts of all ColA to ColE for individual dataframes against Col1 and Col3 combinations. So, we need to keep this in mind.
You can do it this way:
columns = df.Col2.max().split(':')
#['A', 'B', 'C', 'D', 'E']
new = df.Col2.str.split(":", expand = True)
new.columns = columns
new = new.add_prefix("Col")
df.join(new).drop("Col2", 1)
# Col1 Col3 ColA ColB ColC ColD ColE
#0 A A A B C None None
#1 A A A B C D E
#2 A A A B None None None

Pandas: slice Dataframe according to values of a column

I have to slice my Dataframe according to values (imported from a txt) that occur in one of my Dataframe' s column. This is what I have:
>df
col1 col2
a 1
b 2
c 3
d 4
>'mytxt.txt'
2
3
This is what I need: drop rows whenever value in col2 is not among values in mytxt.txt
Expected result must be:
>df
col1 col2
b 2
c 3
I tried:
values = pd.read_csv('mytxt.txt', header=None)
df = df.col2.isin(values)
But it doesn' t work. Help would be very appreciated, thanks!
When you read values, I would do it as a Series, and then convert it to a set, which will be more efficient for lookups:
values = pd.read_csv('mytxt.txt', header=None, squeeze=True)
values = set(values.tolist())
Then slicing will work:
>>> df[df.col2.isin(values)]
col1 col2
1 b 2
2 c 3
What was happening is you were reading values in as a DataFrame rather than a Series, so the .isin method was not behaving as you expected.

How to remove a row in a dataframe if the entry of a specific column is numeric

It is not difficult to remove a row in a dataframe according to the fact that a specific column is non numerical.
But in my case I have the opposite: I must remove the lines which correspond to a numerical entry of a specific column.
Convert column to numeric with errors='coerce' and test missing values for remove numeric values:
df= pd.DataFrame(data={'tested col': [1,'b',2,'3'],
'A': [1,2,3,4]})
print (df)
tested col A
0 1 1
1 b 2
2 2 3
3 3 4
df1 = df[pd.to_numeric(df['tested col'], errors='coerce').isna()]
print (df1)
tested col A
1 b 2

Pandas: Loop through DataFrame columns and remove rows with variables that have less than i observations

Suppose I have the following data frame.
X = pd.DataFrame([["A","Z"],["A","Z"],["A","Z"],["B","Y"],["B","Y"]],columns=["COL1","COL2"])
Suppose I have the above dataframe. COL1 contains 3 A's and 2 B's. COL2 contains 3 Z's and 2 Y's.
What I'm trying to do is search each column and find the rows where there is less than i of a variable (E.g. in this case I search each column and find what rows have fewer than 3 entries).
In this case I have a bunch of duplicate entries but it's just presented like that for simplicity.
Link to my previous question:
Pandas: How do I loop through and remove rows where a column has a single entry
Please let me know if clarification is needed.
You can use subset and keep False params
X = X[X.duplicated(subset=list(X.columns), keep=False)]
output:
COL1 COL2
0 A Z
1 A Z
You can do
i=3
X[X.groupby(X.columns.tolist()).COL1.transform('count')>=i]
COL1 COL2
0 A Z
1 A Z
2 A Z

python pandas groupby then count rows satisfying condition

i am trying to do a groupby on the id column such that i can show the number of rows in col1 that is equal to 1.
df:
id col1 col2 col3
a 1 1 1
a 0 1 1
a 1 1 1
b 1 0 1
my code:
df.groupby(['id'])[col1].count()[1]
output i got was 2. It didnt show me the values from other ids like b.
i want:
id col1
a 2
b 1
if possible can the total rows per id also be displayed as a new column?
example:
id col1 total
a 2 3
b 1 1
Assuming you have only 1 and 0 in col1, you can use agg:
df.groupby('id', as_index=False)['col1'].agg({'col1': 'sum', 'total': 'count'})
# id total col1
#0 a 3 2
#1 b 1 1
It's because your rows which id is 'a' sums to 3. The 2 of them are identical that's why it's been grouped and considered as one then it added the unique row which contains the 0 value on its col 1. You can't group rows with different values on its rows.
Yes you can add it on your output. Just place a method how you counted all rows on your column section of your code.
If you want to generalize the solution to include values in col1 that are not zero you can do the following. This also orders the columns correctly.
df.set_index('id')['col1'].eq(1).groupby(level=0).agg([('col1', 'sum'), ('total', 'count')]).reset_index()
id col1 total
0 a 2.0 3
1 b 1.0 1
Using a tuple in the agg method where the first value is the column name and the second the aggregating function is new to me. I was just experimenting and it seemed to work. I don't remember seeing it in the documentation so use with caution.

Categories