Pandas dataframe get columns names and value_counts - python

how to get all column names where values in columns are 'f' or 't' into array ?
df['FTI'].value_counts()
instead of this 'FTI' i need array of returned columns. Is it possible?

Reproducible example:
df = pd.DataFrame({'col1':[1,2,3], 'col2':['f', 'f', 'f'], 'col3': ['t','t','t'], 'col4':['d','d','d']})
col1 col2 col3 col4
0 1 f t d
1 2 f t d
2 3 f t d
Such that, using eq and all:
>>> s = (df.eq('t') | df.eq('f')).all()
col1 False
col2 True
col3 True
col4 False
dtype: bool
To get the names:
>>> s[s].index.values
array(['col2', 'col3'], dtype=object)
To get the positions:
>>> np.flatnonzero(s) + 1
array([2, 3])

Yes. It is possible. Here is one way
You can get the columns like this.
cols=[]
for col in df.columns:
if df[col].str.contains('f|t').any()==True:
cols.append(col)
Then you can just use this for frequencies
f= pd.Series()
for col in cols:
f=pd.concat([f,df[col].value_counts()])

Related

How to get row-wise absolute minimum value in pandas dataframe

I have a pandas dataframe for example
Col1
Col2
Col3
-1
2
-3
I want to pick the minimum absolute value: table2
Col1
Col2
Col3
Result
-1
2
-3
-1
For now I am using df.abs().idxmin(axis="columns") and I get: table3
Col1
Col2
Col3
Result
-1
2
-3
Col1
I would like to ask how can I convert table 3 to table 2?
Use np.argmin (numpy counterpart of DataFrame.idxmin). Since you want to extract the original values, it's more convenient to access those values at the numpy level.
I added an extra row to your MRE for demonstration:
cols = np.argmin(df.abs().to_numpy(), axis=1) # [0, 2]
rows = range(len(cols)) # [0, 1]
df['Result'] = df.to_numpy()[rows, cols]
# Col1 Col2 Col3 Result
# 0 -1 2 -3 -1
# 1 10 -5 4 4
You can use df.abs().min(axis=1)
>>> import pandas as pd
>>>
>>> data = {'col1': [1], 'col2': [2], 'col3': [-3]}
>>> df = pd.DataFrame(data)
>>> df
col1 col2 col3
0 1 2 -3
>>> df['Result'] = df.abs().min(axis=1)
>>> df
col1 col2 col3 Result
0 1 2 -3 1

Count the records based on value of columns

I have the below df:
ID Col1 Col2 Col3
1 A NB C
2 A NB C
3 NS B NC
4 NS NB NC
5 NS B NC
6 NS B C
And I'm trying to get the count of each column based on their values.
How many "A" are in the Col1
How many "B" are in the Col2
How many "C" are in the Col3
In the original df I have a lot of column and conditions.
The expected output:
Col1 Col2 Col3
TotalCount"A" TotalCount"B" TotalCount"C"
So, I'm trying to get the list of columns and iterate it but I am not getting the expected results.
I'm working with pandas in jupyternotebook
You can use df.eq here and pass a list of values to compare against.
values = ['A', 'B', 'C']
out = df.loc[:, 'Col1':].eq(values).sum()
Col1 2
Col2 3
Col3 3
dtype: int64
Extending on #Ch3ster's answer to match the expected output:
In [1382]: values = ['A', 'B', 'C']
In [1391]: res = df.filter(like='Col', axis=1).eq(values).sum().to_frame().T
In [1392]: res
Out[1392]:
Col1 Col2 Col3
0 2 3 3

Apply list over Pandas dataframe

I need to apply a list to a pandas dataframe by column. The operation to be performed is string concatenation. Being more specific:
Inputs I have:
df = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f']], columns=['Col1', 'Col2', 'Col3'])
lt = ['Prod1', 'Prod2', 'Prod3']
which results in:
>>>df
Col1 Col2 Col3
0 a b c
1 d e f
>>>lt
['Prod1', 'Prod2', 'Prod3']
moreover, the length of lt will always be equal to number of columns of df.
What I would like to have is a dataframe of this sort:
res = pd.DataFrame([['Prod1a', 'Prod2b', 'Prod3c'], ['Prod1d', 'Prod2e', 'Prod3f']],
columns=['Col1', 'Col2', 'Col3'])
which gives:
>>>res
Col1 Col2 Col3
0 Prod1a Prod2b Prod3c
1 Prod1d Prod2e Prod3f
Until now, I've been able to solve the problem looping through rows and columns but I won't give up the idea that there's a more elegant way to solve it (maybe something like apply.
Does anyone has suggestions? Thanks!
You can perform broadcasted string concatenation:
lt + df
Col1 Col2 Col3
0 Prod1a Prod2b Prod3c
1 Prod1d Prod2e Prod3f
You can also use numpy's np.char.add function.
df[:] = np.char.add(lt, df.values.astype(str))
df
Col1 Col2 Col3
0 Prod1a Prod2b Prod3c
1 Prod1d Prod2e Prod3f
Thirdly, there is the list comprehension option.
df[:] = [[i + v for i, v in zip(lt, V)] for V in df.values.tolist()]
df
Col1 Col2 Col3
0 Prod1a Prod2b Prod3c
1 Prod1d Prod2e Prod3f

Pandas Combining two rows into one [duplicate]

Given the following dataframe:
import pandas as pd
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'A','A']})
df
COL1 COL2
0 A NaN
1 NaN A
2 A A
I would like to create a column ('COL3') that uses the value from COL1 per row unless that value is null (or NaN). If the value is null (or NaN), I'd like for it to use the value from COL2.
The desired result is:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
Thanks in advance!
In [8]: df
Out[8]:
COL1 COL2
0 A NaN
1 NaN B
2 A B
In [9]: df["COL3"] = df["COL1"].fillna(df["COL2"])
In [10]: df
Out[10]:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
You can use np.where to conditionally set column values.
df = df.assign(COL3=np.where(df.COL1.isnull(), df.COL2, df.COL1))
>>> df
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
If you don't mind mutating the values in COL2, you can update them directly to get your desired result.
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
>>> df
COL1 COL2
0 A NaN
1 NaN B
2 A B
df.COL2.update(df.COL1)
>>> df
COL1 COL2
0 A A
1 NaN B
2 A A
Using .combine_first, which gives precedence to non-null values in the Series or DataFrame calling it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
df['COL3'] = df.COL1.combine_first(df.COL2)
Output:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
If we mod your df slightly then you will see that this works and in fact will work for any number of columns so long as there is a single valid value:
In [5]:
df = pd.DataFrame({'COL1': ['B', np.nan,'B'],
'COL2' : [np.nan,'A','A']})
df
Out[5]:
COL1 COL2
0 B NaN
1 NaN A
2 B A
In [6]:
df.apply(lambda x: x[x.first_valid_index()], axis=1)
Out[6]:
0 B
1 A
2 B
dtype: object
first_valid_index will return the index value (in this case column) that contains the first non-NaN value:
In [7]:
df.apply(lambda x: x.first_valid_index(), axis=1)
Out[7]:
0 COL1
1 COL2
2 COL1
dtype: object
So we can use this to index into the series
You can also use mask which replaces the values where COL1 is NaN by column COL2:
In [8]: df.assign(COL3=df['COL1'].mask(df['COL1'].isna(), df['COL2']))
Out[8]:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A

Return groupby columns as new dataframe in Python Pandas

Input: CSV with 5 columns.
Expected Output: Unique combinations of 'col1', 'col2', 'col3'.
Sample Input:
col1 col2 col3 col4 col5
0 A B C 11 30
1 A B C 52 10
2 B C A 15 14
3 B C A 1 91
Sample Expected Output:
col1 col2 col3
A B C
B C A
Just expecting this as output. I don't need col4 and col5 in output. And also don't need any sum, count, mean etc. Tried using pandas to achieve this but no luck.
My code:
input_df = pd.read_csv("input.csv");
output_df = input_df.groupby(['col1', 'col2', 'col3'])
This code is returning 'pandas.core.groupby.DataFrameGroupBy object at 0x0000000009134278'.
But I need dataframe like above. Any help much appreciated.
df[['col1', 'col2', 'col3']].drop_duplicates()
First you can use .drop() to delete col4 and col5 as you said you don't need them.
df = df.drop(['col4', 'col5'], axis=1)
Then, you can use .drop_duplicates() to delete the duplicate rows in col1, col2 and col3.
df = df.drop_duplicates(['col1', 'col2', 'col3'])
df
The output:
col1 col2 col3
0 A B C
2 B C A
You noticed that in the output the index is 0, 2 instead of 0,1. To fix that you can do this:
df.index = range(len(df))
df
The output:
col1 col2 col3
0 A B C
1 B C A

Categories