print the unique values in every column in a pandas dataframe - python

I have a dataframe (df) and want to print the unique values from each column in the dataframe.
I need to substitute the variable (i) [column name] into the print statement
column_list = df.columns.values.tolist()
for column_name in column_list:
print(df."[column_name]".unique()
Update
When I use this: I get "Unexpected EOF Parsing" with no extra details.
column_list = sorted_data.columns.values.tolist()
for column_name in column_list:
print(sorted_data[column_name].unique()
What is the difference between your syntax YS-L (above) and the below:
for column_name in sorted_data:
print(column_name)
s = sorted_data[column_name].unique()
for i in s:
print(str(i))

It can be written more concisely like this:
for col in df:
print(df[col].unique())
Generally, you can access a column of the DataFrame through indexing using the [] operator (e.g. df['col']), or through attribute (e.g. df.col).
Attribute accessing makes the code a bit more concise when the target column name is known beforehand, but has several caveats -- for example, it does not work when the column name is not a valid Python identifier (e.g. df.123), or clashes with the built-in DataFrame attribute (e.g. df.index). On the other hand, the [] notation should always work.

Most upvoted answer is a loop solution, hence adding a one line solution using pandas apply() method and lambda function.
print(df.apply(lambda col: col.unique()))

This will get the unique values in proper format:
pd.Series({col:df[col].unique() for col in df})

If you're trying to create multiple separate dataframes as mentioned in your comments, create a dictionary of dataframes:
df_dict = dict(zip([i for i in df.columns] , [pd.DataFrame(df[i].unique(), columns=[i]) for i in df.columns]))
Then you can access any dataframe easily using the name of the column:
df_dict[column name]

We can make this even more concise:
df.describe(include='all').loc['unique', :]
Pandas describe gives a few key statistics about each column, but we can just grab the 'unique' statistic and leave it at that.
Note that this will give a unique count of NaN for numeric columns - if you want to include those columns as well, you can do something like this:
df.astype('object').describe(include='all').loc['unique', :]

I was seeking for a solution to this problem as well, and the code below proved to be more helpful in my situation,
for col in df:
print(col)
print(df[col].unique())
print('\n')
It gives something like below:
Fuel_Type
['Diesel' 'Petrol' 'CNG']
HP
[ 90 192 69 110 97 71 116 98 86 72 107 73]
Met_Color
[1 0]

The code below could provide you a list of unique values for each field, I find it very useful when you want to take a deeper look at the data frame:
for col in list(df):
print(col)
print(df[col].unique())
You can also sort the unique values if you want them to be sorted:
import numpy as np
for col in list(df):
print(col)
print(np.sort(df[col].unique()))

cu = []
i = []
for cn in card.columns[:7]:
cu.append(card[cn].unique())
i.append(cn)
pd.DataFrame( cu, index=i).T

Simply do this:
for i in df.columns:
print(df[i].unique())

Or in short it can be written as:
for val in df['column_name'].unique():
print(val)

Even better. Here's code to view all the unique values as a dataframe column-wise transposed:
columns=[*df.columns]
unique_values={}
for i in columns:
unique_values[i]=df[i].unique()
unique=pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in unique_vals.items() ]))
unique.fillna('').T

This solution constructs a dataframe of unique values with some stats and gracefully handles any unhashable column types.
Resulting dataframe columns are: col, unique_len, df_len, perc_unique, unique_values
df_len = len(df)
unique_cols_list = []
for col in df:
try:
unique_values = df[col].unique()
unique_len = len(unique_values)
except TypeError: # not all cols are hashable
unique_values = ""
unique_len = -1
perc_unique = unique_len*100/df_len
unique_cols_list.append((col, unique_len, df_len, perc_unique, unique_values))
df_unique_cols = pd.DataFrame(unique_cols_list, columns=["col", "unique_len", "df_len", "perc_unique", "unique_values"])
df_unique_cols = df_unique_cols[df_unique_cols["unique_len"] > 0].sort_values("unique_len", ascending=False)
print(df_unique_cols)

The best way to do that:
Series.unique()
For example students.age.unique() the output will be the different values that occurred in the age column of the students' data frame.
To get only the number of how many different values:
Series.nunique()

Related

How to modify loop so as to take NaN values from values in columns in DataFrame in Pandas Python?

I have sample of my code in Python like below:
...
for col in df.columns.tolist():
if val in df[f"{col}"].values:
if val.isna():
my_list.append(col)
So, if some column from my DataFrame contains NaN value add name of this column to "my_list".
I know that in my DF are columns with NaN values, but my code generate empty "my_list", probably the error is in line: if val.isna():, how can I modify that? How can I "tell" Python to take NaN values from columns ?
Just use a if col statement like this
for col in df.columns.tolist():
if val in df[f"{col}"].values:
if col == False:
my_list.append(col)
I am not giving you the best way of doing it, just fixing your little list loop
By iterating over the values in the column, adding the column name to my_list and then breaking you get this:
my_list = ['col1','col3']
My code:
import pandas as pd
from numpy import NaN
df = pd.DataFrame(data={
"col1":[10,2.5,NaN],
"col2":[10,2.5,3.5],
"col3":[5,NaN,1]})
my_list = []
for col in df.columns:
for val in df[col].values:
if pd.isna(val):
my_list.append(col)
break
print(f"{my_list=}")
You can fix your code with changes that #Orange mentioned. I'm just adding this as an alternative. When working with data you want to allow the data base/data analysis software to do the heavy lifting. Looping over a cursor is something you should try to avoid as best as you can.
The code you have can be changed to:
for col in df.columns:
if df[col].hasnans:
my_list.append(col)
The code below functionally does the same thing:
df.columns[[df[col].hasnans for col in df.columns]].to_list()
The code below calculates hasnans using isna and sum.
df.columns[df.isna().sum() > 0].to_list()

How to find exact match of a column based on another column values on python?

I have following sample dataframe with column A and B:
df:
A B
123 555
456 123
789 666
I want to know which method can be used to print out 123 (a method to print out values of A which also exist in column B). I tried following:
for i, row in df.iterrows():
if row.A in row.B:
print(row.A, row.B)
but, got error: argument of type 'float' is not iterable.
If you are trying to print any row that row.A exists in column B, then your code should be:
for i, row in df.iterrows():
if row.A in df.B:
print(row.A, row.B)
col_B = df['B'].unique()
val_B_in_A = [ i for i in df['A'].unique() if i in col_B ]
print(val_B_in_A)
Be careful with "dot" notation in dataframes, since columns can contain spaces and it starts to be a pain dealing with those. With that said,
Depending on how many rows you are iterating over, and the proportion of rows that contain unique values, it may be computationally less expensive to iterate over the unique values in 'A', and check if each one is in 'B':
import pandas as pd
tmp = []
for value in df['A'].unique():
tmp.append(df.loc[df['B']==value])
df_results = pd.concat(tmp)
print(df_results)
You could also use the built-in method .isin(), in fact, much of the power of pandas is in its array-wise operators, which are significantly quicker than most approaches involving loops:
df.loc[df['B'].isin(df['A'].unique())]
And to only show one column with the ".loc" accessor, just add
df.loc[df['B'].isin(df['A'].unique()), 'A']
And to just return the values in an optimized array
df.loc[df['B'].isin(df['A'].unique()), 'A'].values
If you are concerned with an exact match try
df['match'] = pd.Series([(df['col1']==item).sum() for item in df['col1']])

How to query a Pandas Dataframe based on column values

I have a dataframe:
ID
Name
1
A
2
B
3
C
I defined a list:
mylist =[A,C]
If I want to extract only the rows where Name is equal to A and C (namely, mylist), I am trying to use the following code:
df_new = df[(df['Name'].isin(mylist))]
>>> df_new
As result, I get an empty table.
Any suggestion regarding why I get this error?
Just remove the additional open bracket before the df['Name']
df_new = df[df['Name'].isin(lst)]
Found the solution, It was a problem related to the list that caused the result of the empty table.
The format of the list should be:
mylist =['A','C']
instead of
mylist =[A,C]
You could use .loc and lambda as it’s more readable
import pandas as pd
dataf = pd.DataFrame({'ID':[1,2,3],'Name':['A','B','C']})
names = ['A','C']
# lock rows where column Name in names
df = dataf.loc[lambda d: d['Name'].isin(names)]
print(df)

Get only first row per subject in dataframe

I was wondering if there is an easy way to get only the first row of each grouped object (subject id for example) in a dataframe. Doing this:
for index, row in df.iterrows():
# do stuff
gives us each one of the rows, but I am interested in doing something like this:
groups = df.groupby('Subject id')
for index, row in groups.iterrows():
# give me the first row of each group
continue
Is there a pythonic way to do the above?
Direct solution - without .groupby() - by .drop_duplicates()
what you want is to keep only the rows with first occurrencies in a specific column:
df.drop_duplicates(subset='Subject id', keep='first')
General solution
Using the .apply(func) in Pandas:
df.groupby('Subject id').apply(lambda df: df.iloc[0, :])
It applies a function (mostly on the fly generated with lambda) to every data frame in the list of data frames returned by df.groupby() and aggregates the result to a single final data frame.
However, the solution by #AkshayNevrekar is really nice with .first(). And like he did there, you could also attach here - a .reset_index() at the end.
Let's say this is the more general solution - where you could also take any nth row ... - however, this works only if all sub-dataframes have at least n rows.
Otherwise, use:
n = 3
col = 'Subject id'
res_df = pd.DataFrame()
for name, df in df.groupby(col):
if n < (df.shape[0]):
res_df = res_df.append(df.reset_index().iloc[n, :])
Or as a function:
def group_by_select_nth_row(df, col, n):
res_df = pd.DataFrame()
for name, df in df.groupby(col):
if n < df.shape[0]:
res_df = res_df.append(df.reset_index().iloc[n, :])
return res_df
Quite confusing is that df.append() in contrast to list.append() only returns the appended value but leaves the original df unchanged.
Therefore you should always reassign it if you want an 'in place' appending, like one is used from list.append().
Use first() to get first row of each group.
df = pd.DataFrame({'subject_id': [1,1,2,2,2,3,4,4], 'val':[20,32,12,34,45,43,23,10]})
# print(df.groupby('subject_id').first().reset_index())
print(df.groupby('subject_id', as_index=False).first())
Output:
subject_id val
0 1 20
1 2 12
2 3 43
3 4 23

Take all unique values from certain columns in pandas dataframe

I have a simple question for style and how to do something correctly.
I want to take all the unique values of certain columns in a pandas dataframe and create a map ['columnName'] -> [valueA,valueB,...]. Here is my code that does that:
listUnVals = {}
for col in df:
if ((col != 'colA') and (col != 'colB')):
listUnVals[col] = (df[col].unique())
I want to exclude some columns like colA and colB. Is there a better way to filter out the columns I don't want, except writing an if (( != ) and ( != ...) . I hoped to create a lambda expression that filters this values but I can't create it correctly.
Any answer would be appreciated.
Couple of ways to remove unneeded columns
df.columns[~df.columns.isin(['colA', 'colB'])]
Or,
df.columns.difference(['colA', 'colB'])
And, you can ignore loop with
{c: df[c].unique() for c in df.columns[~df.columns.isin(['colA', 'colB'])]}
You can create a list of unwanted columns and then check for in status
>>> unwanted = ['columnA' , 'columnB']
>>> for col in df:
if col not in unwanted:
listUnVals[col] = (df[col].unique())
Or using dict comprehension:
{col : df[col].unique() for col in df if col not in unwanted}

Categories