I have a pandas dataframe and I want to filter/select conditions based on elements of an input list. So, for example, I have something like:
filters = ['category', 'name']
# I am just trying to select the columns which would match as follows:
data = {'category_name_courses': ["Spark","PySpark","Python","pandas"], 'category_name_area': ["cloud", "cloud", "prog", "ds"], 'some_other_column': [0, 0, 0, 0]
x = pd.DataFrame(data)
selections = list()
for col in x.columns:
if ('name' in col) and ('category' in col):
selections.append(col)
In my case, this if condition or some other way of selection should be built by 'ANDing' everything from this input list
IIUC, do you want?
reg_str = '&'.join(filters)
x.filter(regex=reg_str)
Output:
category_name_courses category_name_area
0 Spark cloud
1 PySpark cloud
2 Python prog
3 pandas ds
Your edit shows that you want to filter columns based on their name.
Simply use:
filters = ['category', 'name']
for col in x.columns:
if all(x in col for x in filters):
print(col)
Output:
category_name_courses
category_name_area
older answer: filtering values
You can do almost what you suggested:
x = pd.DataFrame([['flow', 'x', 'category'],['x','x','flow']])
for col in x.columns:
if ('flow' in x[col].values) and ('category' in x[col].values):
# Do something with this column...
print(f'column "{col}" matches')
Using a list of matches:
filters = ['category', 'flow']
for col in x.columns:
if all(x in x[col].values for x in filters):
# Do something with this column...
print(f'column "{col}" matches')
Or, more efficiently, using a set:
filters = set(['category', 'flow'])
for col in x.columns:
if set(x[col]) >= filters:
# Do something with this column...
print(f'column "{col}" matches')
Example:
column "2" matches
Try it (if I understand your problem correctly):
You have a list of condition, and you want to sure, that the column includes all of it.
filters = ['category', 'name']
# x is some pandas dataframe
# I am trying to create something which would be equivalent of:
for col in x.columns:
boo = True
for condition in filters:
if not condition in col:
boo = False
break
if boo == True:
#to do something
I really did not understand the question. I assume you want to find all column names witch contains any value in the given list.
The pandas data frame has loads of methods and functionalities. You may be familiar with dt which allows one to apply datetime functionalities to a given column.
In the same way one can use str for string functionalities.
So you can see if any string is in any value (row) of a column:
a_df.str.contains("string")
it will return a list of bool where condition is met.
One can use | (or) to check multiple strings.
a_df.str.contains("string|someting")
So the line should look like:
x.columns.str.contains("|".join(filters))
Now you have the mask you can apply it:
x.columns[x.columns.str.contains("|".join(filters))]
Since you have the column names you can access the column data itself:
x[x.columns[x.columns.str.contains("|".join(filters))]]
output:
category_name_courses category_name_area
0 Spark cloud
1 PySpark cloud
2 Python prog
3 pandas ds
Related
Hi I am trying to make some contingency tables. I want it in a function so I can use it for various columns/dataframes/combinations etc.
current I have a dataframe that looks like this
df = pd.DataFrame(data={'group' : ['A','A','B','B','C','D'],
'class': ['g1','g2','g2','g3','g1','g2'],
'total' : ['0-10','20-30','0-10','30-40','50-60','20-30'],
'sub' : ['1-4', '5-9','10-14', '15-19','1-4','15-19'],
'n': [3,14,12,11,21,9]})
and a function that looks like this
def cts(tabs, df):
out=[]
for col in df.loc[:,df.columns != tabs]:
a = pd.crosstab([df[tabs]], df[col])
out.append(a)
return(out)
cts('group', df)
which works for cross tabulations for one column against the rest. But I want to add two (or more!) levels to the grouping e.g.
pd.crosstab([df['group'], df['class']], df['total'])
where total is cross tabulated against both group and class.
I think the 'tabs' var in the function should be a list of column names, but when i try and make it a list i get errors re invalid syntax. I hope this makes sense.. thank you!
Try:
def cts(tabs, df):
out=[]
cols = [col for col in df.columns if col not in tabs]
for col in df.loc[:,cols]:
a = pd.crosstab([df[tab] for tab in tabs], df[col])
out.append(a)
return(out)
I want to find pandas columns using a list of strings, but I want to find columns even if it contains part of the string. Now if the column name is 'TVD' and I have 'tv' in my list, I want it to be found. The reason is I want to drop these columns and bring them back to the first column. This is my current code but I'm only able to find the exact column name. Let's say the column name is 'TVD (feet)', then I'll be having a problem.
df = sts.read_df(dataset)
depth_names_lower = ['tvd', 'tvdss', 'md']
depth_names_upper = [depth.upper() for depth in depth_names_lower]
depth_names = depth_names_lower + depth_names_upper
tvd_cols = [col for col in df.columns if depth_names in col]
cols = list(df.columns)
for depth in tvd_cols:
cols.pop(cols.index(depth))
df = df[tvd_cols+cols]
you can use regexp to find the target columns, as flags=re.IGNORECASE to ignore case.
pattern = '|'.join(depth_names_lower)
cond = df.columns.str.contains(pattern, regex=True, flags=re.IGNORECASE)
cols = df.columns[cond]
df[cols]
You are attempting to check if depth_names which is a list in contained in col which is a string. This will always return False. You want to do an individual check of each string in depth_names to see if that is a substring of col. One way to do that is to use another list contraction:
tvd_cols = [col for col in df.columns if any([d in col for d in depth_names])]
The inner-contraction will return a list of booleans. "any" evaluates to a single boolean, True if and only if at least one of the list is True.
I have a simple question for style and how to do something correctly.
I want to take all the unique values of certain columns in a pandas dataframe and create a map ['columnName'] -> [valueA,valueB,...]. Here is my code that does that:
listUnVals = {}
for col in df:
if ((col != 'colA') and (col != 'colB')):
listUnVals[col] = (df[col].unique())
I want to exclude some columns like colA and colB. Is there a better way to filter out the columns I don't want, except writing an if (( != ) and ( != ...) . I hoped to create a lambda expression that filters this values but I can't create it correctly.
Any answer would be appreciated.
Couple of ways to remove unneeded columns
df.columns[~df.columns.isin(['colA', 'colB'])]
Or,
df.columns.difference(['colA', 'colB'])
And, you can ignore loop with
{c: df[c].unique() for c in df.columns[~df.columns.isin(['colA', 'colB'])]}
You can create a list of unwanted columns and then check for in status
>>> unwanted = ['columnA' , 'columnB']
>>> for col in df:
if col not in unwanted:
listUnVals[col] = (df[col].unique())
Or using dict comprehension:
{col : df[col].unique() for col in df if col not in unwanted}
I have a dataframe (df) and want to print the unique values from each column in the dataframe.
I need to substitute the variable (i) [column name] into the print statement
column_list = df.columns.values.tolist()
for column_name in column_list:
print(df."[column_name]".unique()
Update
When I use this: I get "Unexpected EOF Parsing" with no extra details.
column_list = sorted_data.columns.values.tolist()
for column_name in column_list:
print(sorted_data[column_name].unique()
What is the difference between your syntax YS-L (above) and the below:
for column_name in sorted_data:
print(column_name)
s = sorted_data[column_name].unique()
for i in s:
print(str(i))
It can be written more concisely like this:
for col in df:
print(df[col].unique())
Generally, you can access a column of the DataFrame through indexing using the [] operator (e.g. df['col']), or through attribute (e.g. df.col).
Attribute accessing makes the code a bit more concise when the target column name is known beforehand, but has several caveats -- for example, it does not work when the column name is not a valid Python identifier (e.g. df.123), or clashes with the built-in DataFrame attribute (e.g. df.index). On the other hand, the [] notation should always work.
Most upvoted answer is a loop solution, hence adding a one line solution using pandas apply() method and lambda function.
print(df.apply(lambda col: col.unique()))
This will get the unique values in proper format:
pd.Series({col:df[col].unique() for col in df})
If you're trying to create multiple separate dataframes as mentioned in your comments, create a dictionary of dataframes:
df_dict = dict(zip([i for i in df.columns] , [pd.DataFrame(df[i].unique(), columns=[i]) for i in df.columns]))
Then you can access any dataframe easily using the name of the column:
df_dict[column name]
We can make this even more concise:
df.describe(include='all').loc['unique', :]
Pandas describe gives a few key statistics about each column, but we can just grab the 'unique' statistic and leave it at that.
Note that this will give a unique count of NaN for numeric columns - if you want to include those columns as well, you can do something like this:
df.astype('object').describe(include='all').loc['unique', :]
I was seeking for a solution to this problem as well, and the code below proved to be more helpful in my situation,
for col in df:
print(col)
print(df[col].unique())
print('\n')
It gives something like below:
Fuel_Type
['Diesel' 'Petrol' 'CNG']
HP
[ 90 192 69 110 97 71 116 98 86 72 107 73]
Met_Color
[1 0]
The code below could provide you a list of unique values for each field, I find it very useful when you want to take a deeper look at the data frame:
for col in list(df):
print(col)
print(df[col].unique())
You can also sort the unique values if you want them to be sorted:
import numpy as np
for col in list(df):
print(col)
print(np.sort(df[col].unique()))
cu = []
i = []
for cn in card.columns[:7]:
cu.append(card[cn].unique())
i.append(cn)
pd.DataFrame( cu, index=i).T
Simply do this:
for i in df.columns:
print(df[i].unique())
Or in short it can be written as:
for val in df['column_name'].unique():
print(val)
Even better. Here's code to view all the unique values as a dataframe column-wise transposed:
columns=[*df.columns]
unique_values={}
for i in columns:
unique_values[i]=df[i].unique()
unique=pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in unique_vals.items() ]))
unique.fillna('').T
This solution constructs a dataframe of unique values with some stats and gracefully handles any unhashable column types.
Resulting dataframe columns are: col, unique_len, df_len, perc_unique, unique_values
df_len = len(df)
unique_cols_list = []
for col in df:
try:
unique_values = df[col].unique()
unique_len = len(unique_values)
except TypeError: # not all cols are hashable
unique_values = ""
unique_len = -1
perc_unique = unique_len*100/df_len
unique_cols_list.append((col, unique_len, df_len, perc_unique, unique_values))
df_unique_cols = pd.DataFrame(unique_cols_list, columns=["col", "unique_len", "df_len", "perc_unique", "unique_values"])
df_unique_cols = df_unique_cols[df_unique_cols["unique_len"] > 0].sort_values("unique_len", ascending=False)
print(df_unique_cols)
The best way to do that:
Series.unique()
For example students.age.unique() the output will be the different values that occurred in the age column of the students' data frame.
To get only the number of how many different values:
Series.nunique()
I know we can access columns using table.cols.somecolumn, but I need to apply the same operation on 10-15 columns of my table. So I'd like an iterative solution. I have the names of the columns as strings in a list : ['col1','col2','col3'].
So I'm looking for something along the lines of:
for col in columnlist:
thiscol = table.cols[col]
#apply whatever operation
Try this:
columnlist = ['col1','col2','col3']
for col in columnlist:
thiscol = getattr(table.cols, col)