I want to find pandas columns using a list of strings, but I want to find columns even if it contains part of the string. Now if the column name is 'TVD' and I have 'tv' in my list, I want it to be found. The reason is I want to drop these columns and bring them back to the first column. This is my current code but I'm only able to find the exact column name. Let's say the column name is 'TVD (feet)', then I'll be having a problem.
df = sts.read_df(dataset)
depth_names_lower = ['tvd', 'tvdss', 'md']
depth_names_upper = [depth.upper() for depth in depth_names_lower]
depth_names = depth_names_lower + depth_names_upper
tvd_cols = [col for col in df.columns if depth_names in col]
cols = list(df.columns)
for depth in tvd_cols:
cols.pop(cols.index(depth))
df = df[tvd_cols+cols]
you can use regexp to find the target columns, as flags=re.IGNORECASE to ignore case.
pattern = '|'.join(depth_names_lower)
cond = df.columns.str.contains(pattern, regex=True, flags=re.IGNORECASE)
cols = df.columns[cond]
df[cols]
You are attempting to check if depth_names which is a list in contained in col which is a string. This will always return False. You want to do an individual check of each string in depth_names to see if that is a substring of col. One way to do that is to use another list contraction:
tvd_cols = [col for col in df.columns if any([d in col for d in depth_names])]
The inner-contraction will return a list of booleans. "any" evaluates to a single boolean, True if and only if at least one of the list is True.
Related
I have a pandas dataframe and I want to filter/select conditions based on elements of an input list. So, for example, I have something like:
filters = ['category', 'name']
# I am just trying to select the columns which would match as follows:
data = {'category_name_courses': ["Spark","PySpark","Python","pandas"], 'category_name_area': ["cloud", "cloud", "prog", "ds"], 'some_other_column': [0, 0, 0, 0]
x = pd.DataFrame(data)
selections = list()
for col in x.columns:
if ('name' in col) and ('category' in col):
selections.append(col)
In my case, this if condition or some other way of selection should be built by 'ANDing' everything from this input list
IIUC, do you want?
reg_str = '&'.join(filters)
x.filter(regex=reg_str)
Output:
category_name_courses category_name_area
0 Spark cloud
1 PySpark cloud
2 Python prog
3 pandas ds
Your edit shows that you want to filter columns based on their name.
Simply use:
filters = ['category', 'name']
for col in x.columns:
if all(x in col for x in filters):
print(col)
Output:
category_name_courses
category_name_area
older answer: filtering values
You can do almost what you suggested:
x = pd.DataFrame([['flow', 'x', 'category'],['x','x','flow']])
for col in x.columns:
if ('flow' in x[col].values) and ('category' in x[col].values):
# Do something with this column...
print(f'column "{col}" matches')
Using a list of matches:
filters = ['category', 'flow']
for col in x.columns:
if all(x in x[col].values for x in filters):
# Do something with this column...
print(f'column "{col}" matches')
Or, more efficiently, using a set:
filters = set(['category', 'flow'])
for col in x.columns:
if set(x[col]) >= filters:
# Do something with this column...
print(f'column "{col}" matches')
Example:
column "2" matches
Try it (if I understand your problem correctly):
You have a list of condition, and you want to sure, that the column includes all of it.
filters = ['category', 'name']
# x is some pandas dataframe
# I am trying to create something which would be equivalent of:
for col in x.columns:
boo = True
for condition in filters:
if not condition in col:
boo = False
break
if boo == True:
#to do something
I really did not understand the question. I assume you want to find all column names witch contains any value in the given list.
The pandas data frame has loads of methods and functionalities. You may be familiar with dt which allows one to apply datetime functionalities to a given column.
In the same way one can use str for string functionalities.
So you can see if any string is in any value (row) of a column:
a_df.str.contains("string")
it will return a list of bool where condition is met.
One can use | (or) to check multiple strings.
a_df.str.contains("string|someting")
So the line should look like:
x.columns.str.contains("|".join(filters))
Now you have the mask you can apply it:
x.columns[x.columns.str.contains("|".join(filters))]
Since you have the column names you can access the column data itself:
x[x.columns[x.columns.str.contains("|".join(filters))]]
output:
category_name_courses category_name_area
0 Spark cloud
1 PySpark cloud
2 Python prog
3 pandas ds
I have following function:
def match_function(column):
df_1 = df[column].str.split(',', expand=True)
df_11=df_1.apply(lambda s: s.value_counts(), axis=1).fillna(0)
match = df_11.iloc[:, 0][0]/df_11.sum(axis=1)*100
df[column] = match
return match
this functuion only works if I enter specific column name
how to change this function in the way, if I pass it a certain dataframe, it will loop through all of its columns automatically. so I won't have to enter each column separately?
ps. I know the function it self written very poorly, but im kinda new to coding, sorry
You need to wrap the function so that it does this iteratively over all columns.
If you add this to your code then it'll iterate over the columns while returning the match results in a list (as you will have multiple results as you're running over multiple columns).
def match_over_dataframe_columns(dataframe):
return [match_function(column) for column in dataframe.columns]
results = match_over_dataframe_columns(df)
Instead of inputting column to your function, input the entire dataframe. Then, cast the columns of the df to a list and loop over the columns, performing your analysis on each column. For example:
def match_function(df):
columns = df.columns.tolist()
matches = {}
for column in columns:
#do your analysis
#instead of returning match,
matches[column] = match
return matches
This will return a dictionary with keys of your columns and values of the corresponding match value.
just loop through the columns
def match_function(df):
l_match = []
for column in df.columns:
df_1 = df[column].str.split(',', expand=True)
df_11=df_1.apply(lambda s: s.value_counts(), axis=1).fillna(0)
match = df_11.iloc[:, 0][0]/df_11.sum(axis=1)*100
df[column] = match
l_match.append(match)
return l_match
What i have is a list of Dataframes.
What is important to note is that the shape of the dataframes differ between 2-7 columns, also the columns are named between 0 & len of the column (e.g. df1 has 5 columns named 0,1,2,3,4 etc. df2 has 4 columns named 0,1,2,3)
I would like is to check if a row in a column contains a certain string, then delete that column.
list_dfs1=[df1,df2,df3...df100]
What i have done so far is the below & i get an error that column 5 is not in axis (it is there for some DF)
for i, df in enumerate(list_dfs1):
for index,row in df.iterrows():
if np.where(row.str.contains("DEC")):
df.drop(index, axis=1)
Any suggestions.
You could try:
for df in list_dfs:
for col in df.columns:
# If you are unsure about column types, cast column as string:
df[col] = df[col].astype(str)
# Check if the column contains the string of interest
if df[col].str.contains("DEC").any():
df.drop(columns=[col], inplace=True)
If you know that all columns are of type string, you don't have to actually do df[col] = df[col].astype(str).
You can write a custom function that checks whether the dataframe has the pattern or not. You can use pd.Series.str.contains with pd.Series.any
def func(s):
return s.str.contains('DEC').any()
list_df = [df.loc[:, ~df.apply(func)] for df in list_dfs1]
I would take another approach. I would concatenate the list into a data frame and then eliminate the column where finding the string
import pandas as pd
df = pd.concat(list_dfs1)
Let us say your condition was to eliminate any column with "DEC"
df.mask(df == "DEC").dropna(axis=1, how="any")
I have a dataframe of which I want to clean up a specific row, in this case the first row. I have written a function in which I believe will return the string if the regex is matched.
def clean_cells(string):
if '201' in string:
return re.findall('201[0-9]', string)[0]
else:
return string
I want to apply this function to the first row in the dataframe then replace the first row with the cleaned up version then contatenate the rest of the original dataframe.
I have tried:
df = df.iloc[[0]].apply(clean_cells)
Select first row by position and all columns with : and assign back:
df.iloc[0, :] = df.iloc[0, :].apply(clean_cells)
Another solution:
df.iloc[0] = df.iloc[0].apply(clean_cells)
I need to move the 3rd column to the end of the column list. This can be done as follows:
cols = df.columns.tolist()
cols = cols[0:2] + cols[4:] + cols[3]
df= df[cols]
However what if I don't know an index of the column that I'm interested in. Let's say that the only thing I know is that the column called MyMagicCol should be moved at the end of the column list.
How can I do this?
If you want to do it within a list, and you know the name of the column you're interested in, you could just do
cols = [c for c in df.columns if c != 'MyMagicCol'] + ['MyMagicCol']
df = df[cols]
You could also do this with df.loc and use the column names to slice your dataframe if you wanted to work with the columns directly.