How to drop columns that end with a specific wildcard string?

How to drop columns that end with a specific wildcard string? - python

I have a series of columns:
COLUMNS = contract_number, award_date_x, publication_date_x, award_date_y, publication_date_y, award_date, publication_date
I would like to drop all of the 'publication_date' columns that end with '_[a+z'], so that my final result would look like this:
COLUMNS = contract_number, award_date, award_date_x, award_date_y, publication_date
I have tried the following with no luck:
df_merge=df_merge.drop(c for c in df_merge.columns if c.str.contains('publication_date_[a+z]$'))
Thanks

Try this,
import re
columns = df_merge.columns.tolist() # getting all columns
for col in columns:
if re.match(r"publication_date_[a-z]$",col): #regex for your match case
df_merge.drop([col], axis=1, inplace=True) # If regex matches, then remove the column
df_merge.head() # Filtered dataframe

lis = ["publication_date_x", "publication_date", "publication_date_x_y_y", "hello"]
new_list = [x for x in lis if not x.startswith('publication_date_')]
the output will be
new_list: ["publication_date", "hello"]

If you want to use str.contains you'll need to make the list of columns a Series.
series_cols = pd.Series(df_merge.columns)
bool_series_cols = series_cols.str.contains('publication_date_[a-z]$')
df_merge.drop([c for c, bool_c in zip(series_cols, bool_series_cols) if bool_c], axis=1, inplace=True)

Related

Filtering a column on pandas based on a string

I'm trying to filter a column on pandas based on a string, but the issue that I'm facing is that the rows are lists and not only strings.
A small example of the column
tags
['get_mail_mail']
['app', 'oflline_hub', 'smart_home']
['get_mail_mail', 'smart_home']
['web']
[]
[]
['get_mail_mail']
and I'm using this
df[df["tags"].str.contains("smart_home", case=False, na=False)]
but it's returning an empty dataframe.

You can explode, then compare and aggregate with groupby.any:
m = (df['tags'].explode()
.str.contains('smart_home', case=False, na=False)
.groupby(level=0).any()
)
out = df[m]
Or concatenate the string with a delimiter and use str.contains:
out = df[df['tags'].agg('|'.join).str.contains('smart_home')]
Or use a list comprehension:
out = df[[any(s=='smart_home' for s in l) for l in df['tags']]]
output:
tags
1 [app, oflline_hub, smart_home]
2 [get_mail_mail, smart_home]

You could try:
# define list of searching patterns
pattern = ["smart_home"]
df.loc[(df.apply(lambda x: any(m in str(v)
for v in x.values
for m in pattern),
axis=1))]
Output
tags
-- ------------------------------------
1 ['app', 'oflline_hub', 'smart_home']
2 ['get_mail_mail', 'smart_home']

return index of matching value over columns in dataframe

my input:
task1 = pd.DataFrame({'task1': ['|ReviewNG-Cecum Landmark','|ReviewNG-Cecum Landmark','|Cecum Landmark','|Cecum Landmark','|Cecum Landmark','|Cecum Landmark','|ReviewNG-Cecum Landmark',
'|ReviewNG-Cecum Landmark','|Other','|Other','|Other']})
task2 = pd.DataFrame({'task2': ['|Cecum Landmark|Other','|Cecum Landmark|Other','|Cecum Landmark|Other','|Cecum Landmark|Other','|Other','|Other','|Other',
'|Other']})
df = pd.concat([task1, task2], join = 'outer', axis = 1)
I trying get over df all range where matching value is true.
My code:
pat = "\|Cecum Landmark\||\|Cecum Landmark"
idx = df.apply(lambda x: x.str.contains(pat, regex=True),axis=1)
idx.index[idx['task1']== True].tolist()
What I get:*
[2, 3, 4, 5]
So its correct, but how to get all index over each column in df? In other word, I dont want input manually name of column to get index matching value, but get ouptut per column.
So what I expect:
[2,3,4,5]
[0,1,2,3]
so for example two lists(or more) each for its own column. There might be a better way to find matches and get the index?

Try the following,
result = df.apply(
lambda x: x.index[x.str.contains(pat, regex=True, na=False)].tolist(),
axis=0,
result_type="reduce",
).tolist()

How to dynamically match rows from two pandas dataframes

I have a large dataframe of urls and a smaller 2nd dataframe that contains columns of strings which I want to use to merge the two dataframes together. Data from the 2nd df will be used to populate the larger 1st df.
The matching strings can contain * wildcards (and more then one) but the order of the grouping still matters; so "path/*path2" would match with "exsample.com/eg_path/extrapath2.html but not exsample.com/eg_path2/path/test.html. How can I use the strings in the 2nd dataframe to merge the two dataframes together. There can be more then one matching string in the 2nd dataframe.
import pandas as pd
urls = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7]}
metadata = {'group':['group1','group2'],
'matching_string_1':['google','wikipedia*Python_'],
'matching_string_2':['stackoverflow*questions*56318782','']}
result = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7],
'group':['group2','group1','group1','']}
df1 = pd.DataFrame(urls)
df2 = pd.DataFrame(metadata)
what_I_am_after = pd.DataFrame(result)

Not very robust but gives the correct answer for my example.
import pandas as pd
urls = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7]}
metadata = {'group':['group1','group2'],
'matching_string_1':['google','wikipedia*Python_'],
'matching_string_2':['stackoverflow*questions*56318782','']}
result = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7],
'group':['group2','group1','group1','']}
df1 = pd.DataFrame(urls)
df2 = pd.DataFrame(metadata)
results = pd.DataFrame(columns=['url','hits','group'])
for index,row in df2.iterrows():
for x in row[1:]:
group = x.split('*')
rx = "".join([str(x)+".*" if len(x) > 0 else '' for x in group])
if rx == "":
continue
filter = df1['url'].str.contains(rx,na=False, regex=True)
if filter.any():
temp = df1[filter]
temp['group'] = row[0]
results = results.append(temp)
d3 = df1.merge(results,how='outer',on=['url','hits'])

How to extract entire part of string after certain character in dataframe column?

I am working on using the below code to extract the last number of pandas dataframe column name.
names = df.columns.values
new_df = pd.DataFrame()
for name in names:
if ('.value.' in name) and df[name][0]:
last_number = int(name[-1])
print(last_number)
key, value = my_dict[last_number]
try:
new_df[value][0] = list(new_df[value][0]) + [key]
except:
new_df[value] = [key]
name is a string that looks like this:
'data.answers.1234567890.value.0987654321'
I want to take the entire number after .value. as in the IF statement. How would do this in the IF statement above?

Use str.split, and extract the last slice with -1 (also gracefully handles false cases):
df = pd.DataFrame(columns=[
'data.answers.1234567890.value.0987654321', 'blahblah.value.12345', 'foo'])
df.columns = df.columns.str.split('value.').str[-1]
df.columns
# Index(['0987654321', '12345', 'foo'], dtype='object')
Another alternative is splitting inside a listcomp:
df.columns = [x.split('value.')[-1] for x in df.columns]
df.columns
# Index(['0987654321', '12345', 'foo'], dtype='object')

Pandas Columns Operations with List

I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!

Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.

you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to drop columns that end with a specific wildcard string? - python

Try this, import re columns = df_merge.columns.tolist() # getting all columns for col in columns: if re.match(r"publication_date_[a-z]$",col): #regex for your match case df_merge.drop([col], axis=1, inplace=True) # If regex matches, then remove the column df_merge.head() # Filtered dataframe

lis = ["publication_date_x", "publication_date", "publication_date_x_y_y", "hello"] new_list = [x for x in lis if not x.startswith('publication_date_')] the output will be new_list: ["publication_date", "hello"]

If you want to use str.contains you'll need to make the list of columns a Series. series_cols = pd.Series(df_merge.columns) bool_series_cols = series_cols.str.contains('publication_date_[a-z]$') df_merge.drop([c for c, bool_c in zip(series_cols, bool_series_cols) if bool_c], axis=1, inplace=True)

Related

Filtering a column on pandas based on a string

return index of matching value over columns in dataframe

How to dynamically match rows from two pandas dataframes

How to extract entire part of string after certain character in dataframe column?

Pandas Columns Operations with List

Categories

Resources