How to mask when you are using a list? - python

I have a list of unique names (4,300 to be exact). unique_names = ['James', 'Erika', 'Akshay', 'Neil', etc..].
I have a column in a dataframe, where every row has it's own list of names.
I have to find out which rows in this column contain a name from my unique_names list.
I have tried masking, but every time it only gives back 2 rows rather than all the rows that contain a name from my list unique_names.
for name in unique_names:
if name in unique_names:
mask = df['names'].apply(lambda x: name in x)
df1 = df[mask]
My expected result is for every single row that contains a unique name from my list unique_names, instead I only get a return of two rows that contain the name 'Akshay' in the list of names, although I see other rows contain names like 'Neil' and 'Erika' those are not returned.

I would expect that the following would suffice.
mask = df['names'].apply(lambda x: any(name in x for name in unique_names))
If unique_names is a set and the number of names per row is small:
mask = df['names'].apply(lambda x: any(name in unique_names for name in x))
Or:
mask = df['names'].apply(lambda x: not unique_names.isdisjoint(x)))

I would rethink how you are doing this problem. First things first your original code iterates over names from a container called unique_names, then subsequently checks for if it is in unique_names. Every single iteration will pass that test because you pull them from the same container you test for membership of.
My best advice would be to iterate over the rows rather than names. Pseudocode would be as follows:
rows_with_unique = list()
for row in dataframe:
for name in unique_names:
if name in row:
rows_with_unique.append(row) (or whatever you are trying to extract)

Related

Nested loop over list of dataframes

I've got a list of dataframes that I want filtered depending on the values in one column that all three of them have. I want to split all three dataframes into three each; one sub-dataframe for each value in that one column. So I want to make 9 dataframes out of 3.
I've tried:
df_list=[df_a,df_b,df_c]
for df_tmp in df_list:
for i, g in df_tmp.groupby('COLUMN'):
globals()[str(df_tmp) + str(i)] = g
But I get super weird results. Can someone help me fix that code?
Thanks!
This should give you a list with dictionaries: One dictionary for each of the original dataframes, each one containing one dataframe referenced with the unique name from 'COLUMN'.
tables = [{'df_' + name: df[df['COLUMN'] == name].copy() for name in df['COLUMN'].unique()} for df in df_list]
So, for example, you can call tables[0] to get the three dataframes derivated from df_a. Or tables[0]['df_foo'] to get the table from df_a with all the rows with the value 'foo' in the column 'COLUMNS'.
Or, if you want to use a dictionary to have all the df associated with keys instead of indexes in a list:
tables = {'df_' + str(i): {'df_' + name: df_list[i][df_list[i]['COLUMN'] == name].copy() for name in df_list[i]['COLUMN'].unique()} for i in range(len(df_list))}
and then you can all them as tables['df_0']['df_foo'].
You can of course create a list of names and use it to assing the keys:
df_names = ['df_a', 'df_b', 'df_c']
tables = {'df_' + name: {'df_' + name: df_list[i][df_list[i]['COLUMN'] == name].copy() for name in df_list[i]['COLUMN'].unique()} for item in df_names}
And now you do tables['df_a']['df_foo'].
Let's say you choose to use one of the dictionaries and want to apply a single operation to all the dataframes, for example, let's say that each dataframe has a column called 'price' and you want to apply a function called get_discount(), then
for key1 in tables: # top level corresponding to [df_a,df_b,df_c]
for key2 in tables[key]: # bottom level corresponding to each filtered df
tables[key1][key2]['prices'] = tables[key1][key2]['prices'].apply(get_discount)

loop through pandas columns inside function

I have following function:
def match_function(column):
df_1 = df[column].str.split(',', expand=True)
df_11=df_1.apply(lambda s: s.value_counts(), axis=1).fillna(0)
match = df_11.iloc[:, 0][0]/df_11.sum(axis=1)*100
df[column] = match
return match
this functuion only works if I enter specific column name
how to change this function in the way, if I pass it a certain dataframe, it will loop through all of its columns automatically. so I won't have to enter each column separately?
ps. I know the function it self written very poorly, but im kinda new to coding, sorry
You need to wrap the function so that it does this iteratively over all columns.
If you add this to your code then it'll iterate over the columns while returning the match results in a list (as you will have multiple results as you're running over multiple columns).
def match_over_dataframe_columns(dataframe):
return [match_function(column) for column in dataframe.columns]
results = match_over_dataframe_columns(df)
Instead of inputting column to your function, input the entire dataframe. Then, cast the columns of the df to a list and loop over the columns, performing your analysis on each column. For example:
def match_function(df):
columns = df.columns.tolist()
matches = {}
for column in columns:
#do your analysis
#instead of returning match,
matches[column] = match
return matches
This will return a dictionary with keys of your columns and values of the corresponding match value.
just loop through the columns
def match_function(df):
l_match = []
for column in df.columns:
df_1 = df[column].str.split(',', expand=True)
df_11=df_1.apply(lambda s: s.value_counts(), axis=1).fillna(0)
match = df_11.iloc[:, 0][0]/df_11.sum(axis=1)*100
df[column] = match
l_match.append(match)
return l_match

How to create a list of columns using a conditions dynamically?

My input Data Frame is
Below is My code for creating Multiple columns as per my single column data, if my column contains 'reporting' that should be column name as well as it will be place one if reporting contains in that column.
am getting correct output but I want this code dynamical way is any another ways...
df['reporting']=pd.np.where((df['Name'].str.contains('reporting',regex=False)),1,0)
df['update']=pd.np.where((df['Name'].str.contains('update',regex=False)),1,0)
df['offer']=pd.np.where((df['Name'].str.contains('offer',regex=False)),1,0)
df['line']=pd.np.where((df['Name'].str.contains('line',regex=False)),1,0)
Output:
Use Series.str.findall for get all value sof list with \b\b for words boundaries, join them by | and pass to Series.str.get_dummies:
L = ["reporting","update","offer","line"]
pat = '|'.join(r"\b{}\b".format(x) for x in L)
df = df.join(df['Name'].str.findall(pat).str.join('|').str.get_dummies())
Or processing each column separately, here np.where is not necessary, convert True,False to 1,0 by Series.astype or Series.view:
for c in L:
df[c] = df['Name'].str.contains(c, regex=False).astype(int)
for c in L:
df[c] = df['Name'].str.contains(c, regex=False).view('i1')
Make a list of keywords, iterate the list and create new columns?
keywords = ["reporting","update","offer","line"]
for word in keywords:
df[word]=pd.np.where((df['Name'].str.contains(word,regex=False)),1,0)

How to replace a list of longer column names with small names in a dataset?

I have a list of column names which are too lengthy that it messes the code. I want to replace that list of column names with a list shorter name.
Here's an example of columns:
non_multi_choice = ["what_is_the_highest_level_of_formal_education_that_you_have_attained_or_plan_to_attain_within_the_next_2_years",
"select_the_title_most_similar_to_your_current_role_or_most_recent_title_if_retired_selected_choice",
"in_what_industry_is_your_current_employer_contract_or_your_most_recent_employer_if_retired_selected_choice",
"how_many_years_of_experience_do_you_have_in_your_current_role",
"what_is_your_current_yearly_compensation_approximate_usd",
"does_your_current_employer_incorporate_machine_learning_methods_into_their_business",
"of_the_choices_that_you_selected_in_the_previous_question_which_ml_library_have_you_used_the_most_selected_choice",
"approximately_what_percent_of_your_time_at_work_or_school_is_spent_actively_coding"]
shorter_names = ["highest_level_of_formal_education",
"job_title",
"current_industry",
"years_of_experience",
"yearly_compensation",
"does_your_current_employer_incorporate_machine_learning_methods_into_their_business",
"which_ml_library_have_you_used_the_most_selected_choice",
"what_percent_of_your_time_at_work_is_spent_actively_coding"]
I want to replace each name in the first list with the names in second list corresponding to it.
How about something like this:
for index, item in enumerate(non_multi_choice):
non_multi_choice[index] = shorter_names[index]
If those are encompassing all the column names then :
df.columns = shorter_names
If not
df.rename(columns = {old:new for old, new in zip(non_multi_choice, shorter_names})

Pandas efficient way to add values to a column if the element exists in another pandas series

I have two pandas dataframe(df1 and df2). df1 contains a column of strings who are to be matched with substrings from a column in df2 and saved to a seperate column. I can do this with apply but the computation is really slow. Is it possible to vectorize it? Or any other way to improve efficiency? Both the columns are large in size.
Thanks
def get_city(location, city_list):
if(isinstance(location, str)):
suspects_list = []
location=location.lower()
for city in city_list:
if(city in location):
suspects_list.append(city)
if(suspects_list):
return max(suspects_list, key=len)
else:
return np.nan
else:
return np.nan
df['city'] = df['location'].apply(
lambda element: get_city(element, world_cities_list))
Location column contains strings that are not cleaned which may or may not contain name of city as substring. We need to extract the cities as store in a column named city. Cities are distributed worldwide so the whole dataset for cities is 40,000+. The length of location column is 150,000+.
I wish there was some hint of initial data, but here is a small speedup.
First you need to take care of outer check for str type outside this function in a separate call, eg slice your orginal dataframe to valid rows only.
Then you can do the following:
import numpy as np
def get_city(location, cities):
filter1 = lambda x: location.lower() in x
suspects = list(filter(filter1, cities))
if suspects:
return max(suspects, key=len)
else:
return np.nan
cities = ['moscow', 'moscow, russia', 'kiev']
assert get_city('MOSCOW', cities) == 'moscow, russia'
Hope it is expected behaviour.

Categories