i'm trying to preprocess data especially dealing with missing values.
I have a list of words and two columns having text data. If word from list is in at least one of two text-columns, i fill missing with the word
import pandas as pd
a=['coffee', 'milk', 'sugar']
test=pd.DataFrame({'col':['missing', 'missing', 'missing'],
'text1': ['i drink tea', 'i drink coffee', 'i drink whiskey'],
'text2': ['i drink juice', 'i drink nothing', 'i drink milk']
})
So the dataframe looks like and a column "col" has "missing" as a result of applying fillna("missing")
Out[19]:
col text1 text2
0 missing i drink tea i drink juice
1 missing i drink coffee i drink nothing
2 missing i drink whiskey i drink milk
I came up with such code applying loop
for word in a:
test.loc[(test["col"]=='missing') & ((test["text1"].str.count(word)>0)
| (test['text2'].str.count(word)>0)), "col"]=word
With 100 000 rows and 2000 element in the list "a" it takes around 870 seconds to finish the job.
Is there any solution to make it faster for a huge dataframe?
Thanks in advance
Some suggestions:
Why use .str.count instead of .str.contains?
Why do the fillna('missing')? pd.isnull(test["col"]) will work faster tan test["col"]=='missing'
You could also use a test to see whether all the missing fields are filled.
So this can boil down to something like this:
def fill_missing(original_df, column_name, replacements, inplace=True):
df = original_df if inplace else original_df.copy()
for word in replacements:
empty = pd.isnull(df[column_name])
if not empty.any():
return df
contained = (df.loc[empty, "text1"].str.contains(word)) | (df.loc[empty, 'text2'].str.contains(word))
df.loc[contained[contained].index, column_name] = word
return df
Related
I am trying to create a new DataFrame column that contains words that match between a list of keywords and strings in a df column...
data = {
'Sandwich Opinions':['Roast beef is overrated','Toasted bread is always best','Hot sandwiches are better than cold']
}
df = pd.DataFrame(data)
keywords = ['bread', 'bologna', 'toast', 'sandwich']
df['Matches'] = [df.apply(lambda x: ' '.join([i for i in df['Sandwich iOpinions'].str.split() if i in keywords]), axis=1)
This seems like it should do the job but it's getting stuck in endless processing.
for kw in keywords:
df[kw] = np.where(df['Sandwich Opinions'].str.contains(kw), 1, 0)
def add_contain_row(row):
contains = []
for kw in keywords:
if row[kw] == 1:
contains.append(kw)
return contains
df['contains'] = df.apply(add_contain_row, axis=1)
# if you want to drop the temp columns
df.drop(columns=keywords, inplace=True)
Create a regex pattern from your list of words:
import re
pattern = fr"\b({'|'.join(re.escape(k) for k in keywords)})\b"
df['contains'] = df['Sandwich Opinions'].str.extract(pattern, re.IGNORECASE)
Output:
>>> df
Sandwich Opinions contains
0 Roast beef is overrated NaN
1 Toasted bread is always best bread
2 Hot sandwiches are better than cold NaN
I want to move around description text column to newly created columns based on keywords in python.
For example, if keywords are 'Table', 'Fan', 'Chair'
Description(Given) Keyword Table Keyword Fan Keyword Chair
The table is long The table is long
The fan is nice The fan is nice
The fan is cheap The fan is cheap
The chair is brown The chair is brown
I tried to use both str.contains() and str.findall(), but it gives either T|F boolean or just the keyword (ex. 'chair')
df['Keyword Table'] = df['Description'].str.contains('Table')
AND
keywords=['Table']
df['Keyword Table'] = df['Description'].str.findall((keywords)).apply(set)
Here is a simple way using a regex with named capturing groups:
df = pd.DataFrame({'Desc': ['The table is long', 'The fan is nice', 'The fan is cheap', 'The chair is brown']})
words = ['table', 'fan', 'chair']
regex = '|'.join(f'(?P<{w}>.*{w}.*)' for w in words)
df.join(df['Desc'].str.extract(regex, expand=False).add_prefix('keyword_'))
NB. The named capturing groups cannot have special characters or spaces.If this is the case let me know and it is possible to change the name of the capturing group.
Output:
Desc keyword_table keyword_fan keyword_chair
0 The table is long The table is long NaN NaN
1 The fan is nice NaN The fan is nice NaN
2 The fan is cheap NaN The fan is cheap NaN
3 The chair is brown NaN NaN The chair is brown
other option get_dummies
df = pd.DataFrame({'Desc': ['The table is long', 'The fan is nice', 'The fan is cheap', 'The chair is brown']})
words = ['table', 'fan', 'chair']
regex = '(%s)' % '|'.join(words)
df.join(pd.get_dummies(df['Desc'].str.extract(regex, expand=False))
.mul(df['Desc'], axis=0)
.add_prefix('keyword_')
)
Your boolean series can be used as index to slice your dataframe, like this:
df['Keyword Table'] = df[df['Description'].str.contains('Table', na = False)]['Description']
For a list of keywords, you can use apply:
keywords = ['Table', 'Fan', 'Chair']
df['Keywords'] = df[df['Description'].apply(lambda x: any(k in x for k in keywords))]['Description']
Does this piece of code help?
df = pd.DataFrame({'Desc':['cat is black','dog is white']})
kw = ['cat','dog']
for k in kw:
df[k + ' col'] = df.Desc.map(lambda s: s if k in s else '' )
Output is
Example,
parent pandas df is as follow
Now I have another df, df_keywords
What I want to do is, check if any of the keyword from df_keywords['Keywords to Search'] is present is df['Description'] and create a new column as df['Keywords'], for example..
One way to do it is to create a helper column on df with the keyword matching Keywords to Search in df_keywords. Then, merge df with df_keywords on the keywords, as follows:
keywords = df_keywords['Keywords to Search'].dropna().unique()
df['keyword'] = df['Description'].map(lambda x: np.nan if (pd.isna(x) or len(w:=[y for y in keywords if y in str(x)]) == 0) else w[0])
Result:
print(df)
Description keyword
0 I have a dog dog
1 I have a cat cat
2 I want coffee coffee
3 I need sleep sleep
Then, merge the 2 dataframes with .merge() matching the keywords:
df_out = df.merge(df_keywords, left_on='keyword', right_on='Keywords to Search')
Result:
print(df_out)
Description keyword Keywords to Search Keywords to Update
0 I have a dog dog dog dog related
1 I have a cat cat cat cat related
2 I want coffee coffee coffee coffee related
3 I need sleep sleep sleep sleep related
Finally, remove unwanted columns and rename column,
df_out = df_out.drop(['keyword', 'Keywords to Search'], axis=1).rename({'Keywords to Update': 'Keywords'}, axis=1)
Result:
print(df_out)
Description Keywords
0 I have a dog dog related
1 I have a cat cat related
2 I want coffee coffee related
3 I need sleep sleep related
Here's how I've done it. This won't work if description contains more than 1 keyword, and may not be the fasted if you have a lot of words to check.
What I've done is create a blank 'Keywords to Search' column in df_parent, then gone through each keyword, and checked if the keyword exists in in the 'Description' column using df_parent['Description'].str.contains(keyword). Then I've replaced any True values with the keyword, and false with nan and used this new series to fill in the Nan's in my new 'Keywords to Search' column in the parent dataframe. This then returns a filled in 'Keywords to Search' column in df_parent with the keyword corresponding to the description of that row. Then I've merged df_parent and df_keywords on the 'Keywords to Search' column, and then dropped it to get the desired output.
import pandas as pd
import numpy as np
df_parent = pd.DataFrame({'Description':['I have a dog', 'I have a cat', 'I want coffee', 'I need sleep']})
df_keywords = pd.DataFrame({'Keywords to Search' : ['cat', 'sleep', 'dog', 'coffee'],
'Keywords to Update' : ['cat related', 'sleep related', 'dog related', 'coffee related']})
df_parent['Keywords to Search'] = np.nan
for keyword in df_keywords['Keywords to Search']:
df_parent['Keywords to Search'].fillna(df_parent['Description'].str.contains(keyword)
.replace({True:keyword, False:np.nan}), inplace=True)
output = df_parent.merge(df_keywords, on='Keywords to Search').drop('Keywords to Search', axis=1)
output:
Description Keywords to Update
0 I have a dog dog related
1 I have a cat cat related
2 I want coffee coffee related
3 I need sleep sleep related
I have been trying to replace part of the texts in a Pandas dataframe column with keys from a dictionary based on multiple values; though I have achieved the desired result, the process or loop is very very slow in large dataset. I would appreciate it if someone could advise me of a more 'Pythonic' way or more efficient way of achieving the result. Pls see below example:
df = pd.DataFrame({'Dish': ['A', 'B','C'],
'Price': [15,8,20],
'Ingredient': ['apple banana apricot lamb ', 'wheat pork venison', 'orange lamb guinea']
})
Dish
Price
Ingredient
A
15
apple banana apricot lamb
B
8
wheat pork venison
C
20
orange lamb guinea
The dictionary is below:
CountryList = {'FRUIT': [['apple'], ['orange'], ['banana']],
'CEREAL': [['oat'], ['wheat'], ['corn']],
'MEAT': [['chicken'], ['lamb'], ['pork'], ['turkey'], ['duck']]}
I am trying to replace text in the 'Ingredient' column with key based on dictionary values. For example, 'apple' in the first row wound be replaced by dictionary key: 'FRUIT'.. The desired table is shown below:
Dish
Price
Ingredient
A
15
FRUIT FRUIT apricot MEAT
B
8
CEREAL MEAT venison
C
20
FRUIT MEAT guinea
I have seen some related queries here where each key has one value; but in this case, there are multiple values for any given key in the dictionary. So far, I have been able to achieve the desired result but it is painfully slow when working with a large dataset.
The code I have used so far to achieve the result is shown below:
countries = list(CountryList.keys())
for country in countries:
for i in range(len(CountryList[country])):
lender = CountryList[country][i]
country = str(country)
lender = str(lender).replace("['",'',).replace("']",'')
df['Ingredient'] = df['Ingredient'].str.replace(lender,country)
Perhaps this could do with multiprocessing? Needless to say, my knowledge of Python needs a lot to be desired.
Any suggestion to speed up the process would be highly appreciated.
Thanking in advance,
Edit: just to add, some keys have more than 60000 values in the dictionary; and about 200 keys in the dictionary, which is making the code very inefficient time-wise.
Change the format of CountryList:
import itertools
CountryList2 = {}
for k, v in CountryList.items():
for i in (itertools.chain.from_iterable(v)):
CountryList2[i] = k
>>> CountryList2
{'apple': 'FRUIT',
'orange': 'FRUIT',
'banana': 'FRUIT',
'oat': 'CEREAL',
'wheat': 'CEREAL',
'corn': 'CEREAL',
'chicken': 'MEAT',
'lamb': 'MEAT',
'pork': 'MEAT',
'turkey': 'MEAT',
'duck': 'MEAT'}
Now you can use replace:
df['Ingredient'] = df['Ingredient'].replace(CountryList2, regex=True)
>>> df
Dish Price Ingredient
0 A 15 FRUIT FRUIT apricot MEAT
1 B 8 CEREAL MEAT venison
2 C 20 FRUIT MEAT guinea
You can build a reverse index of product to type, by creating a dictionary where the keys are the values of the sublists
product_to_type = {}
for typ, product_lists in CountryList.items():
for product_list in product_lists:
for product in product_list:
product_to_type[product] = typ
A little python magic lets you compress this step into a generator that creates the dict
product_to_type = {product:typ for typ, product_lists in CountryList.items()
for product_list in product_lists for product in product_list}
Then you can create a function that splits the ingredients and maps them to type and apply that to the dataframe.
import pandas as pd
CountryList = {'FRUIT': [['apple'], ['orange'], ['banana']],
'CEREAL': [['oat'], ['wheat'], ['corn']],
'MEAT': [['chicken'], ['lamb'], ['pork'], ['turkey'], ['duck']]}
product_to_type = {product:typ for typ, product_lists in CountryList.items()
for product_list in product_lists for product in product_list}
def convert_product_to_type(products):
return " ".join(product_to_type.get(product, product)
for product in products.split(" "))
df = pd.DataFrame({'Dish': ['A', 'B','C'],
'Price': [15,8,20],
'Ingredient': ['apple banana apricot lamb ', 'wheat pork venison', 'orange lamb guinea']
})
df["Ingredient"] = df["Ingredient"].apply(convert_product_to_type)
print(df)
Note: This solution splits the ingredient list on word boundaries which assumes that ingredients themselves don't have spaces in them.
If you want to use regex, just join all the values in the CountryList by pipe | for each of the keys, and then call Series.str.replace for each of the keys, it will be a way faster than the way you are trying.
joined={key: '|'.join(item[0] for item in value) for key,value in CountryList.items()}
for key in joined:
df['Ingredient'] = df['Ingredient'].str.replace(joined[key], key, regex=True)
OUTPUT:
Dish Price Ingredient
0 A 15 FRUIT FRUIT apricot MEAT
1 B 8 CEREAL MEAT venison
2 C 20 FRUIT MEAT guinea
Another approach would be to reverse the key,value in the dictionary, and then to use dict.get for each key with default value as key splitting the words in Ingredient column:
reversedContries={item[0]:key for key,value in CountryList.items() for item in value}
df['Ingredient'].apply(lambda x: ' '.join(reversedContries.get(y,y) for y in x.split()))
Having a dataframe df with the following columns:
Index(['category', 'synonyms_text', 'enabled', 'stems_text'], dtype='object')
I am interested on getting just the rows containing in synonyms_text just the word food and not seafood for instance:
df_text= df_syn.loc[df_syn['synonyms_text'].str.contains('food')]
Having the following result (which contains seafood, foodlocker and others that are not wanted):
category synonyms_text \
130 Fishing seafarm, seafood, shellfish, sportfish
141 Refrigeration coldstorage, foodlocker, freeze, fridge, ice, refrigeration
183 Food Service cook, fastfood, foodserve, foodservice, foodtruck, mealprep
200 Restaurant expresso, food, galley, gastropub, grill, java, kitchen
377 fastfood carryout, fastfood, takeout
379 Animal Supplies feed, fodder, grain, hay, petfood
613 store convenience, food, grocer, grocery, market
Then, I sent the result to a list to get just food as word:
food_l=df_text['synonyms_text'].str.split().tolist()
However, I am getting in the list values as the following:
['carryout,', 'fastfood,', 'takeout']
so, I get rid of commas:
food_l= [[x.replace(",","") for x in l]for l in food_l]
Then, finally I will get just the word food from the lists of list:
food_l= [[l for x in l if "food"==x]for l in food_l]
After, I get rid of empty lists:
food_l= [x for x in food_l if x != []]
Finally, I flatten the lists of list to get the final result:
food_l = [item for sublist in food_l for item in sublist]
And the final result is as follows:
[['bar', 'bistro', 'breakfast', 'buffet', 'cabaret', 'cafe', 'cantina', 'cappuccino', 'chai', 'coffee', 'commissary', 'cuisine', 'deli', 'dhaba', 'dine', 'diner', 'dining', 'eat', 'eater', 'eats', 'edible', 'espresso', 'expresso', 'food', 'galley', 'gastropub', 'grill', 'java', 'kitchen', 'latte', 'lounge', 'pizza', 'pizzeria', 'pub', 'publichouse', 'restaurant', 'roast', 'sandwich', 'snack', 'snax', 'socialhouse', 'steak', 'sub', 'sushi', 'takeout', 'taphouse', 'taverna', 'tea', 'tiffin', 'trattoria', 'treat', 'treatery'], ['convenience', 'food', 'grocer', 'grocery', 'market', 'mart', 'shop', 'store', 'variety']]
#Erfan This dataframe can be used as test:
df= pd.DataFrame({'category':['Fishing','Refrigeration','store'],'synonyms_text':['seafood','foodlocker','food']})
Both give empty:
df_tmp= df.loc[df['synonyms_text'].str.match('\bfood\b')]
df_tmp= df.loc[df['synonyms_text'].str.contains(pat='\bfood\b', regex= True)]
Do you know a better way to get just the rows with the single word food without going through all this painful process? Do we have other function different to contains to look in the dataframe for an exact match in the values of the dataframe?
Thanks
Example dataframe:
df = pd.DataFrame({'category':['Fishing','Refrigeration','store'],
'synonyms_text':['seafood','foodlocker','food']})
print(df)
category synonyms_text
0 Fishing seafood
1 Refrigeration foodlocker
2 store food # <-- we want only the rows with exact "food"
Three ways we can do this:
str.match
str.contains
str.extract (not very useful here)
# 1
df['synonyms_text'].str.match(r'\bfood\b')
# 2
df['synonyms_text'].str.match(r'\bfood\b')
# 3
df['synonyms_text'].str.extract(r'(\bfood\b)').eq('food')
output
0 False
1 False
2 True
Name: synonyms_text, dtype: bool
Finally we use boolean series to filter out dataframe .loc
m = df['synonyms_text'].str.match(r'\bfood\b')
df.loc[m]
output
category synonyms_text
2 store food
Bonus:
To match case insensitive use ?i:
For example:
df['synonyms_text'].str.match(r'\b(?i)food\b')
Which will match: food, Food, FOOD, fOoD