Pandas mapping to multiple dictionary items for categorising data - python

I have a large dataframe containing a 'Description' column.
I've compiled a sizeable dictionary of lists, where the key is basically the Category, and the items are lists of possible (sub)strings contained in the description column.
I want to use the dictionary to classify each entry in the dataframe based on this description... Unfortunately I can't figure out how to apply a dictionary of lists to map to a dataframes (feels like it would be some sort of concoction of map, isin and str.contains but I have had no joy). I've included code to generate a model dataset below:
df = pd.DataFrame(np.random.randn(10, 1), columns=list('A'))
df['Description'] = ['White Ford Escort', 'Irish Draft Horse', 'Springer \
spaniel (dog)', 'Green Vauxhall Corsa', 'White Van', 'Labrador dog',\
'Black horse' ,'Blue Van','Red Vauxhall Corsa','Bear']
This model dataset would then ideally be somehow mapped against the following dictionary:
dict = {'Car':['Ford Escort','Vauxhall Corsa','Van'],
'Animal':['Dog','Horse']}
to generate a new column in the dataframe, with the result as such:
| | A | Description | Type |
|---|----------------------|------------------------|--------|
| 0 | -1.4120290137842615 | White Ford Escort | Car |
| 1 | -0.3141036399049358 | Irish Draft Horse | Animal |
| 2 | 0.49374344901643896 | Springer spaniel (dog) | Animal |
| 3 | 0.013654965767323723 | Green Vauxhall Corsa | Car |
| 4 | -0.18271952280002862 | White Van | Car |
| 5 | 0.9519081000007026 | Labrador dog | Animal |
| 6 | 0.403258571154998 | Black horse | Animal |
| 7 | -0.8647792960494813 | Blue Van | Car |
| 8 | -0.12429427259820519 | Red Vauxhall Corsa | Car |
| 9 | 0.7695980616520571 | Bear | - |
The numbers are obviously irrelevant here, but there are other columns in the dataframes and I wanted this reflecting.
I'm happy to use regex, or a perhaps change my dictionary to a dataframe and do a join (i've considered multiple routes).
This feels similar to a recent question, but it's not the same and certainly the answer hasn't helped me.
Sorry if I've been stupid somewhere and this is really simple - it does feel like it should be, but i'm missing something.
Thanks

You can use fuzzywuzzy library to solve this. Make sure to install it via pip install fuzzywuzzy
from fuzzywuzzy import process
df = pd.DataFrame(np.random.randn(10, 1), columns=list('A'))
df['Description'] = ['White Ford Escort', 'Irish Draft Horse', 'Springer \
spaniel (dog)', 'Green Vauxhall Corsa', 'White Van', 'Labrador dog',\
'Black horse' ,'Blue Van','Red Vauxhall Corsa','Bear']
d = {'Car':['Ford Escort','Vauxhall Corsa','Van'],
'Animal':['Dog','Horse']}
# Construct a dataframe from the dictionary
df1 = pd.DataFrame([*d.values()], index=d.keys()).T.melt().dropna()
# Get relevant matches using the library.
m = df.Description.apply(lambda x: process.extract(x, df1.value)[0])
# concat the matches with original df
df2 = pd.concat([df, m[m.apply(lambda x: x[1]>80)].apply(lambda x: x[0])], axis=1)
df2.columns = [*df.columns, 'matches']
# After merge it with df1
df2 = df2.merge(df1, left_on='matches', right_on='value', how='left')
# Drop columns that are not required and rename.
df2 = df2.drop(['matches','value'],1).rename(columns={'variable':'Type'})
print (df2)
A Description Type
0 -0.423555 White Ford Escort Car
1 0.294092 Irish Draft Horse Animal
2 1.949626 Springer spaniel (dog) Animal
3 -1.315937 Green Vauxhall Corsa Car
4 -0.250184 White Van Car
5 0.186645 Labrador dog Animal
6 -0.052433 Black horse Animal
7 -0.003261 Blue Van Car
8 0.418292 Red Vauxhall Corsa Car
9 0.241607 Bear NaN

Consider inverting your dictionary first, while making everything lowercase
Then per row, split Description into words and make them lowercase
e.g., 'Springer spaniel (dog)' -> ['springer', 'spaniel', '(', 'dog', ')']
For each lower case word from (2), look it up in the inverted dictionary from (1); using apply

Related

How to Check is Firstnames, and Lastnames are English?

I have a csv file which has two columns and about 9,000 rows. Column 1 contains the firstname of a respondent in a survey, column 2 contains the lastname of a respondent in a survey, so each row is an observation.
These surveys were conducted in a very diverse place. I am trying to find a way to tell, whether a respondent's firstname is of English (British or American) origin or not. Same for his lastname.
This task is very far away from my area of expertise. After reading interesting discussions online here, and here. I have thought about three way:
1- Take a dataset of the most common triplets (families of 3 letters often found together in English) or quadruplets (families of 4 letters often found together in English) and to check for each firstname, and lastname, whether it contains these families of letters.
2- Use a dataset of British names (say the most X common names in the UK in the early XX Century, and match these names based on proximity to my dataset. These datasets could be good I think, data1, data2, data3.
3- Use python and an interface to detect what is (most likely) English from what is not.
If anyone has advise on this, can share experience etc that would be great!
I am attaching an example of the data (I made up the names) and of the expected output.
NB: Please note that I am perfectly aware that classifying names according to an English/Non English dichotomy is not without drawbacks and semantic issues.
I built something a while back that is quite similar. Summary below.
Created 2 Source lists a Firstname list, and a lastname
Created 4+ Comparison lists (English Firstname list, English Last name list, et. al)
Then used an in_array function to compare a source first name to comparison first name
Then I used a big if statement to check lists against eachother. Eng.First vs Src.First, American.First vs Src.First, Irish.First vs src.First.
and so on. If you are thinking of using your first bullet as an option (e.g. parts and pieces of a name, I wrote a paper which includes some source code as well that may be able to help.
Ordered Match Ratio as a Method for Detecting Program Abuse / Fraud
Although the best solution would probably be to train a classification model on top of BERT or a similar language model, a crude solution would be to use zero-shot classification. The example below uses transformers. It does a fairly decent job, although you see some semantic issues pop up: the classification of the name Black, for example, is likely distorted due to it also being a color.
import pandas as pd
from transformers import pipeline
data = [['James', 'Brown'], ['Gerhard', 'Schreuder'], ['Musa', 'Bemba'], ['Morris D.', 'Kemba'], ['Evelyne', 'Fontaine'], ['Max D.', 'Kpali Jr.'], ['Musa', 'Black']]
df = pd.DataFrame(data, columns=['firstname', 'name'])
classifier = pipeline("zero-shot-classification")
firstnames = df['firstname'].tolist()
lastnames = df['name'].tolist()
candidate_labels = ["English or American", "not English or American"]
hypothesis_template = "This name is {}."
results_firstnames = classifier(firstnames, candidate_labels, hypothesis_template=hypothesis_template)
results_lastnames = classifier(lastnames, candidate_labels, hypothesis_template=hypothesis_template)
df['f_english'] = [1 if i['labels'][0] == 'English or American' else 0 for i in results_firstnames ]
df['n_english'] = [1 if i['labels'][0] == 'English or American' else 0 for i in results_lastnames]
df
Output:
| | firstname | name | f_english | n_english |
|---:|:------------|:----------|------------:|------------:|
| 0 | James | Brown | 1 | 1 |
| 1 | Gerhard | Schroeder | 0 | 0 |
| 2 | Musa | Bemba | 0 | 0 |
| 3 | Morris D. | Kemba | 1 | 0 |
| 4 | Evelyne | Fontaine | 1 | 0 |
| 5 | Max D. | Kpali Jr. | 1 | 0 |
| 6 | Musa | Black | 0 | 0 |

Identify the matched string from list of strings using any()?

There's a bunch of similar questions that have the same solution: how do I check my list of strings against a larger string and see if there's a match? How to check if a string contains an element from a list in Python How to check if a line has one of the strings in a list?
I have a different problem: how do I check my list of strings against a larger string, see if there's a match, and isolate the string so I can perform another string operation relative to the matched string?
Here's some sample data:
| id | data |
|--------|---------------------|
| 123131 | Bear Cat Apple Dog |
| 123131 | Cat Ap.ple Mouse |
| 231321 | Ap ple Bear |
| 231321 | Mouse Ap ple Dog |
Ultimately, I'm trying to find all instances of "apple" ['Apple', 'Ap.ple', 'Ap ple'] and, while it doesn't really matter which one is matched, I need to be able to find out if Cat or Bear exist before it or after it. Position of the matched string does not matter, only an ability to determine what is before or after it.
In Bear Cat Apple Dog Bear is before Apple, even though Cat is in the way.
Here's where I am at with my sample code:
data = [[123131, "Bear Cat Apple Dog"], ['123131', "Cat Ap.ple Mouse"], ['231321', "Ap ple Bear"], ['231321', "Mouse Ap ple Dog"]]
df = pd.DataFrame(data, columns = ['id', 'data'])
def matching_function(m):
matching_strings = ['Apple', 'Ap.ple', 'Ap ple']
if any(x in m for x in matching_strings):
# do something to print the matched string
return True
df["matched"] = df['data'].apply(matching_function)
Would it be simply better to just do this in regex?
Right now, the function simply returns true. But if there's a match I imagine it could also return matched_bear_before matched_bear_after or the same for Cat and fill that into the df['matched'] column.
Here's some sample output:
| id | data | matched |
|--------|---------------------|---------|
| 123131 | Bear Cat Apple Dog | TRUE |
| 123131 | Cat Ap.ple Mouse | TRUE |
| 231321 | Ap ple Bear | TRUE |
| 231321 | Mouse Ap ple Dog | FALSE |
You can use the following pattern to check whether either Cat or Bear appear before the word of interest, in this case Apple or Ap.ple or Ap ple.
^(?:Cat|Bear).*Ap[. ]*ple|Ap[. ]*ple.*(?:Cat|Bear)
To create the new dataframe column which satisfies the condition, you can combine map and df.str.match:
>>> df['matched'] = list(map(lambda m: "True" if m else "False", df['data'].str.match('^(?:Cat|Bear).*Ap[. ]*ple|Ap[. ]*ple.*(?:Cat|Bear)')))
or using numpy.where:
>>> df['matched'] = numpy.where(df['data'].str.match('^(?:Cat|Bear).*Ap[. ]*ple|Ap[. ]*ple.*(?:Cat|Bear)'),'True','False')
will result in:
>>> df
id data matched
0 123131 Bear Cat Apple Dog True
1 123131 Cat Ap.ple Mouse True
2 231321 Ap ple Bear True
3 231321 Mouse Ap ple Dog False
Use, Series.str.extract to extract the three new columns from the df['data'] column i.e. key, before & after, then use series.str.findall on each of the before & after columns to find all the matching before and after words:
import re
keys = ['Apple', 'Ap.ple', 'Ap ple']
markers = ['Cat', 'Bear']
p = r'(?P<before>.*?)' + r'(?P<key>' +'|'.join(rf'\b{re.escape(k)}\b' for k in keys) + r')' + r'(?P<after>.*)'
m = '|'.join(markers)
df[['before', 'key', 'after']] = df['data'].str.extract(p)
df['before'] = df['before'].str.findall(m)
df['after'] = df['after'].str.findall(m)
df['matched'] = df['before'].str.len().gt(0) | df['after'].str.len().gt(0)
# print(df)
id data before key after matched
0 123131 Bear Cat Apple Dog [Bear, Cat] Apple [] True
1 123131 Cat Ap.ple Mouse [Cat] Ap.ple [] True
2 231321 Ap ple Bear [] Ap ple [Bear] True
3 231321 Mouse Ap ple Dog [] Ap ple [] False
python regex find matched string from list
modified your function using regex/walrus operator to simplify:
def matching_function(m):
matching_strings = ['Apple', 'Ap.ple', 'Ap ple']
if bool(results := re.search('|'.join(matching_strings), m)):
print(results[0])
return True

Change all characters in a string to unicode applied to whole column pandas

I have a string:
fruit1 = 'apple'
and to change it to unicode code point:
fruit 1 = int(''.join(str(ord(char)) for char in fruit1))
print(fruit1)
97112112108101
Is it possible apply the same concept over a whole column without running a for loop on every value?
Sample Table:
| Fruit |
-------
| apple |
| berry |
| kiwi |
Desired output:
| Number |
----------------
| 97112112108101 |
| 98101114114121 |
| 107105119105 |
Unfortunately map and apply are loops under the hood, but working here:
df['new'] = df['Fruit'].map(lambda x: int(''.join(str(ord(char)) for char in x)))
#alternative
#df['new'] = df['Fruit'].apply(lambda x: int(''.join(str(ord(char)) for char in x)))
print (df)
Fruit new
0 apple 97112112108101
1 berry 98101114114121
2 kiwi 107105119105
Aside from the reason you need that (?!), yes, it is possible:
df['Fruit'] = df['Fruit'].apply(lambda fruit1: int(''.join(str(ord(char)) for char in fruit1)))

Gather data, count, and return list of dictionary even when data does not exists

Let's say I have a mysql table like this.
| id | type | sub_type | customer |
| 1 | animal | cat | John |
| 2 | animal | dog | Marry |
| 3 | animal | fish | Marry |
| 3 | animal | bird | John |
What I have to do is gather data by customer and count rows by sub_type. The animal type has 4 sub_types (cat, dog, fish, bird) and John has two sub_types(cat, bird) and Marry also has two sub_types(dog, fish). Let's say I want to get a result of John, It should look like this.
[
{name='cat', count=1},
{name='dog', count=0},
{name='fish', count=0},
{name='bird', count=1}
]
When I want to get a result about Marry, It should look like this.
[
{name='cat', count=0},
{name='dog', count=1},
{name='fish', count=1},
{name='bird', count=0}
]
So, the sub_type that is not in database should be return with count of 0. Let's say I want to get result of Matthew. Since there is not data of Matthew, the result should look like this.
[
{name='cat', count=0},
{name='dog', count=0},
{name='fish', count=0},
{name='bird', count=0}
]
I usually used setdefault() to make a result. My code probably look like this.
tmp = dict()
for row in queryset:
tmp.setdefault(row.customer, dict(cat=0, dog=0, fish=0, bird=0))
if row.sub_type == 'cat':
tmp[row.customer][row.sub_type] += 1
However, I want to know if there is other way or more elegant way to do this.
Assuming that you have a table named 'people' that contains the field 'name' that has the entries
name
--------
John
Mary
Mathew
and the table mentioned above is a table called 'pets'
You can use the following query to build your result-set for each person
select
A.name as customer,
(select count(*) from pets where customer=A.name and sub_type='cat') as cat,
(select count(*) from pets where customer=A.name and sub_type='dog') as dog,
(select count(*) from pets where customer=A.name and sub_type='fish') as fish,
(select count(*) from pets where customer=A.name and sub_type='bird') as bird
from people A
with the result listed below
customer cat dog fish bird
John 1 0 0 1
Marry 0 1 1 0
Mathew 0 0 0 0
Add an additional where clause and filter my name or provide the
summary result for all at once.

Iterate two pandas dataframe columns at the same time and return values from each column into separate places

I'm looking for a solution that would make it possible to iterate two dataframe columns at the same time and then get the values from each column and put them in two separate places in a text.
My code so far:
def fetchingMetaTitle(x):
keywords = df['Keyword']
title1 = f'{x.title()} - We have a great selection of {x} | Example.com'
title2 = f'{x.title()} - Choose among several {x} here | Example.com'
title3 = f'{x.title()} - Buy cheap {x} easy and fast | Example.com'
for i in keywords:
if i.lower() in x.lower():
return random.choice([title1,title2,title3])
else:
return np.nan
df['Category Meta Title'] = df['Keyword'].apply(fetchingMetaTitle)
Which will give me the following result:
+---------+----------------+-----------------------------------------------------------+
| Keyword | Category Title | Category Meta Title |
+---------+----------------+-----------------------------------------------------------+
| jeans | blue jeans | Jeans - We have a great selection of jeans | Example.com |
| jackets | red jackets | Jackets - Choose among several jackets here | Example.com |
| shoes | black shoes | Shoes - Buy cheap shoes easy and fast | Example.com |
+---------+----------------+-----------------------------------------------------------+
At the moment i'm only fetching from df['Keyword'] and i'm returning the values into df['Category Meta Title'] at two places. Instead of adding it twice i would like to add the values from df['Category Title'] as a secondary value.
So the result would be the following:
+---------+----------------+---------------------------------------------------------------+
| Keyword | Category Title | Category Meta Title |
+---------+----------------+---------------------------------------------------------------+
| jeans | blue jeans | Jeans - We have a great selection of blue jeans | Example.com |
| jackets | red jackets | Jackets - Choose among several red jackets here | Example.com |
| shoes | black shoes | Shoes - Buy cheap black shoes easy and fast | Example.com |
+---------+----------------+---------------------------------------------------------------+
Thanks in advance!
IIUC, this function will do what you need, using the str.format syntax rather than the f'{string}' format:
def fetchingMetaTitle(row):
title1 = '{} - We have a great selection of {} | Example.com'.format(
row['Keyword'].title(), row['Category Title'])
title2 = '{} - Choose among several {} here | Example.com'.format(
row['Keyword'].title(), row['Category Title'])
title3 = '{} - Buy cheap {} easy and fast | Example.com'.format(
row['Keyword'].title(), row['Category Title'])
return random.choice([title1,title2,title3])
df['Category Meta Title '] = df.apply(fetchingMetaTitle, axis=1)
>>> df
Keyword Category Title Category Meta Title
0 jeans blue jeans Jeans - Choose among several blue jeans here |...
1 jackets red jackets Jackets - We have a great selection of red jac...
2 shoes black shoes Shoes - Buy cheap black shoes easy and fast | ...
Alternatively, with the f'{string}' method:
def fetchingMetaTitle(row):
keyword = row['Keyword'].title()
cat = row['Category Title']
title1 = f'{keyword} - We have a great selection of {cat} | Example.com'
title2 = f'{keyword} - Choose among several {cat} here | Example.com'
title3 = f'{keyword} - Buy cheap {cat} easy and fast | Example.com'
return random.choice([title1,title2,title3])
df['Category Meta Title '] = df.apply(fetchingMetaTitle, axis=1)
Will do the same thing.
Note: I'm not exactly sure what the goal of your if statement was, so if you clarify that, I could try to insert its functionality into the functions above...
You could create a new column and put the template for a sentence and both of the parameters in it. This would fulfill your requirement to have access to the row values in both of you original columns. In the next step you can apply a custom function which creates the sentences for you and puts them in a res column.
import pandas as pd
df = pd.DataFrame({'A':['aa','bb','cc'], 'B':['a','b','c'], 'C':['1.{}, {}', '2.{}, {}', '3.{}, {}']})
df['combined'] = df[['A','B','C']].values.tolist()
df['res'] = df['combined'].apply(lambda x: x[2].format(x[0], x[1]))
print(df['res'])
Using this approach, based on the following DataFrame df:
A B C
0 aa a 1.{}, {}
1 bb b 2.{}, {}
2 cc c 3.{}, {}
The output is:
0 1.aa, a
1 2.bb, b
2 3.cc, c

Categories