Identify the matched string from list of strings using any()? - python

There's a bunch of similar questions that have the same solution: how do I check my list of strings against a larger string and see if there's a match? How to check if a string contains an element from a list in Python How to check if a line has one of the strings in a list?
I have a different problem: how do I check my list of strings against a larger string, see if there's a match, and isolate the string so I can perform another string operation relative to the matched string?
Here's some sample data:
| id | data |
|--------|---------------------|
| 123131 | Bear Cat Apple Dog |
| 123131 | Cat Ap.ple Mouse |
| 231321 | Ap ple Bear |
| 231321 | Mouse Ap ple Dog |
Ultimately, I'm trying to find all instances of "apple" ['Apple', 'Ap.ple', 'Ap ple'] and, while it doesn't really matter which one is matched, I need to be able to find out if Cat or Bear exist before it or after it. Position of the matched string does not matter, only an ability to determine what is before or after it.
In Bear Cat Apple Dog Bear is before Apple, even though Cat is in the way.
Here's where I am at with my sample code:
data = [[123131, "Bear Cat Apple Dog"], ['123131', "Cat Ap.ple Mouse"], ['231321', "Ap ple Bear"], ['231321', "Mouse Ap ple Dog"]]
df = pd.DataFrame(data, columns = ['id', 'data'])
def matching_function(m):
matching_strings = ['Apple', 'Ap.ple', 'Ap ple']
if any(x in m for x in matching_strings):
# do something to print the matched string
return True
df["matched"] = df['data'].apply(matching_function)
Would it be simply better to just do this in regex?
Right now, the function simply returns true. But if there's a match I imagine it could also return matched_bear_before matched_bear_after or the same for Cat and fill that into the df['matched'] column.
Here's some sample output:
| id | data | matched |
|--------|---------------------|---------|
| 123131 | Bear Cat Apple Dog | TRUE |
| 123131 | Cat Ap.ple Mouse | TRUE |
| 231321 | Ap ple Bear | TRUE |
| 231321 | Mouse Ap ple Dog | FALSE |

You can use the following pattern to check whether either Cat or Bear appear before the word of interest, in this case Apple or Ap.ple or Ap ple.
^(?:Cat|Bear).*Ap[. ]*ple|Ap[. ]*ple.*(?:Cat|Bear)
To create the new dataframe column which satisfies the condition, you can combine map and df.str.match:
>>> df['matched'] = list(map(lambda m: "True" if m else "False", df['data'].str.match('^(?:Cat|Bear).*Ap[. ]*ple|Ap[. ]*ple.*(?:Cat|Bear)')))
or using numpy.where:
>>> df['matched'] = numpy.where(df['data'].str.match('^(?:Cat|Bear).*Ap[. ]*ple|Ap[. ]*ple.*(?:Cat|Bear)'),'True','False')
will result in:
>>> df
id data matched
0 123131 Bear Cat Apple Dog True
1 123131 Cat Ap.ple Mouse True
2 231321 Ap ple Bear True
3 231321 Mouse Ap ple Dog False

Use, Series.str.extract to extract the three new columns from the df['data'] column i.e. key, before & after, then use series.str.findall on each of the before & after columns to find all the matching before and after words:
import re
keys = ['Apple', 'Ap.ple', 'Ap ple']
markers = ['Cat', 'Bear']
p = r'(?P<before>.*?)' + r'(?P<key>' +'|'.join(rf'\b{re.escape(k)}\b' for k in keys) + r')' + r'(?P<after>.*)'
m = '|'.join(markers)
df[['before', 'key', 'after']] = df['data'].str.extract(p)
df['before'] = df['before'].str.findall(m)
df['after'] = df['after'].str.findall(m)
df['matched'] = df['before'].str.len().gt(0) | df['after'].str.len().gt(0)
# print(df)
id data before key after matched
0 123131 Bear Cat Apple Dog [Bear, Cat] Apple [] True
1 123131 Cat Ap.ple Mouse [Cat] Ap.ple [] True
2 231321 Ap ple Bear [] Ap ple [Bear] True
3 231321 Mouse Ap ple Dog [] Ap ple [] False

python regex find matched string from list
modified your function using regex/walrus operator to simplify:
def matching_function(m):
matching_strings = ['Apple', 'Ap.ple', 'Ap ple']
if bool(results := re.search('|'.join(matching_strings), m)):
print(results[0])
return True

Related

Change all characters in a string to unicode applied to whole column pandas

I have a string:
fruit1 = 'apple'
and to change it to unicode code point:
fruit 1 = int(''.join(str(ord(char)) for char in fruit1))
print(fruit1)
97112112108101
Is it possible apply the same concept over a whole column without running a for loop on every value?
Sample Table:
| Fruit |
-------
| apple |
| berry |
| kiwi |
Desired output:
| Number |
----------------
| 97112112108101 |
| 98101114114121 |
| 107105119105 |
Unfortunately map and apply are loops under the hood, but working here:
df['new'] = df['Fruit'].map(lambda x: int(''.join(str(ord(char)) for char in x)))
#alternative
#df['new'] = df['Fruit'].apply(lambda x: int(''.join(str(ord(char)) for char in x)))
print (df)
Fruit new
0 apple 97112112108101
1 berry 98101114114121
2 kiwi 107105119105
Aside from the reason you need that (?!), yes, it is possible:
df['Fruit'] = df['Fruit'].apply(lambda fruit1: int(''.join(str(ord(char)) for char in fruit1)))

How to get rid of NaN values in csv file? Python

First than all, I know there's answers about this matter, but none of them are working for me until now. Anyway, I would like to know your answers, although I have already used that solution.
I have a csv file called mbti_datasets.csv. The the label of the first column is type and the second column is called description. Each row represent a new personality type (with its respective type and description).
TYPE | DESCRIPTION
a | This personality likes to eat apples...\nThey look like monkeys...\nIn fact, are strong people...
b | b.description
c | c.description
d | d.description
...16 types | ...
In the following code, I'm trying to duplicate each personality type when the description have \n.
Code:
import pandas as pd
# Reading the file
path_root = 'gdrive/My Drive/Colab Notebooks/MBTI/mbti_datasets.csv'
root_fn = path_rooth + 'mbti_datasets.csv'
df = pd.read_csv(path_root, sep = ',', quotechar = '"', usecols = [0, 1])
# split the column where there are new lines and turn it into a series
serie = df['description'].str.split('\n').apply(pd.Series, 1).stack()
# remove the second index for the DataFrame and the series to share indexes
serie.index = serie.index.droplevel(1)
# give it a name to join it to the DataFrame
serie.name = 'description'
# remove original column
del df['description']
# join the series with the DataFrame, based on the shared index
df = df.join(serie)
# New file name and writing the new csv file
root_new_fn = path_root + 'mbti_new.csv'
df.to_csv(root_new_fn, sep = ',', quotechar = '"', encoding = 'utf-8', index = False)
new_df = pd.read_csv(root_new_fn)
print(new_df)
EXPECTED OUTPUT:
TYPE | DESCRIPTION
a | This personality likes to eat apples...
a | They look like monkeys...
a | In fact, are strong people...
b | b.description
b | b.description
c | c.description
... | ...
CURRENT OUTPUT:
TYPE | DESCRIPTION
a | This personality likes to eat apples...
a | They look like monkeys...NaN
a | NaN
a | In fact, are strong people...NaN
b | b.description...NaN
b | NaN
b | b.description
c | c.description
... | ...
I'm not 100% sure, but I think the NaN value is \r.
Files uploaded to github as requested:
CSV FILES
Using the #YOLO solution:
CSV YOLO FILE
E.g. where is failing:
2 INTJ Existe soledad en la cima y-- siendo # adds -- in blank random blank spaces
3 INTJ -- y las mujeres # adds -- in the beginning
3 INTJ (...) el 0--8-- de la poblaci # doesnt end the word 'poblaciĆ³n'
10 INTJ icos-- un conflicto que parecer--a imposible. # starts letters randomly
12 INTJ c #adds just 1 letter
Translation for fully understanding:
2 INTJ There is loneliness at the top and-- being # adds -- in blank spaces
3 INTJ -- and women # adds - in the beginning
3 INTJ (...) on 0--8-- of the popula-- # doesnt end the word 'population'
10 INTJ icos-- a conflict that seems--to impossible. # starts letters randomly
12 INTJ c #adds just 1 letter
When I display if there's any NaN value and which type:
print(new_df['descripcion'].isnull())
<class 'float'>
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 True
10 False
11 True
continue...
Here's a way to do, I had to find a workaround to replace \n character, somehow it wasn't working in the straight forward manner:
df['DESCRIPTION'] = df['DESCRIPTION'].str.replace('[^a-zA-Z0-9\s.]','--').str.split('--n')
df = df.explode('DESCRIPTION')
print(df)
TYPE DESCRIPTION
0 a This personality likes to eat apples...
0 a They look like monkeys...
0 a In fact-- are strong people...
1 b b.description
2 c c.description
3 d d.description
The problem can be attributed to the description cells, as there are parts with two new consecutive lines, with nothing between them.
I just used .dropna() to read the new csv created, and rewriting it without the NaN values. Anyway, I think repeating this process is not the best way, but it's going straight as a solution.
df.to_csv(root_new_fn, sep = ',', quotechar = '"', encoding = 'utf-8', index = False)
new_df = pd.read_csv(root_new_fn).dropna()
new_df.to_csv(root_new_fn, sep = ',', quotechar = '"', encoding = 'utf-8', index = False)
new_df = pd.read_csv(root_new_fn)
print(type(new_df.iloc[7, 1]))# where was a NaN value
print(new_df['descripcion'].isnull())
<class 'str'>
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
and continues...

Pandas mapping to multiple dictionary items for categorising data

I have a large dataframe containing a 'Description' column.
I've compiled a sizeable dictionary of lists, where the key is basically the Category, and the items are lists of possible (sub)strings contained in the description column.
I want to use the dictionary to classify each entry in the dataframe based on this description... Unfortunately I can't figure out how to apply a dictionary of lists to map to a dataframes (feels like it would be some sort of concoction of map, isin and str.contains but I have had no joy). I've included code to generate a model dataset below:
df = pd.DataFrame(np.random.randn(10, 1), columns=list('A'))
df['Description'] = ['White Ford Escort', 'Irish Draft Horse', 'Springer \
spaniel (dog)', 'Green Vauxhall Corsa', 'White Van', 'Labrador dog',\
'Black horse' ,'Blue Van','Red Vauxhall Corsa','Bear']
This model dataset would then ideally be somehow mapped against the following dictionary:
dict = {'Car':['Ford Escort','Vauxhall Corsa','Van'],
'Animal':['Dog','Horse']}
to generate a new column in the dataframe, with the result as such:
| | A | Description | Type |
|---|----------------------|------------------------|--------|
| 0 | -1.4120290137842615 | White Ford Escort | Car |
| 1 | -0.3141036399049358 | Irish Draft Horse | Animal |
| 2 | 0.49374344901643896 | Springer spaniel (dog) | Animal |
| 3 | 0.013654965767323723 | Green Vauxhall Corsa | Car |
| 4 | -0.18271952280002862 | White Van | Car |
| 5 | 0.9519081000007026 | Labrador dog | Animal |
| 6 | 0.403258571154998 | Black horse | Animal |
| 7 | -0.8647792960494813 | Blue Van | Car |
| 8 | -0.12429427259820519 | Red Vauxhall Corsa | Car |
| 9 | 0.7695980616520571 | Bear | - |
The numbers are obviously irrelevant here, but there are other columns in the dataframes and I wanted this reflecting.
I'm happy to use regex, or a perhaps change my dictionary to a dataframe and do a join (i've considered multiple routes).
This feels similar to a recent question, but it's not the same and certainly the answer hasn't helped me.
Sorry if I've been stupid somewhere and this is really simple - it does feel like it should be, but i'm missing something.
Thanks
You can use fuzzywuzzy library to solve this. Make sure to install it via pip install fuzzywuzzy
from fuzzywuzzy import process
df = pd.DataFrame(np.random.randn(10, 1), columns=list('A'))
df['Description'] = ['White Ford Escort', 'Irish Draft Horse', 'Springer \
spaniel (dog)', 'Green Vauxhall Corsa', 'White Van', 'Labrador dog',\
'Black horse' ,'Blue Van','Red Vauxhall Corsa','Bear']
d = {'Car':['Ford Escort','Vauxhall Corsa','Van'],
'Animal':['Dog','Horse']}
# Construct a dataframe from the dictionary
df1 = pd.DataFrame([*d.values()], index=d.keys()).T.melt().dropna()
# Get relevant matches using the library.
m = df.Description.apply(lambda x: process.extract(x, df1.value)[0])
# concat the matches with original df
df2 = pd.concat([df, m[m.apply(lambda x: x[1]>80)].apply(lambda x: x[0])], axis=1)
df2.columns = [*df.columns, 'matches']
# After merge it with df1
df2 = df2.merge(df1, left_on='matches', right_on='value', how='left')
# Drop columns that are not required and rename.
df2 = df2.drop(['matches','value'],1).rename(columns={'variable':'Type'})
print (df2)
A Description Type
0 -0.423555 White Ford Escort Car
1 0.294092 Irish Draft Horse Animal
2 1.949626 Springer spaniel (dog) Animal
3 -1.315937 Green Vauxhall Corsa Car
4 -0.250184 White Van Car
5 0.186645 Labrador dog Animal
6 -0.052433 Black horse Animal
7 -0.003261 Blue Van Car
8 0.418292 Red Vauxhall Corsa Car
9 0.241607 Bear NaN
Consider inverting your dictionary first, while making everything lowercase
Then per row, split Description into words and make them lowercase
e.g., 'Springer spaniel (dog)' -> ['springer', 'spaniel', '(', 'dog', ')']
For each lower case word from (2), look it up in the inverted dictionary from (1); using apply

Gather data, count, and return list of dictionary even when data does not exists

Let's say I have a mysql table like this.
| id | type | sub_type | customer |
| 1 | animal | cat | John |
| 2 | animal | dog | Marry |
| 3 | animal | fish | Marry |
| 3 | animal | bird | John |
What I have to do is gather data by customer and count rows by sub_type. The animal type has 4 sub_types (cat, dog, fish, bird) and John has two sub_types(cat, bird) and Marry also has two sub_types(dog, fish). Let's say I want to get a result of John, It should look like this.
[
{name='cat', count=1},
{name='dog', count=0},
{name='fish', count=0},
{name='bird', count=1}
]
When I want to get a result about Marry, It should look like this.
[
{name='cat', count=0},
{name='dog', count=1},
{name='fish', count=1},
{name='bird', count=0}
]
So, the sub_type that is not in database should be return with count of 0. Let's say I want to get result of Matthew. Since there is not data of Matthew, the result should look like this.
[
{name='cat', count=0},
{name='dog', count=0},
{name='fish', count=0},
{name='bird', count=0}
]
I usually used setdefault() to make a result. My code probably look like this.
tmp = dict()
for row in queryset:
tmp.setdefault(row.customer, dict(cat=0, dog=0, fish=0, bird=0))
if row.sub_type == 'cat':
tmp[row.customer][row.sub_type] += 1
However, I want to know if there is other way or more elegant way to do this.
Assuming that you have a table named 'people' that contains the field 'name' that has the entries
name
--------
John
Mary
Mathew
and the table mentioned above is a table called 'pets'
You can use the following query to build your result-set for each person
select
A.name as customer,
(select count(*) from pets where customer=A.name and sub_type='cat') as cat,
(select count(*) from pets where customer=A.name and sub_type='dog') as dog,
(select count(*) from pets where customer=A.name and sub_type='fish') as fish,
(select count(*) from pets where customer=A.name and sub_type='bird') as bird
from people A
with the result listed below
customer cat dog fish bird
John 1 0 0 1
Marry 0 1 1 0
Mathew 0 0 0 0
Add an additional where clause and filter my name or provide the
summary result for all at once.

Iterate two pandas dataframe columns at the same time and return values from each column into separate places

I'm looking for a solution that would make it possible to iterate two dataframe columns at the same time and then get the values from each column and put them in two separate places in a text.
My code so far:
def fetchingMetaTitle(x):
keywords = df['Keyword']
title1 = f'{x.title()} - We have a great selection of {x} | Example.com'
title2 = f'{x.title()} - Choose among several {x} here | Example.com'
title3 = f'{x.title()} - Buy cheap {x} easy and fast | Example.com'
for i in keywords:
if i.lower() in x.lower():
return random.choice([title1,title2,title3])
else:
return np.nan
df['Category Meta Title'] = df['Keyword'].apply(fetchingMetaTitle)
Which will give me the following result:
+---------+----------------+-----------------------------------------------------------+
| Keyword | Category Title | Category Meta Title |
+---------+----------------+-----------------------------------------------------------+
| jeans | blue jeans | Jeans - We have a great selection of jeans | Example.com |
| jackets | red jackets | Jackets - Choose among several jackets here | Example.com |
| shoes | black shoes | Shoes - Buy cheap shoes easy and fast | Example.com |
+---------+----------------+-----------------------------------------------------------+
At the moment i'm only fetching from df['Keyword'] and i'm returning the values into df['Category Meta Title'] at two places. Instead of adding it twice i would like to add the values from df['Category Title'] as a secondary value.
So the result would be the following:
+---------+----------------+---------------------------------------------------------------+
| Keyword | Category Title | Category Meta Title |
+---------+----------------+---------------------------------------------------------------+
| jeans | blue jeans | Jeans - We have a great selection of blue jeans | Example.com |
| jackets | red jackets | Jackets - Choose among several red jackets here | Example.com |
| shoes | black shoes | Shoes - Buy cheap black shoes easy and fast | Example.com |
+---------+----------------+---------------------------------------------------------------+
Thanks in advance!
IIUC, this function will do what you need, using the str.format syntax rather than the f'{string}' format:
def fetchingMetaTitle(row):
title1 = '{} - We have a great selection of {} | Example.com'.format(
row['Keyword'].title(), row['Category Title'])
title2 = '{} - Choose among several {} here | Example.com'.format(
row['Keyword'].title(), row['Category Title'])
title3 = '{} - Buy cheap {} easy and fast | Example.com'.format(
row['Keyword'].title(), row['Category Title'])
return random.choice([title1,title2,title3])
df['Category Meta Title '] = df.apply(fetchingMetaTitle, axis=1)
>>> df
Keyword Category Title Category Meta Title
0 jeans blue jeans Jeans - Choose among several blue jeans here |...
1 jackets red jackets Jackets - We have a great selection of red jac...
2 shoes black shoes Shoes - Buy cheap black shoes easy and fast | ...
Alternatively, with the f'{string}' method:
def fetchingMetaTitle(row):
keyword = row['Keyword'].title()
cat = row['Category Title']
title1 = f'{keyword} - We have a great selection of {cat} | Example.com'
title2 = f'{keyword} - Choose among several {cat} here | Example.com'
title3 = f'{keyword} - Buy cheap {cat} easy and fast | Example.com'
return random.choice([title1,title2,title3])
df['Category Meta Title '] = df.apply(fetchingMetaTitle, axis=1)
Will do the same thing.
Note: I'm not exactly sure what the goal of your if statement was, so if you clarify that, I could try to insert its functionality into the functions above...
You could create a new column and put the template for a sentence and both of the parameters in it. This would fulfill your requirement to have access to the row values in both of you original columns. In the next step you can apply a custom function which creates the sentences for you and puts them in a res column.
import pandas as pd
df = pd.DataFrame({'A':['aa','bb','cc'], 'B':['a','b','c'], 'C':['1.{}, {}', '2.{}, {}', '3.{}, {}']})
df['combined'] = df[['A','B','C']].values.tolist()
df['res'] = df['combined'].apply(lambda x: x[2].format(x[0], x[1]))
print(df['res'])
Using this approach, based on the following DataFrame df:
A B C
0 aa a 1.{}, {}
1 bb b 2.{}, {}
2 cc c 3.{}, {}
The output is:
0 1.aa, a
1 2.bb, b
2 3.cc, c

Categories