numpy where with multiple conditions linked to dataframe - python

I'm using numpy where with multiple conditions to assign a category based on a text string a transaction description.
Part of the code is below
`import numpy as np
conditions = [
df2['description'].str.contains('AEGON', na=False),
df2['description'].str.contains('IB/PVV', na=False),
df2['description'].str.contains('Picnic', na=False),
df2['description'].str.contains('Jumbo', na=False),
]
values = [
'Hypotheek',
'Hypotheek',
'Boodschappen',
'Boodschappen']
df2['Classificatie'] = np.select(conditions, values, default='unknown')
I have many conditions which - only partly shown here.
I want to create a table / dataframe in stead of including every seperate condition and value in the code. So for instance the following dataframe:
import pandas as pd
Conditions = {'Condition': ['AEGON','IB/PVV','Picnic','Jumbo'],
'Value': ['Hypotheek','Hypotheek','Boodschappen','Boodschappen']
}
df_conditions = pd.DataFrame(Conditions, columns= ['Condition','Value'])
How can I adjust the condition to look for (in the str.contains) a text string as listed in df_condictions['condition'] and to apply the Value column to df2['Classificatie']?
The values are already a list in the variable explorer, but I can't find a way to have the str.contains to look for a value in a list / dataframe.
Desired outcome:
In [3]: iwantthis
Out[3]:
Description Classificatie
0 groceries Jumbo on date boodschappen
1 mortgage payment Aegon. Hypotheek
2 transfer picnic. Boodschappen
The first column is the input data frame, te second column is what I'm looking for.
Please note that my current code already allows me to create this column, but I want to use another more automated way using de df_condtions table.
I'm not yet really familiair with Python and I can't find anything online.

Try:
import re
df_conditions["Condition"] = df_conditions["Condition"].str.lower()
df_conditions = df_conditions.set_index("Condition")
tmp = df["Description"].str.extract(
"(" + "|".join(re.escape(c) for c in df_conditions.index) + ")",
flags=re.I,
)
df["Classificatie"] = tmp[0].str.lower().map(df_conditions["Value"])
print(df)
Prints:
Description Classificatie
0 groceries Jumbo on date Boodschappen
1 mortgage payment Aegon. Hypotheek
2 transfer picnic. Boodschappen

Related

How do you remove similar (not duplicated) rows in pandas dataframe using text similarity?

I have thousands of data that may or may not be similar to each other. Using python's default function drop_duplicates() doesn't really help since they only detect similar data only, for example, what if my data contains something like these:
Hey, good morning!
Hey, good morning.
Python wouldn't detect them as duplicates. There are many variations to this really, that simply cleaning the text wouldn't suffice, so I opt for text similarity.
I have tried the following code,
import textdistance
from tqdm import tqdm
tqdm.pandas()
all_sims = []
for id1, text1 in tqdm(enumerate(df1['cleaned'])):
for id2, text2 in enumerate(df1['cleaned'].iloc[id1:]):
if id1==id2:
continue
sim = textdistance.jaro_winkler(text1, text2)
if sim>=0.9:
# print("similarity value: ",sim)
# print("text 1 >> ",text1)
# print("text 2 >> ",text2)
# print("====><====")
all_sims.append(id1)
Basically I tried to iterate all the rows in the column and check it with themselves. If the jaro-winkler value detected turns out to be >= 0.9 then the index will be saved to a list.
I will then remove all these similar indices with the following code.
df1[~df1.index.isin(all_sims)]
But my code is really slow and inefficient, and I am not sure if it's the right approach. Do you have any idea to improve this?
You could try this:
import pandas as pd
import textdistance
# Toy dataframe
df = pd.DataFrame(
{
"name": [
"Mulligan Nick",
"Hitt S C",
"Breda Joy Mulligen",
"David James Tsan",
"Mulligan Nick",
"Htti S C ",
"Brenda Joy Mulligan",
"Dave James Tsan",
],
}
)
# Calculate similarities between rows
# and save corresponding indexes in a new column "match"
df["match"] = df["name"].map(
lambda x: [
i
for i, text in enumerate(df["name"])
if textdistance.jaro_winkler(x, text) >= 0.9
]
)
# Iterate to remove similar rows (keeping only the first one)
indices = []
for i, row in df.iterrows():
indices.append(i)
df = df.drop(
index=[item for item in row["match"] if item not in indices], errors="ignore"
)
# Clean up
df = df.drop(columns="match")
print(df)
# Outputs
name
0 Mulligan Nick
1 Hitt S C
2 Breda Joy Mulligen
3 David James Tsan

Mining for Term that is "Included In" Entry Rather than "Equal To"

I am doing some data mining. I have a database that looks like this (pulling out three lines):
100324822$10032482$1$PS$BENICAR$OLMESARTAN MEDOXOMIL$1$Oral$UNK$$$Y$$$$021286$$$TABLET$
1014687010$10146870$2$SS$BENICAR HCT$HYDROCHLOROTHIAZIDE\OLMESARTAN MEDOXOMIL$1$Oral$1/2 OF 40/25MG TABLET$$$Y$$$$$.5$DF$FILM-COATED TABLET$QD
115700162$11570016$5$C$Olmesartan$OLMESARTAN$1$Unknown$UNK$$$U$U$$$$$$$
My Code looks like this :
with open('DRUG20Q4.txt') as fileDrug20Q4:
drugTupleList20Q4 = [tuple(map(str, i.split('$'))) for i in fileDrug20Q4]
drug20Q4 = []
for entryDrugPrimaryID20Q4 in drugTupleList20Q4:
drug20Q4.append((entryDrugPrimaryID20Q4[0], entryDrugPrimaryID20Q4[3], entryDrugPrimaryID20Q4[5]))
fileDrug20Q4.close()
drugNameDataFrame20Q4 = pd.DataFrame(drug20Q4, columns = ['PrimaryID', 'Role', 'Drug Name']) drugNameDataFrame20Q4 = pd.DataFrame(drugNameDataFrame20Q4.loc[drugNameDataFrame20Q4['Drug Name'] == 'OLMESARTAN'])
Currently the code will pull only entries with the exact name "OLMESARTAN" out, how do I capture all the variations, for instance "OLMESARTAN MEDOXOMIL" etc? I can't simply list all the varieties as there's an infinite amount of variations, so I would need something that captures anything with the term "OLMESARTAN" within it.
Thanks!
You can use str.contains to get what you are looking for.
Here's an example (using some string I found in the documentation):
import pandas as pd
df = pd.DataFrame()
item = 'Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.'
df['test'] = item.split(' ')
df[df['test'].str.contains('de')]
This outputs:
test
4 Index
22 Index.

Using pandas to categories text data in one column and have corresponding categories stated in the next column

My excel spread sheet currently looks like this after inserting the new column "Expense" by using the code:
import pandas as pd
df = pd.read_csv(r"C:\Users\Mihir Patel\Project\Excel & CSV Stuff\June '20 CSVData.csv")
df_Expense = df.insert(2, "Expense", " ")
df.to_excel(r"C:\Users\Mihir Patel\Project\Excel & CSV Stuff\June '20 CSVData.xlsx", index=None, header=True)
So because the Description column contains the word "DRAKES" I can categories that expense as "Personal" which should appear in the Expense column next to it.
Similarly the next one down contains "Optus" is categorized as a mobile related expense so the word "Phone" should appear in the Expense column.
I have tried searching on Google and YouTube but I just can't seem to find an example for something like this.
Thanks for your help.
You can define a function which has all these rules and simply apply it. For ex.
def rules(x):
if "DRAKES" in x.description:
return "Personal"
if "OPUS" in x.description:
return "Mobile"
df["Expense"] = df.apply(lambda x: rules(x), axis=1)
I have solved my problem by using a while loop. I tried to use the method in quest's answer but I most likely didn't use it properly and kept getting an error. So I used a while loop to search through each individual cell in the "Description" column and categories it in the same row on the "Expense" column.
My solution using a while loop:
import pandas as pd
df = pd.read_csv("C:\\Users\\Mihir Patel\\PycharmProjects\\pythonProject\\June '20 CSVData.csv")
df.insert(2, "Expenses", "")
description = "Description"
expense = "Expenses"
transfer = "Transfer"
i = -1 #Because I wanted python to start searching from index 0
while i < 296: #296 is the row where my data ends
i = i + 1
if "Drakes".upper() in df.loc[i, description]:
df.loc[i, expense] = "Personal"
if "Optus".upper() in df.loc[i, description]:
df.loc[i, expense] = "Phone"
df.sort_values(by=["Expenses"], inplace=True)
df.to_excel("C:\\Users\\Mihir Patel\\PycharmProjects\\pythonProject\\June '20 CSVData.xlsx", index=False)

Concatenate values into Panda Series

I have the following response from an API request:
<movies>
<movie>
<rating>5</rating>
<name>star wars</name>
</movie>
<movie>
<rating>8</rating>
<name>jurassic park</name>
</movie>
</movies>
is there a way to take this information and obtain the rating and name values and store inside a series in Pandas?
The end result would look like this:
Movie Rating
5 - star Wars
8 - Jurassic park
You'll notice I've taken taking each of the values found in my response and added them to the one column. I was looking to add the 5 concatenate '-' and the star wars together for example.
Is this what you are looking for? I have explained step-by-step in the code. There was one part I did not know how to do, but I researched and figured it out.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Data' : ['<movies>','<movie>','<rating>5</rating>',
'<name>star wars</name>', '</movie>',
'<rating>8</rating>', '<name>jurassic park</name>',
'</movie>', '</movies>']})
#Filter for the relevant rows of data based upon the logic of the pattern. I have also
#done an optional reset of the index.
df = df.loc[df['Data'].str.contains('>.*<', regex=True)].reset_index(drop=True)
#For the rows we just filtered for, get rid of the irrelevant data with some regex
#string manipulation
df['Data'] = df['Data'].str.findall('>.*<').str[0].replace(['>','<'], '', regex=True)
#Use join with shift and add_suffix CREDIT to #joelostblom:
#https://stackoverflow.com/questions/47450259/merge-row-with-next-row-in-dataframe-
#pandas
df = df.add_suffix('1').join(df.shift(-1).add_suffix('2'))
#Filter for numeric rows only
df = df.loc[df['Data1'].str.isnumeric() == True]
#Combine Columns with desired format
df['Movie Rating'] = df['Data1'] + ' - ' + df['Data2']
#Filter for only relevant column and print dataframe
df = df[['Movie Rating']]
print(df)

Boxplot needs to use multiple groupby in Pandas

I am using pandas, Jupyter notebooks and python.
I have a following dataset as a dataframe
Cars,Country,Type
1564,Australia,Stolen
200,Australia,Stolen
579,Australia,Stolen
156,Japan,Lost
900,Africa,Burnt
2000,USA,Stolen
1000,Indonesia,Stolen
900,Australia,Lost
798,Australia,Lost
128,Australia,Lost
200,Australia,Burnt
56,Australia,Burnt
348,Australia,Burnt
1246,USA,Burnt
I would like to know how I can use a box plot to answer the following question "Number of cars in Australia that were affected by each type". So basically, I should have 3 boxplots(for each type) showing the number of cars affected in Australia.
Please keep in mind that this is a subset of the real dataset.
You can select only the rows corresponding to "Australia" from the column "Country" and group it by the column "Type" as shown:
from StringIO import StringIO
import pandas as pd
text_string = StringIO(
"""
Cars,Country,Type,Score
1564,Australia,Stolen,1
200,Australia,Stolen,2
579,Australia,Stolen,3
156,Japan,Lost,4
900,Africa,Burnt,5
2000,USA,Stolen,6
1000,Indonesia,Stolen,7
900,Australia,Lost,8
798,Australia,Lost,9
128,Australia,Lost,10
200,Australia,Burnt,11
56,Australia,Burnt,12
348,Australia,Burnt,13
1246,USA,Burnt,14
""")
df = pd.read_csv(text_string, sep = ",")
# Specifically checks in column name "Cars"
group = df.loc[df['Country'] == 'Australia'].boxplot(column = 'Cars', by = 'Type')

Categories