Duplicate & identify certain rows in a Pandas Dataframe - regex - python

I did'nt find any solution about my issue.
I want to identify & duplicate with regex certains rows of my DataFrame.
For example my df :
var1
0 House A and B
1 2 garage + garden
2 fridges
The result that i want in var2 (keep my var1 too) :
var1 var2
0 House A and B House A
1 House A and B House B
2 2 garage + garden Garage 1
3 2 garage + garden Garage 2
4 2 garage + garden Garden
5 fridges fridge 1
6 fridges fridge 2
I don't know exactly how to do that, I think with regex it's a good idea, but i'm not sur.
I tried with str.contains but the results was not good.
Thank for your help.

If those three cases are exhaustive, then you may use my solution, my solution uses a combination of regex matching and split.
#the hard part
def my_scan(t):
#Split
#only '+' and 'and' are considered
cond = re.findall(r'(.+)(and|\+)(.+)' , t)
if len(cond):
t = [_.strip() for _ in cond[0]]
else:
t = [t]
#Process
#Case 1 'House': and
if 'and' in t:
t.remove('and')
#add 'House' to the second element
t[1] = re.split(' ', t[0])[0]+' '+t[1]
#Case 2 'Garage + Garden': + with numeral
elif '+' in t:
t.remove('+')
x = []
##check for numerals in front
for _ in t:
if (re.match(r'^\d+', _)):
temp = _[(re.match(r'^\d+', _)).end()+1:] #'garage'
#append by the number of numeral times
for i in range(int(re.match(r'^\d+', _)[0])):
x.append(temp+' '+str(i+1))
else:
x.append(_)
t = x
#Case 3 'Fridges': single word that ends with an s
else:
if (re.match(r'^[A-Za-z]+s$', t[0])):
t = t[0][:-1]
t = [t+' 1', t+' 2']
else:
t[0] = t[0]+' 1'
return t
#the easier part
def get_df(t):
output1 = []
output2 = []
for _ in t:
dummy = my_scan(_)
for i in range(len(dummy)):
output1.append(_)
output2.append(dummy[i])
df = pd.DataFrame({'var1':output1,'var2':output2})
return df
#test it
data = {'var1':['House A and B','2 Garage + Garden', 'Fridges']}
df = get_df(data['var1'])
print(df)
#bonus test
data1 = {'var1':['House AZ and PL','5 Garage + 3 Garden', 'Fridge']}
df = get_df(data1['var1'])
print(df)
Printed df output from your original data, data = {'var1':['House A and B','2 Garage + Garden', 'Fridges']}.
var1 var2
0 House A and B House A
1 House A and B House B
2 2 Garage + Garden Garage 1
3 2 Garage + Garden Garage 2
4 2 Garage + Garden Garden
5 Fridges Fridge 1
6 Fridges Fridge 2
Printed df output from an additional test data, data1 = {'var1':['House AZ and PL','5 Garage + 3 Garden', 'Fridge']}.
var1 var2
0 House AZ and PL House AZ
1 House AZ and PL House PL
2 5 Garage + 3 Garden Garage 1
3 5 Garage + 3 Garden Garage 2
4 5 Garage + 3 Garden Garage 3
5 5 Garage + 3 Garden Garage 4
6 5 Garage + 3 Garden Garage 5
7 5 Garage + 3 Garden Garden 1
8 5 Garage + 3 Garden Garden 2
9 5 Garage + 3 Garden Garden 3
10 Fridge Fridge 1

Maybe, regular expression wouldn't be the best idea to do this task, yet you can write some expressions to split them, how-to-code-it or how to find the plural words (which you'd probably want some NLP library for it) would be some other different stories:
([A-Za-z]+?)\s([A-Z])(?=\s+and|$)|([0-9]+)?\s+([A-Za-z]*?)(?=\s+\+|$)
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Related

create two columns based on a function with apply()

I have a dataset containing football data of the premier league as such:
HomeTeam AwayTeam FTHG FTAG
0 Liverpool Norwich 4 1
1 West Ham Man City 0 5
2 Bournemouth Sheffield United 1 1
3 Burnley Southampton 3 0
... ... ... ... ...
where "FTHG" and "FTAG" are full-time home team goals and away team goals.
I need to write a function that calculates the final Premier League table given the results (in the form of a data frame). What I wrote is this function:
def calcScore(row):
if PL_df.iloc[row]['FTHG'] > PL_df.iloc[row]['FTAG']:
x = 3
y = 0
elif PL_df.iloc[row]['FTHG'] < PL_df.iloc[row]['FTAG']:
x = 0
y = 3
elif PL_df.iloc[row]['FTHG'] == PL_df.iloc[row]['FTAG']:
x = 1
y = 1
return x,y
this works, for example for the first row it gives this output:
in[1]: calcScore(0)
out[1]: (3,0)
now I need to create two columns HP and AP that contain the number of points awarded for Home and Away teams respectively using apply(). But I can't think of a way to do that.
I hope I was clear enough. Thank you in advance.
No need for a function (and also faster than apply):
win_or_draws = df['FTHG'] > df['FTAG'], df['FTHG'] == df['FTAG']
df['HP'] = np.select( win_or_draws, (3,1), 0)
df['AP'] = np.select(win_or_draws, (0,1),3)
Output:
HomeTeam AwayTeam FTHG FTAG HP AP
0 Liverpool Norwich 4 1 3 0
1 West Ham Man City 0 5 0 3
2 Bournemouth Sheffield United 1 1 1 1
3 Burnley Southampton 3 0 3 0

Splitting a dataframe column on a pattern of characters and numerals

I have a dataframe that is:
A
1 king, crab, 2008
2 green, 2010
3 blue
4 green no. 4
5 green, house
I want to split the dates out into:
A B
1 king, crab 2008
2 green 2010
3 blue
4 green no. 4
5 green, house
I cant split the first instance of ", " because that would make:
A B
1 king crab, 2008
2 green 2010
3 blue
4 green no. 4
5 green house
I cant split after the last instance of ", " because that would make:
A B
1 king crab 2008
2 green 2010
3 blue
4 green no. 4
5 green house
I also cant separate it by numbers because that would make:
A B
1 king, crab 2008
2 green 2010
3 blue
4 green no. 4
5 green, house
Is there some way to split by ", " and then a 4 digit number that is between two values? The two values condition would be extra safety to filter out accidental 4 digit numbers that are clearly not years. For example.
Split by:
", " + (four digit number between 1000 - 2021)
Also appreciated are answers that split by:
", " + four digit number
Even better would be an answer that took into account that the number is ALWAYS at the end of the string.
Or you can just use series.str.extract and replace:
df = pd.DataFrame({"A":["king, crab, 2008","green, 2010","blue","green no. 4","green, house"]})
df["year"] = df["A"].str.extract("(\d{4})")
df["A"] = df["A"].str.replace(",\s\d{4}","")
print (df)
A year
0 king, crab 2008
1 green 2010
2 blue NaN
3 green no. 4 NaN
4 green, house NaN
import pandas as pd
list_dict_Input = [{'A': 'king, crab, 2008'},
{'A': 'green, 2010'},
{'A': 'green no. 4'},
{'A': 'green no. 4'},]
df = pd.DataFrame(list_dict_Input)
for row_Index in range(len(df)):
text = (df.iloc[row_Index]['A']).strip()
last_4_Char = (text[-4:])
if last_4_Char.isdigit() and int(last_4_Char) >= 1000 and int(last_4_Char) <= 2021:
df.at[row_Index, 'B'] = last_4_Char
print(df)

pandas: extract specific text before or after hyphen, that ends in given substrings

I am very new to pandas and have a data frame similar to the below
import pandas as pd
df = pd.DataFrame({'id': ["1", "2", "3","4","5"],
'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd",
"Company X POM – Company X Ltd","DDDD Mill – Company New and Old Ltd",
"Company Not Special – R Mill","Greatest Company – Great World POM"]})
id mill
0 1 Company A Palm Oil Mill – Special Company A of...
1 2 Company X POM – Company X Ltd
2 3 DDDD Mill – Company New and Old Ltd
3 4 Company Not Special – R Mill
4 5 Greatest Company – Great World POM
What I would like to get from the above data frame is something like the below:
Is there an easy way to extract those substrings into the same column. The mill name can sometimes be before and other times after the '-' but will almost always end with Palm Oil Mill, POM or Mill.
IIUC, you can using str.contains with those key words Palm Oil Mill,POM,Mill
s = df.mill.str.split(' – ', expand=True)
df['Name']=s[s.apply(lambda x : x.str.contains('Palm Oil Mill|POM|Mill'))].fillna('').sum(1)
df
Out[230]:
id mill \
0 1 Company A Palm Oil Mill – Special Company A of...
1 2 Company X POM – Company X Ltd
2 3 DDDD Mill – Company New and Old Ltd
3 4 Company Not Special – R Mill
4 5 Greatest Company – Great World POM
Name
0 Company A Palm Oil Mill
1 Company X POM
2 DDDD Mill
3 R Mill
4 Great World POM
Previous solution: You could use .str.split() and do this:
df.mill = df.mill.str.split(' –').str[0].
Update: Seeing you got a few constraints you could build up your own returning function (called func below) and put any logic you want inside there. This will loop through all strings split by - and if Mill is in the first word you return.
In other case I recommend Wen's solution.
import pandas as pd
df = pd.DataFrame({'id': ["1", "2", "3","4","5"],
'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd",
"Company X POM – Company X Ltd","DDDD Mill – Company New and Old Ltd",
"Company Not Special – R Mill","Greatest Company – Great World POM"]})
def func(x):
#Split array
ar = x.split(' – ')
# If length is smaller than 2 return value
if len(ar) < 2:
return x
# Else loop through and apply logic here
for ind, x in enumerate(ar):
if x.lower().endswith(('mill', 'pom')):
return x
# Nothing found, return x
return x
df.mill = df.mill.apply(func)
print(df)
Returns:
id mill
0 1 Company A Palm Oil Mill
1 2 Company X POM
2 3 DDDD Mill
3 4 R Mill
4 5 Great World POM
You want to split on the hyphen (if any), and return the substring ending in 'Mill' or 'POM':
def extract_mill_name(s):
"""Extract the substring which ends in 'Mill' or 'POM'"""
for subs in s.split('–'):
subs = subs.strip(' ')
if subs.endswith('Mill') or subs.endswith('POM'):
return subs
return None # parsing error. Could raise Exception instead
df.mill.apply(extract_mill_name)
0 Company A Palm Oil Mill
1 Company X POM
2 DDDD Mill
3 R Mill
4 Great World POM

TypeError when using iloc to create dummy variables

Source data is from the book Python_for_Data_Analysis, chp 2.
The data for movies is as follows and can also be found here:
movies.head(n=10)
Out[3]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children's
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
The following code has trouble when I use iloc:
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table( 'movies.dat', sep='::',
engine='python', header=None, names=mnames)
movies.head(n=10)
genre_iter = (set(x.split('|')) for x in movies['genres'])
genres = sorted(set.union(*genre_iter))
dummies = DataFrame(np.zeros((len(movies), len(genres))), columns=genres)
for i, gen in enumerate(movies['genres']):
# the following code report error
# TypeError: '['Animation', "Children's", 'Comedy']' is an invalid key
dummies.iloc[i,dummies.columns.get_loc(gen.split('|'))] = 1
# while loc can run successfully
dummies.loc[dummies.index[[i]],gen.split('|')] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
I have some understanding of why Children's is error, but why Animation,Comedy are error? I have tried:
dummies.columns.get_loc('Animation')
and the result is 2.
This is a pretty simple (and fast) answer using string matching that should work fine here and in any case where the genres names don't overlap. E.g. if you had categories "crime" and "crime thriller" then a crime thriller would be categorized under both crime AND crime thriller rather than just crime thriller. (But see note below for how you could generalize this.)
for g in genres:
movies[g] = movies.genres.str.contains(g).astype(np.int8)
(Note using np.int8 rather than int will save a lot of memory as int defaults to 64 bits rather than 8)
Results for movies.head(2):
movie_id title genres Action \
0 1 Toy Story (1995) Animation|Children's|Comedy 0
1 2 Jumanji (1995) Adventure|Children's|Fantasy 0
Adventure Animation Children's Comedy Crime Documentary ... \
0 0 1 1 1 0 0 ...
1 1 0 1 0 0 0 ...
Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller \
0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
War Western
0 0 0
1 0 0
The following generalizaion of the above code may be overkill but gives you a more general approach that should avoid potential double counting of genre categories (e.g. equating Crime and Crime Thriller):
# add '|' delimiter to beginning and end of the genres column
movies['genres2'] = '|' + movies['genres'] + '|'
# search for '|Crime|' rather than 'Crime' which is much safer b/c
# we don't match a category which merely contains 'Crime', we
# only match 'Crime' exactly
for g in genres:
movies[g+'2'] movies.genres2.str.contains('\|'+g+'\|').astype(np.int8)
(If you're better with regular expressions than me you wouldn't need to add the '|' at the beginning and end ;-)
Try
dummies = movies.genres.str.get_dummies()

Generate columns of top ranked values in Pandas

I have a dataframe topic_data that contains the output of an LDA topic model:
topic_data.head(15)
topic word score
0 0 Automobile 0.063986
1 0 Vehicle 0.017457
2 0 Horsepower 0.015675
3 0 Engine 0.014857
4 0 Bicycle 0.013919
5 1 Sport 0.032938
6 1 Association_football 0.025324
7 1 Basketball 0.020949
8 1 Baseball 0.016935
9 1 National_Football_League 0.016597
10 2 Japan 0.051454
11 2 Beer 0.032839
12 2 Alcohol 0.027909
13 2 Drink 0.019494
14 2 Vodka 0.017908
This shows the top 5 terms for each topic, and the score (weight) for each. What I'm trying to do is reformat so that the index is the rank of the term, the columns are the topic IDs, and the values are formatted strings generated from the word and score columns (something along the lines of "%s (%.02f)" % (word,score)). That means the new dataframe should look something like this:
Topic 0 1 ...
Rank
0 Automobile (0.06) Sport (0.03) ...
1 Vehicle (0.017) Association_football (0.03) ...
... ... ... ...
What's the right way of going about this? I assume it involves a combination of index-setting, unstacking, and ranking, but I'm not sure of the right approach.
It would be something like this, note that Rank has to be generated first:
In [140]:
df['Rank'] = (-1*df).groupby('topic').score.transform(np.argsort)
df['New_str'] = df.word + df.score.apply(' ({0:.2f})'.format)
df2 = df.sort(['Rank', 'score'])[['New_str', 'topic','Rank']]
print df2.pivot(index='Rank', values='New_str', columns='topic')
topic 0 1 2
Rank
0 Automobile (0.06) Sport (0.03) Japan (0.05)
1 Vehicle (0.02) Association_football (0.03) Beer (0.03)
2 Horsepower (0.02) Basketball (0.02) Alcohol (0.03)
3 Engine (0.01) Baseball (0.02) Drink (0.02)
4 Bicycle (0.01) National_Football_League (0.02) Vodka (0.02)

Categories