I have a large csv file with a column containing strings. At the beginning of these strings there are a set of id numbers which appear in another column as below.
0 Home /buy /York /Warehouse /P000166770Ou... P000166770
1 Home /buy /York /Plot /P000165923A plot of la... P000165923
2 Home /buy /London /Commercial /P000165504A str... P000165504
...
804 Brand new apartment on the first floor, situat... P000185616
I want to remove all text which appears before the ID number so here we would get:
0 Ou...
1 A plot of la...
2 A str...
...
804 Brand new apartment on the first floor, situat...
I tried something like
df['column_one'].str.split(df['column_two'])
and
df['column_one'].str.replace(df['column_two'],'')
You could replace the pattern using regex as follows:
>> my_pattern = "^(Alpha|Beta|QA|Prod)\s[A-Z0-9]{7}"
>> my_series = pd.Series(['Alpha P17089OText starts here'])
>> my_series.str.replace(my_pattern, '', regex=True)
0 Text starts here
There is a bit of work to be done to determine the nature of your pattern. I would suggest experimenting a bit with https://regex101.com/
To extend your split() idea:
df.apply(lambda x: x['column_one'].split(x['column_two'])[1], axis=1)
0 Text starts here
I managed to get it to work using:
df.apply(lambda x: x['column1'].split(x['column2'])[1] if x['column2'] in x['column1'] else x['column1'], axis=1)
This also works when the ID is not in the description. Thanks for the help!
Here is one way to do it, by applying regex to each of the row based on the code
import re
def ext(row):
mch = re.findall(r"{0}(.*)".format(row['code']), row['txt'])
if len(mch) >0:
rtn = mch.pop()
else:
rtn = row['txt']
return rtn
df['ext'] = df.apply(ext, axis=1)
df
0 Ou...
1 A plot of la...
2 A str...
3 Brand new apartment on the first floor situat...
x txt code ext
0 0 Home /buy /York /Warehouse / P000166770 Ou... P000166770 Ou...
1 1 Home /buy /York /Plot /P000165923A plot of la... P000165923 A plot of la...
2 2 Home /buy /London /Commercial /P000165504A str... P000165504 A str...
3 804 Brand new apartment on the first floor situat... P000185616 Brand new apartment on the first floor situat...
I have a dataframe which includes the names of movie titles and TV Series.
From specific keywords I want to classify each row as Movie or Title according to these key words. However, due to brackets not having a space between the key words they are not being picked up by the str.contains() funtion and I need to do a workaround.
This is my dataframe:
import pandas as pd
import numpy as np
watched_df = pd.DataFrame([['Love Death Robots (Episode 1)'],
['James Bond'],
['How I met your Mother (Avnsitt 3)'],
['random name'],
['Random movie 3 Episode 8383893']],
columns=['Title'])
watched_df.head()
To add the column that classifies the titles as TV series or Movies I have the following code.
watched_df["temporary_brackets_removed_title"] = watched_df['Title'].str.replace('(', '')
watched_df["Film_Type"] = np.where(watched_df.temporary_brackets_removed_title.astype(str).str.contains(pat = 'Episode | Avnsitt', case = False), 'Series', 'Movie')
watched_df = watched_df.drop('temporary_brackets_removed_title', 1)
watched_df.head()
Is there a simpler way to solve this without having to add and drop a column?
Maybe a str.contains-like function that does not look at a string being the exact same but just containing the given word? Similar to how in SQL you have the "Like" functionality?
You can use str.contains and then map the results:
watched_df['Film_Type'] = watched_df['Title'].str.contains(r'(?:Episode|Avnsitt)').map({True: 'Series', False: 'Movie'})
Output:
>>> watched_df
Title Film_Type
0 Love Death Robots (Episode 1) Series
1 James Bond Movie
2 How I met your Mother (Avnsitt 3) Series
3 random name Movie
4 Random movie 3 Episode 8383893 Series
Here are the categories each with a list of words ill be checking the rows for match:
fashion = ['bag','purse','pen']
general = ['knob','hanger','bottle','printing','tissues','book','tissue','holder','heart']
decor =['holder','decoration','candels','frame','paisley','bunting','decorations','party','candles','design','clock','sign','vintage','hanging','mirror','drawer','home','clusters','placements','willow','stickers','box']
kitchen = ['pantry','jam','cake','glass','bowl','napkins','kitchen','baking','jar','mug','cookie','bowl','placements','molds','coaster','placemats']
holiday = ['rabbit','ornament','christmas','trinket','party']
garden = ['lantern','hammok','garden','tree']
kids = ['children','doll','birdie','asstd','bank','soldiers','spaceboy','childs']
Here is my code: (I am checking sentences for keywords and assign the row a category accordingly. I want to allow overlapping, so one row could have more than one category)
#check if description row contains words from one of our category lists
df['description'] = np.select(
[
(df['description'].str.contains('|'.join(fashion))),
(df['description'].str.contains('|'.join(general))),
(df['description'].str.contains('|'.join(decor))),
(df['description'].str.contains('|'.join(kitchen))),
(df['description'].str.contains('|'.join(holiday))),
(df['description'].str.contains('|'.join(garden))),
(df['description'].str.contains('|'.join(kids)))
],
['fashion','general','decor','kitchen','holiday','garden','kids'],
'Other'
)
Current Output:
index description category
0 children wine glass kids
1 candles decor
2 christmas tree holiday
3 bottle general
4 soldiers kids
5 bag fashion
Expected Output:
index description category
0 children wine glass kids, kitchen
1 candles decor
2 christmas tree holiday, garden
3 bottle general
4 soldiers kids
5 bag fashion
Here's an option using apply():
df = pd.DataFrame({'description': ['children wine glass',
'candles',
'christmas tree',
'bottle',
'soldiers',
'bag']})
def categorize(desc):
lst = []
for w in desc.split(' '):
if w in fashion:
lst.append('fashion')
if w in general:
lst.append('general')
if w in decor:
lst.append('decor')
if w in kitchen:
lst.append('kitchen')
if w in holiday:
lst.append('holiday')
if w in garden:
lst.append('garden')
if w in kids:
lst.append('kids')
return ', '.join(lst)
df.apply(lambda x: categorize(x.description), axis=1)
Outuput:
0 kids, kitchen
1 decor
2 holiday, garden
3 general
4 kids
5 fashion
Here's how I would do it.
Comments above each line provides you details on what I am trying to do.
Steps:
Convert all the categories into key:value pair. Use values in the
category as key and the category as value. This is to enable you to
search for the value and map it back to key
Split the description field into multiple columns using
split(expand)
Do a match for key value on each column. The result will be
categories and NaNs
Join all of these back into a column with ', ' separated to get final result while excluding NaNs. Apply pd.unique() on it again to remove duplicate categories
The six lines of code you need are:
dict_keys = ['fashion','general','decor','kitchen','holiday','garden','kids']
dict_cats = [fashion,general,decor,kitchen,holiday,garden,kids]
s_dict = {val:dict_keys[i] for i,cats in enumerate(dict_cats) for val in cats}
temp = df['description'].str.split(expand=True)
temp = temp.applymap(s_dict.get)
df['new_category'] = temp.apply(lambda x: ','.join(x[x.notnull()]), axis = 1)
df['new_category'] = df['new_category'].apply(lambda x: ', '.join(pd.unique(x.split(','))))
If you have more categories, just add it to dict_keys and dict_cats. Everything else stays the same.
The full code with comments begins here:
import pandas as pd
c = ['description','category']
d = [['children wine glass','kids'],
['candles','decor'],
['christmas tree','holiday'],
['bottle','general'],
['soldiers','kids'],
['bag','fashion']]
df = pd.DataFrame(d,columns = c)
fashion = ['bag','purse','pen']
general = ['knob','hanger','bottle','printing','tissues','book','tissue','holder','heart']
decor =['holder','decoration','candels','frame','paisley','bunting','decorations','party','candles','design','clock','sign','vintage','hanging','mirror','drawer','home','clusters','placements','willow','stickers','box']
kitchen = ['pantry','jam','cake','glass','bowl','napkins','kitchen','baking','jar','mug','cookie','bowl','placements','molds','coaster','placemats']
holiday = ['rabbit','ornament','christmas','trinket','party']
garden = ['lantern','hammok','garden','tree']
kids = ['children','doll','birdie','asstd','bank','soldiers','spaceboy','childs']
#create a list of all the lists
dict_keys = ['fashion','general','decor','kitchen','holiday','garden','kids']
dict_cats = [fashion,general,decor,kitchen,holiday,garden,kids]
#create a dictionary with words from the list as key and category as value
s_dict = {val:dict_keys[i] for i,cats in enumerate(dict_cats) for val in cats}
#create a temp dataframe with one word for each column using split
temp = df['description'].str.split(expand=True)
#match the words in each column against the dictionary
temp = temp.applymap(s_dict.get)
#Now put them back together and you have the final list
df['new_category'] = temp.apply(lambda x: ','.join(x[x.notnull()]), axis = 1)
#Remove duplicates using pd.unique()
#Note: prev line join modified to ',' from ', '
df['new_category'] = df['new_category'].apply(lambda x: ', '.join(pd.unique(x.split(','))))
print (df)
The output of this will be: (I kept your category column and created new one called new_category
description category new_category
0 children wine glass kids kids, kitchen
1 candles decor decor
2 christmas tree holiday holiday, garden
3 bottle general general
4 soldiers kids kids
5 bag fashion fashion
The output including 'party candles holder' is :
description category new_category
0 children wine glass kids kids, kitchen
1 candles decor decor
2 christmas tree holiday holiday, garden
3 bottle general general
4 party candles holder None holiday, decor
5 soldiers kids kids
6 bag fashion fashion
I have a column let's say 'Match Place' in which there are entries like 'MANU # POR', 'MANU vs. UTA', 'MANU # IND', 'MANU vs. GRE' etc. So my columns have 3 things in its entry, the 1st name is MANU i.e, 1st country code, 2nd is #/vs. and 3rd is again 2nd country name. So what I wanna do is if '#' comes in any entry of my column I want is to be changed to 'away' and if 'vs.' comes in replace whole entry to 'home' like 'MANU # POR' should be changed to 'away' and 'MANU vs. GRE' should be changed to 'home'
although I wrote some code to do so using for, if, else but it's taking a way too long time to compute it and my total rows are 30697
so is there any other way to reduce time
below I'm showing you my code
pls help
for i in range(len(df)):
if is_na(df['home/away'][i]) == True:
temp = (df['home/away'][i]).split()
if temp[1] == '#':
df['home/away'][i] = 'away'
else:
df['home/away'][i] = 'home
You can use np.select to assign multiple conditions:
s=df['Match Place'].str.split().str[1] #select the middle element
c1,c2=s.eq('#'),s.eq('vs.') #assign conditions
np.select([c1,c2],['away','home']) #assign this to the desired column
#array(['away', 'home', 'away', 'home'], dtype='<U11')
use np.where to with contains to check any substring exist or not
import numpy as np
df = pd.DataFrame(data={"col1":["manu vs. abc","manu # pro"]})
df['type'] = np.where(df['col1'].str.contains("#"),"away","home")
col1 type
0 manu vs. abc home
1 manu # pro away
You can use .str.contains(..) [pandas-doc] to check if the string contains an #, and then use .map(..) [pandas-doc] to fill in values accordingly. For example:
>>> df
match
0 MANU # POR
1 MANU vs. UTA
2 MANU # IND
3 MANU vs. GRE
>>> df['match'].str.contains('#').map({False: 'home', True: 'away'})
0 away
1 home
2 away
3 home
Name: match, dtype: object
A fun usage of replace more info check link
df['match'].replace({'#':0},regex=True).astype(bool).map({False: 'away', True: 'home'})
0 away
1 home
2 away
3 home
Name: match, dtype: object
I have 3 different columns in different dataframes that look like this.
Column 1 has sentence templates, e.g. "He would like to [action] this week".
Column 2 has pairs of words, e.g. "exercise, swim".
The 3d column has the type for the word pair, e.g. [action].
I assume there should be something similar to "melt" in R, but I'm not sure how to do the replacement.
I would like to create a new column/dataframe which will have all the possible options for each sentence template (one sentence per row):
He would like to exercise this week.
He would like to swim this week.
The number of templates is significantly lower than the number of words I have. There are several types of word pairs (action, description, object, etc).
#a simple example of what I would like to achieve
import pandas as pd
#input1
templates = pd.DataFrame(columns=list('AB'))
templates.loc[0] = [1,'He wants to [action] this week']
templates.loc[1] = [2,'She noticed a(n) [object] in the distance']
templates
#input 2
words = pd.DataFrame(columns=list('AB'))
words.loc[0] = ['exercise, swim', 'action']
words.loc[1] = ['bus, shop', 'object']
words
#output
result = pd.DataFrame(columns=list('AB'))
result.loc[0] = [1, 'He wants to exercise this week']
result.loc[1] = [2, 'He wants to swim this week']
result.loc[2] = [3, 'She noticed a(n) bus in the distance']
result.loc[3] = [4, 'She noticed a(n) shop in the distance']
result
First create new columns by Series.str.extract with words from words['B'] and then Series.map for values for replacement:
pat = '|'.join(r"\[{}\]".format(re.escape(x)) for x in words['B'])
templates['matched'] = templates['B'].str.extract('('+ pat + ')', expand=False).fillna('')
templates['repl'] =(templates['matched'].map(words.set_index('B')['A']
.rename(lambda x: '[' + x + ']'))).fillna('')
print (templates)
A B matched repl
0 1 He wants to [action] this week [action] exercise, swim
1 2 She noticed a(n) [object] in the distance [object] bus, shop
And then replace in list comprehension:
z = zip(templates['B'],templates['repl'], templates['matched'])
result = pd.DataFrame({'B':[a.replace(c, y) for a,b,c in z for y in b.split(', ')]})
result.insert(0, 'A', result.index + 1)
print (result)
A B
0 1 He wants to exercise this week
1 2 He wants to swim this week
2 3 She noticed a(n) bus in the distance
3 4 She noticed a(n) shop in the distance