Apply function is not working on a data-frame column - python

I am trying to remove special characters like ",",".","-"(except comma) from the "Actors" column of my pandas data-frame. For this I use the apply method on the "Actors" column
df['Actors']= df['Actors'].apply(lambda x : x.lower().replace("[^a-zA-Z,]","",)
df['Actors'].head()
The output of the above snippet is shown below and we can see no special characters have been replaced:
1 tim robbins, morgan freeman, bob gunton, willi...
2 marlon brando, al pacino, james caan, richard ...
3 al pacino, robert duvall, diane keaton, robert...
4 christian bale, heath ledger, aaron eckhart, m...
5 martin balsam, john fiedler, lee j. cobb, e.g....
Name: Actors, dtype: object
But when I try resolving the above issue using the snippet below, the code works:
df['Actors'] = df['Actors'].str.lower().str.replace("[^a-zA-Z,]","")
df['Actors'].head()
1 timrobbins,morganfreeman,bobgunton,williamsadler
2 marlonbrando,alpacino,jamescaan,richardscastel...
3 alpacino,robertduvall,dianekeaton,robertdeniro
4 christianbale,heathledger,aaroneckhart,michael...
5 martinbalsam,johnfiedler,leejcobb,egmarshall
Name: Actors, dtype: object
I want to know what is it with the apply function that it doesn't work properly while replacing characters ?

You call apply on series, so x in the lambda is a single string of each row of the series. So, x.lower().replace is python replace. Python replace doesn't support regex. so it considers "[^a-zA-Z,]" as a whole string and it looks for that substring in each x. It couldn't find it so nothing got replaced.
On the other hand, Pandas str.replace default option is regex=True, so it considers "[^a-zA-Z,]" as a regex pattern and replaces everything properly

It does not work because you do a replace on a string, formally you do str.replace("[^a-zA-Z,]","",). Your sting do not contain those characters [^a-zA-Z,] so nothing is removed. If you prefer, python do interpret those characters as regex, but simply as string elements.
To work you should do it like this, it's just to answer your question because the preferred way to do it is with your second exemple.
remove = re.compile(r"[^a-zA-Z,]")
df['Actors']= df['Actors'].apply(lambda x : re.sub(remove, "", x.lower()))
Herw are some documentation :
python str replace
pandas str replace

Related

How to match city names split by space?

Trying to figure out given two different types of strings, how to make a determination whether or not a city name is actually a split word? Since working in python, I Split the string and save s[0] for street num, s[-1] for zip code and so on but how to figure out whether the city name may be a split word such as New York or San Jose!?
E.g. : 123 Main Street St. Louisville OH 43071 [City name is single word]
E. g : 45 Holy Grail Al. Niagara Town ZP 32908 [City name 'Niagara Town' is two words]
Forgive the noob question.
Thank you,
I making two assumptions here:
1) That the number code before the town name is always numeric
2) That there is no town name with a number name
index = list(filter(lambda x: x[1].isnumeric(),enumerate(x.split())))[-1][0]
" ".join(x.split()[index+1:])
So what is happening: We try to identify the last part of the split that is purely numeric, and then get the index of that element. Then we join all elements after that numeric element.

Python Pandas - Removing trailing numbers and the remaining words in string

How do I remove numbers and everything behind the number using pandas? Basically anything with a number as a separate word and remove anything behind the separate word.
For example:
ABC,2 QUEEN = ABC
ABC 3 QUEEN = ABC
ABC PTE LTD YES123 = ABC PTE LTD YES123
ABC PTE LTD YES 123 = ABC PTE LTD
Try this:
df['MyCol'].replace('[\,\s]+\d+.+', '')
I dont think pandas is the best way to accomplish that task, you could use ntlk tokenization to separate your row by each word in it, and then iterate through tokenized words, keep those words in a separate array until a number is encountered, in which case you can use 'break' statement and move to next row
This is quite crude but please try
df['MyCol'].str.split('[ |,][0-9]+')
The drawback is that you will have to extract index 0 of the returned list to overwrite the original column. Alternatively, set the parameter Expand=True and drop all the successive columns that are generated.
df['MyCol'].str.split('[ |,][0-9]+', expand=True)
Output:
0 [ABC, QUEEN]
1 [ABC, QUEEN]
2 [ABC PTE LTD YES123]
3 [ABC PTE LTD YES, ]

How do I extract characters from a string in Python?

I need to make some name formats match for merging later on in my script. My column 'Name' is imported from a csv and contains names like the following:
Antonio Brown
LeSean McCoy
Le'Veon Bell
For my script, I would like to get the first letter of the first name and combine it with the last name as such....
A.Brown
L.McCoy
L.Bell
Here's what I have right now that returns a NaaN every time:
ff['AbbrName'] = ff['Name'].str.extract('([A-Z]\s[a-zA-Z]+)', expand=True)
Thanks!
Another option using str.replace method with ^([A-Z]).*?([a-zA-Z]+)$; ^([A-Z]) captures the first letter at the beginning of the string; ([a-zA-Z]+)$ matches the last word, then reconstruct the name by adding . between the first captured group and second captured group:
df['Name'].str.replace(r'^([A-Z]).*?([a-zA-Z]+)$', r'\1.\2')
#0 A.Brown
#1 L.McCoy
#2 L.Bell
#Name: Name, dtype: object
What if you would just apply() a function that would split by the first space and get the first character of the first word adding the rest:
import pandas as pd
def abbreviate(row):
first_word, rest = row['Name'].split(" ", 1)
return first_word[0] + ". " + rest
df = pd.DataFrame({'Name': ['Antonio Brown', 'LeSean McCoy', "Le'Veon Bell"]})
df['AbbrName'] = df.apply(abbreviate, axis=1)
print(df)
Prints:
Name AbbrName
0 Antonio Brown A. Brown
1 LeSean McCoy L. McCoy
2 Le'Veon Bell L. Bell
This should be simple enough to do, even without regex. Use a combination of string splitting and concatenation.
df.Name.str[0] + '.' + df.Name.str.split().str[-1]
0 A.Brown
1 L.McCoy
2 L.Bell
Name: Name, dtype: object
If there is a possibility of the Name column having leading spaces, replace df.Name.str[0] with df.Name.str.strip().str[0].
Caveat: Columns must have two names at the very least.
You get NaaN because your regular expression cannot match to the names.
Instead I'll try the following:
parts = ff[name].split(' ')
ff['AbbrName'] = parts[0][0] + '.' + parts[1]

Select string which contains punctuation

so I'm trying to remove title from a set of professors' name.
Like Dr.Eng, Dr.rer.nat, M.S., Dr., S.Si so on and so forth. Basically any string that contains more than one dot.
This is an example list after I have split the name and the title based on ","
2 [CHOTIMAH, Dr., M.S., RINTO ANUGRAHA NQZ, S...
3 [HARSOJO, S.U., M.Sc., Dr., SUDARMAJI, S.S...
4 [IKHSAN SETIAWAN, S.Si., M.Si., ARI SETIAWAN...
5 [EKO SULISTYA, Dr., M.Si., YOSEF ROBERTUS UT...
6 [SUNARTA, Drs., M.S., WAGINI R., Drs., M.S.]
7 [BAMBANG MURDAKA EKA JATI, Drs., M.S., KAMSU...
8 [AHMAD KUSUMA ATMAJA, S.Si., M.Sc., Dr.Eng....
9 [MOH. ALI JOKO WASONO, M.S., Dr.]
I have tried r'\S*[^\w\s]\S' but it returned
CHOTIMAH, INTO ANUGRAHA NQZ, .
HARSOJO, UDARMAJI, i.
IKHSAN SETIAWAN, RI SETIAWAN, ng.
EKO SULISTYA, OSEF ROBERTUS UTOMO, Dr.
SUNARTA, AGINI .
BAMBANG MURDAKA EKA JATI, AMSUL ABRAHA, Prof.
AHMAD KUSUMA ATMAJA, ITRAYANA, Dr.
MOH. ALI JOKO WASONO, Dr.
Some professors' names are shortened to XXX. Ex: MOHAMMAD TO MOH. And I don't want that to get removed.
Any help is appreciated!
\w{0,}\.(\w{0,}\.)? This regex test string will grab any length word followed by a period, and will look for another word of any length followed by a period optionally. This captures Dr., M.S. etc. I'm pretty sure that's what you're asking for, if not let me know.
In the future you can use regexr.com to easily test regex matches. Also you've tagged this post with Python and Pandas but those aren't really relevant tags. Please either include more code to make tags relevant or avoid using irrelevant tags

using wildcards inside mapping functions python

Is their a way to use a wildcard in your mapping function. I have something like this:
dictionary = {'James':'James Mcree'}
basically stating that anywhere it finds a James in my data frame it changes the names to James Mcree but I would somehow like to throw a wildcard into my mapping function like such:
dictionary = {'Jam*':'James Mcree'}
So it would look in my data frame and anywhere it had the starting letters Jam* it would change it to James Mcree. So it wouldnt matter what comes after Jam. It could be Jammed, Jamis..etc I just want to use the wildcard to say if it has these letters change it to the name specified.
If I am understandign correctly the asterik represents anything from this point on change to designated name which is James Mcree.
Furthermore if you are able to do this is their also a way to specify something like this:
dictionary = {'J*s':'James Mcree'}
so it would look for anything that starts with J and ends with s.
I havent found a way to do this any help would be great, thanks ahead of time.
you can do it using RegEx's:
Demo:
In [29]: df
Out[29]:
a b c
0 Ivan Jayesh James
1 Jan Jaaaaas Bob
In [30]: df = df.replace(['^J.*s$','Bo.*'],['James Mcree','Bobby'], regex=True)
In [31]: df
Out[31]:
a b c
0 Ivan Jayesh James Mcree
1 Jan James Mcree Bobby
'^J.*s$' - is a RegEx which means find a string beginning with J then any number of any characters following by s
Special RegEx symbols:
^ - beginning of the string
$ - end of the string
Here is an online service which explains RegEx's: https://regex101.com/r/Uu8bUV/1

Categories