Python: Trim strings in a column - python

I have a column dataframe that I would like to trim the leading and trailing parts to it. The column has contents such as: ['Tim [Boy]', 'Gina [Girl]'...] and I would like it to make a new column that just has ['Boy','Girl'... etc.]. I tried using rstrip and lstrip but have had no luck. Please advise. Thank you

I assume that the cells of the column are 'Tim [Boy]', etc.
Such as in:
name_gender
0 AAa [Boy]
1 BBc [Girl]
You want to use a replace method call passing a regular expression to pandas.
Assuming that your dataframe is called df, the original column name is 'name_gender' and the destination (new column) name is 'gender', you can use the following code:
df['gender'] = df['name_gender'].replace('.*\\[(.*)\\]', '\\1', regex=True)
or as suggested by #mozway below, this can also be written as:
df['gender'] = df['name_gender'].str.extract('.*\\[(.*)\\]')
You end up with:
name_gender gender
0 AAa [Boy] Boy
1 BBc [Girl] Girl
The regexp '.*\\[(.*)\\]' can be interpreted as matching anything, plus a '[', plus anything which is stored into a register (that's what the parentheses are there for), and a ']'. This is replaced then (second regexp) with the thing stored into register 1 (the only used in the matching regexp).
You might want to document yourself on regexps if you don't know them.
Anything which does not match the entry will not be replaced. You might want to add a test to detect whether some rows don't match that pattern ("name [gender]").

Related

I have a list and i want to print a specific string from it how can i do that?

So far I have done this but this returns the movie name but i want the year 1995 like this in separate list.
moviename=[]
for i in names:
moviename.append(i.split(' (',1)[0])
One issue with the code you have is that you're getting the first element of the array returned by split, which is the movie title. You want the second argument split()[1].
That being said, this solution won't work very well for a couple of reasons.
You will still have the second parenthesis in the year "1995)"
It won't work if the title itself has parenthesis (e.g. for Shanghai Triad)
If the year is always at the end of each string, you could do something like this.
movie_years = []
for movie in names:
movie_years.append(movie[-5:-1])
You could use a regular expression.
\(\d*\) will match an opening bracket, following by any number of digit characters (0-9), followed by a closing bracket.
Put only the \d+ part inside a capturing group to get only that part.
year_regex = r'\((\d+)\)'
moviename=[]
for i in names:
if re.search(year_regex, i):
moviename.append(re.search(year_regex, i).group(1))
By the way, you can make this all more concise using a list comprehension:
year_regex = r'\((\d+)\)'
moviename = [re.search(year_regex, name_and_year).group(1)
for name_and_year in names
if re.search(year_regex, name_and_year)]

Regex separator for splitting a CSV with double brackets / nd lists

I have a .csv following this logic
name, number, 2dlist, bool
"entry1", 1, [[0,1],[2,3]], true
"entry2", 2, [[4,5],[6,7]], true
What kind of regex do I need to separate the rows to four columns so that everything inside the double square brackets get noted as one column, i.e. [[ ... ]].
I'm new to regex but managed to edit the following code sample
df = pd.read_csv("../file.csv", sep=r",(?![^\[]*[\]])",header=0, engine="python")
which does work with single brackets but not with double. As in, the comma between the lists 1],[2 gets still recognized as a separator even though it shouldn't.
This is a part of a hobby project and I might change the initial approach for better. However, at this point I'm only curious about the regex that would work in this specific case.
With your sample, you can probably split your dataframe with , but maybe it's not so simple:
df = pd.read_csv('data.csv', sep=', ', engine='python')
print(df)
# Output
name number 2dlist bool
0 "entry1" 1 [[0,1],[2,3]] True
1 "entry2" 2 [[4,5],[6,7]] True
if your csv looks like this
name,number,2dlist,bool
0,"entry1",1,"[[0,1],[2,3]]",True
1,"entry2",2,"[[4,5],[6,7]]",True
this would work fine:
df = pd.read_csv('data.csv', sep=',')
cause now list is stored in between apostrophes, the spaces and comma's in between get ignored. If data is not stored that way good regex codes are required to separate in a generic way. Try adding regex tag to question u might better solutions then.

Pandas how to filter for multiple substrings in series

I would like to check if pandas dataframe column id contains the following substrings '.F1', '.N1', '.FW', '.SP'.
I am currently using the following codes:
searchfor = ['.F1', '.N1', '.FW', '.SP']
mask = (df["id"].str.contains('|'.join(searchfor)))
The id column looks like such:
ID
0 F611B4E369F1D293B5
1 10302389527F190F1A
I am actually looking to see if the id column contains the four substrings starting with a .. For some reasons, F1 will be filtered out. In the current example, it does not have .F1. I would really appreciate if someone would let me know how to solve this particular issue. Thank you so much.
You can use re.escape() to escape the regex meta-characters in the following way such that you don't need to escape every string in the word list searchfor (no need to change the definition of searchfor):
import re
searchfor = ['.F1', '.N1', '.FW', '.SP'] # no need to escape each string
pattern = '|'.join(map(re.escape, searchfor)) # use re.escape() with map()
mask = (df["id"].str.contains(pattern))
re.escape() will escape each string for you:
print(pattern)
'\\.F1|\\.N1|\\.FW|\\.SP'

How to perform str.strip in dataframe and save it with inplace=true?

I have dataframe with n columns. And I want to perform a strip to strings in one of the columns in the dataframe. I was able to do it, but I want this change to reflect in the original dataframe.
Dataframe: data
Name
0 210123278414410005
1 101232784144610006
2 210123278414410007
3 21012-27841-410008
4 210123278414410009
After stripping:
Name
0 10005
1 10006
2 10007
3 10008
4 10009
5 10010
I tried the below code and strip was successful
data['Name'].str.strip().str[13:]
However if I check dataframe, the strip is not reflected.
I am looking for something like inplace parameter.
String methods (the attributes of the .str attribute on a series) will only ever return a new Series, you can't use these for in-place changes. Your only option is to assign it back to the same column:
data['Name'] = data['Name'].str.strip().str[13:]
You could instead use the Series.replace() method with a regular expression, and inplace=True:
data['Name'].replace(r'(?s)\A\s*(.{,13}).*(?<!\s)\s*\Z', r'\1', regex=True, inplace=True)
The regular expression above matches up to 13 characters after leading whitespace, and ignores trailing whitespace and any other characters beyond the first 13 after whitespace is removed. It produces the same output as .str.strip().str[:13], but makes the changes in place.
The pattern is using a negative look-behind to make sure that the final \s* pattern matches all whitespace elements at the end before selecting between 0 and 13 characters of what remains. The \A and \Z anchors make it so the whole string is matched, and the (?s) at the start switches the . pattern (dot, any character except newlines) to include newlines when matching; this way an input value like ' foo\nbar ' is handled correctly.
Put differently, the \A\s* and (?<!\s)\s*\Z patterns act like str.strip() does, matching all whitespace at the start and end, respectively, and no more. The (.{,13)).* pattern matches everything in between, with the first 13 characters of those (or fewer, if there are not enough characters to match after stripping) captured as a group. That one group is then used as the replacement value.
And because . doesn't normally match \n characters, the (?s) flag at the start tells the regex engine to match newline characters anyway. We want all characters to be included after stripping, not just all except one.
data['Name'].str.strip().str[13:] returns you the new transformed column, but it is not changing in-place data (inside the dataframe). You should write:
data['Name'] = data['Name'].str.strip().str[13:]
to write the transformed data to the Name column.
I agree with the other answers that there's no inplace parameter for the strip function, as seen in the documentation for str.strip.
To add to that: I've found the str functions for pandas Series usually used when selecting specific rows. Like df[df['Name'].str.contains('69'). I'd say this is a possible reason that it doesn't have an inplace parameter -- it's not meant to be completely "stand-alone" like rename or drop.
Also to add! I think a more pythonic solution is to use negative indices instead:
data['Name'] = data['Name'].str.strip().str[-5:]
This way, we don't have to assume that there are 18 characters, and/or we'll consistently get "last 5 characters" instead!
As per yatu's comment: you should reassign the Series with stripped values to the original column.
data['Name'] = data['Name'].str.strip().str[13:]
It is interesting to note that pandas DataFrames do work on numpy beneath.
There is also the idea to do broadcasting operations in numpy.
Here is the example I had in mind:
import numpy as np
import pandas as pd
df=pd.DataFrame([['210123278414410005', '101232784144610006']])
dfn=df.to_numpy(copy=False) #numpy array
df=pd.DataFrame(np.frompyfunc(lambda dfn: dfn[13:],1,1)(dfn) )
print(df) #10005 10006
This doesn't answer your question but it is just another option (it creates a new datafram from the numpy array though).

Pandas extracting text multiple times with same criteria

I have a DataFrame and in one cell I have a long text, e.g.:
-student- Kathrin A -/student- received abc and -student- Mike B -/student-
received def.
My question is: how can I extract the text between the -student- and -/student- and create two new columns with "Kathrin A" in the first one and "Mike B" in the second one? Meaning that this criteria meets twice or multiple times in the text.
what I have tried so far: str.extract('-student-\s * ([^.] * )\s * -/student-', expand = False) but this only extracts the first match, i.e Kathrin A.
Many thanks!
You could use str.split with regex and defined you delimiters as follows:
splittxt = ['-student-','-/student-']
df.text.str.split('|'.join(splittxt), expand=True)
Output:
0 1 2 3 4
0 Kathrin A received abc and Mike B received def.
Another approach would be to try extractall. The only caveat is the result is put into multiple rows instead of multiple columns. With some rearranging this should not be an issue, and please update this response if you end up working it out.
That being said I also have a slight modification to your regular expression which will help you with capturing both.
'(?<=-student-)(?:\s*)([\w\s]+)(?= -/student-)'
The only capturing group is [\w\s]+ so you'll be sure to not end up capturing the whole string.

Categories