Pandas extracting text multiple times with same criteria - python

I have a DataFrame and in one cell I have a long text, e.g.:
-student- Kathrin A -/student- received abc and -student- Mike B -/student-
received def.
My question is: how can I extract the text between the -student- and -/student- and create two new columns with "Kathrin A" in the first one and "Mike B" in the second one? Meaning that this criteria meets twice or multiple times in the text.
what I have tried so far: str.extract('-student-\s * ([^.] * )\s * -/student-', expand = False) but this only extracts the first match, i.e Kathrin A.
Many thanks!

You could use str.split with regex and defined you delimiters as follows:
splittxt = ['-student-','-/student-']
df.text.str.split('|'.join(splittxt), expand=True)
Output:
0 1 2 3 4
0 Kathrin A received abc and Mike B received def.

Another approach would be to try extractall. The only caveat is the result is put into multiple rows instead of multiple columns. With some rearranging this should not be an issue, and please update this response if you end up working it out.
That being said I also have a slight modification to your regular expression which will help you with capturing both.
'(?<=-student-)(?:\s*)([\w\s]+)(?= -/student-)'
The only capturing group is [\w\s]+ so you'll be sure to not end up capturing the whole string.

Related

Python split text without spaces but keep dates as they are

To split text without spaces, one can use wordninja, please see How to split text without spaces into list of words. Here is the code to do the job.
sent = "Test12 to separate mergedwords butkeeprest asitis, say 1/2/2021 or 1.2.2021."
import wordninja
print(' '.join(wordninja.split(sent)))
output: Test 12 to separate merged words but keep rest as it is say 1 2 2021 or 1 2 2021
The wordninja looks great and works well for splitting those merged text. My question here is that how I can split text without spaces but keep the dates (and punctuations) as they are. An ideal output will be:
Test 12 to separate merged words but keep rest as it is, say 1/2/2021 or 1.2.2021
Your help is much appreciated!
The idea here is to split our string into a list at every instance of a date then iterate over that list preserving items that matched the initial split pattern and calling wordninja.split() on everything else. Then recombine the list with join.
import re
def foo(s):
return 'ninja'
string = 'Test12 to separate mergedwords butkeeprest asitis, say 1/2/2021 or 1.2.2021.'
pattern = re.compile(r'([0-9]{1,2}[/.][0-9]{1,2}[/.][0-9]{1,4})')
# Split the string up by things matching our pattern, preserve rest of string.
string_isolated_dates = re.split(pattern, string)
# Apply wordninja to everything that doesn't match our date pattern, join it all together. OP should replace foo in the next line with wordninja.split()
wordninja_applied = ' '.join([el if pattern.match(el) else foo(el) for el in string_isolated_dates])
print(wordninja_applied)
Output:
ninja 1/2/2021 ninja 1.2.2021 ninja
Note: I replaced your function wordninja.split() with foo() just because I don't feel like downloading yet another nlp library. But my code demonstrates modifying the original string while preserving the dates.
Finally I got the following code, based on comments under my post (Thanks for comments):
import re
sent = "Test12 to separate mergedwords butkeeprest asitis, say 1/2/2021 or 1.2.2021."
sent = re.sub(","," ",sent)
corrected = ' '.join([' '.join(wordninja.split(w)) if w.isalnum() else w for w in sent.split(" ")])
print(corrected)
output: Test 12 to separate merged words but keep rest as it is say 1/2/2021 or 1.2.2021.
It is not a straightforward solution, but works.

Python: Trim strings in a column

I have a column dataframe that I would like to trim the leading and trailing parts to it. The column has contents such as: ['Tim [Boy]', 'Gina [Girl]'...] and I would like it to make a new column that just has ['Boy','Girl'... etc.]. I tried using rstrip and lstrip but have had no luck. Please advise. Thank you
I assume that the cells of the column are 'Tim [Boy]', etc.
Such as in:
name_gender
0 AAa [Boy]
1 BBc [Girl]
You want to use a replace method call passing a regular expression to pandas.
Assuming that your dataframe is called df, the original column name is 'name_gender' and the destination (new column) name is 'gender', you can use the following code:
df['gender'] = df['name_gender'].replace('.*\\[(.*)\\]', '\\1', regex=True)
or as suggested by #mozway below, this can also be written as:
df['gender'] = df['name_gender'].str.extract('.*\\[(.*)\\]')
You end up with:
name_gender gender
0 AAa [Boy] Boy
1 BBc [Girl] Girl
The regexp '.*\\[(.*)\\]' can be interpreted as matching anything, plus a '[', plus anything which is stored into a register (that's what the parentheses are there for), and a ']'. This is replaced then (second regexp) with the thing stored into register 1 (the only used in the matching regexp).
You might want to document yourself on regexps if you don't know them.
Anything which does not match the entry will not be replaced. You might want to add a test to detect whether some rows don't match that pattern ("name [gender]").

Hidding values of column with ****xy using python

I am stuck in a coding problem, in Python, I have a CSV file having two columns Flag | Customer_name, I am using data frames so if flag is "0" I want to print complete name and if Flag=1 then I want to hide first n-2 alphabets of Customer name with "*" for example,
if flag=1 then,
display *********th (for john smith)
Thanks in advance
You can create the number of '*' needed and then add the last two letters:
name = 'john smith'
name_update = '*' * (len(name)-2) + name[-2:]
print(name_update)
output:
********th
As you used dataframe as tag, I assume that you are working with pandas.DataFrame - in such case you might harness regular expression for that task.
import pandas as pd
df = pd.DataFrame({'name':['john smith']})
df['redacted'] = df['name'].str.replace(r'.(?=..)', '*')
print(df)
Output:
name redacted
0 john smith ********th
Explanation: I used here positive lookahead (kind of zero-length assertion) and I replace any character with * if and only if two any characters follows - which is true for all but 2 last characters.

How to perform str.strip in dataframe and save it with inplace=true?

I have dataframe with n columns. And I want to perform a strip to strings in one of the columns in the dataframe. I was able to do it, but I want this change to reflect in the original dataframe.
Dataframe: data
Name
0 210123278414410005
1 101232784144610006
2 210123278414410007
3 21012-27841-410008
4 210123278414410009
After stripping:
Name
0 10005
1 10006
2 10007
3 10008
4 10009
5 10010
I tried the below code and strip was successful
data['Name'].str.strip().str[13:]
However if I check dataframe, the strip is not reflected.
I am looking for something like inplace parameter.
String methods (the attributes of the .str attribute on a series) will only ever return a new Series, you can't use these for in-place changes. Your only option is to assign it back to the same column:
data['Name'] = data['Name'].str.strip().str[13:]
You could instead use the Series.replace() method with a regular expression, and inplace=True:
data['Name'].replace(r'(?s)\A\s*(.{,13}).*(?<!\s)\s*\Z', r'\1', regex=True, inplace=True)
The regular expression above matches up to 13 characters after leading whitespace, and ignores trailing whitespace and any other characters beyond the first 13 after whitespace is removed. It produces the same output as .str.strip().str[:13], but makes the changes in place.
The pattern is using a negative look-behind to make sure that the final \s* pattern matches all whitespace elements at the end before selecting between 0 and 13 characters of what remains. The \A and \Z anchors make it so the whole string is matched, and the (?s) at the start switches the . pattern (dot, any character except newlines) to include newlines when matching; this way an input value like ' foo\nbar ' is handled correctly.
Put differently, the \A\s* and (?<!\s)\s*\Z patterns act like str.strip() does, matching all whitespace at the start and end, respectively, and no more. The (.{,13)).* pattern matches everything in between, with the first 13 characters of those (or fewer, if there are not enough characters to match after stripping) captured as a group. That one group is then used as the replacement value.
And because . doesn't normally match \n characters, the (?s) flag at the start tells the regex engine to match newline characters anyway. We want all characters to be included after stripping, not just all except one.
data['Name'].str.strip().str[13:] returns you the new transformed column, but it is not changing in-place data (inside the dataframe). You should write:
data['Name'] = data['Name'].str.strip().str[13:]
to write the transformed data to the Name column.
I agree with the other answers that there's no inplace parameter for the strip function, as seen in the documentation for str.strip.
To add to that: I've found the str functions for pandas Series usually used when selecting specific rows. Like df[df['Name'].str.contains('69'). I'd say this is a possible reason that it doesn't have an inplace parameter -- it's not meant to be completely "stand-alone" like rename or drop.
Also to add! I think a more pythonic solution is to use negative indices instead:
data['Name'] = data['Name'].str.strip().str[-5:]
This way, we don't have to assume that there are 18 characters, and/or we'll consistently get "last 5 characters" instead!
As per yatu's comment: you should reassign the Series with stripped values to the original column.
data['Name'] = data['Name'].str.strip().str[13:]
It is interesting to note that pandas DataFrames do work on numpy beneath.
There is also the idea to do broadcasting operations in numpy.
Here is the example I had in mind:
import numpy as np
import pandas as pd
df=pd.DataFrame([['210123278414410005', '101232784144610006']])
dfn=df.to_numpy(copy=False) #numpy array
df=pd.DataFrame(np.frompyfunc(lambda dfn: dfn[13:],1,1)(dfn) )
print(df) #10005 10006
This doesn't answer your question but it is just another option (it creates a new datafram from the numpy array though).

python regex boolean statement not working

My problem is that this simple regex statement with a boolean operator only gives me the result I want when the first item on the left side of the bitwise operator | is present in the sentence. Could someone tell me why it isn't working on the alternative as well?
import re
b = 'this is a good day to die hard'
jeff = re.search('good night (.+)hard|good day (.+)hard', b)
print jeff.group(1)
You have two sets of capturing parentheses - therefore you have two numbered capturing groups. If the second branch matches, the group(1) will be set to None, and group(2) will contain that which was matched by the second group.
There are several ways to fix this. One would be to write so that there is just one group, for example
jeff = re.search('good (?:day|night) (.+)hard', b)
The second (...) creates the second capturing group that you need to access with .group(2).
You may write a regex that will capture day or night, and the second group will fetch all up to the last hard.
import re
b = 'this is a good day to die hard'
jeff = re.search('good (day|night) (.+)', b)
if jeff:
print jeff.group(1)
print jeff.group(2)
Output of the demo:
day
to die hard

Categories