Hidding values of column with ****xy using python - python

I am stuck in a coding problem, in Python, I have a CSV file having two columns Flag | Customer_name, I am using data frames so if flag is "0" I want to print complete name and if Flag=1 then I want to hide first n-2 alphabets of Customer name with "*" for example,
if flag=1 then,
display *********th (for john smith)
Thanks in advance

You can create the number of '*' needed and then add the last two letters:
name = 'john smith'
name_update = '*' * (len(name)-2) + name[-2:]
print(name_update)
output:
********th

As you used dataframe as tag, I assume that you are working with pandas.DataFrame - in such case you might harness regular expression for that task.
import pandas as pd
df = pd.DataFrame({'name':['john smith']})
df['redacted'] = df['name'].str.replace(r'.(?=..)', '*')
print(df)
Output:
name redacted
0 john smith ********th
Explanation: I used here positive lookahead (kind of zero-length assertion) and I replace any character with * if and only if two any characters follows - which is true for all but 2 last characters.

Related

Python: Trim strings in a column

I have a column dataframe that I would like to trim the leading and trailing parts to it. The column has contents such as: ['Tim [Boy]', 'Gina [Girl]'...] and I would like it to make a new column that just has ['Boy','Girl'... etc.]. I tried using rstrip and lstrip but have had no luck. Please advise. Thank you
I assume that the cells of the column are 'Tim [Boy]', etc.
Such as in:
name_gender
0 AAa [Boy]
1 BBc [Girl]
You want to use a replace method call passing a regular expression to pandas.
Assuming that your dataframe is called df, the original column name is 'name_gender' and the destination (new column) name is 'gender', you can use the following code:
df['gender'] = df['name_gender'].replace('.*\\[(.*)\\]', '\\1', regex=True)
or as suggested by #mozway below, this can also be written as:
df['gender'] = df['name_gender'].str.extract('.*\\[(.*)\\]')
You end up with:
name_gender gender
0 AAa [Boy] Boy
1 BBc [Girl] Girl
The regexp '.*\\[(.*)\\]' can be interpreted as matching anything, plus a '[', plus anything which is stored into a register (that's what the parentheses are there for), and a ']'. This is replaced then (second regexp) with the thing stored into register 1 (the only used in the matching regexp).
You might want to document yourself on regexps if you don't know them.
Anything which does not match the entry will not be replaced. You might want to add a test to detect whether some rows don't match that pattern ("name [gender]").

Remove all words in a string that contain any given substrings using python

I have a .csv file that has a column containing text. For each item in this column there is a gene name and a date (for example 'CYP2C19, CYP2D6 07/17/2020'). I want to remove the dates from all of values in this column so that only the two genes are visible (output: 'CYP2C19, CYP2D6'). Secondly, in some boxes there is both a gene name and an indication if there is no recommendation ('CYP2C9 08/19/2020 (no recommendation'). In these cases, I would like to remove both the date and the statement that says no recommendation (output: 'CYP2C19, CYP2D6').
I have tried using the code below to remove any text that contains slashes for a single string (I have not yet tried anything on the entire .csv file). However it left the 07 from the date unfortunately.
import re
pattern = re.compile('/(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
s = 'CYP2C19, CYP2D6 07/17/2020'
pattern.sub('', s)
Output: 'CYP2C19, CYP2D6 07'
One method is to just take the date out of the string, and then split it as you please. Note this works for any number of dates:
import re
x = 'CYP2C19, CYP2D6 07/17/2020'
x = re.sub(r'\s*\d{2}/\d{2}/\d{4}', "", x)
You could replace \s* with \s if you always know there will be only a single space separating the term you want and the date, but I don't see a reason to do that.
Note that you could now split this by the delimiter, which in the case of your question, is actually a comma followed by a space
result = x.split(", ")
# ['CYP2C19', 'CYP2D6']
Although in your csv you may find that it is just a comma (as CSVs normally are).
Combining the steps above:
import re
x = 'CYP2C19 08/15/1972, CYP2D6 07/17/2020'
x = re.sub(r'\s*\d{2}/\d{2}/\d{4}', "", x).split(", ")
# ['CYP2C19', 'CYP2D6']
I think that you could take each column then split it :
for exemple let's take the following string : column = ' CYP2D6 07/17/2020'
you could do : m = column.split() then you will obtain : a list like : m=['CYP2D6','07/17/2020']
after that you could simply take : gene = m[0]

Python extract substring from between parenthesis

I have a string, that is formatted like that:
"Name Surname (ID), Name2 Surname2 (ID2)"
ID starts with letter that is followed by few digits. We can have various number of people in that string (I mean there can be only one person, 2 as in provided example or even more). Also, people can have few names or surnames, so it's not consistent.
I want to extract a substring built of ID's divided by colons, so for this example it would look like that:
"ID, ID2"
Right now i tried this approach:
import re
string = "Bob Rob Smith (L1234567), John Doe (k12345678)"
result = re.findall(r'[a-zA-Z][0-9]+', string)
','.join(result)
And it works perfectly fine, but I wonder if there's simpler approach that doesn't require any additional modules. Do you guys have any ideas?
I also think using re is good approach, if you have to NOT use re AT ANY PRICE, then you might do:
s = "Bob Rob Smith (L1234567), John Doe (k12345678)"
result = s.replace(')','(').split('(')[1::2]
print(result)
Output:
['L1234567', 'k12345678']
Explanation: I want to split at ( and ), but .split method of str accepts only one delimiter, so I firstly replace ) with (, then I split and get odd elements. This method will work if: ( and ) are used solely around IDs, s does not starts with (, s does not starts with ), there is at least one character between any two brackets.
You could split on ), and take the last 8 characters from each element in the split list but regex is the correct approach
[s[-8:] for s in mystring[:-1].split('),')]
to me, the RegEx approach seems the best approach.
Assuming that you do not know exactly how many digits your IDs have (quote: followed by a few digits), you could through the whole string and catch what's inside parenthesis:
s = "Bob Rob Smith (L1234567), John Doe (k12345678)"
res = []
word = ''
open = False
for x in s:
if x == '(':
open = True
continue
if x == ')':
open = False
res.append(word)
word = ''
if open:
word += x
print(res)
OUTPUT:
['L1234567', 'k12345678']

Pandas extracting text multiple times with same criteria

I have a DataFrame and in one cell I have a long text, e.g.:
-student- Kathrin A -/student- received abc and -student- Mike B -/student-
received def.
My question is: how can I extract the text between the -student- and -/student- and create two new columns with "Kathrin A" in the first one and "Mike B" in the second one? Meaning that this criteria meets twice or multiple times in the text.
what I have tried so far: str.extract('-student-\s * ([^.] * )\s * -/student-', expand = False) but this only extracts the first match, i.e Kathrin A.
Many thanks!
You could use str.split with regex and defined you delimiters as follows:
splittxt = ['-student-','-/student-']
df.text.str.split('|'.join(splittxt), expand=True)
Output:
0 1 2 3 4
0 Kathrin A received abc and Mike B received def.
Another approach would be to try extractall. The only caveat is the result is put into multiple rows instead of multiple columns. With some rearranging this should not be an issue, and please update this response if you end up working it out.
That being said I also have a slight modification to your regular expression which will help you with capturing both.
'(?<=-student-)(?:\s*)([\w\s]+)(?= -/student-)'
The only capturing group is [\w\s]+ so you'll be sure to not end up capturing the whole string.

Python Regex: how to not select whitespace before last string?

I am (a newbie,) struggling with separating a database in columns with regex.findall().
I want to separate these Dutch street names into name and number.
Roemer Visscherstraat 15
Vondelstraat 102-huis
For the number I use
\S*$
Which works just fine. For the street name I use
^\S.+[^\S$]
Or: use everything but the last element, which may be a number or a combination of a number and something else.
Problem is: Python then also keeps the last whitespace after the last name, so I get:
'Roemer Visscherstraat '
Any way I can stop this from happening?
Also, Findall returns a list consisting of the bit of database I wanted, and an empty string. How does this happen and can i prevent it somehow?
Thanks so much in advance for you help.
You can rstrip() the name to remove any spaces at the end of it:
>>>'Roemer Visscherstraat '.rstrip()
'Roemer Visscherstraat'
But if the input is similar to the one you posted, you can simply use split() instead of regex, for example:
st = 'Roemer Visscherstraat 15'
data = st.split()
num = st[-1]
name = ' '.join(st[:-1])
print 'Name: {}, Number: {}'.format(name, num)
output:
Name: Roemer Visscherstraat, Number: 15
For the number you should use the following:
\S+$
Using a + instead of a * will ensure that you have at least one character in the match.
For the street name you can use the following:
^.+(?=\s\S+$)
What this does is selects text up until the number.
However, what you may consider doing is using one regex match with capture groups instead. The following would work:
^(.+(?=\s\S+$))\s(\S+$)
In this case, the first capture group gives you the street name, and the second gives you the number.
([^\d]*)\s+(\d.*)
In this regex the first group captures everything before a space and a number and the 2nd group gives the desired number
my assumption is that number would begin with a digit and the name would not have a digit in it
take a look at https://regex101.com/r/eW0UP2/1
Roemer Visscherstraat 15
Full match 0-24 `Roemer Visscherstraat 15`
Group 1. 0-21 `Roemer Visscherstraat`
Group 2. 22-24 `15`
Vondelstraat 102-huis
Full match 24-46 `Vondelstraat 102-huis`
Group 1. 24-37 `Vondelstraat`
Group 2. 38-46 `102-huis`

Categories