searching matching string pattern from dataframe column in python pandas - python

i have a data-frame like below
name genre
satya |ACTION|DRAMA|IC|
satya |COMEDY|BIOPIC|SOCIAL|
abc |CLASSICAL|
xyz |ROMANCE|ACTION|DARMA|
def |DISCOVERY|SPORT|COMEDY|IC|
ghj |IC|
Now I want to query the dataframe so that i can get row 1,5 and 6.i:e i want to find |IC| with alone or with any combination of other genres.
Upto now i am able to do either a exact search using
df[df['genre'] == '|ACTION|DRAMA|IC|'] ######exact value yields row 1
or a string contains search by
df[df['genre'].str.contains('IC')] ####yields row 1,2,3,5,6
# as BIOPIC has IC in that same for CLASSICAL also
But i don't want these two.
#df[df['genre'].str.contains('|IC|')] #### row 6
# This also not satisfying my need as i am missing rows 1 and 5
So my requirement is to find genres having |IC| in them.(My string search fails because python treats '|' as or operator)
Somebody suggest some reg or any method to do that.Thanks in ADv.

I think you can add \ to regex for escaping , because | without \ is interpreted as OR:
'|'
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].
print df['genre'].str.contains(u'\|IC\|')
0 True
1 False
2 False
3 False
4 True
5 True
Name: genre, dtype: bool
print df[df['genre'].str.contains(u'\|IC\|')]
name genre
0 satya |ACTION|DRAMA|IC|
4 def |DISCOVERY|SPORT|COMEDY|IC|
5 ghj |IC|

may be this construction:
pd.DataFrame[DataFrame['columnName'].str.contains(re.compile('regex_pattern'))]

Related

Multiple character replacement in a column

I am trying to replace some string characters with a single character, I can do it with a multiple lines of code but I was wondering if there is something like this to do it in a single line?
df['Column'].str.replace(['_','-','/'], ' ')
I can write 3 lines of code for normal str.replace() and change those strings one by one but I don't think that would be efficient.
Pandas Dataframe Str replace takes regex pattern or string as first argument. So you can provide a regex to change multiple patterns
code:
import pandas as pd
check_df = pd.DataFrame({"Column":["abc", "A_bC", "A_b-C/d"]})
check_df['Column'].str.replace("_|-|/", " ")
Output:
0 abc
1 A bC
2 A b C d
Name: Column, dtype: object
you can use a regular expression with an alternating group:
df['Column'].str.replace(r"_|-|/", " ", regex=True)
| means "either of these".
or you can use str.maketrans to make a translation table and use .str.translate:
df['Column'].str.translate(str.maketrans(dict.fromkeys("_-|", " ")))
Note that this is for 1-length characters' translation.
If characters are dynamically produced, e.g., within a list, then re.escape("|".join(chars)) can be used for the first way, and "".join(chars) for the second way. re.escape for the first one is for special characters' escaping, e.g., if "$" is to be replaced, since it is the end-of-string anchor in regexes, we need to have written "\$" instead, which re.escape will take care.
You could use a character class [/_-] listing the characters that you want to replace.
Note that if you have multiple consecutive characters and you replace them with a space, you will get space gaps. If you don't want that, you can repeat the character class with a + to match 1 or more characters and replace that match with a single space.
If you don't want the leading and trailing spaces, you can use .str.strip()
Example
import pandas as pd
df = pd.DataFrame({"Column":[" a//b_c__-d", "a//////b "]})
df['Column'] = df['Column'].str.replace(r"[/_-]", ' ')
print(df)
print("\n---------v2---------\n")
df_v2 = pd.DataFrame({"Column":[" a//b_c__-d", "a//////b "]})
df_v2['Column'] = df_v2['Column'].str.replace(r"[/_-]+", ' ').str.strip()
print(df_v2)
Output
Column
0 a b c d
1 a b
---------v2---------
Column
0 a b c d
1 a b

Python: string not splitting correctly at "|||" substring

I have a column in Pandas DataFrame that stores long strings, in which different chunks of information are separated by a "|||".
This is an example:
"intermediation|"mechanical turk"|precarious "public policy" ||| intermediation|"mechanical turk"|precarious high-level
I need to split this column into multiple columns, each column containing the string between the separators "|||".
However, while running the following code:
df['query_ids'].str.split('|||', n=5, expand = True)
What I get, however, are splits done for every single character, like this:
0 1 2 3 4 5
0 " r e g ulatory capture"|"political lobbying" policy-m...
I suspect it's because "|" is a Python operator, but I cannot think of a suitable workaround.
You need to escape |:
df['query_ids'].str.split('\|\|\|', n=5, expand=True)
or to pass regex=False:
df['query_ids'].str.split('|||', n=5, expand=True, regex=False)

Python: Trim strings in a column

I have a column dataframe that I would like to trim the leading and trailing parts to it. The column has contents such as: ['Tim [Boy]', 'Gina [Girl]'...] and I would like it to make a new column that just has ['Boy','Girl'... etc.]. I tried using rstrip and lstrip but have had no luck. Please advise. Thank you
I assume that the cells of the column are 'Tim [Boy]', etc.
Such as in:
name_gender
0 AAa [Boy]
1 BBc [Girl]
You want to use a replace method call passing a regular expression to pandas.
Assuming that your dataframe is called df, the original column name is 'name_gender' and the destination (new column) name is 'gender', you can use the following code:
df['gender'] = df['name_gender'].replace('.*\\[(.*)\\]', '\\1', regex=True)
or as suggested by #mozway below, this can also be written as:
df['gender'] = df['name_gender'].str.extract('.*\\[(.*)\\]')
You end up with:
name_gender gender
0 AAa [Boy] Boy
1 BBc [Girl] Girl
The regexp '.*\\[(.*)\\]' can be interpreted as matching anything, plus a '[', plus anything which is stored into a register (that's what the parentheses are there for), and a ']'. This is replaced then (second regexp) with the thing stored into register 1 (the only used in the matching regexp).
You might want to document yourself on regexps if you don't know them.
Anything which does not match the entry will not be replaced. You might want to add a test to detect whether some rows don't match that pattern ("name [gender]").

Hidding values of column with ****xy using python

I am stuck in a coding problem, in Python, I have a CSV file having two columns Flag | Customer_name, I am using data frames so if flag is "0" I want to print complete name and if Flag=1 then I want to hide first n-2 alphabets of Customer name with "*" for example,
if flag=1 then,
display *********th (for john smith)
Thanks in advance
You can create the number of '*' needed and then add the last two letters:
name = 'john smith'
name_update = '*' * (len(name)-2) + name[-2:]
print(name_update)
output:
********th
As you used dataframe as tag, I assume that you are working with pandas.DataFrame - in such case you might harness regular expression for that task.
import pandas as pd
df = pd.DataFrame({'name':['john smith']})
df['redacted'] = df['name'].str.replace(r'.(?=..)', '*')
print(df)
Output:
name redacted
0 john smith ********th
Explanation: I used here positive lookahead (kind of zero-length assertion) and I replace any character with * if and only if two any characters follows - which is true for all but 2 last characters.

Pandas extracting text multiple times with same criteria

I have a DataFrame and in one cell I have a long text, e.g.:
-student- Kathrin A -/student- received abc and -student- Mike B -/student-
received def.
My question is: how can I extract the text between the -student- and -/student- and create two new columns with "Kathrin A" in the first one and "Mike B" in the second one? Meaning that this criteria meets twice or multiple times in the text.
what I have tried so far: str.extract('-student-\s * ([^.] * )\s * -/student-', expand = False) but this only extracts the first match, i.e Kathrin A.
Many thanks!
You could use str.split with regex and defined you delimiters as follows:
splittxt = ['-student-','-/student-']
df.text.str.split('|'.join(splittxt), expand=True)
Output:
0 1 2 3 4
0 Kathrin A received abc and Mike B received def.
Another approach would be to try extractall. The only caveat is the result is put into multiple rows instead of multiple columns. With some rearranging this should not be an issue, and please update this response if you end up working it out.
That being said I also have a slight modification to your regular expression which will help you with capturing both.
'(?<=-student-)(?:\s*)([\w\s]+)(?= -/student-)'
The only capturing group is [\w\s]+ so you'll be sure to not end up capturing the whole string.

Categories