Python: string not splitting correctly at "|||" substring

Python: string not splitting correctly at "|||" substring - python

I have a column in Pandas DataFrame that stores long strings, in which different chunks of information are separated by a "|||".
This is an example:
"intermediation|"mechanical turk"|precarious "public policy" ||| intermediation|"mechanical turk"|precarious high-level
I need to split this column into multiple columns, each column containing the string between the separators "|||".
However, while running the following code:
df['query_ids'].str.split('|||', n=5, expand = True)
What I get, however, are splits done for every single character, like this:
0 1 2 3 4 5
0 " r e g ulatory capture"|"political lobbying" policy-m...
I suspect it's because "|" is a Python operator, but I cannot think of a suitable workaround.

You need to escape |:
df['query_ids'].str.split('\|\|\|', n=5, expand=True)
or to pass regex=False:
df['query_ids'].str.split('|||', n=5, expand=True, regex=False)

Related

Multiple character replacement in a column

I am trying to replace some string characters with a single character, I can do it with a multiple lines of code but I was wondering if there is something like this to do it in a single line?
df['Column'].str.replace(['_','-','/'], ' ')
I can write 3 lines of code for normal str.replace() and change those strings one by one but I don't think that would be efficient.

Pandas Dataframe Str replace takes regex pattern or string as first argument. So you can provide a regex to change multiple patterns
code:
import pandas as pd
check_df = pd.DataFrame({"Column":["abc", "A_bC", "A_b-C/d"]})
check_df['Column'].str.replace("_|-|/", " ")
Output:
0 abc
1 A bC
2 A b C d
Name: Column, dtype: object

you can use a regular expression with an alternating group:
df['Column'].str.replace(r"_|-|/", " ", regex=True)
| means "either of these".
or you can use str.maketrans to make a translation table and use .str.translate:
df['Column'].str.translate(str.maketrans(dict.fromkeys("_-|", " ")))
Note that this is for 1-length characters' translation.
If characters are dynamically produced, e.g., within a list, then re.escape("|".join(chars)) can be used for the first way, and "".join(chars) for the second way. re.escape for the first one is for special characters' escaping, e.g., if "$" is to be replaced, since it is the end-of-string anchor in regexes, we need to have written "\$" instead, which re.escape will take care.

You could use a character class [/_-] listing the characters that you want to replace.
Note that if you have multiple consecutive characters and you replace them with a space, you will get space gaps. If you don't want that, you can repeat the character class with a + to match 1 or more characters and replace that match with a single space.
If you don't want the leading and trailing spaces, you can use .str.strip()
Example
import pandas as pd
df = pd.DataFrame({"Column":[" a//b_c__-d", "a//////b "]})
df['Column'] = df['Column'].str.replace(r"[/_-]", ' ')
print(df)
print("\n---------v2---------\n")
df_v2 = pd.DataFrame({"Column":[" a//b_c__-d", "a//////b "]})
df_v2['Column'] = df_v2['Column'].str.replace(r"[/_-]+", ' ').str.strip()
print(df_v2)
Output
Column
0 a b c d
1 a b
---------v2---------
Column
0 a b c d
1 a b

Convert plain text to CSV

My source file which contains 5000 rows is as below,
1.4.1 This is my text and is not limited to three words 3 ALL ALL
1.4.2 This is second sentence 1 ALL ALL
1.4.3 An another sentence that I just made up 2 ALL ALL
I want to search and replace (or any other method) to produce the output below
"1.4.1", "This is my text and is not limited to three words", "3", "ALL", "ALL"
"1.4.2","This is second sentence","1","ALL","ALL"
"1.4.3", "An another sentence that I just made up","2","ALL", "ALL"
The sentence that I would like in the 2nd column is of varying length but is always between two numbers - 1.4.1 and 3 for example. This is a complicated part that I am trying to figure out how to achieve.
Edit:
The last two columns are optional, may or may not appear on all lines
"1.4.1", "This is my text and is not limited to three words", "3"

This can actually be fairly simple assuming all columns but the 2nd one will never contain a space. You can simply split the entire string and pull out the necessary parts as below:
[ 1.4.1, This, is, my, text, and, is, not, ,limited, ,to, three, ,words, 3, ALL, ALL ]
+---+ +-------------------------------------------------------------+ + +-+ +-+
1st 2nd 3rd 4th 5th
Once the full line is split like above, you can simply access the proper elements for the one word / number columns and join the 2nd column elements. Below is a simple python function that should accomplish what you are looking to do:
def parse_record(line):
parts = line.split()
col_1 = parts[0]
col_2 = " ".join(parts[0: -3])
col_3 = parts[-3]
col_4 = parts[-2]
col_5 = parts[-1]
return col_1, col_2, col_3, col_4, col_5
If you have control over the file format this becomes much simpler actually. If you are able to change the way the 2nd column is specified, it is technically already a csv. Most csv parsers allow you to specify the delimiter between values, in this case a space. For example if the same file above would quote the 2nd column like this:
1.4.1 "This is my text and is not limited to three words" 3 ALL ALL
1.4.2 "This is second sentence" 1 ALL ALL
1.4.3 "An another sentence that I just made up" 2 ALL ALL
Since the 2nd column's values are wrapped in quotes they will be parsed as a single value rather than many separate ones, allowing you to simply use a space as the csv delimiter rather than the default comma.

Extract first word for each row in a column under multiple conditions

I have a dataset contains a column of string. it looks like
df.a=[['samsung/windows','mobile unknown','chrome/android']].
I am trying to obtain the first word of each row to replace the current string, e.g.[['samsung','mobile','chrome']]
I applied:
df.a=df.a.str.split().str.get(0)
this gives me the first word but with "/"
df.a=[words.split("/")[0] for words in df.a]
this only splits the strings that contains "/"
Can I get the expected result using one line?

use re.findall() and get only alpha numeric
import re
df['a'] = df['a'].apply(lambda x : re.findall(r"[\w']+",x)[0])

You can pass regex syntax directly to the split function to split on / or ' ' with the pipe character |, but his solution only works if those are the only delimiters in your data
dfa=pd.Series(['samsung/windows','mobile unknown','chrome/android'])
dfa.str.split(r'/| ')
0 [samsung, windows]
1 [mobile, unknown]
2 [chrome, android]

The pandas function extract do exactly what you want:
Extract capture groups in the regex pat as columns in a DataFrame
df['a'].str.extract(r"(\w+)", expand=True)
# 0
# 0 samsung
# 1 mobile
# 2 chrome

How To Remove \x95 chars from text - Pandas?

I am having trouble to remove a space at the beginning of string in pandas dataframe cells. If you look at the dataframe cells, it seems like that there is a space at the start of string however it prints "\x95 12345" when you output one of cells which has set of chars at the beginning, so as you can see it is not a normal space char but it is rather "\x95"
I already tried to use strip() - But it didn't help.
Dataframe was produced after the use of str.split(pat=',').tolist() expression which basically split the strings into different cells by ',' so now my strings have this char added.

Assuming col1 is your first column name:
import re
df.col1 = df.col1.apply(lambda x: re.sub(r'\x95',"",x))

searching matching string pattern from dataframe column in python pandas

i have a data-frame like below
name genre
satya |ACTION|DRAMA|IC|
satya |COMEDY|BIOPIC|SOCIAL|
abc |CLASSICAL|
xyz |ROMANCE|ACTION|DARMA|
def |DISCOVERY|SPORT|COMEDY|IC|
ghj |IC|
Now I want to query the dataframe so that i can get row 1,5 and 6.i:e i want to find |IC| with alone or with any combination of other genres.
Upto now i am able to do either a exact search using
df[df['genre'] == '|ACTION|DRAMA|IC|'] ######exact value yields row 1
or a string contains search by
df[df['genre'].str.contains('IC')] ####yields row 1,2,3,5,6
# as BIOPIC has IC in that same for CLASSICAL also
But i don't want these two.
#df[df['genre'].str.contains('|IC|')] #### row 6
# This also not satisfying my need as i am missing rows 1 and 5
So my requirement is to find genres having |IC| in them.(My string search fails because python treats '|' as or operator)
Somebody suggest some reg or any method to do that.Thanks in ADv.

I think you can add \ to regex for escaping , because | without \ is interpreted as OR:
'|'
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].
print df['genre'].str.contains(u'\|IC\|')
0 True
1 False
2 False
3 False
4 True
5 True
Name: genre, dtype: bool
print df[df['genre'].str.contains(u'\|IC\|')]
name genre
0 satya |ACTION|DRAMA|IC|
4 def |DISCOVERY|SPORT|COMEDY|IC|
5 ghj |IC|

may be this construction:
pd.DataFrame[DataFrame['columnName'].str.contains(re.compile('regex_pattern'))]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: string not splitting correctly at "|||" substring - python

You need to escape |: df['query_ids'].str.split('\|\|\|', n=5, expand=True) or to pass regex=False: df['query_ids'].str.split('|||', n=5, expand=True, regex=False)

Related

Multiple character replacement in a column

Convert plain text to CSV

Extract first word for each row in a column under multiple conditions

How To Remove \x95 chars from text - Pandas?

searching matching string pattern from dataframe column in python pandas

Categories

Resources