swap last 2 string of columns seperated by delimiter - python

I hava a dataframe. I want to swap last 2 string of columns seperated by "_" if the 2nd last string is "pi"
Dataframe has columns such as:
abc_rte abc_rte_log abc_rte_log_pi1 abc_rte_pi1_log xyz_pnct_pi2_log
Desired column names:
abc_rte abc_rte_log abc_rte_log_pi1 abc_rte_log_pi1 xyz_pnct_log_pi2
What i tried so far:
for i in range(0, len(df.columns)):
if str(df.columns[i].split('_')[-2] == 'pi':
df.columns[i].split('_')[-2] = str(df.columns[i].split('_')[-1])

Index.str.replace
df.columns = df.columns.str.replace(r'(pi\d*)_([^_]+)$', r'\2_\1')
>>> df.columns
Index(['abc_rte', 'abc_rte_log', 'abc_rte_log_pi1', 'abc_rte_log_pi1',
'xyz_pnct_log_pi2'],
dtype='object')
Regex details:
(pi\d*) : First capturing group
pi : Matches the characters pi literally
\d* : Matches a digit between zero or more times
_ : Matches the character _
([^_]+) : Second capturing group
[^_]+ : Matches any character not present in the list [_] one or more times
$ : Asserts position at the end of line
See the online regex demo

mapping = {col:col for col in df.columns}
for colname in df.columns:
splits = colname.rsplit("_",2)
if splits[-2] == 'pi':
newname = "_".join((splits[0], splits[-1], splits[-2]))
mapping[colname] = newname
df.rename(columns=mapping, inplace=True)

Related

extracting a string from between to strings in dataframe

im trying to extract a value from my data frame
i have a column ['Desc'] it contains sentences in the folowing format
_000it_ZZZ$$$-
_0780it_ZBZT$$$-
_011it_BB$$$-
_000it_CCCC$$$-
I want to extract the string between 'it_' and '$$$'
I have tried this code but does not seem to work
# initializing substrings
sub1 = "it_"
sub2 = "$$$"
# getting index of substrings
idx1 = df['DESC'].find(sub1)
idx2 = df['DESC'].find(sub2)
# length of substring 1 is added to
# get string from next character
df['results'] = df['DESC'][idx1 + len(sub1) + 1: idx2]
I would appreciate your help
You can use str.extract to get the desired output in your new column.
import pandas as pd
import re
df = pd.DataFrame({
'DESC' : ["_000it_ZZZ$$$-", "_0780it_ZBZT$$$-", "_011it_BB$$$-", "_000it_CCCC$$$-", "_000it_123$$$-"]
})
pat = r"(?<=it_)(.+)(?=[\$]{3}-)"
df['results'] = df['DESC'].str.extract(pat)
print(df)
DESC results
0 _000it_ZZZ$$$- ZZZ
1 _0780it_ZBZT$$$- ZBZT
2 _011it_BB$$$- BB
3 _000it_CCCC$$$- CCCC
4 _000it_123$$$- 123
You can see the regex pattern on Regex101 for more details.
You could try using a regex pattern. It matches your cases you listed here, but I can't guarantee that it will generalize to all possible patterns.
import re
string = "_000it_ZZZ$$$-"
p = re.compile(r"(?<=it_)(.*)(?<!\W)")
m = p.findall(string)
print(m) # ['_ZZZ']
The pattern looks for it in the string and then stops untill it meets a non-word character.

Python Trimming a few column names but not all in a dataframe

I have a dataframe of many columns. Now I am trimming a few columns to reduce the text length.
Code:
xdf = pd.DataFrame({'Column1':[10,25],'Column2':[10,25],'Fix_col':[10,25]})
## Rename `Column1` to `C1` and for `C2` as well
req_cols = ['Column1','Column2']
xdf[req_cols].columns = [x[0]+y for name in xdf[req_cols].str.findall(r'([A-Za-z]+)(\d+)' for x,y in name]
Present solution:
print([x[0]+y for name in xdf[req_cols].str.findall(r'([A-Za-z]+)(\d+)' for x,y in name])
['C1','C2']
print(xdf[req_cols].columns)
['Column1','Column2']
Column names did not change. Don't know why?
Expected Answer:
xdf.columns = ['C1','C2','Fix_col']
You can use
import pandas as pd
import re
xdf = pd.DataFrame({'Column1':[10,25],'Column2':[10,25],'Fix_col':[10,25]})
req_cols = ['Column1','Column2']
xdf.rename(columns=lambda x : x if x not in req_cols else re.sub(r'^(\D?)\D*(\d*)', r'\1\2', x), inplace=True)
Output of xdf.columns:
Index(['C1', 'C2', 'Fix_col'], dtype='object')
See the regex demo. Details:
^ - start of string
(\D?) - Group 1 (\1): an optional non-digit char
\D* - zero or more non-digit chars
(\d*) - Group 2 (\2): zero or more digits.

Regex replace first two letters within column in python

I have a dataframe such as
COL1
A_element_1_+_none
C_BLOCA_element
D_element_3
element_'
BasaA_bloc
B_basA_bloc
BbasA_bloc
and I would like to remove the first 2 letters within each row of COL1 only if they are within that list :
the_list =['A_','B_','C_','D_']
Then I should get the following output:
COL1
element_1_+_none
BLOCA_element
element_3
element_'
BasaA_bloc
basA_bloc
BbasA_bloc
So far I tried the following :
df['COL1']=df['COL1'].str.replace("A_","")
df['COL1']=df['COL1'].str.replace("B_","")
df['COL1']=df['COL1'].str.replace("C_","")
df['COL1']=df['COL1'].str.replace("D_","")
But it also remove the pattern such as in row2 A_ and does not remove only the first 2 letters...
If the values to replace in the_list always have that format, you could also consider using str.replace with a simple pattern matching an uppercase char A-D followed by an underscore at the start of the string ^[A-D]_
import pandas as pd
strings = [
"A_element_1_+_none ",
"C_BLOCA_element ",
"D_element_3",
"element_'",
"BasaA_bloc",
"B_basA_bloc",
"BbasA_bloc"
]
df = pd.DataFrame(strings, columns=["COL1"])
df['COL1'] = df['COL1'].str.replace(r"^[A-D]_", "")
print(df)
Output
COL1
0 element_1_+_none
1 BLOCA_element
2 element_3
3 element_'
4 BasaA_bloc
5 basA_bloc
6 BbasA_bloc
You can also use apply() function from pandas. So if the string is with the concerned patterns, we ommit the two first caracters else return the whole string.
d["COL1"] = d["COL1"].apply(lambda x: x[2:] if x.startswith(("A_","B_","C_","D_")) else x)

How to remove unique character based on the same index via regex

while learning through SO's one of the question, where using regex to extract values.
I am wondering how we can implement a regex to remove all the characters if the are same in every row and matching the same index position.
Below is the DataFrame:
print(df)
column1
0 [b,e,c]
1 [e,a,c]
2 [a,b,c]
regex :
df.column1.str.extract(r'(\w,\w)')
print(df)
column1
0 b,e
1 e,a
2 a,b
In the above regex it extract the characters needed but i want to preserve [] this as well.
You can use
df['column2'] = df['column1'].str.replace(r'(?s).*?\[(\w,\w).*', r'[\1]', regex=True)
df['column2'] = '[' + df['column1'].str.extract(r'(\w,\w)') + ']'
In the .str.replace approach, the (?s).*?\[(\w,\w).* matches any zero or more chars as few as possible, then a [, then captures a word char + comma + a word char into Group 1 (\1) and then the rest of the string and replaces the match with [ + Group 1 value + ].
In the second approach, [ and ] are added to the result of the extraction, this solution is best for your toy examples here.
Here is a Pandas test:
>>> import pandas as pd
>>> df = pd.DataFrame({'column1':['[b,e,c]']})
>>> df['column1'].str.replace(r'(?s).*?\[(\w,\w).*', r'[\1]', regex=True)
0 [b,e]
Name: column1, dtype: object
>>> '[' + df['column1'].str.extract(r'(\w,\w)') + ']'
0
0 [b,e]

Check if string is in pandas Dataframe column, and create new Dataframe

I am trying to check if a string is in a Pandas column. I tried doing it two ways but they both seem to check for a substring.
itemName = "eco drum ecommerce"
words = self.itemName.split(" ")
df.columns = ['key','word','umbrella', 'freq']
df = df.dropna()
df = df.loc[df['word'].isin(words)]
I also tried this way, but this also checks for substring
words = self.itemName.split(" ")
words = '|'.join(words)
df.columns = ['key','word','umbrella', 'freq']
df = df.dropna()
df = df.loc[df['word'].str.contains(words, case=False)]
The word was this: "eco drum".
Then I did this:
words = self.itemName.split(" ")
words = '|'.join(words)
To end up with this:
eco|drum
This is the "word" column:
Thank you, is it possible this way to not match substrings?
You have the right idea. .contains has the regex pattern match option set to True by default. Therefore all you need to do is add anchors to your regex pattern e.g. "ball" will become "^ball$".
df = pd.DataFrame(columns=['key'])
df["key"] = ["largeball", "ball", "john", "smallball", "Ball"]
print(df.loc[df['key'].str.contains("^ball$", case=False)])
Referring more specifically to your question, since you want to search for multiple words, you will have to create the regex pattern to give to contains.
# Create dataframe
df = pd.DataFrame(columns=['word'])
df["word"] = ["ecommerce", "ecommerce", "ecommerce", "ecommerce", "eco", "drum"]
# Create regex pattern
word = "eco drum"
words = word.split(" ")
words = "|".join("^{}$".format(word) for word in words)
# Find matches in dataframe
print(df.loc[df['word'].str.contains(words, case=False)])
The code words = "|".join("^{}$".format(word) for word in words) is referred to as a generator expression. Given ['eco', 'drum'] it will return this pattern: ^eco$|^drum$.

Categories