Python Trimming a few column names but not all in a dataframe - python

I have a dataframe of many columns. Now I am trimming a few columns to reduce the text length.
Code:
xdf = pd.DataFrame({'Column1':[10,25],'Column2':[10,25],'Fix_col':[10,25]})
## Rename `Column1` to `C1` and for `C2` as well
req_cols = ['Column1','Column2']
xdf[req_cols].columns = [x[0]+y for name in xdf[req_cols].str.findall(r'([A-Za-z]+)(\d+)' for x,y in name]
Present solution:
print([x[0]+y for name in xdf[req_cols].str.findall(r'([A-Za-z]+)(\d+)' for x,y in name])
['C1','C2']
print(xdf[req_cols].columns)
['Column1','Column2']
Column names did not change. Don't know why?
Expected Answer:
xdf.columns = ['C1','C2','Fix_col']

You can use
import pandas as pd
import re
xdf = pd.DataFrame({'Column1':[10,25],'Column2':[10,25],'Fix_col':[10,25]})
req_cols = ['Column1','Column2']
xdf.rename(columns=lambda x : x if x not in req_cols else re.sub(r'^(\D?)\D*(\d*)', r'\1\2', x), inplace=True)
Output of xdf.columns:
Index(['C1', 'C2', 'Fix_col'], dtype='object')
See the regex demo. Details:
^ - start of string
(\D?) - Group 1 (\1): an optional non-digit char
\D* - zero or more non-digit chars
(\d*) - Group 2 (\2): zero or more digits.

Related

extracting a string from between to strings in dataframe

im trying to extract a value from my data frame
i have a column ['Desc'] it contains sentences in the folowing format
_000it_ZZZ$$$-
_0780it_ZBZT$$$-
_011it_BB$$$-
_000it_CCCC$$$-
I want to extract the string between 'it_' and '$$$'
I have tried this code but does not seem to work
# initializing substrings
sub1 = "it_"
sub2 = "$$$"
# getting index of substrings
idx1 = df['DESC'].find(sub1)
idx2 = df['DESC'].find(sub2)
# length of substring 1 is added to
# get string from next character
df['results'] = df['DESC'][idx1 + len(sub1) + 1: idx2]
I would appreciate your help
You can use str.extract to get the desired output in your new column.
import pandas as pd
import re
df = pd.DataFrame({
'DESC' : ["_000it_ZZZ$$$-", "_0780it_ZBZT$$$-", "_011it_BB$$$-", "_000it_CCCC$$$-", "_000it_123$$$-"]
})
pat = r"(?<=it_)(.+)(?=[\$]{3}-)"
df['results'] = df['DESC'].str.extract(pat)
print(df)
DESC results
0 _000it_ZZZ$$$- ZZZ
1 _0780it_ZBZT$$$- ZBZT
2 _011it_BB$$$- BB
3 _000it_CCCC$$$- CCCC
4 _000it_123$$$- 123
You can see the regex pattern on Regex101 for more details.
You could try using a regex pattern. It matches your cases you listed here, but I can't guarantee that it will generalize to all possible patterns.
import re
string = "_000it_ZZZ$$$-"
p = re.compile(r"(?<=it_)(.*)(?<!\W)")
m = p.findall(string)
print(m) # ['_ZZZ']
The pattern looks for it in the string and then stops untill it meets a non-word character.

Regex replace first two letters within column in python

I have a dataframe such as
COL1
A_element_1_+_none
C_BLOCA_element
D_element_3
element_'
BasaA_bloc
B_basA_bloc
BbasA_bloc
and I would like to remove the first 2 letters within each row of COL1 only if they are within that list :
the_list =['A_','B_','C_','D_']
Then I should get the following output:
COL1
element_1_+_none
BLOCA_element
element_3
element_'
BasaA_bloc
basA_bloc
BbasA_bloc
So far I tried the following :
df['COL1']=df['COL1'].str.replace("A_","")
df['COL1']=df['COL1'].str.replace("B_","")
df['COL1']=df['COL1'].str.replace("C_","")
df['COL1']=df['COL1'].str.replace("D_","")
But it also remove the pattern such as in row2 A_ and does not remove only the first 2 letters...
If the values to replace in the_list always have that format, you could also consider using str.replace with a simple pattern matching an uppercase char A-D followed by an underscore at the start of the string ^[A-D]_
import pandas as pd
strings = [
"A_element_1_+_none ",
"C_BLOCA_element ",
"D_element_3",
"element_'",
"BasaA_bloc",
"B_basA_bloc",
"BbasA_bloc"
]
df = pd.DataFrame(strings, columns=["COL1"])
df['COL1'] = df['COL1'].str.replace(r"^[A-D]_", "")
print(df)
Output
COL1
0 element_1_+_none
1 BLOCA_element
2 element_3
3 element_'
4 BasaA_bloc
5 basA_bloc
6 BbasA_bloc
You can also use apply() function from pandas. So if the string is with the concerned patterns, we ommit the two first caracters else return the whole string.
d["COL1"] = d["COL1"].apply(lambda x: x[2:] if x.startswith(("A_","B_","C_","D_")) else x)

How to remove unique character based on the same index via regex

while learning through SO's one of the question, where using regex to extract values.
I am wondering how we can implement a regex to remove all the characters if the are same in every row and matching the same index position.
Below is the DataFrame:
print(df)
column1
0 [b,e,c]
1 [e,a,c]
2 [a,b,c]
regex :
df.column1.str.extract(r'(\w,\w)')
print(df)
column1
0 b,e
1 e,a
2 a,b
In the above regex it extract the characters needed but i want to preserve [] this as well.
You can use
df['column2'] = df['column1'].str.replace(r'(?s).*?\[(\w,\w).*', r'[\1]', regex=True)
df['column2'] = '[' + df['column1'].str.extract(r'(\w,\w)') + ']'
In the .str.replace approach, the (?s).*?\[(\w,\w).* matches any zero or more chars as few as possible, then a [, then captures a word char + comma + a word char into Group 1 (\1) and then the rest of the string and replaces the match with [ + Group 1 value + ].
In the second approach, [ and ] are added to the result of the extraction, this solution is best for your toy examples here.
Here is a Pandas test:
>>> import pandas as pd
>>> df = pd.DataFrame({'column1':['[b,e,c]']})
>>> df['column1'].str.replace(r'(?s).*?\[(\w,\w).*', r'[\1]', regex=True)
0 [b,e]
Name: column1, dtype: object
>>> '[' + df['column1'].str.extract(r'(\w,\w)') + ']'
0
0 [b,e]

swap last 2 string of columns seperated by delimiter

I hava a dataframe. I want to swap last 2 string of columns seperated by "_" if the 2nd last string is "pi"
Dataframe has columns such as:
abc_rte abc_rte_log abc_rte_log_pi1 abc_rte_pi1_log xyz_pnct_pi2_log
Desired column names:
abc_rte abc_rte_log abc_rte_log_pi1 abc_rte_log_pi1 xyz_pnct_log_pi2
What i tried so far:
for i in range(0, len(df.columns)):
if str(df.columns[i].split('_')[-2] == 'pi':
df.columns[i].split('_')[-2] = str(df.columns[i].split('_')[-1])
Index.str.replace
df.columns = df.columns.str.replace(r'(pi\d*)_([^_]+)$', r'\2_\1')
>>> df.columns
Index(['abc_rte', 'abc_rte_log', 'abc_rte_log_pi1', 'abc_rte_log_pi1',
'xyz_pnct_log_pi2'],
dtype='object')
Regex details:
(pi\d*) : First capturing group
pi : Matches the characters pi literally
\d* : Matches a digit between zero or more times
_ : Matches the character _
([^_]+) : Second capturing group
[^_]+ : Matches any character not present in the list [_] one or more times
$ : Asserts position at the end of line
See the online regex demo
mapping = {col:col for col in df.columns}
for colname in df.columns:
splits = colname.rsplit("_",2)
if splits[-2] == 'pi':
newname = "_".join((splits[0], splits[-1], splits[-2]))
mapping[colname] = newname
df.rename(columns=mapping, inplace=True)

Extract particular string which appears in multiple lines in cell Pandas

I have to extract string which starts with "Year" and finishes with "\n", but for each line that appears in a cell in Pandas data frame.
Additionally, I want to remove \n at the end of cell.
This is data frame:
df
Column1
not_important1\nnot_important2\nE012-855 Year-1972\nE012-856 Year-1983\nnot_important3\nE012-857 Year-1977\nnot_important4\nnot_important5\nE012-858 Year-2012\n
not_important6\nnot_important7\nE013-200 Year-1982\nE013-201 Year-1984\nnot_important8\nE013-202 Year-1987\n
not_important9\nnot_important10\nE014-652 Year-1988\nE014-653 Year-1980\nnot_important11\nE014-654 Year-1989\n
This is what I want to get:
df
Column1
Year-1972\nYear-1983\nYear-1977\nYear-2012
Year-1982\nYear-1984\nYear-1987
Year-1988\nYear-1980\nYear-1989
How to do this?
You can use findall with this regex r'Year.*?\\n' to catch the substrings. Then create a string from the list of the found elements with ''.join and then remove the last \n with [:-2] :
import re
df['Column1'] = df['Column1'].apply(lambda x: ''.join(re.findall('Year.*?\\n', x))[:-2])
Or, if after the 4 digits of the year there is always \n, you can do in this way:
df['Column1'] = df['Column1'].apply(lambda x: '\n'.join(re.findall('Year-\d\d\d\d', x)))

Categories