Remove alpha and special characters from the column using python - python

I am trying to remove alpha characters and special characters(,) from the column values. When trying to remove the alpha characters it gives NaN as output .
Input Data :
col2
2565.0
23899
876.44
1765.7
3,253.0CA
9876.9B
Output Data :
col2
2565.0
23899
876.44
1765.7
3253.0
9876.9
Code i have been using:
df['col2'] = df['col2'].str.replace(r"[a-zA-Z]",'')
df['col2']=df['col2'].fillna('').str.replace(',',"").astype(float)
Please suggest how to resolve this.

Use Series.replace and regex which matches "not numbers and dot"
df['col2'] = df.col2.replace('[^\d.]', '', regex=True).astype(float)
Output
col2
0 2565.00
1 23899.00
2 876.44
3 1765.70
4 3253.00
5 9876.90

Use Series.str.replace:
df['col2'] = df['col2'].str.replace(r'[a-zA-Z,]','', regex=True).astype(float)
print (df)
col2
0 2565.00
1 23899.00
2 876.44
3 1765.70
4 3253.00
5 9876.90

Related

change number negative sign position

Hello I have the following example of df
col1 col2
12.4 12.32
11.4- 2.3
2.0- 1.1
I need the negative sign to be at the beginning of the number and not at the end
col1 col2
12.4 12.32
-11.4 2.3
-2.0 1.1
I am trying with the following function, so far I can get the data with the sign and print them correctly but I no longer know how to replace them in my column
updated_data = '' # iterate over the content
for line in df["col1"]:
# removing last word
updated_line = ' '.join(str(line).split('-')[:-1])
print(updated_line)
Could you help me please? or if there is an easier way to do it I would appreciate it
here is one way to do it, using np.where
#check if string contains - at the end, and then converting it float after removing '-' and multiplying by -1
df['col1']=np.where(df['col1'].str.strip().str.endswith('-'),
df['col1'].str.replace(r'-','',regex=True).astype('float')*(-1),
df['col1']
)
df
col1 col2
0 12.4 12.32
1 -11.4 2.30
2 -2.0 1.10

Formatting a string containing currency and commas

Does anyone know how I'd format this string (which is a column in a dataframe) to be a float so I can sort by the column please?
£880,000
£88,500
£850,000
£845,000
i.e. I want this to become
88,500
845,000
850,000
880,000
Thanks in advance!
Assuming 'col' the column name.
If you just want to sort, and keep as string, you can use natsorted:
from natsort import natsort_key
df.sort_values(by='col', key=natsort_key)
# OR
from natsort import natsort_keygen
df.sort_values(by='col', key=natsort_keygen())
output:
col
1 £88,500
3 £845,000
2 £850,000
0 £880,000
If you want to convert to floats:
df['col'] = pd.to_numeric(df['col'].str.replace('[^\d.]', '', regex=True))
df.sort_values(by='col')
output:
col
1 88500
3 845000
2 850000
0 880000
If you want strings, you can use str.lstrip:
df['col'] = df['col'].str.lstrip('£')
output:
col
0 880,000
1 88,500
2 850,000
3 845,000

Parse only specific characters from the string using python

Trying to split and parse characters from an column and submit the parsed data into different column .
I was trying the same by parsing with _ in the given column data, It was working good until the number of '_' present in the string was fixed to 2.
Input Data:
Col1
U_a65839_Jan87Apr88
U_b98652_Feb88Apr88_(2).jpg.pdf
V_C56478_mar89Apr89
Q_d15634_Apr90Apr91
Q_d15634_Apr90Apr91_(3).jpeg.pdf
S_e15336_may91Apr93
NaN
Expected Output:
col2
Jan87Apr88
Feb88Apr88
mar89Apr89
Apr90Apr91
Apr90Apr91
may91Apr93
Code i have been trying :
df = pd.read_excel(open(r'Dats.xlsx', 'rb'), sheet_name='Sheet1')
df['Col2'] = df.Col1.str.replace(
'.*_', '', regex=True
)
print(df['Col2'])
I think you want this:
col2 = df.Col1.str.split("_", expand=True)[2]
output:
0 Jan87Apr88
1 Feb88Apr88
2 mar89Apr89
3 Apr90Apr91
4 Apr90Apr91
5 may91Apr93
6 NaN
(you can dropna if you don't want the last row)
Use str.extract here:
df["col2"] = df["Col1"].str.extract(r'((?:[a-z]{3}\d{2}){2})', flags=re.IGNORECASE)
Demo
Based on your question, the pandas DataFrame apply can be a good solution:
First, clean the DataFrame by replacing NaNs with empty string ''
df = pd.DataFrame(data=['U_a65839_Jan87Apr88', 'U_b98652_Feb88Apr88_(2).jpg.pdf', 'V_C56478_mar89Apr89', 'Q_d15634_Apr90Apr91', 'Q_d15634_Apr90Apr91_(3).jpeg.pdf', 'S_e15336_may91Apr93', None], columns=['Col1'])
df = df.fillna('')
Col1
0 U_a65839_Jan87Apr88
1 U_b98652_Feb88Apr88_(2).jpg.pdf
2 V_C56478_mar89Apr89
3 Q_d15634_Apr90Apr91
4 Q_d15634_Apr90Apr91_(3).jpeg.pdf
5 S_e15336_may91Apr93
6
Next, define a function to extract the required string with regex
def fun(s):
import re
m = re.search(r'\w{3}\d{2}\w{3}\d{2}', s)
if m:
return m.group(0)
else:
return ''
Then, easily apply the function to DataFrame:
df['Col2'] = df['Col1'].apply(fun)
Col1 Col2
0 U_a65839_Jan87Apr88 Jan87Apr88
1 U_b98652_Feb88Apr88_(2).jpg.pdf Feb88Apr88
2 V_C56478_mar89Apr89 mar89Apr89
3 Q_d15634_Apr90Apr91 Apr90Apr91
4 Q_d15634_Apr90Apr91_(3).jpeg.pdf Apr90Apr91
5 S_e15336_may91Apr93 may91Apr93
6
Hope the above helps.

How to replace an entire cell with NaN on pandas DataFrame

I want to replace the entire cell that contains the word as circled in the picture with blanks or NaN. However when I try to replace for example '1.25 Dividend' it turned out as '1.25 NaN'. I want to return the whole cell as 'NaN'. Any idea how to work on this?
Option 1
Use a regular expression in your replace
df.replace('^.*Dividend.*$', np.nan, regex=True)
From comments
(Using regex=True) means that it will interpret the problem as a regular expression one. You still need an appropriate pattern. The '^' says to start at the beginning of the string. '^.*' matches all characters from the beginning of the string. '$' says to end the match with the end of the string. '.*$' matches all characters up to the end of the string. Finally, '^.*Dividend.*$' matches all characters from the beginning, has 'Dividend' somewhere in the middle, then any characters after it. Then replace this whole thing with np.nan
Consider the dataframe df
df = pd.DataFrame([[1, '2 Dividend'], [3, 4], [5, '6 Dividend']])
df
0 1
0 1 2 Dividend
1 3 4
2 5 6 Dividend
then the proposed solution yields
0 1
0 1 NaN
1 3 4.0
2 5 NaN
Option 2
Another alternative is to use pd.DataFrame.mask in conjunction with a applymap.
If I pass a lambda to applymap that identifies if any cell has 'Dividend' in it.
df.mask(df.applymap(lambda s: 'Dividend' in s if isinstance(s, str) else False))
0 1
0 1 NaN
1 3 4
2 5 NaN
Option 3
Similar in concept but using stack/unstack + pd.Series.str.contains
df.mask(df.stack().astype(str).str.contains('Dividend').unstack())
0 1
0 1 NaN
1 3 4
2 5 NaN
Replace all strings:
df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
I would use applymap like this
df.applymap(lambda x: 'NaN' if (type(x) is str and 'Dividend' in x) else x)

how to replace non-numeric or decimal in string in pandas

I have a column with values in degrees with the degree sign.
42.9377º
42.9368º
42.9359º
42.9259º
42.9341º
The digit 0 should replace the degree symbol
I tried using regex or str.replace but I can't figure out the exact unicode character.
The source xls has it as º
the error shows it as an obelus ÷
printing the dataframe shows it as ?
the exact position of the degree sign may vary, depending on rounding of the decimals, so I can't replace using exact string position.
Use str.replace:
df['a'] = df['a'].str.replace('º', '0')
print (df)
a
0 42.93770
1 42.93680
2 42.93590
3 42.92590
4 42.93410
#check hex format of char
print ("{:02x}".format(ord('º')))
ba
df['a'] = df['a'].str.replace(u'\xba', '0')
print (df)
a
0 42.93770
1 42.93680
2 42.93590
3 42.92590
4 42.93410
Solution with extract floats.
df['a'] = df['a'].str.extract('(\d+\.\d+)', expand=False) + '0'
print (df)
a
0 42.93770
1 42.93680
2 42.93590
3 42.92590
4 42.93410
Or if all last values are º is possible use indexing with str:
df['a'] = df['a'].str[:-1] + '0'
print (df)
a
0 42.93770
1 42.93680
2 42.93590
3 42.92590
4 42.93410
If you know that it's always the last character you could remove that character and append a "0".
s = "42.9259º"
s = s[:-1]+"0"
print(s) # 42.92590

Categories