how to replace non-numeric or decimal in string in pandas - python

I have a column with values in degrees with the degree sign.
42.9377º
42.9368º
42.9359º
42.9259º
42.9341º
The digit 0 should replace the degree symbol
I tried using regex or str.replace but I can't figure out the exact unicode character.
The source xls has it as º
the error shows it as an obelus ÷
printing the dataframe shows it as ?
the exact position of the degree sign may vary, depending on rounding of the decimals, so I can't replace using exact string position.

Use str.replace:
df['a'] = df['a'].str.replace('º', '0')
print (df)
a
0 42.93770
1 42.93680
2 42.93590
3 42.92590
4 42.93410
#check hex format of char
print ("{:02x}".format(ord('º')))
ba
df['a'] = df['a'].str.replace(u'\xba', '0')
print (df)
a
0 42.93770
1 42.93680
2 42.93590
3 42.92590
4 42.93410
Solution with extract floats.
df['a'] = df['a'].str.extract('(\d+\.\d+)', expand=False) + '0'
print (df)
a
0 42.93770
1 42.93680
2 42.93590
3 42.92590
4 42.93410
Or if all last values are º is possible use indexing with str:
df['a'] = df['a'].str[:-1] + '0'
print (df)
a
0 42.93770
1 42.93680
2 42.93590
3 42.92590
4 42.93410

If you know that it's always the last character you could remove that character and append a "0".
s = "42.9259º"
s = s[:-1]+"0"
print(s) # 42.92590

Related

Python - count successive leading digits on a pandas row string without counting non successive digits

I need to create a new column that counts the number of leading 0s, however I am getting errors trying to do so.
I extracted data from mongo based on the following regex [\^0[0]*[1-9][0-9]*\] on mongo and saved it to a csv file. This is all "Sequences" that start with a 0.
df['Sequence'].str.count('0')
and
df['Sequence'].str.count('0[0]*[1-9][0-9]')
Give the below results. As you can see that both of the "count" string return will also count non leading 0s. Or simply the total number of 0s.
Sequence 0s
0 012312312 1
1 024624624 1
2 036901357 2
3 002486248 2
4 045074305 3
5 080666140 3
I also tried writing using loops which worked when testing but when using it on the data frame, I encounter the following **IndexError: string index out of range**
results = []
count = 0
index = 0
for item in df['Sequence']:
count = 0
index = 0
while (item[index] == "0"):
count = count + 1
index = index + 1
results.append(count)
df['0s'] = results
df
In short; If I can get 2 for 001230 substring instead of 3. I could save the results in a column to do my stats on.
You can use extract with the ^(0*) regex to match only the leading zeros. Then use str.len to get the length.
df['0s'] = df['sequence'].str.extract('^(0*)', expand = False).str.len()
Example input:
df = pd.DataFrame({'sequence': ['12040', '01230', '00010', '00120']})
Output:
sequence 0s
0 12040 0
1 01230 1
2 00010 3
3 00120 2
You can use this regex:
'^0+'
the ^ means, capture if the pattern starts at the beginning of the string.
the +means, capture if occuring at least once or multiple times.
IIUC, you want to count the number of leading 0s, right? Take advantage of the fact that leading 0s disappear when an integer of type str is converted to that of type int. Here's one solution:
df['leading 0s'] = df['Sequence'].str.len() - df['Sequence'].astype(int).astype(str).str.len()
Output:
Sequence leading 0s
0 012312312 1
1 024624624 1
2 036901357 1
3 002486248 2
4 045074305 1
5 080666140 1
Try str.findall:
df['0s'] = df['Sequence'].str.findall('^0*').str[0].str.len()
print(df)
# Output:
Sequence 0s
0 012312312 1
1 024624624 1
2 036901357 1
3 002486248 2
4 045074305 1
5 080666140 1

Python: pandas.DataFrame.insert ValueError: Buffer has wrong number of dimensions

In a DataFrame, I want to extract an integer (0-9) from a string which always comes after a specific word, and add it as a new column at a specific position (not the end). In the simplified example below I want to extract the integer which comes after the word 'number'.
DataFrame:
testDf = ['Number1', 'number2', 'aNumber8', 'Number6b']
df = pd.DataFrame(testDf, columns=['Tagname'])
Tagname
Number1
number2
aNumber8
Number6b
The code below works, but since it adds the column at the end of the dataframe, I have to move the column.
df['Number'] = df['Tagname'].str.extract(r'number*(\d)', re.IGNORECASE)
Tagname Number
Number1 1
number2 2
aNumber8 8
Number6b 6
insertNum = df['Number']
df.drop(labels=['Number'], axis=1, inplace = True)
df.insert(0, 'Number', insertNum)
Number Tagname
1 Number1
2 number2
8 aNumber8
6 Number6b
What I hoped I could do is to use .insert(), but this raises the ValueError shown below.
df.insert(0, 'Number', df['Tagname'].str.extract(r'number*(\d)', re.IGNORECASE))
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
Is it possible to use .insert() this way?
Use expand=False for Series from Series.str.extract, if omit it get one or more column DataFrame, because default parameter is expand=True:
Details:
print (df['Tagname'].str.extract(r'number*(\d)', re.IGNORECASE))
0
0 1
1 2
2 8
3 6
print (df['Tagname'].str.extract(r'number*(\d)', re.IGNORECASE, expand=False))
0 1
1 2
2 8
3 6
Name: Tagname, dtype: object
df.insert(0,'Number',df['Tagname'].str.extract(r'number*(\d)', re.IGNORECASE, expand=False))
print (df)
Number Tagname
0 1 Number1
1 2 number2
2 8 aNumber8
3 6 Number6b

Add a character in a string inside a column dataframe

I have a dataframe with some numbers (or strings, it doesn't actually matter). The thing is that I need to add a character in the middle of them. The dataframe looks like this (I got it from Google Takeout)
id A B
1 512343 -1234
1 213 1231345
1 18379 187623
And I want to add a comma in the second position
id A B
1 51,2343 -12,34
1 21,3 12,31345
1 18,379 18,7623
A and B are actually longitude and latitude so I think it is not possible to achieve to add the comma in the right place since there is no way to know if a number is supposed to have one or two digits as coordinates, but it would do the trick if I can put the comma on the second position.
This should do the trick:
df[["A", "B"]]=df[["A", "B"]].astype(str).replace(r"(\d{2})(\d+)", r"\1,\2", regex=True)
Outputs:
id A B
0 1 51,2343 -12,34
1 1 21,3 12,31345
2 1 18,379 18,7623
Here's another approach with str.extract:
for c in ['A','B']:
df[c] = df[c].astype(str).str.extract('(-?\d{2})(\d*)').agg(','.join,axis=1)
Output:
id A B
0 1 51,2343 -12,34
1 1 21,3 12,31345
2 1 18,379 18,7623
You could do something like this -
import numpy as np
df['A'] = np.where(df['A']>=0,'', '-') + ( df['A'].abs().astype(str).str[:2] + ',' + df['A'].abs().astype(str).str[2:] )
df['B'] = np.where(df['B']>=0,'', '-') + ( df['B'].abs().astype(str).str[:2] + ',' + df['B'].abs().astype(str).str[2:] )
df
id A B
0 1 51,2343 -12,34
1 1 21,3 12,31345
2 1 18,379 18,7623

Get 10 Digit Number

I'm trying to define a function that will create a column and clean the numbers to just their ten digit area code and number. The Date frame.
PNum1
0 18888888888
1 1999999999
2 +++(112)31243134
I have all the individual functions and even stored them into a DataFrame and Dictionary.
def GetGoodNumbers(col):
column = col.copy()
Cleaned = column.replace('\D+', '', regex=True)
NumberCount = Cleaned.astype(str).str.len()
FirstNumber = Cleaned.astype(str).str[0]
SummaryNum = {'Number':Cleaned,'First':FirstNumber,'Count':NumberCount}
df = pd.DataFrame(data=SummaryNum)
DecentNumbers = []
return df
returns
Count First Number
0 11 1 18888888888
1 10 3 3999999999
2 11 2 11231243134
How can I loop through the dataframe column and return a new column that will:
-remove all non-digits.
-get the length (which will be usually 10 or 11)
-If length is 11, return the right 10 digits.
The desired output:
number
1231243134
1999999999
8888888888
You can remove every non-digit and slice the last 10 digits.
df.PNum1.str.replace('\D+', '').str[-10:]
0 8888888888
1 1999999999
2 1231243134
Name: PNum1, dtype: object

How to replace an entire cell with NaN on pandas DataFrame

I want to replace the entire cell that contains the word as circled in the picture with blanks or NaN. However when I try to replace for example '1.25 Dividend' it turned out as '1.25 NaN'. I want to return the whole cell as 'NaN'. Any idea how to work on this?
Option 1
Use a regular expression in your replace
df.replace('^.*Dividend.*$', np.nan, regex=True)
From comments
(Using regex=True) means that it will interpret the problem as a regular expression one. You still need an appropriate pattern. The '^' says to start at the beginning of the string. '^.*' matches all characters from the beginning of the string. '$' says to end the match with the end of the string. '.*$' matches all characters up to the end of the string. Finally, '^.*Dividend.*$' matches all characters from the beginning, has 'Dividend' somewhere in the middle, then any characters after it. Then replace this whole thing with np.nan
Consider the dataframe df
df = pd.DataFrame([[1, '2 Dividend'], [3, 4], [5, '6 Dividend']])
df
0 1
0 1 2 Dividend
1 3 4
2 5 6 Dividend
then the proposed solution yields
0 1
0 1 NaN
1 3 4.0
2 5 NaN
Option 2
Another alternative is to use pd.DataFrame.mask in conjunction with a applymap.
If I pass a lambda to applymap that identifies if any cell has 'Dividend' in it.
df.mask(df.applymap(lambda s: 'Dividend' in s if isinstance(s, str) else False))
0 1
0 1 NaN
1 3 4
2 5 NaN
Option 3
Similar in concept but using stack/unstack + pd.Series.str.contains
df.mask(df.stack().astype(str).str.contains('Dividend').unstack())
0 1
0 1 NaN
1 3 4
2 5 NaN
Replace all strings:
df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
I would use applymap like this
df.applymap(lambda x: 'NaN' if (type(x) is str and 'Dividend' in x) else x)

Categories