Removed unwanted characters from string using pandas [duplicate]

Removed unwanted characters from string using pandas [duplicate] - python

This question already has answers here:
Extract float/double value
(5 answers)
Closed last year.
I have the following dataframe:
df = pd.DataFrame({'A': ['2.5cm','2.5cm','2.56”','1.38”','2.2”','0.8 in','$18.00','4','2"']})
which looks like:
A
2.5cm
2.5cm
2.56”
1.38”
2.2”
0.8 in
$18.00
4
2"
I want to remove all characters except for the decimal points.
The output should be:
A
2.5
2.5
2.56
1.38
2.2
0.8
18.00
4
2
Here is what I've tried:
df['A'] = df.A.str.replace(r"[a-zA-Z]", '')
df['A'] = df.A.str.replace('\W', '')
but this is stripping out everything including the decimal point.
Any suggestions would be greatly appreciated.
Thank you in advance

You can use str.extract to extract only the floating points:
df['A'] = df['A'].astype(str).str.extract(r'(\d+.\d+|\d)').astype('float')
However, '.' here matches any character, not just the period. So it will match 18,00 instead of 18. Also it fails to extract multidigit whole numbers. Use the code below. (thanks #DYZ):
df['A'] = df['A'].astype(str).str.extract(r'(\d+\.\d+|\d+)').astype('float')
Output:
A
0 2.50
1 2.50
2 2.56
3 1.38
4 2.20
5 0.80
6 18.00
7 4.00
8 2.00

Try with str.extract
df['new'] = df.A.str.extract('(\d*\.\d+|\d+)').astype(float).iloc[:,0]
Out[31]:
0
0 2.50
1 2.50
2 2.56
3 1.38
4 2.20
5 0.80
6 18.00

Related

Extracting number from string only when string is present in a dataframe

I am trying to get extract a chain of numbers that might proceed a list of characters within a data frame. If there are no characters nothing needs to be done to the cell. If there are characters then I want the chares to be the take out. I want the end result to be the same column but with no characters. see example.
Before:
ID
Price
Item Code
1
3.60
a/b 80986
2
4.30
45772
3
0.60
fF/6 9778
4
9.78
48989
5
3.44
\ 545
6
3.44
r. 509
Result:
ID
Price
Item Code
1
3.60
80986
2
4.30
45772
3
0.60
9778
4
9.78
48989
5
3.44
545
6
3.44
509

Use Series.str.extract with the regex pattern r'(?:^|\s)(\d+):
(?:^|\s) matches the beginning of the string ('^') or ('|') any whitespace character ('\s') without capturing it ((?:...))
(\d+) captures one or more digit (greedy)
df['Item Code'] = df['Item Code'].str.extract(r'(?:^|\s)(\d+)', expand=False)
Note that the values of 'Item Code' are still stings after the extraction. If you want to convert them to integers use Series.astype.
df['Item Code'] = df['Item Code']str.extract(r'(?:\s|^)(\d+)', expand=False).astype(int)
Output
>>> df
ID Price Item Code
0 1 3.60 80986
1 2 4.30 45772
2 3 0.60 9778
3 4 9.78 48989
4 5 3.44 545
5 6 3.44 509

I think using a regex is the solution:
import re
dt["Item code"] = list(map(lambda x:int(re.findall("\d+", x)[0]), dt["Item code"]))

Any value deleting in dataframe cell [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I want to transform dataset type. But i cant do it beacuse of there are two dot in my dataset. Im using pd.apply(pd.to_numeric) code. The error code I get is as follows;
ValueError: Unable to parse string "1.232.2" at position 1
my dataset is like this;
Price Value
1.232.2 1.235.3
2.345.2 1.234.2
3.343.5 5.433.3
I must do removing first dot. Example for;
Price Value
1232.2 1235.3
2345.2 1234.2
3343.5 5433.3
I waiting for help. Thank you.

Here's a way to do this.
Convert string to float format (multiple dots to single dot)
You can just do a regex to solve for this.
regex expression: '\.(?=.*\.)'
Explanation:
'\. --> lookup for literal .
(?=.*\.)' --> Exclude all but last .
For each found, replace with ''
The code for this is:
df['Price'] = df['Price'].str.replace('\.(?=.*\.)', '',regex=True)
df['Value'] = df['Value'].str.replace('\.(?=.*\.)', '',regex=True)
If you also want to convert it to numeric, you can directly give:
df['Price'] = pd.to_numeric(df['Price'].str.replace('\.(?=.*\.)', '',regex=True))
df['Value'] = pd.to_numeric(df['Value'].str.replace('\.(?=.*\.)', '',regex=True))
The output of this will be:
Before Cleansing DataFrame:
Price Value
0 1.232.2 1.235.3
1 2.345.2 1.234.2
2 3.343.5 5.433.3
3 123.45 456.25.5
4 0.825 0.0.0
5 0.0.0.2 5.5.5
6 1234 4567
7 NaN NaN
After Cleansing DataFrame:
Price Value
0 1232.2 1235.3
1 2345.2 1234.2
2 3343.5 5433.3
3 123.45 45625.5
4 0.825 00.0
5 000.2 55.5
6 1234 4567
7 NaN NaN
The pd.numeric() version of the solution will look like this:
After Cleanins DataFrame:
Note: it converts all values to 3 decimal places as one of them has 3 decimal places.
Price Value
0 1232.200 1235.3
1 2345.200 1234.2
2 3343.500 5433.3
3 123.450 45625.5
4 0.825 0.0
5 0.200 55.5
6 1234.000 4567.0
7 NaN NaN
Discard data if more than one period (.) in data
If you want to process all the columns in the dataframe, you can use applymap() and if you want to process for a specific column use apply. Also use pd.isnull() to check if data is NaN so you can ignore processing that data.
The below code addresses for NaN, numbers without decimal places, numbers with one period, numbers with multiple periods. The code assumes the data in the columns are either NaNs or strings with digits and periods. It assumes there are no alphabet or non digit characters (except dots). If you need the code to check for digits only, let me know.
The code also assumes that you want to discard the leading numbers. If you do want to concatenate the numbers, then a different solution needs to be implemented (for ex: 1.2345.67 will be replaced to 2345.67 and 1 will be discarded. example #2: 1.2.3.4.5 will be replaced with 4.5 while discarding 1.2.3. If this is NOT what you want, we need to change the code.
You can do the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Price': ['1.232.2', '2.345.2', '3.343.5', '123.45', '0.825','0.0.0.2', '1234',np.NaN],
'Value': ['1.235.3', '1.234.2', '5.433.3', '456.25.5','0.0.0','5.5.5', '4567',np.NaN]})
print (df)
def remove_dots(x):
return x if pd.isnull(x) else '.'.join(x.rsplit('.',2)[-2:])
df = df.applymap(remove_dots)
print (df)
The output of this will be:
Before Cleansing DataFrame:
Price Value
0 1.232.2 1.235.3
1 2.345.2 1.234.2
2 3.343.5 5.433.3
3 123.45 456.25.5
4 0.825 0.0.0
5 0.0.0.2 5.5.5
6 1234 4567
7 NaN NaN
After Cleansing DataFrame:
Price Value
0 232.2 235.3
1 345.2 234.2
2 343.5 433.3
3 123.45 25.5
4 0.825 0.0
5 0.2 5.5
6 1234 4567
7 NaN NaN
If you want to change specific columns only, then you can use apply.
df['Price'] = df['Price'].apply(lambda x: x if pd.isnull(x) else '.'.join(x.rsplit('.',2)[-2:]))
df['Value'] = df['Value'].apply(lambda x: x if pd.isnull(x) else '.'.join(x.rsplit('.',2)[-2:]))
print(df)
Before and after will be the same:
Before Cleansing DataFrame:
Price Value
0 1.232.2 1.235.3
1 2.345.2 1.234.2
2 3.343.5 5.433.3
3 123.45 456.25.5
4 0.825 0.0.0
5 0.0.0.2 5.5.5
6 1234 4567
7 NaN NaN
After Cleansing DataFrame:
Price Value
0 232.2 235.3
1 345.2 234.2
2 343.5 433.3
3 123.45 25.5
4 0.825 0.0
5 0.2 5.5
6 1234 4567
7 NaN NaN

Round a float to the smallest number keeping a few decimal places

I need to round a column with floats to 2 decimal places, but without rounding the data to the nearest value
My data:
df = pd.DataFrame({'numbers': [1.233,1.238,5.059,5.068, 8.556]})
df.head()
numbers
0 1.233
1 1.238
2 5.059
3 5.068
4 8.556
Expected output:
numbers
0 1.23
1 1.23
2 5.05
3 5.06
4 8.55
The problem
Everything I've tried rounds the numbers to the nearest number (0-4 is 0 and 5-9 is added 1 to the truncated decimal place)
Examples of what didn't work
df[['numbers']].round(2)
#or df['numbers'].apply(lambda x: "%.2f" % x)
#output
numbers
0 1.23
1 1.24
2 5.06
3 5.07
4 8.56

This is more like round down
df.numbers*100//1/100
Out[186]:
0 1.23
1 1.23
2 5.05
3 5.06
4 8.55
Name: numbers, dtype: float64

Try this, works well also
import pandas as pd
do = lambda x: float(str(x).split('.')[0] +'.' + str(x).split('.')[1][0:2])
df = pd.DataFrame({'numbers': list(map(do, [1.233,1.238,5.059,5.068, 8.556]))})
print(df.head())
output
numbers
0 1.23
1 1.23
2 5.05
3 5.06
4 8.55

How to combine values in a row depending on value in a previous row in another column in pandas [duplicate]

This question already has an answer here:
How to combine values in a row depending on value in another row in pandas
(1 answer)
Closed 3 years ago.
I have a pandas dataframe with several columns (words, start time, stop time, speaker). I want to combine all values in the 'word' column while the values in the 'speaker' column do not change. In addition, I want to keep the 'start' value for the first word and the 'stop' value for the last word in the combination. Every time the speaker changes back and forth, I want to return this combination as a new row.
The first 9 rows of what I currently have are (the entire dataframe continues for a while with the speaker changing back and forth):
word start stop speaker
0 but 2.72 2.85 2
1 that's 2.85 3.09 2
2 alright 3.09 3.47 2
3 we'll 8.43 8.69 1
4 have 8.69 8.97 1
5 to 8.97 9.07 1
6 okay 9.19 10.01 2
7 sure 10.02 11.01 2
8 what? 11.02 12.00 1
However, I would like to turn this into (continuing across the entire dataframe beyond this example):
word start stop speaker
0 but that's alright 2.72 3.47 2
1 we'll have to 8.43 9.07 1
2 okay sure 9.19 11.01 2
3 what? 11.02 12.00 1

You need to groupby on the consecutive values of speaker.
df.groupby([(df['speaker'] != df['speaker'].shift()).cumsum(), , df['speaker']], as_index=False).agg({
'word': ' '.join,
'start': 'min',
'stop': 'max'
})
Output:
speaker word start stop
0 2 but that's alright 2.72 3.47
1 1 we'll have to 8.43 9.07
2 2 okay sure 9.19 11.01
3 1 what? 11.02 12.00

parsing through and editing csv with pandas

I'm trying to parse through all of the cells in a csv file that represent heights and round what's after the decimal to match a number in a list (to round down to the nearest inch). After a few days of banging my head against the wall, this is the coding I've been able to get working:
import math
import pandas as pd
inch = [.0, .08, .16, .25, .33, .41, .50, .58, .66, .75, .83, .91, 1]
df = pd.read_csv("sample_csv.csv")
def to_number(s):
for index, row in df.iterrows():
try:
num = float(s)
num = math.modf(num)
num = list(num)
for i,j in enumerate(inch):
if num[0] < j:
num[0] = inch[i-1]
break
elif num[0] == j:
num[0] = inch[i]
break
newnum = num[0] + num[1]
return newnum
except ValueError:
return s
df = df.apply(lambda f : to_number(f[0]), axis=1).fillna('')
with open('new.csv', 'a') as f:
df.to_csv(f, index=False)
Ideally I'd like to have it parse over an entire CSV with n headers, ignoring all strings and round the floats to match the list. Is there a simple(r) way to achieve this with Pandas? And would it be possible (or a good idea?) to have it edit the existing excel workbook instead of creating a new csv i'd have to copy/paste over?
Any help or suggestions would be greatly appreciated as I'm very new to Pandas and it's pretty god damn intimidating!

Helping would be a lot easier if you include a sample mock of the data you're trying to parse. To clarify the points you don't specify, as I understand it
By "an entire CSV with n headers, ignoring all strings and round the floats to match the list" you mean some n-column dataframe with k numeric columns each of which describe someone's height in inches.
The entries in the numeric columns are measured in units of feet.
You want to ignore the non-numeric columns and transform the data as 6.14 -> 6 feet, 1 inches (I'm implicitly assuming that by "round down" you want an integer floor; i.e. 6.14 feet is 6 feet, 0.14*12 = 1.68 inches; it's up to you whether this is floored or rounded to the nearest integer).
Now for a subset of random heights measured in feet sampled uniformly over 5.1 feet and 6.9 feet, we could do the following:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: df = pd.DataFrame(np.random.uniform(5.1, 6.9, size=(10,3)))
In [4]: df
Out[4]:
0 1 2
0 6.020613 6.315707 5.413499
1 5.942232 6.834540 6.761765
2 5.715405 6.162719 6.363224
3 6.416955 6.511843 5.512515
4 6.472462 5.789654 5.270047
5 6.370964 5.509568 6.113121
6 6.353790 6.466489 5.460961
7 6.526039 5.999284 6.617608
8 6.897215 6.016648 5.681619
9 6.886359 5.988068 5.575993
In [5]: np.fix(df) + np.floor(12*(df - np.fix(df)))/12
Out[5]:
0 1 2
0 6.000000 6.250000 5.333333
1 5.916667 6.833333 6.750000
2 5.666667 6.083333 6.333333
3 6.416667 6.500000 5.500000
4 6.416667 5.750000 5.250000
5 6.333333 5.500000 6.083333
6 6.333333 6.416667 5.416667
7 6.500000 5.916667 6.583333
8 6.833333 6.000000 5.666667
9 6.833333 5.916667 5.500000
We're using np.fix to extract the integral part of the height value. Likewise, df - np.fix(df) represents the fractional remainder in feet or in inches when multiplied by 12. np.floor just truncates this to the nearest inch below, and the final division by 12 returns the unit of measurement from inches to feet.
You can change np.floor to np.round to get an answer rounded to the nearest inch rather than truncated to the previous whole inch. Finally, you can specify the precision of the output to insist that the decimal portion is selected from your list.
In [6]: (np.fix(df) + np.round(12*(df - np.fix(df)))/12).round(2)
Out[6]:
0 1 2
0 6.58 5.25 6.33
1 5.17 6.42 5.67
2 6.42 5.83 6.33
3 5.92 5.67 6.33
4 6.83 5.25 6.58
5 5.83 5.50 6.92
6 6.83 6.58 6.25
7 5.83 5.33 6.50
8 5.25 6.00 6.83
9 6.42 5.33 5.08

Adding onto the other answer to address your problem with strings:
# Break the dataframe with a string
df = pd.DataFrame(np.random.uniform(5.1, 6.9, size=(10,3)))
df.ix[0,0] = 'str'
# Find out which things can be cast to numerics and put NaNs everywhere else
df_safe = df.apply(pd.to_numeric, axis=0, errors="coerce")
df_safe = (np.fix(df_safe) + np.round(12*(df_safe - np.fix(df_safe)))/12).round(2)
# Replace all the NaNs with the original data
df_safe[df_safe.isnull()] = df[df_safe.isnull()]
df_safe should be what you want. Despite the name, this isn't particularly safe and there are probably edge conditions that will be a problem.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removed unwanted characters from string using pandas [duplicate] - python

Try with str.extract df['new'] = df.A.str.extract('(\d*\.\d+|\d+)').astype(float).iloc[:,0] Out[31]: 0 0 2.50 1 2.50 2 2.56 3 1.38 4 2.20 5 0.80 6 18.00

Related

Extracting number from string only when string is present in a dataframe

Any value deleting in dataframe cell [closed]

Round a float to the smallest number keeping a few decimal places

How to combine values in a row depending on value in a previous row in another column in pandas [duplicate]

parsing through and editing csv with pandas

Categories

Resources