Extracting number from string only when string is present in a dataframe

Extracting number from string only when string is present in a dataframe - python

I am trying to get extract a chain of numbers that might proceed a list of characters within a data frame. If there are no characters nothing needs to be done to the cell. If there are characters then I want the chares to be the take out. I want the end result to be the same column but with no characters. see example.
Before:
ID
Price
Item Code
1
3.60
a/b 80986
2
4.30
45772
3
0.60
fF/6 9778
4
9.78
48989
5
3.44
\ 545
6
3.44
r. 509
Result:
ID
Price
Item Code
1
3.60
80986
2
4.30
45772
3
0.60
9778
4
9.78
48989
5
3.44
545
6
3.44
509

Use Series.str.extract with the regex pattern r'(?:^|\s)(\d+):
(?:^|\s) matches the beginning of the string ('^') or ('|') any whitespace character ('\s') without capturing it ((?:...))
(\d+) captures one or more digit (greedy)
df['Item Code'] = df['Item Code'].str.extract(r'(?:^|\s)(\d+)', expand=False)
Note that the values of 'Item Code' are still stings after the extraction. If you want to convert them to integers use Series.astype.
df['Item Code'] = df['Item Code']str.extract(r'(?:\s|^)(\d+)', expand=False).astype(int)
Output
>>> df
ID Price Item Code
0 1 3.60 80986
1 2 4.30 45772
2 3 0.60 9778
3 4 9.78 48989
4 5 3.44 545
5 6 3.44 509

I think using a regex is the solution:
import re
dt["Item code"] = list(map(lambda x:int(re.findall("\d+", x)[0]), dt["Item code"]))

Related

Removed unwanted characters from string using pandas [duplicate]

This question already has answers here:
Extract float/double value
(5 answers)
Closed last year.
I have the following dataframe:
df = pd.DataFrame({'A': ['2.5cm','2.5cm','2.56”','1.38”','2.2”','0.8 in','$18.00','4','2"']})
which looks like:
A
2.5cm
2.5cm
2.56”
1.38”
2.2”
0.8 in
$18.00
4
2"
I want to remove all characters except for the decimal points.
The output should be:
A
2.5
2.5
2.56
1.38
2.2
0.8
18.00
4
2
Here is what I've tried:
df['A'] = df.A.str.replace(r"[a-zA-Z]", '')
df['A'] = df.A.str.replace('\W', '')
but this is stripping out everything including the decimal point.
Any suggestions would be greatly appreciated.
Thank you in advance

You can use str.extract to extract only the floating points:
df['A'] = df['A'].astype(str).str.extract(r'(\d+.\d+|\d)').astype('float')
However, '.' here matches any character, not just the period. So it will match 18,00 instead of 18. Also it fails to extract multidigit whole numbers. Use the code below. (thanks #DYZ):
df['A'] = df['A'].astype(str).str.extract(r'(\d+\.\d+|\d+)').astype('float')
Output:
A
0 2.50
1 2.50
2 2.56
3 1.38
4 2.20
5 0.80
6 18.00
7 4.00
8 2.00

Try with str.extract
df['new'] = df.A.str.extract('(\d*\.\d+|\d+)').astype(float).iloc[:,0]
Out[31]:
0
0 2.50
1 2.50
2 2.56
3 1.38
4 2.20
5 0.80
6 18.00

Any value deleting in dataframe cell [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I want to transform dataset type. But i cant do it beacuse of there are two dot in my dataset. Im using pd.apply(pd.to_numeric) code. The error code I get is as follows;
ValueError: Unable to parse string "1.232.2" at position 1
my dataset is like this;
Price Value
1.232.2 1.235.3
2.345.2 1.234.2
3.343.5 5.433.3
I must do removing first dot. Example for;
Price Value
1232.2 1235.3
2345.2 1234.2
3343.5 5433.3
I waiting for help. Thank you.

Here's a way to do this.
Convert string to float format (multiple dots to single dot)
You can just do a regex to solve for this.
regex expression: '\.(?=.*\.)'
Explanation:
'\. --> lookup for literal .
(?=.*\.)' --> Exclude all but last .
For each found, replace with ''
The code for this is:
df['Price'] = df['Price'].str.replace('\.(?=.*\.)', '',regex=True)
df['Value'] = df['Value'].str.replace('\.(?=.*\.)', '',regex=True)
If you also want to convert it to numeric, you can directly give:
df['Price'] = pd.to_numeric(df['Price'].str.replace('\.(?=.*\.)', '',regex=True))
df['Value'] = pd.to_numeric(df['Value'].str.replace('\.(?=.*\.)', '',regex=True))
The output of this will be:
Before Cleansing DataFrame:
Price Value
0 1.232.2 1.235.3
1 2.345.2 1.234.2
2 3.343.5 5.433.3
3 123.45 456.25.5
4 0.825 0.0.0
5 0.0.0.2 5.5.5
6 1234 4567
7 NaN NaN
After Cleansing DataFrame:
Price Value
0 1232.2 1235.3
1 2345.2 1234.2
2 3343.5 5433.3
3 123.45 45625.5
4 0.825 00.0
5 000.2 55.5
6 1234 4567
7 NaN NaN
The pd.numeric() version of the solution will look like this:
After Cleanins DataFrame:
Note: it converts all values to 3 decimal places as one of them has 3 decimal places.
Price Value
0 1232.200 1235.3
1 2345.200 1234.2
2 3343.500 5433.3
3 123.450 45625.5
4 0.825 0.0
5 0.200 55.5
6 1234.000 4567.0
7 NaN NaN
Discard data if more than one period (.) in data
If you want to process all the columns in the dataframe, you can use applymap() and if you want to process for a specific column use apply. Also use pd.isnull() to check if data is NaN so you can ignore processing that data.
The below code addresses for NaN, numbers without decimal places, numbers with one period, numbers with multiple periods. The code assumes the data in the columns are either NaNs or strings with digits and periods. It assumes there are no alphabet or non digit characters (except dots). If you need the code to check for digits only, let me know.
The code also assumes that you want to discard the leading numbers. If you do want to concatenate the numbers, then a different solution needs to be implemented (for ex: 1.2345.67 will be replaced to 2345.67 and 1 will be discarded. example #2: 1.2.3.4.5 will be replaced with 4.5 while discarding 1.2.3. If this is NOT what you want, we need to change the code.
You can do the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Price': ['1.232.2', '2.345.2', '3.343.5', '123.45', '0.825','0.0.0.2', '1234',np.NaN],
'Value': ['1.235.3', '1.234.2', '5.433.3', '456.25.5','0.0.0','5.5.5', '4567',np.NaN]})
print (df)
def remove_dots(x):
return x if pd.isnull(x) else '.'.join(x.rsplit('.',2)[-2:])
df = df.applymap(remove_dots)
print (df)
The output of this will be:
Before Cleansing DataFrame:
Price Value
0 1.232.2 1.235.3
1 2.345.2 1.234.2
2 3.343.5 5.433.3
3 123.45 456.25.5
4 0.825 0.0.0
5 0.0.0.2 5.5.5
6 1234 4567
7 NaN NaN
After Cleansing DataFrame:
Price Value
0 232.2 235.3
1 345.2 234.2
2 343.5 433.3
3 123.45 25.5
4 0.825 0.0
5 0.2 5.5
6 1234 4567
7 NaN NaN
If you want to change specific columns only, then you can use apply.
df['Price'] = df['Price'].apply(lambda x: x if pd.isnull(x) else '.'.join(x.rsplit('.',2)[-2:]))
df['Value'] = df['Value'].apply(lambda x: x if pd.isnull(x) else '.'.join(x.rsplit('.',2)[-2:]))
print(df)
Before and after will be the same:
Before Cleansing DataFrame:
Price Value
0 1.232.2 1.235.3
1 2.345.2 1.234.2
2 3.343.5 5.433.3
3 123.45 456.25.5
4 0.825 0.0.0
5 0.0.0.2 5.5.5
6 1234 4567
7 NaN NaN
After Cleansing DataFrame:
Price Value
0 232.2 235.3
1 345.2 234.2
2 343.5 433.3
3 123.45 25.5
4 0.825 0.0
5 0.2 5.5
6 1234 4567
7 NaN NaN

Pyspark/Python : Converting csv file which has Multiline rows file to single line row file

I have a csv file which has records are in multiline like this
id1,id2,id3,id4,id5,id6,id7
1,2,3,4,5,6,7
1,2,3,4
,5,6,
7
1,2
3,4
,5,6,
7
I want to change the file like below -
id1,id2,id3,id4,id5,id6,id7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
I know pyspark can read such file with multiline :True option but I want to convert this file to single line rows which is the business use case. How can I do it. Technologies to be used are either Pyspark or Python (Pandas). Thanks in advance

Did you have something like this in mind?
import re
items = re.findall("[^ ,\n]+", """id1,id2,id3,id4,id5,id6,id7
1,2,3,4,5,6,7
1,2,3,4
,5,6,
7
1,2
3,4
,5,6,
7""")
rows = [items[i:i+7] for i in range(0,len(items),7)]
pd.DataFrame(rows[1:], columns=rows[0])
Output:
id1 id2 id3 id4 id5 id6 id7
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Since it has been requested here is a no loop version of the 2nd part:
rows = np.array(items).reshape(len(items)//7,7)
pd.DataFrame(rows[1:], columns=rows[0])
I have tested if it actually saves time by using jupter's %%timeit: it turns out:
the regular expression part takes 6.66 µs ± 43.8 ns,
the old loop part of then turning it into a dataframe takes 759 µs ± 2.81 µs
and the new numpy version of the same takes 149 µs ± 4.82 µs

How to combine values in a row depending on value in a previous row in another column in pandas [duplicate]

This question already has an answer here:
How to combine values in a row depending on value in another row in pandas
(1 answer)
Closed 3 years ago.
I have a pandas dataframe with several columns (words, start time, stop time, speaker). I want to combine all values in the 'word' column while the values in the 'speaker' column do not change. In addition, I want to keep the 'start' value for the first word and the 'stop' value for the last word in the combination. Every time the speaker changes back and forth, I want to return this combination as a new row.
The first 9 rows of what I currently have are (the entire dataframe continues for a while with the speaker changing back and forth):
word start stop speaker
0 but 2.72 2.85 2
1 that's 2.85 3.09 2
2 alright 3.09 3.47 2
3 we'll 8.43 8.69 1
4 have 8.69 8.97 1
5 to 8.97 9.07 1
6 okay 9.19 10.01 2
7 sure 10.02 11.01 2
8 what? 11.02 12.00 1
However, I would like to turn this into (continuing across the entire dataframe beyond this example):
word start stop speaker
0 but that's alright 2.72 3.47 2
1 we'll have to 8.43 9.07 1
2 okay sure 9.19 11.01 2
3 what? 11.02 12.00 1

You need to groupby on the consecutive values of speaker.
df.groupby([(df['speaker'] != df['speaker'].shift()).cumsum(), , df['speaker']], as_index=False).agg({
'word': ' '.join,
'start': 'min',
'stop': 'max'
})
Output:
speaker word start stop
0 2 but that's alright 2.72 3.47
1 1 we'll have to 8.43 9.07
2 2 okay sure 9.19 11.01
3 1 what? 11.02 12.00

pandas read_csv to ignore the column index in front of each value

Is there a way to read a file like this in and skip the column index (1-5) like this example? I'm using read_csv.
24.0 1:0.00632 2:18.00 3:2.310 4:0 5:0.5380
21.6 1:0.02731 2:0.00 3:7.070 4:0 5:0.4690
Expected table read:
24.0 0.00632 18.00 2.310 0 0.5380

read_csv won't handle this the way you want because it's not a CSV.
You can do e.g.
pd.DataFrame([[chunk.split(':')[-1] for chunk in line.split()] for line in f])

Your data is oddly structured. Given the colon index separator, you can read the file mostly as text via the usual read_csv. Then, loop through each column in the dataframe (except for the first one), split the string on ':', take the second element which represents your desired value, and convert that value to a float (all done via a list comprehension).
df = pd.read_csv('data.txt', sep=' ', header=None)
>>> df
0 1 2 3 4 5
0 24.0 1:0.00632 2:18.00 3:2.310 4:0 5:0.5380
1 21.6 1:0.02731 2:0.00 3:7.070 4:0 5:0.4690
df.iloc[:, 1:] = df.iloc[:, 1:].applymap(lambda s: float(s.split(':')[1]))
>>> df
0 1 2 3 4 5
0 24.0 0.00632 18 2.31 0 0.538
1 21.6 0.02731 0 7.07 0 0.469

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting number from string only when string is present in a dataframe - python

I think using a regex is the solution: import re dt["Item code"] = list(map(lambda x:int(re.findall("\d+", x)[0]), dt["Item code"]))

Related

Removed unwanted characters from string using pandas [duplicate]

Any value deleting in dataframe cell [closed]

Pyspark/Python : Converting csv file which has Multiline rows file to single line row file

How to combine values in a row depending on value in a previous row in another column in pandas [duplicate]

pandas read_csv to ignore the column index in front of each value

Categories

Resources