How to Split Pandas data for 2 decimals in object - python

Consider a dataframe in Pandas, where one of the many columns have data that has TWO decimals in the column.
Like
13.343.00
12.345.00
98.765.00
How can one get a new column (float) where values are stored in only 1 decimal format stripping that last part of 14.234(.00).
Desired output should be a new column like
13.343
12.345
98.765

If the digits after the second period are not always 0s (and not always two), the following code is more robust:
df["col"] = df["col"].str.extract("(.+)\.[0-9]+").astype(float)

Use:
#remove last 3 values
df['col'] = df['col'].str[:-3].astype(float)
Or:
#get values before last .
df['col'] = df['col'].str.rsplit('.', 1).str[0].astype(float)
Or:
#one or zero integer \d* \. and integer \d+ pattern
df["col"] = df["col"].str.extract("(\d*\.\d+)").astype(float)

You can use:
print(df)
col
0 13.343.00
1 12.345.00
2 98.765.00
df.col=df.col.str.rstrip('.00')
print(df)
col
0 13.343
1 12.345
2 98.765
You can convert it back to float if you like by astype(float)
Note : You should not use this if you have all 0s example: 00.000.00 instead use the second solution.
If the second decimal is not always 0 use:
df.col.str.rsplit(".",1).str[0]

Related

Substring a column in pandas

I have a dataframe like this
Index
Identifier
0
10769289.0
1
1082471174.0
The "Identifier column is a string column" and I need to remove the ".0"
I'm using the following code:
Dataframe["Identifier"] = Dataframe["Identifier"].replace(regex=['.0'],value='')
But I got this:
IndexIdentifier0769289182471174
As you can see it removed more than just the ".0". I also tried to use
Dataframe["Identifier"] = Dataframe["Identifier"].str.replace(".0", "")
but I got the same result.
The dot (.) in regex or in replace can indicate any character. Therefore you have to escape the decimal point. Otherwise it will replace any character followed by a zero. Which in your case would mean that it would replace the 10 at the beginning of 10769289.0 and 1082471174.0, as well as the .0 at the end of each number. By escaping the decimal point, it will only look for the following: .0 - which is what you intended.
import pandas as pd
# Create the dataframe as per the example
Dataframe = pd.DataFrame({"Index": [0,1], "Identifier": ['10769289.0', '1082471174.0']})
# Replace the decimal and the zero at the end of each Identifier.
Dataframe["Identifier"] = Dataframe["Identifier"].str.replace("\.0", "")
# Print the dataframe
print(Dataframe)
OUTPUT:
Index Identifier
0 0 10769289
1 1 1082471174

Add character to column based on text condition using pandas

I'm trying to do some data cleaning using pandas. Imagine I have a data frame which has a column call "Number" and contains data like: "1203.10", "4221","3452.11", etc. I want to add an "M" before the numbers, which have a point and a zero at the end. For this example, it would be turning the "1203.10" into "M1203.10".
I know how to obtain a data frame containing the numbers with a point and ending with zero.
Suppose the data frame is call "df".
pointzero = '[0-9]+[.][0-9]+[0]$'
pz = df[df.Number.str.match(pointzero)]
But I'm not sure on how to add the "M" at the beginning after having "pz". The only way I know is using a for loop, but I think there is a better way. Any suggestions would be great!
You can use boolean indexing:
pointzero = '[0-9]+[.][0-9]+[0]$'
m = df.Number.str.match(pointzero)
df.loc[m, 'Number'] = 'M' + df.loc[m, 'Number']
Alternatively, using str.replace and a slightly different regex:
pointzero = '([0-9]+[.][0-9]+[0]$)'
df['Number'] = df['Number'].str.replace(pointzero, r'M\1', regex=True))
Example:
Number
0 M1203.10
1 4221
2 3452.11
you should make dataframe or seires example for answer
example:
s1 = pd.Series(["1203.10", "4221","3452.11"])
s1
0 M1203.10
1 4221
2 3452.11
dtype: object
str.contains + boolean masking
cond1 = s1.str.contains('[0-9]+[.][0-9]+[0]$')
s1.mask(cond1, 'M'+s1)
output:
0 M1203.10
1 4221
2 3452.11
dtype: object

Detect special values in one column

I am trying to detect values with some specific characters e.g(?,/ etc). Below you can see a small sample with some data.
import pandas as pd
import numpy as np
data = {
'artificial_number':['000100000','000010000','00001000/1','00001000?','0?00/10000'],
}
df1 = pd.DataFrame(data, columns = [
'artificial_number',])
Now I want to detect values with specific characters that are not numbers ('00001000/1','00001000?','0?00/10000') I tried with this lines below
import re
clean = re.sub(r'[^a-zA-Z0-9\._-]', '', df1['artificial_number'])
But this code is not working as I expected. So can anybody help me how to solve this problem ?
#replace the non-digit with an empty value
df1['artificial_number'].str.replace(r'([^\d])','', regex=True)
0 000100000
1 000010000
2 000010001
3 00001000
4 00010000
Name: artificial_number, dtype: object
if you like to list the column with non-digit values
df1.loc[df1['artificial_number'].str.extract(r'([^\d])')[0].notna()]
artificial_number
2 00001000/1
3 00001000?
4 0?00/10000
Assuming a number in your case is an integer, to find the values that have non-numbers, just count the number of numbers, and compare with length of string:
rows = [len(re.findall('[0-9]', s)) != len(s) for s in df1.artificial_number]
df1.loc[rows]
# artificial_number
#2 00001000/1
#3 00001000?
#4 0?00/10000
To detect which of the values aren't interpretable as numeric, you can also use str.isnumeric:
df1.loc[~df1.artificial_number.str.isnumeric()]
artificial_number
2 00001000/1
3 00001000?
4 0?00/10000
If all characters need to be digits (e.g. 10.0 should also be excluded), use str.isdigit:
df1.loc[~df1.artificial_number.str.isdigit()]
df1.iloc[0,0] = '000100000.0'
artificial_number
0 000100000.0
2 00001000/1
3 00001000?
4 0?00/10000

invalid literal for int() with base 10: 'EC180'

All that i tried was:
df['buyer_zip']=df['buyer_zip'].replace('-', 0)
df['buyer_zip']=df['buyer_zip'].replace('', 0)
df['buyer_zip']=df['buyer_zip'].str[:5]
df["buyer_zip"].fillna( method ='ffill', inplace = True)
df["buyer_zip"].apply(int)
I have two columns in a pandas dataframe called Buyer_zip and Item_zip which are the zip codes of Buyer and items respectively. These zip codes have 4 formats. One is 5 digit zip code( ex: 12345), one is 5+4 digit zip code( 12345-1234), one is 9 digit zipcode (123456789) and the last one is 'EC180'. So, the last format is alphanumeric. There are 15 Million records in total. I am struck at a point where i have to convert all those alphanumeric values to numeric. When trying to do the same, i encountered the error: invalid literal for int() with base 10: 'EC180'. Could someone help me how to find all the words in my data column and replace it with 00000. Appreciate any help.But none of it gave an answer to how to find the words in that column and replace it with numbers
Sample data:
buyer_zip
97219
11415-3528
EC180
907031234
Expected output
buyer_zip
0 97219
1 114153528
2 0
3 907031234
Pandas has several different "replace" methods. On a DataFrame or a Series, replace is meant to match and replace entire values. For instance, df['buyer_zip'].replace('-', 0) looks for a column value that is literally the single character "-" and replaces it with the integer 0. That's not what you want. The series also has a .str attribute which holds functions for strings, and its replace is closer to what you want.
But, that is what you want when you have a string that starts with a non-digit letter. You want that one to be completely replaced with "00000".
Finally, astype is a faster way to convert the column to an int.
import pandas as pd
df = pd.DataFrame({"buyer_zip":["12345", "123451234", "123456789", "EC180"]})
df["buyer_zip"] = df["buyer_zip"].str.replace("-", "")
df["buyer_zip"] = df["buyer_zip"].replace(r"[^\d].*$", "00000", regex=True)
df["buyer_zip"] = df["buyer_zip"].astype(int)
The operations can be chained. Apply the second operation to the result of the first, etc, and you can condense the conversion
df["buyer_zip"] = df["buyer_zip"].str.replace("-", "").replace(r"[^\d].*$", "00000", regex=True).astype(int)

Issues with extracting substrings of a string in Python Pandas Dataframe

I have an expression like ( one row of a column, say 'old_col' in pandas data frame) ( Shown the top two rows from a column of the dataframe )
abcd_6.9_uuu ghaha_12.8 _sksks
abcd_5.2_uuu ghaha_13.9 _sksks
I was trying to use the str.extract on the dataframe to get the two floating numbers. However I find two issues, only the first one is picked up( 6.9 from first row and 5.2 from second row )
1. So how can I do that?
2. Also how can I make the extract method general to pick numbers upto any digits ( 5.7or 12.9 irrespective)
I am using:
df['newcol'] = df['old_col'].str.extract('(_\d.\d)')
To get more than one digit,
df['col'].str.extract('(\_\d+\.\d+)')
col
0 _6.9
1 _15.9
To get all occurrences, use str.extractall
df['col'].str.extractall('(\_\d+\.\d+)')
col
match
0 0 _6.9
1 _12.8
1 0 _15.9
1 _13.9
To assign back to df:
s = df['col'].str.extractall('(\_\d+\.\d+)')['col']
df['new_col'] = s.groupby(s.index.get_level_values(0)).agg(list)
You can use Series.str.findall:
import pandas as pd
df=pd.DataFrame({'old_col':['abcd_6.9_uuu ghaha_12.8 _sksks','abcd_5.2_uuu ghaha_13.9 _sksks']})
df['newcol'] = df['old_col'].str.findall(r'\d+(?:\.\d+)?')
df['newcol_str'] = df['old_col'].str.findall(r'\d+(?:\.\d+)?').str.join(', ')
# >>> df
# old_col newcol newcol_str
# 0 abcd_6.9_uuu ghaha_12.8 _sksks [6.9, 12.8] 6.9, 12.8
# 1 abcd_5.2_uuu ghaha_13.9 _sksks [5.2, 13.9] 5.2, 13.9
Regex details:
\d+(?:\.\d+)? - one or more digits followed with an optional occurrence of a . and one or more digits
\d+\.\d+ would match only float values where the . is obligatory between at least two digits.
Since .str.findall(r'\d+(?:\.\d+)?') returns a list, the newcol column contains lists, with .str.join(', '), the newcol_str column contains strings with found matches merged.
If you must check if the numbers occur between underscores add them on both sides of the pattern and wrap the number matching pattern with parentheses:
.str.findall(r'_(\d+(?:\.\d+)?)_')

Categories