I have a dataframe like this
The "Identifier column is a string column" and I need to remove the ".0"
I'm using the following code:
Dataframe["Identifier"] = Dataframe["Identifier"].replace(regex=['.0'],value='')
But I got this:
As you can see it removed more than just the ".0". I also tried to use
Dataframe["Identifier"] = Dataframe["Identifier"].str.replace(".0", "")
but I got the same result.
The dot (.) in regex or in replace can indicate any character. Therefore you have to escape the decimal point. Otherwise it will replace any character followed by a zero. Which in your case would mean that it would replace the 10 at the beginning of 10769289.0 and 1082471174.0, as well as the .0 at the end of each number. By escaping the decimal point, it will only look for the following: .0 - which is what you intended.
import pandas as pd
# Create the dataframe as per the example
Dataframe = pd.DataFrame({"Index": [0,1], "Identifier": ['10769289.0', '1082471174.0']})
# Replace the decimal and the zero at the end of each Identifier.
Dataframe["Identifier"] = Dataframe["Identifier"].str.replace("\.0", "")
# Print the dataframe
Index Identifier
0 0 10769289
1 1 1082471174
All that i tried was:
df['buyer_zip']=df['buyer_zip'].replace('-', 0)
df['buyer_zip']=df['buyer_zip'].replace('', 0)
df["buyer_zip"].fillna( method ='ffill', inplace = True)
I have two columns in a pandas dataframe called Buyer_zip and Item_zip which are the zip codes of Buyer and items respectively. These zip codes have 4 formats. One is 5 digit zip code( ex: 12345), one is 5+4 digit zip code( 12345-1234), one is 9 digit zipcode (123456789) and the last one is 'EC180'. So, the last format is alphanumeric. There are 15 Million records in total. I am struck at a point where i have to convert all those alphanumeric values to numeric. When trying to do the same, i encountered the error: invalid literal for int() with base 10: 'EC180'. Could someone help me how to find all the words in my data column and replace it with 00000. Appreciate any help.But none of it gave an answer to how to find the words in that column and replace it with numbers
Sample data:
Expected output
0 97219
1 114153528
2 0
3 907031234
Pandas has several different "replace" methods. On a DataFrame or a Series, replace is meant to match and replace entire values. For instance, df['buyer_zip'].replace('-', 0) looks for a column value that is literally the single character "-" and replaces it with the integer 0. That's not what you want. The series also has a .str attribute which holds functions for strings, and its replace is closer to what you want.
But, that is what you want when you have a string that starts with a non-digit letter. You want that one to be completely replaced with "00000".
Finally, astype is a faster way to convert the column to an int.
import pandas as pd
df = pd.DataFrame({"buyer_zip":["12345", "123451234", "123456789", "EC180"]})
df["buyer_zip"] = df["buyer_zip"].str.replace("-", "")
df["buyer_zip"] = df["buyer_zip"].replace(r"[^\d].*$", "00000", regex=True)
df["buyer_zip"] = df["buyer_zip"].astype(int)
The operations can be chained. Apply the second operation to the result of the first, etc, and you can condense the conversion
df["buyer_zip"] = df["buyer_zip"].str.replace("-", "").replace(r"[^\d].*$", "00000", regex=True).astype(int)
Consider a dataframe in Pandas, where one of the many columns have data that has TWO decimals in the column.
How can one get a new column (float) where values are stored in only 1 decimal format stripping that last part of 14.234(.00).
Desired output should be a new column like
If the digits after the second period are not always 0s (and not always two), the following code is more robust:
df["col"] = df["col"].str.extract("(.+)\.[0-9]+").astype(float)
#remove last 3 values
df['col'] = df['col'].str[:-3].astype(float)
#get values before last .
df['col'] = df['col'].str.rsplit('.', 1).str[0].astype(float)
#one or zero integer \d* \. and integer \d+ pattern
df["col"] = df["col"].str.extract("(\d*\.\d+)").astype(float)
You can use:
0 13.343.00
1 12.345.00
2 98.765.00
0 13.343
1 12.345
2 98.765
You can convert it back to float if you like by astype(float)
Note : You should not use this if you have all 0s example: 00.000.00 instead use the second solution.
If the second decimal is not always 0 use:
What I have:
df = pd.DataFrame(data = ["version11.11","version2.2","version3"], columns=["software_version"])
Index software_version
0 version11.11
1 version2.2
2 version3
What I am trying to do:
Is to detect the type of the second last character in the dataframe column called software_version and create a new column in the dataframe based on that condition.
If the second last character is a digit or an alphabet, extract the whole name without the last alpha/digital. Such as version11.11 becomes version11.1 OR version3 becomes version. elif, its a decimal place then extract til before the decimal place, version2.2 becomes version2
Output Should be:
Index software_version main_software
0 version11.11 version11.1
1 version2.2 version2
2 version3 version
What I did so far:
How can I cleanly add the column above main_software ?
import pandas as pd
df = pd.DataFrame(data = ["version11.11","version2.2","version3"], columns=["software_version"])
for name in df.software_version:
if name[-2].isalnum():
elif name[-2] == ".":
else :
You can first define a function that makes the necessary changes on the string.
def GetMainSoftware(string):
new_string=string[:-1] #first remove the last character
if(new_string[-1]=="."): #if "." is present, remove that too
return new_string[:-1]
return new_string
And then use apply on the dataframe to create a new column with these specifics.
df["main_software"]=df.apply(lambda row: GetMainSoftware(row["software_version"]),axis=1)
df will now be :
software_version main_software
0 version11.11 version11.1
1 version2.2 version2
2 version3 version
I have an expression like ( one row of a column, say 'old_col' in pandas data frame) ( Shown the top two rows from a column of the dataframe )
abcd_6.9_uuu ghaha_12.8 _sksks
abcd_5.2_uuu ghaha_13.9 _sksks
I was trying to use the str.extract on the dataframe to get the two floating numbers. However I find two issues, only the first one is picked up( 6.9 from first row and 5.2 from second row )
1. So how can I do that?
2. Also how can I make the extract method general to pick numbers upto any digits ( 5.7or 12.9 irrespective)
I am using:
df['newcol'] = df['old_col'].str.extract('(_\d.\d)')
To get more than one digit,
0 _6.9
1 _15.9
To get all occurrences, use str.extractall
0 0 _6.9
1 _12.8
1 0 _15.9
1 _13.9
To assign back to df:
s = df['col'].str.extractall('(\_\d+\.\d+)')['col']
df['new_col'] = s.groupby(s.index.get_level_values(0)).agg(list)
You can use Series.str.findall:
import pandas as pd
df=pd.DataFrame({'old_col':['abcd_6.9_uuu ghaha_12.8 _sksks','abcd_5.2_uuu ghaha_13.9 _sksks']})
df['newcol'] = df['old_col'].str.findall(r'\d+(?:\.\d+)?')
df['newcol_str'] = df['old_col'].str.findall(r'\d+(?:\.\d+)?').str.join(', ')
# >>> df
# old_col newcol newcol_str
# 0 abcd_6.9_uuu ghaha_12.8 _sksks [6.9, 12.8] 6.9, 12.8
# 1 abcd_5.2_uuu ghaha_13.9 _sksks [5.2, 13.9] 5.2, 13.9
Regex details:
\d+(?:\.\d+)? - one or more digits followed with an optional occurrence of a . and one or more digits
\d+\.\d+ would match only float values where the . is obligatory between at least two digits.
Since .str.findall(r'\d+(?:\.\d+)?') returns a list, the newcol column contains lists, with .str.join(', '), the newcol_str column contains strings with found matches merged.
If you must check if the numbers occur between underscores add them on both sides of the pattern and wrap the number matching pattern with parentheses:
I have a pandas dataframe column value as
I need to trim it from the back,i.e my resultant value should be AS7878788.
I am doing the below:
newdf=pd.DataFrame(df.COLUMNNAME.str.split('(',1).tolist(),columns = ['col1','col2'])
df['newcol'] = newdf['col2'].str[:10]
This in the above Dataframe column is giving the the output "12tytyttyt", however my intended output is "AS7878788"
Can someone help please?
Let's try first with a regular string in pure Python:
x = "assdffjhjhjh(12tytyt)bhhh(AS7878788)"
res = x.rsplit('(', 1)[-1][:-1] # 'AS7878788'
Here we split from the right by open bracket (limiting the split count to one for efficiency), extract the last split, and extract every character except the last.
You can then apply this in Pandas via pd.Series.str methods:
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
Here's a demo:
df = pd.DataFrame({'col': ["assdffjhjhjh(12tytyt)bhhh(AS7878788)"]})
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
0 AS7878788
Note the solution above is very specific to the string you have presented as an example. For a more flexible alternative, consider using regex.
You can use a regex to find all instances of "values between two brackets" and then pull out the final one. For example, if we have the following data:
df = pd.DataFrame({'col': ['assdffjhjhjh(12tytyt)bhhh(AS7878788)',
and we do:
df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)').str[-1]
this gets us:
0 AS7878788
1 abjsdfvhg
To explain what the regex is doing, it is trying to find all instances where we have:
\( # an open bracket
([^\(^\)]+) # anything that isn't an open bracket or a close bracket for one or more characters
\) # a close bracket
We can see how this is working if we take the .str[-1] from the end of our previous statement, as df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)') gives us:
0 [12tytyt, AS7878788]
1 [abjhsgf, abjsdfvhg]