I have the following dataframe that I am trying to remove the spaces between the numbers in the value column and then use pd.to_numeric to change the dtype. THe current dtype of value is an object.
periodFrom value
1 17.11.2020 28 621 240
2 18.11.2020 30 211 234
3 19.11.2020 33 065 243
4 20.11.2020 34 811 330
I have tried multiple variations of this but can't work it out:
df['value'] = df['value'].str.strip()
df['value'] = df['value'].str.replace(',', '').astype(int)
df['value'] = df['value'].astype(str).astype(int)
One option is to apply .str.split() first in order to split by whitespaces(even if the anyone of them has more than one character length), then concatenate (''.join()) while removing those whitespaces along with converting to integers(int()) such as
j=0
for i in df['value'].str.split():
df['value'][j]=int(''.join(i))
j+=1
You can do:
df['value'].replace({' ':''}, regex=True)
Or
df['value'].apply(lambda x: re.sub(' ', '', str(x)))
And add to both .astype(int).
Related
Basically data is coming into my program in this format
0xxxx000xxxx where the x is unique to the data that I have in another system. I'm trying to remove those 0's as they're always in the same place.
I tried
df['item'] = df['item'].str.replace('0','')
but sometimes the x can be a 0 and will get rid of it. I'm not sure how to get rid of just the 0's in those specific positions.
EX:
Input: 099890000890
Output (Desired): 99890890
Use the str accessor for indexing:
df['item'] = df['item'].str[1:5] + df['item'].str[8:]
Or str.replace:
df['item'] = df['item'].str.replace(r'0(.{4})000(.{4})', r'\1\2', regex=True)
Output (as new column. Item2):
item item2
0 099890000890 99890890
I have a huge dataframe composed of 7 columns.
Extract:
45589 664865.0 100000.0 7.62275 -.494 1.60149 100010
...
57205 718888.0 100000.0 8.218463 -1.405-3 1.75137 100010
...
55143 711827.0 100000.0 8.156107 9.8336-3 1.758051 100010
As these values come from an input file, there are currently all of string type and I would like to change all the dataframe to float through :
df= df.astype('float')
However, as you might have noticed on the extract, there are ' - ' hiding. Some represent the negative value of the whole number, such as -.494 and others represent a negative power, such as 9.8-3.
I need to replace the latters with 'E-' so Python understands it's a power and can switch the cell to a float type. Usually, I would use:
df= df.replace('E\-', '-', regex=True)
However, this would also add an E to my negative values. To avoid that, I tried the solution offered here: Replace all a in the middle of string by * using regex
str = 'JAYANTA POKED AGASTYA WITH BAAAAMBOO '
str = re.sub(r'\BA+\B', r'*', str)
However, this is for one specific string. As my dataframe is quite large, I would like to avoid having to go through each cell.
Is there a combination of the functions replace and re.sub I could use in order to only replace the'-' surrounded by other strings by 'E-'?
Thank you for your help!
You can use regex negative lookahead and positive lookahead to assert that the hyphen is in the middle for replace, as follows:
df = df.replace(r'\s', '', regex=True) # remove any unwanted spaces
df = df.replace(r'(?<=.)-(?=.)', 'E-', regex=True)
Result:
print(df)
0 1 2 3 4 5 6
0 45589 664865.0 100000.0 7.62275 -.494 1.60149 100010
1 57205 718888.0 100000.0 8.218463 -1.405E-3 1.75137 100010
2 55143 711827.0 100000.0 8.156107 9.8336E-3 1.758051 100010
Regular expressions can be expensive, perhaps slice the string into the first digit and remaining digits, use replace on the remaining digits, then recombine with the first digit. Haven't benchmarked this though! Something like this (applied with df.str_col.apply(lambda x: f(x))
my_str = '-1.23-4'
def f(x):
first_part = my_str[0]
remaining_part = my_str[1:]
remaining_part = remaining_part.replace('-', 'E-')
return first_part + remaining_part
Or as a one liner (assuming the seven columns are the only columns in your df, otherwise specify the columns):
df.apply(lambda x: x[0] + x[1:].replace('-', 'E-'))
I tried this example and worked:
import pandas as pd
df = pd.DataFrame({'A': ['-.494', '-1.405-3', '9.8336-3']})
pat = r"(\d)-"
repl = lambda m: f"{m.group(1)}e-"
df['A'] = df['A'].str.replace(pat, repl, regex=True)
df['A'] = pd.to_numeric(df['A'], errors='coerce')
You could use groups as specified in this thread, to select the number before you exponent so that :
first : the match only ocurs when the minus is preceded by values
and second : replace the match by E preceded by the values matched by the group (for example 158-3 will be replaced "dynamically" by the value 158 matched in group 1, with the expression \1 (group 1 content) and "statically" by E-.
This gives :
df.replace({r'(\d+)-' : r'\1E-'}, inplace=True, regex=True)
(You can verify it on regexp tester)
Consider a dataframe in Pandas, where one of the many columns have data that has TWO decimals in the column.
Like
13.343.00
12.345.00
98.765.00
How can one get a new column (float) where values are stored in only 1 decimal format stripping that last part of 14.234(.00).
Desired output should be a new column like
13.343
12.345
98.765
If the digits after the second period are not always 0s (and not always two), the following code is more robust:
df["col"] = df["col"].str.extract("(.+)\.[0-9]+").astype(float)
Use:
#remove last 3 values
df['col'] = df['col'].str[:-3].astype(float)
Or:
#get values before last .
df['col'] = df['col'].str.rsplit('.', 1).str[0].astype(float)
Or:
#one or zero integer \d* \. and integer \d+ pattern
df["col"] = df["col"].str.extract("(\d*\.\d+)").astype(float)
You can use:
print(df)
col
0 13.343.00
1 12.345.00
2 98.765.00
df.col=df.col.str.rstrip('.00')
print(df)
col
0 13.343
1 12.345
2 98.765
You can convert it back to float if you like by astype(float)
Note : You should not use this if you have all 0s example: 00.000.00 instead use the second solution.
If the second decimal is not always 0 use:
df.col.str.rsplit(".",1).str[0]
I have a pandas dataframe column value as
"assdffjhjhjh(12tytyttyt)bhhh(AS7878788)"
I need to trim it from the back,i.e my resultant value should be AS7878788.
I am doing the below:
newdf=pd.DataFrame(df.COLUMNNAME.str.split('(',1).tolist(),columns = ['col1','col2'])
df['newcol'] = newdf['col2'].str[:10]
This in the above Dataframe column is giving the the output "12tytyttyt", however my intended output is "AS7878788"
Can someone help please?
Let's try first with a regular string in pure Python:
x = "assdffjhjhjh(12tytyt)bhhh(AS7878788)"
res = x.rsplit('(', 1)[-1][:-1] # 'AS7878788'
Here we split from the right by open bracket (limiting the split count to one for efficiency), extract the last split, and extract every character except the last.
You can then apply this in Pandas via pd.Series.str methods:
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
Here's a demo:
df = pd.DataFrame({'col': ["assdffjhjhjh(12tytyt)bhhh(AS7878788)"]})
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
print(df)
col
0 AS7878788
Note the solution above is very specific to the string you have presented as an example. For a more flexible alternative, consider using regex.
You can use a regex to find all instances of "values between two brackets" and then pull out the final one. For example, if we have the following data:
df = pd.DataFrame({'col': ['assdffjhjhjh(12tytyt)bhhh(AS7878788)',
'asjhgdv(abjhsgf)(abjsdfvhg)afdsgf']})
and we do:
df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)').str[-1]
this gets us:
col
0 AS7878788
1 abjsdfvhg
To explain what the regex is doing, it is trying to find all instances where we have:
\( # an open bracket
([^\(^\)]+) # anything that isn't an open bracket or a close bracket for one or more characters
\) # a close bracket
We can see how this is working if we take the .str[-1] from the end of our previous statement, as df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)') gives us:
col
0 [12tytyt, AS7878788]
1 [abjhsgf, abjsdfvhg]
I have a data frame with a string column. I need to create a new column with 3rd element after col1.split(' '). I tried
df['col1'].str.split(' ')[0]
but all I get is error.
Actually I need to turn col1 into multiple columns after spliting by " ".
What is the correct way to do this ?
Consider this df
df = pd.DataFrame({'col': ['Lets say 2000 is greater than 5']})
col
0 Lets say 2000 is greater than 5
You can split and use str accessor to get elements at different positions
df['third'] = df.col.str.split(' ').str[2]
df['fifth'] = df.col.str.split(' ').str[4]
df['last'] = df.col.str.split(' ').str[-1]
col third fifth last
0 Lets say 2000 is greater than 5 2000 greater 5
Another way is:
df["third"] = df['col1'].apply(lambda x: x.split()[2])