how to convert in python negative value objects in dataframe to float - python

I'd like to create function a which converts objects to float. Tried to find some solution, but always get some errors:
# sample dataframe
d = {'price':['−$13.79', '−$ 13234.79', '$ 132834.79', 'R$ 75.900,00', 'R$ 69.375,12', '- $ 2344.92']}
df = pd.DataFrame(data=d)
I tried this code, first wanted just to find solution.
df['price'] = (df.price.str.replace("−$", "-").str.replace(r'\w+\$\s+', '').str.replace('.', '')\
.str.replace(',', '').astype(float)) / 100
So idea was to convert -$ to - (for negative values). Then $ to ''.
But as a result I get:
ValueError: could not convert string to float: '−$1379'

You can extract the numbers on one side, and identify whether there is a minus in the other side, then combine:
factor = np.where(df['price'].str.match(r'[−-]'), -1, 1)/100
out = (pd.to_numeric(df['price'].str.replace(r'\D', '', regex=True), errors='coerce')
.mul(factor)
)
output:
0 -13.79
1 -13234.79
2 132834.79
3 75900.00
4 69375.12
5 -2344.92
Name: price, dtype: float64

Can you use re ?
Like this:
import re
df['price'] = float(re.sub(r'[^\-.0-9]', '', df.price.str)) / 100
I'm just removing by regex all the symbols that are not 0-9, ".", "," & "-".
BTW, no clue why you divide it by 100...

df["price2"] = pd.to_numeric(df["price"].str.replace("[R$\s\.,]", "")) / 100
df["price3"] = df["price"].str.replace("[R$\s\.,]", "").astype(float) / 100
df
A few notes:
The dot is the regex symbel for everything.
The - symbel you are using is not a minus. Its something else.

df["price2"] = pd.to_numeric(df["price"].str.replace("[R$\s\.,]", "")) / 100
df["price3"] = df["price"].str.replace("[R$\s\.,]", "").astype(float) / 100
df
A few notes:
The dot is the regex symbel for everything.
The - symbel you are using is not a minus. Its something else.
I would use something like https://regex101.com for debugging.

Related

pandas split string with $ special text style

I have an excel, the data has two $ , when I read it using pandas, it will convert them to a very special text style.
import pandas as pd
df = pd.DataFrame({ 'Bid-Ask':['$185.25 - $186.10','$10.85 - $11.10','$14.70 - $15.10']})
after pd.read_excel
df['Bid'] = df['Bid-Ask'].str.split('−').str[0]
above code doesn't work due to $ make my string into a special style text.the Split function doesn't work.
my expected result
Do not split. Using str.extract is likely the most robust:
df[['Bid', 'Ask']] = df['Bid-Ask'].str.extract(r'(\d+(?:\.\d+)?)\D*(\d+(?:\.\d+)?)')
Output:
Bid-Ask Bid Ask
0 $185.25 - $186.10 185.25 186.10
1 $10.85 - $11.10 10.85 11.10
2 $14.70 - $15.10 14.70 15.10
There is a non-breaking space (\xa0) in your string. That's why the split doesn't work.
I copied the strings (of your df) one by one into an Excel file and then imported it with pd.read_excel.
The column looks like this then:
repr(df['Bid-Ask'])
'0 $185.25\xa0- $186.10\n1 $10.85\xa0- $11.10\n2 $14.70\xa0- $15.10\nName: Bid-Ask, dtype: object'
Before splitting you can replace that and it'll work.
df['Bid-Ask'] = df['Bid-Ask'].astype('str').str.replace('\xa0',' ', regex=False)
df[['Bid', 'Ask']] = df['Bid-Ask'].str.replace('$','', regex=False).str.split('-',expand = True)
print(df)
Bid-Ask Bid Ask
0 $185.25 - $186.10 185.25 186.10
1 $10.85 - $11.10 10.85 11.10
2 $14.70 - $15.10 14.70 15.10
You have to use the lambda function and apply the method together to split the column values into two and slice the value
df['Bid'] = df['Bid-Ask'].apply(lambda x: x.split('-')[0].strip()[1:])
df['Ask'] = df['Bid-Ask'].apply(lambda x: x.split('-')[1].strip()[1:])
output:
Bid-Ask Bid Ask
0 185.25− 186.10 185.25 186.1
1 10.85− 11.10 10.85 11.1
2 14.70− 15.10 14.70 15.1

Multi-part manipulation post str.split() Pandas

I have a subset of data (single column) we'll call ID:
ID
0 07-1401469
1 07-89556629
2 07-12187595
3 07-381962
4 07-99999085
The current format is (usually) YY-[up to 8-character ID].
The desired output format is a more uniformed YYYY-xxxxxxxx:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
Knowing that I've done padding in the past, the thought process was to combine
df['id'].str.split('-').str[0].apply(lambda x: '{0:20>4}'.format(x))
df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x))
However I ran into a few problems:
The '20' in '{0:20>4}' must be a singular value and not a string
Trying to do something like the below just results in df['id'] taking the properties of the last lambda & trying any other way to combine multiple apply/lambdas just didn't work. I started going down the pad left/right route but that seemed to be taking be backwards.
df['id'] = (df['id'].str.split('-').str[0].apply(lambda x: '{0:X>4}'.format(x)).str[1].apply(lambda x: '{0:0>8}'.format(x)))
The current solution I have (but HATE because its long, messy, and just not clean IMO) is:
df['idyear'] = df['id'].str.split('-').str[0].apply(lambda x: '{:X>4}'.format(x)) # Split on '-' and pad with X
df['idyear'] = df['idyear'].str.replace('XX', '20') # Replace XX with 20 to conform to YYYY
df['idnum'] = df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x)) # Pad 0s up to 8 digits
df['id'] = df['idyear'].map(str) + "-" + df['idnum'] # Merge idyear and idnum to remake id
del df['idnum'] # delete extra
del df['idyear'] # delete extra
Which does work
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
But my questions are
Is there a way to run multiple apply() functions in a single line so I'm not making temp variables
Is there a better way than replacing 'XX' for '20'
I feel like this entire code block can be compress to 1 or 2 lines I just don't know how. Everything I've seen on SO and Pandas documentation on highlights/relates to singular manipulation so far.
One option is to split; then use str.zfill to pad '0's. Also prepend '20's before splitting, since you seem to need it anyway:
tmp = df['ID'].radd('20').str.split('-')
df['ID'] = tmp.str[0] + '-'+ tmp.str[1].str.zfill(8)
Output:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
I'd do it in two steps, using .str.replace:
df["ID"] = df["ID"].str.replace(r"^(\d{2})-", r"20\1-", regex=True)
df["ID"] = df["ID"].str.replace(r"-(\d+)", lambda g: f"-{g[1]:0>8}", regex=True)
print(df)
Prints:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085

How to add a new column with multiple string contain conditions in python pandas other than using np.where?

I was trying to add a new column by giving multiple strings contain conditions using str.contains() and np.where() function. By this way, I can have the final result I want.
But, the code is very lengthy. Are there any good ways to reimplement this using pandas function?
df5['new_column'] = np.where(df5['sr_description'].str.contains('gross to net', case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('gross up', case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('net to gross',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('gross-to-net',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('gross-up',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('net-to-gross',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('gross 2 net',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('net 2 gross',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('gross net',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('net gross',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('memo code',case=False).fillna(False),1,0)))))))))))
This output will be,
if those strings contain in 'sr_description' then give a 1, else 0 to new_column
Maybe store the multiple string conditions in a list then read and apply them to a function.
Edit:
Sample Data:
sr_description new_column
something with gross up. 1
without those words. 0
or with Net to gross 1
if not then we give a '0' 0
Here is what I came up with.
Code:
import re
import pandas as pd
import numpy as np
# list of the strings we want to check for
check_strs = ['gross to net', 'gross up', 'net to gross', 'gross-to-net', 'gross-up', 'net-to-gross', 'gross 2 net',
'net 2 gross', 'gross net', 'net gross', 'memo code']
# From the re.escape() docs: Escape special characters in pattern.
# This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
check_strs_esc = [re.escape(curr_val) for curr_val in check_strs]
# join all the escaped strings as a single regex
check_strs_re = '|'.join(check_strs_esc)
test_col_1 = ['something with gross up.', 'without those words.', np.NaN, 'or with Net to gross', 'if not then we give a "0"']
df_1 = pd.DataFrame(data=test_col_1, columns=['sr_description'])
df_1['contains_str'] = df_1['sr_description'].str.contains(check_strs_re, case=False, na=False)
print(df_1)
Result:
sr_description contains_str
0 something with gross up. True
1 without those words. False
2 NaN False
3 or with Net to gross True
4 if not then we give a "0" False
Note that numpy isn't required for the solution to function, I'm just using it to test a NaN value.
Let me know if anything is unclear or your have any questions! :)

Printing AND writing the RIGHTLY formatted number

While being able to print # of decimals in console, I am significantly more challenged when attempting to write to a csv with the RIGHTLY formatted number. In the code below I somehow managed to divide integers in '000s, and have the result thrown into a csv, but I cannot get rid of the extra ".,". The for loop is really a hard challenge. Maybe, someone could tell me how to crack the puzzle.
View code strings below. Should have looked like, both in print in console AND in the csv file I am writing to:
23,400,344.567, 54,363,744.678, 56,789,117.456, 4,132,454.987
INPUT:
import pandas as pd
def insert_comma(s):
str=''
count=0
for n in reversed(s):
str += n
count+=1
if count == 3:
str += ','
count=0
return ''.join([i for i in reversed(str)][1:])
d = {'Quarters' : ['Quarter1','Quarter2','Quarter3','Quarter4'],
'Revenue':[23400344.567, 54363744.678, 56789117.456, 4132454.987]}
df=pd.DataFrame(d)
df['Revenue']=df.apply(lambda x: insert_comma(str(x['Revenue'] / 1000)), axis=1)
# pd.options.display.float_format = '{:.0f}'.format
df.to_csv("C:\\Users\\jcst\\Desktop\\Private\\Python data\\new8.csv", sep=";")
# round to two decimal places in python pandas
# .options.display.float_format = '{:.0f}'.format
print(df)
OUTPUT
Quarters Revenue
0 Quarter1 234,00.,344
1 Quarter2 543,63.,744
2 Quarter3 567,89.,117
3 Quarter4 1,32.,454
You can use this. Use format string to use comma(s) and 3 decimal places for all rows in Revenue column.
df['Revenue']=df['Revenue'].apply('{0:,.3f}'.format)
Result:
Quarters Revenue
0 Quarter1 23,400,344.567
1 Quarter2 54,363,744.678
2 Quarter3 56,789,117.456
3 Quarter4 4,132,454.987
Suggestion:
insertCommas = lambda x: format(x, ',')
Works like this:
>>> insertCommas(23400344.567)
'23,400,344.567'
This works for me:
df['Revenue'] = df['Revenue'].apply(lambda x:f"{x:,.3f}")
This solution uses Python 3.6+ f-strings to insert commas as thousand separator and show 3 decimals.

python - Replace first five characters in a column with asterisks

I have a column called SSN in a CSV file with values like this
289-31-9165
I need to loop through the values in this column and replace the first five characters so it looks like this
***-**-9165
Here's the code I have so far:
emp_file = "Resources/employee_data1.csv"
emp_pd = pd.read_csv(emp_file)
new_ssn = emp_pd["SSN"].str.replace([:5], "*")
emp_pd["SSN"] = new_ssn
How do I loop through the value and replace just the first five numbers (only) with asterisks and keep the hiphens as is?
Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format.
emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])
You can simply achieve this with replace() method:
Example dataframe :
borrows from #AkshayNevrekar..
>>> df
ssn
0 111-22-3333
1 121-22-1123
2 345-87-3425
Result:
>>> df.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
OR
>>> df.ssn.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
Name: ssn, dtype: object
OR:
df['ssn'] = df['ssn'].str.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
Put your asterisks in front, then grab the last 4 digits.
new_ssn = '***-**-' + emp_pd["SSN"][-4:]
You can use regex
df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']})
def func(x):
return re.sub(r'\d{3}-\d{2}','***-**', x)
df['ssn'] = df['ssn'].apply(func)
print(df)
Output:
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425

Categories