While being able to print # of decimals in console, I am significantly more challenged when attempting to write to a csv with the RIGHTLY formatted number. In the code below I somehow managed to divide integers in '000s, and have the result thrown into a csv, but I cannot get rid of the extra ".,". The for loop is really a hard challenge. Maybe, someone could tell me how to crack the puzzle.
View code strings below. Should have looked like, both in print in console AND in the csv file I am writing to:
23,400,344.567, 54,363,744.678, 56,789,117.456, 4,132,454.987
INPUT:
import pandas as pd
def insert_comma(s):
str=''
count=0
for n in reversed(s):
str += n
count+=1
if count == 3:
str += ','
count=0
return ''.join([i for i in reversed(str)][1:])
d = {'Quarters' : ['Quarter1','Quarter2','Quarter3','Quarter4'],
'Revenue':[23400344.567, 54363744.678, 56789117.456, 4132454.987]}
df=pd.DataFrame(d)
df['Revenue']=df.apply(lambda x: insert_comma(str(x['Revenue'] / 1000)), axis=1)
# pd.options.display.float_format = '{:.0f}'.format
df.to_csv("C:\\Users\\jcst\\Desktop\\Private\\Python data\\new8.csv", sep=";")
# round to two decimal places in python pandas
# .options.display.float_format = '{:.0f}'.format
print(df)
OUTPUT
Quarters Revenue
0 Quarter1 234,00.,344
1 Quarter2 543,63.,744
2 Quarter3 567,89.,117
3 Quarter4 1,32.,454
You can use this. Use format string to use comma(s) and 3 decimal places for all rows in Revenue column.
df['Revenue']=df['Revenue'].apply('{0:,.3f}'.format)
Result:
Quarters Revenue
0 Quarter1 23,400,344.567
1 Quarter2 54,363,744.678
2 Quarter3 56,789,117.456
3 Quarter4 4,132,454.987
Suggestion:
insertCommas = lambda x: format(x, ',')
Works like this:
>>> insertCommas(23400344.567)
'23,400,344.567'
This works for me:
df['Revenue'] = df['Revenue'].apply(lambda x:f"{x:,.3f}")
This solution uses Python 3.6+ f-strings to insert commas as thousand separator and show 3 decimals.
Related
I'd like to create function a which converts objects to float. Tried to find some solution, but always get some errors:
# sample dataframe
d = {'price':['−$13.79', '−$ 13234.79', '$ 132834.79', 'R$ 75.900,00', 'R$ 69.375,12', '- $ 2344.92']}
df = pd.DataFrame(data=d)
I tried this code, first wanted just to find solution.
df['price'] = (df.price.str.replace("−$", "-").str.replace(r'\w+\$\s+', '').str.replace('.', '')\
.str.replace(',', '').astype(float)) / 100
So idea was to convert -$ to - (for negative values). Then $ to ''.
But as a result I get:
ValueError: could not convert string to float: '−$1379'
You can extract the numbers on one side, and identify whether there is a minus in the other side, then combine:
factor = np.where(df['price'].str.match(r'[−-]'), -1, 1)/100
out = (pd.to_numeric(df['price'].str.replace(r'\D', '', regex=True), errors='coerce')
.mul(factor)
)
output:
0 -13.79
1 -13234.79
2 132834.79
3 75900.00
4 69375.12
5 -2344.92
Name: price, dtype: float64
Can you use re ?
Like this:
import re
df['price'] = float(re.sub(r'[^\-.0-9]', '', df.price.str)) / 100
I'm just removing by regex all the symbols that are not 0-9, ".", "," & "-".
BTW, no clue why you divide it by 100...
df["price2"] = pd.to_numeric(df["price"].str.replace("[R$\s\.,]", "")) / 100
df["price3"] = df["price"].str.replace("[R$\s\.,]", "").astype(float) / 100
df
A few notes:
The dot is the regex symbel for everything.
The - symbel you are using is not a minus. Its something else.
df["price2"] = pd.to_numeric(df["price"].str.replace("[R$\s\.,]", "")) / 100
df["price3"] = df["price"].str.replace("[R$\s\.,]", "").astype(float) / 100
df
A few notes:
The dot is the regex symbel for everything.
The - symbel you are using is not a minus. Its something else.
I would use something like https://regex101.com for debugging.
I have the following data:
data shows a race time finish and pace:
As you can see, the data doesn't show the hour format for people who finish before the hour mark and in order to do some analysis, i need to convert into a time format but pandas doesn't recognize just the MM:SS format. how can I pad '0:' in front of the rows where hour is missing?
i'm sorry, this is my first time posting.
Considering your data is in csv format.
# reading in the data file
df = pd.read_csv('data_file.csv')
# replacing spaces with '_' in column names
df.columns = [c.replace(' ', '_') for c in df.columns]
for i, row in df.iterrows():
val_inital = str(row.Gun_time)
val_final = val_inital.replace(':','')
if len(val_final)<5:
val_final = "0:" + val_inital
df.at[i, 'Gun_time'] = val_final
# saving newly edited csv file
df.to_csv('new_data_file.csv')
Before:
Gun time
0 28:48
1 29:11
2 1:01:51
3 55:01
4 2:08:11
After:
Gun_time
0 0:28:48
1 0:29:11
2 1:01:51
3 0:55:01
4 2:08:11
You can try to apply the following function to the columns you want to change then maybe change it to timedelta
df['Gun time'].apply(lambda x: '0:' + x if len(x) == 5 \
else ('0:0' + x if len(x) == 4 else x))
df['Gun time'] = pd.to_timedelta(df['Gun Time'])
how do I separate two numbers if the length of the number is 18 . what I exactly wanna do is I want to separate mobile number(10) and landline number(8) when they are joined(18).
I have tried to extract first 8 numbers but I don't know how to apply condition. and I need to remove first 8 numbers if the condition satisfies
df['Landline'] = df['Number'].str[:8]
I have tried this but I know its wrong
df['Landline'] = df['Number'].apply(lambda x : x.str[:8] if len(x)==18 )
For extracting first 8 numbers, use findall.
df['Number'].str.findall('^\d{8}')
Solution using an example
Here we use the Dummy Data made in the following section.
# separate landline and mobile numbers
phone_numbers = df.Numbers.str.findall('(^\d{8})*(\d{10})').tolist()
# store in a dict
d = dict((i, {'Landline': e[0][0], 'Mobile': e[0][1]}) for i, e in enumerate(phone_numbers))
# make a temporary dataframe
phone_df = pd.DataFrame(d).T
# update original dataframe
df['Landline'] = phone_df['Landline']
df['Mobile'] = phone_df['Mobile']
print(df)
Output:
Numbers Landline Mobile
0 123456780123456789 12345678 0123456789
1 0123456789 0123456789
Dummy Data
df = pd.DataFrame({'Numbers': ['123456780123456789', '0123456789', ]})
print(df)
Output:
Numbers
0 123456780123456789
1 0123456789
Looks like you need
df['Landline'] = df['Number'].apply(lambda x : x[:8] if len(x)==18 else x)
I have a column called SSN in a CSV file with values like this
289-31-9165
I need to loop through the values in this column and replace the first five characters so it looks like this
***-**-9165
Here's the code I have so far:
emp_file = "Resources/employee_data1.csv"
emp_pd = pd.read_csv(emp_file)
new_ssn = emp_pd["SSN"].str.replace([:5], "*")
emp_pd["SSN"] = new_ssn
How do I loop through the value and replace just the first five numbers (only) with asterisks and keep the hiphens as is?
Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format.
emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])
You can simply achieve this with replace() method:
Example dataframe :
borrows from #AkshayNevrekar..
>>> df
ssn
0 111-22-3333
1 121-22-1123
2 345-87-3425
Result:
>>> df.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
OR
>>> df.ssn.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
Name: ssn, dtype: object
OR:
df['ssn'] = df['ssn'].str.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
Put your asterisks in front, then grab the last 4 digits.
new_ssn = '***-**-' + emp_pd["SSN"][-4:]
You can use regex
df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']})
def func(x):
return re.sub(r'\d{3}-\d{2}','***-**', x)
df['ssn'] = df['ssn'].apply(func)
print(df)
Output:
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
My excel sheet has a column of percentages stored with the percent symbol (eg "50%"). How can I coerce pandas.read_excel to read the string "50%" instead of casting it to a float?
Currently the read_excel implementation parses the percentage into the float 0.5. Additionally if I add a converter = {col_with_percentage: str} argument, it parses it into the string '0.5'. Is there a way to read the raw percentage value ("50%")?
You can pass your own function with the converters. Something to make a string (eg: 50%) could look like:
Code:
def convert_to_percent_string(value):
return '{}%'.format(value * 100)
Test Code:
import pandas as pd
df = pd.read_excel('example.xlsx', converters={
'percents': convert_to_percent_string})
print(df)
Or as a lambda:
df = pd.read_excel('example.xlsx', converters={
'percents': lambda value: '{}%'.format(value * 100)})
Results:
percents
0 40.0%
1 50.0%
2 60.0%
You can generate string after reading
df = pd.DataFrame(np.random.ranf(size=(4,1)),columns =['col_with_percentage'])
df['col_with_percentage_s']= (df.col_with_percentage*100).astype(int).astype(str)+'%'
df
Output:
col_with_percentage col_with_percentage_s
0 0.5339712650806299 53%
1 0.9220323933894158 92%
2 0.11156261877930995 11%
3 0.18864363985224808 18%
But better way is to format on display, you can do it by style in pandas
df.style.format({'col_with_percentage': "{:.0%}"})
Output:
col_with_percentage col_with_percentage_s
0 53% 53%
1 92% 92%
2 11% 11%
3 19% 18%
I write a special conversion because sometimes in Excel, it is possible these percentages are melted with true strings or numbers in the same columns, and sometimes with or without decimal.
Examples :
"12%", "12 %", "Near 20%", "15.5", "15,5%", "11", "14.05%", "14.05", "0%"; "100%", "no result", "100"
And I want keep the percentage symbol from true Excel percentage values, keeping
decimals, without change the other values:
import re
df[field] = (df[field].apply(lambda x: str(round(float(x) * 100, 2)).rstrip('0').rstrip('.') + ' %' if re.search(r'^0\.\d+$|^0$|^1$',x) else x))
It works, but remains one problem: if a cell contains a true number between 0 and 1, so it becomes a percentage:
"0.3" becomes "30%"
But this is a special case, when the Excel file is wrong builded, revealing a true error. So I just add special alerts to manage this special cases.