Pandas read_excel percentages as strings - python

My excel sheet has a column of percentages stored with the percent symbol (eg "50%"). How can I coerce pandas.read_excel to read the string "50%" instead of casting it to a float?
Currently the read_excel implementation parses the percentage into the float 0.5. Additionally if I add a converter = {col_with_percentage: str} argument, it parses it into the string '0.5'. Is there a way to read the raw percentage value ("50%")?

You can pass your own function with the converters. Something to make a string (eg: 50%) could look like:
Code:
def convert_to_percent_string(value):
return '{}%'.format(value * 100)
Test Code:
import pandas as pd
df = pd.read_excel('example.xlsx', converters={
'percents': convert_to_percent_string})
print(df)
Or as a lambda:
df = pd.read_excel('example.xlsx', converters={
'percents': lambda value: '{}%'.format(value * 100)})
Results:
percents
0 40.0%
1 50.0%
2 60.0%

You can generate string after reading
df = pd.DataFrame(np.random.ranf(size=(4,1)),columns =['col_with_percentage'])
df['col_with_percentage_s']= (df.col_with_percentage*100).astype(int).astype(str)+'%'
df
Output:
col_with_percentage col_with_percentage_s
0 0.5339712650806299 53%
1 0.9220323933894158 92%
2 0.11156261877930995 11%
3 0.18864363985224808 18%
But better way is to format on display, you can do it by style in pandas
df.style.format({'col_with_percentage': "{:.0%}"})
Output:
col_with_percentage col_with_percentage_s
0 53% 53%
1 92% 92%
2 11% 11%
3 19% 18%

I write a special conversion because sometimes in Excel, it is possible these percentages are melted with true strings or numbers in the same columns, and sometimes with or without decimal.
Examples :
"12%", "12 %", "Near 20%", "15.5", "15,5%", "11", "14.05%", "14.05", "0%"; "100%", "no result", "100"
And I want keep the percentage symbol from true Excel percentage values, keeping
decimals, without change the other values:
import re
df[field] = (df[field].apply(lambda x: str(round(float(x) * 100, 2)).rstrip('0').rstrip('.') + ' %' if re.search(r'^0\.\d+$|^0$|^1$',x) else x))
It works, but remains one problem: if a cell contains a true number between 0 and 1, so it becomes a percentage:
"0.3" becomes "30%"
But this is a special case, when the Excel file is wrong builded, revealing a true error. So I just add special alerts to manage this special cases.

Related

Is it possible to replace percentages with a phrase in a column?

I have a table that I am turning into a graph, so I am making it so that the percentages are just whole numbers. For example: 63.9 would be 64%. I have some percentages that are less than 1%, and I would prefer to not have them read 0%, and rather write, "less than 1%" or something like that. Is this possible? Below I wrote a line of code I tried, which runs, but I received a SettingWithCopyWarning, so there are no reflected changes.
national_df.loc[national_df['percent']==0.152416, 'percent']='less than 1%'
You can use pandas.apply and round. If round() == 0 insert 'less than 1%'.
import pandas as pd
# example df
df = pd.DataFrame({'percent' : [0.152416, 1.152416, 10.152416, 63.9]})
# python < 3.8
df['percent'] = df['percent'].apply(lambda x: 'less than 1%' if round(x) == 0 else f'{round(x)}%')
# python >= 3.8
# df['percent'] = df['percent'].apply(lambda x: 'less than 1%' if (res:=round(x)) == 0 else f'{res}%')
print(df)
percent
0 less than 1%
1 1%
2 10%
3 64%

Human-readable or engineering style floats in Pandas?

How do I get human-readable floating point numbers output by Pandas? If I have numbers across several magnitudes of values, I would like to see the output printed in a concise format.
For example, with the code below, I have a table that has small fractional numbers, as well as large numbers with many zeroes. Setting precision to two decimals, the resulting output shows exponential numbers for the small and large magnitude numbers:
import numpy as np
import pandas as pd
np.random.seed(0)
pd.set_option('precision', 1)
columns = ['Small', 'Medium', 'Large']
df = pd.DataFrame(np.random.randn(4, 3), columns=columns)
df.Small = df.Small / 1000
df.Medium = df.Medium * 1000
df.Large = df.Large * 1000 * 1000 * 1000
print(df)
Output:
Small Medium Large
0 1.8e-03 400.2 9.8e+08
1 2.2e-03 1867.6 -9.8e+08
2 9.5e-04 -151.4 -1.0e+08
3 4.1e-04 144.0 1.5e+09
Is there a way in Pandas to get this output more human-readable, like engineering format?
I would expect output like the partial table below.
Small Medium Large
0 1.8m 400.2 978.7M
...
Pandas has an engineering style floating point formatter that is not well-documented (only some documentation can be found about pd.set_eng_float_format()), based on matplotlib.ticker.EngFormatter.
This formatter can be used in two ways. The first way is to set the engineering format for all floats, the second way is to use the engineering formatter in the style object.
Set the engineering format for all floats:
np.random.seed(0)
pd.set_option('precision', 1)
columns = ['Small', 'Medium', 'Large']
df = pd.DataFrame(np.random.randn(4, 3), columns=columns)
df.Small = df.Small / 1000
df.Medium = df.Medium * 1000
df.Large = df.Large * 1000 * 1000 * 1000
pd.set_eng_float_format(accuracy=1, use_eng_prefix=True)
print(df)
Output:
Small Medium Large
0 1.8m 400.2 978.7M
1 2.2m 1.9k -977.3M
2 950.1u -151.4 -103.2M
3 410.6u 144.0 1.5G
The underlying formatter can also be used in style objects, either for all columns or with a formatter dictionary. Note that in the latter example, the Small column gets reduced to 0.0:
eng_fmt = pd.io.formats.format.EngFormatter(accuracy=1, use_eng_prefix=True)
style_all = df.style.format(formatter=eng_fmt)
style_large = df.style.format(formatter={'Large': eng_fmt})
# style_large.to_html()

Removing comma from values in column (csv file) using Python Pandas

I want to remove commas from a column named size.
CSV looks like below:
number name size
1 Car 9,32,123
2 Bike 1,00,000
3 Truck 10,32,111
I want the output as below:
number name size
1 Car 932123
2 Bike 100000
3 Truck 1032111
I am using python3 and Pandas module for handling this csv.
I am trying replace method but I don't get the desired output.
Snapshot from my code :
import pandas as pd
df = pd.read_csv("file.csv")
// df.replace(",","")
// df['size'] = df['size'].replace(to_replace = "," , value = "")
// df['size'] = df['size'].replace(",", "")
df['size'] = df['size'].replace({",", ""})
print(df['size']) // expecting to see 'size' column without comma
I don't see any error/exception. The last line print(df['size']) simply displays values as it is, ie, with commas.
With replace, we need regex=True because otherwise it looks for exact match in a cell, i.e., cells with , in them only:
>>> df["size"] = df["size"].replace(",", "", regex=True)
>>> df
number name size
0 1 Car 932123
1 2 Bike 100000
2 3 Truck 1032111
I am using python3 and Pandas module for handling this csv
Note that pandas.read_csv function has optional argument thousands, if , are used for denoting thousands you might set thousands="," consider following example
import io
import pandas as pd
some_csv = io.StringIO('value\n"1"\n"1,000"\n"1,000,000"\n')
df = pd.read_csv(some_csv, thousands=",")
print(df)
output
value
0 1
1 1000
2 1000000
For brevity I used io.StringIO, same effect might be achieved providing name of file with same content as first argument in io.StringIO.
Try with str.replace instead:
df['size'] = df['size'].str.replace(',', '')
Optional convert to int with astype:
df['size'] = df['size'].str.replace(',', '').astype(int)
number name size
0 1 Car 932123
1 2 Bike 100000
2 3 Truck 1032111
Sample Frame Used:
df = pd.DataFrame({'number': [1, 2, 3], 'name': ['Car', 'Bike', 'Truck'],
'size': ['9,32,123', '1,00,000', '10,32,111']})
number name size
0 1 Car 9,32,123
1 2 Bike 1,00,000
2 3 Truck 10,32,111

How to add a new column with multiple string contain conditions in python pandas other than using np.where?

I was trying to add a new column by giving multiple strings contain conditions using str.contains() and np.where() function. By this way, I can have the final result I want.
But, the code is very lengthy. Are there any good ways to reimplement this using pandas function?
df5['new_column'] = np.where(df5['sr_description'].str.contains('gross to net', case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('gross up', case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('net to gross',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('gross-to-net',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('gross-up',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('net-to-gross',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('gross 2 net',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('net 2 gross',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('gross net',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('net gross',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('memo code',case=False).fillna(False),1,0)))))))))))
This output will be,
if those strings contain in 'sr_description' then give a 1, else 0 to new_column
Maybe store the multiple string conditions in a list then read and apply them to a function.
Edit:
Sample Data:
sr_description new_column
something with gross up. 1
without those words. 0
or with Net to gross 1
if not then we give a '0' 0
Here is what I came up with.
Code:
import re
import pandas as pd
import numpy as np
# list of the strings we want to check for
check_strs = ['gross to net', 'gross up', 'net to gross', 'gross-to-net', 'gross-up', 'net-to-gross', 'gross 2 net',
'net 2 gross', 'gross net', 'net gross', 'memo code']
# From the re.escape() docs: Escape special characters in pattern.
# This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
check_strs_esc = [re.escape(curr_val) for curr_val in check_strs]
# join all the escaped strings as a single regex
check_strs_re = '|'.join(check_strs_esc)
test_col_1 = ['something with gross up.', 'without those words.', np.NaN, 'or with Net to gross', 'if not then we give a "0"']
df_1 = pd.DataFrame(data=test_col_1, columns=['sr_description'])
df_1['contains_str'] = df_1['sr_description'].str.contains(check_strs_re, case=False, na=False)
print(df_1)
Result:
sr_description contains_str
0 something with gross up. True
1 without those words. False
2 NaN False
3 or with Net to gross True
4 if not then we give a "0" False
Note that numpy isn't required for the solution to function, I'm just using it to test a NaN value.
Let me know if anything is unclear or your have any questions! :)

Printing AND writing the RIGHTLY formatted number

While being able to print # of decimals in console, I am significantly more challenged when attempting to write to a csv with the RIGHTLY formatted number. In the code below I somehow managed to divide integers in '000s, and have the result thrown into a csv, but I cannot get rid of the extra ".,". The for loop is really a hard challenge. Maybe, someone could tell me how to crack the puzzle.
View code strings below. Should have looked like, both in print in console AND in the csv file I am writing to:
23,400,344.567, 54,363,744.678, 56,789,117.456, 4,132,454.987
INPUT:
import pandas as pd
def insert_comma(s):
str=''
count=0
for n in reversed(s):
str += n
count+=1
if count == 3:
str += ','
count=0
return ''.join([i for i in reversed(str)][1:])
d = {'Quarters' : ['Quarter1','Quarter2','Quarter3','Quarter4'],
'Revenue':[23400344.567, 54363744.678, 56789117.456, 4132454.987]}
df=pd.DataFrame(d)
df['Revenue']=df.apply(lambda x: insert_comma(str(x['Revenue'] / 1000)), axis=1)
# pd.options.display.float_format = '{:.0f}'.format
df.to_csv("C:\\Users\\jcst\\Desktop\\Private\\Python data\\new8.csv", sep=";")
# round to two decimal places in python pandas
# .options.display.float_format = '{:.0f}'.format
print(df)
OUTPUT
Quarters Revenue
0 Quarter1 234,00.,344
1 Quarter2 543,63.,744
2 Quarter3 567,89.,117
3 Quarter4 1,32.,454
You can use this. Use format string to use comma(s) and 3 decimal places for all rows in Revenue column.
df['Revenue']=df['Revenue'].apply('{0:,.3f}'.format)
Result:
Quarters Revenue
0 Quarter1 23,400,344.567
1 Quarter2 54,363,744.678
2 Quarter3 56,789,117.456
3 Quarter4 4,132,454.987
Suggestion:
insertCommas = lambda x: format(x, ',')
Works like this:
>>> insertCommas(23400344.567)
'23,400,344.567'
This works for me:
df['Revenue'] = df['Revenue'].apply(lambda x:f"{x:,.3f}")
This solution uses Python 3.6+ f-strings to insert commas as thousand separator and show 3 decimals.

Categories