Accounting formatting in Pandas df - python

x=pd.DataFrame([[5.75,7.32],[1000000,-2]])
def money(val):
"""
Takes a value and returns properly formatted money
"""
if val < 0:
return "$({:>,.0f})".format(abs((val)))
else:
return "${:>,.0f}".format(abs(val))
x.style.format({0: lambda x: money(x),
1: lambda x: money(x)
})
I am trying to get currency to format in the pandas jupyter display with excel accounting formatting. Which would look like the below.
I was most successful with the above code, but i also tried a myriad of css and html things, but i am not well versed in the languages so they didn't work really at all.

Your output looks like you are using the HTML display in the Jupyter notebook, so you will need to set pre for the white-space style, because HTML collapses multiple whitespace, and use a monospace font, e.g.:
styles = {
'font-family': 'monospace',
'white-space': 'pre'
}
x_style = x.style.set_properties(**styles)
Now to format the float, a simple right justified with $ could look like:
x_style.format('${:>10,.0f}')
This isn't quite right because you want to convert the negative number to (2), and you can do this with nested formats, separating out the number formatting from justification so you can add () if negative, e.g.:
x_style.format(lambda f: '${:>10}'.format(('({:,.0f})' if f < 0 else '{:,.0f}').format(f)))
Note: this is fragile in the sense it assumes 10 is sufficient width, vs. excel which dynamically left justifies $ to the maximum width of all the values in that column.
An alternative way to do this would be to extend string.StringFormatter to implement the accounting format logic.

Related

Pandas Styler.to_latex() - how to pass commands and do simple editing

How do I pass the following commands into the latex environment?
\centering (I need landscape tables to be centered)
and
\caption* (I need to skip for a panel the table numbering)
In addition, I would need to add parentheses and asterisks to the t-statistics, meaning row-specific formatting on the dataframes.
For example:
Current
variable
value
const
2.439628
t stat
13.921319
FamFirm
0.114914
t stat
0.351283
founder
0.154914
t stat
2.351283
Adjusted R Square
0.291328
I want this
variable
value
const
2.439628
t stat
(13.921319)***
FamFirm
0.114914
t stat
(0.351283)
founder
0.154914
t stat
(1.651283)**
Adjusted R Square
0.291328
I'm doing my research papers in DataSpell. All empirical work is in Python, and then I use Latex (TexiFy) to create the pdf within DataSpell. Due to this workflow, I can't edit tables in latex code while they get overwritten every time I run the jupyter notebook.
In case it helps, here's an example of how I pass a table to the latex environment:
# drop index to column
panel_a.reset_index(inplace=True)
# write Latex index and cut names to appropriate length
ind_list = [
"ageFirm",
"meanAgeF",
"lnAssets",
"bsVol",
"roa",
"fndrCeo",
"lnQ",
"sic",
"hightech",
"nonFndrFam"
]
# assign the list of values to the column
panel_a["index"] = ind_list
# format column names
header = ["", "count","mean", "std", "min", "25%", "50%", "75%", "max"]
panel_a.columns = header
with open(
os.path.join(r"/.../tables/panel_a.tex"),"w"
) as tf:
tf.write(
panel_a
.style
.format(precision=3)
.format_index(escape="latex", axis=1)
.hide(level=0, axis=0)
.to_latex(
caption = "Panel A: Summary Statistics for the Full Sample",
label = "tab:table_label",
hrules=True,
))
You're asking three questions in one. I think I can do you two out of three (I hear that "ain't bad").
How to pass \centering to the LaTeX env using Styler.to_latex?
Use the position_float parameter. Simplified:
df.style.to_latex(position_float='centering')
How to pass \caption*?
This one I don't know. Perhaps useful: Why is caption not working.
How to apply row-specific formatting?
This one's a little tricky. Let me give an example of how I would normally do this:
df = pd.DataFrame({'a':['some_var','t stat'],'b':[1.01235,2.01235]})
df.style.format({'a': str, 'b': lambda x: "{:.3f}".format(x)
if x < 2 else '({:.3f})***'.format(x)})
Result:
You can see from this example that style.format accepts a callable (here nested inside a dict, but you could also do: .format(func, subset='value')). So, this is great if each value itself is evaluated (x < 2).
The problem in your case is that the evaluation is over some other value, namely a (not supplied) P value combined with panel_a['variable'] == 't stat'. Now, assuming you have those P values in a different column, I suggest you create a for loop to populate a list that becomes like this:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
Now, we can apply a function to df.style.format, and pop/select from the list like so:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
def func(v):
fmt = fmt_list.pop(0)
return fmt.format(v)
panel_a.style.format({'variable': str, 'value': func})
Result:
This solution is admittedly a bit "hacky", since modifying a globally declared list inside a function is far from good practice; e.g. if you modify the list again before calling func, its functionality is unlikely to result in the expected behaviour or worse, it may throw an error that is difficult to track down. I'm not sure how to remedy this other than simply turning all the floats into strings in panel_a.value inplace. In that case, of course, you don't need .format anymore, but it will alter your df and that's also not ideal. I guess you could make a copy first (df2 = df.copy()), but that will affect memory.
Anyway, hope this helps. So, in full you add this as follows to your code:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
def func(v):
fmt = fmt_list.pop(0)
return fmt.format(v)
with open(fname, "w") as tf:
tf.write(
panel_a
.style
.format({'variable': str, 'value': func})
...
.to_latex(
...
position_float='centering'
))

Problem in elemintaing the brackets () and post processing the dataframe using Pandas in Python

I am just a beginner in the Python so kindly excuse for this question, I tried a lot to get it done, but failed, thus I am posting this. I have a data set which looks like:
5.96303e-07 (11.6667 3.21427 -2.20471e-07) (11.8746 -1.75419 -2.37923e-07) (8.66991 -2.84873 5.29442e-07) (2.19427 13.547 1.16203e-05)
9.67139e-07 (11.6171 3.16081 -8.83286e-08) (11.8851 -1.763 -4.38136e-07) (8.68988 -2.85339 1.81039e-07) (1.61058 13.629 4.42662e-07)
1.34613e-06 (11.5562 3.11037 -7.74061e-08) (11.8897 -1.77006 -3.81523e-07) (8.70652 -2.8608 8.00436e-08) (1.47268 13.5569 -2.03173e-06)
1.73261e-06 (11.4961 3.06921 -1.49294e-07) (11.8919 -1.77567 -3.48887e-07) (8.71974 -2.86802 5.2652e-08) (1.59798 13.4556 -2.52073e-06)
2.12563e-06 (11.4423 3.03706 -1.53771e-07) (11.8932 -1.78022 -3.33928e-07) (8.73 -2.87398 4.65075e-08) (1.77817 13.3679 -2.42045e-06)
Now when I am accessing the data frame for an instance df.iloc[:,1] it gives me (11.6171, when I tried to plot it --it gives me error, then I thought that since the "(" is creating a problem I removed that using df.replace('\(','',regex=True).replace('\)','',regex=True) . The plot function seems to work but gives very weird figure(not allowed to post the figure). In addition to that when I tried to do some calculations like (df.iloc[:,1])^2 it is giving me errors which says:
TypeError: can't multiply sequence by non-int of type 'str'
I guess the data is not in the correct form. Any comment or suggestion will be a great help. Thanks in advance.
There are two relatively minor issues. Something like the following might be what you're looking for. Maybe.
First, the column you are trying to plot is a string. Essentially it contains letters/symbols. Even when you remove the "(" ")" the "numbers" are still considered a string.
# To convert a "3.14" (string) to a 3.14 (float)
# floats are basically decimals
my_string = "3.14"
my_number = float(my_string)
Additionally, there are multiple "numbers" in the string. So to plot the numbers in that column, I think you would first need to split the string and then convert to numbers.
# Use your code to replace the special characters
df.replace('\(','',regex=True).replace('\)','',regex=True)
# new data frame with split value columns
new = df["colname_with_three_numbers"].str.split(" ", n = 2, expand = True)
# Making separate first name column from new data frame
df["first_number"]= new[0]
df["second_number"]= new[1]
df["third_number"]= new[2]
# change the type to allow you to plot something like this should work
df["first_number"] = float(df["first_number"])
df
This is a pretty bad method to solve this, but if the dataset is not too large you can get each element using a for loop and remove the brackets using str.replace(")","").

Biopython gives ValueError: Sequences must all be the same length even though sequences are of the same length

I'm trying to create a phylogenetic tree by making a .phy file from my data.
I have a dataframe
ndf=
ESV trunc
1 esv1 TACGTAGGTG...
2 esv2 TACGGAGGGT...
3 esv3 TACGGGGGG...
7 esv7 TACGTAGGGT...
I checked the length of the elements of the column "trunc":
length_checker = np.vectorize(len)
arr_len = length_checker(ndf['trunc'])
The resulting arr_len gives the same length (=253) for all the elements.
I saved this dataframe as .phy file, which looks like this:
23 253
esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG
esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG
This is similar to the file used in this tutorial.
However, when I run the command
aln = AlignIO.read('msa.phy', 'phylip')
I get "ValueError: Sequences must all be the same length"
I don't know why I'm getting this or how to fix it. Any help is greatly appreciated!
Thanks
Generally phylip is the fiddliest format in phylogenetics between different programs. There is strict phylip format and relaxed phylip format etc ... t is not easy to know which is the separator being used, a space character and/or a carriage return.
I think that you appear to have left a space between the name of the taxon (i.e. the sequence label) and sequence name, viz.
2. esv2
Phylip format is watching for the space between the label and the sequence data. In this example the sequence would be 3bp long. The use of a "." is generally not a great idea as well. The integer doesn't appear to denote a line number.
The other issue is you could/should try keeping the sequence on the same line as the label and remove the carriage return, viz.
esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
Sometimes a carriage return does work (this could be relaxed phylip format), the traditional format uses a space character " ". I always maintained a uniform number of spaces to preserve the alignment ... not sure if that is needed.
Note if you taxon name exceeeds 10 characters you will need relaxed phylip format and this format in any case is generally a good idea.
The final solution is all else fails is to convert to fasta, import as fasta and then convert to phylip. If all this fails ... post back there's more trouble-shooting
Fasta format removes the "23 254" header and then each sequence looks like this,
>esv2
TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
There is always a carriage return between ">esv2" and the sequence. In addition, ">" is always present to prefix the label (taxon name) without any spae. You can simply convert via reg-ex or "re" in Python. Using a perl one-liner it will be s/^([az]+[0-9]+)/>$1/g type code. I'm pretty sure they'll be an online website that will do this.
You then simply replace the "phylip" with "fasta" in your import command. Once imported you ask BioPython to convert to whatever format you want and it should not have any problem.
First, please read the answer to How to make good reproducible pandas examples. In the future please provide a minimal reproducibl example.
Secondly, Michael G is absolutely correct that phylip is a format that is very peculiar about its syntax.
The code below will alow you to generate a phylogenetic tree from your Pandas dataframe.
First some imports and let's recreate your dataframe.
import pandas as pd
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
from Bio import AlignIO
data = {'ESV' : ['esv1', 'esv2', 'esv3'],
'trunc': ['TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG',
'TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG',
'TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG']
}
ndf = pd.DataFrame.from_dict(data)
print(ndf)
Output:
ESV trunc
0 esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCG...
1 esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTG...
2 esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCG...
Next, write the phylip file in the correct format.
with open("test.phy", 'w') as f:
f.write("{:10} {}\n".format(ndf.shape[0], ndf.trunc.str.len()[0]))
for row in ndf.iterrows():
f.write("{:10} {}\n".format(*row[1].to_list()))
Ouput of test.phy:
3 253
esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG
esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG
Now we can start with the creation of our phylogenetic tree.
# Read the sequences and align
aln = AlignIO.read('test.phy', 'phylip')
print(aln)
Output:
SingleLetterAlphabet() alignment with 3 rows and 253 columns
TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCG...AGG esv1
TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGG...AGG esv2
TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGG...CAG esv3
Calculate the distance matrix:
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(aln)
print(dm)
Output:
esv1 0
esv2 0.3003952569169961 0
esv3 0.6086956521739131 0.6245059288537549 0
Construct the phylogenetic tree using UPGMA algorithm and draw the tree in ascii
constructor = DistanceTreeConstructor()
tree = constructor.upgma(dm)
Phylo.draw_ascii(tree)
Output:
________________________________________________________________________ esv3
_|
| ___________________________________ esv2
|____________________________________|
|___________________________________ esv1
Or make a nice plot of the tree:
Phylo.draw(tree)
Output:

When using pygal.maps.world is there a way to format the numbers that display a country's population?

I am using pygal to make an interactive map showing world country populations from 2010. I am trying to find a way so that the populations of the country display with commas inserted ie as 10,000 not simply 10000.
I have already tried using "{:,}".format(x) when reading the numbers into my lists for the different population levels, but it causes an error. I believe this to be because this changes the value to a string.
I also tried inserting a piece of code I found online
wm.value_formatter = lambda x: "{:,}".format(x).
This doesn't cause any errors but doesn't fix how the numbers are formatted either. I am hoping someone might know of a built in function such as:
wm_style = RotateStyle('#336699')
Which is letting me set a color scheme.
Below is a the part of my code which is plotting the map.
wm = World()
wm.force_uri_protocol = "http"
wm_style = RotateStyle('#996699')
wm.value_formatter = lambda x: "{:,}".format(x)
wm.value_formatter = lambda y: "{:,}".format(y)
wm = World(style=wm_style)
wm.title = "Country populations year 2010"
wm.add('0-10 million', cc_pop_low)
wm.add("10m to 1 billion", cc_pop_mid)
wm.add('Over 1 billion', cc_pop_high)
wm.render_to_file('world_population.svg')
Setting the value_formatter property will change the label format, but in your code you recreate the World object after setting the property. This newly created object will have the default value formatter. You can also remove one of the lines setting the value_formatter property as they both achieve the same thing.
Re-ordering the code will fix your problem:
wm_style = RotateStyle('#996699')
wm = World(style=wm_style)
wm.value_formatter = lambda x: "{:,}".format(x)
wm.force_uri_protocol = "http"
wm.title = "Country populations year 2010"
wm.add('0-10 million', cc_pop_low)
wm.add("10m to 1 billion", cc_pop_mid)
wm.add('Over 1 billion', cc_pop_high)
wm.render_to_file('world_population.svg')

Apply operation and a division operation in the same step using Python

I am trying to get proportion of nouns in my text using the code below and it is giving me an error. I am using a function that calculates the number of nouns in my text and I have the overall word count in a different column.
pos_family = {
'noun' : ['NN','NNS','NNP','NNPS']
}
def check_pos_tag(x, flag):
cnt = 0
try:
for tag,value in x.items():
if tag in pos_family[flag]:
cnt +=value
except:
pass
return cnt
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')/df2['word_count'])
Note: I have used nltk package to get the counts by PoS tags and I have the counts in a dictionary in PoS_Count column in my dataframe.
If I remove "/df2['word_count']" in the first run and get the noun count and include it again and run, it works fine but if I run it for the first time I get the below error.
ValueError: Wrong number of items passed 100, placement implies 1
Any help is greatly appreciated
Thanks in Advance!
As you have guessed, the problem is in the /df2['word_count'] bit.
df2['word_count'] is a pandas series, but you need to use a float or int here, because you are dividing check_pos_tag(x, 'noun') (which is an int) by it.
A possible solution is to extract the corresponding field from the series and use it in your lambda.
However, it would be easier (and arguably faster) to do each operation alone.
Try this:
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')) / df2['word_count']

Categories