Human-readable or engineering style floats in Pandas? - python

How do I get human-readable floating point numbers output by Pandas? If I have numbers across several magnitudes of values, I would like to see the output printed in a concise format.
For example, with the code below, I have a table that has small fractional numbers, as well as large numbers with many zeroes. Setting precision to two decimals, the resulting output shows exponential numbers for the small and large magnitude numbers:
import numpy as np
import pandas as pd
np.random.seed(0)
pd.set_option('precision', 1)
columns = ['Small', 'Medium', 'Large']
df = pd.DataFrame(np.random.randn(4, 3), columns=columns)
df.Small = df.Small / 1000
df.Medium = df.Medium * 1000
df.Large = df.Large * 1000 * 1000 * 1000
print(df)
Output:
Small Medium Large
0 1.8e-03 400.2 9.8e+08
1 2.2e-03 1867.6 -9.8e+08
2 9.5e-04 -151.4 -1.0e+08
3 4.1e-04 144.0 1.5e+09
Is there a way in Pandas to get this output more human-readable, like engineering format?
I would expect output like the partial table below.
Small Medium Large
0 1.8m 400.2 978.7M
...

Pandas has an engineering style floating point formatter that is not well-documented (only some documentation can be found about pd.set_eng_float_format()), based on matplotlib.ticker.EngFormatter.
This formatter can be used in two ways. The first way is to set the engineering format for all floats, the second way is to use the engineering formatter in the style object.
Set the engineering format for all floats:
np.random.seed(0)
pd.set_option('precision', 1)
columns = ['Small', 'Medium', 'Large']
df = pd.DataFrame(np.random.randn(4, 3), columns=columns)
df.Small = df.Small / 1000
df.Medium = df.Medium * 1000
df.Large = df.Large * 1000 * 1000 * 1000
pd.set_eng_float_format(accuracy=1, use_eng_prefix=True)
print(df)
Output:
Small Medium Large
0 1.8m 400.2 978.7M
1 2.2m 1.9k -977.3M
2 950.1u -151.4 -103.2M
3 410.6u 144.0 1.5G
The underlying formatter can also be used in style objects, either for all columns or with a formatter dictionary. Note that in the latter example, the Small column gets reduced to 0.0:
eng_fmt = pd.io.formats.format.EngFormatter(accuracy=1, use_eng_prefix=True)
style_all = df.style.format(formatter=eng_fmt)
style_large = df.style.format(formatter={'Large': eng_fmt})
# style_large.to_html()

Related

How to slice on DateTime objects more efficiently and compute a given statistic at each iteration?

I am dealing with a pandas dataframe where the index is a DateTime object and the columns represent minute-by-minute returns on several stocks from the SP500 index, together with a column of returns from the index. It's fairly long (100 stocks, 1510 trading days, minute-by-minute data each day) and looks like this (only three stocks for the sake of example):
DateTime SPY AAPL AMZN T
2014-01-02 9:30 0.032 -0.01 0.164 0.007
2014-01-02 9:31 -0.012 0.02 0.001 -0.004
2014-01-02 9:32 -0.015 0.031 0.004 -0.001
I am trying to compute the betas of each stock for each different day and for each 30-minute window. The beta of a stock in this case is defined as the covariance between its returns and the SPY returns divided by the variance of SPY in the same period. My desired output is a 3-dimensional numpy array beta_HF where beta_HF[s, i, j], for instance, means the beta of stock s at day i at window j. At this moment, I am computing the betas in the following way (let returns be full dataframe):
trading_days = pd.unique(returns.index.date)
window = "30min"
moments = pd.date_range(start = "9:30", end = "16:00", freq = window).time
def dispersion(trading_days, moments, df, verbose = True):
index = 'SPY'
beta_HF = np.zeros((df.shape[1] - 1, len(trading_days), len(moments) - 1))
for i, day in enumerate(trading_days):
daily_data = df[df.index.date == day]
start_time = dt.time(9,30)
for j, end_time in enumerate(moments[1:]):
moment_data = daily_data.between_time(start_time, end_time)
covariances = np.array([moment_data[index].cov(moment_data[symbol]) for symbol in df])
beta_HF[:, i,j] = covariances[1:]/covariances[0]
if verbose == True:
if np.remainder(i, 100) == 0:
print("Current Trading Day: {}".format(day))
return(beta_HF)
The dispersion() function generates the correct output. However, I understand that I am looping over long iterables and this is not very efficient. I seek a more efficient way to "slice" the dataframe at each 30-minute window for each day in the sample and compute the covariances. Effectively, for each slice, I need to compute 101 numbers (100 covariances + 1 variance). On my local machine (a 2013 Retina i5 Macbook Pro) it's taking around 8 minutes to compute everything. I tested it on a research server of my university and the computing time was basically the same, which probably implies that computing power is not the bottleneck but my code has low quality in this part. I would appreciate any ideas on how to make this faster.
One might point out that parallelization is the way to go here since the elements in beta_HF never interact with each other. So this seems to be easy to parallelize. However, I have never implemented anything with parallelization so I am very new to these concepts. Any ideas on how to make the code run faster? Thanks a lot!
You can use pandas Grouper in order to group your data by frequency. The only drawbacks are that you cannot have overlapping windows and it will iterate over times that are not existant.
The first issue basically means that the window will slide from 9:30-9:59 to 10:00-10:29 instead of 9:30-10:00 to 10:00-10:30.
The second issue comes to play during holidays and night when no trading takes place. Hence, if you have a large period without trading then you might want to split the DataFrame and combine them afterwards.
Create example data
import pandas as pd
import numpy as np
time = pd.date_range(start="2014-01-02 09:30",
end="2014-01-02 16:00", freq="min")
time = time.append( pd.date_range(start="2014-01-03 09:30",
end="2014-01-03 16:00", freq="min") )
df = pd.DataFrame(data=np.random.rand(time.shape[0], 4)-0.5,
index=time, columns=['SPY','AAPL','AMZN','T'])
define the range you want to use
freq = '30min'
obs_per_day = len(pd.date_range(start = "9:30", end = "16:00", freq = "30min"))
trading_days = len(pd.unique(df.index.date))
make a function to calculate the beta values
def beta(df):
if df.empty: # returns nan when no trading takes place
return np.nan
mat = df.to_numpy() # numpy is faster than pandas
m = mat.mean(axis=0)
mat = mat - m[np.newaxis,:] # demean
dof = mat.shape[0] - 1 # degree of freedom
if dof != 0: # check if you data has more than one observation
mat = mat.T.dot(mat[:,0]) / dof # covariance with first column
return mat[1:] / mat[0] # beta
else:
return np.zeros(mat.shape[1] - 1) # return zeros for to short data e.g. 16:00
and in the end use pd.groupby().apply()
res = df.groupby(pd.Grouper(freq=freq)).apply(beta)
res = np.array( [k for k in res.values if ~np.isnan(k).any()] ) # remove NaN
res = res.reshape([trading_days, obs_per_day, df.shape[1]-1])
Note that the result is in a slightly different shape than yours.
The results also differ a bit because of the different window sliding. To check whether the results are the same, simply try somthing like this
trading_days = pd.unique(df.index.date)
# Your result
moments1 = pd.date_range(start = "9:30", end = "10:00", freq = "30min").time
beta(df[df.index.date == trading_days[0]].between_time(moments1[0], moments1[1]))
# mine
moments2 = pd.date_range(start = "9:30", end = "10:00", freq = "29min").time
beta(df[df.index.date == trading_days[0]].between_time(moments[0], moments2[1]))

pandas df - calculate percentage difference not change

I want to calculate the percentage difference not change or just the difference between two values.
my df:
Radisson Collection
6
Total awareness 0.440553
Very/Somewhat familiar 0.462577
Consideration 0.494652
Ever used 0.484620
Expected output:
Radisson Collection
6
Total awareness none
Very/Somewhat familiar 4.87726%
Consideration 6.70163%
Ever used 2.04886%
The calculation would be:
Difference of 0.440553 and 0.462577 = |0.440553 - 0.462577|/((0.440553 + 0.462577)/2) = 0.022024/0.451565 = 0.048772601950993 = 4.8772601950993%
Divide difference by diff with absolute values by abs with rolling mean:
s = df['Radisson Collection'].rolling(2).mean()
df['new'] = df['Radisson Collection'].diff().abs().div(s) * 100
print (df)
Radisson Collection new
Total awareness 0.440553 NaN
Very/Somewhat familiar 0.462577 4.877260
Consideration 0.494652 6.701636
Ever used 0.484620 2.048869
If need percentages:
df['new'] = (df['Radisson Collection'].diff().abs().div(s) * 100)
.iloc[1:].round(5).astype(str) + '%'

Pandas read_excel percentages as strings

My excel sheet has a column of percentages stored with the percent symbol (eg "50%"). How can I coerce pandas.read_excel to read the string "50%" instead of casting it to a float?
Currently the read_excel implementation parses the percentage into the float 0.5. Additionally if I add a converter = {col_with_percentage: str} argument, it parses it into the string '0.5'. Is there a way to read the raw percentage value ("50%")?
You can pass your own function with the converters. Something to make a string (eg: 50%) could look like:
Code:
def convert_to_percent_string(value):
return '{}%'.format(value * 100)
Test Code:
import pandas as pd
df = pd.read_excel('example.xlsx', converters={
'percents': convert_to_percent_string})
print(df)
Or as a lambda:
df = pd.read_excel('example.xlsx', converters={
'percents': lambda value: '{}%'.format(value * 100)})
Results:
percents
0 40.0%
1 50.0%
2 60.0%
You can generate string after reading
df = pd.DataFrame(np.random.ranf(size=(4,1)),columns =['col_with_percentage'])
df['col_with_percentage_s']= (df.col_with_percentage*100).astype(int).astype(str)+'%'
df
Output:
col_with_percentage col_with_percentage_s
0 0.5339712650806299 53%
1 0.9220323933894158 92%
2 0.11156261877930995 11%
3 0.18864363985224808 18%
But better way is to format on display, you can do it by style in pandas
df.style.format({'col_with_percentage': "{:.0%}"})
Output:
col_with_percentage col_with_percentage_s
0 53% 53%
1 92% 92%
2 11% 11%
3 19% 18%
I write a special conversion because sometimes in Excel, it is possible these percentages are melted with true strings or numbers in the same columns, and sometimes with or without decimal.
Examples :
"12%", "12 %", "Near 20%", "15.5", "15,5%", "11", "14.05%", "14.05", "0%"; "100%", "no result", "100"
And I want keep the percentage symbol from true Excel percentage values, keeping
decimals, without change the other values:
import re
df[field] = (df[field].apply(lambda x: str(round(float(x) * 100, 2)).rstrip('0').rstrip('.') + ' %' if re.search(r'^0\.\d+$|^0$|^1$',x) else x))
It works, but remains one problem: if a cell contains a true number between 0 and 1, so it becomes a percentage:
"0.3" becomes "30%"
But this is a special case, when the Excel file is wrong builded, revealing a true error. So I just add special alerts to manage this special cases.

How to Import multiple CSV files then make a Master Table?

I am a research chemist and have carried out a measurement where I record 'signal intensity' vs 'mass-to-charge (m/z)' . I have repeated this experiment 15x, by changing a specific parameter (Collision Energy). As a result, I have 15 CSV files and would like to align/join them within the same range of m/z values and same interval values. Due to the instrument thresholding rules, certain m/z values were not recorded, thus I have files that cannot simply be exported into excel and copy/pasted. The data looks a bit like the tables posted below
Dataset 1: x | y Dataset 2: x | y
--------- ---------
0.0 5 0.0 2
0.5 3 0.5 6
2.0 7 1.0 9
3.0 1 2.5 1
3.0 4
Using matlab I started with this code:
%% Create a table for the set m/z range with an interval of 0.1 Da
mzrange = 50:0.1:620;
mzrange = mzrange';
mzrange = array2table(mzrange,'VariableNames',{'XThompsons'});
Then I manually imported 1 X/Y CSV (Xtitle=XThompson, Ytitle=YCounts) to align with the specified m/z range.
%% Join/merge the two tables using a common Key variable 'XThompson' (m/z value)
mzspectrum = outerjoin(mzrange,ReserpineCE00,'MergeKeys',true);
% Replace all NaN values with zero
mzspectrum.YCounts(isnan(mzspectrum.YCounts)) = 0;
At this point I am stuck because repeating this process with a separate file will overwrite my YCounts column. The title of the YCounts column doesnt matter to me as I can change it later, however I would like to have the table continue as such:
XThompson | YCounts_1 | YCounts_2 | YCounts_3 | etc...
--------------------------------------------------------
How can I carry this out in Matlab so that this is at least semi-automated? I've had posted earlier describing a similar scenario but it turned out that it could not carry out what I need. I must admit that my mind is not of a programmer so I have been struggling with this problem quite a bit.
PS- Is this problem best executed in Matlab or Python?
I don't know or use matlab so my answer is pure python based. I think python and matlab should be equally well suited to read csv files and generate a master table.
Please consider this answer more as pointer to how to address the problem in python.
In python one would typically address this problem using the pandas package. This package provides "high-performance, easy-to-use data structures and data analysis tools" and can read natively a large set of file formats including CSV files. A master table from two CSV files "foo.csv" and "bar.csv" could be generated e.g. as follows:
import pandas as pd
df = pd.read_csv('foo.csv')
df2 = pd.read_csv('bar.cvs')
master_table = pd.concat([df, df2])
Pandas further allows to group and structure the data in many ways. The pandas documentation has very good descriptions of its various features.
One can install pandas with the python package installer pip:
sudo pip install pandas
if on Linux or OSX.
The counts from the different analyses should be named differently, i.e., YCounts_1, YCounts_2, and YCounts_3 from analyses 1, 2, and 3, respectively, in the different datasets before joining them. However, the M/Z name (i.e., XThompson) should be the same since this is the key that will be used to join the datasets. The code below is for MATLAB.
This step is not needed (just recreates your tables) and I copied dataset2 to create dataset3 for illustration. You could use 'readtable' to import your data i.e., imported_data = readtable('filename');
dataset1 = table([0.0; 0.5; 2.0; 3.0], [5; 3; 7; 1], 'VariableNames', {'XThompson', 'YCounts_1'});
dataset2 = table([0.0; 0.5; 1.0; 2.5; 3.0], [2; 6; 9; 1; 4], 'VariableNames', {'XThompson', 'YCounts_2'});
dataset3 = table([0.0; 0.5; 1.0; 2.5; 3.0], [2; 6; 9; 1; 4], 'VariableNames', {'XThompson', 'YCounts_3'});
Merge tables using outerjoin. You could use loop if you have many datasets.
combined_dataset = outerjoin(dataset1,dataset2, 'MergeKeys', true);
Add dataset3 to the combined_dataset
combined_dataset = outerjoin(combined_dataset,dataset3, 'MergeKeys', true);
You could export the combined data as Excel Sheet by using writetable
writetable(combined_dataset, 'joined_icp_ms_data.xlsx');
I managed to create a solution to my problem based on learning through everyone's input and taking an online matlab courses. I am not a natural coder so my script is not as elegant as the geniuses here, but hopefully it is clear enough for other non-programming scientists to use.
Here's the result that works for me:
% Reads a directory containing *.csv files and corrects the x-axis to an evenly spaced (0.1 unit) interval.
% Create a matrix with the input x range then convert it to a table
prompt = 'Input recorded min/max data range separated by space \n(ex. 1 to 100 = 1 100): ';
inputrange = input(prompt,'s');
min_max = str2num(inputrange)
datarange = (min_max(1):0.1:min_max(2))';
datarange = array2table(datarange,'VariableNames',{'XAxis'});
files = dir('*.csv');
for q=1:length(files);
% Extract each XY pair from the csvread cell and convert it to an array, then back to a table.
data{q} = csvread(files(q).name,2,1);
data1 = data(q);
data2 = cell2mat(data1);
data3 = array2table(data2,'VariableNames',{'XAxis','YAxis'});
% Join the datarange table and the intensity table to obtain an evenly spaced m/z range
data3 = outerjoin(datarange,data3,'MergeKeys',true);
data3.YAxis(isnan(data3.YAxis)) = 0;
data3.XAxis = round(data3.XAxis,1);
% Remove duplicate values
data4 = sortrows(data3,[1 -2]);
[~, idx] = unique(data4.XAxis);
data4 = data4(idx,:);
% Save the file as the same name in CSV without underscores or dashes
filename = files(q).name;
filename = strrep(filename,'_','');
filename = strrep(filename,'-','');
filename = strrep(filename,'.csv','');
writetable(data4,filename,'FileType','text');
clear data data1 data2 data3 data4 filename
end
clear

Big data visualization for multiple sampled data points from a large log

I have a log file which I need to plot in python with different data points as a multi line plot with a line for each unique point , the problem is that in some samples some points would be missing and new points would be added in another, as shown is an example with each line denoting a sample of n points where n is variable:
2015-06-20 16:42:48,135 current stats=[ ('keypassed', 13), ('toy', 2), ('ball', 2),('mouse', 1) ...]
2015-06-21 16:42:48,135 current stats=[ ('keypassed', 20, ('toy', 5), ('ball', 7), ('cod', 1), ('fish', 1) ... ]
in the above 1 st sample 'mouse ' is present but absent in the second line with new data points in each sample added like 'cod','fish'
so how can this be done in python in the quickest and cleanest way? are there any existing python utilities which can help to plot this timed log file? Also being a log file the samples are thousands in numbers so the visualization should be able to properly display it.
Interested to apply multivariate hexagonal binning to this and different color hexagoan for each unique column "ball,mouse ... etc". scikit offers hexagoanal binning but cant figure out how to render different colors for each hexagon based on the unique data point. Any other visualization technique would also help in this.
Getting the data into pandas:
import pandas as pd
df = pd.DataFrame(columns = ['timestamp','name','value'])
with open(logfilepath) as f:
for line in f.readlines():
timestamp = line.split(',')[0]
#the data part of each line can be evaluated directly as a Python list
data = eval(line.split('=')[1])
#convert the input data from wide format to long format
for name, value in data:
df = df.append({'timestamp':timestamp, 'name':name, 'value':value},
ignore_index = True)
#convert from long format back to wide format, and fill null values with 0
df2 = df.pivot_table(index = 'timestamp', columns = 'name')
df2 = df2.fillna(0)
df2
Out[142]:
value
name ball cod fish keypassed mouse toy
timestamp
2015-06-20 16:42:48 2 0 0 13 1 2
2015-06-21 16:42:48 7 1 1 20 0 5
Plot the data:
import matplotlib.pylab as plt
df2.value.plot()
plt.show()

Categories