I have some problems with formatting describe table from pandas.
I would love to have 2 decimals precision in every column, but in last I need to have 1.11e11 format. I have tried applying
data.styles.format({"last_column": "{:.2E}"})
, but it does not seem to work for me, still the same result as can be seen below.
Things like: pd.set_option('display.float_format', '{:.2E}'.format)
is applied pandas-wide, which is not what I want to do.
print(data.describe(percentiles=[],).fillna("-.--").round(2))
count 1 1 1 1 1 1
mean 1.43 0.4 34.58 0.07 0.71 1.12877e+08
std -.-- -.-- -.-- -.-- -.-- -.--
min 1.43 0.4 34.58 0.07 0.71 1.12877e+08
50% 1.43 0.4 34.58 0.07 0.71 1.12877e+08
max 1.43 0.4 34.58 0.07 0.71 1.12877e+08
I would like to evade if tabulate, or any other tabular tool if possible, would like to solve this on level of pandas.
Does anyone please have a solution?
Thank you :)
Just use this:-
pd.set_option('precision',2)
and if you want to reset it back to original form i.e its default value then use this:-
pd.reset_option('precision')
Related
i want to round the number showing in my table
it looks now like:
i want it looks like:
How can i get that? use pandas or numpy and as simple as possible. Thanks!
In pandas , we can use pandas.DataFrame.round. See example below ( from pandas documentation )
data frame : df
dogs cats
0 0.21 0.32
1 0.01 0.67
2 0.66 0.03
3 0.21 0.18
Do round on df like below
df.round(1)
Output:
dogs cats
0 0.2 0.3
1 0.0 0.7
2 0.7 0.0
3 0.2 0.2
We can even specify the fields need to round, see link for more details : pandas.DataFrame.round
Or We can use default python round in a loop, as below
>>> round(5.76543, 2)
5.77
I have a script where I do munging with dataframes and extract data like the following:
times = pd.Series(df.loc[df['sy_x'].str.contains('AA'), ('t_diff')].quantile([.1, .25, .5, .75, .9]))
I want to add the resulting data from quantile() to a data frame with separate columns for each of those quantiles, lets say the columns are:
ID pt_1 pt_2 pt_5 pt_7 pt_9
AA
BB
CC
How might I add the quantiles to each row of ID?
new_df = None
for index, value in times.items():
for col in df[['pt_1', 'pt_2','pt_5','pt_7','pt_9',]]:
..but that feels wrong and not idiomatic. Should I be using loc or iloc? I have a couple more Series that I'll need to add to other columns not shown, but I think I can figure that out once I know
EDIT:
Some of the output of times looks like:
0.1 -0.5
0.25 -0.3
0.5 0.0
0.75 2.0
0.90 4.0
Thanks in advance for any insight
IIUC, you want a groupby():
# toy data
np.random.seed(1)
df = pd.DataFrame({'sy_x':np.random.choice(['AA','BB','CC'], 100),
't_diff': np.random.randint(0,100,100)})
df.groupby('sy_x').t_diff.quantile((0.1,.25,.5,.75,.9)).unstack(1)
Output:
0.10 0.25 0.50 0.75 0.90
sy_x
AA 16.5 22.25 57.0 77.00 94.5
BB 9.1 21.00 58.5 80.25 91.3
CC 9.7 23.25 40.5 65.75 84.1
Try something like:
pd.DataFrame(times.values.T, index=times.keys())
I am trying to create a row in my existing pandas dataframe and the value of a new row should be a computation
I have a dataframe that looks like the below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
I want to add a row called "Metric" which is the sum of "LE_St" variable for "Rating" >= 4 and <6 divided by "LE_St" for "All" i.e Metric = (0.05+1.77)/10.17
My output dataframe should look like below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
Metric 0.44
I believe your approach to the dataframe is wrong.
Usually rows hold values correlating with columns in a matter that makes sense and not hold random information. the power of pandas and python is for holding and manipulating data. You can easily compute a value from a column or even all columns and store them in a "summary" like dataframe or in separate values. That might help you with this as well.
for computation on a column (i.e. Series object) you can use the .sum() method (or any other of the computational tools) and slice your dataframe by values in the "rating" column.
for random computation of small statistics you will be rather off with excel :)
an example of a solution might look like this:
all = 10.17 # i dont know where this value comes from
df = df[df['rating'].between(4, 6, inclusive=True)]
metric = sliced_df['LE_ST'].sum()/all
print metric # or store it somewhere however you like
For this data that is already pivoted in a dataframe:
1 2 3 4 5 6 7
2013-05-28 -0.44 0.03 0.06 -0.31 0.13 0.56 0.81
2013-07-05 0.84 1.03 0.96 0.90 1.09 0.59 1.15
2013-08-21 0.09 0.25 0.06 0.09 -0.09 -0.16 0.56
2014-10-15 0.35 1.16 1.91 3.44 2.75 1.97 2.16
2015-02-09 0.09 -0.10 -0.38 -0.69 -0.25 -0.85 -0.47
.. I'm trying to make a lines chart. This from Excel:
.. and if I click that flip x & y button in Excel, also this pic:
I'm getting lost with the to-chart and to-png steps, and most of the examples want unpivoted raw data, which is something I'm passed.
Seaborn or Matplotlib or anything that can make the chart would be great. On a box without X11 would be better still :)
I thought about posting this a comment on this SO answer, but I could not do newlines, insert pics and all of that.
Edit: Sorry, I've not pasted in any of the attempts I've tried because they have not even come close to putting a PNG out. The only other examples on SO I can see start with transactional rows, and pivot for sure, but don't go as far as PNG output.
You need to transpose your data before plotting it.
df.T.plot()
I want to compute the duration (in weeks between change). For example, p is the same for weeks 1,2,3 and changes to 1.11 in period 4. So duration is 3. Now the duration is computed in a loop ported from R. It works but it is slow. Any suggestion how to improve this would be greatly appreciated.
raw['duration']=np.nan
id=raw['unique_id'].unique()
for i in range(0,len(id)):
pos1= abs(raw['dp'])>0
pos2= raw['unique_id']==id[i]
pos= np.where(pos1 & pos2)[0]
raw['duration'][pos[0]]=raw['week'][pos[0]]-1
for j in range(1,len(pos)):
raw['duration'][pos[j]]=raw['week'][pos[j]]-raw['week'][pos[j-1]]
The dataframe is raw, and values for a particular unique_id looks like this.
date week p change duration
2006-07-08 27 1.05 -0.07 1
2006-07-15 28 1.05 0.00 NaN
2006-07-22 29 1.05 0.00 NaN
2006-07-29 30 1.11 0.06 3
... ... ... ... ...
2010-06-05 231 1.61 0.09 1
2010-06-12 232 1.63 0.02 1
2010-06-19 233 1.57 -0.06 1
2010-06-26 234 1.41 -0.16 1
2010-07-03 235 1.35 -0.06 1
2010-07-10 236 1.43 0.08 1
2010-07-17 237 1.59 0.16 1
2010-07-24 238 1.59 0.00 NaN
2010-07-31 239 1.59 0.00 NaN
2010-08-07 240 1.59 0.00 NaN
2010-08-14 241 1.59 0.00 NaN
2010-08-21 242 1.61 0.02 5
##
Computing duratiosn once you have your list in date order is trivial: iterate over the list, keeping track of how long since the last change to p. If the slowness comes from how you get that list, you haven't provided nearly enough info for help with that.
You can simply get the list of weeks where there is a change, then compute their differences, and finally join those differences back onto your original DataFrame.
weeks = raw.query('change != 0.0')[['week']]
weeks['duration'] = weeks.week.diff()
pd.merge(raw, weeks, on='week', how='left')
raw2=raw.ix[raw['change'] !=0,['week','unique_id']]
data2=raw2.groupby('unique_id')
raw2['duration']=data2['week'].transform(lambda x: x.diff())
raw2.drop('unique_id',1)
raw=pd.merge(raw,raw2,on=['unique_id','week'],how='left')
Thank you all. I modified the suggestion and got this to give the same answer as the complicated loop. For 10,000. observations, it is not a whole lot faster but the code seems more compact.
I put no change to Nan because the duration seems to be undefined when no change is made. But zero will work too. With the above code, the NaN is put in automatically by merge. In any case,
I want to compute statistics for the non-change group separately.