I want to compute the duration (in weeks between change). For example, p is the same for weeks 1,2,3 and changes to 1.11 in period 4. So duration is 3. Now the duration is computed in a loop ported from R. It works but it is slow. Any suggestion how to improve this would be greatly appreciated.
raw['duration']=np.nan
id=raw['unique_id'].unique()
for i in range(0,len(id)):
pos1= abs(raw['dp'])>0
pos2= raw['unique_id']==id[i]
pos= np.where(pos1 & pos2)[0]
raw['duration'][pos[0]]=raw['week'][pos[0]]-1
for j in range(1,len(pos)):
raw['duration'][pos[j]]=raw['week'][pos[j]]-raw['week'][pos[j-1]]
The dataframe is raw, and values for a particular unique_id looks like this.
date week p change duration
2006-07-08 27 1.05 -0.07 1
2006-07-15 28 1.05 0.00 NaN
2006-07-22 29 1.05 0.00 NaN
2006-07-29 30 1.11 0.06 3
... ... ... ... ...
2010-06-05 231 1.61 0.09 1
2010-06-12 232 1.63 0.02 1
2010-06-19 233 1.57 -0.06 1
2010-06-26 234 1.41 -0.16 1
2010-07-03 235 1.35 -0.06 1
2010-07-10 236 1.43 0.08 1
2010-07-17 237 1.59 0.16 1
2010-07-24 238 1.59 0.00 NaN
2010-07-31 239 1.59 0.00 NaN
2010-08-07 240 1.59 0.00 NaN
2010-08-14 241 1.59 0.00 NaN
2010-08-21 242 1.61 0.02 5
##
Computing duratiosn once you have your list in date order is trivial: iterate over the list, keeping track of how long since the last change to p. If the slowness comes from how you get that list, you haven't provided nearly enough info for help with that.
You can simply get the list of weeks where there is a change, then compute their differences, and finally join those differences back onto your original DataFrame.
weeks = raw.query('change != 0.0')[['week']]
weeks['duration'] = weeks.week.diff()
pd.merge(raw, weeks, on='week', how='left')
raw2=raw.ix[raw['change'] !=0,['week','unique_id']]
data2=raw2.groupby('unique_id')
raw2['duration']=data2['week'].transform(lambda x: x.diff())
raw2.drop('unique_id',1)
raw=pd.merge(raw,raw2,on=['unique_id','week'],how='left')
Thank you all. I modified the suggestion and got this to give the same answer as the complicated loop. For 10,000. observations, it is not a whole lot faster but the code seems more compact.
I put no change to Nan because the duration seems to be undefined when no change is made. But zero will work too. With the above code, the NaN is put in automatically by merge. In any case,
I want to compute statistics for the non-change group separately.
Related
I am trying to use Seaborn to plot a simple bar plot using data that was transformed. The data started out looking like this (text follows):
element 1 2 3 4 5 6 7 8 9 10 11 12
C 95.6 95.81 96.1 95.89 97.92 96.71 96.1 96.38 96.09 97.12 95.12 95.97
N 1.9 1.55 1.59 1.66 0.53 1.22 1.57 1.63 1.82 0.83 2.37 2.13
O 2.31 2.4 2.14 2.25 1.36 1.89 2.23 1.8 1.93 1.89 2.3 1.71
Co 0.18 0.21 0.16 0.17 0.01 0.03 0.13 0.01 0.02 0.01 0.14 0.01
Zn 0.01 0.03 0.02 0.03 0.18 0.14 0.07 0.17 0.14 0.16 0.07 0.18
and after importing using:
df1 = pd.read_csv(r"C:\path.txt", sep='\t',header = 0, usecols=[0, 1, 2,3,4,5,6,7,8,9,10,11,12], index_col='element').transpose()
display(df1)
When I plot the values of an element versus the first column (which represents an observation), the first column of data corresponding to 'C' is used instead. What am I doing wrong and how can I fix it?
I also tried importing, then pivoting the dataframe, which resulted in an undesired shape that repeated the element set as columns 12 times.
ax = sns.barplot(x=df1.iloc[:,0], y='Zn', data=df1)
edited to add that I am not married to using any particular package or technique. I just want to be able to use my data to build a bar plot with 1-12 on the x axis and elemental compositions on the y.
you have different possibilities here. The problem you have is because 'element' is the index of your dataframe, so x=df1.iloc[:,0] is the column of 'C'.
1)
ax = sns.barplot(x=df.index, y='Zn', data=df1)
df.reset_index(inplace=True) #now 'element' is the first column of the df1
ax = sns.barplot(x=df.iloc[:,0], y='Zn', data=df1)
#equal to
ax = sns.barplot(x='element', y='Zn', data=df1
I am trying to create a row in my existing pandas dataframe and the value of a new row should be a computation
I have a dataframe that looks like the below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
I want to add a row called "Metric" which is the sum of "LE_St" variable for "Rating" >= 4 and <6 divided by "LE_St" for "All" i.e Metric = (0.05+1.77)/10.17
My output dataframe should look like below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
Metric 0.44
I believe your approach to the dataframe is wrong.
Usually rows hold values correlating with columns in a matter that makes sense and not hold random information. the power of pandas and python is for holding and manipulating data. You can easily compute a value from a column or even all columns and store them in a "summary" like dataframe or in separate values. That might help you with this as well.
for computation on a column (i.e. Series object) you can use the .sum() method (or any other of the computational tools) and slice your dataframe by values in the "rating" column.
for random computation of small statistics you will be rather off with excel :)
an example of a solution might look like this:
all = 10.17 # i dont know where this value comes from
df = df[df['rating'].between(4, 6, inclusive=True)]
metric = sliced_df['LE_ST'].sum()/all
print metric # or store it somewhere however you like
This question already has answers here:
Pandas pivot table with multiple columns at once
(2 answers)
Closed 5 years ago.
Have a df like the one below, and looking to compress duplicate index values into a single row:
ask bid
date
2011-01-03 0.32 0.30
2011-01-03 1.03 1.01
2011-01-03 4.16 4.11
and expected output is to have (column names not important for now will manually set it):
ask bid ask1 bid1 ask2 bid2
date
2011-01-03 0.32 0.30 1.03 1.01 4.16 4.11
Something like below can be done to get the output you looking for:
import pandas as pd
df_1=pd.DataFrame({'date':['2011-01-03','2011-01-03','2011-01-03'],'ask':[0.31,1.05,4.17],'bid':[0.40,1.41,5.11]})
dfs=list()
df_count=1
while df_1['date'].duplicated().any()==True:
df_count+=1
b=df_1.drop_duplicates(subset='date',keep='first')
dfs.append(b)
df_1=df_1.merge(b,how='outer',on=['date','ask','bid'],indicator=True)
df_1=df_1[df_1['_merge']=='left_only']
del df_1['_merge']
dfs.append(df_1)
df_final = reduce(lambda left,right: pd.merge(left,right,on='date',suffixes=('_1','_2')), dfs)
input:
ask bid date
0 0.31 0.40 2011-01-03
1 1.05 1.41 2011-01-03
2 4.17 5.11 2011-01-03
Output :
ask_1 bid_1 date ask_2 bid_2 ask bid
0 0.31 0.4 2011-01-03 1.05 1.41 4.17 5.11
For this data that is already pivoted in a dataframe:
1 2 3 4 5 6 7
2013-05-28 -0.44 0.03 0.06 -0.31 0.13 0.56 0.81
2013-07-05 0.84 1.03 0.96 0.90 1.09 0.59 1.15
2013-08-21 0.09 0.25 0.06 0.09 -0.09 -0.16 0.56
2014-10-15 0.35 1.16 1.91 3.44 2.75 1.97 2.16
2015-02-09 0.09 -0.10 -0.38 -0.69 -0.25 -0.85 -0.47
.. I'm trying to make a lines chart. This from Excel:
.. and if I click that flip x & y button in Excel, also this pic:
I'm getting lost with the to-chart and to-png steps, and most of the examples want unpivoted raw data, which is something I'm passed.
Seaborn or Matplotlib or anything that can make the chart would be great. On a box without X11 would be better still :)
I thought about posting this a comment on this SO answer, but I could not do newlines, insert pics and all of that.
Edit: Sorry, I've not pasted in any of the attempts I've tried because they have not even come close to putting a PNG out. The only other examples on SO I can see start with transactional rows, and pivot for sure, but don't go as far as PNG output.
You need to transpose your data before plotting it.
df.T.plot()
I will try and explain the problem I am currently having concerning cumulative sums on DataFrames in Python, and hopefully you'll grasp it!
Given a pandas DataFrame df with a column returns as such:
returns
Date
2014-12-10 0.0000
2014-12-11 0.0200
2014-12-12 0.0500
2014-12-15 -0.0200
2014-12-16 0.0000
Applying a cumulative sum on this DataFrame is easy, just using e.g. df.cumsum(). But is it possible to apply a cumulative sum every X days (or data points) say, yielding only the cumulative sum of the last Y days (data points).
Clarification: Given daily data as above, how do I get the accumulated sum of the last Y days, re-evaluated (from zero) every X days?
Hope its clear enough,
Thanks,
N
"Every X days" and "every X data points" are very different; the following assumes you really mean the first, since you mention it more frequently.
If the index is a DatetimeIndex, you can resample to a daily frequency, take a rolling_sum, and then select only the original dates:
>>> pd.rolling_sum(df.resample("1d"), 2, min_periods=1).loc[df.index]
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.07
2014-12-15 -0.02
2014-12-16 -0.02
or, step by step:
>>> df.resample("1d")
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.05
2014-12-13 NaN
2014-12-14 NaN
2014-12-15 -0.02
2014-12-16 0.00
>>> pd.rolling_sum(df.resample("1d"), 2, min_periods=1)
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.07
2014-12-13 0.05
2014-12-14 NaN
2014-12-15 -0.02
2014-12-16 -0.02
The way I would do it is with helper columns. It's a little kludgy but it should work:
numgroups = int(len(df)/(x-1))
df['groupby'] = sorted(list(range(numgroups))*x)[:len(df)]
df['mask'] = (([0]*(x-y)+[1]*(y))*numgroups)[:len(df)]
df['masked'] = df.returns*df['mask']
df.groupby('groupby').masked.cumsum()
I am not sure if there is a built in method but it does not seem very difficult to write one.
for example, here is one for pandas series.
def cum(df, interval):
all = []
quotient = len(df)//interval
intervals = range(quotient)
for i in intervals:
all.append(df[0:(i+1)*interval].sum())
return pd.Series(all)
>>>s1 = pd.Series(range(20))
>>>print(cum(s1, 4))
0 6
1 28
2 66
3 120
4 190
dtype: int64
Thanks to #DSM I managed to come up with a variation of his solution that actually does pretty much what I was looking for:
import numpy as np
import pandas as pd
df.resample("1w"), how={'A': np.sum})
Yields what I want for the example below:
rng = range(1,29)
dates = pd.date_range('1/1/2000', periods=len(rng))
r = pd.DataFrame(rng, index=dates, columns=['A'])
r2 = r.resample("1w", how={'A': np.sum})
Outputs:
>> print r
A
2000-01-01 1
2000-01-02 2
2000-01-03 3
2000-01-04 4
2000-01-05 5
2000-01-06 6
2000-01-07 7
2000-01-08 8
2000-01-09 9
2000-01-10 10
2000-01-11 11
...
2000-01-25 25
2000-01-26 26
2000-01-27 27
2000-01-28 28
>> print r2
A
2000-01-02 3
2000-01-09 42
2000-01-16 91
2000-01-23 140
2000-01-30 130
Even though it doesn't start "one week in" in this case (resulting in sum of 3 in the very first case), it does always get the correct rolling sum, starting on the previous date with initial value of zero.