I have data in the below format.
Input
From To Zone 1 Zone 2 Zone 3 Zone 4 Zone 5
10.1 20 0.45 0.45 0.35 0.45 0.45
20.1 40 0.45 0.45 0.45 0.45 0.70
40.1 50 0.50 0.50 0.55 0.55 0.55
50.1 250 0.75 0.79 0.79 0.80 0.79
Desired Output
From To Kg Attribute Value
10.1 20 0.5 Zone 1 0.45
10.1 20 0.5 Zone 2 0.45
10.1 20 0.5 Zone 3 0.35
10.1 20 0.5 Zone 4 0.45
10.1 20 0.5 Zone 5 0.45
20.1 40 0.5 Zone 1 0.45
20.1 40 0.5 Zone 2 0.45
20.1 40 0.5 Zone 3 0.45
20.1 40 0.5 Zone 4 0.45
20.1 40 0.5 Zone 5 0.70
How can this be done in pandas python?
You can set From and To as index and use stack().
(
df.set_index(['From', 'To']).stack().to_frame('Value')
.rename_axis(['From', 'To', 'Attribute'])
.assign(Kg=0.5)
.reset_index()
)
From To Attribute Value Kg
0 10.1 20 Zone1 0.45 0.5
1 10.1 20 Zone2 0.45 0.5
2 10.1 20 Zone3 0.35 0.5
3 10.1 20 Zone4 0.45 0.5
4 10.1 20 Zone5 0.45 0.5
5 20.1 40 Zone1 0.45 0.5
6 20.1 40 Zone2 0.45 0.5
7 20.1 40 Zone3 0.45 0.5
8 20.1 40 Zone4 0.45 0.5
9 20.1 40 Zone5 0.70 0.5
10 40.1 50 Zone1 0.50 0.5
11 40.1 50 Zone2 0.50 0.5
12 40.1 50 Zone3 0.55 0.5
13 40.1 50 Zone4 0.55 0.5
14 40.1 50 Zone5 0.55 0.5
15 50.1 250 Zone1 0.75 0.5
16 50.1 250 Zone2 0.79 0.5
17 50.1 250 Zone3 0.79 0.5
18 50.1 250 Zone4 0.80 0.5
19 50.1 250 Zone5 0.79 0.5
I have the following data frame. I want to check the values of each row for the columns of "mental_illness", "feeling", and "flavor". If all the values for those three columns per row are less than 0.5, I want to change the corresponding value of the "unclassified" column to 1.0.
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0.0 0.75 0.30 0.28
1 1 word_2 0.0 0.17 0.72 0.16
2 2 word_3 0.0 0.19 0.38 0.16
3 3 word_4 0.0 0.39 0.20 0.14
4 4 word_5 0.0 0.72 0.30 0.14
Expected result:
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0.0 0.75 0.30 0.28
1 1 word_2 0.0 0.17 0.72 0.16
2 2 word_3 1.0 0.19 0.38 0.16
3 3 word_4 1.0 0.39 0.20 0.14
4 4 word_5 0.0 0.72 0.30 0.14
How do I go about doing so?
Use .le and .all over axis=1:
m = df[['mental_illness', 'feeling', 'flavor']].le(0.5).all(axis=1)
df['unclassified'] = m.astype(int)
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0 0.75 0.30 0.28
1 1 word_2 0 0.17 0.72 0.16
2 2 word_3 1 0.19 0.38 0.16
3 3 word_4 1 0.39 0.20 0.14
4 4 word_5 0 0.72 0.30 0.14
Would this work?
mask1 = df["mental_illness"] < 0.5
mask2 = df["feeling"] < 0.5
mask3 = df["flavor"] < 0.5
df.loc[mask1 & mask2 & mask3, 'unclassified'] = 1
Here is my solution:
data.unclassified = data[['mental_illness', 'feeling', 'flavor']].apply(lambda x: x.le(0.5)).apply(lambda x: 1 if sum(x) == 3 else 0, axis = 1)
output
sent_no pos unclassified mental_illness feeling flavor
0 0 Word_1 0 0.75 0.30 0.28
1 1 Word_2 0 0.17 0.72 0.16
2 2 Word_3 1 0.19 0.38 0.16
3 3 Word_4 1 0.39 0.20 0.14
4 4 Word_5 0 0.72 0.30 0.14
I am wanting to calculate a simple momentum signal. The method I am following is 1 month lagged cumret divided by 12 month lagged cumret minus 1.
date starts at 1/5/14 and ends at 1/5/16. As a 12 month lag is required, the first mom signal has to start 12 months after the start of date. Hence, why the first mom signal starts at 1/5/15.
Here is the data utilized:
import pandas as pd
data = {'date':['1/5/14','1/6/14','1/7/14','1/8/14','1/9/14','1/10/14','1/11/14','1/12/14' .,'1/1/15','1/2/15','1/3/15','1/4/15','1/5/15','1/6/15','1/7/15','1/8/15','1/9/15','1/10/15','1/11/15','1/12/15','1/1/16','1/2/16','1/3/16','1/4/16','1/5/16'],
'id': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a' ],
'ret':[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20, 0.21, 0.22, 0.23, 0.24, 0.25],
'cumret':[1.01,1.03, 1.06,1.1 ,1.15,1.21,1.28, 1.36,1.45,1.55,1.66, 1.78,1.91,2.05,2.2,2.36, 2.53,2.71,2.9,3.1,3.31,3.53, 3.76,4,4.25]}
df = pd.DataFrame(data).set_index(['date', 'id'])
Desired output
ret cumret mom
date id
1/5/14 a .01 1.01
1/6/14 a .02 1.03
1/7/14 a .03 1.06
1/8/14 a .04 1.1
1/9/14 a .05 1.15
1/10/14 a .06 1.21
1/11/14 a .07 1.28
1/12/14 a .08 1.36
1/1/15 a .09 1.45
1/2/15 a .1 1.55
1/3/15 a .11 1.66
1/4/15 a .12 1.78
1/5/15 a .13 1.91 .8
1/6/15 a .14 2.05 .9
1/7/15 a .15 2.2 .9
1/8/15 a .16 2.36 1
1/9/15 a .17 2.53 1.1
1/10/15 a .18 2.71 1.1
1/11/15 a .19 2.9 1.1
1/12/15 a .2 3.1 1.1
1/1/16 a .21 3.31 1.1
1/2/16 a .22 3.53 1.1
1/3/16 a .23 3.76 1.1
1/4/16 a .24 4 1.1
1/5/16 a .25 4.25 1.1
This is the code tried to calculate mom
df['mom'] = ((df['cumret'].shift(-1) / (df['cumret'].shift(-12))) - 1).groupby(level = ['id'])
Entire dataset has more id e.g. a, b, c. Just included 1 variable for this example.
Any help would be awesome! :)
As far as I know, momentum is simply rate of change. Pandas has a built-in method for this:
df['mom'] = df['ret'].pct_change(12) # 12 month change
Also, I am not sure why you are using cumret instead of ret to calculate momentum.
Update: If you have multiple IDs that you need to go through, I'd recommend:
for i in df.index.levels[1]:
temp = df.loc[(slice(None), i), "ret"].pct_change(11)
df.loc[(slice(None), i), "mom"] = temp
# or df.loc[(slice(None), i), "mom"] = df.loc[(slice(None), i), "ret"].pct_change(11) for short
Output:
ret cumret mom
date id
1/5/14 a 0.01 1.01 NaN
1/6/14 a 0.02 1.03 NaN
1/7/14 a 0.03 1.06 NaN
1/8/14 a 0.04 1.10 NaN
1/9/14 a 0.05 1.15 NaN
1/10/14 a 0.06 1.21 NaN
1/11/14 a 0.07 1.28 NaN
1/12/14 a 0.08 1.36 NaN
1/1/15 a 0.09 1.45 NaN
1/2/15 a 0.10 1.55 NaN
1/3/15 a 0.11 1.66 NaN
1/4/15 a 0.12 1.78 11.000000
1/5/15 a 0.13 1.91 5.500000
1/6/15 a 0.14 2.05 3.666667
1/7/15 a 0.15 2.20 2.750000
1/8/15 a 0.16 2.36 2.200000
1/9/15 a 0.17 2.53 1.833333
1/10/15 a 0.18 2.71 1.571429
1/11/15 a 0.19 2.90 1.375000
1/12/15 a 0.20 3.10 1.222222
1/1/16 a 0.21 3.31 1.100000
1/2/16 a 0.22 3.53 1.000000
1/3/16 a 0.23 3.76 0.916667
1/4/16 a 0.24 4.00 0.846154
1/5/16 a 0.25 4.25 0.785714
1/5/14 b 0.01 1.01 NaN
1/6/14 b 0.02 1.03 NaN
1/7/14 b 0.03 1.06 NaN
1/8/14 b 0.04 1.10 NaN
1/9/14 b 0.05 1.15 NaN
1/10/14 b 0.06 1.21 NaN
1/11/14 b 0.07 1.28 NaN
1/12/14 b 0.08 1.36 NaN
1/1/15 b 0.09 1.45 NaN
1/2/15 b 0.10 1.55 NaN
1/3/15 b 0.11 1.66 NaN
1/4/15 b 0.12 1.78 11.000000
1/5/15 b 0.13 1.91 5.500000
1/6/15 b 0.14 2.05 3.666667
1/7/15 b 0.15 2.20 2.750000
1/8/15 b 0.16 2.36 2.200000
1/9/15 b 0.17 2.53 1.833333
1/10/15 b 0.18 2.71 1.571429
1/11/15 b 0.19 2.90 1.375000
1/12/15 b 0.20 3.10 1.222222
1/1/16 b 0.21 3.31 1.100000
1/2/16 b 0.22 3.53 1.000000
1/3/16 b 0.23 3.76 0.916667
1/4/16 b 0.24 4.00 0.846154
1/5/16 b 0.25 4.25 0.785714
Given that my data is a pandas dataframe and looks like this:
Ref +1 +2 +3 +4 +5 +6 +7
2013-05-28 1 -0.44 0.03 0.06 -0.31 0.13 0.56 0.81
2013-07-05 2 0.84 1.03 0.96 0.90 1.09 0.59 1.15
2013-08-21 3 0.09 0.25 0.06 0.09 -0.09 -0.16 0.56
2014-10-15 4 0.35 1.16 1.91 3.44 2.75 1.97 2.16
2015-02-09 5 0.09 -0.10 -0.38 -0.69 -0.25 -0.85 -0.47
How can I plot a chart of the 5 lines (1 for each ref), where the X axis are the columns (+1, +2...), and starts from 0? If is in seaborn, even better. But matplotlib solutions are also welcome.
Plotting a dataframe in pandas is generally all about reshaping the table so that the individual lines you want are in separate columns, and the x-values are in the index. Some of these reshape operations are a bit ugly, but you can do:
df = pd.read_clipboard()
plot_table = pd.melt(df.reset_index(), id_vars=['index', 'Ref'])
plot_table = plot_table.pivot(index='variable', columns='Ref', values='value')
# Add extra row to have all lines start from 0:
plot_table.loc['+0', :] = 0
plot_table = plot_table.sort_index()
plot_table
Ref 1 2 3 4 5
variable
+0 0.00 0.00 0.00 0.00 0.00
+1 -0.44 0.84 0.09 0.35 0.09
+2 0.03 1.03 0.25 1.16 -0.10
+3 0.06 0.96 0.06 1.91 -0.38
+4 -0.31 0.90 0.09 3.44 -0.69
+5 0.13 1.09 -0.09 2.75 -0.25
+6 0.56 0.59 -0.16 1.97 -0.85
+7 0.81 1.15 0.56 2.16 -0.47
Now that you have a table with the right shape, plotting is pretty automatic:
plot_table.plot()
I would like to feed a empty dataframe appending several files of the same type and structure. However, I can't see what's wrong here:
def files2df(colnames, ext):
df = DataFrame(columns = colnames)
for inf in sorted(glob.glob(ext)):
dfin = read_csv(inf, sep='\t', skiprows=1)
print(dfin.head(), '\n')
df.append(dfin, ignore_index=True)
return df
The resulting dataframe is empty. Could someone give me a hand?
1.0 16.59 0.597 0.87 1.0.1 3282 100.08
0 0.953 14.52 0.561 0.80 0.99 4355 -
1 1.000 31.59 1.000 0.94 1.00 6322 -
2 1.000 6.09 0.237 0.71 1.00 10568 -
3 1.000 31.29 1.000 0.94 1.00 14363 -
4 1.000 31.59 1.000 0.94 1.00 19797 -
1.0 6.69 0.199 0.74 1.0.1 186 13.16
0 1 0.88 0.020 0.13 0.99 394 -
1 1 0.75 0.017 0.11 0.99 1052 -
2 1 3.34 0.097 0.57 1.00 1178 -
3 1 1.50 0.035 0.26 1.00 1211 -
4 1 20.59 0.940 0.88 1.00 1583 -
1.0 0.12 0.0030 0.04 0.97 2285 2.62
0 1 1.25 0.135 0.18 0.99 2480 -
1 1 0.03 0.001 0.04 0.97 7440 -
2 1 0.12 0.003 0.04 0.97 8199 -
3 1 1.10 0.092 0.16 0.99 11174 -
4 1 0.27 0.007 0.06 0.98 11310 -
0.244 0.07 0.0030 0.02 0.76 41314 1.32
0 0.181 0.64 0.028 0.03 0.36 41755 -
1 0.161 0.18 0.008 0.01 0.45 42420 -
2 0.161 0.18 0.008 0.01 0.45 42461 -
3 0.237 0.25 0.011 0.02 0.56 43060 -
4 0.267 1.03 0.047 0.07 0.46 43321 -
0.163 0.12 0.0060 0.01 0.5 103384 1.27
0 0.243 0.27 0.014 0.02 0.56 104693 -
1 0.215 0.66 0.029 0.04 0.41 105192 -
2 0.190 0.10 0.005 0.01 0.59 105758 -
3 0.161 0.12 0.006 0.01 0.50 109783 -
4 0.144 0.16 0.007 0.01 0.42 110067 -
Empty DataFrame
Columns: array([D, LOD, r2, CIlow, CIhi, Dist, T-int], dtype=object)
Index: array([], dtype=object)
df.append(dfin, ignore_index=True) returns a new DataFrame, it does not change df in place.
Use df = df.append(dfin, ignore_index=True). But even with this change i think this will not give what you need. Append extends a frame on axis=1 (columns), but i believe you want to combine the data on axis=0 (rows)
In this scenario (reading multiple files and use all data to create a single DataFrame), i would use pandas.concat(). The code below will give you a frame with columns named by colnames, and the rows are formed by the data in the csv files.
def files2df(colnames, ext):
files = sorted(glob.glob(ext))
frames = [read_csv(inf, sep='\t', skiprows=1, names=colnames) for inf in files]
return concat(frames, ignore_index=True)
I did not try this code, just wrote it here, maybe you need tweak it to get it running, but the idea is clear (i hope).
Also, I found another solution, but don't know which one is faster.
def files2df(colnames, ext):
dflist = [ ]
for inf in sorted(glob.glob(ext)):
dflist.append(read_csv(inf, names = colnames, sep='\t', skiprows=1))
#print(dflist)
df = concat(dflist, axis = 0, ignore_index=True)
#print(df.to_string())
return df