Suppose I have a dataframe df
pd demand mon1 mon2 mon3
abc1 137 46 37 31
abc2 138 33 37 50
abc3 120 38 47 46
abc4 149 39 30 30
abc5 129 33 42 42
abc6 112 30 45 43
abc7 129 43 33 45
I want to satisfy the demand of each pd month-wise. I am generating some random numbers which indicate satisfied demand. For example, for pd abc1, demand is 137, say I have produced 42 units for mon1, but mon1 demand is 46. Hence revised dataframe would be
pd demand mon2 mon3
abc1 137 - 42= 95 37 + 4 (Unsatisfied demand for previous month) 31
Then it will run for mon2 and so on. In this way, I would like to capture, how much demand would be satisfied for each pd (excess or unsatisfied).
My try:
import pandas as pd
import random
mon = ['mon1', 'mon2', 'mon3']
for i in df['pd'].values.tolist():
t = df.loc[df['pd'] == i, :]
for m in t.columns[2:]:
y = t[m].iloc[0]
n = random.randint(20, 70)
t['demand'] = t['demand'].iloc[0] - n
Not finding the logic exactly.
Related
I am trying to linearly interpolate a series of CDS rates. Below is the data I have available; it is represented as maturities and in years.
Maturity Company 1 Company 2
0 0.5 186.73 186.73
1 1.0 210.65 210.65
2 2.0 249.09 249.09
3 3.0 285.4 285.4
4 4.0 317.59 317.59
5 5.0 344.06 344.06
6 6.0 363.01 363.01
7 7.0 375.69 375.69
8 8.0 384.31 384.31
9 9.0 391.0 391.0
10 10.0 396.12 396.12
I am now trying to use this set of maturities and their CDS rate in a similar format to interpolate the rate; below is an example of the maturities I will need to interpolate.
Maturity Years
28 0.10410958904109589
29 0.1863013698630137
30 0.27671232876712326
31 0.3561643835616438
32 0.4328767123287671
33 0.5260273972602739
34 0.5945205479452055
35 0.684931506849315
36 0.7753424657534247
37 0.852054794520548
38 0.9397260273972603
39 1.0164383561643835
40 1.104109589041096
41 1.1863013698630136
42 1.2728102189781023
43 1.35492700729927
44 1.4343065693430657
45 1.5136861313868615
46 1.6012773722627738
47 1.686131386861314
48 1.770985401459854
49 1.853102189781022
50 1.9434306569343067
51 2.02007299270073
52 2.10492700729927
53 2.184306569343066
54 2.2751540041067764
55 2.3572895277207393
56 2.433949349760438
57 2.518822724161533
58 2.6009582477754964
59 2.6830937713894594
60 2.7707049965776864
61 2.8528405201916494
62 2.940451745379877
63 3.0198494182067077
64 3.1047227926078027
65 3.1813826146475015
66 3.2749178532311065
67 3.3543263964950714
68 3.4309967141292446
69 3.521358159912377
70 3.6007667031763417
71 3.6801752464403066
72 3.7677984665936473
73 3.8526834611171963
74 3.940306681270537
75 4.019715224534502
76 4.101861993428258
77 4.186746987951808
78 4.274760383386581
79 4.351437699680511
80 4.4281150159744405
81 4.518484710178001
82 4.600638977635782
83 4.68553172067549
84 4.767685988133272
85 4.8498402555910545
86 4.940209949794614
87 5.0196257416704695
88 5.099041533546326
89 5.18667275216796
90 5.269847477512711
91 5.354712553773954
92 5.434102463824795
93 5.518967540086038
94 5.595619867031678
95 5.685960109503324
96 5.776300351974971
97 5.8529526789206106
98 5.940555338287055
99 6.017207665232695
100 6.10481032459914
101 6.186937817755182
102 6.275154004106776
103 6.357289527720739
104 6.433949349760438
105 6.5160848733744015
106 6.6009582477754964
107 6.685831622176591
108 6.7734428473648185
109 6.852840520191649
110 6.945927446954141
111 7.014373716632443
112 7.104722792607803
113 7.186858316221766
114 7.275022817158503
115 7.3571645877700025
116 7.433830240340736
117 7.513233951931853
118 7.600851840584119
119 7.685731670216002
120 7.7706114998478855
121 7.852753270459385
122 7.943109218132035
123 8.019774870702769
124 8.10465470033465
125 8.184058411925768
126 8.274917853231106
127 8.357064622124863
128 8.433734939759036
129 8.518619934282585
130 8.600766703176342
131 8.6829134720701
132 8.77053669222344
133 8.852683461117197
134 8.940306681270537
135 9.019715224534503
136 9.104600219058051
137 9.181270536692224
138 9.272523643603783
139 9.35191637630662
In the past, I have created log-linear interpolation functions that will allow me to interpolate discount rates based on maturities represented as dates, not as yearly numerical values; below is an example of that function.
def loglinearinterpolation(df, list_of_dates, dt_rng_name, rate_rng_name):
asofDate = pd.to_datetime(list_of_dates)
low_lim = df[df[dt_rng_name] <= asofDate].tail(1)
upper_lim = df[df[dt_rng_name] >= asofDate].head(1)
if low_lim.index == upper_lim.index:
return low_lim[rate_rng_name].iloc[0]
mat_dt_min = low_lim[dt_rng_name].iloc[0]
mat_dt_max = upper_lim[dt_rng_name].iloc[0]
y_min = low_lim[rate_rng_name].iloc[0]
y_max = upper_lim[rate_rng_name].iloc[0]
return np.exp(((np.log(y_max) - np.log(y_min))/((mat_dt_max - mat_dt_min).days))*(asofDate - mat_dt_min).days + np.log(y_min))
df_Libor_interpolated = pd.DataFrame()
df_Libor_interpolated = [(pd.to_datetime(x), loglinearinterpolation(df_libor_curve, pd.to_datetime(
x), 'Dates', 'value')) for x in df_client_curve['Date'].unique()]
I now need to do a similar task using the same formula in the return function, except linearly, not log linearly; however, my code is breaking as the values I am feeding for dates are being converted to DateTime and providing numpy and timestamp comparison errors.
I have tried using the code below as a workaround; however, it is not providing me with the values my team expects.
[np.interp(x,df_cds['Maturity'],df_cds['Company 1']) for x in df_cds_interpolated['Maturity Years']]
Any guidance or insight on how I can modify the function and formula above to work with the input data I provided would be greatly appreciated!
How to get the max value from the second column and min value from the third column in CSV file with no row headers as per the screenshot of DataFrame through defining a function?
My code is:
import pandas as pd
def minmaxvalue(filename):
# some code
minmaxvalue("my_data.cvs")
How to get the max&min value between the defining function?
i a b
1 33 99
2 35 100
3 37 101
4 39 102
5 41 103
6 43 104
7 45 105
8 47 106
9 49 107
10 51 108
11 53 109
12 55 110
13 57 111
14 59 112
15 61 113
import pandas as pd
def minmaxvalue(filename):
# reading from file
df = pd.read_csv(filename, names=['a', 'b'])
# returning max and min
return df['a'].max(), df['b'].min()
minmaxvalue("my_data.csv")
One way is this:
def minmaxvalue(filename):
minim = filename['a'][0]
maxim = filename['b'][0]
for i in range(0, len(filename)):
if minim > filename['a'][i]:
minim = filename['a'][i]
if maxim < filename['b'][i]:
maxim = filename['b'][i]
return minim, maxim
I am looking for a solution for the following problem and it just won't work the way I want to.
So my goal is to calculate a regression analysis and get the slope, intercept, rvalue, pvalue and stderr for multiple rows (this could go up to 10000). In this example, I have a file with 15 rows. Here are the first two rows:
array([
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24],
[ 100, 10, 61, 55, 29, 77, 61, 42, 70, 73, 98,
62, 25, 86, 49, 68, 68, 26, 35, 62, 100, 56,
10, 97]]
)
Full trial data set:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
100 10 61 55 29 77 61 42 70 73 98 62 25 86 49 68 68 26 35 62 100 56 10 97
57 89 25 89 48 56 67 17 98 10 25 90 17 52 85 56 18 20 74 97 82 63 45 87
192 371 47 173 202 144 17 147 174 483 170 422 285 13 77 116 500 136 276 392 220 121 441 268
The first row is the x-variable and this is the independent variable. This has to be kept fixed while iterating over every following row.
For the following row, the y-variable and thus the dependent variable, I want to calculate the slope, intercept, rvalue, pvalue and stderr and have them in a dataframe (if possible added to the same dataframe, but this is not necessary).
I tried the following code:
import pandas as pd
import scipy.stats
import numpy as np
df = pd.read_excel("Directory\\file.xlsx")
def regr(row):
r = scipy.stats.linregress(df.iloc[1:, :], row)
return r
full_dataframe = None
for index,row in df.iterrows():
x = regr(index)
if full_dataframe is None:
full_dataframe = x.T
else:
full_dataframe = full_dataframe.append([x.T])
full_dataframe.to_excel('Directory\\file.xlsx')
But this fails and gives the following error:
ValueError: all the input array dimensions except for the concatenation axis
must match exactly
I'm really lost in here.
So, I want to achieve that I have the slope, intercept, pvalue, rvalue and stderr per row, starting from the second one, because the first row is the x-variable.
Anyone has an idea HOW to do this and tell me WHY mine isn't working and WHAT the code should look like?
Thanks!!
Guessing the issue
Most likely, your problem is the format of your numbers, there are Unicode String dtype('<U21') instead of being Integer or Float.
Always check types:
df.dtypes
Cast your dataframe using:
df = df.astype(np.float64)
Below a small example showing the issue:
import numpy as np
import pandas as pd
# DataFrame without numbers (will not work for Math):
df = pd.DataFrame(['1', '2', '3'])
df.dtypes # object: placeholder for everything that is not number or timestamps (string, etc...)
# Casting DataFrame to make it suitable for Math Operations:
df = df.astype(np.float64)
df.dtypes # float64
But it is difficult to be sure of this without having the original file or data you are working with.
Carefully read the Exception
This is coherent with the Exception you get:
TypeError: ufunc 'add' did not contain a loop with signature matching types
dtype('<U21') dtype('<U21') dtype('<U21')
The method scipy.stats.linregress raises a TypeError (so it is about type) and is telling you than it cannot perform add operation because adding String dtype('<U21') does not make any sense in the context of a Linear Regression.
Understand the Design
Loading the data:
import io
fh = io.StringIO("""1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
100 10 61 55 29 77 61 42 70 73 98 62 25 86 49 68 68 26 35 62 100 56 10 97
57 89 25 89 48 56 67 17 98 10 25 90 17 52 85 56 18 20 74 97 82 63 45 87
192 371 47 173 202 144 17 147 174 483 170 422 285 13 77 116 500 136 276 392 220 121 441 268""")
df = pd.read_fwf(fh).astype(np.float)
Then we can regress the second row vs the first:
scipy.stats.linregress(df.iloc[0,:].values, df.iloc[1,:].values)
It returns:
LinregressResult(slope=0.12419744768547877, intercept=49.60998434527584, rvalue=0.11461693561751324, pvalue=0.5938303095361301, stderr=0.22949908667668056)
Assembling all together:
result = pd.DataFrame(columns=["slope", "intercept", "rvalue"])
for i, row in df.iterrows():
fit = scipy.stats.linregress(df.iloc[0,:], row)
result.loc[i] = (fit.slope, fit.intercept, fit.rvalue)
Returns:
slope intercept rvalue
0 1.000000 0.000000 1.000000
1 0.124197 49.609984 0.114617
2 -1.095801 289.293224 -0.205150
Which is, as far as I understand your question, what you expected.
The second exception you get comes because of this line:
x = regr(index)
You sent the index of the row instead of the row itself to the regression method.
I'm trying to parse some data from this website:
http://www.csfbl.com/freeagents.asp?leagueid=2237
I've written some code:
import urllib
import re
name = re.compile('<td>(.+?)')
player_id = re.compile('<td><a href="(.+?)" onclick=')
#player_id_num = re.compile('<td><a href=player.asp?playerid="(.+?)" onclick=')
stat_c = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">(.+?)</span><br><span class="[^"]?">')
stat_p = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">"[^"]+"</span><br><span class="[^"]?">(.+?)</span></td>')
url = 'http://www.csfbl.com/freeagents.asp?leagueid=2237'
sock = urllib.request.urlopen(url).read().decode("utf-8")
#li = name.findall(sock)
name = name.findall(sock)
player_id = player_id.findall(sock)
#player_id_num = player_id_num.findall(sock)
#age = age.findall(sock)
stat_c = stat_c.findall(sock)
stat_p = stat_p.findall(sock)
First question : player_id returns the whole url "player.asp?playerid=4209661". I was unable to get just the number part. How can I do that?
(my attempt is described in #player_id_num)
Second question: I am not able to get stat_c when span_class is empty as in "".
Is there a way I can get these resolved? I am not very familiar with RE (regular expressions), I looked up tutorials online but it's still unclear what I am doing wrong.
Very simple using the pandas library.
Code:
import pandas as pd
url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)
# print dfs[3]
# dfs[3].to_csv("stats.csv") # Send to a CSV file.
print dfs[3].head()
Result:
0 1 2 3 4 5 6 7 8 9 10 \
0 Pos Name Age T PO FI CO SY HR RA GL
1 P George Pacheco 38 R 4858 7484 8090 7888 6777 4353 6979
2 P David Montoya 34 R 3944 5976 6673 8699 6267 6685 5459
3 P Robert Cole 34 R 5769 7189 7285 5863 6267 5868 5462
4 P Juanold McDonald 32 R 69100 5772 4953 4866 5976 67100 5362
11 12 13 14 15 16
0 AR EN RL Fatigue Salary NaN
1 3747 6171 -3 100% --- $3,672,000
2 5257 5975 -4 96% 2% $2,736,000
3 4953 5061 -4 96% 3% $2,401,000
4 5982 5263 -4 100% --- $1,890,000
You can apply whatever cleaning methods you want from here onwards. Code is rudimentary so it's up to you to improve it.
More Code:
import pandas as pd
import itertools
url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)
df = dfs[3] # "First" stats table.
# The first row is the actual header.
# Also, notice the NaN at the end.
header = df.iloc[0][:-1].tolist()
# Fix that atrocity of a last column.
df.drop([15], axis=1, inplace=True)
# Last row is all NaNs. This particular
# table should end with Jeremy Dix.
df = df.iloc[1:-1,:]
df.columns = header
df.reset_index(drop=True, inplace=True)
# Pandas cannot create two rows without the
# dataframe turning into a nightmare. Let's
# try an aesthetic change.
sub_header = header[4:13]
orig = ["{}{}".format(h, "r") for h in sub_header]
clone = ["{}{}".format(h, "p") for h in sub_header]
# http://stackoverflow.com/a/3678930/2548721
comb = [iter(orig), iter(clone)]
comb = list(it.next() for it in itertools.cycle(comb))
# Construct the new header.
new_header = header[0:4]
new_header += comb
new_header += header[13:]
# Slow but does it cleanly.
for s, o, c in zip(sub_header, orig, clone):
df.loc[:, o] = df[s].apply(lambda x: x[:2])
df.loc[:, c] = df[s].apply(lambda x: x[2:])
df = df[new_header] # Drop the other columns.
print df.head()
More result:
Pos Name Age T POr POp FIr FIp COr COp ... RAp GLr \
0 P George Pacheco 38 R 48 58 74 84 80 90 ... 53 69
1 P David Montoya 34 R 39 44 59 76 66 73 ... 85 54
2 P Robert Cole 34 R 57 69 71 89 72 85 ... 68 54
3 P Juanold McDonald 32 R 69 100 57 72 49 53 ... 100 53
4 P Trevor White 37 R 61 66 62 64 67 67 ... 38 48
GLp ARr ARp ENr ENp RL Fatigue Salary
0 79 37 47 61 71 -3 100% $3,672,000
1 59 52 57 59 75 -4 96% $2,736,000
2 62 49 53 50 61 -4 96% $2,401,000
3 62 59 82 52 63 -4 100% $1,890,000
4 50 70 100 62 69 -4 100% $1,887,000
Obviously, what I did instead was separate the Real values from Potential values. Some tricks were used but it gets the job done at least for the first table of players. The next few ones require a degree of manipulation.
I have a dataframe (df) in pandas/python with ['Product','OrderDate','Sales'].
I noticed that some rows, values have better Distribution (like in a Histogram) than others. By "Best" meaning, the shape is more spread, or the spread of values make the shape looks wider than for other rows.
If I want to pick from say +700 Product's, those with more spread values, is there a way to do that easily in pandas/python?
txs in advance.
Caveat here is that I'm not a stats expert but basically scipy has a number of tests you can conduct on your data to test whether it could be considered to be a normalised Gaussian distribution.
Here I create 2 series one is simple a linear range and the other is a random normalised sampling with mean set to 50 and variance set to 25.
In [48]:
import pandas as pd
import scipy.stats as stats
df = pd.DataFrame({'linear':arange(100), 'normal':np.random.normal(50, 25, 100)})
df
Out[48]:
linear normal
0 0 66.565374
1 1 63.453899
2 2 65.736406
3 3 65.848908
4 4 56.916032
5 5 93.870682
6 6 89.513998
7 7 9.949555
8 8 9.727099
9 9 47.072785
10 10 62.849321
11 11 33.263309
12 12 42.168484
13 13 38.488933
14 14 51.833459
15 15 54.911915
16 16 62.372709
17 17 96.928452
18 18 65.333546
19 19 26.341462
20 20 41.692790
21 21 22.852561
22 22 15.799415
23 23 50.600141
24 24 14.234088
25 25 72.428607
26 26 45.872601
27 27 80.783253
28 28 29.561586
29 29 51.261099
.. ... ...
70 70 32.826052
71 71 35.413106
72 72 49.415386
73 73 28.998378
74 74 32.237667
75 75 86.622402
76 76 105.098296
77 77 53.176413
78 78 -7.954881
79 79 60.313761
80 80 42.739641
81 81 56.667834
82 82 68.046688
83 83 72.189683
84 84 67.125708
85 85 24.798553
86 86 58.845761
87 87 54.559792
88 88 93.116777
89 89 30.209895
90 90 80.952444
91 91 57.895433
92 92 47.392336
93 93 13.136111
94 94 26.624532
95 95 53.461421
96 96 28.782809
97 97 16.342756
98 98 64.768579
99 99 68.410021
[100 rows x 2 columns]
From this page there are a number of tests we can use which are combined to for the normaltest, namely the skewtest and kurtosistest, I cannot explain these but you can see that the p-value is poor for the linear series and is relatively closer to 1 for the normalised data:
In [49]:
print('linear skewtest teststat = %6.3f pvalue = %6.4f' % sc.stats.skewtest(df['linear']))
print('normal skewtest teststat = %6.3f pvalue = %6.4f' % sc.stats.skewtest(df['normal']))
print('linear kurtoisis teststat = %6.3f pvalue = %6.4f' % sc.stats.kurtosistest(df['linear']))
print('normal kurtoisis teststat = %6.3f pvalue = %6.4f' % sc.stats.kurtosistest(df['normal']))
print('linear normaltest teststat = %6.3f pvalue = %6.4f' % sc.stats.normaltest(df['linear']))
print('normal normaltest teststat = %6.3f pvalue = %6.4f' % sc.stats.normaltest(df['normal']))
linear skewtest teststat = 1.022 pvalue = 0.3070
normal skewtest teststat = -0.170 pvalue = 0.8652
linear kurtoisis teststat = -5.799 pvalue = 0.0000
normal kurtoisis teststat = -1.113 pvalue = 0.2656
linear normaltest teststat = 34.674 pvalue = 0.0000
normal normaltest teststat = 1.268 pvalue = 0.5304
From the scipy site:
When testing for normality of a small sample of t-distributed
observations and a large sample of normal distributed observation,
then in neither case can we reject the null hypothesis that the sample
comes from a normal distribution. In the first case this is because
the test is not powerful enough to distinguish a t and a normally
distributed random variable in a small sample.
So you'll have to try the above and see if it fits with what you want, hope this helps.
Sure. What you'd like to do here is find the 700 entries with the largest standard deviation.
pandas.DataFrame.std() will return the standard deviation for an axis, and then you just need to keep track of the entries with the highest corresponding values.
Large Standard Deviation vs. Small Standard Deviation