Comparing results in dataframe and grouping results - python

I have a dataset consisting of emails and how they are similar to each other correlated by their score.
emlgroup1 emlgroup2 scores
79 1739.eml 1742.eml 100
130 1742.eml 1739.eml 100
153 1743.eml 1744.eml 99
157 1743.eml 1748.eml 82
170 1744.eml 1743.eml 99
175 1744.eml 1748.eml 82
231 1747.eml 1750.eml 85
242 1748.eml 1743.eml 82
243 1748.eml 1744.eml 82
282 1750.eml 1747.eml 85
What I want to do now is group them automatically like so and put that in a new dataframe with one column.
group 1: 1739.eml, 1742.eml
group 2: 1743.eml, 1744.eml, 1748
group 3: 1747.eml, 1750.eml
Desired Output:
Col 1
1 1739.eml 1742.eml
2 1743.eml 1744.eml 1748.eml
3 1747.eml 1750.eml
I am getting stuck at the logic part where it splits the data into another group/cluster. I'm really new to posting on StackOverflow so I hope I am not committing any sins, Thanks in advance!

This network problem using networkx
import networkx as nx
G=nx.from_pandas_edgelist(df, 'emlgroup1', 'emlgroup2')
l=list(nx.connected_components(G))
l
[{'1739.eml', '1742.eml'}, {'1744.eml', '1743.eml', '1748.eml'}, {'1747.eml', '1750.eml'}]
pd.Series(l).to_frame('col 1')
col 1
0 {1739.eml, 1742.eml}
1 {1744.eml, 1743.eml, 1748.eml}
2 {1747.eml, 1750.eml}

Related

Python: Linear interpolation for CDS rates based on maturity represented in years

I am trying to linearly interpolate a series of CDS rates. Below is the data I have available; it is represented as maturities and in years.
Maturity Company 1 Company 2
0 0.5 186.73 186.73
1 1.0 210.65 210.65
2 2.0 249.09 249.09
3 3.0 285.4 285.4
4 4.0 317.59 317.59
5 5.0 344.06 344.06
6 6.0 363.01 363.01
7 7.0 375.69 375.69
8 8.0 384.31 384.31
9 9.0 391.0 391.0
10 10.0 396.12 396.12
I am now trying to use this set of maturities and their CDS rate in a similar format to interpolate the rate; below is an example of the maturities I will need to interpolate.
Maturity Years
28 0.10410958904109589
29 0.1863013698630137
30 0.27671232876712326
31 0.3561643835616438
32 0.4328767123287671
33 0.5260273972602739
34 0.5945205479452055
35 0.684931506849315
36 0.7753424657534247
37 0.852054794520548
38 0.9397260273972603
39 1.0164383561643835
40 1.104109589041096
41 1.1863013698630136
42 1.2728102189781023
43 1.35492700729927
44 1.4343065693430657
45 1.5136861313868615
46 1.6012773722627738
47 1.686131386861314
48 1.770985401459854
49 1.853102189781022
50 1.9434306569343067
51 2.02007299270073
52 2.10492700729927
53 2.184306569343066
54 2.2751540041067764
55 2.3572895277207393
56 2.433949349760438
57 2.518822724161533
58 2.6009582477754964
59 2.6830937713894594
60 2.7707049965776864
61 2.8528405201916494
62 2.940451745379877
63 3.0198494182067077
64 3.1047227926078027
65 3.1813826146475015
66 3.2749178532311065
67 3.3543263964950714
68 3.4309967141292446
69 3.521358159912377
70 3.6007667031763417
71 3.6801752464403066
72 3.7677984665936473
73 3.8526834611171963
74 3.940306681270537
75 4.019715224534502
76 4.101861993428258
77 4.186746987951808
78 4.274760383386581
79 4.351437699680511
80 4.4281150159744405
81 4.518484710178001
82 4.600638977635782
83 4.68553172067549
84 4.767685988133272
85 4.8498402555910545
86 4.940209949794614
87 5.0196257416704695
88 5.099041533546326
89 5.18667275216796
90 5.269847477512711
91 5.354712553773954
92 5.434102463824795
93 5.518967540086038
94 5.595619867031678
95 5.685960109503324
96 5.776300351974971
97 5.8529526789206106
98 5.940555338287055
99 6.017207665232695
100 6.10481032459914
101 6.186937817755182
102 6.275154004106776
103 6.357289527720739
104 6.433949349760438
105 6.5160848733744015
106 6.6009582477754964
107 6.685831622176591
108 6.7734428473648185
109 6.852840520191649
110 6.945927446954141
111 7.014373716632443
112 7.104722792607803
113 7.186858316221766
114 7.275022817158503
115 7.3571645877700025
116 7.433830240340736
117 7.513233951931853
118 7.600851840584119
119 7.685731670216002
120 7.7706114998478855
121 7.852753270459385
122 7.943109218132035
123 8.019774870702769
124 8.10465470033465
125 8.184058411925768
126 8.274917853231106
127 8.357064622124863
128 8.433734939759036
129 8.518619934282585
130 8.600766703176342
131 8.6829134720701
132 8.77053669222344
133 8.852683461117197
134 8.940306681270537
135 9.019715224534503
136 9.104600219058051
137 9.181270536692224
138 9.272523643603783
139 9.35191637630662
In the past, I have created log-linear interpolation functions that will allow me to interpolate discount rates based on maturities represented as dates, not as yearly numerical values; below is an example of that function.
def loglinearinterpolation(df, list_of_dates, dt_rng_name, rate_rng_name):
asofDate = pd.to_datetime(list_of_dates)
low_lim = df[df[dt_rng_name] <= asofDate].tail(1)
upper_lim = df[df[dt_rng_name] >= asofDate].head(1)
if low_lim.index == upper_lim.index:
return low_lim[rate_rng_name].iloc[0]
mat_dt_min = low_lim[dt_rng_name].iloc[0]
mat_dt_max = upper_lim[dt_rng_name].iloc[0]
y_min = low_lim[rate_rng_name].iloc[0]
y_max = upper_lim[rate_rng_name].iloc[0]
return np.exp(((np.log(y_max) - np.log(y_min))/((mat_dt_max - mat_dt_min).days))*(asofDate - mat_dt_min).days + np.log(y_min))
df_Libor_interpolated = pd.DataFrame()
df_Libor_interpolated = [(pd.to_datetime(x), loglinearinterpolation(df_libor_curve, pd.to_datetime(
x), 'Dates', 'value')) for x in df_client_curve['Date'].unique()]
I now need to do a similar task using the same formula in the return function, except linearly, not log linearly; however, my code is breaking as the values I am feeding for dates are being converted to DateTime and providing numpy and timestamp comparison errors.
I have tried using the code below as a workaround; however, it is not providing me with the values my team expects.
[np.interp(x,df_cds['Maturity'],df_cds['Company 1']) for x in df_cds_interpolated['Maturity Years']]
Any guidance or insight on how I can modify the function and formula above to work with the input data I provided would be greatly appreciated!

Pandas interpolation creating a horizontal line

This is a continuation of the following question:
Plot a line on a curve that is undersampled
I tried the solution provided but with real data and getting a straight line. The full data is pasted below:
mdcol, tvdcol = 'md_m', 'tvd_m'
df = df[[mdcol, tvdcol]].copy().set_index(mdcol)
df = df[~df.index.duplicated()]
data_intp = (df.reindex(index = range(int(df.index.min()), int(df.index.max())))
.reset_index() # optional, you could write 'index' in the second line plot, too.
.interpolate()
)
data_intp
Dataframe is shown below:
md_m tvd_m
0 0.00 0.00
1 281.00 281.00
2 300.00 300.00
3 330.00 330.00
4 360.00 360.00
5 390.00 390.00
6 420.00 420.00
7 450.00 450.00
8 480.00 480.00
9 510.00 510.00
10 540.00 539.99
11 570.00 569.98
12 600.00 599.97
13 630.00 629.94
14 660.00 659.91
15 690.00 689.88
16 720.00 719.84
17 750.00 749.80
18 780.00 779.75
19 810.00 809.69
20 840.00 839.58
21 870.00 869.34
22 900.00 898.90
23 930.00 928.19
24 950.00 947.55
25 960.00 957.18
26 970.00 966.76
27 980.00 976.32
28 990.00 985.83
29 1000.00 995.32
30 1010.00 1004.77
31 1020.00 1014.20
32 1030.00 1023.60
33 1040.00 1032.96
34 1050.00 1042.29
35 1060.00 1051.56
36 1070.00 1060.78
37 1080.00 1069.94
38 1090.00 1079.05
39 1100.00 1088.11
40 1110.00 1097.12
41 1120.00 1106.10
42 1130.00 1115.03
43 1140.00 1123.91
44 1150.00 1132.73
45 1160.00 1141.48
46 1170.00 1150.17
47 1180.00 1158.80
48 1190.00 1167.37
49 1200.00 1175.86
50 1210.00 1184.28
51 1220.00 1192.61
52 1230.00 1200.88
53 1240.00 1209.09
54 1250.00 1217.24
55 1260.00 1225.36
56 1270.00 1233.50
57 1280.00 1241.70
58 1290.00 1249.95
59 1300.00 1258.22
60 1310.00 1266.50
61 1320.00 1274.79
62 1330.00 1283.11
63 1340.00 1291.46
64 1350.00 1299.84
65 1360.00 1308.23
66 1370.00 1316.64
67 1380.00 1325.08
68 1390.00 1333.55
69 1400.00 1342.05
70 1410.00 1350.59
71 1420.00 1359.16
72 1430.00 1367.75
73 1440.00 1376.37
74 1450.00 1385.00
75 1460.00 1393.65
76 1470.00 1402.31
77 1480.00 1411.01
78 1490.00 1419.75
79 1500.00 1428.51
80 1510.00 1437.30
81 1520.00 1446.11
82 1530.00 1454.92
83 1540.00 1463.71
84 1550.00 1472.46
85 1560.00 1481.20
86 1570.00 1489.93
87 1580.00 1498.65
88 1590.00 1507.37
89 1600.00 1516.09
90 1610.00 1524.84
91 1620.00 1533.62
92 1630.00 1542.40
93 1640.00 1551.18
94 1650.00 1559.96
95 1660.00 1568.74
96 1670.00 1577.53
97 1680.00 1586.29
98 1690.00 1595.01
99 1700.00 1603.69
100 1710.00 1612.36
101 1720.00 1621.02
102 1730.00 1629.66
103 1740.00 1638.27
104 1750.00 1646.84
105 1760.00 1655.35
106 1770.00 1663.83
107 1780.00 1672.27
108 1790.00 1680.65
109 1800.00 1688.97
110 1810.00 1697.23
111 1820.00 1705.42
112 1830.00 1713.54
113 1840.00 1721.60
114 1850.00 1729.61
115 1860.00 1737.63
116 1870.00 1745.66
117 1880.00 1753.69
118 1890.00 1761.72
119 1900.00 1769.70
120 1910.00 1777.61
121 1920.00 1785.44
122 1930.00 1793.20
123 1940.00 1800.86
124 1950.00 1808.43
125 1960.00 1815.92
126 1970.00 1823.31
127 1980.00 1830.62
128 1990.00 1837.83
129 2000.00 1844.95
130 2010.00 1851.96
131 2020.00 1858.89
132 2030.00 1865.76
133 2040.00 1872.58
134 2050.00 1879.35
135 2060.00 1886.05
136 2070.00 1892.70
137 2080.00 1899.28
138 2090.00 1905.78
139 2100.00 1912.20
140 2110.00 1918.50
141 2120.00 1924.66
142 2130.00 1930.68
143 2140.00 1936.57
144 2150.00 1942.34
145 2160.00 1947.97
146 2170.00 1953.47
147 2180.00 1958.83
148 2190.00 1964.06
149 2200.00 1969.16
150 2210.00 1974.12
151 2220.00 1978.93
152 2230.00 1983.63
153 2240.00 1988.25
154 2250.00 1992.78
155 2260.00 1997.23
156 2270.00 2001.60
157 2280.00 2005.87
158 2290.00 2010.06
159 2300.00 2014.15
160 2310.00 2018.12
161 2320.00 2021.97
162 2330.00 2025.68
163 2340.00 2029.25
164 2373.20 2039.67
165 2401.60 2047.31
166 2430.80 2054.90
167 2459.70 2062.45
168 2488.30 2069.84
169 2488.30 2069.88
170 2489.97 2070.30
171 2493.30 2071.11
172 2503.50 2073.51
173 2519.97 2077.32
174 2549.97 2083.99
175 2563.51 2086.88
176 2579.97 2090.18
177 2609.97 2095.34
178 2639.97 2099.36
179 2662.86 2101.68
180 2752.86 2109.47
181 2759.97 2110.08
182 2789.97 2112.33
183 2819.97 2114.10
184 2849.97 2115.39
185 2879.97 2116.19
186 2902.87 2116.48
187 2909.96 2116.53
188 2939.96 2116.72
189 2969.96 2116.92
190 2999.96 2117.11
191 3029.96 2117.31
192 3059.96 2117.51
193 3089.96 2117.70
194 3119.96 2117.90
195 3149.96 2118.09
196 3179.96 2118.29
197 3209.96 2118.49
198 3239.96 2118.68
199 3252.87 2118.76
200 3352.87 2119.41
201 3359.96 2119.45
202 3389.96 2119.65
203 3419.96 2119.84
204 3449.96 2120.04
205 3479.96 2120.23
206 3509.96 2120.43
207 3539.96 2120.62
208 3569.96 2120.82
209 3599.96 2121.01
210 3629.96 2121.21
211 3652.87 2121.35
212 3779.95 2122.17
213 3852.87 2122.64
Plotting shows the horizontal line:
TOOLS = ["box_zoom", "reset", "save", "crosshair", "pan", "wheel_zoom" ,"lasso_select"]
p = figure(plot_width = 1600,
plot_height = 800, tools = TOOLS,
title = 'Well Survey', toolbar_location='above')
p.line(data_intp[mdcol], data_intp[tvdcol], line_width = 2, color='red')
show(p)
Not sure where the interpolation is going wrong. I was hoping it would interpolate between the points. Anyone have an idea what I am doing wrong here?
The issue with your data (and this proposed solution) is that you have a single duplicate value in md_m:
chk = df.groupby('md_m').agg({'tvd_m':['count',lambda x: list(x)]})
print(chk[chk['tvd_m']['count']>1])
returns:
Pandas can't "reindex from a duplicate axis" which is what this approach relies on and really, linear interpolation won't really work either when you have two identical x values and two distinct y values.
An extra layer of QA could be done on the input data, inspecting it beforehand like my snippet, and doing a groupby average or something like that if appropriate.
The only other thing I'd point out is using a range (i.e. integers) for the reindexing is kinda unneccessary --> you should be able to reindex with floats to any step size you want.
Thanks for the answer gmerrit123, but I believe I remove that error by using this line:
df = df[~df.index.duplicated()]
What solved it for me was converting the md column to INT, as the reindexing was running with step = 1 meter, but the raw data had 2 decimal points.
mdcol, tvdcol = 'md_m', 'tvd_m'
df = data[[mdcol, tvdcol]].copy()#.set_index(mdcol)#.reindex(index = range(int(df.index.min()), int(df.index.max())))
df[mdcol] = df[mdcol].astype('int')
df = df.set_index(mdcol)
df = df[~df.index.duplicated()]
data_intp = (df.reindex(index = range(int(df.index.min()), int(df.index.max()+ 1)))
.reset_index()
.interpolate()
)

Using groupby() for a dataframe in pandas resulted Index Error

I have this dataframe:
x y z parameter
0 26 24 25 Age
1 35 37 36 Age
2 57 52 54.5 Age
3 160 164 162 Hgt
4 182 163 172.5 Hgt
5 175 167 171 Hgt
6 95 71 83 Wgt
7 110 68 89 Wgt
8 89 65 77 Wgt
I'm using pandas to get this final result:
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt
I'm using groupby() to extract and isolate rows based on same parameter Hgt from the original dataframe
First, I added a column to set it as an index:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
And the dataframe came out like this:
index x y z parameter
0 0 26 24 25 Age
1 1 35 37 36 Age
2 2 57 52 54.5 Age
3 3 160 164 162 Hgt
4 4 182 163 172.5 Hgt
5 5 175 167 171 Hgt
6 6 95 71 83 Wgt
7 7 110 68 89 Wgt
8 8 89 65 77 Wgt
Then, I used the following code to group based on index and extract the columns I need:
df1 = df.groupby('index')[['x', 'y','parameter']]
And the output was:
x y parameter
0 26 24 Age
1 35 37 Age
2 57 52 Age
3 160 164 Hgt
4 182 163 Hgt
5 175 167 Hgt
6 95 71 Wgt
7 110 68 Wgt
8 89 65 Wgt
After that, I used the following code to isolate only Hgt values:
df2 = df1[df1['parameter'] == 'Hgt']
When I ran df2, I got an error saying:
IndexError: Column(s) ['x', 'y', 'parameter'] already selected
Am I missing something here? What to do to get the final result?
Because you asked what you did wrong, let me point to useless/bad code.
Without any judgement (this is just to help you improve future code), almost everything is incorrect. It feels like a succession of complicated ways to do useless things. Let me give some details:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
This seems a very convoluted way to do df.reset_index(). Even [count for count in range(df.shape[0])] could be have been simplified by using range(df.shape[0]) directly.
But this step is not even needed for a groupby as you can group by index level:
df.groupby(level=0)
But... the groupby is useless anyways as you only have single membered groups.
Also, when you do:
df1 = df.groupby('index')[['x', 'y','parameter']]
df1 is not a dataframe but a DataFrameGroupBy object. Very useful to store in a variable when you know what you're doing, this is however causing the error in your case as you thought this was a DataFrame. You need to apply an aggregation or transformation method of the DataFrameGroupBy object to get back a DataFrame, which you didn't (likely because, as seen above, there isn't much interesting to do on single-membered groups).
So when you run:
df1[df1['parameter'] == 'Hgt']
again, all is wrong as df1['parameter'] is equivalent to df.groupby('index')[['x', 'y','parameter']]['parameter'] (the cause of the error as you select twice 'parameter'). Even if you removed this error, the equality comparison would give a single True/False as you still have your DataFrameGroupBy and not a DataFrame, and this would incorrectly try to subselect an inexistent column of the DataFrameGroupBy.
I hope it helped!
Do you really need groupby?
>>> df.loc[df['parameter'] == 'Hgt', ['x', 'y', 'parameter']].reset_index(drop=True)
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt

Pandas: How to (cleanly) unpivot two columns with same category?

I'm trying to unpivot two columns inside a pandas dataframe. The transformation I seek would be the inverse of this question.
We start with a dataset that looks like this:
import pandas as pd
import numpy as np
df_orig = pd.DataFrame(data=np.random.randint(255, size=(4,5)),
columns=['accuracy','time_a','time_b','memory_a', 'memory_b'])
df_orig
accuracy time_a time_b memory_a memory_b
0 6 118 170 102 239
1 241 9 166 159 162
2 164 70 76 228 121
3 228 121 135 128 92
I wish to unpivot both themwmory and time columns, obtaining this dataset in result:
df
accuracy memory category time
0 6 102 a 118
1 241 159 a 9
2 164 228 a 70
3 228 128 a 121
12 6 239 b 170
13 241 162 b 166
14 164 121 b 76
15 228 92 b 135
So far I have managed to get my desired output using df.melt() twice plus some extra commands:
df = df_orig.copy()
# Unpivot memory columns
df = df.melt(id_vars=['accuracy','time_a', 'time_b'],
value_vars=['memory_a', 'memory_b'],
value_name='memory',
var_name='mem_cat')
# Unpivot time columns
df = df.melt(id_vars=['accuracy','memory', 'mem_cat'],
value_vars=['time_a', 'time_b'],
value_name='time',
var_name='time_cat')
# Keep only the 'a'/'b' as categories
df.mem_cat = df.mem_cat.str[-1]
df.time_cat = df.time_cat.str[-1]
# Keeping only the colums whose categories match (DIRTY!)
df = df[df.mem_cat==df.time_cat]
# Removing the duplicated category column.
df = df.drop(columns='time_cat').rename(columns={"mem_cat":'category'})
Given how easy it was to solve the inverse question, I believe my code is way too complex. Can anyone do it better?
Use wide_to_long:
np.random.seed(123)
df_orig = pd.DataFrame(data=np.random.randint(255, size=(4,5)),
columns=['accuracy','time_a','time_b','memory_a', 'memory_b'])
df = (pd.wide_to_long(df_orig.reset_index(),
stubnames=['time','memory'],
i='index',
j='category',
sep='_',
suffix='\w+')
.reset_index(level=1)
.reset_index(drop=True)
.rename_axis(None))
print (df)
category accuracy time memory
0 a 254 109 66
1 a 98 230 83
2 a 123 57 225
3 a 113 126 73
4 b 254 126 220
5 b 98 17 106
6 b 123 214 96
7 b 113 47 32

Series of Strings to Arrays

I have some image arrays that I'm trying to run a regression on and somehow I'm importing the csv file as a series of strings instead of a series of arrays
In: image_train = pd.read_csv('image_train_data.csv')
In: image_train['image_array'].head()
Out: 0 [73 77 58 71 68 50 77 69 44 120 116 83 125 120...
1 [7 5 8 7 5 8 5 4 6 7 4 7 11 5 9 11 5 9 17 11 1...
2 [169 122 65 131 108 75 193 196 192 218 221 222...
3 [154 179 152 159 183 157 165 189 162 174 199 1...
4 [216 195 180 201 178 160 210 184 164 212 188 1...
Name: image_array, dtype: object
When I try to run the regression using image_train('image_array') I get
ValueError: could not convert string to float: '[255 255 255 255 255 255 255 255...
The array is a string.
Is there a way to transform the strings to arrays for the entire series?
You can use converters to describe how you want to read that field in. The easiest way would be to define your own converter to treat that column as a list, e.g.:
import ast
def conv(x):
return ast.literal_eval(','.join(x.split(' ')))
image_train = pd.read_csv('image_train_data.csv', converters={'image_array':conv})
While AChampion's solution looks good, I went ahead and found another solution:
image_train['image_array'].str.findall(r'\d+').apply(lambda x: map(int, x))
Which would be useful if you already had it loaded and didn't want to/couldn't load it again.
Here's another solution that works well for just evaluating a literal string representation of a list:
pd.eval(image_train['image_array'])
However, if it's separated by spaces you could do:
pd.eval(image_train['image_array'].str.replace(' ', ','))

Categories