+------+------+------+------+------+------+-------+----+
| | | | | USD | EUR | JPY | RUP |
+------+------+------+------+------+------+------+-----+
+------+------+------+------+------+------+------+-----+
| | | | | Case | Cons | Case | Case|
+------+------+------+------+------+------+------+-----+
+------+------+------+------+------+------+------+-----+
| | | | | High | Low | CWM | AEP |
+------+------+------+------+------+------+------+-----+
+------+------+------+------+------+------+------+-----+
| Col1 | Col2 | Col3 | Col4 | Owner| OPS | VH |Delta|
+------+------+------+------+------+------+------+-----+
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 |
| V1a | V2a | V3a | V4a | V5a | V6a | V7a | V8a |
+------+------+------+------+------+------+------+-----+
as requested here is the sample data as output by df.to_dict():
{('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'Unnamed: 0_level_2', 'Year'): {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020, 5: 2020, 6: 2020, 7: 2020, 8: 2020, 9: 2020, 10: 2020, 11: 2020, 12: 2020, 13: 2020, 14: 2020, 15: 2020, 16: 2020, 17: 2020, 18: 2020, 19: 2020, 20: 2020, 21: 2020, 22: 2020, 23: 2020, 24: 2020, 25: 2020, 26: 2020, 27: 2020, 28: 2020, 29: 2020, 30: 2020, 31: 2020, 32: 2020, 33: 2020, 34: 2020, 35: 2020, 36: 2020, 37: 2020, 38: 2020, 39: 2020, 40: 2020, 41: 2020, 42: 2020, 43: 2020, 44: 2020, 45: 2020, 46: 2020, 47: 2020}, ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'Unnamed: 1_level_2', 'Month'): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1, 23: 1, 24: 1, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1, 32: 1, 33: 1, 34: 1, 35: 1, 36: 1, 37: 1, 38: 1, 39: 1, 40: 1, 41: 1, 42: 1, 43: 1, 44: 1, 45: 1, 46: 1, 47: 1}, ('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'Unnamed: 2_level_2', 'Day'): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1, 23: 1, 24: 2, 25: 2, 26: 2, 27: 2, 28: 2, 29: 2, 30: 2, 31: 2, 32: 2, 33: 2, 34: 2, 35: 2, 36: 2, 37: 2, 38: 2, 39: 2, 40: 2, 41: 2, 42: 2, 43: 2, 44: 2, 45: 2, 46: 2, 47: 2}, ('Unnamed: 3_level_0', 'Unnamed: 3_level_1', 'Unnamed: 3_level_2', 'Hour'): {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 0, 25: 1, 26: 2, 27: 3, 28: 4, 29: 5, 30: 6, 31: 7, 32: 8, 33: 9, 34: 10, 35: 11, 36: 12, 37: 13, 38: 14, 39: 15, 40: 16, 41: 17, 42: 18, 43: 19, 44: 20, 45: 21, 46: 22, 47: 23}, ('USD', 'Cons', 'very high', 'Hub1'): {0: 23.06, 1: 21.49, 2: 21.73, 3: 21.58, 4: 21.67, 5: 22.78, 6: 27.15, 7: 26.09, 8: 26.23, 9: 28.21, 10: 29.21, 11: 31.97, 12: 30.45, 13: 30.45, 14: 30.45, 15: 29.14, 16: 28.28, 17: 26.35, 18: 26.32, 19: 27.01, 20: 26.34, 21: 28.22, 22: 27.77, 23: 26.94, 24: 24.16, 25: 22.74, 26: 22.67, 27: 22.67, 28: 22.74, 29: 23.14, 30: 27.81, 31: 27.87, 32: 28.05, 33: 27.91, 34: 32.66, 35: 35.14, 36: 33.32, 37: 36.17, 38: 38.33, 39: 31.75, 40: 30.9, 41: 26.36, 42: 27.17, 43: 28.17, 44: 26.17, 45: 26.5, 46: 28.95, 47: 26.94}, ('EUR', 'Case', 'CWM', 'Hub2'): {0: 18.59, 1: 18.32, 2: 18.32, 3: 18.32, 4: 18.32, 5: 19.19, 6: 22.57, 7: 25.38, 8: 25.53, 9: 25.9, 10: 26.47, 11: 26.47, 12: 26.09, 13: 25.59, 14: 25.35, 15: 24.97, 16: 24.22, 17: 25.22, 18: 25.49, 19: 26.19, 20: 25.63, 21: 25.1, 22: 21.93, 23: 19.61, 24: 19.4, 25: 18.75, 26: 18.85, 27: 18.75, 28: 18.88, 29: 19.41, 30: 23.97, 31: 27.07, 32: 27.23, 33: 29.21, 34: 30.49, 35: 28.52, 36: 27.49, 37: 26.93, 38: 26.71, 39: 25.76, 40: 25.24, 41: 25.67, 42: 26.72, 43: 27.98, 44: 26.73, 45: 25.97, 46: 22.34, 47: 19.47}, ('USD', 'Cons', 'Ventyx', 'Hub3'): {0: 19.78, 1: 20.96, 2: 21.58, 3: 21.5, 4: 21.27, 5: 22.59, 6: 26.22, 7: 26.78, 8: 26.78, 9: 26.97, 10: 26.97, 11: 26.97, 12: 26.53, 13: 26.34, 14: 26.5, 15: 26.22, 16: 25.6, 17: 26.5, 18: 26.74, 19: 27.44, 20: 26.87, 21: 26.5, 22: 23.2, 23: 23.58, 24: 22.74, 25: 22.31, 26: 22.27, 27: 22.27, 28: 22.74, 29: 22.84, 30: 27.79, 31: 31.63, 32: 29.6, 33: 29.25, 34: 30.53, 35: 28.51, 36: 27.48, 37: 26.97, 38: 26.74, 39: 26.53, 40: 26.5, 41: 26.92, 42: 28.89, 43: 30.24, 44: 28.38, 45: 27.38, 46: 24.39, 47: 23.2}}
That is about as good a representation as I can make for this file.
Columns 1-4 have a single header Columns 5-N (yes N, because we don't know how many) have 4 headers.
The dataframe needs to look like this:
+------+------+------+------+------+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 | NCol1| NCol2|NCol3 | NCol4| Col9 |
+------+------+------+------+------+------+------+------+------+
| V1 | V2 | V3 | V4 | USD | Case | High | Owner| V5 |
| V1a | V2a | V3a | V4a | USD | Case | High | Owner| V5a |
| V1a | V2a | V3a | V4a | EUR | Cons | Low | Ops | V6 |
| V1a | V2a | V3a | V4a | EUR | Cons | Low | Ops | V6a |
| V1a | V2a | V3a | V4a | JPY | Case | CWM | VH | V7 |
| V1a | V2a | V3a | V4a | JPY | Case | CWM | VH | V7a |
| V1a | V2a | V3a | V4a | RUP | Case | AEP | Delta| V8 |
| V1a | V2a | V3a | V4a | RUP | Case | AEP | Delta| V8a |
+------+------+------+------+------+------+-----+------+-------+
So essentially pivot the 5th through N column headers into new columns where each row of data is aligned with the first 4 columns and the headers the values were originally under.
I tried:
df = pd.read_csv(file,header=[0,1,2,3])
df.melt(var_name=['a','b','c','d'], value_name='e')
Also:
df2 = df.melt(id_vars=['Year','Month','Day','Hour'], col_level=3)
And :
df2 = df.stack().stack().stack().stack()
That last one is ver close, but it does the first 4 columns
However that doesn't work as it gives me just col1 and col2.
I feel like I am shooting in the dark but this is what i could pull out. lemme know if it is not what you want. if it isn't, kindly post a sample output based on the dict you posted so others can butt in. and i'll gladly delete this hack
df = pd.DataFrame(sample)
df.columns =df.columns.to_flat_index()
df.columns = ['_'.join(i) for i in df.columns]
df = df.melt(id_vars=['Unnamed: 0_level_0_Unnamed: 0_level_1_Unnamed:
0_level_2_Year',
'Unnamed: 1_level_0_Unnamed: 1_level_1_Unnamed: 1_level_2_Month',
'Unnamed: 2_level_0_Unnamed: 2_level_1_Unnamed: 2_level_2_Day',
'Unnamed: 3_level_0_Unnamed: 3_level_1_Unnamed: 3_level_2_Hour'])
df.columns = [i.split('_')[-1] for i in df.columns]
pd.concat([df,df.variable.str.split('_',expand=True)],axis=1)
Related
I need to calculate a loss/profit carried forward for various years for various mappings.The test data looks like the following:
import pandas as pd
data = {'combined_line': {0: 'COMB', 1: 'COMB', 2: 'COMB', 3: 'COMB', 4: 'COMB', 5: 'COMB', 6: 'COMB', 7: 'COMB', 8: 'COMB', 9: 'COMB', 10: 'COMB', 11: 'COMB', 12: 'COMB', 13: 'COMB', 14: 'COMB', 15: 'COMB', 16: 'COMB', 17: 'COMB', 18: 'COMB', 19: 'COMB', 20: 'COMB', 21: 'COMB', 22: 'COMB', 23: 'COMB', 24: 'COMB', 25: 'COMB', 26: 'COMB', 27: 'COMB', 28: 'COMB', 29: 'COMB', 30: 'COMB', 31: 'COMB', 32: 'COMB', 33: 'COMB', 34: 'COMB', 35: 'COMB', 36: 'COMB', 37: 'COMB', 38: 'COMB', 39: 'COMB', 40: 'COMB', 41: 'COMB', 42: 'COMB', 43: 'COMB', 44: 'COMB', 45: 'COMB', 46: 'COMB', 47: 'COMB', 48: 'COMB', 49: 'COMB', 50: 'COMB', 51: 'COMB', 52: 'COMB', 53: 'COMB', 54: 'COMB', 55: 'COMB', 56: 'COMB', 57: 'COMB', 58: 'COMB', 59: 'COMB', 60: 'COMB', 61: 'COMB', 62: 'COMB', 63: 'COMB'}, 'line': {0: 'HWNK', 1: 'HWNK', 2: 'HWNK', 3: 'HWNK', 4: 'HWNK', 5: 'HWNK', 6: 'HWNK', 7: 'HWNK', 8: 'PGIB',
9: 'PGIB', 10: 'PGIB', 11: 'PGIB', 12: 'PGIB', 13: 'PGIB', 14: 'PGIB', 15: 'PGIB', 16: 'UIGZ', 17: 'UIGZ', 18: 'UIGZ', 19: 'UIGZ', 20: 'UIGZ', 21: 'UIGZ', 22: 'UIGZ', 23: 'UIGZ', 24: 'JVSM', 25: 'JVSM', 26: 'JVSM', 27: 'JVSM', 28: 'JVSM', 29: 'JVSM', 30: 'JVSM', 31: 'JVSM', 32: 'IALH', 33: 'IALH', 34: 'IALH', 35: 'IALH', 36: 'IALH', 37: 'IALH', 38: 'IALH', 39: 'IALH', 40: 'GUER', 41: 'GUER', 42: 'GUER', 43: 'GUER', 44: 'GUER', 45: 'GUER', 46: 'GUER', 47: 'GUER', 48: 'UGQC', 49: 'UGQC', 50: 'UGQC', 51: 'UGQC', 52: 'UGQC', 53: 'UGQC', 54: 'UGQC', 55: 'UGQC', 56: 'ZBZA', 57: 'ZBZA', 58: 'ZBZA', 59: 'ZBZA', 60: 'ZBZA', 61: 'ZBZA', 62: 'ZBZA', 63: 'ZBZA'},
'Underwriting Year': {0: 2006, 1: 2007, 2: 2008, 3: 2009, 4: 2010, 5: 2011, 6: 2012, 7: 2013, 8: 2006, 9: 2007, 10: 2008, 11: 2009, 12: 2010, 13: 2011, 14: 2012, 15: 2013, 16: 2006, 17: 2007, 18: 2008, 19: 2009, 20: 2010, 21: 2011, 22: 2012, 23: 2013, 24: 2006, 25: 2007, 26: 2008, 27: 2009, 28: 2010, 29: 2011, 30: 2012, 31: 2013, 32: 2006, 33: 2007, 34: 2008, 35: 2009, 36: 2010, 37: 2011, 38: 2012, 39: 2013, 40: 2006, 41: 2007, 42: 2008, 43: 2009, 44: 2010, 45: 2011, 46: 2012, 47: 2013, 48: 2006, 49: 2007, 50: 2008, 51: 2009, 52: 2010, 53: 2011, 54: 2012, 55: 2013, 56: 2006, 57: 2007, 58: 2008, 59: 2009, 60: 2010, 61: 2011, 62: 2012, 63: 2013}, 'Loss Carried Forward Years': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 4, 6: 4, 7: 4, 8: 4, 9: 4, 10: 4, 11: 4, 12: 4, 13: 4, 14: 4, 15: 4, 16: 4, 17: 4, 18: 4, 19: 4, 20: 4, 21: 4, 22: 4, 23: 4, 24: 4, 25: 4, 26: 4, 27: 4, 28: 4, 29: 4, 30: 4, 31: 4, 32: 4, 33: 4, 34: 4, 35: 4, 36: 4, 37: 4, 38: 4, 39: 4, 40: 4, 41: 4, 42: 4, 43: 4, 44: 4, 45: 4, 46: 4, 47: 4, 48: 4, 49: 4, 50: 4, 51: 4, 52: 4, 53: 4, 54: 4, 55: 4, 56: 4, 57: 4, 58: 4, 59: 4, 60: 4, 61: 4, 62: 4, 63: 4}, 'Result': {0: 1.7782623338664507, 1: 573.5652911310642, 2: -757.5452321102866, 3: 109.5149916578, 4: -255.67441806846205, 5: -687.5363404984247, 6: -237.72375990073272, 7: 377.0590732628068, 8: 195.06552059019327, 9: 253.9139354887218, 10: -199.3089719508628, 11: -613.0298155777073, 12: 579.0530926295057, 13: 29.428579932476623, 14: 138.8491336480481, 15: 169.5509712778246, 16: -678.0475161337745, 17: 143.8572792017776, 18: 582.0521770196842, 19: 999.6608185859805, 20: 617.653356833144, 21: 324.507583333668, 22: -659.8006551374211, 23: 504.40968855532833, 24: -233.0400805626533, 25: -216.2984964245977, 26: -867.441337711643, 27:
837.8986975605346, 28: 701.1722485951575, 29: 430.6209772769762, 30: 949.027900642678, 31: 153.92299033433596, 32: 839.6369570865697, 33: -453.5140989578259, 34: -58.89747070779697, 35: -530.522608203202, 36: -463.6972938418005, 37: -468.78369264516937, 38: -541.2808912223624, 39: 330.6903172253092, 40: -638.0156450384441, 41: -304.1122851963345, 42: 437.2797841418076, 43: 561.7387061220729, 44: -503.2740733067485, 45: 433.5804400240565, 46: 475.2435623884169, 47: -405.59364491545136, 48: -415.5501796978929, 49: -935.0663192223606, 50: 171.69580433209808, 51: -554.0056030900487, 52: 45.388394682329135, 53: -440.7714651883558, 54: 59.27169133875464, 55: 40.29995988400401, 56: -812.8599999277563, 57: 86.19303814647606, 58: 655.1887822922679, 59: 62.82680301860228, 60: 22.36985316764265, 61: -964.6910496383512, 62: -830.95126121312, 63: -808.1019400083396}}
df = pd.DataFrame(data)
I need to calculate a profit/loss carried forward on the combined and individual level.
On a combined level only a loss can be carried forward and only carriable for the Loss Carried Forward Years column value (so after 4 years a loss expires). On a combined level the loss carried forward looks like the following.
╒════╤═════════════════════╤══════════╤════════════════════════╤═════════════════════════════════════╤═════════════════╕
│ │ Underwriting Year │ Result │ Loss Carried Forward │ Result After Loss Carried Forward │ combined_line │
╞════╪═════════════════════╪══════════╪════════════════════════╪═════════════════════════════════════╪═════════════════╡
│ 0 │ 2006 │ -1741.03 │ 0.00 │ -1741.03 │ COMB │
├────┼─────────────────────┼──────────┼────────────────────────┼─────────────────────────────────────┼─────────────────┤
│ 1 │ 2007 │ -851.46 │ -1741.03 │ -2592.49 │ COMB │
├────┼─────────────────────┼──────────┼────────────────────────┼─────────────────────────────────────┼─────────────────┤
│ 2 │ 2008 │ -36.98 │ -2592.49 │ -2629.47 │ COMB │
├────┼─────────────────────┼──────────┼────────────────────────┼─────────────────────────────────────┼─────────────────┤
│ 3 │ 2009 │ 874.08 │ -2629.47 │ -1755.39 │ COMB │
├────┼─────────────────────┼──────────┼────────────────────────┼─────────────────────────────────────┼─────────────────┤
│ 4 │ 2010 │ 742.99 │ -1755.39 │ -1012.40 │ COMB │
├────┼─────────────────────┼──────────┼────────────────────────┼─────────────────────────────────────┼─────────────────┤
│ 5 │ 2011 │ -1343.64 │ -888.44 │ -2232.08 │ COMB │
├────┼─────────────────────┼──────────┼────────────────────────┼─────────────────────────────────────┼─────────────────┤
│ 6 │ 2012 │ -647.36 │ -1380.62 │ -2027.99 │ COMB │
├────┼─────────────────────┼──────────┼────────────────────────┼─────────────────────────────────────┼─────────────────┤
│ 7 │ 2013 │ 362.24 │ -1991.01 │ -1628.77 │ COMB │
╘════╧═════════════════════╧══════════╧════════════════════════╧═════════════════════════════════════╧═════════════════╛
The problem I am having is calculating the individual lines' profit/loss carried forward. To get those values, you will need to carry profit and losses forward to balance with the combined level.
I have written a test data creator:
def generate_data(combined_line: str) -> List[Dict[str, Any]]:
data: List[Dict[str, Any]] = []
# Underwriting Years
end_uwy: int = random.randint(2001, 2022)
start_uwy: int = random.randint(2000, end_uwy-1)
uwy_list = [i for i in range(start_uwy, end_uwy)]
lines = random.sample(range(1, 456976), random.randint(2, 10))
alphabets_list: List[str] = list(string.ascii_uppercase)
keywords = [''.join(i) for i in itertools.product(alphabets_list, repeat = 4)]
lines_list:List[str] = [keywords[i] for i in lines]
loss_carried_forwards_years: int = random.randint(3, 10)
for line in lines_list:
for uw_year in uwy_list:
data_dict: Dict[str, Any] = {
"combined_line": combined_line,
"line": line,
"Underwriting Year": uw_year,
"Loss Carried Forward Years": loss_carried_forwards_years,
"Result": random.uniform(-1000, 1000)
}
data.append(data_dict)
return data
To check that the result balance I do the following:
grouped_df = indiv_df.groupby(by=["Underwriting Year", "combined_line"]).sum().reset_index()
assert_frame_equal(combined_df, grouped_col_df)
I can't get it right to go back from the combined level to the individual level with the code I have written, so that if you group by and sum the individual level, it equals the combined level.
The problem with this is that you the grouped data is an, in aggregate, loss or profit (after considering the results of all the individuals together). However, to calculate the result after loss carried forward means for both and then trying to equate these will, in most cases, not work.
This is because for some companies there would have been a negative result, which would be carried forward, whilst in others there would be a positive and so not carried forward. If for the given year there are larger positive values than negative, the grouped data would not carry forward the negative individual results, which causes the difference.
Here is the code I wrote below that calculates the two different types, and although the code is nearly identical, it is apparent that there will be a difference because of when the aggregation occurs.
# data from your function
df = pd.DataFrame(generate_data("COMB"))
""" Creating the individual data """
individ_df = pd.DataFrame()
# for each individual "line" in each "combined_line"
for grp, dat in df.groupby(["combined_line", "line"]):
# sort values by underwriting year
dat = dat.sort_values(by="Underwriting Year")
# loss carried forward if a shifted calculation of a 4-year rolling sum
dat["Loss Carried Forward"] = pd.Series(np.where(dat["Result"] < 0, dat["Result"], 0)).rolling(4, min_periods=1).sum().shift(1).fillna(0)
# result after loss carried forward is result plus loss carried forward
dat["Result After Loss Carried Forward"] = dat["Result"] + dat["Loss Carried Forward"]
# concatenate this result to the dataframe
individ_df = pd.concat([individ_df, dat], axis=0)
""" Grouped calculations """
# This is exactly the same, but grouped for combined_line, not individual
grouped_df = df.groupby(by=["Underwriting Year", "combined_line"]).sum().reset_index()
grouped_df["Loss Carried Forward"] = pd.Series(np.where(grouped_df["Result"] < 0, grouped_df["Result"], 0)).rolling(4, min_periods=1).sum().shift(1).fillna(0)
grouped_df["Result After Loss Carried Forward"] = grouped_df["Result"] + grouped_df["Loss Carried Forward"]
""" Checking the results of the "Result After Loss Carried Forward" """
# individuals grouped
individ_df.groupby(["combined_line", "Underwriting Year"])["Result After Loss Carried Forward"].sum()
# grouped_df
grouped_df["Result After Loss Carried Forward"]
Here is the dataframe I'm working with in python. I'm including the dataframe here with this line of code:
print(mtcars.to_dict())
{'Unnamed: 0': {0: 'Mazda RX4', 1: 'Mazda RX4 Wag', 2: 'Datsun 710', 3: 'Hornet 4 Drive', 4: 'Hornet Sportabout', 5: 'Valiant', 6: 'Duster 360', 7: 'Merc 240D', 8: 'Merc 230', 9: 'Merc 280', 10: 'Merc 280C', 11: 'Merc 450SE', 12: 'Merc 450SL', 13: 'Merc 450SLC', 14: 'Cadillac Fleetwood', 15: 'Lincoln Continental', 16: 'Chrysler Imperial', 17: 'Fiat 128', 18: 'Honda Civic', 19: 'Toyota Corolla', 20: 'Toyota Corona', 21: 'Dodge Challenger', 22: 'AMC Javelin', 23: 'Camaro Z28', 24: 'Pontiac Firebird', 25: 'Fiat X1-9', 26: 'Porsche 914-2', 27: 'Lotus Europa', 28: 'Ford Pantera L', 29: 'Ferrari Dino', 30: 'Maserati Bora', 31: 'Volvo 142E'}, 'mpg': {0: 21.0, 1: 21.0, 2: 22.8, 3: 21.4, 4: 18.7, 5: 18.1, 6: 14.3, 7: 24.4, 8: 22.8, 9: 19.2, 10: 17.8, 11: 16.4, 12: 17.3, 13: 15.2, 14: 10.4, 15: 10.4, 16: 14.7, 17: 32.4, 18: 30.4, 19: 33.9, 20: 21.5, 21: 15.5, 22: 15.2, 23: 13.3, 24: 19.2, 25: 27.3, 26: 26.0, 27: 30.4, 28: 15.8, 29: 19.7, 30: 15.0, 31: 21.4}, 'cyl': {0: 6, 1: 6, 2: 4, 3: 6, 4: 8, 5: 6, 6: 8, 7: 4, 8: 4, 9: 6, 10: 6, 11: 8, 12: 8, 13: 8, 14: 8, 15: 8, 16: 8, 17: 4, 18: 4, 19: 4, 20: 4, 21: 8, 22: 8, 23: 8, 24: 8, 25: 4, 26: 4, 27: 4, 28: 8, 29: 6, 30: 8, 31: 4}, 'disp': {0: 160.0, 1: 160.0, 2: 108.0, 3: 258.0, 4: 360.0, 5: 225.0, 6: 360.0, 7: 146.7, 8: 140.8, 9: 167.6, 10: 167.6, 11: 275.8, 12: 275.8, 13: 275.8, 14: 472.0, 15: 460.0, 16: 440.0, 17: 78.7, 18: 75.7, 19: 71.1, 20: 120.1, 21: 318.0, 22: 304.0, 23: 350.0, 24: 400.0, 25: 79.0, 26: 120.3, 27: 95.1, 28: 351.0, 29: 145.0, 30: 301.0, 31: 121.0}, 'hp': {0: 110, 1: 110, 2: 93, 3: 110, 4: 175, 5: 105, 6: 245, 7: 62, 8: 95, 9: 123, 10: 123, 11: 180, 12: 180, 13: 180, 14: 205, 15: 215, 16: 230, 17: 66, 18: 52, 19: 65, 20: 97, 21: 150, 22: 150, 23: 245, 24: 175, 25: 66, 26: 91, 27: 113, 28: 264, 29: 175, 30: 335, 31: 109}, 'drat': {0: 3.9, 1: 3.9, 2: 3.85, 3: 3.08, 4: 3.15, 5: 2.76, 6: 3.21, 7: 3.69, 8: 3.92, 9: 3.92, 10: 3.92, 11: 3.07, 12: 3.07, 13: 3.07, 14: 2.93, 15: 3.0, 16: 3.23, 17: 4.08, 18: 4.93, 19: 4.22, 20: 3.7, 21: 2.76, 22: 3.15, 23: 3.73, 24: 3.08, 25: 4.08, 26: 4.43, 27: 3.77, 28: 4.22, 29: 3.62, 30: 3.54, 31: 4.11}, 'wt': {0: 2.62, 1: 2.875, 2: 2.32, 3: 3.215, 4: 3.44, 5: 3.46, 6: 3.57, 7: 3.19, 8: 3.15, 9: 3.44, 10: 3.44, 11: 4.07, 12: 3.73, 13: 3.78, 14: 5.25, 15: 5.424, 16: 5.345, 17: 2.2, 18: 1.615, 19: 1.835, 20: 2.465, 21: 3.52, 22: 3.435, 23: 3.84, 24: 3.845, 25: 1.935, 26: 2.14, 27: 1.513, 28: 3.17, 29: 2.77, 30: 3.57, 31: 2.78}, 'qsec': {0: 16.46, 1: 17.02, 2: 18.61, 3: 19.44, 4: 17.02, 5: 20.22, 6: 15.84, 7: 20.0, 8: 22.9, 9: 18.3, 10: 18.9, 11: 17.4, 12: 17.6, 13: 18.0, 14: 17.98, 15: 17.82, 16: 17.42, 17: 19.47, 18: 18.52, 19: 19.9, 20: 20.01, 21: 16.87, 22: 17.3, 23: 15.41, 24: 17.05, 25: 18.9, 26: 16.7, 27: 16.9, 28: 14.5, 29: 15.5, 30: 14.6, 31: 18.6}, 'vs': {0: 0, 1: 0, 2: 1, 3: 1, 4: 0, 5: 1, 6: 0, 7: 1, 8: 1, 9: 1, 10: 1, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 1, 18: 1, 19: 1, 20: 1, 21: 0, 22: 0, 23: 0, 24: 0, 25: 1, 26: 0, 27: 1, 28: 0, 29: 0, 30: 0, 31: 1}, 'am': {0: 1, 1: 1, 2: 1, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 1, 18: 1, 19: 1, 20: 0, 21: 0, 22: 0, 23: 0, 24: 0, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1}, 'gear': {0: 4, 1: 4, 2: 4, 3: 3, 4: 3, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 4, 11: 3, 12: 3, 13: 3, 14: 3, 15: 3, 16: 3, 17: 4, 18: 4, 19: 4, 20: 3, 21: 3, 22: 3, 23: 3, 24: 3, 25: 4, 26: 5, 27: 5, 28: 5, 29: 5, 30: 5, 31: 4}, 'carb': {0: 4, 1: 4, 2: 1, 3: 1, 4: 2, 5: 1, 6: 4, 7: 2, 8: 2, 9: 4, 10: 4, 11: 3, 12: 3, 13: 3, 14: 4, 15: 4, 16: 4, 17: 1, 18: 2, 19: 1, 20: 1, 21: 2, 22: 2, 23: 4, 24: 2, 25: 1, 26: 2, 27: 2, 28: 4, 29: 6, 30: 8, 31: 2}}
This SO post was helpful in learning how to print the python dataframe like R does with the dput() function.
Now I import seaborn and create a histogram.
import seaborn as seaborn
seaborn.histplot(data=mtcars, x="mpg", bins = 30)
plt.suptitle("Mtcars", loc = 'left')
plt.title("histogram", loc = 'left')
plt.show()
This doesn't work as the title disappears.
So I clear out whatever is happening with the graphs and try again.
plt.figure().clear()
plt.close()
plt.cla()
plt.clf()
seaborn.histplot(data=mtcars, x="mpg", bins = 30)
plt.suptitle("Mtcars", horizontalalignment = 'left')
plt.title("histogram", loc = 'left')
plt.show()
But this doesn't work either. This time, the title is there but the alignment is wrong.
I'd like to put both the title and the subtitle on the left side.
Here is the dataframe that I'm working with in python.
{'Unnamed: 0': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29, 29: 30, 30: 31, 31: 32}, 'car': {0: 'Mazda RX4', 1: 'Mazda RX4 Wag', 2: 'Datsun 710', 3: 'Hornet 4 Drive', 4: 'Hornet Sportabout', 5: 'Valiant', 6: 'Duster 360', 7: 'Merc 240D', 8: 'Merc 230', 9: 'Merc 280', 10: 'Merc 280C', 11: 'Merc 450SE', 12: 'Merc 450SL', 13: 'Merc 450SLC', 14: 'Cadillac Fleetwood', 15: 'Lincoln Continental', 16: 'Chrysler Imperial', 17: 'Fiat 128', 18: 'Honda Civic', 19: 'Toyota Corolla', 20: 'Toyota Corona', 21: 'Dodge Challenger', 22: 'AMC Javelin', 23: 'Camaro Z28', 24: 'Pontiac Firebird', 25: 'Fiat X1-9', 26: 'Porsche 914-2', 27: 'Lotus Europa', 28: 'Ford Pantera L', 29: 'Ferrari Dino', 30: 'Maserati Bora', 31: 'Volvo 142E'}, 'mpg': {0: 21.0, 1: 21.0, 2: 22.8, 3: 21.4, 4: 18.7, 5: 18.1, 6: 14.3, 7: 24.4, 8: 22.8, 9: 19.2, 10: 17.8, 11: 16.4, 12: 17.3, 13: 15.2, 14: 10.4, 15: 10.4, 16: 14.7, 17: 32.4, 18: 30.4, 19: 33.9, 20: 21.5, 21: 15.5, 22: 15.2, 23: 13.3, 24: 19.2, 25: 27.3, 26: 26.0, 27: 30.4, 28: 15.8, 29: 19.7, 30: 15.0, 31: 21.4}, 'cyl': {0: 6, 1: 6, 2: 4, 3: 6, 4: 8, 5: 6, 6: 8, 7: 4, 8: 4, 9: 6, 10: 6, 11: 8, 12: 8, 13: 8, 14: 8, 15: 8, 16: 8, 17: 4, 18: 4, 19: 4, 20: 4, 21: 8, 22: 8, 23: 8, 24: 8, 25: 4, 26: 4, 27: 4, 28: 8, 29: 6, 30: 8, 31: 4}, 'disp': {0: 160.0, 1: 160.0, 2: 108.0, 3: 258.0, 4: 360.0, 5: 225.0, 6: 360.0, 7: 146.7, 8: 140.8, 9: 167.6, 10: 167.6, 11: 275.8, 12: 275.8, 13: 275.8, 14: 472.0, 15: 460.0, 16: 440.0, 17: 78.7, 18: 75.7, 19: 71.1, 20: 120.1, 21: 318.0, 22: 304.0, 23: 350.0, 24: 400.0, 25: 79.0, 26: 120.3, 27: 95.1, 28: 351.0, 29: 145.0, 30: 301.0, 31: 121.0}, 'hp': {0: 110, 1: 110, 2: 93, 3: 110, 4: 175, 5: 105, 6: 245, 7: 62, 8: 95, 9: 123, 10: 123, 11: 180, 12: 180, 13: 180, 14: 205, 15: 215, 16: 230, 17: 66, 18: 52, 19: 65, 20: 97, 21: 150, 22: 150, 23: 245, 24: 175, 25: 66, 26: 91, 27: 113, 28: 264, 29: 175, 30: 335, 31: 109}, 'drat': {0: 3.9, 1: 3.9, 2: 3.85, 3: 3.08, 4: 3.15, 5: 2.76, 6: 3.21, 7: 3.69, 8: 3.92, 9: 3.92, 10: 3.92, 11: 3.07, 12: 3.07, 13: 3.07, 14: 2.93, 15: 3.0, 16: 3.23, 17: 4.08, 18: 4.93, 19: 4.22, 20: 3.7, 21: 2.76, 22: 3.15, 23: 3.73, 24: 3.08, 25: 4.08, 26: 4.43, 27: 3.77, 28: 4.22, 29: 3.62, 30: 3.54, 31: 4.11}, 'wt': {0: 2.62, 1: 2.875, 2: 2.32, 3: 3.215, 4: 3.44, 5: 3.46, 6: 3.57, 7: 3.19, 8: 3.15, 9: 3.44, 10: 3.44, 11: 4.07, 12: 3.73, 13: 3.78, 14: 5.25, 15: 5.424, 16: 5.345, 17: 2.2, 18: 1.615, 19: 1.835, 20: 2.465, 21: 3.52, 22: 3.435, 23: 3.84, 24: 3.845, 25: 1.935, 26: 2.14, 27: 1.513, 28: 3.17, 29: 2.77, 30: 3.57, 31: 2.78}, 'qsec': {0: 16.46, 1: 17.02, 2: 18.61, 3: 19.44, 4: 17.02, 5: 20.22, 6: 15.84, 7: 20.0, 8: 22.9, 9: 18.3, 10: 18.9, 11: 17.4, 12: 17.6, 13: 18.0, 14: 17.98, 15: 17.82, 16: 17.42, 17: 19.47, 18: 18.52, 19: 19.9, 20: 20.01, 21: 16.87, 22: 17.3, 23: 15.41, 24: 17.05, 25: 18.9, 26: 16.7, 27: 16.9, 28: 14.5, 29: 15.5, 30: 14.6, 31: 18.6}, 'vs': {0: 0, 1: 0, 2: 1, 3: 1, 4: 0, 5: 1, 6: 0, 7: 1, 8: 1, 9: 1, 10: 1, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 1, 18: 1, 19: 1, 20: 1, 21: 0, 22: 0, 23: 0, 24: 0, 25: 1, 26: 0, 27: 1, 28: 0, 29: 0, 30: 0, 31: 1}, 'am': {0: 1, 1: 1, 2: 1, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 1, 18: 1, 19: 1, 20: 0, 21: 0, 22: 0, 23: 0, 24: 0, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1}, 'gear': {0: 4, 1: 4, 2: 4, 3: 3, 4: 3, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 4, 11: 3, 12: 3, 13: 3, 14: 3, 15: 3, 16: 3, 17: 4, 18: 4, 19: 4, 20: 3, 21: 3, 22: 3, 23: 3, 24: 3, 25: 4, 26: 5, 27: 5, 28: 5, 29: 5, 30: 5, 31: 4}, 'carb': {0: 4, 1: 4, 2: 1, 3: 1, 4: 2, 5: 1, 6: 4, 7: 2, 8: 2, 9: 4, 10: 4, 11: 3, 12: 3, 13: 3, 14: 4, 15: 4, 16: 4, 17: 1, 18: 2, 19: 1, 20: 1, 21: 2, 22: 2, 23: 4, 24: 2, 25: 1, 26: 2, 27: 2, 28: 4, 29: 6, 30: 8, 31: 2}}
Here is the code that I'm using. The subplot part I got off a datacamp module.
fig, ax = plt.subplot()
plt.show()
But when I go to plot the mtcars dataset, one variable against the other, I get a blank canvas. Why is that? I don't see how the code is different than what I am looking at on DataCamp.
ax.plot(mtcars['cyl'], mtcars['mpg'])
plt.show()
The answer from below is helpful and gets me closer to a solution but it is giving me lines instead of a scatterplot?
fig, ax = plt.subplot()
plt.show()
import matplotlib.pyplot as plt
plt.plot(df['cyl'], df['mpg'])
plt.show()
or:
ax = plt.subplot(2, 1, 1)
ax.plot(df['cyl'], df['mpg'])
plt.show()
I have below model for which I seek estimation of parameters using pymc3
import pandas as pd
import pymc3 as pm
import arviz as arviz
myData = pd.DataFrame.from_dict({
'Unnamed: 0': {
0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10,
10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20,
20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29, 29: 30,
30: 31, 31: 32, 32: 33, 33: 34, 34: 35, 35: 36, 36: 37, 37: 38},
'y': {
0: 0.0079235409492941, 1: 0.0086530073429249, 2: 0.0297400780486734, 3: 0.0196358416326437, 4: 0.0023902064076204, 5: 0.0258055591736283, 6: 0.17394835142698, 7: 0.156463554455613, 8: 0.329388185725557, 9: 0.0076443508881763,
10: 0.0162081480398152, 11: 0.0, 12: 0.0015759139941696, 13: 0.420025972703085, 14: 0.0001226236519444, 15: 0.133061480234834, 16: 0.565454216154227, 17: 0.0002819734812997, 18: 0.000559715156383, 19: 0.0270686389659072,
20: 0.918300537689865, 21: 7.8262468302e-06, 22: 0.0073241434191945, 23: 0.0, 24: 0.0, 25: 0.0, 26: 0.0, 27: 0.0, 28: 0.0, 29: 0.0,
30: 0.174071274611405, 31: 0.0432109713717948, 32: 0.0544400838264943, 33: 0.0, 34: 0.0907049925221286, 35: 0.616680102647887, 36: 0.0, 37: 0.0},
'x': {
0: 23.8187587698947, 1: 15.9991138359515, 2: 33.6495930512881, 3: 28.555818797764, 4: -52.2967967248258, 5: -91.3835208788233, 6: -73.9830692708321, 7: -5.16901145289629, 8: 29.8363012310241, 9: 10.6820057903939,
10: 19.4868517164395, 11: 15.4499668436458, 12: -17.0441644773509, 13: 10.7025053739577, 14: -8.6382953428539, 15: -32.8892974839165, 16: -15.8671863161348, 17: -11.237248036145, 18: -7.37978020066205, 19: -3.33500586334862,
20: -4.02629933182873, 21: -20.2413384726948, 22: -54.9094885578775, 23: -48.041459120976, 24: -52.3125732905322, 25: -35.6269065970458, 26: -62.0296155423529, 27: -49.0825017152659, 28: -73.0574478287598, 29: -50.9409090127938,
30: -63.4650928035253, 31: -55.1263264283842, 32: -52.2841103768755, 33: -61.2275334149805, 34: -74.2175990067417, 35: -68.2961107804698, 36: -76.6834643609286, 37: -70.16769103228}
})
with pm.Model() as myModel :
beta0 = pm.Normal('intercept', 0, 1)
beta1 = pm.Normal('x', 0, 1)
mu = beta0 + beta1 * myData['x'].values
pm.Bernoulli('obs', p = pm.invlogit(mu), observed = myData['y'].values)
with myModel :
calc = pm.sample(50000, tune = 10000, step = pm.Metropolis(), random_seed = 1000)
arviz.summary(calc, round_to = 10)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
intercept -2.537501 0.599667 -3.707061 -1.450243 0.004375 0.003118 18893.344191 22631.772985 1.000070
x 0.033750 0.024314 -0.007871 0.081619 0.000181 0.000133 18550.620475 20113.739639 1.000194
Now I changed above model to this,
mu = beta0 + beta1 * myData['x'].values * 0
With this change I get below result,
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
intercept -2.690874 0.546570 -3.698465 -1.643091 0.003611 0.002565 22980.471424 24806.935727 1.000036
x -0.013861 1.003612 -1.916176 1.826709 0.006874 0.005175 21336.662537 23299.680306 1.000084
I wonder if above estimate is correct. Should not I expect very small estimate for the coefficient beta1? I see hardly any change for this estimate except just change in sign.
Any pointer is highly appreciated.
"hardly any change for this estimate"
Seems like you are ignoring the sd, which has a strong change and is behaving as expected. That is, the first version yields 0.034 ± 0.024 (weakly positive); whereas the second correctly reverts to the prior with -0.014 ± 1.00.
Looking at the input data, none of this seems surprising:
I have printed an output in python shell like:
>>>{1: 117.33282674772036, 2: 119.55324074074075, 3: 116.45497076023392, 4: 113.77561475409836, 5: 112.93896713615024, 6: 114.23583333333333, 7: 124.92402972749794, 8: 121.40603448275863, 9: 116.4946452476573, 10: 112.89107142857142, 11: 122.33312577833125, 12: 116.57083333333334, 13: 122.2856334841629, 14: 125.26688815060908, 15: 129.13817204301074, 16: 128.78991596638656, 17: 127.54600301659126, 18: 133.65972222222223, 19: 127.28315789473685, 20: 125.07205882352942, 21: 124.79464285714286, 22: 131.36170212765958, 23: 130.17974002689377, 24: 138.37055555555557, 25: 132.72380952380954, 26: 138.44230769230768, 27: 134.82251082251082, 28: 147.12448979591838, 29: 149.86879730866275, 30: 145.04521072796936, 31: 143.72442396313363, 32: 148.12940140845072, 33: 140.06355218855219, 34: 145.44537815126051, 35: 146.50366300366301, 36: 146.2173611111111, 37: 152.36319881525361, 38: 156.42249459264599, 39: 154.6977564102564, 40: 155.47647058823529, 41: 158.72357723577235, 42: 162.23746031746032, 43: 149.30991931656382, 44: ........
It represents adjacent neighbors. How can I save this output in a text file in python line-by-line?
like:
1:117.3328268788
2:119.5532822788
Something like this:
with open('some_file.txt', 'w') as f:
for k in sorted(your_dic):
f.write("{}:{}\n".format(k, your_dic[k]))