I want to store a a dictionary to an data frame
dictionary_example={1234:{'choice':0,'choice_set':{0:{'A':100,'B':200,'C':300},1:{'A':200,'B':300,'C':300},2:{'A':500,'B':300,'C':300}}},
234:{'choice':1,'choice_set':0:{'A':100,'B':400},1:{'A':100,'B':300,'C':1000}},
1876:{'choice':2,'choice_set':0:{'A': 100,'B':400,'C':300},1:{'A':100,'B':300,'C':1000},2:{'A':600,'B':200,'C':100}}
}
That put them into
id choice 0_A 0_B 0_C 1_A 1_B 1_C 2_A 2_B 2_C
1234 0 100 200 300 200 300 300 500 300 300
234 1 100 400 - 100 300 1000 - - -
1876 2 100 400 300 100 300 1000 600 200 100
I think the following is pretty close, the core idea is simply to convert those dictionaries into json and relying on pandas.read_json to parse them.
dictionary_example={
"1234":{'choice':0,'choice_set':{0:{'A':100,'B':200,'C':300},1:{'A':200,'B':300,'C':300},2:{'A':500,'B':300,'C':300}}},
"234":{'choice':1,'choice_set':{0:{'A':100,'B':400},1:{'A':100,'B':300,'C':1000}}},
"1876":{'choice':2,'choice_set':{0:{'A': 100,'B':400,'C':300},1:{'A':100,'B':300,'C':1000},2:{'A':600,'B':200,'C':100}}}
}
df = pd.read_json(json.dumps(dictionary_example)).T
def to_s(r):
return pd.read_json(json.dumps(r)).unstack()
flattened_choice_set = df["choice_set"].apply(to_s)
flattened_choice_set.columns = ['_'.join((str(col[0]), col[1])) for col in flattened_choice_set.columns]
result = pd.merge(df, flattened_choice_set,
left_index=True, right_index=True).drop("choice_set", axis=1)
result
Related
I just started learning pandas a week ago or so and I've been struggling with a pandas dataframe for a bit now. My data looks like this:
State NY CA Other Total
Year
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
I made this table from a dataset that included 30 or so values for the variable I'm representing as State here. If they weren't NY or CA, in the example, I summed them and put them in an 'Other' category. The years here were made from a normalized list of dates (originally mm/dd/yyyy and yyyy-mm-dd) as such, if this is contributing to my issue:
dict = {'Date': pd.to_datetime(my_df.Date).dt.year}
and later:
my_df = my_df.rename_axis('Year')
I'm trying now to append a row at the bottom that shows the totals in each category:
final_df = my_df.append({'Year' : 'Total',
'NY': my_df.NY.sum(),
'CA': my_df.CA.sum(),
'Other': my_df.Other.sum(),
'Total': my_df.Total.sum()},
ignore_index=True)
This does technically work, but it makes my table look like this:
NY CA Other Total State
0 450 50 25 525 NaN
1 300 75 5 380 NaN
2 500 100 100 700 NaN
3 250 50 100 400 NaN
4 a b c d Total
('a' and so forth are the actual totals of the columns.) It adds a column at the beginning and puts my 'Year' column at the end. In fact, it removes the 'Date' label as well, and turns all the years in the last column into NaNs.
Is there any way I can get this formatted properly? Thank you for your time.
I believe you need create Series by sum and rename it:
final_df = my_df.append(my_df.sum().rename('Total'))
print (final_df)
NY CA Other Total
State
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
Total 1500 275 230 2005
Another solution is use loc for setting with enlargement:
my_df.loc['Total'] = my_df.sum()
print (my_df)
NY CA Other Total
State
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
Total 1500 275 230 2005
Another idea from previous answer - add parameters margins=True and margins_name='Total' to crosstab:
df1 = df.assign(**dct)
out = (pd.crosstab(df1['Firing'], df1['State'], margins=True, margins_name='Total'))
I have a column in my dataframe comprised of numbers. Id like to have another column in the dataframe that takes a running average of the values greater than 0 that i can ideally do in numpy without iteration. (data is huge)
Vals Output
-350
1000 1000
1300 1150
1600 1300
1100 1250
1000 1200
450 1075
1900 1192.857143
-2000 1192.857143
-3150 1192.857143
1000 1168.75
-900 1168.75
800 1127.777778
8550 1870
Code:
list =[-350, 1000, 1300, 1600, 1100, 1000, 450,
1900, -2000, -3150, 1000, -900, 800, 8550]
df = pd.DataFrame(data = list)
Option 1
expanding and mean
df.assign(out=df.loc[df.Vals.gt(0)].Vals.expanding().mean()).ffill()
If you have other columns in your DataFrame that have NaN values, this method will ffill those too, so if that is a concern, you may want to consider using something like this:
df['Out'] = df.loc[df.Vals.gt(0)].Vals.expanding().mean()
df['Out'] = df.Out.ffill()
Which will only fill in the Out column.
Option 2
mask:
df.assign(Out=df.mask(df.Vals.lt(0)).Vals.expanding().mean())
Both of these result in:
Vals Out
0 -350 NaN
1 1000 1000.000000
2 1300 1150.000000
3 1600 1300.000000
4 1100 1250.000000
5 1000 1200.000000
6 450 1075.000000
7 1900 1192.857143
8 -2000 1192.857143
9 -3150 1192.857143
10 1000 1168.750000
11 -900 1168.750000
12 800 1127.777778
13 8550 1870.000000
I have a large pandas dataframe (df_orig) and several lookup tables (also dataframes) that correspond to each of the segments in df_orig.
Here's a small subset of df_orig:
segment score1 score2
B3 0 700
B1 0 120
B1 400 950
B1 100 220
B1 200 320
B1 650 340
B5 300 400
B5 0 320
B1 0 240
B1 100 360
B1 940 700
B3 100 340
And here's a lookup table in its entirety for segment B5 called thresholds_b5 (there is a lookup table for each segment in the large dataset):
score1 score2
990 220
980 280
970 200
960 260
950 260
940 200
930 240
920 220
910 220
900 220
850 120
800 220
750 220
700 120
650 200
600 220
550 220
500 240
400 240
300 260
200 300
100 320
0 400
I want to create a new column in my large dataset that is analagous to this SQL logic:
case when segment = 'B5' then
case when score1 = 990 and score2 >= 220 then 1
case when score1 = 980 and score2 >= 280 then 1
.
.
.
else 0
case when segment = 'B1' then
.
.
.
else 0 end as indicator
I was able to get the correct output using a loop based on the solution to this question:
df_b5 = df_orig[df_orig.loc[:,'segment'] == 'B5']
for i,row in enumerate(thresholds_b5):
value1 = thresholds_b5.iloc[i,0]
value2 = thresholds_b5.iloc[i,1]
df_b5.loc[(df_b5['score1'] == value1) & (df_b5['score2'] >= value2), 'indicator'] = 1
However, I'd need another loop to run this for each segment and then append all of the resultant dataframes back together, which is a bit messy. Furthermore, while I only have three segments (B1,B3,B5) for now, I'm going to have 20+ segments in the future.
Is there a way to do this more succinctly and preferably without loops? I've been warned that loops over dataframes tend to be slow and given the size of my dataset I think speed will matter.
If you are ok with sorting the DataFrames ahead of time, then you can replace your loop example with the new asof join in pandas 0.19:
# query
df_b5 = df_orig.query('segment == "B5"')
# sort ahead of time
df_b5.sort_values('score2', inplace=True)
threshold_b5.sort_values('score2', inplace=True)
# set the default indicator as 1
threshold_b5['indicator'] = 1
# join the tables
df = pd.merge_asof(df_b5, threshold_b5, on='score2', by='score1')
# fill missing indicators as 0
df.indicator = np.int64(df.indicator.fillna(0.0))
This is what I got:
segment score1 score2 indicator
0 B5 0 320 0
1 B5 300 400 1
If you need the original order, then save the index in a new column of df_orig and then resort the final DataFrame by that.
pandas 0.19.2 added multiple by parameters, so you could concat all of your thresholds with the segment column set for each one, then invoke:
pd.merge_asof(df_orig, thresholds, on='score2', by=['segment', 'score1'])
I have a DataFrame df1:
df1.head() =
id type position
dates
2000-01-03 17378 600 400
2000-01-03 4203 600 150
2000-01-03 18321 600 5000
2000-01-03 6158 600 1000
2000-01-03 886 600 10000
2000-01-03 17127 600 800
2000-01-03 18317 1300 110
2000-01-03 5536 600 207
2000-01-03 5132 600 20000
2000-01-03 18191 600 2000
And a second DataFrame df2:
df2.head() =
dt_f dt_l
id_y id_x
670 715 2000-02-14 2003-09-30
704 2963 2000-02-11 2004-01-13
886 18350 2000-02-09 2001-09-24
1451 18159 2005-11-14 2007-03-06
2175 8648 2007-02-28 2007-09-19
2236 18321 2001-04-05 2002-07-02
2283 2352 2007-03-07 2007-09-19
6694 2007-03-07 2007-09-17
13865 2007-04-19 2007-09-19
14348 2007-08-10 2007-09-19
15415 2007-03-07 2007-09-19
2300 2963 2001-05-30 2007-09-26
I need to slice df1for each value of id_x, and count the number of rows within the interval dt_f:dt_l. This has to be done again for the values of id_y. Finally the result should be merged on df2, giving as output the following DataFrame:
df_result.head() =
dt_f dt_l n_x n_y
id_y id_x
670 715 2000-02-14 2003-09-30 8 10
704 2963 2000-02-11 2004-01-13 13 25
886 18350 2000-02-09 2001-09-24 32 75
1451 18159 2005-11-14 2007-03-06 48 6
where n_x(n_y) corresponds to the number of rows contained in the interval dt_f:dt_l for each value of id_x(id_y).
Here is the for-loop I have used:
idx_list = df2.index.tolist()
k = 1
for j in idx_list:
n_y = df1[df1.id == j[0]][df2['dt_f'].iloc[k]:df2['dt_l'].iloc[k]]['id'].count()
n_x = df1[df1.id == j[1]][df2['dt_f'].iloc[k]:df2['dt_l'].iloc[k]]['id'].count()
Would it be possible to do it without using a for-loop? DataFrame df1contains around 30000 rows and I am afraid a loop will slow down the process too much, since this is a small part of the whole script.
you want something like this:
#Merge the tables together - making sure we keep the index column
mg = df1.reset_index().merge(df2, left_on = 'id', right_on = 'id_x')
#Select only the rows that are within the start and end
mg = mg[(mg['index'] > mg['dt_f']) & (mg['index'] < mg['dt_l'])]
#Finally count by id_x
mg.groupby('id_x').count()
You'll need to tidy up the columns afterwards and repeat for id_y.
I have a list of persons with the respective earnings by company like this
Company_code Person Date Earning1 Earning2
1 Jonh 2014-01 100 200
2 Jonh 2014-01 300 400
1 Jonh 2014-02 500 600
1 Peter 2014-01 300 400
1 Peter 2014-02 500 600
And I would like to summarize into this:
Company_code Person 2014-01_E1 2014-01_E2 2014-02_E1 2014-02_E2
1 Jonh 100 200 300 400
2 Jonh 500 600
1 Peter 300 400 500 600
I had the same problem doing this with SQL which I solved with the code:
with t(Company_code, Person, Dt, Earning1, Earning2) as (
select 1, 'Jonh', to_date('2014-01-01', 'YYYY-MM-DD'), 100, 200 from dual union all
select 2, 'Jonh', to_date('2014-01-01', 'YYYY-MM-DD'), 300, 400 from dual union all
select 1, 'Jonh', to_date('2014-02-01', 'YYYY-MM-DD'), 500, 600 from dual union all
select 1, 'Peter', to_date('2014-01-01', 'YYYY-MM-DD'), 300, 400 from dual union all
select 1, 'Peter', to_date('2014-02-01', 'YYYY-MM-DD'), 500, 600 from dual
)
select *
from t
pivot (
sum(Earning1) e1
, sum(Earning2) e2
for dt in (
to_date('2014-01-01', 'YYYY-MM-DD') "2014-01"
, to_date('2014-02-01', 'YYYY-MM-DD') "2014-02"
)
)
COMPANY_CODE PERSON 2014-01_E1 2014-01_E2 2014-02_E1 2014-02_E2
----------------------------------------------------------------------
2 Jonh 300 400 - -
1 Peter 300 400 500 600
1 Jonh 100 200 500 600
How can this be achived in python? I'm trying with Pandas pivot_table:
pd.pivot_table(df, columns=['COMPANY_CODE', 'PERSON', 'DATE'], aggfunc=np.sum)
but this just transposes the table ... any clues?
Using user1827356's suggestion:
df2 = pd.pivot_table(df, rows=['Company_code', 'Person'], cols=['Date'], aggfunc='sum')
print(df2)
# Earning1 Earning2
# Date 2014-01 2014-02 2014-01 2014-02
# Company_code Person
# 1 Jonh 100 500 200 600
# Peter 300 500 400 600
# 2 Jonh 300 NaN 400 NaN
You can flatten the hierarchical columns like this:
columns = ['{}_E{}'.format(date, earning.replace('Earning', ''))
for earning, date in df2.columns.tolist()]
df2.columns = columns
print(df2)
# 2014-01_E1 2014-02_E1 2014-01_E2 2014-02_E2
# Company_code Person
# 1 Jonh 100 500 200 600
# Peter 300 500 400 600
# 2 Jonh 300 NaN 400 NaN
Here's the nicest way to do it, using unstack.
df = pd.DataFrame({
'company_code': [1, 2, 1, 1, 1],
'person': ['Jonh', 'Jonh', 'Jonh', 'Peter', 'Peter'],
'earning2': [200, 400, 600, 400, 600],
'earning1': [100, 300, 500, 300, 500],
'date': ['2014-01', '2014-01', '2014-02', '2014-01', '2014-02']
})
df = df.set_index(['date', 'company_code', 'person'])
df.unstack('date')
Resulting in:
earning1 earning2
date 2014-01 2014-02 2014-01 2014-02
company_code person
1 Jonh 100.0 500.0 200.0 600.0
1 Peter 300.0 500.0 400.0 600.0
2 Jonh 300.0 NaN 400.0 NaN
Setting the index to ['date', 'company_code', 'person'] is a good idea anyway, since that's really what your DataFrame contains: two different earnings categories (1 and 2) each described by a date, a company code and a person.
It's good practice to always work out what the 'real' data in your DataFrame is, and which columns are meta-data, and index accordingly.