I am converting an SPSS code into Pandas and I am trying to find a Pythonic way to express this thing:
COUNT WBbf = M1 M26 M38 M50 M62 M74 M85 M97 M109
M121 M133 M144 (1).
COUNT SPbf = M2 M15 M39 M51 M75 M87 M110 (1)
M63 M98 M122 M134 M145 (0).
COUNT ACbf = M3 M16 M27 M52 M76 M88 M111 M123 M135 M146 (1)
M64 M99 (0).
COUNT SCbf = M5 M17 M40 M77 M112 (1)
M28 M65 M89 M100 M124 M136 M148 (0).
My dataframe has this form:
In [90]: data[b]
Out[90]:
M1 M2 M3 M4 M5 M6 M7 M8 M9 \
case_id
ERAB_S1_LR_Q1_261016 1.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_AS_011116 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0
ERAB_S2_LR_Q1_021116AFTERNOO 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0
ERAB_S2_AS031116MORNING 1.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0
ERAB_S3_AS031116AFTERNOON 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_S1_AS041116 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_LOH__S3_021116 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_LR_081116 1.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_S1_AS_111116 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
ERAB_S1_141116AFTERNOON 1.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_S1_LOH_151116 1.0 0.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0
ERAB_S1_161116 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0
and so on...
I want to count the values and create a new column with the result for each case id.
I believe you can first select data by loc, compare by eq and then sum True values per row:
#add strings by your data
SPbf1 = 'M2 M5 M8'.split()
SPbf0 = 'M6 M9'.split()
print (SPbf1)
['M2', 'M5', 'M8']
print (SPbf0)
['M6', 'M9']
df['SPbf'] = df[SPbf1].eq(1).sum(axis=1) + df[SPbf0].eq(0).sum(axis=1)
print (df)
M1 M2 M3 M4 M5 M6 M7 M8 M9 \
case_id
ERAB_S1_LR_Q1_261016 1.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_AS_011116 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0
ERAB_S2_LR_Q1_021116AFTERNOO 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0
ERAB_S2_AS031116MORNING 1.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0
ERAB_S3_AS031116AFTERNOON 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_S1_AS041116 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_LOH__S3_021116 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_LR_081116 1.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_S1_AS_111116 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
ERAB_S1_141116AFTERNOON 1.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_S1_LOH_151116 1.0 0.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0
ERAB_S1_161116 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0
SPbf
case_id
ERAB_S1_LR_Q1_261016 2
ERAB_AS_011116 4
ERAB_S2_LR_Q1_021116AFTERNOO 1
ERAB_S2_AS031116MORNING 1
ERAB_S3_AS031116AFTERNOON 1
ERAB_S1_AS041116 1
ERAB_LOH__S3_021116 2
ERAB_LR_081116 2
ERAB_S1_AS_111116 2
ERAB_S1_141116AFTERNOON 2
ERAB_S1_LOH_151116 2
ERAB_S1_161116 2
If some column names can missing instead loc use reindex_axis:
SPbf1 = 'M2 M15 M39 M51 M75 M87 M110'.split()
SPbf0 = 'M63 M98 M122 M134 M145'.split()
print (SPbf1)
['M2', 'M15', 'M39', 'M51', 'M75', 'M87', 'M110']
print (SPbf0)
['M63', 'M98', 'M122', 'M134', 'M145']
df['SPbf'] = df.reindex_axis(SPbf1, axis=1).eq(1).sum(axis=1) + \
df.reindex_axis(SPbf0, axis=1).eq(0).sum(axis=1)
print (df)
M1 M2 M3 M4 M5 M6 M7 M8 M9 \
case_id
ERAB_S1_LR_Q1_261016 1.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_AS_011116 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0
ERAB_S2_LR_Q1_021116AFTERNOO 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0
ERAB_S2_AS031116MORNING 1.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0
ERAB_S3_AS031116AFTERNOON 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_S1_AS041116 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_LOH__S3_021116 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_LR_081116 1.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_S1_AS_111116 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
ERAB_S1_141116AFTERNOON 1.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0
ERAB_S1_LOH_151116 1.0 0.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0
ERAB_S1_161116 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0
SPbf
case_id
ERAB_S1_LR_Q1_261016 1
ERAB_AS_011116 1
ERAB_S2_LR_Q1_021116AFTERNOO 1
ERAB_S2_AS031116MORNING 1
ERAB_S3_AS031116AFTERNOON 0
ERAB_S1_AS041116 0
ERAB_LOH__S3_021116 1
ERAB_LR_081116 1
ERAB_S1_AS_111116 1
ERAB_S1_141116AFTERNOON 1
ERAB_S1_LOH_151116 0
ERAB_S1_161116 1
Related
My dataset looks like this, and my code is:
import pandas as pd
df = pd.read_csv("Admission_data.csv")
correlations = df.corr()
correlations
admit gre gpa rank
0.0 380.0 3.609999895095825 3.0
1.0 660.0 3.6700000762939453 3.0
1.0 800.0 4.0 1.0
1.0 640.0 3.190000057220459 4.0
0.0 520.0 2.930000066757202 4.0
1.0 760.0 3.0 2.0
1.0 560.0 2.9800000190734863 1.0
0.0 400.0 3.0799999237060547 2.0
1.0 540.0 3.390000104904175 3.0
0.0 700.0 3.9200000762939453 2.0
0.0 800.0 4.0 4.0
0.0 440.0 3.2200000286102295 1.0
1.0 760.0 4.0 1.0
0.0 700.0 3.0799999237060547 2.0
1.0 700.0 4.0 1.0
0.0 480.0 3.440000057220459 3.0
0.0 780.0 3.869999885559082 4.0
0.0 360.0 2.559999942779541 3.0
0.0 800.0 3.75 2.0
1.0 540.0 3.809999942779541 1.0
0.0 500.0 3.1700000762939453 3.0
1.0 660.0 3.630000114440918 2.0
0.0 600.0 2.819999933242798 4.0
0.0 680.0 3.190000057220459 4.0
1.0 760.0 3.3499999046325684 2.0
1.0 800.0 3.6600000858306885 1.0
1.0 620.0 3.609999895095825 1.0
1.0 520.0 3.740000009536743 4.0
1.0 780.0 3.2200000286102295 2.0
0.0 520.0 3.2899999618530273 1.0
0.0 540.0 3.7799999713897705 4.0
0.0 760.0 3.3499999046325684 3.0
0.0 600.0 3.4000000953674316 3.0
1.0 800.0 4.0 3.0
0.0 360.0 3.140000104904175 1.0
0.0 400.0 3.049999952316284 2.0
0.0 580.0 3.25 1.0
0.0 520.0 2.9000000953674316 3.0
1.0 500.0 3.130000114440918 2.0
1.0 520.0 2.680000066757202 3.0
0.0 560.0 2.4200000762939453 2.0
1.0 580.0 3.319999933242798 2.0
1.0 600.0 3.1500000953674316 2.0
0.0 500.0 3.309999942779541 3.0
0.0 700.0 2.940000057220459 2.0
1.0 460.0 3.450000047683716 3.0
1.0 580.0 3.4600000381469727 2.0
0.0 500.0 2.9700000286102295 4.0
0.0 440.0 2.4800000190734863 4.0
0.0 400.0 3.3499999046325684 3.0
0.0 640.0 3.859999895095825 3.0
0.0 440.0 3.130000114440918 4.0
0.0 740.0 3.369999885559082 4.0
1.0 680.0 3.2699999809265137 2.0
0.0 660.0 3.3399999141693115 3.0
1.0 740.0 4.0 3.0
0.0 560.0 3.190000057220459 3.0
0.0 380.0 2.940000057220459 3.0
0.0 400.0 3.6500000953674316 2.0
0.0 600.0 2.819999933242798 4.0
1.0 620.0 3.180000066757202 2.0
0.0 560.0 3.319999933242798 4.0
0.0 640.0 3.6700000762939453 3.0
1.0 680.0 3.8499999046325684 3.0
0.0 580.0 4.0 3.0
0.0 600.0 3.5899999141693115 2.0
0.0 740.0 3.619999885559082 4.0
0.0 620.0 3.299999952316284 1.0
0.0 580.0 3.690000057220459 1.0
0.0 800.0 3.7300000190734863 1.0
0.0 640.0 4.0 3.0
0.0 300.0 2.9200000762939453 4.0
0.0 480.0 3.390000104904175 4.0
0.0 580.0 4.0 2.0
0.0 720.0 3.450000047683716 4.0
0.0 720.0 4.0 3.0
0.0 560.0 3.359999895095825 3.0
1.0 800.0 4.0 3.0
0.0 540.0 3.119999885559082 1.0
1.0 620.0 4.0 1.0
0.0 700.0 2.9000000953674316 4.0
0.0 620.0 3.069999933242798 2.0
0.0 500.0 2.7100000381469727 2.0
0.0 380.0 2.9100000858306885 4.0
1.0 500.0 3.5999999046325684 3.0
0.0 520.0 2.9800000190734863 2.0
0.0 600.0 3.319999933242798 2.0
0.0 600.0 3.4800000190734863 2.0
0.0 700.0 3.2799999713897705 1.0
1.0 660.0 4.0 2.0
0.0 700.0 3.8299999237060547 2.0
1.0 720.0 3.640000104904175 1.0
0.0 800.0 3.9000000953674316 2.0
0.0 580.0 2.930000066757202 2.0
1.0 660.0 3.440000057220459 2.0
0.0 660.0 3.3299999237060547 2.0
0.0 640.0 3.5199999809265137 4.0
0.0 480.0 3.569999933242798 2.0
0.0 700.0 2.880000114440918 2.0
0.0 400.0 3.309999942779541 3.0
I tried correlations methods, method='pearson' or numeric_value=True etc.. but nothing works
I've spent a little while on this and got an answer but seems a little convoluted so curious if people have a better solution.
Given a list I want a table indicating all the possible combinations between the elements.
sample_list = ['a', 'b', 'c', 'd']
(pd.concat(
[
pd.DataFrame(
[dict.fromkeys(i, 1) for i in combinations(sample_list, j)]
) for j in range(len(sample_list)+1)
]).
fillna(0).
reset_index(drop = True)
)
With the result, as desired:
a b c d
0 0.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0
2 0.0 1.0 0.0 0.0
3 0.0 0.0 1.0 0.0
4 0.0 0.0 0.0 1.0
5 1.0 1.0 0.0 0.0
6 1.0 0.0 1.0 0.0
7 1.0 0.0 0.0 1.0
8 0.0 1.0 1.0 0.0
9 0.0 1.0 0.0 1.0
10 0.0 0.0 1.0 1.0
11 1.0 1.0 1.0 0.0
12 1.0 1.0 0.0 1.0
13 1.0 0.0 1.0 1.0
14 0.0 1.0 1.0 1.0
15 1.0 1.0 1.0 1.0
For learning purposes would like to know better solutions.
Thanks
Check Below code
import itertools
import pandas as pd
sample_list = ['a', 'b', 'c', 'd']
pd.DataFrame(list(itertools.product([0, 1], repeat=len(sample_list))), columns=sample_list)
Output:
I have a dataframe that looks like this:
iso3 prod_level alloc_key cell5m x y rec_type tech_type unit whea_a ... acof_pct_prod rcof_pct_prod coco_pct_prod teas_pct_prod toba_pct_prod bana_pct_prod trof_pct_prod temf_pct_prod vege_pct_prod rest_pct_prod
35110 IND IN16011 9243059 3990418 74.875000 13.041667 P A mt 0.0 ... 1.0 1.0 1.0 1.0 1.0 0.958586 0.449218 1.0 1.0 0.004520
35109 IND IN16011 9243058 3990417 74.791667 13.041667 P A mt 0.0 ... 1.0 1.0 1.0 1.0 1.0 0.970957 0.459725 1.0 1.0 0.009037
35406 IND IN16003 9283093 4007732 77.708333 12.708333 P A mt 0.0 ... 1.0 1.0 1.0 1.0 1.0 0.883868 1.000000 1.0 1.0 0.012084
35311 IND IN16011 9273062 4003381 75.125000 12.791667 P A mt 0.0 ... 1.0 1.0 1.0 1.0 1.0 0.942550 0.381430 1.0 1.0 0.015024
35308 IND IN16011 9273059 4003378 74.875000 12.791667 P A mt 0.0 ... 1.0 1.0 1.0 1.0 1.0 0.991871 0.887494 1.0 1.0 0.017878
I want to set all values that are greater than 0.9 in columns that end in 'prod' to zero. I can select only those columns like this:
cols2=[col for col in df.columns if col.endswith('_prod')]
df[cols2]
whea_pct_prod rice_pct_prod maiz_pct_prod barl_pct_prod pmil_pct_prod smil_pct_prod sorg_pct_prod pota_pct_prod swpo_pct_prod cass_pct_prod ... acof_pct_prod rcof_pct_prod coco_pct_prod teas_pct_prod toba_pct_prod bana_pct_prod trof_pct_prod temf_pct_prod vege_pct_prod rest_pct_prod
35110 1.0 0.958721 0.359063 1.0 1.0 1.000000 1.0 1.0 1.00000 0.992816 ... 1.0 1.0 1.0 1.0 1.0 0.958586 0.449218 1.0 1.0 0.004520
35109 1.0 0.878148 0.200283 1.0 1.0 1.000000 1.0 1.0 1.00000 0.993140 ... 1.0 1.0 1.0 1.0 1.0 0.970957 0.459725 1.0 1.0 0.009037
35406 1.0 0.996354 0.980844 1.0 1.0 0.274348 1.0 1.0 0.99945 1.000000 ... 1.0 1.0 1.0 1.0 1.0 0.883318 1.000000 1.0 1.0 0.012084
35311 1.0 0.570999 0.341217 1.0 1.0 1.000000 1.0 1.0 1.00000 0.997081 ... 1.0 1.0 1.0 1.0 1.0 0.942550 0.381430 1.0 1.0 0.015024
35308 1.0 0.657520 0.161771 1.0 1.0 1.000000 1.0 1.0 1.00000 0.991491 ... 1.0 1.0 1.0 1.0 1.0 0.991871 0.887494 1.0 1.0 0.017878
Now, when I try and set the values greater than 0.9 to be zero, it does not work.
df[cols2][df[cols2]>0.9]=0
What should I be doing instead?
You can use df.where(cond, other) to replace the values with other where cond == False.
df[cols2] = df[cols2].where(df[cols]<=0.9, other=0)
I'm trying to drop a feature which if float and a number of missing values is higher than certain number.
I've tried:
# Define threshold to 1/6
threshold = 0.1667
# Drop float > threshold
for f in data:
if data[f].dtype==float & data[f].isnull().sum() / data.shape[0] > threshold: del data[f]
..which raises an error:
TypeError: unsupported operand type(s) for &: 'type' and
'numpy.float64'
Help would be aprreciated.
Use DataFrame.select_dtypes for only floats columns, check missing values and get mean - sum/count and add another non floats columns by Series.reindex, last filter by inverse condition > to <= by boolean indexing:
np.random.seed(2019)
df = pd.DataFrame(np.random.choice([np.nan,1], p=(0.2,0.8),size=(10,10))).assign(A='a')
print (df)
0 1 2 3 4 5 6 7 8 9 A
0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
1 1.0 1.0 NaN 1.0 NaN 1.0 NaN 1.0 1.0 1.0 a
2 1.0 1.0 1.0 1.0 1.0 NaN 1.0 NaN 1.0 1.0 a
3 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0 a
4 1.0 NaN 1.0 1.0 1.0 1.0 1.0 NaN 1.0 1.0 a
5 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0 1.0 1.0 a
6 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
7 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
8 1.0 NaN 1.0 1.0 1.0 1.0 NaN 1.0 1.0 1.0 a
9 NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN a
threshold = 0.1667
df1 = df.select_dtypes(float).isnull().mean().reindex(df.columns, fill_value=False)
df = df.loc[:, df1 <= threshold]
print (df)
0 2 3 4 5 8 9 A
0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
1 1.0 NaN 1.0 NaN 1.0 1.0 1.0 a
2 1.0 1.0 1.0 1.0 NaN 1.0 1.0 a
3 1.0 1.0 1.0 1.0 1.0 NaN 1.0 a
4 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
6 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
7 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
8 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
9 NaN 1.0 1.0 1.0 1.0 1.0 NaN a
I am working with a large array of 1's and need to systematically remove 0's from sections of the array. The large array is comprised of many smaller arrays, for each smaller array I need to replace its upper and lower triangles with 0's systematically. For example we have an array with 5 sub arrays indicated by the index value (all sub-arrays have the same number of columns):
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
2 1.0 1.0 1.0
2 1.0 1.0 1.0
3 1.0 1.0 1.0
3 1.0 1.0 1.0
3 1.0 1.0 1.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
4 1.0 1.0 1.0
4 1.0 1.0 1.0
4 1.0 1.0 1.0
4 1.0 1.0 1.0
I want each group of rows to be modified in its upper and lower triangle such that the resulting matrix is:
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 0.0
1 0.0 1.0 1.0
2 1.0 0.0 0.0
2 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
3 1.0 1.0 0.0
3 0.0 1.0 1.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
4 1.0 1.0 0.0
4 1.0 1.0 1.0
4 0.0 1.0 1.0
4 0.0 0.0 1.0
At the moment I am using only numpy to achieve this resulting array, but I think I can speed it up using Pandas grouping. In reality my dataset is very large almost 500,000 rows long. The numpy code is below:
import numpy as np
candidateLengths = np.array([1,2,3,4,5])
centroidLength =3
smallPaths = [min(l,centroidLength) for l in candidateLengths]
# This is the k_values of zeros to delete. To be used in np.tri
k_vals = list(map(lambda smallPath: centroidLength - (smallPath), smallPaths))
maskArray = np.ones((np.sum(candidateLengths), centroidLength))
startPos = 0
endPos = 0
for canNo, canLen in enumerate(candidateLengths):
a = np.ones((canLen, centroidLength))
a *= np.tri(*a.shape, dtype=np.bool, k=k_vals[canNo])
b = np.fliplr(np.flipud(a))
c = a*b
endPos = startPos + canLen
maskArray[startPos:endPos, :] = c
startPos = endPos
print(maskArray)
When I run this on my real dataset it takes nearly 5-7secs to execute. I think this is down to this massive for loop. How can I use pandas groupings to achieve a higher speed? Thanks
New Answer
def tris(n, m):
if n < m:
a = np.tri(m, n, dtype=int).T
else:
a = np.tri(n, m, dtype=int)
return a * a[::-1, ::-1]
idx = np.append(df.index.values, -1)
w = np.append(-1, np.flatnonzero(idx[:-1] != idx[1:]))
c = np.diff(w)
df * np.vstack([tris(n, 3) for n in c])
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 0.0
1 0.0 1.0 1.0
2 1.0 0.0 0.0
2 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
3 1.0 1.0 0.0
3 0.0 1.0 1.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
4 1.0 1.0 0.0
4 1.0 1.0 1.0
4 0.0 1.0 1.0
4 0.0 0.0 1.0
Old Answer
I define some helper triangle functions
def tris(n, m):
if n < m:
a = np.tri(m, n, dtype=int).T
else:
a = np.tri(n, m, dtype=int)
return a * a[::-1, ::-1]
def tris_df(df):
n, m = df.shape
return pd.DataFrame(tris(n, m), df.index, df.columns)
Then
df * df.groupby(level=0, group_keys=False).apply(tris_df)
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 0.0
1 0.0 1.0 1.0
2 1.0 0.0 0.0
2 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
3 1.0 1.0 0.0
3 0.0 1.0 1.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
4 1.0 1.0 0.0
4 1.0 1.0 1.0
4 0.0 1.0 1.0
4 0.0 0.0 1.0