Python Categorizing Dataframe columns based on part of the colomn name - python

Reproducible data:
import random
data = {'test_1_a':random.sample(range(1, 50), 7),
'test_1_b':random.sample(range(1, 50), 7),
'test_1_c':random.sample(range(1, 50), 7),
'test_2_a':random.sample(range(1, 50), 7),
'test_2_b':random.sample(range(1, 50), 7),
'test_2_c':random.sample(range(1, 50), 7),
'test_3_a':random.sample(range(1, 50), 7),
'test_4_b':random.sample(range(1, 50), 7),
'test_4_c':random.sample(range(1, 50), 7)}
df = pd.DataFrame(data)
Description:
I have a data frame similar to the example I gave above with 1000ish columns. The column name format is as following:
test_number_family so test_1_c would be a number type of 1 and the family of "c"
I want to classify the df by column names of the same "family" type. So my final output needs to be a list of lists of same family values:
Output example:
[ [a_familily values], [b_familily values],...]
it would also look like the values of columns:
[ [test_1_a, test_2_a , test_3_a ] , [test_1_b, test_2_b , test_3_b ] , ...]
What I have:
#### transfers data frame into a sorted dict (by column name) by columns as key
col_names = [ i for (i,j) in df.iteritems() ]
col_vals = [ j for (i,j) in df.iteritems() ]
df_dict = dict(zip(col_names, col_vals))
families = np.unique([ i.split("_")[2] for i in dict1.keys() ])
I have classified each column name with its associated value and extracted the unique number of groups I want to have in the final output as "families". I now am seeking help in categorizing the data frame into a length(families) number of lists identical to the output example I have given above.
I hope my explanation has been clear, thank you for your time!

Let's keep track of the different families in a dictionary, the keys being the letters (the families) and the values being lists holding the columns from a certain family.
Since we know that each column ends with a letter related with its family, we can use that as a key in the dictionary.
from collections import defaultdict
families = defaultdict(list)
for col in df.columns:
families[col[-1]].append(df[col])
Now for example, in families["a"], we have:
[0 26
1 13
2 11
3 35
4 43
5 45
6 46
Name: test_1_a, dtype: int64,
0 10
1 15
2 20
3 43
4 40
5 35
6 22
Name: test_2_a, dtype: int64,
0 35
1 48
2 38
3 13
4 3
5 10
6 25
Name: test_3_a, dtype: int64]
We can easily get a per-family dataframe with concat.
df_a = pd.concat(families["a"], axis=1)
Gets us:
test_1_a test_2_a test_3_a
0 26 10 35
1 13 15 48
2 11 20 38
3 35 43 13
4 43 40 3
5 45 35 10
6 46 22 25
If we were to create a dictionary of dataframes per family,
dfs = {f"df_{fam}" : pd.concat(families[fam], axis=1) for fam in families.keys()}
Now, the dictionary dfs contains:
{'df_a': test_1_a test_2_a test_3_a
0 26 10 35
1 13 15 48
2 11 20 38
3 35 43 13
4 43 40 3
5 45 35 10
6 46 22 25,
'df_b': test_1_b test_2_b test_4_b
0 18 4 44
1 48 43 2
2 30 21 4
3 46 12 16
4 42 14 25
5 22 24 13
6 43 40 43,
'df_c': test_1_c test_2_c test_4_c
0 25 15 5
1 36 39 28
2 6 3 37
3 22 48 16
4 2 34 25
5 39 16 30
6 32 36 2}

What do you think of an approach like that ? Use pd.wide_to_long with the result of a long dataframe with splitted columns, one with the whole classification like 1_a, one only with the number, one with the family and their values.
df = (pd.wide_to_long(
df.reset_index(),stubnames='test_',i='index',j='classification',suffix='\d_\w')
.reset_index()
.drop('index',axis=1)
.rename(columns={'test_':'values'}))
df[['number', 'family']] = df['classification'].str.split('_', expand=True)
df = df.reindex(columns=['classification', 'number', 'family', 'values'])
print(df)
classification number family values
0 1_a 1 a 29
1 1_a 1 a 46
2 1_a 1 a 2
3 1_a 1 a 6
4 1_a 1 a 16
.. ... ... ... ...
58 4_c 4 c 30
59 4_c 4 c 23
60 4_c 4 c 26
61 4_c 4 c 40
62 4_c 4 c 39
Easy to groupby or filter for more analysis.
If you want to get dicts or lists of specific data, here some examples:
filter1 = df.loc[df['classification']=='1_a',:]
filter2 = df.loc[df['number']=='2','values']
filter1.to_dict(orient='list')
Output:
{'classification': ['1_a', '1_a', '1_a', '1_a', '1_a', '1_a', '1_a'],
'number': ['1', '1', '1', '1', '1', '1', '1'],
'family': ['a', 'a', 'a', 'a', 'a', 'a', 'a'],
'values': [29, 46, 2, 6, 16, 12, 38]}
filter2.tolist()
Output:
[8, 2, 43, 9, 5, 30, 28, 26, 25, 49, 3, 1, 47, 44, 16, 9, 8, 15, 24, 36, 1]

Not sure I understand the question completely; is this what you have in mind:
dict(list(df.groupby(df.columns.str[-1], axis = 1)))
{'a': test_1_a test_2_a test_3_a
0 20 36 14
1 4 7 16
2 28 13 28
3 3 40 9
4 38 41 5
5 34 47 18
6 49 25 46,
'b': test_1_b test_2_b test_4_b
0 35 10 44
1 46 14 23
2 26 11 36
3 17 27 4
4 13 16 42
5 20 38 9
6 41 22 18,
'c': test_1_c test_2_c test_4_c
0 22 2 26
1 42 24 3
2 15 16 41
3 7 11 16
4 40 37 47
5 38 7 33
6 39 22 24}
This groups the columns on the last letter in the column name.
If this is not what you have in mind, kindly comment and maybe explain a bit better where I misunderstood your intent.

Related

Find closest element in list for each row in Pandas DataFrame column

I have a Pandas DataFrame and comparation list like this:
In [21]: df
Out[21]:
Results
0 90
1 80
2 70
3 60
4 50
5 40
6 30
7 20
8 10
In [23]: comparation_list
Out[23]: [83, 72, 65, 40, 36, 22, 15, 12]
Now, I want to create a new column on this df where the value of each row is the closest element of the comparation list to the Results column correspondent row.
The output should be something like this:
Results assigned_value
0 90 83
1 80 83
2 70 72
3 60 65
4 50 40
5 40 40
6 30 36
7 20 22
8 10 12
Doing this through loops or using apply comes straight to my mind, but I would like to know how to do it in a vectorized way.
Use a merge_asof:
out = pd.merge_asof(
df.reset_index().sort_values(by='Results'),
pd.Series(sorted(comparation_list), name='assigned_value'),
left_on='Results', right_on='assigned_value',
direction='nearest'
).set_index('index').sort_index()
Output:
Results assigned_value
index
0 90 83
1 80 83
2 70 72
3 60 65
4 50 40
5 40 40
6 30 36
7 20 22
8 10 12

Get all combinations of several columns in a pandas dataframe and calculate sum for each combination

I have a dataframe as below:
df = pd.DataFrame({'id': ['a', 'b', 'c', 'd'],
'colA': [1, 2, 3, 4],
'colB': [5, 6, 7, 8],
'colC': [9, 10, 11, 12],
'colD': [13, 14, 15, 16]})
I want to get all combinations of 'colA', 'colB', 'colC' and 'colD' and calculate sum for each combination. I can get all combinations using itertools
cols = ['colA', 'colB', 'colC', 'colD']
all_combinations = [c for i in range(2, len(cols)+1) for c in combinations(cols, i)]
But how can I get the sum for each combination and create a new column in the dataframe? Expected output:
id colA colB colC colD colA+colB colB+colC ... colA+colB+colC+colD
a 1 5 9 13 6 14 ... 28
b 2 6 10 14 8 16 ... 32
c 3 7 11 15 10 18 ... 36
d 4 8 12 16 12 20 ... 40
First, select from the frame a list of all columns starting with col. Then we create a dictionary using combinations, where the keys are the names of the new summing columns, and the values are the sums of the corresponding columns of the original dataframe, then we unpack them ** as arguments to the assign method, thereby adding to the frame
cols = [c for c in df.columns if c.startswith('col')]
df = df.assign(**{'+'.join(c):df.loc[:, c].sum(axis=1) for i in range(2, len(cols) + 1) for c in combinations(cols, i)})
print(df)
id colA colB colC colD colA+colB colA+colC colA+colD colB+colC colB+colD colC+colD colA+colB+colC colA+colB+colD colA+colC+colD colB+colC+colD colA+colB+colC+colD
0 a 1 5 9 13 6 10 14 14 18 22 15 19 23 27 28
1 b 2 6 10 14 8 12 16 16 20 24 18 22 26 30 32
2 c 3 7 11 15 10 14 18 18 22 26 21 25 29 33 36
3 d 4 8 12 16 12 16 20 20 24 28 24 28 32 36 40

Only sum pandas rows Consecutive when column has consecutive number

I have a dataframe like
pd.DataFrame({'i': [ 3, 4, 12, 25, 44, 45, 52, 53, 65, 66]
, 't': range(1,11)
, 'v': range(0,100)[::10]}
)
i.e.
i t v
0 3 1 0
1 4 2 10
2 12 3 20
3 25 4 30
4 44 5 40
5 45 6 50
6 52 7 60
7 53 8 70
8 65 9 80
9 66 10 90
I would like to sum the values in column v with the next column if i increased by 1, otherwise do nothing.
One can assume that there are maximally two consecutive rows to sum, thus the last row might be ambiguous, depending if it is summed or not.
The resulting dataframe should look like:
i t v
0 3 1 10
2 12 3 20
3 25 4 30
4 44 5 90
6 52 7 130
8 65 9 170
Obviously I could loop over the dataframe using .iterrows() but there must be a smarter solution.
I tried various combinations of shift, diff and groupby, though I cannot see the way to do it...
It's a common technique to identify the block with cumsum on diff:
blocks = df['i'].diff().ne(1).cumsum()
df.groupby(blocks, as_index=False).agg({'i':'first','t':'first', 'v':'sum'})
Output:
i t v
0 3 1 10
1 12 3 20
2 25 4 30
3 44 5 90
4 52 7 130
5 65 9 170
Let us try
out = df.groupby(df['i'].diff().ne(1).cumsum()).agg({'i':'first','t':'first','v':'sum'})
Out[11]:
i t v
i
1 3 1 10
2 12 3 20
3 25 4 30
4 44 5 90
5 52 7 130
6 65 9 170

How to group data of one dataframe by matching the condition on another dataframe, when both the dataframes has exactly same columns and index?

I have a dataframe, say df1 like this:
1 2 3 4
1 5 2 3 14
2 20 18 12 16
3 13 21 32 4
4 17 25 32 13
and another dataframe, say df2 like this:
1 2 3 4
1 13 19 45 56
2 45 54 28 31
3 33 45 32 9
4 23 65 14 15
I want to fill this third dataframe, say df3:
Lower Upper Val
0 5 73
5 10 13
10 15 132
15 20 ...
20 25 ...
25 30 ...
Now, to fill the 'val' column in df3, the code should first get the cell locations in df1 which lies between the given limits in df3 and then adds the corresponding values of df2 not df1.
I could able to do it using nested-loops but I need something without loops.
Thank you in advance.
Idea is create IntervalIndex.from_arrays from Lower and Upper columns. Then reshape first and second DataFrames by DataFrame.stack and for first use binning by cut used for groups for aggregation by sum:
df3.index = pd.IntervalIndex.from_arrays(df3['Lower'], df3['Upper'], closed='left')
print (df3.index)
IntervalIndex([[0, 5), [5, 10), [10, 15), [15, 20), [20, 25), [25, 30)],
closed='left',
dtype='interval[int64]')
df3['Val'] = df2.stack().groupby(pd.cut(df1.stack(), df3.index)).sum()
print (df3)
Lower Upper Val
[0, 5) 0 5 73
[5, 10) 5 10 13
[10, 15) 10 15 132
[15, 20) 15 20 108
[20, 25) 20 25 90
[25, 30) 25 30 65
Last create default index:
df3 = df3.reset_index(drop=True)
print (df3)
Lower Upper Val
0 0 5 73
1 5 10 13
2 10 15 132
3 15 20 108
4 20 25 90
5 25 30 65
Details:
print (df1.stack())
1 1 5
2 2
3 3
4 14
2 1 20
2 18
3 12
4 16
3 1 13
2 21
3 32
4 4
4 1 17
2 25
3 32
4 13
dtype: int64
print (pd.cut(df1.stack(), df3.index))
1 1 [5.0, 10.0)
2 [0.0, 5.0)
3 [0.0, 5.0)
4 [10.0, 15.0)
2 1 [20.0, 25.0)
2 [15.0, 20.0)
3 [10.0, 15.0)
4 [15.0, 20.0)
3 1 [10.0, 15.0)
2 [20.0, 25.0)
3 NaN
4 [0.0, 5.0)
4 1 [15.0, 20.0)
2 [25.0, 30.0)
3 NaN
4 [10.0, 15.0)
dtype: category
Categories (6, interval[int64]): [[0, 5) < [5, 10) < [10, 15) < [15, 20) < [20, 25) < [25, 30)]
print (df2.stack())
1 1 13
2 19
3 45
4 56
2 1 45
2 54
3 28
4 31
3 1 33
2 45
3 32
4 9
4 1 23
2 65
3 14
4 15
dtype: int64

Find maximum value in python dataframe combining several rows

I have a dataframe looks like following(I have sorted it according to item column already). For example, item 1- 10,11-20,...(every 10 items) are in the same category, I want to find the item in each category that have the highest score and return it.
What is the most efficient way to do that?
item score
1 1 10
3 4 1
4 6 6
39 11 2
8 12 1
9 13 1
10 15 24
11 17 9
12 18 12
13 20 7
14 22 1
59 25 3
18 28 3
19 29 2
22 34 2
23 37 1
24 38 3
25 39 2
26 40 2
27 42 3
29 45 1
31 48 1
32 53 4
33 58 4
assuming your dataframe is stored in df
g = df.groupby(pd.cut(df.item, np.arange(1, df.item.max(), 10), right=False)
)
get the max values from each category
max_score_ids = g.score.agg('idxmax')
this gives you the ids of the rows that contain the max score in each category
item
[1, 11) 1
[11, 21) 10
[21, 31) 59
[31, 41) 24
[41, 51) 27
then get the items associated with these ids
df.loc[max_score_ids].item
1 1
10 15
59 25
24 38
27 42

Categories