map values in a dataframe according to ranges - python

I have a dataframe df
import pandas
df = pandas.DataFrame(data=[1,2,3,2,2,2,3,3,4,5,10,11,12,1,2,1,1], columns=['codes'])
codes
0 1
1 2
2 3
3 2
4 2
5 2
6 3
7 3
8 4
9 5
10 10
11 11
12 12
13 1
14 2
15 1
16 1
and I would like to group the values in the column code
according to a specific logic:
values == 0 become A
values in the range (1,4) becomes B
values == 5 becomes C
values in the range (6,16) becomes D
is there a way to keep the logic and the dataframe separate so that it is easy to change the grouping rules in the future?
I would like to avoid to write
df.loc[df['code']==0,'code']=A
df.loc[(df['code']>=1 & df['code']<=4),'code']=B

First idea is use Series.map with merge dictionaries, second is use cut with right=False:
df = pd.DataFrame(data=[0,1,2,3,2,2,2,3,3,4,5,10,11,12,16,2,17,1], columns=['codes'])
d1 = {0: 'A', 5:'C'}
d2 = dict.fromkeys(range(1,5), 'B')
d3 = dict.fromkeys(range(6,17), 'D')
d = {**d1, **d2, **d3}
df['codes1'] = df['codes'].map(d)
df['codes2'] = pd.cut(df['codes'], bins=(0,1,5,6,17), labels=list('ABCD'), right=False)
print (df)
codes codes1 codes2
0 0 A A
1 1 B B
2 2 B B
3 3 B B
4 2 B B
5 2 B B
6 2 B B
7 3 B B
8 3 B B
9 4 B B
10 5 C C
11 10 D D
12 11 D D
13 12 D D
14 16 D D
15 2 B B
16 17 NaN NaN
17 1 B B

Related

Pandas: fill NaNs based on group text values [duplicate]

The following is the pandas dataframe I have:
cluster Value
1 A
1 NaN
1 NaN
1 NaN
1 NaN
2 NaN
2 NaN
2 B
2 NaN
3 NaN
3 NaN
3 C
3 NaN
4 NaN
4 S
4 NaN
5 NaN
5 A
5 NaN
5 NaN
If we look into the data, cluster 1 has Value 'A' for one row and remain all are NA values. I want to fill 'A' value for all the rows of cluster 1. Similarly for all the clusters. Based on one of the values of the cluster, I want to fill the remaining rows of the cluster. The output should be like,
cluster Value
1 A
1 A
1 A
1 A
1 A
2 B
2 B
2 B
2 B
3 C
3 C
3 C
3 C
4 S
4 S
4 S
5 A
5 A
5 A
5 A
I am new to python and not sure how to proceed with this. Can anybody help with this ?
groupby + bfill, and ffill
df = df.groupby('cluster').bfill().ffill()
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Or,
groupby + transform with first
df['Value'] = df.groupby('cluster').Value.transform('first')
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Edit
The following seems better:
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
df['Value'] = df['cluster'].map(nan_map)
print(df)
Original
I can't think of a better way to do this than iterate over all the rows, but one might exist. First I built your DataFrame:
import pandas as pd
import math
# Build your DataFrame
df = pd.DataFrame.from_items([
('cluster', [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5,5]),
('Value', [float('nan') for _ in range(20)]),
])
df['Value'] = df['Value'].astype(object)
df.at[ 0,'Value'] = 'A'
df.at[ 7,'Value'] = 'B'
df.at[11,'Value'] = 'C'
df.at[14,'Value'] = 'S'
df.at[17,'Value'] = 'A'
Now here's an approach that first creates a nan_map dict, then sets the values in Value as specified in the dict.
# Create a dict to map clusters to unique values
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
# nan_map: {1: 'A', 2: 'B', 3: 'C', 4: 'S', 5: 'A'}
# Apply
for i, row in df.iterrows():
df.at[i,'Value'] = nan_map[row['cluster']]
print(df)
Output:
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 C
10 3 C
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Note: This sets all values based on the cluster and doesn't check for NaN-ness. You may want to experiment with something like:
# Apply
for i, row in df.iterrows():
if isinstance(df.at[i,'Value'], float) and math.isnan(df.at[i,'Value']):
df.at[i,'Value'] = nan_map[row['cluster']]
to see which is more efficient (my guess is the former, without the checks).

How do I filter out random sample from dataframe in which there are different sample size for each value, in python?

I have to filter out random sample from Data on which:
'a' should have 6 values,
'b' should have 4 values and
'c' should have 7 values randomly.
Data Value
a 1
a 2
a 3
a 4
a 5
a 6
a 7
a 8
a 9
a 10
b 1
b 2
b 3
b 4
b 5
b 6
b 7
b 8
b 9
c 1
c 2
c 3
c 4
c 5
c 6
c 7
c 8
I want output as:
Data Value
a 3
a 5
a 7
a 2
a 4
a 9
b 3
b 5
b 7
b 8
c 1
c 3
c 4
c 5
c 6
c 7
c 9
First define counts of samples for each group and then groupby with sample:
d = {'a':6, 'b':4, 'c':7}
df = df.groupby('Data', group_keys=False).apply(lambda x: x.sample(d[x.name]))
print (df)
Data Value
7 a 8
5 a 6
0 a 1
2 a 3
9 a 10
8 a 9
17 b 8
18 b 9
15 b 6
14 b 5
22 c 4
23 c 5
25 c 7
21 c 3
20 c 2
24 c 6
19 c 1
Another aproach with filtering only values of matched of keys of dict:
d = {'a':6, 'b':4, 'c':7}
df = pd.concat([df[df['Data'].eq(k)].sample(v) for k, v in d.items()], ignore_index=True)

Fill all values in a group with the first non-null value in that group

The following is the pandas dataframe I have:
cluster Value
1 A
1 NaN
1 NaN
1 NaN
1 NaN
2 NaN
2 NaN
2 B
2 NaN
3 NaN
3 NaN
3 C
3 NaN
4 NaN
4 S
4 NaN
5 NaN
5 A
5 NaN
5 NaN
If we look into the data, cluster 1 has Value 'A' for one row and remain all are NA values. I want to fill 'A' value for all the rows of cluster 1. Similarly for all the clusters. Based on one of the values of the cluster, I want to fill the remaining rows of the cluster. The output should be like,
cluster Value
1 A
1 A
1 A
1 A
1 A
2 B
2 B
2 B
2 B
3 C
3 C
3 C
3 C
4 S
4 S
4 S
5 A
5 A
5 A
5 A
I am new to python and not sure how to proceed with this. Can anybody help with this ?
groupby + bfill, and ffill
df = df.groupby('cluster').bfill().ffill()
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Or,
groupby + transform with first
df['Value'] = df.groupby('cluster').Value.transform('first')
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Edit
The following seems better:
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
df['Value'] = df['cluster'].map(nan_map)
print(df)
Original
I can't think of a better way to do this than iterate over all the rows, but one might exist. First I built your DataFrame:
import pandas as pd
import math
# Build your DataFrame
df = pd.DataFrame.from_items([
('cluster', [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5,5]),
('Value', [float('nan') for _ in range(20)]),
])
df['Value'] = df['Value'].astype(object)
df.at[ 0,'Value'] = 'A'
df.at[ 7,'Value'] = 'B'
df.at[11,'Value'] = 'C'
df.at[14,'Value'] = 'S'
df.at[17,'Value'] = 'A'
Now here's an approach that first creates a nan_map dict, then sets the values in Value as specified in the dict.
# Create a dict to map clusters to unique values
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
# nan_map: {1: 'A', 2: 'B', 3: 'C', 4: 'S', 5: 'A'}
# Apply
for i, row in df.iterrows():
df.at[i,'Value'] = nan_map[row['cluster']]
print(df)
Output:
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 C
10 3 C
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Note: This sets all values based on the cluster and doesn't check for NaN-ness. You may want to experiment with something like:
# Apply
for i, row in df.iterrows():
if isinstance(df.at[i,'Value'], float) and math.isnan(df.at[i,'Value']):
df.at[i,'Value'] = nan_map[row['cluster']]
to see which is more efficient (my guess is the former, without the checks).

python - sum list of columns, even if not all there

I have a dataframe that looks like this
A B C D G
0 9 5 7 6 1
1 1 4 7 3 1
2 8 4 1 3 1
generated by this:
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
x=np.array([[1,2]])
df['G'] = np.repeat(x,5)
Suppose there are times when a certain column 'E' exists, and sometimes it doesn't depending on the time frame of the data.
So sometimes we have
A B C D E G
0 9 5 7 6 2 1
1 1 4 7 3 3 1
2 8 4 1 3 4 1
So either way, I'd like to sum the rows from columns A, C, and E, and groupby column G. So when column E exists , I just use
df.groupby('G')['A', 'C', 'E'].sum()
but when E doesn't exist, like in the first dataframe, it doesn't work.
What do I need to do in order to sum even if a column is missing?
You could store the columns you wish to sum in a list sum_cols = list('ACE'), and then intersect whatever DataFrame you're working with with this list.
df.groupby('G')[df.columns.intersection(sum_cols)].sum()
Demo
>>> df = pd.DataFrame(np.random.randint(0, 10, (2, 5)),
columns=list('ABCDG'))
>>> df
A B C D G
0 9 5 9 2 6
1 3 1 1 1 3
>>> sum_cols = list('ACE')
>>> df.groupby('G')[df.columns.intersection(sum_cols)].sum()
A C
G
3 3 1
6 9 9
>>> df['E'] = [100, 200]
>>> df.groupby('G')[df.columns.intersection(sum_cols)].sum()
A C E
G
3 3 1 200
6 9 9 100

Reshape MultiIndex dataframe to tabular format

Given a sample MultiIndex:
idx = pd.MultiIndex.from_product([[0, 1, 2], ['a', 'b', 'c', 'd']])
df = pd.DataFrame({'value' : np.arange(12)}, index=idx)
df
value
0 a 0
b 1
c 2
d 3
1 a 4
b 5
c 6
d 7
2 a 8
b 9
c 10
d 11
How can I efficiently convert this to a tabular format like so?
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Furthermore, given the dataframe above, how can I bring it back to its original multi-indexed state?
What I've tried:
pd.DataFrame(df.values.reshape(-1, df.index.levels[1].size),
index=df.index.levels[0], columns=df.index.levels[1])
Which works for the first problem, but I'm not sure how to bring it back to its original from there.
Using unstack and stack
In [5359]: dff = df['value'].unstack()
In [5360]: dff
Out[5360]:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
In [5361]: dff.stack().to_frame('name')
Out[5361]:
name
0 a 0
b 1
c 2
d 3
1 a 4
b 5
c 6
d 7
2 a 8
b 9
c 10
d 11
By using get_level_values
pd.crosstab(df.index.get_level_values(0),df.index.get_level_values(1),values=df.value,aggfunc=np.sum)
Out[477]:
col_0 a b c d
row_0
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Another alternative, which you should think of when using stack/unstack (though unstack is clearly better in this case!) is pivot_table:
In [11]: df.pivot_table(values="value", index=df.index.get_level_values(0), columns=df.index.get_level_values(1))
Out[11]:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11

Categories