I have a dataframe in pandas containing information that I would like display as a heatmap of sorts. The dataframe displays the x and y co-ordinates of several objects at varying points in time and includes other information in extra columns (eg:mass).
time object x y mass
3 1.0 216 12 12
4 1.0 218 13 12
5 1.0 217 12 12
6 1.0 234 13 13
1 2.0 361 289 23
2 2.0 362 287 22
3 2.0 362 286 22
5 3.0 124 56 18
6 3.0 126 52 17
I would like to create a heatmap with the x and y values corresponding to the x and y axes of the heatmap. The greater the number of objects at a particular x/y location, the more intense I would like the color to be. Any ideas on how you would accomplish this?
One idea is to use seaborn heatmap. First I would pivot your dataframe over your desired output, in this case x, y and say mass, with:
In [4]: df
Out[4]:
time object x y mass
0 3 1.0 216 12 12
1 4 1.0 218 13 12
2 5 1.0 217 12 12
3 6 1.0 234 13 13
4 1 2.0 361 289 23
5 2 2.0 362 287 22
6 3 2.0 362 286 22
7 5 3.0 124 56 18
8 6 3.0 126 52 17
In [5]: d = df.pivot('x','y','mass')
In [6]: d
Out[6]:
y 12 13 52 56 286 287 289
x
124 NaN NaN NaN 18.0 NaN NaN NaN
126 NaN NaN 17.0 NaN NaN NaN NaN
216 12.0 NaN NaN NaN NaN NaN NaN
217 12.0 NaN NaN NaN NaN NaN NaN
218 NaN 12.0 NaN NaN NaN NaN NaN
234 NaN 13.0 NaN NaN NaN NaN NaN
361 NaN NaN NaN NaN NaN NaN 23.0
362 NaN NaN NaN NaN 22.0 22.0 NaN
Then you can apply a simple heatmap with:
ax = sns.heatmap(d)
as a result you have the following image. In the case you need more complex attribute instead of the single mass, you can add a new column in the original dataframe. Finally here you can find some samples on how to define colormaps, style etc.
Related
Appreciate any help on this.
Let's say I have the following df with two columns:
col1 col2
NaN NaN
11 100
12 110
15 115
NaN NaN
NaN NaN
NaN NaN
9 142
12 144
NaN NaN
NaN NaN
NaN NaN
6 155
9 156
7 161
NaN NaN
NaN NaN
I'd like to forward fill and replace the Nan values with the median value of the preceding values. For example, the median of 11,12,15 in 'col1' is 12, therefore I need the Nan values to be filled with 12 until I get to the next non-Nan values in the column and continue iterating the same. See below the expected df:
col1 col2
NaN NaN
11 100
12 110
15 115
12 110
12 110
12 110
9 142
12 144
10.5 143
10.5 143
10.5 143
6 155
9 156
7 161
7 156
7 156
Try:
m1 = (df.col1.isna() != df.col1.isna().shift(1)).cumsum()
m2 = (df.col2.isna() != df.col2.isna().shift(1)).cumsum()
df["col1"] = df["col1"].fillna(
df.groupby(m1)["col1"].transform("median").ffill()
)
df["col2"] = df["col2"].fillna(
df.groupby(m2)["col2"].transform("median").ffill()
)
print(df)
Prints:
col1 col2
0 NaN NaN
1 11.0 100.0
2 12.0 110.0
3 15.0 115.0
4 12.0 110.0
5 12.0 110.0
6 12.0 110.0
7 9.0 142.0
8 12.0 144.0
9 10.5 143.0
10 10.5 143.0
11 10.5 143.0
12 6.0 155.0
13 9.0 156.0
14 7.0 161.0
15 7.0 156.0
16 7.0 156.0
IIUC, if we fill null values like so:
Fill with Median of last 3 non-null items.
Fill with Median of last 2 non-null items.
Front fill values.
We'll get what you're looking for.
out = (df.combine_first(df.rolling(4,3).median())
.combine_first(df.rolling(3,2).median())
.ffill())
print(out)
Output:
col1 col2
0 NaN NaN
1 11.0 100.0
2 12.0 110.0
3 15.0 115.0
4 12.0 110.0
5 12.0 110.0
6 12.0 110.0
7 9.0 142.0
8 12.0 144.0
9 10.5 143.0
10 10.5 143.0
11 10.5 143.0
12 6.0 155.0
13 9.0 156.0
14 7.0 161.0
15 7.0 156.0
16 7.0 156.0
I have a dataset similar to this
Serial A B
1 12
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100
2 32 242
2 3
3 2
3 23 100
3
3 23
I group the dataframe based on Serial and find the maximum value of each A column by df['A_MAX'] = df.groupby('Serial')['A'].transform('max').values and retain the first value by df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
Serial A B A_MAX B_corresponding
1 12 31 203
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100 32 100
2 32 242
2 3
3 2 23 100
3 23 100
3
3 23
Now for the B_corresponding column, I would like to get the corresponding B values of the A_MAX. I thought of locating the A_MAX values in A but there are similar max A values per group. Additional condition, for example in Serial 2 I would also prefer to get the smallest B values between the 32
Idea is use DataFrame.sort_values for maximal values per groups, then remove missing values by DataFrame.dropna and get first rows by Serial by DataFrame.drop_duplicates. Create Series by DataFrame.set_index and last use Series.map:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated())
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated())
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31.0 203.0
1 1 31.0 NaN NaN NaN
2 1 NaN NaN NaN NaN
3 1 12.0 NaN NaN NaN
4 1 31.0 203.0 NaN NaN
5 1 10.0 NaN NaN NaN
6 1 2.0 NaN NaN NaN
7 2 32.0 100.0 32.0 100.0
8 2 32.0 242.0 NaN NaN
9 2 3.0 NaN NaN NaN
10 3 2.0 NaN 23.0 100.0
11 3 23.0 100.0 NaN NaN
12 3 NaN NaN NaN NaN
13 3 23.0 NaN NaN NaN
Converting missing values to empty strings is possible, but get mixed values - numeric and strings, so next processing should be problematic:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated(), '')
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31 203
1 1 31.0 NaN
2 1 NaN NaN
3 1 12.0 NaN
4 1 31.0 203.0
5 1 10.0 NaN
6 1 2.0 NaN
7 2 32.0 100.0 32 100
8 2 32.0 242.0
9 2 3.0 NaN
10 3 2.0 NaN 23 100
11 3 23.0 100.0
12 3 NaN NaN
13 3 23.0 NaN
You could also use dictionaries to achieve the same if you are not so inclined to only use pandas.
a_to_b_mapping = df.groupby('A')['B'].min().to_dict()
series_to_a_mapping = df.groupby('Series')['A'].max().to_dict()
agg_df = {}
for series, a in series_to_a_mapping.items():
agg_df.append((series, a, a_to_b_mapping.get(a, None)))
agg_df = pd.DataFrame(agg_df, columns=['Series', 'A_max', 'B_corresponding'])
agg_df.head()
Series A_max B_corresponding
0 1 31.0 203.0
1 2 32.0 100.0
2 3 23.0 100.0
If you want, you could join this to original dataframe and mask duplicates.
dft = df.join(final_df.set_index('Serial'), on='Serial', how='left')
dft['A_max'] = dft['A_max'].mask(dft['A_max'].duplicated(), '')
dft['B_corresponding'] = dft['B_corresponding'].mask(dft['B_corresponding'].duplicated(), '')
dft
I have the following pandas series:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 2.291958
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN
21 NaN
22 NaN
23 NaN
24 NaN
25 NaN
26 0.378826
27 NaN
28 NaN
29 NaN
...
123 NaN
124 NaN
125 NaN
126 NaN
127 1.170094
128 NaN
129 NaN
130 NaN
131 0.008531
132 NaN
133 NaN
134 NaN
135 NaN
136 NaN
137 NaN
138 NaN
139 NaN
140 NaN
141 NaN
142 NaN
143 NaN
144 NaN
145 NaN
146 NaN
147 NaN
148 NaN
149 NaN
150 NaN
151 NaN
152 NaN
Length: 153, dtype: float64
I interpolate it as follows:
ts.interpolate(method='cubic', limit_direction='both', limit=75)
I would have expected all NaNs to be filled by this, but in the output, NaNs still remain, why is that and how can I fix it in the interpolate command?Output is as follows:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 2.291958
12 1.733142
13 1.255447
14 0.854370
15 0.525409
16 0.264062
17 0.065826
18 -0.073801
19 -0.159321
20 -0.195237
21 -0.186051
22 -0.136265
23 -0.050382
24 0.067095
25 0.211666
26 0.378826
27 0.564074
28 0.762908
29 0.970824
...
123 1.649933
124 1.579817
125 1.479152
126 1.343917
127 1.170094
128 0.953663
129 0.690605
130 0.376900
131 0.008531
132 NaN
133 NaN
134 NaN
135 NaN
136 NaN
137 NaN
138 NaN
139 NaN
140 NaN
141 NaN
142 NaN
143 NaN
144 NaN
145 NaN
146 NaN
147 NaN
148 NaN
149 NaN
150 NaN
151 NaN
152 NaN
Length: 153, dtype: float64
I do not think cubic can do that to fillna without between , if you change the linear , it will do it
s.interpolate('linear',limit_direction='both', limit=75)
Out[62]:
0 2.291958
1 2.291958
2 2.291958
3 2.291958
4 2.291958
5 2.291958
6 2.291958
7 2.291958
8 2.291958
9 2.291958
10 2.291958
11 2.291958
12 2.164416
13 2.036874
14 1.909332
15 1.781789
16 1.654247
17 1.526705
18 1.399163
19 1.271621
20 1.144079
21 1.016537
22 0.888995
23 0.761452
24 0.633910
25 0.506368
26 0.378826
27 0.378826
28 0.378826
29 0.378826
Name: s, dtype: float64
I had following data frame (the real data frame is much more larger than this one ) :
sale_user_id sale_product_id count
1 1 1
1 8 1
1 52 1
1 312 5
1 315 1
Then reshaped it to move the values in sale_product_id as column headers using the following code:
reshaped_df=id_product_count.pivot(index='sale_user_id',columns='sale_product_id',values='count')
and the resulting data frame is:
sale_product_id -1057 1 2 3 4 5 6 8 9 10 ... 98 980 981 982 983 984 985 986 987 99
sale_user_id
1 NaN 1.0 NaN NaN NaN NaN NaN 1.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
as you can see we have a multililevel index , what i need is to have sale_user_is in the first column without multilevel indexing:
i take the following approach :
reshaped_df.reset_index()
the the result would be like this i still have the sale_product_id column , but i do not need it anymore:
sale_product_id sale_user_id -1057 1 2 3 4 5 6 8 9 ... 98 980 981 982 983 984 985 986 987 99
0 1 NaN 1.0 NaN NaN NaN NaN NaN 1.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 3 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 4 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
i can subset this data frame to get rid of sale_product_id but i don't think it would be efficient.I am looking for an efficient way to get rid of multilevel indexing while reshaping the original data frame
You need remove only index name, use rename_axis (new in pandas 0.18.0):
print (reshaped_df)
sale_product_id 1 8 52 312 315
sale_user_id
1 1 1 1 5 1
print (reshaped_df.index.name)
sale_user_id
print (reshaped_df.rename_axis(None))
sale_product_id 1 8 52 312 315
1 1 1 1 5 1
Another solution working in pandas below 0.18.0:
reshaped_df.index.name = None
print (reshaped_df)
sale_product_id 1 8 52 312 315
1 1 1 1 5 1
If need remove columns name also:
print (reshaped_df.columns.name)
sale_product_id
print (reshaped_df.rename_axis(None).rename_axis(None, axis=1))
1 8 52 312 315
1 1 1 1 5 1
Another solution:
reshaped_df.columns.name = None
reshaped_df.index.name = None
print (reshaped_df)
1 8 52 312 315
1 1 1 1 5 1
EDIT by comment:
You need reset_index with parameter drop=True:
reshaped_df = reshaped_df.reset_index(drop=True)
print (reshaped_df)
sale_product_id 1 8 52 312 315
0 1 1 1 5 1
#if need reset index nad remove column name
reshaped_df = reshaped_df.reset_index(drop=True).rename_axis(None, axis=1)
print (reshaped_df)
1 8 52 312 315
0 1 1 1 5 1
Of if need remove only column name:
reshaped_df = reshaped_df.rename_axis(None, axis=1)
print (reshaped_df)
1 8 52 312 315
sale_user_id
1 1 1 1 5 1
Edit1:
So if need create new column from index and remove columns names:
reshaped_df = reshaped_df.rename_axis(None, axis=1).reset_index()
print (reshaped_df)
sale_user_id 1 8 52 312 315
0 1 1 1 1 5 1
Make a DataFrame
import random
d = {'Country': ['Afghanistan','Albania','Algeria','Andorra','Angola']*2,
'Year': [2005]*5 + [2006]*5, 'Value': random.sample(range(1,20),10)}
df = pd.DataFrame(data=d)
df:
Country Year Value
1 Afghanistan 2005 6
2 Albania 2005 13
3 Algeria 2005 10
4 Andorra 2005 11
5 Angola 2005 5
6 Afghanistan 2006 3
7 Albania 2006 2
8 Algeria 2006 7
9 Andorra 2006 3
10 Angola 2006 6
Pivot
table = df.pivot(index='Country',columns='Year',values='Value')
Table:
Year Country 2005 2006
0 Afghanistan 16 9
1 Albania 17 19
2 Algeria 11 7
3 Andorra 5 12
4 Angola 6 18
I want 'Year' to be 'index':
clean_tbl = table.rename_axis(None, axis=1).reset_index(drop=True)
clean_tbl:
Country 2005 2006
0 Afghanistan 16 9
1 Albania 17 19
2 Algeria 11 7
3 Andorra 5 12
4 Angola 6 18
Done!
You can also use a to_flat_index method of MultiIndex object to convert it into a list of tuples, which you can then concatenate with list comprehension and use it to overwrite the .columns attribute of your dataframe.
# create a dataframe
df = pd.DataFrame({"a": [1, 2, 3, 1], "b": ["x", "x", "y", "y"], "c": [0.1, 0.2, 0.1, 0.2]})
a b c
0 1 x 0.1
1 2 x 0.2
2 3 y 0.1
3 1 y 0.2
# pivot the dataframe
df_pivoted = df.pivot(index="a", columns="b")
c
b x y
a
1 0.1 0.2
2 0.2 NaN
3 NaN 0.1
Now let's overwrite the .columns attribute and .reset_index():
df_pivoted.columns = ["_".join(tup) for tup in df_pivoted.columns.to_flat_index()]
df_pivoted.reset_index()
a c_x c_y
0 1 0.1 0.2
1 2 0.2 NaN
2 3 NaN 0.1
We need to reset_index() to reset the index columns back into the dataframe, then rename_axis() to rename the index to None and the columns to their axis=1 (column headers) values.
reshaped_df = reshaped_df.reset_index().rename_axis(None, axis=1)
Pivot from long to wide format using pivot:
import pandas
df = pandas.DataFrame({
"lev1": [1, 1, 1, 2, 2, 2],
"lev2": [1, 1, 2, 1, 1, 2],
"lev3": [1, 2, 1, 2, 1, 2],
"lev4": [1, 2, 3, 4, 5, 6],
"values": [0, 1, 2, 3, 4, 5]})
df_wide = df.pivot(index="lev1", columns=["lev2", "lev3"], values="values")
df_wide
# lev2 1 2
# lev3 1 2 1 2
# lev1
# 1 0.0 1.0 2.0 NaN
# 2 4.0 3.0 NaN 5.0
Rename the (sometimes confusing) axis names
df_wide.rename_axis(columns=[None, None])
# 1 2
# 1 2 1 2
# lev1
# 1 0.0 1.0 2.0 NaN
# 2 4.0 3.0 NaN 5.0
The way it works for me is
df_cross=pd.DataFrame(pd.crosstab(df[c1], df[c2]).to_dict()).reset_index()
I have a list of students in a csv file. I want (using Python) to display four columns in that I want to display the male students who have higher marks in Maths, Computer, and Physics.
I tried to use pandas library.
marks = pd.concat([data['name'],
data.loc[data['students']==1, 'maths'].nlargest(n=10)], 'computer'].nlargest(n=10)], 'physics'].nlargest(n=10)])
I used 1 for male students and 0 for female students.
It gives me an error saying: Invalid syntax.
Here's a way to show the top 10 students in each of the disciplines. You could of course just sum the three scores and select the students with the highest total if you want the combined as opposed to the individual performance (see illustration below).
df1 = pd.DataFrame(data={'name': [''.join(random.choice('abcdefgh') for _ in range(8)) for i in range(100)],
'students': np.random.randint(0, 2, size=100)})
df2 = pd.DataFrame(data=np.random.randint(0, 10, size=(100, 3)), columns=['math', 'physics', 'computers'])
data = pd.concat([df1, df2], axis=1)
data.info()
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
name 100 non-null object
students 100 non-null int64
math 100 non-null int64
physics 100 non-null int64
computers 100 non-null int64
dtypes: int64(4), object(1)
memory usage: 4.0+ KB
res = pd.concat([data.loc[:, ['name']], data.loc[data['students'] == 1, 'math'].nlargest(n=10), data.loc[data['students'] == 1, 'physics'].nlargest(n=10), data.loc[data['students'] == 1, 'computers'].nlargest(n=10)], axis=1)
res.dropna(how='all', subset=['math', 'physics', 'computers'])
name math physics computers
0 geghhbce NaN 9.0 NaN
1 hbbdhcef NaN 7.0 NaN
4 ghgffgga NaN NaN 8.0
6 hfcaccgg 8.0 NaN NaN
14 feechdec NaN NaN 8.0
15 dfaabcgh 9.0 NaN NaN
16 ghbchgdg 9.0 NaN NaN
23 fbeggcha NaN NaN 9.0
27 agechbcf 8.0 NaN NaN
28 bcddedeg NaN NaN 9.0
30 hcdgbgdg NaN 8.0 NaN
38 fgdfeefd NaN NaN 9.0
39 fbcgbeda 9.0 NaN NaN
41 agbdaegg 8.0 NaN 9.0
49 adgbefgg NaN 8.0 NaN
50 dehdhhhh NaN NaN 9.0
55 ccbaaagc NaN 8.0 NaN
68 hhggfffe 8.0 9.0 NaN
71 bhggbheg NaN 9.0 NaN
84 aabcefhf NaN NaN 9.0
85 feeeefbd 9.0 NaN NaN
86 hgeecacc NaN 8.0 NaN
88 ggedgfeg 9.0 8.0 NaN
89 faafgbfe 9.0 NaN 9.0
94 degegegd NaN 8.0 NaN
99 beadccdb NaN NaN 9.0
data['total'] = data.loc[:, ['math', 'physics', 'computers']].sum(axis=1)
data[data.students==1].nlargest(10, 'total').sort_values('total', ascending=False)
name students math physics computers total
29 fahddafg 1 8 8 8 24
79 acchhcdb 1 8 9 7 24
9 ecacceff 1 7 9 7 23
16 dccefaeb 1 9 9 4 22
92 dhaechfb 1 4 9 9 22
47 eefbfeef 1 8 8 5 21
60 bbfaaada 1 4 7 9 20
82 fbbbehbf 1 9 3 8 20
18 dhhfgcbb 1 8 8 3 19
1 ehfdhegg 1 5 7 6 18