I have some data.
I want to remain with rows when an ID has 4 consecutive numbers. For example, if ID 1 has rows 100, 101, 102, 103, 105, the "105" should be excluded.
Data:
ID X
0 1 100
1 1 101
2 1 102
3 1 103
4 1 105
5 2 100
6 2 102
7 2 103
8 2 104
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
18 3 112
19 4 100
20 4 102
21 4 103
22 4 104
23 4 105
24 4 107
Expected results:
ID X
0 1 100
1 1 101
2 1 102
3 1 103
4 3 100
5 3 101
6 3 102
7 3 103
8 3 106
9 3 107
10 3 108
11 3 109
12 3 110
13 4 102
14 4 103
15 4 104
16 4 105
You can identify the consecutive values, then filter the groups by size with groupby.filter:
# group consecutive X
g = df['X'].diff().gt(1).cumsum() # no need to group here, we'll group later
# filter groups
out = df.groupby(['ID', g]).filter(lambda g: len(g)>=4)#.reset_index(drop=True)
output:
ID X
0 1 100
1 1 101
2 1 102
3 1 103
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
20 4 102
21 4 103
22 4 104
23 4 105
Another method:
out = df.groupby(df.groupby('ID')['X'].diff().ne(1).cumsum()).filter(lambda x: len(x) >= 4)
print(out)
# Output
ID X
0 1 100
1 1 101
2 1 102
3 1 103
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
20 4 102
21 4 103
22 4 104
23 4 105
def function1(dd:pd.DataFrame):
return dd.assign(rk=(dd.assign(col1=(dd.X.diff()>1).cumsum()).groupby('col1').transform('size')))
df1.groupby('ID').apply(function1).loc[lambda x:x.rk>3,:'X']
ID X
0 1 100
1 1 101
2 1 102
3 1 103
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
20 4 102
21 4 103
22 4 104
23 4 105
Related
I am trying to append every column from one row into another row, I want to do this for every row, but some row will not have any values, take a look at my code it will be more clear:
Here is my data
date day_of_week day_of_month day_of_year month_of_year
5/1/2017 0 1 121 5
5/2/2017 1 2 122 5
5/3/2017 2 3 123 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5
5/9/2017 1 9 129 5
5/10/2017 2 10 130 5
5/11/2017 3 11 131 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5
5/16/2017 1 16 136 5
5/17/2017 2 17 137 5
5/18/2017 3 18 138 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5
5/24/2017 2 24 144 5
5/25/2017 3 25 145 5
5/26/2017 4 26 146 5
Here is my current code:
s = df_md['date'].shift(-1)
df_md['next_calendarday'] = s.mask(s.dt.dayofweek.diff().lt(0))
df_md.set_index('date', inplace=True)
df_md.apply(lambda row: GetNextDayMarketData(row, df_md), axis=1)
def GetNextDayMarketData(row, dataframe):
if(row['next_calendarday'] is pd.NaT):
return
key = row['next_calendarday'].strftime("%Y-%m-%d")
nextrow = dataframe.loc[key]
for index, val in nextrow.iteritems():
if(index != "next_calendarday"):
dataframe.loc[row.name, index+'_nextday'] = val
This works but it's so slow it might as well not work. Here is what the result should look like, you can see that the value from the next row has been added to the previous row. The kicker is that it's the next calendar date and not just the next row in the sequence. If a row does not have an entry for next calendar date, it will simply be blank.
Here is the expected result in csv
date day_of_week day_of_month day_of_year month_of_year next_workingday day_of_week_nextday day_of_month_nextday day_of_year_nextday month_of_year_nextday
5/1/2017 0 1 121 5 5/2/2017 1 2 122 5
5/2/2017 1 2 122 5 5/3/2017 2 3 123 5
5/3/2017 2 3 123 5 5/4/2017 3 4 124 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5 5/9/2017 1 9 129 5
5/9/2017 1 9 129 5 5/10/2017 2 10 130 5
5/10/2017 2 10 130 5 5/11/2017 3 11 131 5
5/11/2017 3 11 131 5 5/12/2017 4 12 132 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5 5/16/2017 1 16 136 5
5/16/2017 1 16 136 5 5/17/2017 2 17 137 5
5/17/2017 2 17 137 5 5/18/2017 3 18 138 5
5/18/2017 3 18 138 5 5/19/2017 4 19 139 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5 5/24/2017 2 24 144 5
5/24/2017 2 24 144 5 5/25/2017 3 25 145 5
5/25/2017 3 25 145 5 5/26/2017 4 26 146 5
5/26/2017 4 26 146 5
5/30/2017 1 30 150 5
Use DataFrame.join with remove column next_calendarday_nextday:
df = df.set_index('date')
df = (df.join(df, on='next_calendarday', rsuffix='_nextday')
.drop('next_calendarday_nextday', axis=1))
I have the following dataframe.
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 4 101 79
6 4 102 21
7 5 101 129
8 6 101 561
Notice that for sensor_id 102, there are no values for hour = 3. This is due to the fact that the sensors do not generate a separate row of data if the hourly_count is equal to zero. This means that sensor 102 should have hourly_counts = 0 at hour = 3, but this is just the way the original data was collected.
I would ideally wish for a code that fills in this gap. So it should understand that if there are 2 sensors, each sensor should have an hourly record, and if not, insert a row in the dataframe for that sensor for that hour and fill the hourly_count column at that row as 0.
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0
Any help is really appreciated.
Using DataFrame.reindex, you can explicitly define your index. This is useful if you are missing data from both sensors for a particular hour. You can also extend the hour beyond what you have. In the following example, it extends out to hour 8.
new_ix = pd.MultiIndex.from_product([range(1,9), [101, 102]], names=['hour', 'sensor_id'])
df_new = df.set_index(['hour', 'sensor_id'])
df_new.reindex(new_ix, fill_value=0).reset_index()
Output:
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0
12 7 101 0
13 7 102 0
14 8 101 0
15 8 102 0
Use pandas.DataFrame.pivot and then unstack with reset_index:
new_df = df.pivot('sensor_id','hour', 'hourly_count').fillna(0).unstack().reset_index()
print(new_df)
Output:
hour sensor_id 0
0 1 101 651.0
1 1 102 19.0
2 2 101 423.0
3 2 102 12.0
4 3 101 356.0
5 3 102 0.0
6 4 101 79.0
7 4 102 21.0
8 5 101 129.0
9 5 102 0.0
10 6 101 561.0
11 6 102 0.0
Assume missing is on sensor_id 2 only. One way is you just create a new df with all combination of all hours of sensor_id 1, and merge left this new df with original df to get hourly_count and fillna
a = df.hour.unique()
Idf1 = pd.MultiIndex.from_product([a, [101, 102]]).to_frame(index=False, name=['hour', 'sensor_id'])
Out[157]:
hour sensor_id
0 1 101
1 1 102
2 2 101
3 2 102
4 3 101
5 3 102
6 4 101
7 4 102
8 5 101
9 5 102
10 6 101
11 6 102
df1.merge(df, on=['hour','sensor_id'], how='left').fillna(0)
Out[161]:
hour sensor_id hourly_count
0 1 101 651.0
1 1 102 19.0
2 2 101 423.0
3 2 102 12.0
4 3 101 356.0
5 3 102 0.0
6 4 101 79.0
7 4 102 21.0
8 5 101 129.0
9 5 102 0.0
10 6 101 561.0
11 6 102 0.0
Other way: using unstack with fill_value
df.set_index(['hour', 'sensor_id']).unstack(fill_value=0).stack().reset_index()
Out[171]:
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0
I have the following table in pandas
user_id idaggregate_info num_events num_lark_convo_events num_meals_logged num_breakfasts num_lunches num_dinners num_snacks total_activity sleep_duration num_activity_events num_weights num_notifs idusermission completed mission_delta
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28
2 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 NaT
3 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05
4 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 NaT
Some user_ids have multiple lines that are identical except for their different mission_delta values. How do I transform this into one line for each id, with a columns named "mission_delta_1", "mission_delta_2" (the number of them vary, it could be 1 per user_id to maybe 5 per user_id so naming has to be iterative_ etc so output would be:
user_id idaggregate_info num_events num_lark_convo_events num_meals_logged num_breakfasts num_lunches num_dinners num_snacks total_activity sleep_duration num_activity_events num_weights num_notifs idusermission completed mission_delta_1 mission_delta_2
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28 NaT
2 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05 NaT
Not a duplicate as those address exploding all columns, there is just one that needs to be unstacked. The solutions offered in the duplicate link fail:
df.groupby(level=0).apply(lambda x: pd.Series(x.values.flatten()))
produces the same df as the original with different labels
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28
2 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 NaT
3 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05
The next options:
result2.groupby(level=0).apply(lambda x: pd.Series(x.stack().values))
produces:
0 0 0
1 406
2 94
3 20
4 7
and
df.groupby(level=0).apply(lambda x: x.values.ravel()).apply(pd.Series)
produces the original dataframe:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28
2 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 NaT
3 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05
In essence, I want to turn a df:
id mission_delta
0 NaT
1 1 day
1 2 days
1 1 day
2 5 days
2 NaT
into
id mission_delta1 mission_delta_2 mission_delta_3
0 NaT NaT NaT
1 1 day 2 days 1 day
2 5 days NaT NaT
You might try this;
grp = df.groupby('id')
df_res = grp['mission_delta'].apply(lambda x: pd.Series(x.values)).unstack().fillna('NaT')
df_res = df_res.rename(columns={i: 'mission_delta_{}'.format(i + 1) for i in range(len(df_res))})
print(df_res)
mission_delta_1 mission_delta_2 mission_delta_3
id
0 NaT NaT NaT
1 1 day 2 days 1 day
2 5 days NaT NaT
I'm a beginner to this and I would like to know if there is a possibility to extract an adjective frequency from categories in brown corpus and create a list of this adjectives with Python.
from collections import Counter
from nltk.corpus import brown
# Split the words and POS tags
words, poss = zip(*brown.tagged_words())
# Put them into a Counter object
pos_freq = Counter(poss)
for pos in pos_freq:
print pos, pos_freq[pos]
[out]:
' 317
'' 8789
( 2264
(-HL 162
) 2273
)-HL 184
* 4603
*-HL 8
*-NC 1
*-TL 1
, 58156
,-HL 171
,-NC 5
,-TL 4
-- 3405
---HL 26
. 60638
.-HL 598
.-NC 16
.-TL 2
: 1558
:-HL 138
:-TL 22
ABL 357
ABN 3010
ABN-HL 4
ABN-NC 1
ABN-TL 7
ABX 730
AP 9522
AP$ 9
AP+AP-NC 1
AP-HL 40
AP-NC 2
AP-TL 18
AT 97959
AT-HL 332
AT-NC 35
AT-TL 746
AT-TL-HL 5
BE 6360
BE-HL 13
BE-TL 1
BED 3282
BED* 22
BED-NC 3
BEDZ 9806
BEDZ* 154
BEDZ-HL 1
BEDZ-NC 8
BEG 686
BEM 226
BEM* 9
BEM-NC 2
BEN 2470
BEN-TL 2
BER 4379
BER* 47
BER*-NC 1
BER-HL 11
BER-NC 5
BER-TL 6
BEZ 10066
BEZ* 117
BEZ-HL 30
BEZ-NC 5
BEZ-TL 8
CC 37718
CC-HL 119
CC-NC 5
CC-TL 307
CC-TL-HL 2
CD 13510
CD$ 5
CD-HL 444
CD-NC 5
CD-TL 898
CD-TL-HL 17
CS 22143
CS-HL 25
CS-NC 5
CS-TL 2
DO 1353
DO* 485
DO*-HL 3
DO+PPSS 1
DO-HL 4
DO-NC 2
DO-TL 5
DOD 1047
DOD* 402
DOD*-TL 1
DOD-NC 1
DOZ 467
DOZ* 89
DOZ*-TL 1
DOZ-HL 16
DOZ-TL 2
DT 8957
DT$ 5
DT+BEZ 179
DT+BEZ-NC 1
DT+MD 3
DT-HL 6
DT-NC 7
DT-TL 9
DTI 2921
DTI-HL 6
DTI-TL 2
DTS 2435
DTS+BEZ 2
DTS-HL 2
DTX 104
EX 2164
EX+BEZ 105
EX+HVD 3
EX+HVZ 2
EX+MD 4
EX-HL 1
EX-NC 1
FW-* 6
FW-*-TL 2
FW-AT 24
FW-AT+NN-TL 13
FW-AT+NP-TL 2
FW-AT-HL 1
FW-AT-TL 44
FW-BE 1
FW-BER 3
FW-BEZ 4
FW-CC 27
FW-CC-TL 14
FW-CD 7
FW-CD-TL 2
FW-CS 3
FW-DT 2
FW-DT+BEZ 2
FW-DTS 1
FW-HV 1
FW-IN 84
FW-IN+AT 4
FW-IN+AT-T 3
FW-IN+AT-TL 18
FW-IN+NN 5
FW-IN+NN-TL 2
FW-IN+NP-TL 2
FW-IN-TL 40
FW-JJ 53
FW-JJ-NC 2
FW-JJ-TL 74
FW-JJR 1
FW-JJT 1
FW-NN 288
FW-NN$ 9
FW-NN$-TL 4
FW-NN-NC 6
FW-NN-TL 170
FW-NN-TL-NC 1
FW-NNS 83
FW-NNS-NC 2
FW-NNS-TL 36
FW-NP 7
FW-NP-TL 4
FW-NPS 2
FW-NPS-TL 1
FW-NR 1
FW-NR-TL 3
FW-OD-NC 1
FW-OD-TL 4
FW-PN 1
FW-PP$ 3
FW-PP$-NC 1
FW-PP$-TL 2
FW-PPL 9
FW-PPL+VBZ 2
FW-PPO 4
FW-PPO+IN 3
FW-PPS 1
FW-PPSS 6
FW-PPSS+HV 1
FW-QL 1
FW-RB 32
FW-RB+CC 1
FW-RB-TL 3
FW-TO+VB 1
FW-UH 8
FW-UH-NC 1
FW-UH-TL 1
FW-VB 26
FW-VB-NC 3
FW-VB-TL 1
FW-VBD 2
FW-VBD-TL 1
FW-VBG 7
FW-VBG-TL 1
FW-VBN 12
FW-VBZ 4
FW-WDT 16
FW-WPO 1
FW-WPS 1
HV 3928
HV* 42
HV+TO 3
HV-HL 3
HV-NC 11
HV-TL 3
HVD 4895
HVD* 99
HVD-HL 1
HVG 281
HVG-HL 1
HVN 237
HVZ 2433
HVZ* 22
HVZ-NC 2
HVZ-TL 4
IN 120557
IN+IN 1
IN+PPO 1
IN-HL 508
IN-NC 41
IN-TL 1477
IN-TL-HL 6
JJ 64028
JJ$-TL 1
JJ+JJ-NC 2
JJ-HL 396
JJ-NC 41
JJ-TL 4107
JJ-TL-HL 26
JJ-TL-NC 1
JJR 1958
JJR+CS 1
JJR-HL 17
JJR-NC 5
JJR-TL 15
JJS 359
JJS-HL 1
JJS-TL 20
JJT 1005
JJT-HL 6
JJT-NC 1
JJT-TL 4
MD 12431
MD* 866
MD*-HL 1
MD+HV 7
MD+PPSS 1
MD+TO 2
MD-HL 27
MD-NC 2
MD-TL 8
NIL 157
NN 152470
NN$ 1480
NN$-HL 20
NN$-TL 361
NN+BEZ 34
NN+BEZ-TL 2
NN+HVD-TL 1
NN+HVZ 5
NN+HVZ-TL 1
NN+IN 1
NN+MD 2
NN+NN-NC 1
NN-HL 1471
NN-NC 118
NN-TL 13372
NN-TL-HL 129
NN-TL-NC 3
NNS 55110
NNS$ 257
NNS$-HL 4
NNS$-NC 2
NNS$-TL 74
NNS$-TL-HL 1
NNS+MD 2
NNS-HL 609
NNS-NC 26
NNS-TL 2226
NNS-TL-HL 14
NNS-TL-NC 3
NP 34476
NP$ 2565
NP$-HL 8
NP$-TL 141
NP+BEZ 25
NP+BEZ-NC 3
NP+HVZ 6
NP+HVZ-NC 1
NP+MD 2
NP-HL 517
NP-NC 15
NP-TL 4019
NP-TL-HL 7
NPS 1275
NPS$ 38
NPS$-HL 1
NPS$-TL 3
NPS-HL 8
NPS-NC 2
NPS-TL 67
NR 1566
NR$ 66
NR$-TL 11
NR+MD 1
NR-HL 10
NR-NC 4
NR-TL 309
NR-TL-HL 5
NRS 16
NRS-TL 1
OD 1935
OD-HL 8
OD-NC 1
OD-TL 201
PN 2573
PN$ 89
PN+BEZ 7
PN+HVD 1
PN+HVZ 3
PN+MD 3
PN-HL 2
PN-NC 2
PN-TL 5
PP$ 16872
PP$$ 164
PP$-HL 10
PP$-NC 13
PP$-TL 35
PPL 1233
PPL-HL 1
PPL-NC 2
PPL-TL 1
PPLS 345
PPO 11181
PPO-HL 5
PPO-NC 9
PPO-TL 13
PPS 18253
PPS+BEZ 430
PPS+BEZ-HL 1
PPS+BEZ-NC 3
PPS+HVD 83
PPS+HVZ 43
PPS+MD 144
PPS-HL 19
PPS-NC 9
PPS-TL 6
PPSS 13802
PPSS+BEM 270
PPSS+BER 278
PPSS+BER-N 1
PPSS+BER-NC 1
PPSS+BER-TL 1
PPSS+BEZ 1
PPSS+BEZ* 1
PPSS+HV 241
PPSS+HV-TL 1
PPSS+HVD 83
PPSS+MD 484
PPSS+MD-NC 2
PPSS+VB 2
PPSS-HL 25
PPSS-NC 31
PPSS-TL 9
QL 8735
QL-HL 4
QL-NC 2
QL-TL 6
QLP 261
RB 36464
RB$ 9
RB+BEZ 11
RB+BEZ-HL 1
RB+BEZ-NC 1
RB+CS 3
RB-HL 49
RB-NC 26
RB-TL 40
RBR 1182
RBR+CS 1
RBR-NC 1
RBT 101
RN 9
RP 6009
RP+IN 4
RP-HL 14
RP-NC 5
RP-TL 4
TO 14918
TO+VB 2
TO-HL 55
TO-NC 13
TO-TL 10
UH 608
UH-HL 1
UH-NC 5
UH-TL 15
VB 33693
VB+AT 2
VB+IN 3
VB+JJ-NC 1
VB+PPO 71
VB+RP 2
VB+TO 4
VB+VB-NC 1
VB-HL 125
VB-NC 41
VB-TL 96
VBD 26167
VBD-HL 8
VBD-NC 11
VBD-TL 6
VBG 17893
VBG+TO 17
VBG-HL 146
VBG-NC 16
VBG-TL 133
VBN 29186
VBN+TO 5
VBN-HL 137
VBN-NC 9
VBN-TL 591
VBN-TL-HL 6
VBN-TL-NC 3
VBZ 7373
VBZ-HL 72
VBZ-NC 7
VBZ-TL 17
WDT 5539
WDT+BER 1
WDT+BER+PP 1
WDT+BEZ 47
WDT+BEZ-HL 1
WDT+BEZ-NC 2
WDT+BEZ-TL 1
WDT+DO+PPS 1
WDT+DOD 1
WDT+HVZ 2
WDT-HL 30
WDT-NC 7
WP$ 252
WPO 280
WPO-NC 1
WPO-TL 4
WPS 3924
WPS+BEZ 21
WPS+BEZ-NC 2
WPS+BEZ-TL 1
WPS+HVD 6
WPS+HVZ 2
WPS+MD 8
WPS-HL 2
WPS-NC 3
WPS-TL 12
WQL 176
WQL-TL 5
WRB 4509
WRB+BER 1
WRB+BEZ 11
WRB+BEZ-TL 3
WRB+DO 1
WRB+DOD 6
WRB+DOD* 1
WRB+DOZ 1
WRB+IN 1
WRB+MD 1
WRB-HL 36
WRB-NC 7
WRB-TL 9
`` 8837
And then:
# POS that starts with JJ are adjectives, sum the counts up
print sum(pos_freq[i] for i in pos_freq if i.startswith('JJ'))
[out]:
71994
I converted a nested dictionary to a Pandas DataFrame which I want to use as to create a heatmap.
The nested dictionary is simple to create:
>>>df = pandas.DataFrame.from_dict(my_nested_dict)
>>>df
93 94 95 96 97 98 99 100 100A 100B ... 100M 100N 100O 100P 100Q 100R 100S 101 102 103
A 465 5 36 36 28 24 25 30 28 32 ... 28 19 16 15 4 4 185 2 7 3
C 0 1 2 0 6 10 8 16 23 17 ... 9 5 6 3 4 2 3 3 0 1
D 1 0 132 6 17 22 17 25 21 25 ... 12 16 21 7 5 18 2 1 296 0
E 4 0 45 10 16 12 10 15 17 18 ... 4 9 7 10 5 6 4 3 129 0
F 1 0 4 17 14 11 8 11 24 9 ... 17 8 8 12 7 3 1 98 0 1
G 2 10 77 55 71 52 65 39 37 45 ... 46 65 23 9 18 171 141 2 31 0
H 0 5 25 12 18 8 12 7 10 6 ... 8 11 6 4 4 5 2 2 1 8
I 1 8 7 23 26 35 36 34 31 38 ... 19 7 2 37 7 3 0 3 2 26
K 0 42 3 24 5 15 17 11 6 8 ... 9 10 9 8 9 2 1 28 0 0
L 3 0 19 50 32 33 21 26 26 18 ... 19 44 122 11 10 7 5 17 2 5
M 0 1 1 3 1 13 9 12 12 8 ... 20 3 1 1 0 1 0 191 0 0
N 0 5 3 12 8 15 12 13 21 9 ... 18 10 10 11 12 26 3 0 5 1
P 1 1 19 50 39 47 42 43 39 33 ... 48 35 15 16 59 2 13 6 0 160
Q 0 2 16 15 12 13 10 13 16 5 ... 11 6 3 11 4 1 0 1 6 28
R 0 380 17 66 54 41 51 32 24 29 ... 43 44 16 17 14 6 2 126 4 5
S 14 18 27 42 55 37 41 42 45 70 ... 47 31 64 14 42 18 8 3 1 5
T 4 13 17 32 29 37 33 32 30 38 ... 87 79 19 125 96 11 11 7 7 3
V 4 9 36 24 39 40 35 45 42 52 ... 20 12 12 9 8 5 0 6 7 209
W 0 0 1 6 6 8 4 7 7 9 ... 6 6 1 1 1 1 27 1 0 0
X 0 0 0 0 0 0 0 0 0 0 ... 0 4 0 0 0 0 0 0 0 0
Y 0 0 13 17 24 27 44 47 41 31 ... 29 76 139 179 191 208 92 0 2 45
I like to use ggplot to make heat maps which would just be this data frame. However, the dataframes needed for ggplot are a little different. I can use the pandas.melt function to get close, but I'm missing the row titles.
>>>mdf = pandas.melt(df)
>>>mdf
variable value
0 93 465
1 93 0
2 93 1
3 93 4
4 93 1
5 93 2
6 93 0
7 93 1
8 93 0
...
624 103 5
625 103 3
626 103 209
627 103 0
628 103 0
629 103 45
The easiest thing to make this dataframe would be is to add the value of the amino acid so the DataFrame looks like:
variable value rowvalue
0 93 465 A
1 93 0 C
2 93 1 D
3 93 4 E
4 93 1 F
5 93 2 G
6 93 0 H
7 93 1 I
8 93 0 K
That way I can take that dataframe and put it right into ggplot:
>>> from ggplot import *
>>> ggplot(new_df,aes("variable","rowvalue")) + geom_tile(fill="value")
would produce a beautiful heatmap. How do I manipulate the nested dictionary dataframe in order to get the dataframe at the end. If there is a more efficient way to do this, I'm open for suggestions, but I still want to use ggplot2.
Edit -
I found a solution but it seems to be way too convoluted. Basically I make the index into a column, then melt the data frame.
>>>df.reset_index(level=0,inplace=True)
>>>pandas.melt(df,id_vars['index']
index variable value
0 A 93 465
1 C 93 0
2 D 93 1
3 E 93 4
4 F 93 1
5 G 93 2
6 H 93 0
7 I 93 1
8 K 93 0
9 L 93 3
10 M 93 0
11 N 93 0
12 P 93 1
13 Q 93 0
14 R 93 0
15 S 93 14
16 T 93 4
if i understand properly your question, i think you can simply do the following :
mdf = pandas.melt(df)
mdf['rowvalue'] = df.index
mdf
variable value rowvalue
0 93 465 A
1 93 0 C
2 93 1 D
3 93 4 E
4 93 1 F
5 93 2 G
6 93 0 H
7 93 1 I
8 93 0 K