id numbers
1 {'105': 1, '65': 11, '75': 0, '85': 51, '95': 0}
2 {'105': 1, '65': 11, '75': 0, '85': 50, '95': 0}
3 {'105': 1, '65': 11, '75': 0, '85': 51, '95': 0}
4 {}
5 {}
6 {}
7 {'75 cm': 7, '85 cm': 52, '95 cm': 10}
8 {'75 cm': 51, '85 cm': 114, '95 cm': 10}
9 {'75 cm': 9, '85 cm': 60, '95 cm': 10}
this is the current table
I know how to turn the dict into column and rows (key as column and value as rows but what i am looking for is for key and value to be rows with their own column headers)
test = pd.concat([df.drop(['numbers'], axis=1).sort_values(['id']),
df['numbers'].apply(pd.Series)], axis=1)
test2 = test.melt(id_vars=['id'],
var_name="name",
value_name="nameN").fillna(0)
im trying to get each key and value in the dictionary to be rows
id name nameN
1 105 1
1 65 11
1 75 0
1 85 51
1 95 0
You should use comprehensions to build the data for a new DataFrame. If you can just drop the ids where numbers is an empy dictionary, you can do:
test = pd.DataFrame([[x['id'], k, v] for _, x in df.iterrows()
for k,v in x['numbers'].items()], columns=['id', 'name', 'nameN'])
to get:
id name nameN
0 1 105 1
1 1 65 11
2 1 75 0
3 1 85 51
4 1 95 0
5 2 105 1
6 2 65 11
7 2 75 0
8 2 85 50
9 2 95 0
10 3 105 1
11 3 65 11
12 3 75 0
13 3 85 51
14 3 95 0
15 7 75 cm 7
16 7 85 cm 52
17 7 95 cm 10
18 8 75 cm 51
19 8 85 cm 114
20 8 95 cm 10
21 9 75 cm 9
22 9 85 cm 60
23 9 95 cm 10
If you want a line with a specific value when numbers is empty:
test2 = pd.DataFrame([i for lst in [[[x['id'], '', '']] if x['numbers'] == {}
else [[x['id'], k, v] for k,v in x['numbers'].items()]
for _, x in df.iterrows()] for i in lst],
columns=['id', 'name', 'nameN']).sort_values('id').reset_index(drop=True)
giving:
id name nameN
0 1 105 1
1 1 65 11
2 1 75 0
3 1 85 51
4 1 95 0
5 2 105 1
6 2 65 11
7 2 75 0
8 2 85 50
9 2 95 0
10 3 95 0
11 3 75 0
12 3 85 51
13 3 105 1
14 3 65 11
15 4
16 5
17 6
18 7 75 cm 7
19 7 85 cm 52
20 7 95 cm 10
21 8 75 cm 51
22 8 85 cm 114
23 8 95 cm 10
24 9 85 cm 60
25 9 75 cm 9
26 9 95 cm 10
Related
I'm having week numbers in the dataframe from 1 to 52 e.g. [1,2,3,4,5,6,7,8,..52]
I'm trying to create a new column for month but it would mean an incremental assignment like [1,2,3,4] = 1, [5,6,7,8] = 2, .. [49,50,51,52] = 12
I tried getting the records by multiple of 4 using df[df["week"]%4==0] and then ffill it but seems like we can only assign it all to the same number which is not what I want. Instead I want to assign [1..12] accordingly. Is there another way to do this?
Subtract 1 first and then use integer division by 4:
df = pd.DataFrame({'week':range(1,53)})
df['new'] = (df["week"] - 1)//4
print (df.head(10))
week new
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
5 6 1
6 7 1
7 8 1
8 9 2
9 10 2
print (df.tail(10))
week new
42 43 10
43 44 10
44 45 11
45 46 11
46 47 11
47 48 11
48 49 12
49 50 12
50 51 12
51 52 12
If want starting by 1 it is possible, but last value is 13:
df['new'] = ((df["week"] - 1)//4) + 1
print (df.head(10))
week new
0 1 1
1 2 1
2 3 1
3 4 1
4 5 2
5 6 2
6 7 2
7 8 2
8 9 3
9 10 3
print (df.tail(10))
week new
42 43 11
43 44 11
44 45 12
45 46 12
46 47 12
47 48 12
48 49 13
49 50 13
50 51 13
51 52 13
If want values between 1 and 12 (but some groups has more like 4 values) use, solution by #Aryerez, thank you:
df['new'] = ((df["week"] - 1) // (52 / 12)).astype(int) + 1
print (df.head(10))
week new
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 2
6 7 2
7 8 2
8 9 2
9 10 3
print (df.tail(10))
week new
42 43 10
43 44 10
44 45 11
45 46 11
46 47 11
47 48 11
48 49 12
49 50 12
50 51 12
51 52 12
EDIT: For 5 values in each 3rd group use:
df['new'] = ((df["week"] + 4) // (52 / 12)).astype(int)
print (df.head(15))
week new
0 1 1
1 2 1
2 3 1
3 4 1
4 5 2
5 6 2
6 7 2
7 8 2
8 9 3
9 10 3
10 11 3
11 12 3
12 13 3
13 14 4
14 15 4
print (df.tail(15))
week new
37 38 9
38 39 9
39 40 10
40 41 10
41 42 10
42 43 10
43 44 11
44 45 11
45 46 11
46 47 11
47 48 12
48 49 12
49 50 12
50 51 12
51 52 12
I have a dataframe pd with two columns, X and y.
In pd[y] I have integers from 1 to 10 inclusive. However they have different frequencies:
df[y].value_counts()
10 6645
9 6213
8 5789
7 4643
6 2532
5 1839
4 1596
3 878
2 815
1 642
I want to cut down my dataframe so that there are equal number of occurrences for each label. As I want an equal number of each label, the minimum frequency is 642. So I only want to keep 642 randomly sampled rows of each class label in my dataframe so that my new dataframe has 642 for each class label.
I thought this might have helped however stratifying only keeps the same percentage of each label but I want all my labels to have the same frequency.
As an example of a dataframe:
df = pd.DataFrame()
df['y'] = sum([[10]*6645, [9]* 6213,[8]* 5789, [7]*4643,[6]* 2532, [5]*1839,[4]* 1596,[3]* 878, [2]*815, [1]* 642],[])
df['X'] = [random.choice(list('abcdef')) for i in range(len(df))]
Use pd.sample with groupby-
df = pd.DataFrame(np.random.randint(1, 11, 100), columns=['y'])
val_cnt = df['y'].value_counts()
min_sample = val_cnt.min()
print(min_sample) # Outputs 7 in as an example
print(df.groupby('y').apply(lambda s: s.sample(min_sample)))
Output
y
y
1 68 1
8 1
82 1
17 1
99 1
31 1
6 1
2 55 2
15 2
81 2
22 2
46 2
13 2
58 2
3 2 3
30 3
84 3
61 3
78 3
24 3
98 3
4 51 4
86 4
52 4
10 4
42 4
80 4
53 4
5 16 5
87 5
... ..
6 26 6
18 6
7 56 7
4 7
60 7
65 7
85 7
37 7
70 7
8 93 8
41 8
28 8
20 8
33 8
64 8
62 8
9 73 9
79 9
9 9
40 9
29 9
57 9
7 9
10 96 10
67 10
47 10
54 10
97 10
71 10
94 10
[70 rows x 1 columns]
My professor gave us some code to implement heapsort into our sorting class, and I can't seem to get it to work right. Every time I print it out, some of the numbers are converting into 0 (or 1s with a random fill) and not getting sorted. I know this because I have a fill function that just creates an array of numbers with increasing value that it is supposed to sort.
def heapsort(self):
n = self.size # Doing this for simplicity
for k in range((n-2) // 2, -1, -1):
self.downheap(n, k)
for m in range(n - 1, 0, -1):
self.data[m], self.data[0] = self.data[0], self.data[m]
self.downheap(m, 0)
def downheap(self, n, k):
if n > 1:
key = self.data[k]
isHeap = False
while (k <= (n-2) // 2) and not isHeap:
j = 2 * k + 1
if j + 1 < n:
if self.data[j] < self.data[j + 1]:
j += 1
if key >= self.data[j]:
isHeap = True
else:
k = j
self.data[k] = key
Unsorted list looks like-
17 19 8 8 9 3 17 13 9 1
14 19 15 12 19 4 12 6 1 8
13 8 10 5 6 6 9 17 6 5
12 5 7 16 9 10 11 3 10 14
5 3 12 1 3 10 18 10 4 19
5 10 14 9 16 8 3 14 4 13
12 8 13 10 16 17 16 10 11 3
16 9 3 16 15 3 2 11 15 3
3 3 18 7 9 6 10 4 1 4
15 10 9 1 2 18 14 11 4 3
"sorted" list looks like-
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 8 13
1 3 5 6 6 9 12 6 5 12
1 1 3 3 1 1 3 10 8 5
3 12 1 1 1 1 10 4 3 5
10 6 9 9 8 3 6 4 5 12
8 8 1 2 1 2 3 3 2 1
9 3 11 6 3 2 1 10 3 3
3 5 7 3 6 1 1 1 4 3
1 1 1 2 10 5 4 4 3 17
And here's what it does to the inc. numbers-
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
80 81 82 83 84 85 86 87 88 89
90 91 92 93 94 95 96 97 98 99
"Sorted"
0 0 3 4 0 0 7 8 9 10
11 12 0 0 15 16 17 18 19 20
21 22 23 24 25 26 27 28 0 0
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 24
51 25 12 5 55 27 13 28 59 29
0 0 63 31 15 32 67 33 16 7
71 35 17 36 75 37 3 8 79 39
19 40 83 41 20 9 87 43 21 44
91 45 4 1 95 47 23 48 11 0
I've been pouring over this the last couple of days, I know I wrote everything down correctly, and I can't for the life of me figure out what is going on. Any help would be appreciated. Thanks.
Figured it out!
I had to put -
self.data[k] = self.data[j]
over k = j
I have created a Pandas DataFrame. I need to create a RangeIndex for the DataFrame that corresponds to the frame -
RangeIndex(start=0, stop=x, step=y) - where x and y relate to my DataFrame.
I've not seen an example of how to do this - is there a method or syntax specific to this?
thanks
It seems you need RangeIndex constructor:
df = pd.DataFrame({'A' : range(1, 21)})
print (df)
A
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
13 14
14 15
15 16
16 17
17 18
18 19
19 20
print (df.index)
RangeIndex(start=0, stop=20, step=1)
df.index = pd.RangeIndex(start=0, stop=99, step=5)
print (df)
A
0 1
5 2
10 3
15 4
20 5
25 6
30 7
35 8
40 9
45 10
50 11
55 12
60 13
65 14
70 15
75 16
80 17
85 18
90 19
95 20
print (df.index)
RangeIndex(start=0, stop=99, step=5)
More dynamic solution:
step = 10
df.index = pd.RangeIndex(start=0, stop=len(df.index) * step - 1, step=step)
print (df)
A
0 1
10 2
20 3
30 4
40 5
50 6
60 7
70 8
80 9
90 10
100 11
110 12
120 13
130 14
140 15
150 16
160 17
170 18
180 19
190 20
print (df.index)
RangeIndex(start=0, stop=199, step=10)
EDIT:
As #ZakS pointed in comments better is use only DataFrame constructor:
df = pd.DataFrame({'A' : range(1, 21)}, index=pd.RangeIndex(start=0, stop=99, step=5))
print (df)
0 1
5 2
10 3
15 4
20 5
25 6
30 7
35 8
40 9
45 10
50 11
55 12
60 13
65 14
70 15
75 16
80 17
85 18
90 19
95 20
I converted a nested dictionary to a Pandas DataFrame which I want to use as to create a heatmap.
The nested dictionary is simple to create:
>>>df = pandas.DataFrame.from_dict(my_nested_dict)
>>>df
93 94 95 96 97 98 99 100 100A 100B ... 100M 100N 100O 100P 100Q 100R 100S 101 102 103
A 465 5 36 36 28 24 25 30 28 32 ... 28 19 16 15 4 4 185 2 7 3
C 0 1 2 0 6 10 8 16 23 17 ... 9 5 6 3 4 2 3 3 0 1
D 1 0 132 6 17 22 17 25 21 25 ... 12 16 21 7 5 18 2 1 296 0
E 4 0 45 10 16 12 10 15 17 18 ... 4 9 7 10 5 6 4 3 129 0
F 1 0 4 17 14 11 8 11 24 9 ... 17 8 8 12 7 3 1 98 0 1
G 2 10 77 55 71 52 65 39 37 45 ... 46 65 23 9 18 171 141 2 31 0
H 0 5 25 12 18 8 12 7 10 6 ... 8 11 6 4 4 5 2 2 1 8
I 1 8 7 23 26 35 36 34 31 38 ... 19 7 2 37 7 3 0 3 2 26
K 0 42 3 24 5 15 17 11 6 8 ... 9 10 9 8 9 2 1 28 0 0
L 3 0 19 50 32 33 21 26 26 18 ... 19 44 122 11 10 7 5 17 2 5
M 0 1 1 3 1 13 9 12 12 8 ... 20 3 1 1 0 1 0 191 0 0
N 0 5 3 12 8 15 12 13 21 9 ... 18 10 10 11 12 26 3 0 5 1
P 1 1 19 50 39 47 42 43 39 33 ... 48 35 15 16 59 2 13 6 0 160
Q 0 2 16 15 12 13 10 13 16 5 ... 11 6 3 11 4 1 0 1 6 28
R 0 380 17 66 54 41 51 32 24 29 ... 43 44 16 17 14 6 2 126 4 5
S 14 18 27 42 55 37 41 42 45 70 ... 47 31 64 14 42 18 8 3 1 5
T 4 13 17 32 29 37 33 32 30 38 ... 87 79 19 125 96 11 11 7 7 3
V 4 9 36 24 39 40 35 45 42 52 ... 20 12 12 9 8 5 0 6 7 209
W 0 0 1 6 6 8 4 7 7 9 ... 6 6 1 1 1 1 27 1 0 0
X 0 0 0 0 0 0 0 0 0 0 ... 0 4 0 0 0 0 0 0 0 0
Y 0 0 13 17 24 27 44 47 41 31 ... 29 76 139 179 191 208 92 0 2 45
I like to use ggplot to make heat maps which would just be this data frame. However, the dataframes needed for ggplot are a little different. I can use the pandas.melt function to get close, but I'm missing the row titles.
>>>mdf = pandas.melt(df)
>>>mdf
variable value
0 93 465
1 93 0
2 93 1
3 93 4
4 93 1
5 93 2
6 93 0
7 93 1
8 93 0
...
624 103 5
625 103 3
626 103 209
627 103 0
628 103 0
629 103 45
The easiest thing to make this dataframe would be is to add the value of the amino acid so the DataFrame looks like:
variable value rowvalue
0 93 465 A
1 93 0 C
2 93 1 D
3 93 4 E
4 93 1 F
5 93 2 G
6 93 0 H
7 93 1 I
8 93 0 K
That way I can take that dataframe and put it right into ggplot:
>>> from ggplot import *
>>> ggplot(new_df,aes("variable","rowvalue")) + geom_tile(fill="value")
would produce a beautiful heatmap. How do I manipulate the nested dictionary dataframe in order to get the dataframe at the end. If there is a more efficient way to do this, I'm open for suggestions, but I still want to use ggplot2.
Edit -
I found a solution but it seems to be way too convoluted. Basically I make the index into a column, then melt the data frame.
>>>df.reset_index(level=0,inplace=True)
>>>pandas.melt(df,id_vars['index']
index variable value
0 A 93 465
1 C 93 0
2 D 93 1
3 E 93 4
4 F 93 1
5 G 93 2
6 H 93 0
7 I 93 1
8 K 93 0
9 L 93 3
10 M 93 0
11 N 93 0
12 P 93 1
13 Q 93 0
14 R 93 0
15 S 93 14
16 T 93 4
if i understand properly your question, i think you can simply do the following :
mdf = pandas.melt(df)
mdf['rowvalue'] = df.index
mdf
variable value rowvalue
0 93 465 A
1 93 0 C
2 93 1 D
3 93 4 E
4 93 1 F
5 93 2 G
6 93 0 H
7 93 1 I
8 93 0 K