Reorganizing tabular data in python

Reorganizing tabular data in python - python

I have this data shown below:
ID
Sample
X
Y
SP001
Sample01
78
22
SP002
Sample01
65
35
SP003
Sample01
93
07
SP001
Sample02
79
21
SP002
Sample02
65
35
SP003
Sample02
94
06
and I would like to change it into a table with one entry for each sample:
Sample
SP001_X
SP002_X
SP003_X
Sample01
78
65
93
Sample02
79
65
94
How do I proceed with this?

Related

Greatest product of four adjacent numbers in a grid (#11 Euler)

There are already many questions on stackoverlow regarding question 11 of Euler project. However I would like to figure out what is the mistake in my code.
Here is the python code:
a=[['08 02 22 97 38 15 00 40 00 75 04 05 07 78 52 12 50 77 91 08'],
['49 49 99 40 17 81 18 57 60 87 17 40 98 43 69 48 04 56 62 00'],
['81 49 31 73 55 79 14 29 93 71 40 67 53 88 30 03 49 13 36 65'],
['52 70 95 23 04 60 11 42 69 24 68 56 01 32 56 71 37 02 36 91'],
['22 31 16 71 51 67 63 89 41 92 36 54 22 40 40 28 66 33 13 80'],
['24 47 32 60 99 03 45 02 44 75 33 53 78 36 84 20 35 17 12 50'],
['32 98 81 28 64 23 67 10 26 38 40 67 59 54 70 66 18 38 64 70'],
['67 26 20 68 02 62 12 20 95 63 94 39 63 08 40 91 66 49 94 21'],
['24 55 58 05 66 73 99 26 97 17 78 78 96 83 14 88 34 89 63 72'],
['21 36 23 09 75 00 76 44 20 45 35 14 00 61 33 97 34 31 33 95'],
['78 17 53 28 22 75 31 67 15 94 03 80 04 62 16 14 09 53 56 92'],
['16 39 05 42 96 35 31 47 55 58 88 24 00 17 54 24 36 29 85 57'],
['86 56 00 48 35 71 89 07 05 44 44 37 44 60 21 58 51 54 17 58'],
['19 80 81 68 05 94 47 69 28 73 92 13 86 52 17 77 04 89 55 40'],
['04 52 08 83 97 35 99 16 07 97 57 32 16 26 26 79 33 27 98 66'],
['88 36 68 87 57 62 20 72 03 46 33 67 46 55 12 32 63 93 53 69'],
['04 42 16 73 38 25 39 11 24 94 72 18 08 46 29 32 40 62 76 36'],
['20 69 36 41 72 30 23 88 34 62 99 69 82 67 59 85 74 04 36 16'],
['20 73 35 29 78 31 90 01 74 31 49 71 48 86 81 16 23 57 05 54'],
['01 70 54 71 83 51 54 69 16 92 33 48 61 43 52 01 89 19 67 48']]
b=[]
for i in range(len(a)):
b.append(a[i][0].split(' '))
Sum=1
currentSum=1
#Loop for checking horizontally adjacent sum
for x in range(20):
for y in range(16):
for z in range(4):
currentSum*=int(b[x][y+z])
if currentSum>Sum:
Sum=currentSum
currentSum=1
#Loop for checking vertically adjacent sum
for x in range(0, 16):
for y in range(0, 20):
for z in range(0, 4):
currentSum*=int(b[x+z][y])
if currentSum>Sum:
Sum=currentSum
currentSum=1
#Loop for checking diagonally adjacent sum (\)
for x in range(0, 16):
for y in range(0, 16):
for z in range(4):
currentSum*=int(b[x+z][y+z])
if currentSum>Sum:
Sum=currentSum
currentSum=1
#Loop for checking diagonally adjacent sum(/)
for x in range(3, 20):
for y in range(3, 20):
for z in range(4):
currentSum*=int(b[x-z][y-z])
if currentSum>Sum:
Sum=currentSum
currentSum=1
print(Sum)
My approach to the question:
I had manually made the grid provided into a 20 x 1 2D list. However since I need a 20 x 20 2D list, I created a new one wherein the inner list items are the results of split function upon the first list items.
I have four nested loops, each for checking the products in one of the directions mentioned
The final answer yielded is 51267216, which according to Project Euler is wrong.
I would like to know where I am going wrong, and some guidance in this direction without revealing the answer itself.
Another thing to note is I am trying to get this solved without any additional libraries like numpy.

I've finally figured out the mistake - the problem has arisen due to the use of incorrect indexing formula in the last for loop.
The correct code is:
#Loop for checking diagonally adjacent sum
for x in range(0, 16):
for y in range(3, 20):
for z in range(4):
currentSum*=int(b[x+z][y-z]) //originally: b[x-z][x-y]
if currentSum>Sum:
Sum=currentSum
currentSum=1
b[x+z][y-z] instead of b[x-z][y-z]
To add on, here are the pictorial representations of all four loops:
1st Loop:
2nd loop:
3rd Loop:
4th loop:

Is there a better way to keep few columns by dropping several columns from a DataFrame than to use .drop()?

Curious to know if there is a better way to keep needed columns in a dataframe if those I need to keep are a small number and the ones to remove are several of them
import numpy as np
df1 = pd.DataFrame(np.random.randint(10,99, size=(13, 26)), columns =list('abcdefghijklmnopqrstuvwxyz'))
df1
Output:
a b c d e f g h i j ... q r s t u v w x y z
0 78 60 27 38 21 93 74 47 16 53 ... 79 56 40 41 87 80 14 82 12 50
1 84 73 59 46 91 43 22 28 57 52 ... 27 65 81 72 68 90 68 61 22 44
2 56 37 29 52 57 14 87 82 46 90 ... 67 57 29 14 55 30 46 72 56 91
3 86 44 46 79 41 74 32 49 42 32 ... 33 34 40 17 30 78 29 75 80 52
4 14 89 90 79 67 17 34 39 57 37 ... 93 49 78 91 26 73 40 48 91 36
5 16 62 32 87 56 81 82 17 59 57 ... 84 24 97 39 46 40 68 53 73 40
6 69 72 16 47 37 20 27 56 13 37 ... 10 28 17 35 39 14 51 85 69 53
7 81 34 35 20 66 44 86 23 94 57 ... 38 45 76 53 82 72 64 34 81 43
8 95 90 97 31 18 85 74 18 43 22 ... 20 20 96 25 53 76 55 96 58 98
9 73 53 72 94 55 33 22 40 11 64 ... 84 66 85 34 94 32 78 72 10 62
10 73 24 57 17 63 24 94 25 59 84 ... 34 45 27 28 47 23 38 80 45 41
11 69 18 22 42 95 38 16 47 68 36 ... 59 69 35 39 78 75 85 86 53 55
12 46 27 53 77 48 15 57 90 32 57 ... 32 79 18 67 71 86 54 11 36 51
13 rows × 26 columns
Say, I have to only keep a few random columns , E.g. e,u,r,q,j ; is there a better way to keep them having to run df1.drop() with 21 column names passed in? I could not find a better way in any of the questions.
Edit:
Different from the solution in
Selecting multiple columns in a pandas dataframe
since the columns to choose to drop are random and not sequential

You can copy all the rows you want to keep into a new dataframe and then overwrite your first dataframe like so:
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.random.randint(10,99, size=(13, 26)), columns =list('abcdefghijklmnopqrstuvwxyz'))
df2 = pd.DataFrame()
columns_to_keep = ["e", "r", "u"]
for column in columns_to_keep:
df2[column] = df1[column]
df1 = df2
df1
or alternatively using a for statement to drop any item not in a list:
columns_to_keep = ["e", "r", "u"]
for column_name, column_data in df1.iteritems():
if column_name not in columns_to_keep:
df1 = df1.drop(column_name, axis=1)
df1

Let's just use column filtering and reassign back to df1:
df1 = pd.DataFrame(np.random.randint(10,99, size=(13, 26)), columns =list('abcdefghijklmnopqrstuvwxyz'))
columns_to_keep = ["e", "r", "u"]
df1 = df1[columns_to_keep]
df1.head()
Output:
e r u
0 65 95 13
1 58 42 75
2 95 34 12
3 43 20 79
4 83 27 47

Object Similarity Pandas and Scikit Learn

Is there a way to find to find and rank rows in a Pandas Dataframe by their similarity to a row from another Dataframe?

My understanding of your question: you have two data frames, hopfully of the same column count. You want to rate first data frame's, the subject data frame, members by how close, i.e. similar, they are to any of the members of the target data frame.
I am not aware of a built in method.
It is probably not the most efficient way but here is how I'd approach:
#! /usr/bin/python3
import pandas as pd
import numpy as np
import pprint
pp = pprint.PrettyPrinter(indent=4)
# Simulate data
df_subject = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')) # This is the one we're iterating to check similarity to target.
df_target = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')) # This is the one we're checking distance to
# This will hold the min dstances.
distances=[]
# Loop to iterate over subject DF
for ix1,subject in df_subject.iterrows():
distances_cur=[]
# Loop to iterate over target DF
for ix2,target in df_target.iterrows():
distances_cur.append(np.linalg.norm(target-subject))
# Get the minimum distance for the subject set member.
distances.append(min(distances_cur))
# Distances to df
distances=pd.DataFrame(distances)
# Normalize.
distances=0.5-(distances-distances.mean(axis=0))/distances.max(axis=0)
# Column index joining, ordering and beautification.
Proximity_Ratings_name='Proximity Ratings'
distances=distances.rename(columns={0: Proximity_Ratings_name})
df_subject=df_subject.join(distances)
pp.pprint(df_subject.sort_values(Proximity_Ratings_name,ascending=False))
It should yeild something like the table below. Higher rating means there's a similar member in the target data frame:
A B C D Proximity Ratings
55 86 21 91 78 0.941537
38 91 31 35 95 0.901638
43 49 89 49 6 0.878030
98 28 98 98 36 0.813685
77 67 23 78 84 0.809324
35 52 16 36 58 0.802223
54 2 25 61 44 0.788591
95 76 3 60 46 0.766896
5 55 39 88 37 0.756049
52 79 71 90 70 0.752520
66 52 27 82 82 0.751353
41 45 67 55 33 0.739919
76 12 93 50 62 0.720323
94 99 84 39 63 0.716123
26 62 6 97 60 0.715081
40 64 50 37 27 0.714042
68 70 21 8 82 0.698824
47 90 54 60 65 0.676680
7 85 95 45 71 0.672036
2 14 68 50 6 0.661113
34 62 63 83 29 0.659322
8 87 90 28 74 0.647873
75 14 61 27 68 0.633370
60 9 91 42 40 0.630030
4 46 46 52 35 0.621792
81 94 19 82 44 0.614510
73 67 27 34 92 0.608137
30 92 64 93 51 0.608137
11 52 25 93 50 0.605770
51 17 48 57 52 0.604984
.. .. .. .. .. ...
64 28 56 0 9 0.397054
18 52 84 36 79 0.396518
99 41 5 32 34 0.388519
27 19 54 43 94 0.382714
92 69 56 73 93 0.382714
59 1 29 46 16 0.374878
58 2 36 8 96 0.362525
69 58 92 16 48 0.361505
31 27 57 80 35 0.349887
10 59 23 47 24 0.345891
96 41 77 76 33 0.345891
78 42 71 87 65 0.344398
93 12 31 6 27 0.329152
23 6 5 10 42 0.320445
14 44 6 43 29 0.319964
6 81 51 44 15 0.311840
3 17 60 13 22 0.293066
70 28 40 22 82 0.251549
36 95 72 35 5 0.249354
49 78 10 30 18 0.242370
17 79 69 57 96 0.225168
46 42 95 86 81 0.224742
84 58 81 59 86 0.221346
9 9 62 8 30 0.211659
72 11 51 74 8 0.159265
90 74 26 80 1 0.138993
20 90 4 6 5 0.117652
50 3 12 5 53 0.077088
42 90 76 42 1 0.075284
45 94 46 88 14 0.054244
Hope I understand correctly. Don't use if performance matters, I'm sure there's an algebraic way to approach this (Multiply matrices) that would run way faster.

Project Euler 11 [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm having some trouble with my Project Euler code in Python -- When I run through the code in my head everything seems to check out, but I'm still getting the wrong answer. I'm really new to Python, so it could be any number of things. Any suggestions? Thanks in advance!
nums = '\
08 02 22 97 38 15 00 40 00 75 04 05 07 78 52 12 50 77 91 08\n\
49 49 99 40 17 81 18 57 60 87 17 40 98 43 69 48 04 56 62 00\n\
81 49 31 73 55 79 14 29 93 71 40 67 53 88 30 03 49 13 36 65\n\
52 70 95 23 04 60 11 42 69 24 68 56 01 32 56 71 37 02 36 91\n\
22 31 16 71 51 67 63 89 41 92 36 54 22 40 40 28 66 33 13 80\n\
24 47 32 60 99 03 45 02 44 75 33 53 78 36 84 20 35 17 12 50\n\
32 98 81 28 64 23 67 10 26 38 40 67 59 54 70 66 18 38 64 70\n\
67 26 20 68 02 62 12 20 95 63 94 39 63 08 40 91 66 49 94 21\n\
24 55 58 05 66 73 99 26 97 17 78 78 96 83 14 88 34 89 63 72\n\
21 36 23 09 75 00 76 44 20 45 35 14 00 61 33 97 34 31 33 95\n\
78 17 53 28 22 75 31 67 15 94 03 80 04 62 16 14 09 53 56 92\n\
16 39 05 42 96 35 31 47 55 58 88 24 00 17 54 24 36 29 85 57\n\
86 56 00 48 35 71 89 07 05 44 44 37 44 60 21 58 51 54 17 58\n\
19 80 81 68 05 94 47 69 28 73 92 13 86 52 17 77 04 89 55 40\n\
04 52 08 83 97 35 99 16 07 97 57 32 16 26 26 79 33 27 98 66\n\
88 36 68 87 57 62 20 72 03 46 33 67 46 55 12 32 63 93 53 69\n\
04 42 16 73 38 25 39 11 24 94 72 18 08 46 29 32 40 62 76 36\n\
20 69 36 41 72 30 23 88 34 62 99 69 82 67 59 85 74 04 36 16\n\
20 73 35 29 78 31 90 01 74 31 49 71 48 86 81 16 23 57 05 54\n\
01 70 54 71 83 51 54 69 16 92 33 48 61 43 52 01 89 19 67 48'
grid = []
diag = []
for line in nums.split('\n'):
grid.append(map(int, line.split(' ')))
i=0
j=0
while i<17:
l = grid[i][j]*grid[i+1][j+1]*grid[i+2][j+2]*grid[i+3][j+3]
diag.append(l)
i+=1
if i==17:
j+=1
i=0
l = grid[i][j]*grid[i+1][j+1]*grid[i+2][j+2]*grid[i+3][j+3]
diag.append(l)
if j==16:
break
print max(diag)

My comments are more along the lines of code review, but culminating in a full solution:
You can get rid of those ugly endline escapes with the textwrap module (implicit string concatenation would work too, but it would mean more repetitious typing and clutter):
import textwrap
nums = textwrap.dedent('''\
08 02 22 97 38 15 00 40 00 75 04 05 07 78 52 12 50 77 91 08
49 49 99 40 17 81 18 57 60 87 17 40 98 43 69 48 04 56 62 00
81 49 31 73 55 79 14 29 93 71 40 67 53 88 30 03 49 13 36 65
52 70 95 23 04 60 11 42 69 24 68 56 01 32 56 71 37 02 36 91
22 31 16 71 51 67 63 89 41 92 36 54 22 40 40 28 66 33 13 80
24 47 32 60 99 03 45 02 44 75 33 53 78 36 84 20 35 17 12 50
32 98 81 28 64 23 67 10 26 38 40 67 59 54 70 66 18 38 64 70
67 26 20 68 02 62 12 20 95 63 94 39 63 08 40 91 66 49 94 21
24 55 58 05 66 73 99 26 97 17 78 78 96 83 14 88 34 89 63 72
21 36 23 09 75 00 76 44 20 45 35 14 00 61 33 97 34 31 33 95
78 17 53 28 22 75 31 67 15 94 03 80 04 62 16 14 09 53 56 92
16 39 05 42 96 35 31 47 55 58 88 24 00 17 54 24 36 29 85 57
86 56 00 48 35 71 89 07 05 44 44 37 44 60 21 58 51 54 17 58
19 80 81 68 05 94 47 69 28 73 92 13 86 52 17 77 04 89 55 40
04 52 08 83 97 35 99 16 07 97 57 32 16 26 26 79 33 27 98 66
88 36 68 87 57 62 20 72 03 46 33 67 46 55 12 32 63 93 53 69
04 42 16 73 38 25 39 11 24 94 72 18 08 46 29 32 40 62 76 36
20 69 36 41 72 30 23 88 34 62 99 69 82 67 59 85 74 04 36 16
20 73 35 29 78 31 90 01 74 31 49 71 48 86 81 16 23 57 05 54
01 70 54 71 83 51 54 69 16 92 33 48 61 43 52 01 89 19 67 48''')
grid = []
Use the string builtin method, splitlines, and list comprehensions are somewhat more readable than map (as well as not needing to be wrapped with a list call to be forward compatible with Python 3):
for line in nums.splitlines():
grid.append([int(i) for i in line.split(' ')])
Now we have our data, and we can begin our search algorithm. Since the horizontals are already in rows together, we can easily search by row, and since zip stops on the shortest iterable, we can safely zip up slices of the same string starting at increasing points from the beginning without getting index errors:
def max_horizontal(grid):
return max(w * x * y * z
for r in grid
for w, x, y, z in zip(r, r[1:], r[2:], r[3:]))
Vertical is not much more tricky, but we want to transpose it like it was a matrix, and then we can use the same code around it. Expanding an iterable of iterables into zip is the same as transposing such a matrix that we can iterate over:
def max_vertical(grid):
return max(w * x * y * z
for c in zip(*grid)
for w, x, y, z in zip(c, c[1:], c[2:], c[3:]))
Diagonals are a bit more difficult, but if we get one definition right, we just reverse it. Here, we need to go row by row over a window across the matrix, so we treat the matrix like we treated the row with zip. So we step over the matrix one row at a time, looking at 4 rows each time. Next, using the slash semantic to indicate the diagonal running from bottom left to top right, in our first row, we start at the fourth element (keeping in mind, Python starts indexing at 0, so that corresponds to the [3:] slice notation below), second row, 3rd element, third row, second element, 4th row, the first element. Again, since zip stops at the end of the shortest iterable, our window doesn't run out of range of the matrix:
def max_slashdiag(g=grid):
return max(w * x * y * z
for r1, r2, r3, r4 in zip(g, g[1:], g[2:], g[3:])
for w, x, y, z in zip(r1[3:], r2[2:], r3[1:], r4))
To get the other diagonal, just reverse the corresponding row starting points:
def max_backdiag(g=grid):
return max(w * x * y * z
for r1, r2, r3, r4 in zip(g, g[1:], g[2:], g[3:])
for w, x, y, z in zip(r1, r2[1:], r3[2:], r4[3:]))
And we take the maximum of all of these functions:
max(max_horizontal(grid),
max_vertical(grid),
max_slashdiag(g=grid),
max_backdiag(g=grid))
which returns 70600674

Create Grids in Python

Question : I want to create a grid like this.
08 06 78 56 96
45 63 68 23 51
63 78 45 08 37
56 73 92 73 83
43 22 67 98 55
Once I created this grid, I want to find the product of four adjacent numbers in the same direction (up, down, left, right, or diagonally)?
How can I do this?
I searched a lot and found an answer. They suggested to use N-dimensional array. But I don't know how to do that?

When you specify a backslash at the end of the string, it will allow you to continue the current string along to the next line. So let's first think about the simplest way we can stuff that grid of numbers into our code, a simple string with lines separated by newline characters!
Try doing something like this:
nums = '\
08 06 78 56 96\n\
45 63 68 23 51\n\
63 78 45 08 37\n\
56 73 92 73 83\n\
43 22 67 98 55'
grid = []
for line in nums.split('\n'):
grid.append(map(int, line.split(' ')))
And then you can print your grid with:
>>> print grid
[[8, 6, 78, 56, 96], [45, 63, 68, 23, 51], [63, 78, 45, 8, 37], [56, 73, 92, 73,
83], [43, 22, 67, 98, 55]]
The grid will be a two-dimensional list which can be queried by first specifying the row, then the column. For example if you wish to get an element from 4th row and the 2nd column, you could do:
>>> print grid[3][1] # it's 3, 1 because lists are 0-indexed
73
And that would correctly give you 73 because if you look at your original grid:
08 06 78 56 96
45 63 68 23 51
63 78 45 08 37
56 73 92 73 83
43 22 67 98 55
73 is on row 4 and column 2. Good luck with Project Euler.
Edit:
You might lack a nice source code editor so here is the Problem 11 grid from Project Euler nicely laid out in string format for you to parse and do what you please with it. :)
nums = '\
08 02 22 97 38 15 00 40 00 75 04 05 07 78 52 12 50 77 91 08\n\
49 49 99 40 17 81 18 57 60 87 17 40 98 43 69 48 04 56 62 00\n\
81 49 31 73 55 79 14 29 93 71 40 67 53 88 30 03 49 13 36 65\n\
52 70 95 23 04 60 11 42 69 24 68 56 01 32 56 71 37 02 36 91\n\
22 31 16 71 51 67 63 89 41 92 36 54 22 40 40 28 66 33 13 80\n\
24 47 32 60 99 03 45 02 44 75 33 53 78 36 84 20 35 17 12 50\n\
32 98 81 28 64 23 67 10 26 38 40 67 59 54 70 66 18 38 64 70\n\
67 26 20 68 02 62 12 20 95 63 94 39 63 08 40 91 66 49 94 21\n\
24 55 58 05 66 73 99 26 97 17 78 78 96 83 14 88 34 89 63 72\n\
21 36 23 09 75 00 76 44 20 45 35 14 00 61 33 97 34 31 33 95\n\
78 17 53 28 22 75 31 67 15 94 03 80 04 62 16 14 09 53 56 92\n\
16 39 05 42 96 35 31 47 55 58 88 24 00 17 54 24 36 29 85 57\n\
86 56 00 48 35 71 89 07 05 44 44 37 44 60 21 58 51 54 17 58\n\
19 80 81 68 05 94 47 69 28 73 92 13 86 52 17 77 04 89 55 40\n\
04 52 08 83 97 35 99 16 07 97 57 32 16 26 26 79 33 27 98 66\n\
88 36 68 87 57 62 20 72 03 46 33 67 46 55 12 32 63 93 53 69\n\
04 42 16 73 38 25 39 11 24 94 72 18 08 46 29 32 40 62 76 36\n\
20 69 36 41 72 30 23 88 34 62 99 69 82 67 59 85 74 04 36 16\n\
20 73 35 29 78 31 90 01 74 31 49 71 48 86 81 16 23 57 05 54\n\
01 70 54 71 83 51 54 69 16 92 33 48 61 43 52 01 89 19 67 48'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reorganizing tabular data in python - python

Related

Greatest product of four adjacent numbers in a grid (#11 Euler)

Is there a better way to keep few columns by dropping several columns from a DataFrame than to use .drop()?

Object Similarity Pandas and Scikit Learn

Project Euler 11 [closed]

Create Grids in Python

Categories

Resources