How to remove a specific repeated line in pandas dataframe? - python

In this pandas dataframe:
df =
pos index data
21 36 a,b,c
21 36 a,b,c
23 36 c,d,e
25 36 f,g,h
27 36 g,h,k
29 39 a,b,c
29 39 a,b,c
31 39 .
35 39 c,k
36 41 g,h
38 41 k,l
39 41 j,k
39 41 j,k
I want to remove the repeated line that are only in the same index group and when they are in the head regions of the subframe.
So, I did:
df_grouped = df.groupby(['index'], as_index=True)
now,
for i, sub_frame in df_grouped:
subframe.apply(lamda g: ... remove one duplicate line in the head region if pos value is a repeat)
I want to apply this method because some pos value will be repeated in the tail region which should not be removed.
Any suggestions.
Expected output:
pos index data
removed
21 36 a,b,c
23 36 c,d,e
25 36 f,g,h
27 36 g,h,k
removed
29 39 a,b,c
31 39 .
35 39 c,k
36 41 g,h
38 41 k,l
39 41 j,k
39 41 j,k

If it doesn't have to be done in a single apply statement, then this code will only remove duplicates in the head region:
data= {'pos':[21, 21, 23, 25, 27, 29, 29, 31, 35, 36, 38, 39, 39],
'idx':[36, 36, 36, 36, 36, 39, 39, 39, 39, 41, 41, 41, 41],
'data':['a,b,c', 'a,b,c', 'c,d,e', 'f,g,h', 'g,h,k', 'a,b,c', 'a,b,c', '.', 'c,k', 'g,h', 'h,l', 'j,k', 'j,k']
}
df = pd.DataFrame(data)
accum = []
for i, sub_frame in df.groupby('idx'):
accum.append(pd.concat([sub_frame.iloc[:2].drop_duplicates(), sub_frame.iloc[2:]]))
df2 = pd.concat(accum)
print(df2)
EDIT2: The first version of the chained command that I posted was wrong and and only worked for the sample data. This version provides a more general solution to remove duplicate rows per the OP's request:
df.drop(df.groupby('idx') # group by the index column
.head(2) # select the first two rows
.duplicated() # create a Series with True for duplicate rows
.to_frame(name='duped') # make the Series a dataframe
.query('duped') # select only the duplicate rows
.index) # provide index of duplicated rows to drop

Related

How do I extract columns names from rows and promote them to headers?

I`m reading a csv and the data is a little bit messy. Here's the code:
import pandas as pd
ocorrencias = pd.read_csv('data.csv', encoding="1252", header=None)
ocorrencias = ocorrencias.drop([0, 1, 2, 4, 10, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36], axis=1)
Output:
And I want to remove columns names from rows and promote them to the headers, so the dataframe will be like::
Anyone can help me?
You can use split(': ') to keep only part after : in cells
df = df.apply(lambda x: x.str.split(': ', 1).str[1])
You can use also split(': ') to get column names from any row (ie. from first row .iloc[0])
df.columns = df.iloc[0].str.split(': ', 1).str[0]
Minimal working code
First it has to get headers before names are removed from cells.
I used random to generate random value - but using random.seed(0) you should get the same values as in my result.
I use 1 in split(': ', 1) to split it only on first : because sometimes there can be more : if you would have text values.
import pandas as pd
import random
random.seed(0) # to get the same random values in every test
df = pd.DataFrame([f'{col}: {random.randint(0,100)}'
for col in ['hello', 'world', 'of', 'python']]
for row in range(3))
print(df)
df.columns = df.iloc[0].str.split(': ', 1).str[0]
print(df)
df = df.apply(lambda x: x.str.split(': ', 1).str[1])
print(df)
Result:
0 1 2 3
0 hello: 49 world: 97 of: 53 python: 5
1 hello: 33 world: 65 of: 62 python: 51
2 hello: 100 world: 38 of: 61 python: 45
0 hello world of python
0 hello: 49 world: 97 of: 53 python: 5
1 hello: 33 world: 65 of: 62 python: 51
2 hello: 100 world: 38 of: 61 python: 45
0 hello world of python
0 49 97 53 5
1 33 65 62 51
2 100 38 61 45

simple usecase if numpy.delete() is not working

here is some code:
c = np.delete(a,b)
print(len(a))
print(a)
print(len(b))
print(b)
print(len(c))
print(c)
it gives back:
24
[32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55]
20
[46, 35, 37, 54, 40, 49, 34, 48, 50, 38, 42, 47, 33, 52, 41, 36, 39, 44, 55,
51]
24
[32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55]
as you can see, all elements of b appear in a, but are not being deleted. can not figure out why. any ideas? thank you.
numpy.delete does not remove the elements contained in b, it deletes a[b], in other words, b needs to contain the indices to remove. Since your b contains only values larger than the length of a, no values are removed. Currently out of bounds indices are ignored, but this will not be true in the future:
/usr/local/bin/ipython3:1: DeprecationWarning: in the future out of bounds indices will raise an error instead of being ignored by `numpy.delete`.
#!/usr/bin/python3
A pure Python solution would be to use set:
set_b = set(b)
c = np.array([x for x in a if x not in set_b])
# array([32, 43, 45, 51, 53])
And using numpy broadcasting to create a mask to determine which values to delete:
c = a[~(a[None,:] == b[:, None]).any(axis=0)]
# array([32, 43, 45, 51, 53])
They are about the same speed with the given example, but the numpy approach and takes more memory (because it generates a 2D matrix that contains all combinations of a and b).

Euler Project #18 - Pythonic approach

I'm trying to solve the 18th problem from Project Euler but I'm stuck in the solution. Doing it in a paper I get the same results but I know the answer has a difference of 10 between what I'm getting.
By starting at the top of the triangle below and moving to adjacent numbers on the row below, the maximum total from top to bottom is 23.
3
7 4
2 4 6
8 5 9 3
That is, 3 + 7 + 4 + 9 = 23.
Find the maximum total from top to bottom of the triangle below:
75
95 64
17 47 82
18 35 87 10
20 04 82 47 65
19 01 23 75 03 34
88 02 77 73 07 63 67
99 65 04 28 06 16 70 92
41 41 26 56 83 40 80 70 33
41 48 72 33 47 32 37 16 94 29
53 71 44 65 25 43 91 52 97 51 14
70 11 33 28 77 73 17 78 39 68 17 57
91 71 52 38 17 14 91 43 58 50 27 29 48
63 66 04 68 89 53 67 30 73 16 69 87 40 31
04 62 98 27 23 09 70 98 73 93 38 53 60 04 23
NOTE: As there are only 16384 routes, it is possible to solve this problem by trying every route. However, Problem 67, is the same challenge with a triangle containing one-hundred rows; it cannot be solved by brute force, and requires a clever method! ;o)
Here is my code
filename = "triangle.txt"
f = open(filename,"r+")
total = 0
#will store the position of the maximum value in the line
index = 0
#get the first pyramid value
total = [int(x) for x in f.readline().split()][0]
#since it's only one value, the position will start with 0
current_index = 0
# loop through the lines
for line in f:
# transform the line into a list of integers
cleaned_list = [int(x) for x in line.split()]
# get the maxium value between index and index + 1 (adjacent positions)
maximum_value_now = max(cleaned_list[current_index],cleaned_list[current_index + 1])
#print maximum_value_now
# stores the index to the next iteration
future_indexes = [ind for (ind,value) in enumerate(cleaned_list) if value == maximum_value_now]
# we have more that 2 values in our list with this maximum value
# must return only that which is greater than our previous index
if (len(future_indexes) > 1):
current_index = [i for i in future_indexes if (i >= current_index and i <= current_index + 1)][0]
else:
#only one occurence of the maximum value
current_index = future_indexes[0]
# add the value found to the total sum
total = total + maximum_value_now
print total
Thanks!
First of all, read the entire triangle into a 2d structure. It is handy to note that we can do an affine transformation to the triangle and therefore use an easier coordinate system:
3 \ 3
7 4 ====\ 7 4
2 4 6 ====/ 2 4 6
8 5 9 3 / 8 5 9 3
It is easy to read this into a jagged array in Python:
with open(filename, 'r') as file:
rows = [[int(i) for i in line.split()] for line in file]
Now given x as the horizontal coordinate and y as the vertical coordinate, and them increasing left and down, there are 2 valid moves from (x, y): (x + 1, y + 1) and (x, y + 1). It is as simple as that.
The trick here is now to calculate all the maximum sums for cell in each row. This is called dynamic programming. The maximum sum is then the maximal sum in the last row.
Actually there is no need to remember anything beyond the sums on the just preceding row, and the sums on the current row. To calculate the maximal row sums current_sums', we notice that to arrive to positionxin the latest row, the position must have beenx - 1orx. We choose the maximal value of these, then sum with the currentcell_value`. We can consider any of the numbers outside the triangle as 0 for simplicity as they don't affect the maximal solution here. Therefore we get
with open('triangle.txt', 'r') as file:
triangle = [[int(i) for i in line.split()] for line in file]
previous_sums = []
for row in triangle:
current_sums = []
for position, cell_value in enumerate(row):
sum_from_right = 0 if position >= len(previous_sums) else previous_sums[position]
sum_from_left = (previous_sums[position - 1]
if 0 < position <= len(previous_sums)
else 0)
current_sums.append(max(sum_from_right, sum_from_left) + cell_value)
previous_sums = current_sums
print('The maximum sum is', max(previous_sums))
If you like list comprehensions, the inner loop can be written into one:
current_sums = []
for row in triangle:
len_previous = len(current_sums)
current_sums = [
max(0 if pos >= len_previous else current_sums[pos],
current_sums[pos - 1] if 0 < pos <= len_previous else 0)
+ cell_value
for pos, cell_value in enumerate(row)
]
print('The maximum sum is', max(current_sums))
Here is a simple recursive solution which uses memoization
L1 = [
" 3 ",
" 7 4 ",
" 2 4 6 ",
"8 5 9 3",
]
L2 = [
" 75 ",
" 95 64 ",
" 17 47 82 ",
" 18 35 87 10 ",
" 20 04 82 47 65 ",
" 19 01 23 75 03 34 ",
" 88 02 77 73 07 63 67 ",
" 99 65 04 28 06 16 70 92 ",
" 41 41 26 56 83 40 80 70 33 ",
" 41 48 72 33 47 32 37 16 94 29 ",
" 53 71 44 65 25 43 91 52 97 51 14 ",
" 70 11 33 28 77 73 17 78 39 68 17 57 ",
" 91 71 52 38 17 14 91 43 58 50 27 29 48 ",
" 63 66 04 68 89 53 67 30 73 16 69 87 40 31 ",
"04 62 98 27 23 09 70 98 73 93 38 53 60 04 23 ",
]
class Max(object):
def __init__(self, l):
"parse triangle, initialize cache"
self.l = l
self.t = [
map(int,filter(lambda x:len(x)>0, x.split(" ")))
for x in l
]
self.cache = {}
def maxsub(self, r=0, c=0):
"compute max path starting at (r,c)"
saved = self.cache.get((r,c))
if saved:
return saved
if r >= len(self.t):
answer = (0, [], [])
else:
v = self.t[r][c]
s1, l1, c1 = self.maxsub(r+1, c)
s2, l2, c2 = self.maxsub(r+1, c+1)
if s1 > s2:
answer = (v+s1, [v]+l1, [c]+c1)
else:
answer = (v+s2, [v]+l2, [c]+c2)
self.cache[(r,c)] = answer
return answer
def report(self):
"find and report max path"
m = self.maxsub()
print
print "\n".join(self.l)
print "maxsum:%s\nvalues:%s\ncolumns:%s" % m
if __name__ == '__main__':
Max(L1).report()
Max(L2).report()
Sample output
3
7 4
2 4 6
8 5 9 3
maxsum:23
values:[3, 7, 4, 9]
columns:[0, 0, 1, 2]
75
95 64
17 47 82
18 35 87 10
20 04 82 47 65
19 01 23 75 03 34
88 02 77 73 07 63 67
99 65 04 28 06 16 70 92
41 41 26 56 83 40 80 70 33
41 48 72 33 47 32 37 16 94 29
53 71 44 65 25 43 91 52 97 51 14
70 11 33 28 77 73 17 78 39 68 17 57
91 71 52 38 17 14 91 43 58 50 27 29 48
63 66 04 68 89 53 67 30 73 16 69 87 40 31
04 62 98 27 23 09 70 98 73 93 38 53 60 04 23
maxsum:1074
values:[75, 64, 82, 87, 82, 75, 73, 28, 83, 32, 91, 78, 58, 73, 93]
columns:[0, 1, 2, 2, 2, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9]
To solve the 100-row Project Euler problem 67 we make a small change to __main__
def main():
with file('triangle.txt') as f:
L = f.readlines()
Max(L).report()
if __name__ == '__main__':
main()
Last lines of output:
maxsum:7273
values:[59, 73, 52, 53, 87, 57, 92, 81, 81, 79, 81, 32, 86, 82, 97, 55, 97, 36, 62, 65, 90, 93, 95, 54, 71, 77, 68, 71, 94, 8, 89, 54, 42, 90, 84, 91, 31, 71, 93, 94, 53, 69, 73, 99, 89, 47, 80, 96, 81, 52, 98, 38, 91, 78, 90, 70, 61, 17, 11, 75, 74, 55, 81, 87, 89, 99, 73, 88, 95, 68, 37, 87, 73, 77, 60, 82, 87, 64, 96, 65, 47, 94, 85, 51, 87, 65, 65, 66, 91, 83, 72, 24, 98, 89, 53, 82, 57, 99, 98, 95]
columns:[0, 0, 0, 1, 2, 3, 4, 4, 5, 5, 6, 6, 7, 8, 9, 10, 11, 12, 12, 12, 13, 13, 13, 14, 14, 15, 15, 16, 17, 17, 17, 18, 19, 20, 21, 22, 23, 24, 25, 25, 25, 26, 27, 27, 28, 29, 30, 31, 32, 32, 32, 32, 33, 33, 34, 35, 36, 36, 36, 36, 36, 36, 36, 37, 38, 39, 40, 41, 41, 42, 42, 42, 42, 42, 42, 42, 43, 43, 43, 44, 45, 45, 45, 45, 45, 45, 46, 46, 46, 46, 47, 47, 48, 49, 49, 50, 51, 52, 52, 53]
On my Mac it returns the answer immediately. Here is a timeit measurement:
$ python -m timeit -s 'from p067 import main' main
100000000 loops, best of 3: 0.0181 usec per loop

Select every n values in pandas column

I would like to choose from a column every 3 values.
For example:
Input
12
73
56
33
16
output
12
73
56
------
73
56
33
-----
56
33
16
I have tried to add a key column and group by it, but my data frame is too large to perform the grouping. Here is my attempt:
df.groupby('key').agg(lambda x: x.tolist())
If use list type, you can do like this :
lst = [12,73,56,33,16]
slide_size=3
result = []
for i in range(0,len(lst)-slide_size+1):
result.append(lst[i:i+3])
result
# output : [[12, 73, 56], [73, 56, 33], [56, 33, 16]]
After this, you can transform the list to DataFrame

Indexing data by string

I have a large csv file with columns that encode the name and index of the array below. eg:
time, dataset1[0], dataset1[1], dataset1[2], dataset2[0], dataset2[1], dataset2[2]\n
0, 43, 35, 29, 21, 59, 39\n
1, 21, 59, 39, 43, 35, 29\n
You get the idea (obviously there is far more data in the arrays).
Any ideas how can I easily parse/strip this into an efficient dataframes?
[EDIT]
Ideally I'm after a structure like this:
time dataset1 dataset2
0 0 [43,35,29] [21,59,39]
1 1 [21,59,39] [43,35,29]
where the index's have been stripped from the labels and turned into nparray indices.
from pandas import read_csv
df = read_csv('data.csv')
print df
Gives as output:
>>>
time dataset1[0] dataset1[1] dataset1[2] dataset2[0] dataset2[1] \
0 0 43 35 29 21 59
1 1 21 59 39 43 35
dataset2[2]
0 39
1 29

Categories