Python to extract unique CSV rows

Python to extract unique CSV rows - python

I'm trying to get the first occurrences of each row of a CSV in Python. However, I'm facing an issue. My CSV files looks like this:
1,2,3,a,7,5,y,0
1,2,3,a,3,5,y,8
1,2,3,a,5,3,y,7
1,2,3,d,7,5,n,0
1,2,3,d,3,5,n,8
1,2,3,d,5,3,n,7
2,3,4,f,4,6,y,9
2,3,4,f,5,6,y,9
2,3,4,f,7,3,y,9
2,3,4,e,3,5,n,9
2,3,4,e,0,7,n,9
2,3,4,e,5,8,n,9
I tried this way to get the first occurrences of unique values based on one of the columns.
def unique():
rows = list(csv.reader(open('try.csv', 'r'), delimiter=','))
columns = zip(*rows)
uniq = set(columns[1])
indexed = defaultdict(list)
for x in uniq:
i = columns[1].index(x)
indexed[i] = rows[i]
return indexed
It works fine for one unique column value set. However,
I'd like to set columns[1] and columns[6] as unique values.
The tricky part is columns[6] is always y or n. If I set that, it returns me only first y and n columns. I'd like to get all the columns that have both columns[1] and columns[6] as unique. For every columns[2] value, I need the first occurrence of y and n rows. Sorry for my poor description. So basically, I'd like my output to be like:
1,2,3,d,7,5,n,0,a
2,3,4,e,3,5,n,9,f

There are some room for improvement in your code, but I didn't want to rewrite it in depth, as you had it almost right. The "key" point is that you need a compound key. This is the pair (r[1],r[6]) that has to be unique. In addition, I took the liberty to use an OrderedDict for fast-lookup, but preserving the row order.
import csv
import collections
def unique():
rows = list(csv.reader(open('try.csv', 'r'), delimiter=','))
result = collections.OrderedDict()
for r in rows:
key = (r[1],r[6]) ## The pair (r[1],r[6]) must be unique
if key not in result:
result[key] = r
return result.values()
from pprint import pprint
pprint(unique())
Producing:
[['1', '2', '3', 'a', '7', '5', 'y', '0'],
['1', '2', '3', 'a', '7', '5', 'n', '0'],
['2', '3', '4', 'f', '4', '6', 'y', '9'],
['2', '3', '4', 'f', '3', '5', 'n', '9']]

Here's an alternate implementation.
Each row is read in from the data set. We use a defaultdict(list) to store all rows, based on each rows two-column index. As a row is read in from the dataset, it's appended to the defaultdict based on that row's two-column index key.
At the end, we scan through the defaultdict. We want the first row from the dataset that matched the index, so we return uniq[0] that corresponds to the two-column index.
source
import csv
from collections import defaultdict
def unique():
uniq = defaultdict(list)
for row in csv.reader(open('try.csv', 'r'), delimiter=','):
uniq[ (row[0],row[6]) ].append(row)
for idx,row in uniq.iteritems():
yield row[0]
print list( unique() )
output
[['2', '3', '4', 'f', '4', '6', 'y', '9'], ['2', '3', '4', 'f', '3', '5', 'n', '9'], ['1', '2', '3', 'a', '7', '5', 'y', '0'], ['1', '2', '3', 'a', '7', '5', 'n', '0']]

Old topic, but could be useful for other: why not call the external uniq command if you are in a Unix environment? That way you would not have to reinvent this code and would benefit from a potentially better performance.

Related

compare two list of unequal length and fill third list by unmatched index

I have two list of unequal size.
large_list=['A','B','C','D','E','F','G','H','I']
small_list=['A','D','E']
I have another list.
tag_list=['1','3','5']
I want to compare large_list against small_list. Where the elements are equal, at that point take the element from same index from tag_list, otherwise put '0' if the element of small_list and large_list are unequal at a particular index.
I tried this code
new_tags=[]
for lrg in large_list:
for sm,tag in zip(small_list,tag_list):
if sm==lrg:
new_tags.append(tag)
else:
new_tags.append('0')
new_tags
But the ouptut produce has a larger length because of nested for loop but the length i want should be large_list
This is expected output.
output=['1', '0', '0', '3', '5', '0', '0', '0', '0']

One approach using a lookup dictionary:
large_list = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I']
small_list = ['A', 'D', 'E']
tag_list = ['1', '3', '5']
# mapping each value of small_list to the correspondent tag_list value
lookup = dict(zip(small_list, tag_list)) # {'A': '1', 'D': '3', 'E': '5'}
res = [lookup.get(e, '0') for e in large_list]
print(res)
Output
['1', '0', '0', '3', '5', '0', '0', '0', '0']

You can loop over large_list and use index to find its position in small_list. If the element is not present, then catch the resulting ValueError and instead insert a 0 at that position.
new_tags = []
for value in large_list:
try:
new_tags.append(tag_list[small_list.index(value)])
except ValueError:
new_tags.append(0)
print(new_tags)

You could use pop and a comprehension to do this
large_list=['A','B','C','D','E','F','G','H','I']
small_list=['A','D','E']
tag_list=['1','3','5']
[tag_list.pop(0) if x in small_list else '0' for x in large_list]
Output
['1', '0', '0', '3', '5', '0', '0', '0', '0']

You can use iter and next to fetch elements from tag_list:
it = iter(tag_list)
result = [next(it) if x in small_list else '0' for x in large_list]
However, this is fragile. What it there are more than three matching elements?

You can try this:
unmatch = []
large_list = ['A','B','C','D','E','F','G','H','I']
small_list = ['A','D','E']
tag_list = ['1','3','5']
j = 0
for num in large_list:
if num in small_list:
unmatch.append(tag_list[j])
j += 1
else:
unmatch.append('0')
print(unmatch)
# ['1', '0', '0', '3', '5', '0', '0', '0', '0']
It first iterates large_list and if the value was not inside of small_list it will append 0 to the unmatch variable but if it exists, it will then append j index of the tag_list which each time the condition becomes true, it adds 1 to it.

How to convert csv file to dictionary so that top row is keys, and columns (in list form) are values (without numpy.loadtxt and panda)?

(To preface, I know/believe you could use numpy.loadtxt and pandas but I want to know how to achieve the same result without them.)
I would have a csv file like this, after opening it and assigning it to file while also using splitlines():
file =
'Top1','Top2','Top3','Top4','Top5'
'1','a','3','b','5'
'a','2','b','3','c'
'1','a','3','b','5'
How could I go about converting the file into a dictionary, with the top row being the keys,
and the column underneath each key being its value, like so:
a_dict = {
'top1': ['1', 'a', '1'],
'top2': ['a', '2', 'a'],
'top3': ['3', 'b', '3'],
'top4': ['b', '3', 'b'],
'top5': ['5', 'c', '5']
}
I tried by doing this first:
a_dict = {}
for i in file[0].split(','):
a_dict[i] = ''
print(a_dict)
which gave me:
a_dict = {"'top1'": '', "'top2'": '', "'top3'": '', "'top4'": '', "'top5'": ''}
then, to get the columns into lists I tried:
prac_list = []
for i in file[1:]:
i = i.split(',')
x = 0
prac_list.append(num[x])
x+=1
print (prac_list)
and that gave me:
prac_list = ["'1'", "'a'", "'1'"]
but I got stuck up to that point, the idea was to have the for loop go through elements below the first row, and have it so each loop iteration would take the x'th index of each row and append them together as one element in the list like ['1', 'a', '1']. eg:
x = 0
for i in file[1:]:
get x'th index of each row
append together as one element in prac_list
x += 1
print (prac_list)
then loop through the dict and change the values to the items in the list, but I believe I am messing up the slicing and the appending to the list part.

You can do this with zip,
text = """'Top1','Top2','Top3','Top4','Top5'
'1','a','3','b','5'
'a','2','b','3','c'
'1','a','3','b','5'"""
lines = text.split('\n') # This equals to the file_pointer.readlines()
keys = keys = list(map(lambda x:x.replace("'", '').lower(), lines[0].split(',')))
values = [list(map(lambda x:x.replace("'", ''), line.split(','))) for line in lines[1:]]
result = dict(zip(keys, values))
Output:
{'top1': ['1', 'a', '3', 'b', '5'],
'top2': ['a', '2', 'b', '3', 'c'],
'top3': ['1', 'a', '3', 'b', '5']}

I think this code will help you :
with open('Book1.csv','r')as f:
file=f.read().splitlines()
a_dict ={i.replace("'",""):[] for i in file[0].split(',')}
prac_list = []
for i in file[1:]:
i = i.split(',')
x=0
for j in a_dict:
a_dict[j].append(i[x].replace("'",""))
x+=1
print(a_dict)
Output
{'Top1': ['1', 'a', '1'], 'Top2': ['a', '2', 'a'], 'Top3': ['3', 'b', '3'], 'Top4': ['b', '3', 'b'], 'Top5': ['5', 'c', '5']}

You can use the csv library to iterate through rows:
import csv
with open('file.csv') as f:
reader_obj = csv.DictReader(f)
data = next(reader_obj)
for row in reader_obj:
data = {k: (data[k]+[v] if type(data[k]) is list else [data[k]]+[v]) for k,v in row.items()}
print(data)
{"'Top1'": ["'1'", "'a'", "'1'"], "'Top2'": ["'a'", "'2'", "'a'"], "'Top3'": ["'3'", "'b'", "'3'"], "'Top4'": ["'b'", "'3'", "'b'"], "'Top5'": ["'5'", "'c'", "'5'"]}

How to introduce constraints using Python itertools.product()?

The following script generates 4-character permutations of set s and outputs to file:
import itertools
s = ['1', '2', '3', '4', '!']
l = list(itertools.product(s, repeat=4))
with open('output1.txt', 'w') as f:
for i in l:
f.write(''.join([str(v) for v in i]) + '\n')
Output:
...
11!1
11!2
11!3
11!4
11!!
...
How are constraints introduced such as:
No permutation should start with '!'
The 3rd character should be '3'
etc.

The repeat parameter is meant to be used when you do want the same set of options for each position in the sequence. Since you don't, then you should just use positional arguments to give the options for each position in the sequence. (docs link)
For your example, the first letter can be any of ['1', '2', '3', '4'], and the third letter can only be '3':
import itertools as it
s = ['1', '2', '3', '4', '!']
no_exclamation_mark = ['1', '2', '3', '4']
only_3 = ['3']
l = it.product(no_exclamation_mark, s, only_3, s)
#Kelly Bundy wrote the same solution in a comment, but simplified using the fact that strings are sequences of characters, so if your options for each position are just one character each then you don't need to put them in lists:
l = it.product('1234', '1234!', '3', '1234!')

Don't convert the result to a list. Instead, filter it using a generator comprehension:
result = itertools.product(s, repeat=4)
result = (''.join(word) for word in result)
result = (word for word in result if not word.startswith('!'))
result = (word for word in result if word[2] == '3')
The filtering will not be executed until you actually read the elements from result, such as converting it to a list or using a for-loop:
def f1(x):
print("Filter 1")
return x.startswith('A')
def f2(x):
print("Filter 2")
return x.endswith('B')
words = ['ABC', 'ABB', 'BAA', 'BBB']
result = (word for word in words if f1(word))
result = (word for word in result if f2(word))
print('No output here')
print(list(result))
print('Filtering output here')
This will output
No output here
Filter 1
Filter 2
Filter 1
Filter 2
Filter 1
Filter 1
['ABB']
Filtering output here

The itertools.product function can't handle the kinds of constraints you describe itself. You can probably implement them yourself, though, with extra iteration and changes to how you build your output. For instance, to generate a 4-character string where the third character is always 3, generate a 3-product and use it to fill in the first, second and fourth characters, leaving the third fixed.
Here's a solution for your two suggested constraints. There's not really a generalization to be made here, I'm just interpreting each one and combining them:
import itertools
s = ['1', '2', '3', '4', '!']
for i in s[:-1]: # skip '!'
for j, k in itertools.product(s, repeat=2): # generate two more values from s
print(f'{i}{j}3{k}')
This approach avoids generating values that will need to be filtered out. This is a lot more efficient than generating all possible four-tuples and filtering the ones that violate the constraints. The filtering approach will often do many times more work, and it gets proportionally much worse the more constraints you have (since more and more of the generated values will be filtered).

Itertools' product does not have an integrated filter mechanism. It will generate all permutations brutally and you will have to filter its output (which is not very efficient).
To be more efficient you would need to implement your own (recursive) generator function so that you can short-circuit the generation as soon as one of the constraint is not met (i.e. before getting to a full permutation):
def perm(a,p=[]):
# constraints applied progressively
if p and p[0] == "!": return
if len(p)>= 3 and p[2]!= '3': return
# yield permutation of 4
if len(p)==4: yield p; return
# recursion (product)
for x in a:
yield from perm(a,p+[x])
Output:
s = ['1', '2', '3', '4', '!']
for p in perm(s): print(p)
['1', '1', '3', '1']
['1', '1', '3', '2']
['1', '1', '3', '3']
['1', '1', '3', '4']
['1', '1', '3', '!']
['1', '2', '3', '1']
['1', '2', '3', '2']
['1', '2', '3', '3']
...
['4', '4', '3', '3']
['4', '4', '3', '4']
['4', '4', '3', '!']
['4', '!', '3', '1']
['4', '!', '3', '2']
['4', '!', '3', '3']
['4', '!', '3', '4']
['4', '!', '3', '!']

How to write complex sort in python?

Is there a concise way to sort a list by first sorting numbers in ascending order and then sort the characters in descending order?
How would you sort the following:
['2', '4', '1', '6', '7', '4', '2', 'K', 'A', 'Z', 'B', 'W']
To:
['1', '2', '2', '4', '4', '6', '7', 'Z', 'W', 'K', 'B', 'A']

One way (there might be better ones) is to separate digits and letters beforehand, sort them appropriately and glue them again together in the end:
lst = ['2', '4', '1', '6', '7', '4', '2', 'K', 'A', 'Z', 'B', 'W']
numbers = sorted([number for number in lst if number.isdigit()])
letters = sorted([letter for letter in lst if not letter.isdigit()], reverse=True)
combined = numbers + letters
print(combined)
Another way makes use of ord(...) and the ability to sort by tuples. Here we use zero for numbers and one for letters:
def sorter(item):
if item.isdigit():
return 0, int(item)
else:
return 1, -ord(item)
print(sorted(lst, key=sorter))
Both will yield
['1', '2', '2', '4', '4', '6', '7', 'Z', 'W', 'K', 'B', 'A']
As for timing:
def different_lists():
global my_list
numbers = sorted([number for number in my_list if number.isdigit()])
letters = sorted([letter for letter in my_list if not letter.isdigit()], reverse=True)
return numbers + letters
def key_function():
global my_list
def sorter(item):
if item.isdigit():
return 0, int(item)
else:
return 1, -ord(item)
return sorted(my_list, key=sorter)
from timeit import timeit
print(timeit(different_lists, number=10**6))
print(timeit(key_function, number=10**6))
This yields (running it a million times on my MacBook):
2.9208732349999997
4.54283629
So the approach with list comprehensions is faster here.

To elaborate on the custom-comparison approach: in Python the built-in sort does key comparison.
How to think about the problem: to group values and then sort each group by a different quality, we can think of "which group is a given value in?" as a quality - so now we are sorting by multiple qualities, which we do with a key that gives us a tuple of the value for each quality.
Since we want to sort the letters in descending order, and we can't "negate" them (in the arithmetic sense), it will be easiest to apply reverse=True to the entire sort, so we keep that in mind.
We encode: True for digits and False for non-digits (since numerically, these are equivalent to 1 and 0 respectively, and we are sorting in descending order overall). Then for the second value, we'll use the symbol directly for non-digits; for digits, we need the negation of the numeric value, to re-reverse the sort.
This gives us:
def custom_key(value):
numeric = value.isdigit()
return (numeric, -int(value) if numeric else value)
And now we can do:
my_list.sort(key=custom_key, reverse=True)
which works for me (and also handles multi-digit numbers):
>>> my_list
['1', '2', '2', '4', '4', '6', '7', 'Z', 'W', 'K', 'B', 'A']

You will have to implement your own comparison function and pass it as the key argument for the sorted function. What you are seeking is not a trivial comparison as you "assign" custom values to fields so you will have to let Python know how you value each one of them

Extend/append python join list

I have an example:
li = [['b', 'b', 'c', '3.2', 'text', '3', '5', '5'], ['a', 'w', '3', '4'], ['a', 'x', '3', '4'],['a','b'],['312','4']]
a = 0
b = []
c = []
count = []
for x in range(len(li)):
for a in range(len(li[x])):
if li[x][a].isalpha():
a += 1
elif not li[x][a].isalpha() and li[x][a + 1].isalpha():
a += 1
else:
break
i = (len(li[x]) - a)
b.extend([' '.join(li[x][0:a])])
b.extend(li[x][a::])
count.append(i)
for x in range(len(count)):
a = count[x] + 1
z = (sum(count[:x]))
if x == 0:
c.append(b[:a])
else:
c.append(b[a+1::z])
print(c)
I have various items in the li list and the length of the list itself is not constant.
If any element in the array is a string or if there is some other symbol between the two strings, it combines everything into one element - this join works as I wanted.
I would like to preserve the existing structure. For example, output now looks like this:
[['b b c 3.2 text', '3', '5', '5'], ['a w', 'a x', 'a b', '4'], ['a w', '4'], ['5', '4'], ['a w', '']]
but it should look like this:
[['b b c 3.2 text', '3', '5', '5'],['aw','3','4'],['ax','3','4'],['ab'],['312','4']
Of course, the code I sent did not work properly - I think of a solution but I still have some problems with it - I do not know how to add ranges to this list c - I try to pull the length of the elements of the list as count but it also doesn't work for me - maybe this is a bad solution? Maybe this extend b is not the best solution? Maybe there is no point in using so many 'transformations' and creating new lists?
Let me some tips.

The definition is a bit unclear to me, but I think this will do it. Code is not very verbose, though. If it does what you intended, I can try to explain / make it simpler.
li = [['b', 'b', 'c', '3.2', 'text', '3', '5', '5'], ['a', 'w', '3', '4'], ['a', 'x', '3', '4'],['a','b'],['312','4']]
def join_to_last_text(lst: list, min_join: int = 1) -> list:
last_text = max((i for i,s in enumerate(lst) if s.isalpha()), default=min_join - 1)
return [' '.join(lst[:last_text + 1])] + lst[last_text + 1:]
output = [join_to_last_text(lst) for lst in li]
print(output)
# You can join a minimum of first items by setting a higher max default.
# If max does not find isalpha, it will use this value.
output_min_2 = [join_to_last_text(lst, min_join=2) for lst in li]
print(output_min_2)

#Johan Schiff's code works as expected but leaves a corner case - when the first element of the list is not a text. I have made a small change in his code to take care of that situation:
li = [['b', 'b', 'c', '3.2', 'text', '3', '5', '5'], ['a', 'w', '3', '4'], ['a', 'x', '3', '4'],['a','b'],['312','4']]
def join_to_last_text(lst: list) -> list:
first_text = min((i for i,s in enumerate(lst) if s.isalpha()), default=0)
last_text = max((i for i,s in enumerate(lst) if s.isalpha()), default=0)
return lst[:first_text] + [''.join(lst[first_text:last_text + 1])] + lst[last_text + 1:]
output = [join_to_last_text(lst) for lst in li]
print(output)
Where would this give a different output(a correct one)? Check out the following test case:
li = [['4','b', 'b', 'c', '3.2', 'text', '3', '5', '5'], ['a', 'w', '3', '4']]
#Johan's code would output -
[['5bbc3.2text', '3', '5', '5'], ['aw', '3', '4']]
whereas based on the following phrase in the question
If any element in the array is a string or if there is some other symbol between the two strings, it combines everything into one element
the output should be-
[['5', 'bbc3.2text', '3', '5', '5'], ['aw', '3', '4']]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python to extract unique CSV rows - python

Old topic, but could be useful for other: why not call the external uniq command if you are in a Unix environment? That way you would not have to reinvent this code and would benefit from a potentially better performance.

Related

compare two list of unequal length and fill third list by unmatched index

How to convert csv file to dictionary so that top row is keys, and columns (in list form) are values (without numpy.loadtxt and panda)?

How to introduce constraints using Python itertools.product()?

How to write complex sort in python?

Extend/append python join list

Categories

Resources