I have created two CSV lists. One is an original CSV file, the other is a DeDuped version of that file. I have read each into a list and for all intents and purposes they are the same format. Each list item is a string.
I am trying to use a list comprehension to find out which items were deleted by the duplication. The length of the original is 16939 and the list of the DeDupe is 15368. That's a difference of 1571, but my list comprehension length is 368. Ideas?
deduped = open('account_de_ex.csv', 'r')
deduped_data = deduped.read()
deduped.close()
deduped = deduped_data.split("\r")
#read in file with just the account names from the full account list
account_names = open('account_names.csv', 'r')
account_data = account_names.read()
account_names.close()
account_names = account_data.split("\r")
# Get all the accounts that were deleted in the dedupe - i.e. get the duplicate accounts
dupes = [ele for ele in account_names if ele not in deduped]
Edit: For some notes in the comments, here is a test on my list comp and the lists themselves. Pretty much the same difference, 20 or so off. Not the 1500 i need! thanks!
print len(deduped)
deduped = set(deduped)
print len(deduped)
print len(account_names)
account_names = set(account_names)
print len(account_names)
15368
15368
16939
15387
Try running this code and see what it reports. This requires Python 2.7 or newer for collections.Counter but you could easily write your own counter code, or copy my example code from another answer: Python : List of dict, if exists increment a dict value, if not append a new dict
from collections import Counter
# read in original records
with open("account_names.csv", "rt") as f:
rows = sorted(line.strip() for line in f)
# count how many times each row appears
counts = Counter(rows)
# get a list of tuples of (count, row) that only includes count > 1
dups = [(count, row) for row, count in counts.items() if count > 1]
dup_count = sum(count-1 for count in counts.values() if count > 1)
# sort the list from largest number of dups to least
dups.sort(reverse=True)
# print a report showing how many dups
for count, row in dups:
print("{}\t{}".format(count, row))
# get de-duped list
unique_rows = sorted(counts)
# read in de-duped list
with open("account_de_ex.csv", "rt") as f:
de_duped = sorted(line.strip() for line in f)
print("List lengths: rows {}, uniques {}/de_duped {}, result {}".format(
len(rows), len(unique_rows), len(de_duped), len(de_duped) + dup_count))
# lists should match since we sorted both lists
if unique_rows == de_duped:
print("perfect match!")
else:
# if lists don't match, find out what is going on
uniques_set = set(unique_rows)
deduped_set = set(de_duped)
# find intersection of the two sets
x = uniques_set.intersection(deduped_set)
# print differences
if x != uniques_set:
print("Rows in original that are not in deduped:\n{}".format(sorted(uniques_set - x)))
if x != deduped_set:
print("Rows in deduped that are not in original:\n{}".format(sorted(deduped_set - x)))
To see what you really have in each list you can proceed by construction :
If you only had unique elements :
deduped = range(15368)
account_names2 = range(15387)
dupes2 = [ele for ele in account_names2 if ele not in deduped] #len is 19
However because you have repetitions of removed and not removed elements you actually end up with :
account_names =account_names2 + dupes2*18 + dupes2[:7] + account_names2[:1571 - 368]
dupes = [ele for ele in account_names if ele not in deduped] # dupes will have 368 elements
Related
i want to count the unique HH:MM:xx(eg. 11:11:00, 11:12:00, 11:12:11) using regex. so far i am only able to count the total of HH:MM:SS in the text. not sure how to continue from here.. this are my codes
pattern = re.compile("(\d{2}):(\d{2}):(\d{2})") #capture all the pattern with HH:MM:SS
path = r'C:\Users\CL\Desktop\abc.txt'
list1 = [] # to store values in list
for line in open(path,'r'):
for match in re.finditer(pattern, line): #draw 11:11:00, 11:12:00, 11:12:11
list1.append(line) #append to a list
total = len(list1) #sum list
print(total) #3
sample text
11:11:00
abc
11:12:00
abc
11:12:11
abc
the desired output should be 2 (unique values - 11:11:xx and 11:12:xx)
see below (data1.txt is your data)
from collections import defaultdict
data = defaultdict(int)
with open('data1.txt') as f:
lines = [l.strip() for l in f.readlines()]
for line in lines:
if line.count(':') == 2:
data[line[:5]] += 1
print(data)
output
defaultdict(<class 'int'>, {'11:11': 1, '11:12': 2})
You could use re.findall here, followed by a list comprehension to remove duplicates:
with open(path, 'r') as file:
data = file.read()
ts = re.findall(r'(\d{2}:\d{2}):\d{2}', data)
res = []
[res.append(x) for x in ts if x not in res]
print(len(res))
If you only want to count the number of occurences you can simply:
txtfile = open("C:\Users\CL\Desktop\abc.txt", "r")
filetext = txtfile.read()
txtfile.close()
list1 = set(re.findall("(\d{2}:\d{2}):\d{2}",filetext))
total = len(list1) #sum list
print(total) #3
You can use parentheses to specify what you wan to capture (the HH:MM). Then you can use set to remove duplicates.
Have you tried using a set instead of a list?
pattern = re.compile("(\d{2}):(\d{2}):(\d{2})")
path = r'C:\Users\CL\Desktop\abc.txt'
s = set() # use a set instead of a list, to avoid duplicates
for line in open(path,'r'):
for match in re.finditer(pattern, line):
s.add(line[:-3]) #insert into set
total = len(s) #number of elements in s
print(total) #2
This way, if you try to insert an element you've already seen, we won't have multiple copies of it stored, since sets don't allow duplicates.
EDIT: As commented, we are not supposed to include seconds here, which I mistakenly did originally. Fixed now.
I have multiple lists, the first index of each list are related the second as well so on and so fourth. I need a way of linking the order of these two lists together. so i have a list of teams (some are duplicate) i need an if statement that says: if theres a duplicate of this, then compare this to the duplicate and take the related value in the other list and choose the better one
import sys
import itertools
from itertools import islice
fileLocation = input("Input the file location of ScoreBoard: ")
T = []
N = []
L = []
timestamps = []
teamids = []
problemids = []
inputids = []
scores = []
dictionary = {}
amountOfLines = len(open('input1.txt').readlines())
with open('input1.txt') as input1:
for line in islice(input1, 2, amountOfLines):
parsed = line.strip().split()
timestamps.append(parsed[0])
teamids.append(parsed[1])
problemids.append(parsed[2])
inputids.append(parsed[3])
scores.append(parsed[4])
def checkIfDuplicates(teamids):
''' Check if given list contains any duplicates '''
if len(teamids) == len(set(teamids)):
return False
else:
return True
for i in teamids:
if checkIfDuplicates(i):
dictionary['team%s' % i] = {}
if dictionary < amountOfTeams:
dictionary['team%s' %]
for i in score:
dictionary[teamid][]
print(dictionary)
loop through each list item
delete item if duplicate
for i in list1:
for k in list2:
if i == k:
list.remove(i)
This is what I do to find all double lines in a textfile
import regex #regex is as re
#capture all lines in buffer
r = f.readlines()
#create list of all linenumbers
lines = list(range(1,endline+1))
#merge both lists
z=[list(a) for a in zip(r, lines)]
#sort list
newsorting = sorted(z)
#put doubles in list
listdoubles = []
for i in range(0,len(newsorting)-1):
if (i+1) <= len(newsorting):
if (newsorting[i][0] == newsorting[i+1][0]) and (not regex.search('^\s*$',newsorting[i][0])):
listdoubles.append(newsorting[i][1])
listdoubles.append(newsorting[i+1][1])
#remove event. double linenumbers
listdoubles = list(set(listdoubles))
#sort line numeric
listdoubles = sorted(listdoubles, key=int)
print(listdoubles)
But it is very slow. When I have over 10.000 lines it takes 10 seconds to create this list.
Is there a way to do it faster?
You can use a simpler approach:
for each line
if it has been seen before then display it
else add it to the set of known lines
In code:
seen = set()
for L in f:
if L in seen:
print(L)
else:
seen.add(L)
If you want to display the line numbers where duplicates are appearing the code can be simply changed to use a dictionary mapping line content to the line number its text has been seen for the first time:
seen = {}
for n, L in enumerate(f):
if L in seen:
print("Line %i is a duplicate of line %i" % (n, seen[L]))
else:
seen[L] = n
Both dict and set in Python are based on hashing and provide constant-time lookup operations.
EDIT
If you need only the line numbers of last duplicate of a line then the output clearly cannot be done during the processing but you will have first to process the whole input before emitting any output...
# lastdup will be a map from line content to the line number the
# last duplicate was found. On first insertion the value is None
# to mark the line is not a duplicate
lastdup = {}
for n, L in enumerate(f):
if L in lastdup:
lastdup[L] = n
else:
lastdup[L] = None
# Now all values that are not None are the last duplicate of a line
result = sorted(x for x in lastdup.values() if x is not None)
I have a list of strings ending with numbers. Want to sort them in python and then compress them if a range is formed.
Eg input string :
ABC1/3, ABC1/1, ABC1/2, ABC2/3, ABC2/2, ABC2/1
Eg output string:
ABC1/1-3, ABC2/1-3
How should I approach this problem with python?
There's no need to use a dict for this problem. You can simply parse the tokens into a list and sort it. By default Python sorts a list of lists by the individual elements of each list. After sorting the list of token pairs, you only need to iterate once and record the important indices. Try this:
# Data is a comma separated list of name/number pairs.
data = 'ABC1/3, ABC1/1, ABC1/2, ABC2/3, ABC2/2, ABC2/1'
# Split data on ', ' and split each token on '/'.
tokens = [token.split('/') for token in data.split(', ')]
# Convert token number to integer.
for index in range(len(tokens)):
tokens[index][1] = int(tokens[index][1])
# Sort pairs, automatically orders lists by items.
tokens.sort()
prev = 0 # Record index of previous pair's name.
indices = [] # List to record indices for output.
for index in range(1, len(tokens)):
# If name matches with previous position.
if tokens[index][0] == tokens[prev][0]:
# Check whether number is increasing sequentially.
if tokens[index][1] != (tokens[index - 1][1] + 1):
# If non-sequential increase then record the indices.
indices.append((prev, index - 1))
prev = index
else:
# If name changes then record the indices.
indices.append((prev, index - 1))
prev = index
# After iterating the list, record the indices.
indices.append((prev, index))
# Print the ranges.
for start, end in indices:
if start == end:
args = (tokens[start][0], tokens[start][1])
print '{0}/{1},'.format(*args),
else:
args = (tokens[start][0], tokens[start][1], tokens[end][1])
print '{0}/{1}-{2},'.format(*args),
# Output:
# ABC1/1-3 ABC2/1-3
I wanted to speedhack this problem, so here is an almost complete solution for you, based on my own make_range_string and a stolen natsort.
import re
from collections import defaultdict
def sortkey_natural(s):
return tuple(int(part) if re.match(r'[0-9]+$', part) else part
for part in re.split(r'([0-9]+)', s))
def natsort(collection):
return sorted(collection, key=sortkey_natural)
def make_range_string(collection):
collection = sorted(collection)
parts = []
range_start = None
previous = None
def push_range(range_start, previous):
if range_start is not None:
if previous == range_start:
parts.append(str(previous))
else:
parts.append("{}-{}".format(range_start, previous))
for i in collection:
if previous != i - 1:
push_range(range_start, previous)
range_start = i
previous = i
push_range(range_start, previous)
return ', '.join(parts)
def make_ranges(strings):
components = defaultdict(list)
for i in strings:
main, _, number = i.partition('/')
components[main].append(int(number))
rvlist = []
for key in natsort(components):
rvlist.append((key, make_range_string(components[key])))
return rvlist
print(make_ranges(['ABC1/3', 'ABC1/1', 'ABC1/2', 'ABC2/5', 'ABC2/2', 'ABC2/1']))
The code prints a list of tuples:
[('ABC1', '1-3'), ('ABC2', '1-2, 5')]
I would start by splitting the strings, and using the part that you want to match on as a dictionary key.
strings = ['ABC1/3', 'ABC1/1', 'ABC1/2', 'ABC2/3', 'ABC2/2', 'ABC2/1']
d = {}
for s in string:
a, b = s.split('/')
d.get(a, default=[]).append(b)
That collects the number parts into a list for each prefix. Then you can sort the lists and look for adjacent numbers to join.
I have a nested list comprehension which has created a list of six lists of ~29,000 items. I'm trying to parse this list of final data, and create six separate dictionaries from it. Right now the code is very unpythonic, I need the right statement to properly accomplish the following:
1.) Create six dictionaries from a single statement.
2.) Scale to any length list, i.e., not hardcoding a counter shown as is.
I've run into multiple issues, and have tried the following:
1.) Using while loops
2.) Using break statements, will break out of the inner most loop, but then does not properly create other dictionaries. Also break statements set by a binary switch.
3.) if, else conditions for n number of indices, indices iterate from 1-29,000, then repeat.
Note the ellipses designate code omitted for brevity.
# Parse csv files for samples, creating a dictionary of key, value pairs and multiple lists.
with open('genes_1') as f:
cread_1 = list(csv.reader(f, delimiter = '\t'))
sample_1_values = [j for i, j in (sorted([x for x in {i: float(j)
for i, j in cread_1}.items()], key = lambda v: v[1]))]
sample_1_genes = [i for i, j in (sorted([x for x in {i: float(j)
for i, j in cread_1}.items()], key = lambda v: v[1]))]
...
# Compute row means.
mean_values = []
for i, (a, b, c, d, e, f) in enumerate(zip(sample_1_values, sample_2_values, sample_3_values, sample_4_values, sample_5_values, sample_6_values)):
mean_values.append((a + b + c + d + e + f)/6)
# Provide proper gene names for mean values and replace original data values by corresponding means.
sample_genes_list = [i for i in sample_1_genes, sample_2_genes, sample_3_genes, sample_4_genes, sample_5_genes, sample_6_genes]
sample_final_list = [sorted(zip(sg, mean_values)) for sg in sample_genes_list]
# Create multiple dictionaries from normalized values for each dataset.
class BreakIt(Exception): pass
try:
count = 1
for index, items in enumerate(sample_final_list):
sample_1_dict_normalized = {}
for index, (genes, values) in enumerate(items):
sample_1_dict_normalized[genes] = values
count = count + 1
if count == 29595:
raise BreakIt
except BreakIt:
pass
...
try:
count = 1
for index, items in enumerate(sample_final_list):
sample_6_dict_normalized = {}
for index, (genes, values) in enumerate(items):
if count > 147975:
sample_6_dict_normalized[genes] = values
count = count + 1
if count == 177570:
raise BreakIt
except BreakIt:
pass
# Pull expression values to qualify overexpressed proteins.
print 'ERG values:'
print 'Sample 1:', round(sample_1_dict_normalized.get('ERG'), 3)
print 'Sample 6:', round(sample_6_dict_normalized.get('ERG'), 3)
Your code is too long for me to give exact answer. I will answer very generally.
First, you are using enumerate for no reason. if you don't need both index and value, you probably don't need enumerate.
This part:
with open('genes.csv') as f:
cread_1 = list(csv.reader(f, delimiter = '\t'))
sample_1_dict = {i: float(j) for i, j in cread_1}
sample_1_list = [x for x in sample_1_dict.items()]
sample_1_values_sorted = sorted(sample_1_list, key=lambda expvalues: expvalues[1])
sample_1_genes = [i for i, j in sample_1_values_sorted]
sample_1_values = [j for i, j in sample_1_values_sorted]
sample_1_graph_raw = [float(j) for i, j in cread_1]
should be (a) using a list named samples and (b) much shorter, since you don't really need to extract all this information from sample_1_dict and move it around right now. It can be something like:
samples = [None] * 6
for k in range(6):
with open('genes.csv') as f: #but something specific to k
cread = list(csv.reader(f, delimiter = '\t'))
samples[k] = {i: float(j) for i, j in cread}
after that, calculating the sum and mean will be way more natural.
In this part:
class BreakIt(Exception): pass
try:
count = 1
for index, items in enumerate(sample_final_list):
sample_1_dict_normalized = {}
for index, (genes, values) in enumerate(items):
sample_1_dict_normalized[genes] = values
count = count + 1
if count == 29595:
raise BreakIt
except BreakIt:
pass
you should be (a) iterating of the samples list mentioned earlier, and (b) not using count at all, since you can iterate naturally over samples or sample[i].list or something like that.
Your code has several problems. You should put your code in functions that preferably do one thing each. Than you can call a function for each sample without repeating the same code six times (I assume that is what the ellipsis is hiding.). Give each function a self-describing name and a doc string that explains what it does. There is quite a bit unnecessary code. Some of this might become obvious once you have it in functions. Since functions take arguments you can hand in your 29595, for example.