Deleting duplicate entries from CSV

Deleting duplicate entries from CSV - python

I have a csv with 2 columns:
a,x
a,y
a,z
b,1
b,2
b,3
b,4
c,5
c,6
c,7
c,8
I'd like to loop through only look at the 1st column and only show 2 entries for each value in the first column. I don't care what values get kept or deleted for the second column, i just want 2 entries of each different option for the first column.
Output would look something like this:
a,x
a,y
b,1
b,2
c,5
c,6
I'm familiar with csv module(how to read/write/replace), but am having a hard time finding resources that explain how to compare one row with another. I think that is where I'm stuck on this problem.

I would use a dictionary to combat this problem, maybe something along the lines of the following:
dict = {}
rows = [['a', 'x'], ['a', 'y'], ['a', 'z'], ['b', 1], ['b', 2], ['b', 3], ['b', 4], ['c', 5], ['c', 6], ['c', 7], ['c', 8]]
for row in rows:
if row[0] not in dict.keys():
dict[row[0]] = []
if len(dict[row[0]]) == 2:
continue
dict[row[0]].append(row[1])
print(dict)
Output:
>> {'a': ['x', 'y'], 'b': [1, 2], 'c': [5, 6]}

So, here's an idea, based on Jacob's:
Create two dicts, first and second
For each line in the CSV:
if the key is in second, skip, else
if the key is not in first put it there
if the key is in first, and the value is not the line you are looing at, add the key to second
At the end you'll have two dictionaries with a value each as you wanted
You could generalize it to keeping N values by creating a list of dictionaries and use as many as you'd need

Here's an example with itertools.groupby
import itertools
with open("test.csv", "r") as stuff:
data = stuff.readlines()
out = []
for k,dat in itertools.groupby(data, key=lambda x: x[0]):
twoVals = list(dat)[:2]
out.append(twoVals)
print out
For cases where there are less than two values
import itertools
with open("test.csv", "r") as stuff:
data = stuff.readlines()
out = []
for k,dat in itertools.groupby(data, key=lambda x: x[0]):
dat = list(dat)
try:
vals = dat[:2]
except IndexError:
vals = list(dat)
out.append(vals)
print out
I tested this out on:
a,x
a,y
a,z
b,1
b,2
b,3
b,4
c,5
c,6
c,7
c,8
z,1

Related

How to convert a dataframe to a dictionary when dataframe column has commas

I have a csv as follows:
I need the "Term"s and the "DocID"s for which the "DocFreq" is greater than 5. And I need to store it as a dictionary where the Term is the key and the "DocID"s separated by the comma make individual values for that key in a list.
For example, I need
{"Want to be with":[doc100.txt,doc8311.txt,...doc123.txt], "and has her own": [doc100.txt,doc9286.txt...doc23330.txt]....}
So far, I've got this:
df1 = df[(df['DocFreq'] > 5)][['Term','DocFreq','Ngram','DocID']]
But I can't get the format I need. Doing df.to_dict() gives me a dictionary of dictionaries that include column names and I don't want that.
Please help!!
Thank you!!

I would do something like that:
data = pd.DataFrame({'Term': ['a', 'b', 'c', 'd', 'e'],
'DocFreq': [1, 2, 2, 6, 7],
'Ngram': [4, 4, 4, 4, 4],
'DocId': ['11, 123, 222', '22', '33', '44, 303, doa0', '55, 9393, idid']
})
filt = data[data['DocFreq'] > 5][['Term', 'DocId']]
result = {row['Term']: row['DocId'] for _, row in filt.iterrows()}

You are almost there. Just select DocID column before calling to_dict.
You may use
dict1 = df.loc[(df['DocFreq'] > 5), ['Term','DocID']].set_index('Term')['DocID'].to_dict()

How to split a list into smaller lists python

I have a nested list that looks something like:
lst = [['ID1', 'A'],['ID1','B'],['ID2','AAA'], ['ID2','DDD']...]
Is it possible for me to split the lst into small lists by their ID so that each small list contained elements with the same ID? The results should look something looks like:
lst1 = [['ID1', 'A'], ['ID1', 'B']...]
lst2 = [['ID2', 'AAA'], ['ID2', 'DDD']...]

You can use groupby:
from itertools import groupby
grp_lists = []
for i, grp in groupby(lst, key= lambda x: x[0]):
grp_lists.append(list(grp))
print(grp_lists[0])
[['ID1', 'A'], ['ID1', 'B']]
print(grp_lists[1])
[['ID2', 'AAA'], ['ID2', 'DDD']]

using collections.defaultdict:
lst = [['ID1', 'A'],['ID1','B'],['ID2','AAA'], ['ID2','DDD']]
from collections import defaultdict
result = defaultdict(list)
for item in lst:
result[item[0]].append(item)
print(list(result.values()))
output:
[[['ID1', 'A'], ['ID1', 'B']], [['ID2', 'AAA'], ['ID2', 'DDD']]]

Without external functions: build a set of unique indexes, then loop over the original list building a new list for each of the indexes and filling it with list items that contain that index:
lst = [['ID1', 'A'],['ID1','B'],['ID2','AAA'], ['ID2','DDD']]
unique_set = set(elem[0] for elem in lst)
lst2 = [ [elem for elem in lst if elem[0] in every_unique] for every_unique in unique_set]
print (lst2)
Result:
[[['ID2', 'AAA'], ['ID2', 'DDD']], [['ID1', 'A'], ['ID1', 'B']]]
(It is possible to move unique_set into the final line, making it a one-liner. But that would make it less clear what happens.)

If you want to get separate variables like your example of a result:
lst1 = [sub_lst for sub_lst in lst if sub_lst[0] == 'ID1']
and
lst2 = [sub_lst for sub_lst in lst if sub_lst[0] == 'ID2']
from that, you can make a function:
def create_sub_list(id_str, original_lst):
return [x for x in original_lst if x[0] == id_str]
And call it like that:
lst1 = create_sub_list('ID1', lst)
If you want a dictionary of the sub-lists, for easier access, you can use:
from functools import reduce
def reduce_dict(ret_dict, sub_lst):
if (sub_lst[0] not in ret_dict):
ret_dict[sub_lst[0]] = sub_lst[1:]
else:
ret_dict[sub_lst[0]] += sub_lst[1:]
return ret_dict
grouped_dict = reduce(reduce_dict, lst, dict())
(If you know that in your list there will only be 1 string after each ID slot you can change both the sub_lst[1:]'s to sub_lst[1])
And then to access the elements if the dictionary you use the ID strings:
print(grouped_dict['ID1'])
This will print:
['A', 'B']

Create list with all unique possible combination based on condition in dataframe in Python

I have the following dataset:
d = {
'Company':['A','A','A','A','B','B','B','B','C','C','C','C','D','D','D','D'],
'Individual': [1,2,3,4,1,5,6,7,1,8,9,10,10,11,12,13]
}
Now, I need to create a list in Python of all pairs of elements of 'Company', that correspond to the values in 'Individual'.
E.g. The output for above should be as follows for the dataset above:
((A,B),(A,C),(B,C),(C,D)).The first three tuples, since Individual 1 is affiliated with A,B and C and the last one since, Individual 10 is affiliated with C and D.
Further Explanation -
If individual =1, the above dataset has 'A','B' and 'C' values. Now, I want to create all unique combination of these three values (tuple), therefore it should create a list with the tuples (A,B),(A,C) and (B,C). The next is Individual=2. Here is only has the value 'A' therefore there is no tuple to append to the list. For next individuals there's only one corresponding company each, hence no further pairs. The only other tuple that has to be added is for Individual=10, since it has values 'C' and 'D' - and should therefore add the tuple (C,D) to the list.

One solution is to use pandas:
import pandas as pd
d = {'Company':['A','A','A','B','B','B','C','C','C'],'Individual': [1,2,3,1,4,5,3,6,7]}
df = pd.DataFrame(d).groupby('Individual')['Company'].apply(list).reset_index()
companies = df.loc[df['Company'].map(len)>1, 'Company'].tolist()
# [['A', 'B'], ['A', 'C']]
This isn't the most efficient way, but it may be intuitive.

Here is a solution to your refined question:
from collections import defaultdict
from itertools import combinations
data = {'Company':['A','A','A','A','B','B','B','B','C','C','C','C','D','D','D','D'],
'Individual': [1,2,3,4,1,5,6,7,1,8,9,10,10,11,12,13]}
d = defaultdict(set)
for i, j in zip(data['Individual'], data['Company']):
d[i].add(j)
res = {k: sorted(map(sorted, combinations(v, 2))) for k, v in d.items()}
# {1: [['A', 'B'], ['A', 'C'], ['B', 'C']],
# 2: [],
# 3: [],
# 4: [],
# 5: [],
# 6: [],
# 7: [],
# 8: [],
# 9: [],
# 10: [['C', 'D']],
# 11: [],
# 12: [],
# 13: []}

Try this,
temp=df[df.duplicated(subset=['Individual'],keep=False)]
print temp.groupby(['Individual'])['Company'].unique()
>>>1 [A, B]
>>>3 [A, C]

How to change the items in a list of sublists based on certain rules and conditions of those sublists?

I have a list of sublists that are made up of three items. Only the first and last item matter in the sublists, because I want to change the last item across all sublists based on the frequency of the last item across the list.
This is the list I have:
lst = [['A','abc','id1'],['A','def','id2'],['A','ghi','id1'],['A','ijk','id1'],['A','lmn','id2'],['B','abc','id3'],['B','def','id3'],['B','ghi','id3'],['B','ijk','id3'],['B','lmn','id'],['C','xyz','id6'],['C','lmn','id6'],['C','aaa','id5']]
For example, A appears the most with id1 instead of id2, so I'd like to replace all id2 that appear with A with id1. For B, id3 is the most common, so I'd like to replace any instance of anything else with id3, which means I'd want to replace 'id' with 'id3' only for B. For C, I'd like to replace the instance of 'id5' with 'id6,' because 'id6' appears the most with the list.
Desired_List = lst = [['A','abc','id1'],['A','def','id1'],['A','ghi','id1'],['A','ijk','id1'],['A','lmn','id1'],['B','abc','id3'],['B','def','id3'],['B','ghi','id3'],['B','ijk','id3'],['B','lmn','id3'],['C','xyz','id6'],['C','lmn','id6'],['C','aaa','id6']]
I should also mention that this is going to be done on a very large list, so speed and efficiency is needed.

Straight-up data processing using your ad-hoc requirement above, I can come up with the following algorithm.
First sweep: collect frequency information for every key (i.e. 'A', 'B', 'C'):
def generate_frequency_table(lst):
assoc = {} # e.g. 'A': {'id1': 3, 'id2': 2}
for key, unused, val in list:
freqs = assoc.get(key, None)
if freqs is None:
freqs = {}
assoc[key] = freqs
valfreq = freqs.get(val, None)
if valfreq is None:
freqs[val] = 1
else:
freqs[val] = valfreq + 1
return assoc
>>> generate_frequency_table(lst)
{'A': {'id2': 2, 'id1': 3}, 'C': {'id6': 2, 'id5': 1}, 'B': {'id3': 4, 'id': 1}}
Then, see what 'value' is associated with each key (i.e. {'A': 'id1'}):
def generate_max_assoc(assoc):
max = {} # e.g. {'A': 'id1'}
for key, freqs in assoc.iteritems():
curmax = ('', 0)
for val, freq in freqs.iteritems():
if freq > curmax[1]:
curmax = (val, freq)
max[key] = curmax[0]
return max
>>> maxtable = generate_max_assoc(generate_frequency_table(lst))
>>> print maxtable
{'A': 'id1', 'C': 'id6', 'B': 'id3'}
Finally, iterate through the original list and replace values using the table above:
>>> newlst = [[key, unused, maxtable[key]] for key, unused, val in lst]
>>> print newlst
[['A', 'abc', 'id1'], ['A', 'def', 'id1'], ['A', 'ghi', 'id1'], ['A', 'ijk', 'id1'], ['A', 'lmn', 'id1'], ['B', 'abc', 'id3'], ['B', 'def', 'id3'], ['B', 'ghi', 'id3'], ['B', 'ijk', 'id3'], ['B', 'lmn', 'id3'], ['C', 'xyz', 'id6'], ['C', 'lmn', 'id6'], ['C', 'aaa', 'id6']]

This is pretty much the same solution as supplied by Santa, but I've combined a few steps into one, as we can scan for the maximum value while we are collecting the frequencies:
def fix_by_frequency(triple_list):
freq = {}
for key, _, value in triple_list:
# Get existing data
data = freq[key] = \
freq.get(key, {'max_value': value, 'max_count': 1, 'counts': {}})
# Increment the count
count = data['counts'][value] = data['counts'].get(value, 0) + 1
# Update the most frequently seen
if count > data['max_count']:
data['max_value'], data['max_count'] = value, count
# Use the maximums to map the list
return [[key, mid, freq[key]['max_value']] for key, mid, _ in triple_list]
This has been optimised a bit for readability (I think, be nice!) rather than raw speed. For example you might not want to write back to the dict when you don't need to, or maintain a separate max dict to prevent two key lookups in the list comprehension at the end.

Sorting columns in a table (list of lists) whilst preserving the correspondence of the rows

For example:
list1 = ['c', 'b', 'a']
list2 = [3, 2, 1]
list3 = ['11', '10', '01']
table = [list1, list2, list3]
I'd like to sort with respect to the first column (list1), but I'd like the final ordering to still preserve the lines (so after sorting I'd still have a line 'b', 2, '10'). In this example I could just sort each list individually but with my data I can't just do that. What's the pythonic approach?

One quick way would be to use zip:
>>> from operator import itemgetter
>>> transpose = zip(*table)
>>> transpose.sort(key=itemgetter(0))
>>> table = zip(*transpose)
>>> table
[('a', 'b', 'c'), (1, 2, 3), ('01', '10', '11')]

# Get a list of indexes (js), sorted by the values in list1.
js = [t[1] for t in sorted((v,i) for i,v in enumerate(list1))]
# Use those indexes to build your new table.
sorted_table = [[row[j] for j in js] for row in table]
See this question for information on how Python sorts a list of tuples.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Deleting duplicate entries from CSV - python

Related

How to convert a dataframe to a dictionary when dataframe column has commas

How to split a list into smaller lists python

Create list with all unique possible combination based on condition in dataframe in Python

How to change the items in a list of sublists based on certain rules and conditions of those sublists?

Sorting columns in a table (list of lists) whilst preserving the correspondence of the rows

Categories

Resources