I have a text file named file.txt with some numbers like the following :
1 79 8.106E-08 2.052E-08 3.837E-08
1 80 -4.766E-09 9.003E-08 4.812E-07
1 90 4.914E-08 1.563E-07 5.193E-07
2 2 9.254E-07 5.166E-06 9.723E-06
2 3 1.366E-06 -5.184E-06 7.580E-06
2 4 2.966E-06 5.979E-07 9.702E-08
2 5 5.254E-07 0.166E-02 9.723E-06
3 23 1.366E-06 -5.184E-03 7.580E-06
3 24 3.244E-03 5.239E-04 9.002E-08
I want to build a python dictionary, where the first number in each row is the key, the second number is always ignored, and the last three numbers are put as values. But in a dictionary, a key can not be repeated, so when I write my code (attached at the end of the question), what I get is
'1' : [ '90' '4.914E-08' '1.563E-07' '5.193E-07' ]
'2' : [ '5' '5.254E-07' '0.166E-02' '9.723E-06' ]
'3' : [ '24' '3.244E-03' '5.239E-04' '9.002E-08' ]
All the other numbers are removed, and only the last row is kept as the values. What I need is to have all the numbers against a key, say 1, to be appended in the dictionary. For example, what I need is :
'1' : ['8.106E-08' '2.052E-08' '3.837E-08' '-4.766E-09' '9.003E-08' '4.812E-07' '4.914E-08' '1.563E-07' '5.193E-07']
Is it possible to do it elegantly in python? The code I have right now is the following :
diction = {}
with open("file.txt") as f:
for line in f:
pa = line.split()
diction[pa[0]] = pa[1:]
with open('file.txt') as f:
diction = {pa[0]: pa[1:] for pa in map(str.split, f)}
You can use a defaultdict.
from collections import defaultdict
data = defaultdict(list)
with open("file.txt", "r") as f:
for line in f:
line = line.split()
data[line[0]].extend(line[2:])
Try this:
from collections import defaultdict
diction = defaultdict(list)
with open("file.txt") as f:
for line in f:
key, _, *values = line.strip().split()
diction[key].extend(values)
print(diction)
This is a solution for Python 3, because the statement a, *b = tuple1 is invalid in Python 2. Look at the solution of #cha0site if you are using Python 2.
Make the value of each key in diction be a list and extend that list with each iteration. With your code as it is written now when you say diction[pa[0]] = pa[1:] you're overwriting the value in diction[pa[0]] each time the key appears, which describes the behavior you're seeing.
with open("file.txt") as f:
for line in f:
pa = line.split()
try:
diction[pa[0]].extend(pa[1:])
except KeyError:
diction[pa[0]] = pa[1:]
In this code each value of diction will be a list. In each iteration if the key exists that list will be extended with new values from pa giving you a list of all the values for each key.
To do this in a very simple for loop:
with open('file.txt') as f:
return_dict = {}
for item_list in map(str.split, f):
if item_list[0] not in return_dict:
return_dict[item_list[0]] = []
return_dict[item_list[0]].extend(item_list[1:])
return return_dict
Or, if you wanted to use defaultdict in a one liner-ish:
from collections import defaultdict
with open('file.txt') as f:
return_dict = defaultdict(list)
[return_dict[item_list[0]].extend(item_list[1:]) for item_list in map(str.split, f)]
return return_dict
Related
How might I remove duplicate lines from a file and also the unique related to this duplicate?
Example:
Input file:
line 1 : Messi , 1
line 2 : Messi , 2
line 3 : CR7 , 2
I want the output file to be:
line 1 : CR7 , 2
Just ( " CR7 , 2 " I want to delete duplicate lines and also the unique related to this duplicate)
The deletion depends on the first row if there is a match in the first row I want to delete this line
How to do this in python
with this code what to edit on it :
lines_seen = set() # holds lines already seen
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
if line not in lines_seen: # not a duplicate
outfile.write(line)
lines_seen.add(line)
outfile.close()
What is the best way to do this job?
Have you tried Counter?
This works for example:
import collections
a = [1, 1, 2]
out = [k for k, v in collections.Counter(a).items() if v == 1]
print(out)
Output: [2]
Or with a longer example:
import collections
a = [1, 1, 1, 2, 4, 4, 4, 5, 3]
out = [k for k, v in collections.Counter(a).items() if v == 1]
print(out)
Output: [2, 5, 3]
EDIT:
Since you don't have a list at the beginning there are two ways, depending on the file size you should use the first for small enough files (otherwise you might run in memory problems) or the second one for large files.
Read file as list and use previous answer:
import collections
lines = [line for line in open(infilename)]
out = [k for k, v in collections.Counter(lines).items() if v == 1]
with open(outfilename, 'w') as outfile:
for o in out:
outfile.write(o)
The first line reads your file completely as a list. This means, that really large files would be loaded in your memory. If you have to large files you can go ahead and use a sort of "blacklist":
Using blacklist:
lines_seen = set() # holds lines already seen
blacklist = set()
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
if line not in lines_seen and line not in blacklist: # not a duplicate
lines_seen.add(line)
else:
lines_seen.discard(line)
blacklist.add(line)
for l in lines_seen:
outfile.write(l)
outfile.close()
Here you add all lines to the set and only write the set to the file at the end. The blacklist remembers all multiple occurrences and therefore you do not write multiple lines even once. You can't do it in one go, to read and write since you do not know, if there comes the same line a second time. If you have further information (like multiple lines always come continuously) you could maybe do it differently
EDIT 2
If you want to do it depending on the first part:
firsts_seen = set()
lines_seen = set() # holds lines already seen
blacklist = set()
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
first = line.split(',')[0]
if first not in firsts_seen and first not in blacklist: # not a duplicate
lines_seen.add(line)
firsts_seen.add(first)
else:
lines_seen.discard(line)
firsts_seen.discard(first)
blacklist.add(first)
print(len(lines_seen))
for l in lines_seen:
outfile.write(l)
outfile.close()
P.S.: By now I have just been adding code, there might be a better way
For example with a dict:
lines_dict = {}
for line in open(infilename, 'r'):
if line.split(',')[0] not in lines_dict:
lines_dict[line.split(',')[0]] = [line]
else:
lines_dict[line.split(',')[0]].append(line)
with open(outfilename, 'w') as outfile:
for key, value in lines_dict.items():
if len(value) == 1:
outfile.write(value[0])
Given your input you can do something like this:
seen = {} # key maps to index
double_seen = set()
with open('input.txt') as f:
for line in f:
_, key = line.split(':')
key = key.strip()
if key not in seen: # Have not seen this yet?
seen[key] = line # Then add it to the dictionary
else:
double_seen.add(key) # Else we have seen this more thane once
# Now we can just write back to a different file
with open('output.txt', 'w') as f2:
for key in set(seen.keys()) - double_seen:
f2.write(seen[key])
Input I used:
line 1 : Messi
line 2 : Messi
line 3 : CR7
Output:
line 3 : CR7
Note this solution assumes Python3.7+ since it assumes dictionaries are in insertion order.
E;Z;X;Y
I tried
dl= defaultdict(list)
for line in file:
line = line.strip().split(';')
for x in line:
dl[line[0]].append(line[1:4])
dl=dict(dl)
print (votep)
It print out too many results. I have an init that reads the file.
What ways can I edit to make it work?
The csv module could be really handy here, just use a semicolon as your delimiter and a simple dict comprehension will suffice:
with open('filename.txt') as file:
reader = csv.reader(file, delimiter=';')
votep = {k: vals for k, *vals in reader}
print(votep)
Without using csv you can just use str.split:
with open('filename.txt') as file:
votep = {k: vals for k, *vals in (s.split(';') for s in file)}
print(votep)
Further simplified without the comprehension this would look as follows:
votep = {}
for line in file:
key, *vals = line.split(';')
votep[key] = vals
And FYI, key, *vals = line.strip(';') is just multiple variable assignment coupled with iterable unpacking. The star just means put whatever’s left in the iterable into vals after assigning the first value to key.
if you read file in list object, there is a simple function to iterate over and convert it to dictionary you expect:
a = [
'A;X;Y;Z',
'B;Y;Z;X',
'C;Y;Z;X',
'D;Z;X;Y',
'E;Z;X;Y',
]
def vp(a):
dl = {}
for i in a:
split_keys = i.split(';')
dl[split_keys[0]] = split_keys[1:]
print(dl)
I'm trying to convert a text file containing DNA sequences to a dictionary in python. The file is setup in columns.
TTT F
TCT S
TAT Y
TGT C
TTC F
import os.path
if os.path.isfile("GeneticCode_2.txt"):
f = open('GeneticCode_2.txt', 'r')
my_dict = eval(f.read())
Trying to get it to:
my_dict = {TTT: F, TCT: S, TAT: Y}
You can use the dict constructor using an iterable of pairs (2-tuples) and pass it the split lines of your file:
with open('GeneticCode_2.txt', 'r') as f:
my_dict = dict(line.split() for line in f)
# works only if file only contains lines that split into exactly 2 tokens
d = {}
with open("GeneticCode_2.txt") as infile:
for line in infile:
k,v = line.strip().split()
d[k] = v
This isn't the most compact way of doing it, but it is very readable.
my_dict = dict()
for line in f.readlines():
parts = line.strip().split()
if not len(parts) < 2:
my_dict[parts[0]] = parts[1]
I have a file named report_data.csv that contains the following:
user,score
a,10
b,15
c,10
a,10
a,5
b,10
I am creating a dictionary from this file using this code:
with open('report_data.csv') as f:
f.readline() # Skip over the column titles
mydict = dict(csv.reader(f, delimiter=','))
After running this code mydict is:
mydict = {'a':5,'b':10,'c':10}
But I want it to be:
mydict = {'a':25,'b':25,'c':10}
In other words, whenever a key that already exists in mydict is encountered while reading a line of the file, the new value in mydict associated with that key should be the sum of the old value and the integer that appears on that line of the file. How can I do this?
The most straightforward way is to use defaultdict or Counter from useful collections module.
from collections import Counter
summary = Counter()
with open('report_data.csv') as f:
f.readline()
for line in f:
lbl, n = line.split(",")
n = int(n)
summary[lbl] = summary[lbl] + n
One of the most useful features in Counter class is the most_common() function, that is absent from the plain dictionaries and from defaultdict
This should work for you:
with open('report_data.csv') as f:
f.readline()
mydict = {}
for line in csv.reader(f, delimiter=','):
mydict[line[0]] = mydict.get(line[0], 0) + int(line[1])
try this.
mydict = {}
with open('report_data.csv') as f:
f.readline()
x = csv.reader(f, delimiter=',')
for x1 in x:
if mydict.get(x1[0]):
mydict[x1[0]] += int(x1[1])
else:
mydict[x1[0]] = int(x1[1])
print mydict
Im trying to find out how to get certain data from a file in the easiest way possible. I have searched all over the internet but can't find anything. I want to be able to do this:
File.txt:
data1 = 1
data2 = 2
but i want to get only data1 like so,
p = open('file.txt')
f = p.get(data1)
print(f)
Any Ideas, Thanks in advance.
You can do:
with open("file.txt", "r") as f:
for line in f:
key, val = line.split('=')
key = key.strip()
val = val.strip()
if key == 'data1': # if data1 is not the first line
# do something with value and data
using map:
from operator import methodcaller
with open("file.txt", "r") as f:
for line in f:
key, val = map(methodcaller("strip", " "), line.split('='))
if key == "data1":
# do something with value and data
with open("file.txt", "r") as f:
key, val = f.readline().split('=')
if key.strip() == 'data1': # if data1 is not the first line
# do something with value and data
If you know you only want data1 which is on the first line, you can do
with open('file.txt', 'r') as f:
key, val = tuple(x.strip() for x in f.readline().split('='))
The list comprehension is used to remove the whitespace from each string.