Building a dictionary using two files

Building a dictionary using two files - python

I am very very new to python and I have been playing around to write a script using two files. File 1 contains a number of ID numbers such as:
1000012
1000015
1000046
1000047
1000050
1000072
1000076
100008
1000102
100013
The other file has few lines of single ID numbers followed by lines made of one ID number followed by other ID numbers which have a + or - at the end:
951450
8951670
8951800
8951863
8951889
9040311
9255087 147+ 206041- 8852164- 4458078- 1424812- 3631438- 8603144+ 4908786- 4780663+ 4643406+ 3061176- 7523696- 5876052- 163881- 6234800- 395660-
9255088 149+ 7735585+ 6359867+ 620034- 4522360- 2810885- 3705265+ 5966368- 7021344+ 9165926- 2477382+ 4015358- 2497281+ 9166415+ 6837601-
9255089 217+ 6544241+ 5181434+ 4625589+ 7433598+ 7295233+ 3938917+ 4109401+ 2135539+ 4960823+ 1838531+ 1959852+ 5698864+ 1925066+ 8212560+ 3056544+ 82N 1751642+ 4772695+ 2396528+ 2673866+ 2963754+ 5087444+ 977167+ 2892617- 7412278- 6920479- 2539680- 4315259- 8899799- 733101- 5281901- 7055760+ 8508290+ 8559218+ 7985985+ 6391093+ 2483783+ 8939632+ 3373919- 924346+ 1618865- 8670617+ 515619+ 5371996+ 2152211+ 6337329+ 284813+ 8512064+ 3469059+ 3405322+ 1415471- 1536881- 8034033+ 4592921+ 4226887- 6578783-
I want to build a dictionary using these two files. My script has to search inside File 2 for the ID numbers in File 1 and append those lines as values in which the key is the number in File 1. Therefore there may be more than one value for each key. I only want to search the lines in File 2 that have more than one number (if len(x) > 1).
the output would be something like: 1000047: 9292540 1000047+ 9126889+ 3490727- 8991434+ 4296324+ 9193432- 3766395+ 9193431+ 8949379- (I need to print each ID number in File1 as the key and as its value, the chunk of lines that contain that ID number as a whole)
Here is my -very wrong- script:
#!/usr/bin/python
f = open('file1')
z = open('file2')
d = dict() # d is an empty dictionary
for l in f:
p = l.rstrip()
d[p] = list() # sets the keys in the dictionary as p (IDs with newline characters stripped)
y = z.readlines() # retrieves a string from the path file
s = "".join(y) # makes a string from y
x = str.split(s) #splits the path file at white spaces
if len(x) > 1: # only the lines that include contigs IDs that were used to make another contig
for lines in y:
k = lines.rstrip()
w = tuple(x) # convert list x into a tuple called w
for i in w:
if i[:-1] in d:
d[p].append(k)
print d

Try:
#!/usr/bin/python
f = open('file1')
z = open('file2')
d = dict() # d is an empty dictionary
for l in f:
p = l.rstrip()
d[p] = list() # Change #1
f.close()
# Now we have a dictinary with the keys from file1 and empty lists as values
for line in z:
items = item.split() # items will be a list from 1 line
if len(items) > 1: # more than initial item in the list
k = items[0] # First is the key line
for i in items[1:]: # rest of items
if d.haskey(i[:-1]): # is it in the dict
d[i].append(k) # Add the k value
z.close()
print d
N.B. This is untested code but shouldn't be too far off.

Is this what you are looking for ?? (I have not tested it ...)
#!/usr/bin/python
f = open('file1')
z = open('file2')
d = dict() # d is an empty dictionary
for l in f.readlines():
for l2 in z.readlines():
if l.rstrip() in l2.rstrip():
d[l] = l2
z.seek(0, 0)
f.close()
z.close()
Here is a simpler version the same code, if you don't want to deal with the file pointer
f = open("file1")
z = open("file2")
d = dict() # d is an empty dictionary
file1_lines = f.readlines()
file2_lines = z.readlines()
for l in file1_lines:
for l2 in file2_lines:
if l.rstrip() in l2.rstrip():
d[l] = l2
print d
f.close()
z.close()

Related

How to transform a csv file into a multi-dimensional list using Python?

I started out with a 4d list, something like
tokens = [[[["a"], ["b"], ["c"]], [["d"]]], [[["e"], ["f"], ["g"]],[["h"], ["i"], ["j"], ["k"], ["l"]]]]
So I converted this to a csv file using the code
import csv
def export_to_csv(tokens):
csv_list = [["A", "B", "C", word]]
for h_index, h in enumerate(tokens):
for i_index, i in enumerate(h):
for j_index, j in enumerate(i):
csv_list.append([h_index, i_index, j_index, j])
with open('TEST.csv', 'w') as f:
# using csv.writer method from CSV package
write = csv.writer(f)
write.writerows(csv_list)
But now I want to do the reverse process, want to convert a csv file obtained in this format, back to the list format mentioned above.

Assuming you wanted your csv file to look something like this (there were a couple typos in the posted code):
A,B,C,word
0,0,0,a
0,0,1,b
0,0,2,c
...
here's one solution:
import csv
def import_from_csv(filename):
retval = []
with open(filename) as fh:
reader = csv.reader(fh)
# discard header row
next(reader)
# process data rows
for (x,y,z,word) in reader:
x = int(x)
y = int(y)
z = int(z)
retval.extend([[[]]] * (x + 1 - len(retval)))
retval[x].extend([[]] * (y + 1 - len(retval[x])))
retval[x][y].extend([0] * (z + 1 - len(retval[x][y])))
retval[x][y][z] = [word]
return retval

def import_from_csv(file):
import ast
import csv
data = []
# Read the CSV file
with open(file) as fp:
reader = csv.reader(fp)
# Skip the first line, which contains the headers
next(reader)
for line in reader:
# Read the first 3 elements of the line
a, b, c = [int(i) for i in line[:3]]
# When we read it back, everything comes in as strings. Use
# `literal_eval` to convert it to a Python list
value = ast.literal_eval(line[3])
# Extend the list to accomodate the new element
data.append([[[]]]) if len(data) < a + 1 else None
data[a].append([[]]) if len(data[a]) < b + 1 else None
data[a][b].append([]) if len(data[a][b]) < c + 1 else None
data[a][b][c] = value
return data
# Test
assert import_from_csv("TEST.csv") == tokens

First, I'd make writing this construction in a CSV format independent from dimensions:
import csv
def deep_iter(seq):
for i, val in enumerate(seq):
if type(val) is list:
for others in deep_iter(val):
yield i, *others
else:
yield i, val
with open('TEST.csv', 'w') as f:
csv.writer(f).writerows(deep_iter(tokens))
Next, we can use the lexicographic order of the indices to recreate the structure. All we have to do is sequentially move deeper into the output list according to the indices of a word. We stop at the penultimate index to get the last list, because the last index is pointing only at the place of the word in this list and doesn't matter due to the natural ordering:
with open('TEST.csv', 'r') as f:
rows = [*csv.reader(f)]
res = []
for r in rows:
index = r[:-2] # skip the last index and word
e = res
while index:
i = int(index.pop(0)) # get next part of a current index
if i < len(e):
e = e[i]
else:
e.append([]) # add new record at this level
e = e[-1]
e.append(r[-1]) # append the word to the corresponding list

Selecting line from file by using "startswith" and "next" commands

I have a file from which I want to create a list ("timestep") from the numbers which appear after each line "ITEM: TIMESTEP" so:
timestep = [253400, 253500, .. etc]
Here is the sample of the file I have:
ITEM: TIMESTEP
253400
ITEM: NUMBER OF ATOMS
378
ITEM: BOX BOUNDS pp pp pp
-2.6943709180241954e-01 5.6240920636804063e+01
-2.8194230631882372e-01 5.8851195163321044e+01
-2.7398090193568775e-01 5.7189372326936599e+01
ITEM: ATOMS id type q x y z
16865 3 0 28.8028 1.81293 26.876
16866 2 0 27.6753 2.22199 27.8362
16867 2 0 26.8715 1.04115 28.4178
16868 2 0 25.7503 1.42602 29.4002
16869 2 0 24.8716 0.25569 29.8897
16870 3 0 23.7129 0.593415 30.8357
16871 3 0 11.9253 -0.270359 31.7252
ITEM: TIMESTEP
253500
ITEM: NUMBER OF ATOMS
378
ITEM: BOX BOUNDS pp pp pp
-2.6943709180241954e-01 5.6240920636804063e+01
-2.8194230631882372e-01 5.8851195163321044e+01
-2.7398090193568775e-01 5.7189372326936599e+01
ITEM: ATOMS id type q x y z
16865 3 0 28.8028 1.81293 26.876
16866 2 0 27.6753 2.22199 27.8362
16867 2 0 26.8715 1.04115 28.4178
16868 2 0 25.7503 1.42602 29.4002
16869 2 0 24.8716 0.25569 29.8897
16870 3 0 23.7129 0.593415 30.8357
16871 3 0 11.9253 -0.270359 31.7252
To do this I tried to use "startswith" and "next" commands at once and it didn't work. Is there other way to do it? I send also the code I'm trying to use for that:
timestep = []
with open(file, 'r') as f:
lines = f.readlines()
for line in lines:
line = line.split()
if line[0].startswith("ITEM: TIMESTEP"):
timestep.append(next(line))
print(timestep)

The logic is to decide whether to append the current line to timestep or not. So, what you need is a variable which tells you append the current line when that variable is TRUE.
timestep = []
append_to_list = False # decision variable
with open(file, 'r') as f:
lines = f.readlines()
for line in lines:
line = line.strip() # remove "\n" from line
if line.startswith("ITEM"):
# Update add_to_list
if line == 'ITEM: TIMESTEP':
append_to_list = True
else:
append_to_list = False
else:
# append to list if line doesn't start with "ITEM" and append_to_list is TRUE
if append_to_list:
timestep.append(line)
print(timestep)
output:
['253400', '253500']

First - I don't like this, because it doesn't scale. You can only get the first immediately following line nicely, anything else will be just ugh...
But you asked, so ... for x in lines will create an iterator over lines and use that to keep the position. You don't have access to that iterator, so next will not be the next element you're expecting. But you can make your own iterator and use that:
lines_iter = iter(lines)
for line in lines_iter:
# whatever was here
timestep.append(next(line_iter))
However, if you ever want to scale it... for is not a good way to iterate over a file like this. You want to know what is in the next/previous line. I would suggest using while:
timestep = []
with open('example.txt', 'r') as f:
lines = f.readlines()
i = 0
while i < len(lines):
if line[i].startswith("ITEM: TIMESTEP"):
i += 1
while not line[i].startswith("ITEM: "):
timestep.append(next(line))
i += 1
else:
i += 1
This way you can extend it for different types of ITEMS of variable length.

So the problem with your code is subtle. You have a list lines which you iterate over, but you can't call next on a list.
Instead, turn it into an explicit iterator and you should be fine
timestep = []
with open(file, 'r') as f:
lines = f.readlines()
lines_iter = iter(lines)
for line in lines_iter:
line = line.strip() # removes the newline
if line.startswith("ITEM: TIMESTEP"):
timestep.append(next(lines_iter, None)) # the second argument here prevents errors
# when ITEM: TIMESTEP appears as the
# last line in the file
print(timestep)
I'm also not sure why you included line.split, which seems to be incorrect (in any case line.split()[0].startswith('ITEM: TIMESTEP') can never be true, since the split will separate ITEM: and TIMESTEP into separate elements of the resulting list.)
For a more robust answer, consider grouping your data based on when the line begins with ITEM.
def process_file(f):
ITEM_MARKER = 'ITEM: '
item_title = '(none)'
values = []
for line in f:
if line.startswith(ITEM_MARKER):
if values:
yield (item_title, values)
item_title = line[len(ITEM_MARKER):].strip() # strip off the marker
values = []
else:
values.append(line.strip())
if values:
yield (item_title, values)
This will let you pass in the whole file and will lazily produce a set of values for each ITEM: <whatever> group. Then you can aggregate in some reasonable way.
with open(file, 'r') as f:
groups = process_file(f)
aggregations = {}
for name, values in groups:
aggregations.setdefault(name, []).extend(values)
print(aggregations['TIMESTEP']) # this is what you want

You can use enumerate to help with index referencing. We can check to see if the string ITEM: TIMESTEP is in the previous line then add the integer to our timestep list.
timestep = []
with open('example.txt', 'r') as f:
lines = f.readlines()
for i, line in enumerate(lines):
if "ITEM: TIMESTEP" in lines[i-1]:
timestep.append(int(line.strip()))
print(timestep)

how to get all lines in list and get something search from this list

import csv
f = open("savewl_ssj500k22_Minfreq1-lowercaseWords_1.csv", "r")
csvF = csv.reader(f, delimiter="\t")
s = 0
sez = []
sezB = []
for q in f:
s = s + 1
if s > 3:
l = q.split(",")
x = l[1]
y = l[0]
sezB.append(y)
sezB.append(int(x))
sez.append(sezB)
print(sez)
f.close()
How to get it work to get all rows from .csv in list or sez saved
from this code I get: MemoryError
in file is 77214 lines of something like this : je,17031

Every loop you are appending sezB which is growing by itself.
so you are apparently grows by O(number of lines ^2).
This is something like this pattern (just for the explanation):
[[1,2], [1,2,3,4], [1,2,3,4,5,6], .....]
I guess you wanted to reset sezB to [] every loop.

Your code can be simplified to
import csv
s = 0
sez = []
sezB = []
with open("savewl_ssj500k22_Minfreq1-lowercaseWords_1.csv", "r") as f:
csvF = csv.reader(f, delimiter="\t")
for q in f:
s += 1
if s > 3:
l = q.split(",")
x, y = l[:2]
sezB.extend([x, int(y)])
sez.append(sezB)
print(sez)
As you can see, you constantly add 2 more element to the sezB list, which is not that much, but you also keep adding the resulting sezB list to the sez list.
So since the file has 77214 lines, sez will need to hold about 6 trillion (5962079010) strings, which is way too many to be stored into memory...

How to assign the first colmn in text file to list variable?

I have txt file which have "strings" like this
5.0125,511.2,5.12.3,4.51212,45.412,54111.5142 \n
4.23,1.2,2.6,2.3,1.2,1.554 \n
How to assign each column a separate list of floats please? I have been spending few hours on that, but I am lost.
Expected results
list 1 = [5.0125, 4.23]
list 2 = [511.2, 1.2 ]
Update: adding my trial :
for line in f:
lis = [float(line.split()[0]) for line in f]
print("lis is ", list)
tmp = line.strip().split(",")
values = [float(v) for v in tmp]
points4d = np.array(values).reshape(-1,11) #11 is number of elements in the line
print("points4d", points4d)
for i in points4d:
points3d_first_cluster = points4d[:, :3] # HARD CODED PART
points3d_second_cluster = points4d[:, 3:6]
points3d_third_cluster = points4d[:, 6:9]
#print("x_values_first_cluster",x_values_first_cluster)
print("points3d first cluster ",points3d_first_cluster)
print("points3d second cluster", points3d_second_cluster)
print("points for third cluster", points3d_third_cluster)

lists = []
with open ("text.txt", "r") as file:
for lines in file.readlines():
lista = [s for s in lines.split(',')]
lista.pop(-1)
lists.append(lista)
final_list = []
for x in range (len(lists[0])):
i = x+1
print("list {}".format(i))
globals()['list%s' % i] = [lists[0][x],lists[1][x]]
print(globals()['list%s' % i])
print(list1)
Output :
list 1
['5.0125', '4.23']
list 2
['511.2', '1.2']
list 3
['5.12.3', '2.6']
list 4
['4.51212', '2.3']
list 5
['45.412', '1.2']
['5.0125', '4.23'] # Output of print(list1)

this should work :
lists = []
with open ("text.txt", "r") as file:
for lines in file.readlines():
lista = [s for s in lines.split(',')]
lista.pop(-1)
lists.append(lista)
final_list = []
for x in range (len(lists[0])):
list_new = [lists[0][x],lists[1][x]]
final_list.append(list_new)
print(list_new)

Dictionaries overwriting in Python

This program is to take the grammar rules found in Binary.text and store them into a dictionary, where the rules are:
N = N D
N = D
D = 0
D = 1
but the current code returns D: D = 1, N:N = D, whereas I want N: N D, N: D, D:0, D:1
import sys
import string
#default length of 3
stringLength = 3
#get last argument of command line(file)
filename1 = sys.argv[-1]
#get a length from user
try:
stringLength = int(input('Length? '))
filename = input('Filename: ')
except ValueError:
print("Not a number")
#checks
print(stringLength)
print(filename)
def str2dict(filename="Binary.txt"):
result = {}
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
result[line[0]] = line
print (result)
return result
print (str2dict("Binary.txt"))

Firstly, your data structure of choice is wrong. Dictionary in python is a simple key-to-value mapping. What you'd like is a map from a key to multiple values. For that you'll need:
from collections import defaultdict
result = defaultdict(list)
Next, where are you splitting on '=' ? You'll need to do that in order to get the proper key/value you are looking for? You'll need
key, value = line.split('=', 1) #Returns an array, and gets unpacked into 2 variables
Putting the above two together, you'd go about in the following way:
result = defaultdict(list)
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
key, value = line.split('=', 1)
result[key.strip()].append(value.strip())
return result

Dictionaries, by definition, cannot have duplicate keys. Therefor there can only ever be a single 'D' key. You could, however, store a list of values at that key if you'd like. Ex:
from collections import defaultdict
# rest of your code...
result = defaultdict(list) # Use defaultdict so that an insert to an empty key creates a new list automatically
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
result[line[0]].append(line)
print (result)
return result
This will result in something like:
{"D" : ["D = N D", "D = 0", "D = 1"], "N" : ["N = D"]}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Building a dictionary using two files - python

Related

How to transform a csv file into a multi-dimensional list using Python?

Selecting line from file by using "startswith" and "next" commands

how to get all lines in list and get something search from this list

How to assign the first colmn in text file to list variable?

Dictionaries overwriting in Python

Categories

Resources