i got a large textfile (https://int-emb-word2vec-de-wiki.s3.eu-central-1.amazonaws.com/vectors.txt) and put the file into a dictionary:
word2vec = "./vectors.txt"
with open(word2vec, 'r') as f:
file = csv.reader(f, delimiter=' ')
model = {k: np.array(list(map(float, v))) for k, *v in file}
So i got this dictionary: {Word: Embedding vectors}.
Now I want to convert my key from: b'Word' to: Word (so that I got for example UNK instead of b'UNK').
Does anyone know how I can remove the b'...' for every instance?
Or is it easier if i first remove all the b'...' in the textfile before I put the file into a dictionary?
why not just str.decode() it?
the line would be
model = {k.decode(): np.array(list(map(float, v))) for k, *v in file}
Its not possible to change the Keys. You will need to add a new key with the modified value then remove the old one, or create a new dict with a dict comprehension or the like.
Now I want to convert my key from: b'Word' to: Word (so that I got for example UNK instead of b'UNK').
The keys you get are strings like "b'Word'" and "b'UNK'", not b'Word' and b'UNK'. Try executing print(b"Word", type(b"Word"), "b'Word'", type("b'Word'")), it might make things clearer.
This should work:
import ast
import csv
import numpy as np
with open("../out/out_file.txt") as file_in:
reader = csv.reader(file_in, delimiter=" ")
words = {ast.literal_eval(word).decode(): np.array(vect, dtype=np.float64) for word, *vect in reader}
This solution also appears to be much faster.
Related
I am trying to convert a file, where every word is on a different newline, into a dictionary where the keys are the word sizes and values are the lists of words.
The first part of my code has removed the newline characters from the text file, and now I am trying to organize the dictionary based on the values a word has.
with open(dictionary_file, 'r') as file:
wordlist = file.readlines()
print([k.rstrip('\n') for k in wordlist])
dictionary = {}
for line in file:
(key, val) = line.split()
dictionary[int(key)] = val
print(dictionary)
However, I keep getting the error that there aren't enough values to unpack, even though I'm sure I have already removed the newline characters from the original text file. Another error I get is that it will only print out the words in a dictionary without the newlines, however, they aren't organized by value. Any help would be appreciated, thanks! :)
(key, val) = line.split()
^^^^^^^^^^
ValueError: not enough values to unpack (expected 2, got 1)
I'm not sure why you're trying to use line.split(). All you need is the length of the word, so you can use the len() function. Also, you use collections.defaultdict to make this code shorter. Like this:
import collections
words = collections.defaultdict(list)
with open('test.txt') as file:
for line in file:
word = line.strip()
words[len(word)].append(word)
try this
with open(dictionary_file, 'r') as file:
dictionary = {}
for line in file:
val = line.strip().split()
dictionary[len(val)] = val
print(dictionary)
How to iterate defaultdict(list) in python in such a way so that I can get counts of each strings sorted by highest number. In my below code, I am reading csv file to
So I read about it and found I can use collections.Counter here but my poc column also has lot of empty/null strings and it is also counting those as well. Is there any way to avoid that? Also can we generate a json with that result if there is any way?
import sys
import csv
import collections
from collections import defaultdict
filename = sys.argv[1]
columns = defaultdict(list)
with open(filename) as f:
reader = csv.DictReader(f)
for row in reader:
for (k,v) in row.items():
columns[k].append(v)
print(collections.Counter(columns['poc']))
This is the output I get as of now:
Counter({'': 100,'\health':2,'Checking records':2,...})
You can use the filter builtin function to remove empty strings or other "false-y" values such as None, 0 or False.
collections.Counter(filter(None, columns['poc']))
If you want to exclude empty strings but keep other false-y values, use a lambda to define the filtering criteria.
collections.Counter(filter(lambda x: x != '', columns['poc']))
Counter is a subclass of dict, so an instance can be serialised to JSON like any dictionary: json.dumps(counter)
I guess this can do done simply using dictionary.
key = dict(collections.Counter(columns['poc']))
#removes the empty element
key.pop('')
print(key)
i have a list of dictionaries like this:
[{"a0":0,"a1":1,"a2":2,"a3":3},{"a4":4,"a5":5,"a6":6},{"a7":7,"a8":8}]
i want to save it to a csv file and read it back.
A=[{"a0":0,"a1":1,"a2":2,"a3":3},{"a4":4,"a5":5,"a6":6},{"a7":7,"a8":8}]
with open("file_temp.csv","w+",newline="") as file_temp:
file_temp_writer=csv.writer(file_temp)
for a in A:
temp_list=[]
for key,value in a.items():
temp_list.append([[key],[value]])
file_temp_writer.writerow(temp_list)
now the csv file is:
"[['a0'], [0]]","[['a1'], [1]]","[['a2'], [2]]","[['a3'], [3]]"
"[['a4'], [4]]","[['a5'], [5]]","[['a6'], [6]]"
"[['a7'], [7]]","[['a8'], [8]]"
and then to read it back:
import csv
B=[]
with open("file_temp.csv","r+",newline="") as file_temp:
file_temp_reader= csv.reader(file_temp)
for row in file_temp_reader:
row_dict={}
for i in range(len(row)):
row[i]=row[i].strip('"')
row_dict[row[i][0]]=row[i][1]
B.append(row_dict)
now if i print(B) the result will be:
[{'[': '['}, {'[': '['}, {'[': '['}]
i know the problem is that when i write in a csv file, it save each element as a string. for example "[['a0'], [0]]" instead of [['a0'], [0]]. i used strip('"') to solve this problem. but i cant solve the problem.
If you really need this as a CSV file, I think your issue is where you create temp_list your're creating a nested list when you append to it.
Try this instead:
# use meaningful names
dictionary_list = [{"a0":0,"a1":1,"a2":2,"a3":3},{"a4":4,"a5":5,"a6":6},{"a7":7,"a8":8}]
with open("file_temp.csv","w+",newline="") as file_temp:
file_temp_writer=csv.writer(file_temp)
for d in dictionary_list:
temp_list=[]
for key,value in d.items():
# notice the difference here, instead of appending a nested list
# we just append the key and value
# this will make temp_list something like: [a0, 0, a1, 1, etc...]
temp_list.append(key)
temp_list.append(value)
file_temp_writer.writerow(temp_list)
To save a dictionary it is easy using json:
import json
A=[{"a0":0,"a1":1,"a2":2,"a3":3},{"a4":4,"a5":5,"a6":6},{"a7":7,"a8":8}]
with open("file_temp.json", "w") as f:
json.dump(A, f)
To retrieve data again:
with open("file_temp.json", "r") as f:
B = json.load(f)
I'm still quite new to Python and I was wondering how would I convert something that is already in key:value form in a text file into a Python dictionary?
Eg.
2:red
3:orange
5:yellow
6:green
(each key:value on a separate line)
I've looked at other posts but none of them seem to work and I know I'm doing something wrong. So far, I have:
def create_colours_dictionary(filename):
colours_dict = {}
file = open(filename,'r')
contents = file.read()
for key in contents:
#???
return colours_dict
The straight-forward way to do this is to use a traditional for loop, and the str.split method.
Rather than reading from a file, I'll embed the input data into the script as a multi-line string, and use str.splitlines to convert it to a list of strings, so we can loop over it, just like looping over the lines of a file.
# Use a list of strings to simulate the file
contents = '''\
2:red
3:orange
5:yellow
6:green
'''.splitlines()
colours_dict = {}
for s in contents:
k, v = s.split(':')
colours_dict[k] = v
print(colours_dict)
output
{'2': 'red', '3': 'orange', '5': 'yellow', '6': 'green'}
Be aware that this code will only work correctly if there are no spaces surrounding the colon. If there could be spaces (or spaces at the start or end of the line), they you can use the str.strip method to remove them.
There are a couple of ways to make this more compact.
We could use a list comprehension nested inside a dictionary comprehension:
colours_dict = {k: v for k, v in [s.split(':') for s in contents]}
But it's even more compact to use the dict constructor on a generator expression:
colours_dict = dict(s.split(':') for s in contents)
If you aren't familiar with comprehensions, please see
List Comprehensions and Dictionaries in the official tutorial.
Iterate over your file and build a dictionary.
def create_colours_dictionary(filename):
colours_dict = {}
with open(filename) as file:
for line in file:
k, v = line.rstrip().split(':')
colours_dict[k] = v
return colours_dict
dct = create_colours_dictionary('file.txt')
Or, if you're looking for something compact, you can use a dict comprehension with a lambda to split on colons.
colours_dict = {k : v for k, v in (
line.rstrip().split(':') for line in open(filename)
}
This approach will need some modification if the colon is surrounded by spaces—perhaps regex?
Assuming the textfile has the stated 'key:value' and the name of the file is contained in the variable fname you could write a function that will read the file and return a dict or just use a simple with statment.
A function is probably a better choice if this opertion is performed in several places in your code. If only done once a 2-liner will do fine.
# Example with fname being the path to the textfile
def dict_from(fname):
return dict(line.strip().split(':') for line in open(fname))
fname = '...'
# ...
d1 = dict_from(fname)
# Alternative solution
with open(fname) as fd:
d2 = dict(line.strip().split(':') for line in fd)
Both suggested solutions uses a built-in dictconstructor and a generator expression to parse each line. Use strip to remove white space at both start and end of the line. Use split create a (key, value) pair from each line.
I have an input file that contains lines of:
key \t value1 \t value2 .....
I'd like read this file into a dictionary where key is the first token of the line and the value is the list of the values.
I think something like this would do it, but python gives me an error that name l is not defined. How do I write a comprehension that has two levels of "for" statements like this?
f = open("input.txt")
datamap = {tokens[0]:tokens[1:] for tokens in l.split("\t") for l in enumerate(f)}
Use the csv module and insert each row into a dictionary:
import csv
with open('input.txt') as tsvfile:
reader = csv.reader(tsvfile, delimiter='\t')
datamap = {row[0]: row[1:] for row in reader}
This sidesteps the issue altogether.
You can put a str.split() result into a tuple to create a 'loop variable':
datamap = {row[0]: row[1:] for l in f for row in (l.strip().split("\t"),)}
Here row is bound to the one str.split() result from the tuple, effectively creating a row = l.strip().split('\t') 'assignment'.
Martijn's got you covered for improving the process, but just to directly address the issues you were seeing with your code:
First, enumerate is not doing what you think it's doing (although I'm not entirely sure what you think it's doing). You can just get rid of it.
Second, Python is trying to resolve this:
tokens[0]:tokens[1:] for tokens in l.split("\t")
before it sees what you're defining l as. You can put parentheses around the second comprehension to make it evaluate as you intended:
datamap = {tokens[0]:tokens[1:] for tokens in (l.split("\t") for l in f)}