Convert from a csv only a column to a dict - python

There are already questions in this direction, but in my situation I have the following problem:
The column alias contains dictionaries. If I use the csv reader I get strings.
I have solved this problem with ast eval, but it is very slow and consumes a lot of resources.
The alternative json.loads does not work because of encoding.
Some Ideas to solve this?
CSV File:
id;name;partei;term;wikidata;alias
2a24b32c-8f68-4a5c-bfb4-392262e15a78;Adolf Freiherr Spies von Büllesheim;CDU;10;Q361600;{}
9aaa1167-a566-4911-ac60-ab987b6dbd6a;Adolf Herkenrath;CDU;10;Q362100;{}
c371060d-ced3-4dc6-bf0e-48acd83f8d1d;Adolf Müller;CDU;10;Q363453;{'nl': ['Adolf Muller']}
41cf84b8-a02e-42f1-a70a-c0a613e6c8ad;Adolf Müller-Emmert;SPD;10;Q363451;{'de': ['Müller-Emmert'], 'nl': ['Adolf Muller-Emmert']}
15a7fe06-8007-4ff0-9250-dc7917711b54;Adolf Roth;CDU;10;Q363697;{}
Code:
with open(PATH_CSV+'mdb_file_2123.csv', "r", encoding="utf8") as csv8:
csv_reader = csv.DictReader(csv8, delimiter=';')
for row in csv_reader:
if not (ast.literal_eval(row['alias'])):
pass
elif (ast.literal_eval(row['alias'])):
known_as_list = list()
for values in ast.literal_eval(row['alias']).values():
for aliases in values:
known_as_list.append(aliases)
Its working good, but very slowly.

ast library consumes lots of memory (refer this link) and I would suggest avoid using that while converting a simple string of dictionary format into python dictionary. Instead we can try python's builtin eval function to overcome latency due to imported modules. As some discussions suggest eval is extremely dangerous while dealing with strings which are sensitive. Example: eval('os.system("rm -rf /")'). But if we are very sure that that the csv content will not carry such sensitive commands we can make use of eval without worrying.
with open('input.csv', encoding='utf-8') as fd:
csv_reader = csv.DictReader(fd, delimiter=';')
for row in csv_reader:
# Convert dictionary in string format to python format
row['alias'] = eval(row['alias'])
# Filter empty dictionaries
if not bool(row['alias']):
continue
known_as_list = [aliases for values in row['alias'].values() for aliases in values]
print(known_as_list)
Output
C:\Python34\python.exe c:\so\51712444\eval_demo.py
['Adolf Muller']
['Müller-Emmert', 'Adolf Muller-Emmert']

You can avoid calling literal_eval three times (one is sufficient) —
while I was at it I've cleaned up, or so I think, your code using a SO
classic (3013 upvotes!) contribution
from ast import literal_eval
# https://stackoverflow.com/a/952952/2749397 by Alex Martelli
flatten = lambda l: [item for sublist in l for item in sublist]
...
for row in csv_reader:
known_as_list = flatten(literal_eval(row['alias']).values())
From the excerpt of data shown from the OP, it seems to be possible to
avoid calling literal_eval on a significant part of the rows
...
for row in csv_reader:
if row['alias'] != '{}':
known_as_list = flatten(literal_eval(row['alias']).values())

Related

How to call a single function multiple times with different parameters in python?

list = ["a","b","c"]
def type(count):
res2=list[count]
for row1 in csv_reader1: # Values imported from CSV
res2=list[count]
print(res2)
res1 = str(row1)[1:-1]
res3 = str(res1)[1:-1]
print(res3)
type(0)
type(1)
type(2)
I want to call this type function, type(0) is being called, but then it exits and type(1) and type(2) are not being called. I've even tried with for loop
for i in range(0,2):
type(i)
i=i+1
Even this For doesn't work, it just calls type(0) and exits.
I've defined a list and I'm trying to iterate list for each value of imported from CSV.
Kind of for-each in powershell - for-each list( print res2( = list) and print(each value in CSV) ) - This is what I'm trying to achieve. I'm new to Python. Any help would be greatly appreciated. Thanks in advance.
I assume you are creating a CSV reader something like this:
import csv
with open("myfile.csv") as f:
csvreader1 = csv.reader(f)
This reader object can only be read once and is then used up. That's why your function doesn't do anything the 2nd and 3rd times. To be able to reuse the content, use list to read the whole file into memory.
with open("myfile.csv") as f:
csv_content = list(csv.reader(f))
Alternatively, rewrite your function so that it reads the CSV each time.
letters = ["a","b","c"]
def print_data(i, filename):
print(letters[i])
with open(filename) as f:
for row in csv.reader(f): # Values imported from CSV
print(str(row)[2:-2])
print_data(0, "myfile.csv")
list = ["a","b","c"]
file = open("C:/Users/data.csv")
csv_reader = csv. reader(file)
def type(count):
res2=list[count]
for row1 in csv_reader: # Values imported from CSV
res2=list[count]
print(res2)
res1 = str(row1)[1:-1]
res3 = str(res1)[1:-1]
print(res3)
for i in range(0,2):
type(i)

Printing list of DictReader twice in a row produces different results

I'm using the csv module to use csv.DictReader to read in a csv file. I am a newbie to Python and the following behavior has me stumped.
EDIT: See original question afterwards.
csv = csv.DictReader(csvFile)
print(list(csv)) # prints what I would expect, a sequence of OrderedDict's
print(list(csv)) # prints an empty list...
Is list somehow mutating csv?
Original question:
def removeFooColumn(csv):
for row in csv:
del csv['Foo']
csv = csv.DictReader(csvFile)
print(list(csv)) # prints what I would expect, a sequence of OrderedDict's
removeFooColum(csv)
print(list(csv)) # prints an empty list...
What is happening to the sequence in the removeFooColumn function?
csv.DictReader is an generator iterator, it can only be consumed once. Here is a fix:
def removeFooColumn(csv):
for row in csv:
del row['Foo']
csv = list(csv.DictReader(csvFile))
print(csv) # prints what I would expect, a sequence of OrderedDict's
removeFooColumn(csv)
print(csv) # prints an empty list...

Multiple dictionary list of values assignment with a single for loop for multiple keys

I want to create a dictionary with a list of values for multiple keys with a single for loop in Python3. For me, the time execution and memory footprint are of utmost importance since the file which my Python3 script is reading is rather long.
I have already tried the following simple script:
p_avg = []
p_y = []
m_avg = []
m_y = []
res_dict = {}
with open('/home/user/test', 'r') as f:
for line in f:
p_avg.append(float(line.split(" ")[5].split(":")[1]))
p_y.append(float(line.split(" ")[6].split(":")[1]))
m_avg.append(float(line.split(" ")[1].split(":")[1]))
m_avg.append(float(line.split(" ")[2].split(":")[1]))
res_dict['p_avg'] = p_avg
res_dict['p_y'] = p_y
res_dict['m_avg'] = m_avg
res_dict['m_y'] = mse_y
print(res_dict)
The format of my home/user/test file is:
n:1 m_avg:7588.39 m_y:11289.73 m_u:147.92 m_v:223.53 p_avg:9.33 p_y:7.60 p_u:26.43 p_v:24.64
n:2 m_avg:7587.60 m_y:11288.54 m_u:147.92 m_v:223.53 p_avg:9.33 p_y:7.60 p_u:26.43 p_v:24.64
n:3 m_avg:7598.56 m_y:11304.50 m_u:148.01 m_v:225.33 p_avg:9.32 p_y:7.60 p_u:26.43 p_v:24.60
.
.
.
The Python script shown above works but first it is too long and repetitive, second, I am not sure how efficient it is. I was eventually thinking to create the same with list-comprehensions. Something like that:
(res_dict['p_avg'], res_dict['p_y']) = [(float(line.split(" ")[5].split(":")[1]), float(line.split(" ")[6].split(":")[1])) for line in f]
But for all four dictionary keys. Do you think that using list comprehension could reduce the used memory footprint of the script and the speed of execution? What should be the right syntax for the list-comprehension?
[EDIT] I have changed the dict -> res_dict as it was mentioned that it is not a good practice, I have also fixed a typo, where the p_y wasn't pointing to the right value and added a print statement to print the resulting dictionary as mentioned by the other users.
You can make use of defaultdict. There is no need to split the line each time, and to make it more readable you can use a lambda to extract the fields for each item.
from collections import defaultdict
res = defaultdict(list)
with open('/home/user/test', 'r') as f:
for line in f:
items = line.split()
extract = lambda x: x.split(':')[1]
res['p_avg'].append(extract(items[5]))
res['p_y'].append(extract(items[6]))
res['m_avg'].append(extract(items[1]))
res['m_y'].append(extract(items[2]))
You can initialize your dict to contain the string/list pairs, and then append directly as you iterate through every line. Also, you don't want to keep calling split() on line on each iteration. Rather, just call once and save to a local variable and index from this variable.
# Initialize dict to contain string key and list value pairs
dictionary = {'p_avg':[],
'p_y':[],
'm_avg':[],
'm_y':[]
}
with open('/home/user/test', 'r') as f:
for line in f:
items = line.split() # store line.split() so you don't split multiple times per line
dictionary['p_avg'].append(float(items[5].split(':')[1]))
dictionary['p_y'].append(float(items[6].split(':')[1])) # I think you meant index 6 here
dictionary['m_avg'].append(float(items[1].split(':')[1]))
dictionary['m_y'].append(float(items[2].split(':')[1]))
You can just pre-define dict attributes:
d = {
'p_avg': [],
'p_y': [],
'm_avg': [],
'm_y': []
}
and then append directly to them:
with open('/home/user/test', 'r') as f:
for line in f:
splitted_line = line.split(" ")
d['p_avg'].append(float(splitted_line[5].split(":")[1]))
d['p_y'].append(float(splitted_line[5].split(":")[1]))
d['m_avg'].append(float(splitted_line[1].split(":")[1]))
d['m_avg'].append(float(splitted_line[2].split(":")[1]))
P.S. Never use variable names equal to built-in words, like dict, list etc. It can cause MANY various errors!

How to duplicate a python DictReader object?

I'm currently trying to modify a DictReader object to strip all the spaces for every cell in the csv. I have this function:
def read_the_csv(input_file):
csv_reader = csv.DictReader(input_file)
for row in csv_reader:
for key, value in row.items():
value.strip()
return csv_reader
However, the issue with this function is that the reader that is returned has already been iterated through, so I can't re-iterate through it (as I would be able to if I just called csv.DictReader(input_file). I want to be able to create a new object that is exactly like the DictReader (i.e., has the fieldnames attribute too), but with all fields stripped of white space. Any tips on how I can accomplish this?
Two things: firstly, the reader is a lazy iterator object which is exhausted after one full run (meaning it will be empty once you return it at the end of your function!), so you have to either collect the modified rows in a list and return that list in the end or make the function a generator producing the modified rows. Secondly, str.strip() does not modify strings in-place (strings are immutable), but returns a new stripped string, so you have to rebind that new value to the old key:
def read_the_csv(input_file):
csv_reader = csv.DictReader(input_file)
for row in csv_reader:
for key, value in row.items():
row[key] = value.strip() # reassign
yield row
Now you can use that generator function like you did the DictReader:
reader = read_the_csv(input_file)
for row in reader:
# process data which is already stripped
I prefer using inheritance, make a subclass of DictReader as follows:
from csv import DictReader
from collections import OrderedDict
class MyDictReader(DictReader):
def __next__(self):
return OrderedDict({k: v.strip()
for k, v in super().__next__().items()})
Usage, just as DictReader:
with open('../data/risk_level_model_5.csv') as input_file:
for row in MyDictReader(input_file):
print(row)

Read from file to dictionary as floats instead of strings

I'm loading and extracting data from python, which I want to be stored in a dictionary.
I'm using csv to write read the data and externally it is just stored as to comma-separated columns. This works great, but when the data is initially read it (obviously) is read as string.
I can convert it to a dictionary with both keys and values as floats using two lines of code, but my question whether I can load the data directly as floats into a dictionary.
My original code was:
reader = csv.reader(open('testdict.csv','rb'))
dict_read = dict((x,y) for (x,y) in reader)
Which I have changed to:
reader = csv.reader(open('testdict.csv','rb'))
read = [(float(x),float(y)) for (x,y) in reader]
dict_read = dict(read)
which loads the data in the desired way.
So, is it possible to modify the first dict_read = dict((x,y) for (x,y) in reader) to do what the code below does?
SOLUTION:
The solution is to use the map-function, which has to be used on iterable objects:
dict_read = dict(map(float,x) for x in reader)
Try this:
dict_read = dict((map(float,x) for x in reader)

Categories