I'm using the csv module to use csv.DictReader to read in a csv file. I am a newbie to Python and the following behavior has me stumped.
EDIT: See original question afterwards.
csv = csv.DictReader(csvFile)
print(list(csv)) # prints what I would expect, a sequence of OrderedDict's
print(list(csv)) # prints an empty list...
Is list somehow mutating csv?
Original question:
def removeFooColumn(csv):
for row in csv:
del csv['Foo']
csv = csv.DictReader(csvFile)
print(list(csv)) # prints what I would expect, a sequence of OrderedDict's
removeFooColum(csv)
print(list(csv)) # prints an empty list...
What is happening to the sequence in the removeFooColumn function?
csv.DictReader is an generator iterator, it can only be consumed once. Here is a fix:
def removeFooColumn(csv):
for row in csv:
del row['Foo']
csv = list(csv.DictReader(csvFile))
print(csv) # prints what I would expect, a sequence of OrderedDict's
removeFooColumn(csv)
print(csv) # prints an empty list...
Json object (output):
[424783, [198184], [605], [644], [296], [2048], 424694, [369192], [10139],
[152532], [397538], [1420]]
<<< CODE REmoved >>>
Desired output:
424783,198184
424783,605
424783,644
424783,296
424783,2048
424694,369192
424694,10139
424694,152532
424694,397538
424694,1420
From your data it looks like non-bracketed items should be considered as values of the first column (i.e. a key) and bracketed ones should be considered as values for the second column, using the key that precedes them. You can do this purely in a procedural fashion:
import csv
import json
src = '''[424783, [198184], [605], [644], [296], [2048],
424694, [369192], [10139], [152532], [397538], [1420]]'''
with open('output.csv', 'w', newline='') as f: # Python 2.x: open('output.csv', 'wb')
writer = csv.writer(f) # create a simple CSV writer
current_key = None # a container for the last seen / cached 'key'
for element in json.loads(src): # parse the structure and iterate over it
if isinstance(element, list): # if the element is a 'list'
writer.writerow((current_key, element[0])) # write to csv w/ cached key
else:
current_key = element # cache the element as the key for following entries
Which should produce an output.csv containing:
424783,198184
424783,605
424783,644
424783,296
424783,2048
424694,369192
424694,10139
424694,152532
424694,397538
424694,1420
itertools.groupby is a little challenging for Python beginners, but is very handy for walking through a collection of items and working with them in groups. In this case, we group by items that are or are not Python lists.
From each group of nested ints, we'll create one or more entries in our accumulator list.
Once the accumulator list is loaded, the code below just prints out the results, easily converted to writing to a file.
import ast
from itertools import groupby
from collections import namedtuple
# this may be JSON, but it's also an ordinary Python nested list of ints, so safely parseable using
# ast.literal_eval()
text = "[424783, [198184], [605], [644], [296], [2048], 424694, [369192], [10139], [152532], [397538], [1420]]"
items = ast.literal_eval(text)
# a namedtuple to hold each record, and a list to accumulate them
DataRow = namedtuple("DataRow", "old_id new_id")
accumulator = []
# use groupby to process the entries in groups, depending on whether the items are lists or not
key = None
for is_data, values in groupby(items, key=lambda x: isinstance(x, list)):
if not is_data:
# the sole value the next record key
key = list(values)[0]
else:
# the values are the collection of lists until the next key
accumulator.extend(DataRow(key, v[0]) for v in values)
# dump out as csv
for item in accumulator:
print("{old_id},{new_id}".format_map(item._asdict()))
Prints:
424783,198184
424783,605
424783,644
424783,296
424783,2048
424694,369192
424694,10139
424694,152532
424694,397538
424694,1420
I think using itertools.groupby() would be a good approach since grouping items is main thing that needs to be done to accomplish what you want.
Here's a fairly simple way of using it:
import csv
from itertools import groupby
import json
json_src = '''[424783, [198184], [605], [644], [296], [2048],
424694, [369192], [10139], [152532], [397538], [1420]]'''
def xyz():
return json.loads(json_src)
def abc():
json_processed = xyz()
output_filename = 'y.csv'
with open(output_filename, 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
for is_list, items in groupby(json_processed, key=lambda v: isinstance(v, list)):
if is_list:
new_ids = [item[0] for item in items]
else:
old_id = next(items)
continue
for new_id in new_ids:
writer.writerow([old_id, new_id])
abc()
Contents of csv file produced:
424783,198184
424783,605
424783,644
424783,296
424783,2048
424694,369192
424694,10139
424694,152532
424694,397538
424694,1420
There are already questions in this direction, but in my situation I have the following problem:
The column alias contains dictionaries. If I use the csv reader I get strings.
I have solved this problem with ast eval, but it is very slow and consumes a lot of resources.
The alternative json.loads does not work because of encoding.
Some Ideas to solve this?
CSV File:
id;name;partei;term;wikidata;alias
2a24b32c-8f68-4a5c-bfb4-392262e15a78;Adolf Freiherr Spies von Büllesheim;CDU;10;Q361600;{}
9aaa1167-a566-4911-ac60-ab987b6dbd6a;Adolf Herkenrath;CDU;10;Q362100;{}
c371060d-ced3-4dc6-bf0e-48acd83f8d1d;Adolf Müller;CDU;10;Q363453;{'nl': ['Adolf Muller']}
41cf84b8-a02e-42f1-a70a-c0a613e6c8ad;Adolf Müller-Emmert;SPD;10;Q363451;{'de': ['Müller-Emmert'], 'nl': ['Adolf Muller-Emmert']}
15a7fe06-8007-4ff0-9250-dc7917711b54;Adolf Roth;CDU;10;Q363697;{}
Code:
with open(PATH_CSV+'mdb_file_2123.csv', "r", encoding="utf8") as csv8:
csv_reader = csv.DictReader(csv8, delimiter=';')
for row in csv_reader:
if not (ast.literal_eval(row['alias'])):
pass
elif (ast.literal_eval(row['alias'])):
known_as_list = list()
for values in ast.literal_eval(row['alias']).values():
for aliases in values:
known_as_list.append(aliases)
Its working good, but very slowly.
ast library consumes lots of memory (refer this link) and I would suggest avoid using that while converting a simple string of dictionary format into python dictionary. Instead we can try python's builtin eval function to overcome latency due to imported modules. As some discussions suggest eval is extremely dangerous while dealing with strings which are sensitive. Example: eval('os.system("rm -rf /")'). But if we are very sure that that the csv content will not carry such sensitive commands we can make use of eval without worrying.
with open('input.csv', encoding='utf-8') as fd:
csv_reader = csv.DictReader(fd, delimiter=';')
for row in csv_reader:
# Convert dictionary in string format to python format
row['alias'] = eval(row['alias'])
# Filter empty dictionaries
if not bool(row['alias']):
continue
known_as_list = [aliases for values in row['alias'].values() for aliases in values]
print(known_as_list)
Output
C:\Python34\python.exe c:\so\51712444\eval_demo.py
['Adolf Muller']
['Müller-Emmert', 'Adolf Muller-Emmert']
You can avoid calling literal_eval three times (one is sufficient) —
while I was at it I've cleaned up, or so I think, your code using a SO
classic (3013 upvotes!) contribution
from ast import literal_eval
# https://stackoverflow.com/a/952952/2749397 by Alex Martelli
flatten = lambda l: [item for sublist in l for item in sublist]
...
for row in csv_reader:
known_as_list = flatten(literal_eval(row['alias']).values())
From the excerpt of data shown from the OP, it seems to be possible to
avoid calling literal_eval on a significant part of the rows
...
for row in csv_reader:
if row['alias'] != '{}':
known_as_list = flatten(literal_eval(row['alias']).values())
I am trying to loop though a python dictionary to see if values that I am getting from a csv file already exist in the dictionary, If the values do not exist I want to add them to the dictionary. then append this to a list.
I am getting the error list indices must be integers, not str.
example input
first name last name
john smith
john smith
example output
first_name john last name smith
user_list =[]
with open(input_path,'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row['first_name'] not in user_dictionary['first_name'] and not in row['last_name'] not in user_dictionary['last_name']:
user_dictionary = {
'first_name':row['first_name'],
'last_name':row['last_name']
}
user_list.append(user_dictionary)
Currently, your code is creating a new dictionary on every iteration of the for-loop. If each value of the dictionary is a list, then you can append to that list via the key:
with open(input_path,'rU') as csvfile:
reader = csv.DictReader(csvfile)
user_dictionary = {"first_name":["name1, "name2", ...], "last_name":["name3", name4", ....]}
for row in reader:
if row['first_name'] not in user_dictionary['first_name'] and not in row['last_name'] not in user_dictionary['last_name']:
user_dictionary["first_name"].append(row['first_name'])
user_dictionary['last_name'].append(row['last_name'])
Generally, you can use a membership test (x in y) on dict.values() view to check if the value already exists in your dictionary.
However, if you are trying to add all unique users from your CSV file to a list of users, that has nothing to do with dictionary values testing, but a list membership testing.
Instead of iterating over the complete list each time for a slow membership check, you can use a set that will contain "ids" of all users added to a list and enable a fast O(n) (amortized) time check:
with open(input_path,'rU') as csvfile:
reader = csv.DictReader(csvfile)
user_list = []
user_set = set()
for row in reader:
user_id = (row['first_name'], row['last_name'])
if user_id not in user_set:
user = {
'first_name': row['first_name'],
'last_name': row['last_name'],
# something else ...
}
user_list.append(user)
user_set.add(user_id)
The error "list indices must be integers, not str" makes the problem clear: On the line that throws the error, you have a list that you think is a dict. You try to use a string as a key for it, and boom!
You don't give enough information to guess which dict it is: It could be user_dictionary, it could be that you're using csv.reader and not csv.DictReader as you say you do. It could even be something else-- there's no telling what else you left out of your code. But it's a list that you're using as if it's a dict.
I am reading a text file with python, formatted where the values in each column may be numeric or strings.
When those values are strings, I need to assign a unique ID of that string (unique across all the strings under the same column; the same ID must be assigned if the same string appears elsewhere under the same column).
What would be an efficient way to do it?
Use a defaultdict with a default value factory that generates new ids:
ids = collections.defaultdict(itertools.count().next)
ids['a'] # 0
ids['b'] # 1
ids['a'] # 0
When you look up a key in a defaultdict, if it's not already present, the defaultdict calls a user-provided default value factory to get the value and stores it before returning it.
collections.count() creates an iterator that counts up from 0, so collections.count().next is a bound method that produces a new integer whenever you call it.
Combined, these tools produce a dict that returns a new integer whenever you look up something you've never looked up before.
defaultdict answer updated for python 3, where .next is now .__next__, and for pylint compliance, where using "magic" __*__ methods is discouraged:
ids = collections.defaultdict(functoools.partial(next, itertools.count()))
Create a set, and then add strings to the set. This will ensure that strings are not duplicated; then you can use enumerate to get a unique id of each string. Use this ID when you are writing the file out again.
Here I am assuming the second column is the one you want to scan for text or integers.
seen = set()
with open('somefile.txt') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
try:
int(row[1])
except ValueError:
seen.add(row[1]) # adds string to set
# print the unique ids for each string
for id,text in enumerate(seen):
print("{}: {}".format(id, text))
Now you can take the same logic, and replicate it across each column of your file. If you know the column length in advanced, you can have a list of sets. Suppose the file has three columns:
unique_strings = [set(), set(), set()]
with open('file.txt') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
for column,value in enumerate(row):
try:
int(value)
except ValueError:
# It is not an integer, so it must be
# a string
unique_strings[column].add(value)