How to duplicate a python DictReader object?

How to duplicate a python DictReader object? - python

I'm currently trying to modify a DictReader object to strip all the spaces for every cell in the csv. I have this function:
def read_the_csv(input_file):
csv_reader = csv.DictReader(input_file)
for row in csv_reader:
for key, value in row.items():
value.strip()
return csv_reader
However, the issue with this function is that the reader that is returned has already been iterated through, so I can't re-iterate through it (as I would be able to if I just called csv.DictReader(input_file). I want to be able to create a new object that is exactly like the DictReader (i.e., has the fieldnames attribute too), but with all fields stripped of white space. Any tips on how I can accomplish this?

Two things: firstly, the reader is a lazy iterator object which is exhausted after one full run (meaning it will be empty once you return it at the end of your function!), so you have to either collect the modified rows in a list and return that list in the end or make the function a generator producing the modified rows. Secondly, str.strip() does not modify strings in-place (strings are immutable), but returns a new stripped string, so you have to rebind that new value to the old key:
def read_the_csv(input_file):
csv_reader = csv.DictReader(input_file)
for row in csv_reader:
for key, value in row.items():
row[key] = value.strip() # reassign
yield row
Now you can use that generator function like you did the DictReader:
reader = read_the_csv(input_file)
for row in reader:
# process data which is already stripped

I prefer using inheritance, make a subclass of DictReader as follows:
from csv import DictReader
from collections import OrderedDict
class MyDictReader(DictReader):
def __next__(self):
return OrderedDict({k: v.strip()
for k, v in super().__next__().items()})
Usage, just as DictReader:
with open('../data/risk_level_model_5.csv') as input_file:
for row in MyDictReader(input_file):
print(row)

Related

Printing list of DictReader twice in a row produces different results

I'm using the csv module to use csv.DictReader to read in a csv file. I am a newbie to Python and the following behavior has me stumped.
EDIT: See original question afterwards.
csv = csv.DictReader(csvFile)
print(list(csv)) # prints what I would expect, a sequence of OrderedDict's
print(list(csv)) # prints an empty list...
Is list somehow mutating csv?
Original question:
def removeFooColumn(csv):
for row in csv:
del csv['Foo']
csv = csv.DictReader(csvFile)
print(list(csv)) # prints what I would expect, a sequence of OrderedDict's
removeFooColum(csv)
print(list(csv)) # prints an empty list...
What is happening to the sequence in the removeFooColumn function?

csv.DictReader is an generator iterator, it can only be consumed once. Here is a fix:
def removeFooColumn(csv):
for row in csv:
del row['Foo']
csv = list(csv.DictReader(csvFile))
print(csv) # prints what I would expect, a sequence of OrderedDict's
removeFooColumn(csv)
print(csv) # prints an empty list...

Attempting to output the following json as a csv

Json object (output):
[424783, [198184], [605], [644], [296], [2048], 424694, [369192], [10139],
[152532], [397538], [1420]]
<<< CODE REmoved >>>
Desired output:
424783,198184
424783,605
424783,644
424783,296
424783,2048
424694,369192
424694,10139
424694,152532
424694,397538
424694,1420

From your data it looks like non-bracketed items should be considered as values of the first column (i.e. a key) and bracketed ones should be considered as values for the second column, using the key that precedes them. You can do this purely in a procedural fashion:
import csv
import json
src = '''[424783, [198184], [605], [644], [296], [2048],
424694, [369192], [10139], [152532], [397538], [1420]]'''
with open('output.csv', 'w', newline='') as f: # Python 2.x: open('output.csv', 'wb')
writer = csv.writer(f) # create a simple CSV writer
current_key = None # a container for the last seen / cached 'key'
for element in json.loads(src): # parse the structure and iterate over it
if isinstance(element, list): # if the element is a 'list'
writer.writerow((current_key, element[0])) # write to csv w/ cached key
else:
current_key = element # cache the element as the key for following entries
Which should produce an output.csv containing:
424783,198184
424783,605
424783,644
424783,296
424783,2048
424694,369192
424694,10139
424694,152532
424694,397538
424694,1420

itertools.groupby is a little challenging for Python beginners, but is very handy for walking through a collection of items and working with them in groups. In this case, we group by items that are or are not Python lists.
From each group of nested ints, we'll create one or more entries in our accumulator list.
Once the accumulator list is loaded, the code below just prints out the results, easily converted to writing to a file.
import ast
from itertools import groupby
from collections import namedtuple
# this may be JSON, but it's also an ordinary Python nested list of ints, so safely parseable using
# ast.literal_eval()
text = "[424783, [198184], [605], [644], [296], [2048], 424694, [369192], [10139], [152532], [397538], [1420]]"
items = ast.literal_eval(text)
# a namedtuple to hold each record, and a list to accumulate them
DataRow = namedtuple("DataRow", "old_id new_id")
accumulator = []
# use groupby to process the entries in groups, depending on whether the items are lists or not
key = None
for is_data, values in groupby(items, key=lambda x: isinstance(x, list)):
if not is_data:
# the sole value the next record key
key = list(values)[0]
else:
# the values are the collection of lists until the next key
accumulator.extend(DataRow(key, v[0]) for v in values)
# dump out as csv
for item in accumulator:
print("{old_id},{new_id}".format_map(item._asdict()))
Prints:
424783,198184
424783,605
424783,644
424783,296
424783,2048
424694,369192
424694,10139
424694,152532
424694,397538
424694,1420

I think using itertools.groupby() would be a good approach since grouping items is main thing that needs to be done to accomplish what you want.
Here's a fairly simple way of using it:
import csv
from itertools import groupby
import json
json_src = '''[424783, [198184], [605], [644], [296], [2048],
424694, [369192], [10139], [152532], [397538], [1420]]'''
def xyz():
return json.loads(json_src)
def abc():
json_processed = xyz()
output_filename = 'y.csv'
with open(output_filename, 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
for is_list, items in groupby(json_processed, key=lambda v: isinstance(v, list)):
if is_list:
new_ids = [item[0] for item in items]
else:
old_id = next(items)
continue
for new_id in new_ids:
writer.writerow([old_id, new_id])
abc()
Contents of csv file produced:
424783,198184
424783,605
424783,644
424783,296
424783,2048
424694,369192
424694,10139
424694,152532
424694,397538
424694,1420

Convert from a csv only a column to a dict

There are already questions in this direction, but in my situation I have the following problem:
The column alias contains dictionaries. If I use the csv reader I get strings.
I have solved this problem with ast eval, but it is very slow and consumes a lot of resources.
The alternative json.loads does not work because of encoding.
Some Ideas to solve this?
CSV File:
id;name;partei;term;wikidata;alias
2a24b32c-8f68-4a5c-bfb4-392262e15a78;Adolf Freiherr Spies von Büllesheim;CDU;10;Q361600;{}
9aaa1167-a566-4911-ac60-ab987b6dbd6a;Adolf Herkenrath;CDU;10;Q362100;{}
c371060d-ced3-4dc6-bf0e-48acd83f8d1d;Adolf Müller;CDU;10;Q363453;{'nl': ['Adolf Muller']}
41cf84b8-a02e-42f1-a70a-c0a613e6c8ad;Adolf Müller-Emmert;SPD;10;Q363451;{'de': ['Müller-Emmert'], 'nl': ['Adolf Muller-Emmert']}
15a7fe06-8007-4ff0-9250-dc7917711b54;Adolf Roth;CDU;10;Q363697;{}
Code:
with open(PATH_CSV+'mdb_file_2123.csv', "r", encoding="utf8") as csv8:
csv_reader = csv.DictReader(csv8, delimiter=';')
for row in csv_reader:
if not (ast.literal_eval(row['alias'])):
pass
elif (ast.literal_eval(row['alias'])):
known_as_list = list()
for values in ast.literal_eval(row['alias']).values():
for aliases in values:
known_as_list.append(aliases)
Its working good, but very slowly.

ast library consumes lots of memory (refer this link) and I would suggest avoid using that while converting a simple string of dictionary format into python dictionary. Instead we can try python's builtin eval function to overcome latency due to imported modules. As some discussions suggest eval is extremely dangerous while dealing with strings which are sensitive. Example: eval('os.system("rm -rf /")'). But if we are very sure that that the csv content will not carry such sensitive commands we can make use of eval without worrying.
with open('input.csv', encoding='utf-8') as fd:
csv_reader = csv.DictReader(fd, delimiter=';')
for row in csv_reader:
# Convert dictionary in string format to python format
row['alias'] = eval(row['alias'])
# Filter empty dictionaries
if not bool(row['alias']):
continue
known_as_list = [aliases for values in row['alias'].values() for aliases in values]
print(known_as_list)
Output
C:\Python34\python.exe c:\so\51712444\eval_demo.py
['Adolf Muller']
['Müller-Emmert', 'Adolf Muller-Emmert']

You can avoid calling literal_eval three times (one is sufficient) —
while I was at it I've cleaned up, or so I think, your code using a SO
classic (3013 upvotes!) contribution
from ast import literal_eval
# https://stackoverflow.com/a/952952/2749397 by Alex Martelli
flatten = lambda l: [item for sublist in l for item in sublist]
...
for row in csv_reader:
known_as_list = flatten(literal_eval(row['alias']).values())
From the excerpt of data shown from the OP, it seems to be possible to
avoid calling literal_eval on a significant part of the rows
...
for row in csv_reader:
if row['alias'] != '{}':
known_as_list = flatten(literal_eval(row['alias']).values())

python loop through a dictionary to see if values exist

I am trying to loop though a python dictionary to see if values that I am getting from a csv file already exist in the dictionary, If the values do not exist I want to add them to the dictionary. then append this to a list.
I am getting the error list indices must be integers, not str.
example input
first name last name
john smith
john smith
example output
first_name john last name smith
user_list =[]
with open(input_path,'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row['first_name'] not in user_dictionary['first_name'] and not in row['last_name'] not in user_dictionary['last_name']:
user_dictionary = {
'first_name':row['first_name'],
'last_name':row['last_name']
}
user_list.append(user_dictionary)

Currently, your code is creating a new dictionary on every iteration of the for-loop. If each value of the dictionary is a list, then you can append to that list via the key:
with open(input_path,'rU') as csvfile:
reader = csv.DictReader(csvfile)
user_dictionary = {"first_name":["name1, "name2", ...], "last_name":["name3", name4", ....]}
for row in reader:
if row['first_name'] not in user_dictionary['first_name'] and not in row['last_name'] not in user_dictionary['last_name']:
user_dictionary["first_name"].append(row['first_name'])
user_dictionary['last_name'].append(row['last_name'])

Generally, you can use a membership test (x in y) on dict.values() view to check if the value already exists in your dictionary.
However, if you are trying to add all unique users from your CSV file to a list of users, that has nothing to do with dictionary values testing, but a list membership testing.
Instead of iterating over the complete list each time for a slow membership check, you can use a set that will contain "ids" of all users added to a list and enable a fast O(n) (amortized) time check:
with open(input_path,'rU') as csvfile:
reader = csv.DictReader(csvfile)
user_list = []
user_set = set()
for row in reader:
user_id = (row['first_name'], row['last_name'])
if user_id not in user_set:
user = {
'first_name': row['first_name'],
'last_name': row['last_name'],
# something else ...
}
user_list.append(user)
user_set.add(user_id)

The error "list indices must be integers, not str" makes the problem clear: On the line that throws the error, you have a list that you think is a dict. You try to use a string as a key for it, and boom!
You don't give enough information to guess which dict it is: It could be user_dictionary, it could be that you're using csv.reader and not csv.DictReader as you say you do. It could even be something else-- there's no telling what else you left out of your code. But it's a list that you're using as if it's a dict.

Assign strings to IDs in Python

I am reading a text file with python, formatted where the values in each column may be numeric or strings.
When those values are strings, I need to assign a unique ID of that string (unique across all the strings under the same column; the same ID must be assigned if the same string appears elsewhere under the same column).
What would be an efficient way to do it?

Use a defaultdict with a default value factory that generates new ids:
ids = collections.defaultdict(itertools.count().next)
ids['a'] # 0
ids['b'] # 1
ids['a'] # 0
When you look up a key in a defaultdict, if it's not already present, the defaultdict calls a user-provided default value factory to get the value and stores it before returning it.
collections.count() creates an iterator that counts up from 0, so collections.count().next is a bound method that produces a new integer whenever you call it.
Combined, these tools produce a dict that returns a new integer whenever you look up something you've never looked up before.

defaultdict answer updated for python 3, where .next is now .__next__, and for pylint compliance, where using "magic" __*__ methods is discouraged:
ids = collections.defaultdict(functoools.partial(next, itertools.count()))

Create a set, and then add strings to the set. This will ensure that strings are not duplicated; then you can use enumerate to get a unique id of each string. Use this ID when you are writing the file out again.
Here I am assuming the second column is the one you want to scan for text or integers.
seen = set()
with open('somefile.txt') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
try:
int(row[1])
except ValueError:
seen.add(row[1]) # adds string to set
# print the unique ids for each string
for id,text in enumerate(seen):
print("{}: {}".format(id, text))
Now you can take the same logic, and replicate it across each column of your file. If you know the column length in advanced, you can have a list of sets. Suppose the file has three columns:
unique_strings = [set(), set(), set()]
with open('file.txt') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
for column,value in enumerate(row):
try:
int(value)
except ValueError:
# It is not an integer, so it must be
# a string
unique_strings[column].add(value)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to duplicate a python DictReader object? - python

Related

Printing list of DictReader twice in a row produces different results

Attempting to output the following json as a csv

Convert from a csv only a column to a dict

python loop through a dictionary to see if values exist

Assign strings to IDs in Python

Categories

Resources