Python: How to restart a FOR loop, which iterates over a csv - python

I am using Python 3.5 and I wanna load data from a csv into several lists, but it only works exactly one time with a FOR-Loop. Then it loads 0 into it.
Here is the code:
f1 = open("csvfile.csv", encoding="latin-1")
csv_f1 = csv.reader(f1, delimiter=';')
list_f1_vorname = []
for row_f1 in csv_f1:
list_f1_vorname.append(row_f1[2])
list_f1_name = []
for row_f1 in csv_f1: # <-- HERE IS THE ERROR, IT DOESNT WORK A SECOND TIME!
list_f1_name.append(row_f1[3])
Does anybody know how to restart this thing?
Many thanks and best regards,
Saitam

csv_f1 is not an list, it is an iterative.
Either, you cache the csv_f1 into a list by using list() or you just recreate the object.
I would recommend to recreate the object in case your cvs data gets very big. This way, the data is not put into RAM completely.

The simple answer is to iterate over the csv once and store it into a list.
something like
my_list = []
for row in csv_f1:
my_list.append(row)
or what abukaj wrote with
csv_f1 = list(csv.reader(f1, delimiter=';'))
and then move on and iterate over that list as many times as you want.
However if you are only trying to get certain columns then you can simply do that in the same for loop.
list_f1_vorname = []
list_f1_name = []
for row in csv_f1:
list_f1_vorname.append(row[2])
list_f1_name.append(row[3])
The reason it doesn't work multiple times is because it is an iterator...so it will iterate over the values once but not restart at beginning again after it has already iterated over the data.

Try:
csv_f1 = list(csv.reader(f1, delimiter=';'))
It is not exactly restarting the reader, but rather caching the file contents in a list, which may be iterated many times.

One thing nobody noticed so far is that you're trying to store names and last names in two separate lists. This is not going to be very convenient to use later on. Therefore despite other answers show correct possible solutions to read names and last names from csv into two separate lists, I'm going to propose you to use a single list of dicts instead:
f1 = open("csvfile.csv", encoding="latin-1")
csv_f1 = csv.reader(f1, delimiter=";")
list_of_names = []
for row_f1 in csv_f1:
list_of_names.append({
"vorname": row_f1[2],
"name": row_f1[3]
})
Then you can iterate over this list and take the value you want. For example to simply print the values:
for row in list_of_names:
print(row["vorname"])
print(row["name"])
And the last but not least, you could build this list also by using list comprehension (kinda more Pythonic):
list_of_names = [
{
"vorname": row_f1[2],
"name": row_f1[3]
}
for row_f1 in csv_f1
]
As I said, I appreciate other answers. They solve the issue of csv reader being iterable and not a list-like object.
Nevertheless I see a little bit of XY Problem in your question. I've seen so many times attempts to store entity properties (name and last name are obviously related properties and form a simple entity together) in multiple lists. It always ends up with the code which is hard to read and maintain.

Related

Django, Python: Best way to parse a CSV and convert to Django model instances

I have a page where a user uploads a CSV file. That bit works.
I can read the CSV and turn it into lists. It takes a fair bit of time for something I thought should be faster (about 7 seconds to parse through and convert it to lists, for a 17mb CSV file).
I'm wondering now though, what would be the best approach to doing this? The code I have so far is very convoluted (written a long time ago by a CS graduate colleague that has since left) and I think I want to rewrite it as it's too slow.
I haven't worked with CSVs before. Right now this is what I have:
import codecs
import csv
import sys
def read_csv_file(self, file_path):
is_file = False
while not is_file:
if os.path.exists(file_path):
is_file = True
result_data = []
csv.field_size_limit(sys.maxsize)
csv_reader = csv.reader(codecs.open(file_path, 'rU', 'utf-8'), delimiter=',')
for row in csv_reader:
result_data.append(row)
return result_data
Is turning a CSV into lists (that I can then perhaps zip?) the best way to go about it?
Ultimately, the goal is to create DB objects (in a loop perhaps?) that would be something like looping through each list, using an index to create objects, appending those objects to an object list and then doing a bulk_create:
object_instance_list.append(My_Object.objects.get_or_create(property=csv_property[some_index], etc etc)[0])
My_Object.bulk_create(object_instance_list)
Would that be efficient?
Should I be working with dicts instead?
Is there a built in method that would allow this for either Python's CSV or a bit of Django functionality that already does it?
Basically, since I'm not that experience, and this is my first time working with CSVs, I would like to get it right(ish) from the beginning.
I would appreciate any help in this regard, so I can learn the proper way to handle this. Thanks!
So, this is untested, but conceptually you should be able to get the idea. The trick is to take advantage of using **kwargs.
import csv
def read_csv():
"""Read the csv into dictionaries, transform the keys necessary
and return a list of cleaned-up dictionaries.
"""
with open('data.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
return [map_rows_to_fields(row) for row in reader]
def map_rows_to_fields(row):
"""Here for each dictionary you want to transform the dictionary
in order to map the keys of the dict to match the names of the
fields on the model you want to create so we can pass it in as
`**kwargs`. This would be an opportunity to use a nice dictionary
comprehension.
"""
csv_fields_to_model_fields = {
'csv_field_1': 'model_field_1',
'csv_field_2': 'model_field_2',
'csv_field_n': 'model_field_n',
}
return {
csv_fields_to_model_fields[key]: value
for key, value in row.items()
}
def instantiate_models():
"""Finally, we have our data from the csv in dictionaries
that map values to expected fields on our model constructor,
then we can just instantiate each of those models from the
dictionary data using a list comprehension, the result of which
we pass as the argument to `bulk_create` saving the rows to
the database.
"""
model_data = read_csv()
MyModel.objects.bulk_create([
MyModel(**data) for data in model_data
])
The method bulk_create does have a few caveats though so make sure it's ok to use it in your case.
https://docs.djangoproject.com/en/2.0/ref/models/querysets/#bulk-create
If you can't use bulk_create then just make the models in a loop.
for data in model_data:
MyModel.objects.create(**data)

How to Remove duplicates from a Lookup File using Python?

I've seen multiple responses around this type of question, but i don't believe I've seen any for the type of list I am concerned with. Hopefully I am not duplicating anything here. Your help is much appreciated!
I have a comma separated file that I use for data enrichment. It starts with the headers - TPCode,corporation_name - then the list of values follows. There are around 35k rows (if that matters).
I notice when the data from that lookup file (CSV) is displayed there are multiple entries for the same customer. Rather than going in and manually removing them, I would like to run a Python script to remove the duplicates
In the format of:
value,value
value,value
value,value
etc., what is the optimal way to remove the duplicates using Python? As a side note, each TPCode should be different, but a corp name can have multiple TPCodes.
Please let me know if you need any additional information.
Thanks in advance!
Hard to tell from your question if each line should be unique. If so you could do:
for l in sorted(set(line for line in open('ors_1202.log'))):
print(l.rstrip())
otherwise we need more info.
As the csv rows are tuple and tuples are immutable objects you can loop over your rwos and use a set container to hold the rows :
import csv
seen=set()
with open('in_file.csv', 'rb') as csvfile,pen('out_file.csv', 'wb') as csvout:
spamreader = csv.reader(csvfile, delimiter=',')
spamwriter = csv.writer(csvout, delimiter=',')
for row in spamreader:
seen.add(row)
if row not in seen :
pamwriter.writerow(row)
Note that member ship checking in set has O(1) complexity.

Reading/Writing/Appending using json module

I was recommended some code from someone here but I still need some help. Here's the code that was recommended:
import json
def write_data(data,filename):
with open(filename,'w') as outfh:
json.dump(data,outfh)
def read_data(filename):
with open(filename, 'r') as infh:
json.load(infh)
The writing code works fine, but the reading code doesn't return any strings after inputting the data:
read_data('player.txt')
Another thing that I'd like to be able to do is to be able to specify a line to be printed. Something that is also pretty important for this project / exorcise is to be able to append strings into my file. Thanks to anyone that can help me.
Edit: I need to be storing strings in the file that I can convert to values. IE;
name="Karatepig"
Is something I would store so I can recall the data if I ever need to load a previous set of data or something of that sort.
I'm very much a noob at python so I don't know what would be best, whether a list or dictionary; I haven't really used a dictionary before, and also I have no idea yet as to how I'm going to convert strings into values.
Suppose you have a dictionary like this:
data = {'name': 'Karatepig', 'score': 10}
Your write_data function is fine, but your read_data function needs to actually return the data after reading it from the file:
def read_data(filename):
with open(filename, 'r') as infh:
data = json.load(infh)
return data
data = read_data('player.json')
Now suppose you want to print the name and update the score:
print data['name']
data['score'] += 5 # add 5 points to the score
You can then write the data back to disk using the write_data function. You can not use the json module to append data to a file. You need to actually read the file, modify the data, and write the file again. This is not so bad for small amounts of data, but consider using a database for larger projects.
Whether you use a list or dictionary depends on what you're trying to do. You need to explain what you're trying to accomplish before anyone can recommend an approach. Dictionaries store values assigned to keys (see above). Lists are collections of items without keys and are accessed using an index.
It isn't clear what you mean by "convert strings into values." Notice that the above code shows the string "Karatepig" using the dictionary key "name".

Combining JSON values based on key

I'm working on Python 2.6.6 and I'm struggling with one issue.
I have a large JSON file with the following structure:
{"id":"12345","ua":[{"n":"GROUP_A","v":["true"]}]}
{"id":"12345","ua":[{"n":"GROUP_B","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_C","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_D","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_E","v":["true"]}]}
{"id":"98765","ua":[{"n":"GROUP_F","v":["true"]}]}
And I need to merge the id's so they'll contain all the GROUPS as so:
{"id":"12345","ua":[{"n":"GROUP_A","v":["true"]},{"n":"GROUP_B","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_C","v":["true"]},{"n":"GROUP_D","v":["true"]},{"n":"GROUP_E","v":["true"]}]}
{"id":"98765","ua":[{"n":"GROUP_F","v":["true"]}]}
I tried using the 'json' library, but I couldn't append the values correctly. Also I tried to convert it all to a dictionary and append the values (GROUPS) to the key (id) as lists, but I got stuck on printing it all as I need to the output file.
I can do it using bash but it takes too long to parse all the information and rearrange it in the needed format.
Any help is appreciated!
Thanks.
First, let's get the JSON stuff out of the way.
Your file is not a JSON structure, it's a bunch of separate JSON objects. From your sample, it looks like it's one object per line. So, let's read this in to a list:
with open('spam.json') as f:
things = [json.loads(line) for line in f]
Then we'll process this, and write it out:
with open('eggs.json', 'w') as f:
for thing in new_things:
f.write(json.dumps(thing) + '\n')
Now, you don't have a JSON structure that you want to append things to; you have a list of dicts, and you want to create a new list of dicts, merging together the ones with the same key.
Here's one way to do it:
new_things = {}
for thing in things:
thing_id = thing['id']
try:
old_thing = new_things[thing_id]
except KeyError:
new_things[thing_id] = thing
else:
old_thing['ua'].extend(thing['ua'])
new_things = new_things.values()
There are a few different ways you could simplify this; I just wrote it this way because it uses no tricks that should be beyond a novice. For example, you could do by it sorting and grouping:
def merge(things):
return {'id': things[0]['id'],
'ua': list(itertools.chain.from_iterable(t['ua'] for t in things))}
sorted_things = sorted(things, key=operator.itemgetter('id'))
grouped_things = itertools.groupby(sorted_things, key=operator.itemgetter('id'))
new_things = [merge(list(group)) for key, group in grouped_things]
I didn't realize from you original question that you had tens of millions of rows. All of the above steps require loading the entire original data set into memory, processing with some temporary storage, then writing it back out. But if your data set is too large, you need to find a way to process one row at a time, and keep as little in memory simultaneously as possibly.
First, to process one row at a time, you just need to change the initial list comprehension to a generator expression, and move the rest of the code inside the with statement, like this:
with open('spam.json') as f:
things = (json.loads(line) for line in f)
for thing in things:
# blah blah
… at which point it might be just as easy to rewrite it like this:
with open('spam.json') as f:
for line in f:
thing = json.loads(line)
# blah blah
Next, sorting obviously builds the whole sorted list in memory, so that's not acceptable here. But if you don't sort and group, the entire new_things result object has to be alive at the same time (because the very last input row could have to be merged into the very first output row).
Your sample data seems to already have the rows sorted by id. If you can count on that in real life—or just count on the rows always being grouped by id—just skip the sorting step, which isn't doing anything but wasting time and memory, and use a grouping solution.
On the other hand, if you can't count on the rows being grouped by id, there are only really two ways to reduce memory further: compress the data in some way, or back the storage to disk.
For the first, Foo Bar User's solution built a simpler and smaller data structure (a dict mapping each id to its list of uas, instead of a list of dicts, each with an id and a ua), which should take less memory, and which we could convert to the final format one row at a time. Like this:
with open('spam.json') as f:
new_dict = defaultdict(list)
for row in f:
thing = json.loads(row)
new_dict[thing["id"]].extend(thing["ua"])
with open('eggs.json', 'w') as f:
for id, ua in new_dict.items(): # use iteritems in Python 2.x
thing = {'id': id, 'ua': ua}
f.write(json.dumps(thing) + '\n')
For the second, Python comes with a nice way to use a dbm database as if it were a dictionary. If your values are just strings, you can use the anydbm/dbm module (or one of the specific implementations). Since your values are lists, you'll need to use shelve instead.
Anyway, while this will reduce your memory usage, it could slow things down. On a machine with 4GB of RAM, the savings in pagefile swapping will probably blow away the extra cost of going through the database… but on a machine with 16GB of RAM, you may just be adding overhead for very little gain. You may want to experiment with smaller files first, to see how much slower your code is with shelve vs. dict when memory isn't an issue.
Alternatively, if things get way beyond the limits of your memory, you can always use a more powerful database that actually can sort things on disk. For example (untested):
db = sqlite3.connect('temp.sqlite')
c = db.cursor()
c.execute('CREATE TABLE Things (tid, ua)')
for thing in things:
for ua in thing['ua']:
c.execute('INSERT INTO Things (tid, ua) VALUES (?, ?)',
thing['id'], ua)
c.commit()
c.execute('SELECT tid, ua FROM Things ORDER BY tid')
rows = iter(c.fetchone, None)
grouped_things = itertools.groupby(rows, key=operator.itemgetter(0))
new_things = (merge(list(group)) for key, group in grouped_things)
with open('eggs.json', 'w') as f:
for thing in new_things:
f.write(json.dumps(thing) + '\n')

Loading large file (25k entries) into dict is slow in Python?

I have a file which has about 25000 lines, and it's a s19 format file.
each line is like: S214 780010 00802000000010000000000A508CC78C 7A
There are no spaces in the actual file, the first part 780010 is the address of this line, and I want it to be a dict's key value, and I want the data part 00802000000010000000000A508CC78C be the value of this key. I wrote my code like this:
def __init__(self,filename):
infile = file(filename,'r')
self.all_lines = infile.readlines()
self.dict_by_address = {}
for i in range(0, self.get_line_number()):
self.dict_by_address[self.get_address_of_line(i)] = self.get_data_of_line(i)
infile.close()
get_address_of_line() and get_data_of_line() are all simply string slicing functions. get_line_number() iterates over self.all_lines and returns an int
problem is, the init process takes me over 1 min, is the way I construct the dict wrong or python just need so long to do this?
And by the way, I'm new to python:) maybe the code looks more C/C++ like, any advice of how to program like python is appreciated:)
How about something like this? (I made a test file with just a line S21478001000802000000010000000000A508CC78C7A so you might have to adjust the slicing.)
>>> with open('test.test') as f:
... dict_by_address = {line[4:10]:line[10:-3] for line in f}
...
>>> dict_by_address
{'780010': '00802000000010000000000A508CC78C'}
This code should be tremendously faster than what you have now. EDIT: As #sth pointed out, this doesn't work because there are no spaces in the actual file. I'll add a corrected version at the end.
def __init__(self,filename):
self.dict_by_address = {}
with open(filename, 'r') as infile:
for line in infile:
_, key, value, _ = line.split()
self.dict_by_address[key] = value
Some comments:
Best practice in Python is to use a with statement, unless you are using an old Python that doesn't have it.
Best practice is to use open() rather than file(); I don't think Python 3.x even has file().
You can use the open file object as an iterator, and when you iterate it you get one line from the input. This is better than calling the .readlines() method, which slurps all the data into a list; then you use the data one time and delete the list. Since the input file is large, that means you are probably causing swapping to virtual memory, which is always slow. This version avoids building and deleting the giant list.
Then, having created a giant list of input lines, you use range() to make a big list of integers. Again it wastes time and memory to build a list, use it once, then delete the list. You can avoid this overhead by using xrange() but even better is just to build the dictionary as you go, as part of the same loop that is reading lines from the file.
It might be better to use your special slicing functions to pull out the "address" and "data" fields, but if the input is regular (always follows the pattern of your example) you can just do what I showed here. line.split() splits the line on white space, giving a list of four strings. Then we unpack it into four variables using "destructuring assignment". Since we only want to save two of the values, I used the variable name _ (a single underscore) for the other two. That's not really a language feature, but it is an idiom in the Python community: when you have data you don't care about you can assign it to _. This line will raise an exception if there are ever any number of values other than 4, so if it is possible to have blank lines or comment lines or whatever, you should add checks and handle the error (at least wrap that line in a try:/except).
EDIT: corrected version:
def __init__(self,filename):
self.dict_by_address = {}
with open(filename, 'r') as infile:
for line in infile:
key = extract_address(line)
value = extract_data(line)
self.dict_by_address[key] = value

Categories