Create a list containing files of each transaction - python

Good evening, I want to create a list while reading a text file (historique.txt) which contains list of files associated to each taskid. Considering the following example: my text file contains these lines:
4,file1
4,file2
5,file1
5,file3
5,file4
6,file3
6,file4
(to explain more the content of the text file: 4 is an idtask and file1 is a file used by idtask=4, so basically, task 4 used (file1,file2).
I want to obtain list Transactions=[[file1,file2],[file1,file3,file4],[file3,file4]]
Any help and thank you.

This will not work if the input file is not ordered
Exactly the same idea as #mad_'s answer, just showing the benefit of turning file_data_list to be a list of lists instead of list of strings. We only need to .split each line once which is more readable and probably a bit faster as well.
Note that this can also be done while reading the file instead of after-the-fact like I show below.
from itertools import groupby
file_data_list = ['4,file1',
'4,file2',
'5,file1',
'5,file3',
'5,file4',
'6,file3',
'6,file4']
file_data_list = [line.split(',') for line in file_data_list]
for k, v in groupby(file_data_list, key=lambda x: x[0]):
print([x[1] for x in v]) # also no need to convert v to list

After reading from the file e.g f.readlines() which will give a list similar to below
file_data_list=['4,file1',
'4,file2',
'5,file1',
'5,file3',
'5,file4',
'6,file3',
'6,file4']
Apply groupby
from itertools import groupby
for k,v in groupby(file_data_list,key=lambda x:x.split(",")[0]):
print([i.split(",")[1] for i in list(v)])
Output
['file1', 'file2']
['file1', 'file3', 'file4']
['file3', 'file4']
you can also create a mapping dict
for k,v in groupby(file_data_list,key=lambda x:x.split(",")[0]):
print({k:[i.split(",")[1] for i in list(v)]})
Output
{'4': ['file1', 'file2']}
{'5': ['file1', 'file3', 'file4']}
{'6': ['file3', 'file4']}
As pointed out by #DeepSpace the above solution will work only if the ids are ordered. Modifying if it not ordered
from collections import defaultdict
d=defaultdict(list)
file_data_list=['4,file1',
'4,file2',
'5,file1',
'5,file3',
'5,file4',
'6,file3',
'6,file4',
'4,file3']
for k,v in groupby(file_data_list,key=lambda x:x.split(",")[0]):
for i in list(v):
d[k].append(i.split(",")[1])
print(d)
Output
defaultdict(list,
{'4': ['file1', 'file2', 'file3'],
'5': ['file1', 'file3', 'file4'],
'6': ['file3', 'file4']})

We can use the csv module to process the lines into lists of values.
csv reads from a file-like object, which we can fake using StringIO for an example:
>>> from io import StringIO
>>> contents = StringIO('''4,file1
... 4,file2
... 5,file1
... 5,file3
... 5,file4
... 6,file3
... 6,file4''')
Just to note: depending upon the version of Python you are using you might need to import StringIO differently. The above code works for Python 3. For Python 2, replace the import with from StringIO import StringIO.
csv.reader returns an iterable object. We can consume the whole thing into a list, just to see how it works. Later we will instead iterate over the reader object one line at a time.
We can use pprint to see the results nicely formatted:
>>> import csv
>>> lines = list(csv.reader(contents))
>>> from pprint import pprint
>>> pprint(lines)
[['4', 'file1'],
['4', 'file2'],
['5', 'file1'],
['5', 'file3'],
['5', 'file4'],
['6', 'file3'],
['6', 'file4']]
These lists can then be unpacked into a task and filename:
>>> task, filename = ['4', 'file1']
>>> task
'4'
>>> filename
'file1'
We want to build lists of filenames having the same task as key.
To efficiently organise this we can use a dictionary. The efficiency is because we can ask the dictionary to find a list of values for a given key. It will store the keys in some sort of a tree and searching the tree is quicker than a linear search.
The first time we look to add a value to the dictionary for a particular key, we would need to check to see whether it already exists.
If not we would add an empty list and append the new value to it. Otherwise we would just add the value to the existing list for the given key.
This pattern is so common that Python's builtin dictionary has a method dict.setdefault to help us achieve this.
However, I don't like the name, or the non-uniform syntax. You can read the linked documentation if you like, but I'd rather use
Python's defaultdict instead. This automatically creates a default value for a key if it doesn't already exist when you query it.
We create a defaultdict with a list as default:
>>> from collections import defaultdict
>>> d = defaultdict(list)
Then for any new key it will create an empty list for us:
>>> d['5']
[]
We can append to the list:
>>> d['5'].append('file1')
>>> d['7'].append('file2')
>>> d['7'].append('file3')
I'll convert the defaultdict to a dict just to make it pprint more nicely:
>>> pprint(dict(d), width=30)
{'5': ['file1'],
'7': ['file2', 'file3']}
So, putting all this together:
import csv
from collections import defaultdict
from io import StringIO
from pprint import pprint
contents = StringIO('''4,file1
4,file2
5,file1
5,file3
5,file4
6,file3
6,file4''')
task_transactions = defaultdict(list)
for row in csv.reader(contents):
task, filename = row
task_transactions[task].append(filename)
pprint(dict(task_transactions))
Output:
{'4': ['file1', 'file2'],
'5': ['file1', 'file3', 'file4'],
'6': ['file3', 'file4']}
Some final notes: In the example we've used StringIO to fake the file contents. You'll probably want to replace that in your actual code with something like:
with open('historique.txt') as contents:
for row in csv.reader(contents):
... # etc
Also, where we take each row from the csv reader, and then unpack it into a task and filename, we could do that all in one go:
for task, filename in csv.reader(contents):
So your whole code (without printing) would be quite simple:
import csv
from collections import defaultdict
task_transactions = defaultdict(list)
with open('historique.txt') as contents:
for task, filename in csv.reader(contents):
task_transactions[task].append(filename)
If you want a list of transactions (as you asked in the question!):
transactions = list(task_transactions.values())
However, this may not be in the same order of tasks as the original file. If that's important to you, clarify the question, and comment so I can help.

An alternate solution without using the groupby library
(This solution does exactly what #mad_'s does, however it is more readable, especially for someone who is a beginner):
As #mad_ said, the read list will be as follows:
data=[
'4,file1',
'4,file2',
'5,file1',
'5,file3',
'5,file4',
'6,file3',
'6,file4']
You could loop over the data, and create a dict
transactions = defaultdict(list)
for element in data: #data[i] is the idtask, data[i+1] is the file
id, file = element.split(',')
transactions[id].append(file)
Transactions will now contain the dictionary:
{'4': ['file1', 'file2']
'5': ['file1', 'file3', 'file4']
'6': ['file3', 'file4']}

Related

Add string to dictionary in list comprehension

Say I have a list of filenames files containing data in json format. To receive the data in a list with an entry for each file, I use a list comprehension:
>>> import json
>>> data = [json.load(open(file)) for file in files]
Now I was wondering, if there is a way to append the file name file to the json data, as if it looked like this:
{
'Some': ['data', 'that', 'has', 'already', 'been', 'there'],
'Filename': 'filename'
}
For my case, json.load() returns a dict, so I've tried something similar to this question. This didn't work out for me, because files contains strings and not dictionaries.
Edit
For clarification, if dict.update() didn't return None, this would probably work:
>>> data = [dict([('filename',file)]).update(json.load(open(file))) for file in files]
Yes, you can. Here's one way (requires Python 3.5+):
import json
data = [{**json.load(open(file)), **{'Filename': file}} for file in files]
The syntax {**d1, **d2} combines 2 dictionaries, with preference for d2. If you wish to add items explicitly, you can simply add an extra item as so:
data = [{**json.load(open(file)), 'Filename': file} for file in files]
You can merge a custom dictionary into the one being loaded as in this answer.
data = [{**json.loads("{\"el...\": 5}"), **{'filename': file}} for file in files]

Python: Group a list of file names according to common name identifier

In a directory I have some files:
temperature_Resu05_les_spec_r0.0300.0
temperature_Resu05_les_spec_r0.0350.0
temperature_Resu05_les_spec_r0.0400.0
temperature_Resu05_les_spec_r0.0450.0
temperature_Resu06_les_spec_r0.0300.0
temperature_Resu06_les_spec_r0.0350.0
temperature_Resu06_les_spec_r0.0400.0
temperature_Resu06_les_spec_r0.0450.0
temperature_Resu07_les_spec_r0.0300.0
temperature_Resu07_les_spec_r0.0350.0
temperature_Resu07_les_spec_r0.0400.0
temperature_Resu07_les_spec_r0.0450.0
temperature_Resu08_les_spec_r0.0300.0
temperature_Resu08_les_spec_r0.0350.0
temperature_Resu08_les_spec_r0.0400.0
temperature_Resu08_les_spec_r0.0450.0
temperature_Resu09_les_spec_r0.0300.0
temperature_Resu09_les_spec_r0.0350.0
temperature_Resu09_les_spec_r0.0400.0
temperature_Resu09_les_spec_r0.0450.0
I need a list of all the files that have the same identifier XXXX as in _rXXXX. For example one such list would be composed of
temperature_Resu05_les_spec_r0.0300.0
temperature_Resu06_les_spec_r0.0300.0
temperature_Resu07_les_spec_r0.0300.0
temperature_Resu08_les_spec_r0.0300.0
temperature_Resu09_les_spec_r0.0300.0
I don't know a priori what the XXXX values are going to be so I can't iterate through them and match like that. Im thinking this might best be handles with a regular expression. Any ideas?
Yes, regular expressions are a fun way to do it! It could look something like this:
results = {}
for fname in fnames:
id = re.search('.*_r(.*)', fname).group(1) # grabs whatever is after the final "_r" as an identifier
if id in results:
results[id] += fname
else:
results[id] = [fname]
The results will be stored in a dictionary, results, indexed by the id.
I should add that this will work as long as all file names reliably have the _rXXXX structure. If there's any chance that a file name will not match that pattern, you will have to check for it and act accordingly.
No a regex is not the best way, you pattern is very straight forward, just str.rsplit on the _r and use the right element of the split as the key to group the data with. A defaultdict will do the grouping efficiently:
from collections import defaultdict
with open("yourfile") as f:
groups = defaultdict(list)
for line in f:
groups[line.rsplit("_r",1)[1]].append(line.rstrip())
from pprint import pprint as pp
pp(groups.values())
Which for your sample will give you:
[['temperature_Resu09_les_spec_r0.0450.0'],
['temperature_Resu05_les_spec_r0.0300.0',
'temperature_Resu06_les_spec_r0.0300.0',
'temperature_Resu07_les_spec_r0.0300.0',
'temperature_Resu08_les_spec_r0.0300.0',
'temperature_Resu09_les_spec_r0.0300.0'],
['temperature_Resu05_les_spec_r0.0400.0',
'temperature_Resu06_les_spec_r0.0400.0',
'temperature_Resu07_les_spec_r0.0400.0',
'temperature_Resu08_les_spec_r0.0400.0',
'temperature_Resu09_les_spec_r0.0400.0'],
['temperature_Resu05_les_spec_r0.0450.0',
'temperature_Resu06_les_spec_r0.0450.0',
'temperature_Resu07_les_spec_r0.0450.0',
'temperature_Resu08_les_spec_r0.0450.0'],
['temperature_Resu05_les_spec_r0.0350.0',
'temperature_Resu06_les_spec_r0.0350.0',
'temperature_Resu07_les_spec_r0.0350.0',
'temperature_Resu08_les_spec_r0.0350.0',
'temperature_Resu09_les_spec_r0.0350.0']]

Split string by comma, ignoring comma inside string. Am trying CSV

I have a string like this:
s = '1,2,"hello, there"'
And I want to turn it into a list:
[1,2,"hello, there"]
Normally I'd use split:
my_list = s.split(",")
However, that doesn't work if there's a comma in a string.
So, I've read that I need to use cvs, but I don't really see how. I've tried:
from csv import reader
s = '1,2,"hello, there"'
ll = reader(s)
print ll
for row in ll:
print row
Which writes:
<_csv.reader object at 0x020EBC70>
['1']
['', '']
['2']
['', '']
['hello, there']
I've also tried with
ll = reader(s, delimiter=',')
It is that way because you provide the csv reader input as a string. If you do not want to use a file or a StringIO object just wrap your string in a list as shown below.
>>> import csv
>>> s = ['1,2,"hello, there"']
>>> ll = csv.reader(s, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
>>> list(ll)
[['1', '2', 'hello, there']]
It sounds like you probably want to use the csv module. To use the reader on a string, you want a StringIO object.
As an example:
>> import csv, StringIO
>> print list(csv.reader(StringIO.StringIO(s)))
[['1', '2', 'hello, there']]
To clarify, csv.reader expects a buffer object, not a string. So StringIO does the trick. However, if you're reading this csv from a file object, (a typical use case) you can just as easily give the file object to the reader and it'll work the same way.
It's usually easier to re-use than to invent a bicycle... You just to use csv library properly. If you can't for some reason, you can always check the source code out and learn how's the parsing done there.
Example for parsing a single string into a list. Notice that the string in wrapped in list.
>>> import csv
>>> s = '1,2,"hello, there"'
>>> list(csv.reader([s]))[0]
['1', '2', 'hello, there']
You can split first by the string delimiters, then by the commas for every even index (The ones not in the string)
import itertools
new_data = s.split('"')
for i in range(len(new_data)):
if i % 2 == 1: # Skip odd indices, making them arrays
new_data[i] = [new_data[i]]
else:
new_data[i] = new_data[i].split(",")
data = itertools.chain(*new_data)
Which goes something like:
'1,2,"hello, there"'
['1,2,', 'hello, there']
[['1', '2'], ['hello, there']]
['1', '2', 'hello, there']
But it's probably better to use the csv library if that's what you're working with.
You could also use ast.literal_eval if you want to preserve the integers:
>>> from ast import literal_eval
>>> literal_eval('[{}]'.format('1,2,"hello, there"'))
[1, 2, 'hello, there']

Constructing peculiar dictionary out of file (python)

I'd like to automaticaly form a dictionary from files that have the following structure.
str11 str12 str13
str21 str22
str31 str32 str33 str34
...
that is, two, three or four strings each line, with spaces in between. The dictionary I'd like to construct out of this list must have following structure:
{str11:(str12,str13),str21:(str22),str31:(str32,str33,str34), ... }
(that is, all entries str*1 are the keys -- all of them different -- and the remaining ones are the values). What can I use?
>>> with open('abc') as f:
... dic = {}
... for line in f:
... key, val = line.split(None,1)
... dic[key] = tuple(val.split())
...
>>> dic
{'str31': ('str32', 'str33', 'str34'),
'str21': ('str22',),
'str11': ('str12', 'str13')}
If you want the order of items to be preserved then consider using OrderedDict:
>>> from collections import OrderedDict
>>> with open('abc') as f:
dic = OrderedDict()
for line in f:
key, val = line.split(None,1)
dic[key] = tuple(val.split())
...
>>> dic
OrderedDict([
('str11', ('str12', 'str13')),
('str21', ('str22',)),
('str31', ('str32', 'str33', 'str34'))
])
Using a StringIO instance for simplicity:
import io
fobj = io.StringIO("""str11 str12 str13
str21 str22
str31 str32 str33 str34""")
One line does the trick:
>>> {line.split(None, 1)[0]: tuple(line.split()[1:]) for line in fobj}
{'str11': ('str12', 'str13'),
'str21': ('str22',),
'str31': ('str32', 'str33', 'str34')}
Note the line.split(None, 1). This limits the splitting to one item because we have to use .split() twice in a dict comprehension. We cannot store intermediate results for reuse as in a loop. The None means split at any whitespace.
For an OrderedDict you can also get away with one line using a generator expression:
from collections import OrderedDict
>>> OrderedDict((line.split(None, 1)[0], tuple(line.split()[1:]))
for line in fobj)
OrderedDict([('str11', ('str12', 'str13')), ('str21', ('str22',)),
('str31', ('str32', 'str33', 'str34'))])

Python strings / match case

I have a CSV file which has the following format:
id,case1,case2,case3
Here is a sample:
123,null,X,Y
342,X,X,Y
456,null,null,null
789,null,null,X
For each line I need to know which of the cases is not null. Is there an easy way to find out which case(s) are not null without splitting the string and going through each element?
This is what the result should look like:
123,case2:case3
342,case1:case2:case3
456:None
789:case3
You probably want to take a look at the CSV module, which has readers and writers that will enable you to create transforms.
>>> from StringIO import StringIO
>>> from csv import DictReader
>>> fh = StringIO("""
... id,case1,case2,case3
...
... 123,null,X,Y
...
... 342,X,X,Y
...
... 456,null,null,null
...
... 789,null,null,X
... """.strip())
>>> dr = DictReader(fh)
>>> dr.next()
{'case1': 'null', 'case3': 'Y', 'case2': 'X', 'id': '123'}
At which point you can do something like:
>>> from csv import DictWriter
>>> out_fh = StringIO()
>>> writer = DictWriter(fh, fieldnames=dr.fieldnames)
>>> for mapping in dr:
... writer.write(dict((k, v) for k, v in mapping.items() if v != 'null'))
...
The last bit is just pseudocode -- not sure dr.fieldnames is actually a property. Replace out_fh with the filehandle that you'd like to output to.
Anyway you slice it, you are still going to have to go through the list. There are more and less elegant ways to do it. Depending on the python version you are using, you can use list comprehensions.
ids=line.split(",")
print "%s:%s" % (ids[0], ":".join(["case%d" % x for x in range(1, len(ids)) if ids[x] != "null"])
Why do you treat spliting as a problem? For performance reasons?
Literally you could avoid splitting with smart regexps (like:
\d+,null,\w+,\w+
\d+,\w+,null,\w+
...
but I find it a worse solution than reparsing the data into lists.
You could use the Python csv module, comes in with the standard installation of python... It will not be much easier, though...

Categories