I have a CSV file which has the following format:
id,case1,case2,case3
Here is a sample:
123,null,X,Y
342,X,X,Y
456,null,null,null
789,null,null,X
For each line I need to know which of the cases is not null. Is there an easy way to find out which case(s) are not null without splitting the string and going through each element?
This is what the result should look like:
123,case2:case3
342,case1:case2:case3
456:None
789:case3
You probably want to take a look at the CSV module, which has readers and writers that will enable you to create transforms.
>>> from StringIO import StringIO
>>> from csv import DictReader
>>> fh = StringIO("""
... id,case1,case2,case3
...
... 123,null,X,Y
...
... 342,X,X,Y
...
... 456,null,null,null
...
... 789,null,null,X
... """.strip())
>>> dr = DictReader(fh)
>>> dr.next()
{'case1': 'null', 'case3': 'Y', 'case2': 'X', 'id': '123'}
At which point you can do something like:
>>> from csv import DictWriter
>>> out_fh = StringIO()
>>> writer = DictWriter(fh, fieldnames=dr.fieldnames)
>>> for mapping in dr:
... writer.write(dict((k, v) for k, v in mapping.items() if v != 'null'))
...
The last bit is just pseudocode -- not sure dr.fieldnames is actually a property. Replace out_fh with the filehandle that you'd like to output to.
Anyway you slice it, you are still going to have to go through the list. There are more and less elegant ways to do it. Depending on the python version you are using, you can use list comprehensions.
ids=line.split(",")
print "%s:%s" % (ids[0], ":".join(["case%d" % x for x in range(1, len(ids)) if ids[x] != "null"])
Why do you treat spliting as a problem? For performance reasons?
Literally you could avoid splitting with smart regexps (like:
\d+,null,\w+,\w+
\d+,\w+,null,\w+
...
but I find it a worse solution than reparsing the data into lists.
You could use the Python csv module, comes in with the standard installation of python... It will not be much easier, though...
Related
Good evening, I want to create a list while reading a text file (historique.txt) which contains list of files associated to each taskid. Considering the following example: my text file contains these lines:
4,file1
4,file2
5,file1
5,file3
5,file4
6,file3
6,file4
(to explain more the content of the text file: 4 is an idtask and file1 is a file used by idtask=4, so basically, task 4 used (file1,file2).
I want to obtain list Transactions=[[file1,file2],[file1,file3,file4],[file3,file4]]
Any help and thank you.
This will not work if the input file is not ordered
Exactly the same idea as #mad_'s answer, just showing the benefit of turning file_data_list to be a list of lists instead of list of strings. We only need to .split each line once which is more readable and probably a bit faster as well.
Note that this can also be done while reading the file instead of after-the-fact like I show below.
from itertools import groupby
file_data_list = ['4,file1',
'4,file2',
'5,file1',
'5,file3',
'5,file4',
'6,file3',
'6,file4']
file_data_list = [line.split(',') for line in file_data_list]
for k, v in groupby(file_data_list, key=lambda x: x[0]):
print([x[1] for x in v]) # also no need to convert v to list
After reading from the file e.g f.readlines() which will give a list similar to below
file_data_list=['4,file1',
'4,file2',
'5,file1',
'5,file3',
'5,file4',
'6,file3',
'6,file4']
Apply groupby
from itertools import groupby
for k,v in groupby(file_data_list,key=lambda x:x.split(",")[0]):
print([i.split(",")[1] for i in list(v)])
Output
['file1', 'file2']
['file1', 'file3', 'file4']
['file3', 'file4']
you can also create a mapping dict
for k,v in groupby(file_data_list,key=lambda x:x.split(",")[0]):
print({k:[i.split(",")[1] for i in list(v)]})
Output
{'4': ['file1', 'file2']}
{'5': ['file1', 'file3', 'file4']}
{'6': ['file3', 'file4']}
As pointed out by #DeepSpace the above solution will work only if the ids are ordered. Modifying if it not ordered
from collections import defaultdict
d=defaultdict(list)
file_data_list=['4,file1',
'4,file2',
'5,file1',
'5,file3',
'5,file4',
'6,file3',
'6,file4',
'4,file3']
for k,v in groupby(file_data_list,key=lambda x:x.split(",")[0]):
for i in list(v):
d[k].append(i.split(",")[1])
print(d)
Output
defaultdict(list,
{'4': ['file1', 'file2', 'file3'],
'5': ['file1', 'file3', 'file4'],
'6': ['file3', 'file4']})
We can use the csv module to process the lines into lists of values.
csv reads from a file-like object, which we can fake using StringIO for an example:
>>> from io import StringIO
>>> contents = StringIO('''4,file1
... 4,file2
... 5,file1
... 5,file3
... 5,file4
... 6,file3
... 6,file4''')
Just to note: depending upon the version of Python you are using you might need to import StringIO differently. The above code works for Python 3. For Python 2, replace the import with from StringIO import StringIO.
csv.reader returns an iterable object. We can consume the whole thing into a list, just to see how it works. Later we will instead iterate over the reader object one line at a time.
We can use pprint to see the results nicely formatted:
>>> import csv
>>> lines = list(csv.reader(contents))
>>> from pprint import pprint
>>> pprint(lines)
[['4', 'file1'],
['4', 'file2'],
['5', 'file1'],
['5', 'file3'],
['5', 'file4'],
['6', 'file3'],
['6', 'file4']]
These lists can then be unpacked into a task and filename:
>>> task, filename = ['4', 'file1']
>>> task
'4'
>>> filename
'file1'
We want to build lists of filenames having the same task as key.
To efficiently organise this we can use a dictionary. The efficiency is because we can ask the dictionary to find a list of values for a given key. It will store the keys in some sort of a tree and searching the tree is quicker than a linear search.
The first time we look to add a value to the dictionary for a particular key, we would need to check to see whether it already exists.
If not we would add an empty list and append the new value to it. Otherwise we would just add the value to the existing list for the given key.
This pattern is so common that Python's builtin dictionary has a method dict.setdefault to help us achieve this.
However, I don't like the name, or the non-uniform syntax. You can read the linked documentation if you like, but I'd rather use
Python's defaultdict instead. This automatically creates a default value for a key if it doesn't already exist when you query it.
We create a defaultdict with a list as default:
>>> from collections import defaultdict
>>> d = defaultdict(list)
Then for any new key it will create an empty list for us:
>>> d['5']
[]
We can append to the list:
>>> d['5'].append('file1')
>>> d['7'].append('file2')
>>> d['7'].append('file3')
I'll convert the defaultdict to a dict just to make it pprint more nicely:
>>> pprint(dict(d), width=30)
{'5': ['file1'],
'7': ['file2', 'file3']}
So, putting all this together:
import csv
from collections import defaultdict
from io import StringIO
from pprint import pprint
contents = StringIO('''4,file1
4,file2
5,file1
5,file3
5,file4
6,file3
6,file4''')
task_transactions = defaultdict(list)
for row in csv.reader(contents):
task, filename = row
task_transactions[task].append(filename)
pprint(dict(task_transactions))
Output:
{'4': ['file1', 'file2'],
'5': ['file1', 'file3', 'file4'],
'6': ['file3', 'file4']}
Some final notes: In the example we've used StringIO to fake the file contents. You'll probably want to replace that in your actual code with something like:
with open('historique.txt') as contents:
for row in csv.reader(contents):
... # etc
Also, where we take each row from the csv reader, and then unpack it into a task and filename, we could do that all in one go:
for task, filename in csv.reader(contents):
So your whole code (without printing) would be quite simple:
import csv
from collections import defaultdict
task_transactions = defaultdict(list)
with open('historique.txt') as contents:
for task, filename in csv.reader(contents):
task_transactions[task].append(filename)
If you want a list of transactions (as you asked in the question!):
transactions = list(task_transactions.values())
However, this may not be in the same order of tasks as the original file. If that's important to you, clarify the question, and comment so I can help.
An alternate solution without using the groupby library
(This solution does exactly what #mad_'s does, however it is more readable, especially for someone who is a beginner):
As #mad_ said, the read list will be as follows:
data=[
'4,file1',
'4,file2',
'5,file1',
'5,file3',
'5,file4',
'6,file3',
'6,file4']
You could loop over the data, and create a dict
transactions = defaultdict(list)
for element in data: #data[i] is the idtask, data[i+1] is the file
id, file = element.split(',')
transactions[id].append(file)
Transactions will now contain the dictionary:
{'4': ['file1', 'file2']
'5': ['file1', 'file3', 'file4']
'6': ['file3', 'file4']}
I have a string like this:
s = '1,2,"hello, there"'
And I want to turn it into a list:
[1,2,"hello, there"]
Normally I'd use split:
my_list = s.split(",")
However, that doesn't work if there's a comma in a string.
So, I've read that I need to use cvs, but I don't really see how. I've tried:
from csv import reader
s = '1,2,"hello, there"'
ll = reader(s)
print ll
for row in ll:
print row
Which writes:
<_csv.reader object at 0x020EBC70>
['1']
['', '']
['2']
['', '']
['hello, there']
I've also tried with
ll = reader(s, delimiter=',')
It is that way because you provide the csv reader input as a string. If you do not want to use a file or a StringIO object just wrap your string in a list as shown below.
>>> import csv
>>> s = ['1,2,"hello, there"']
>>> ll = csv.reader(s, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
>>> list(ll)
[['1', '2', 'hello, there']]
It sounds like you probably want to use the csv module. To use the reader on a string, you want a StringIO object.
As an example:
>> import csv, StringIO
>> print list(csv.reader(StringIO.StringIO(s)))
[['1', '2', 'hello, there']]
To clarify, csv.reader expects a buffer object, not a string. So StringIO does the trick. However, if you're reading this csv from a file object, (a typical use case) you can just as easily give the file object to the reader and it'll work the same way.
It's usually easier to re-use than to invent a bicycle... You just to use csv library properly. If you can't for some reason, you can always check the source code out and learn how's the parsing done there.
Example for parsing a single string into a list. Notice that the string in wrapped in list.
>>> import csv
>>> s = '1,2,"hello, there"'
>>> list(csv.reader([s]))[0]
['1', '2', 'hello, there']
You can split first by the string delimiters, then by the commas for every even index (The ones not in the string)
import itertools
new_data = s.split('"')
for i in range(len(new_data)):
if i % 2 == 1: # Skip odd indices, making them arrays
new_data[i] = [new_data[i]]
else:
new_data[i] = new_data[i].split(",")
data = itertools.chain(*new_data)
Which goes something like:
'1,2,"hello, there"'
['1,2,', 'hello, there']
[['1', '2'], ['hello, there']]
['1', '2', 'hello, there']
But it's probably better to use the csv library if that's what you're working with.
You could also use ast.literal_eval if you want to preserve the integers:
>>> from ast import literal_eval
>>> literal_eval('[{}]'.format('1,2,"hello, there"'))
[1, 2, 'hello, there']
Here is how I dump a file
with open('es_hosts.json', 'w') as fp:
json.dump(','.join(host_list.keys()), fp)
The results is
"a,b,c"
I would like:
a,b,c
Thanks
Before doing a string replace, you might want to strip the quotation marks:
print '"a,b,c"'.strip('"')
Output:
a,b,c
That's closer to what you want to achieve. Even just removing the first and the last character works: '"a,b,c"'[1:-1].
But have you looked into this question?
To remove the quotation marks in the keys only, which may be important if you are parsing it later (presumably with some tolerant parser or maybe you just pipe it directly into node for bizarre reasons), you could try the following regex.
re.sub(r'(?<!: )"(\S*?)"', '\\1', json_string)
One issue is that this regex expects fields to be seperated key: value and it will fail for key:value. You could make it work for the latter with a minor change, but similarly it won't work for variable amounts of whitespace after :
There may be other edge cases but it will work with outputs of json.dumps, however the results will not be parseable by json. Some more tolerant parsers like yaml might be able to read the results.
import re
regex = r'(?<!: )"(\S*?)"'
o = {"noquotes" : 127, "put quotes here" : "and here", "but_not" : "there"}
s = json.dumps(o)
s2 = json.dumps(o, indent=3)
strip_s = re.sub(regex,'\\1',s)
strip_s2 = re.sub(regex,'\\1',s2)
print(strip_s)
print(strip_s2)
assert(json.loads(strip_s) == json.loads(s) == json.loads(strip_s2) == json.loads(s2) == object)
Will raise a ValueError but prints what you want.
Well, that's not valid json, so the json module won't help you to write that data. But you can do this:
import json
with open('es_hosts.json', 'w') as fp:
data = ['a', 'b', 'c']
fp.write(json.dumps(','.join(data)).replace('"', ''))
That's because you asked for json, but since that's not json, this should suffice:
with open('es_hosts.json', 'w') as fp:
data = ['a', 'b', 'c']
fp.write(','.join(data))
Use python's built-in string replace function
with open('es_hosts.json', 'w') as fp:
json.dump(','.join(host_list.keys()).replace('\"',''), fp)
Just use for loop to assign list to string.
import json
with open('json_file') as f:
data = json.loads(f.read())
for value_wo_bracket in data['key_name']:
print(value_wo_bracket)
Note there is difference between json.load and json.loads
I am extracting some emails from a CSV file and then saving it to another CSV file.
email variable should be in this format:
email = ['email#email.com'], ['email2#company.com'], ['email3#company2.com']
but in certain cases it will be returned as:
email = ['email#email.com', 'email2#email.com'], ['email3#email.com']
In certain rows it finds 2 emails, so that is when it is presented like this.
What would be an efficient way to change it??
The next should be quite efficient:
>>> import itertools
>>> data = [ ['email#email.com', 'email2#email.com'], ['email3#email.com'] ]
>>> [[i] for i in itertools.chain(*data)]
[['email#email.com'], ['email2#email.com'], ['email3#email.com']]
data = [ ['email#email.com', 'email2#email.com'], ['email3#email.com'] ]
def flatten(data):
for item in data:
if isinstance(item, basestring):
yield item
else:
for i in item:
yield [i]
or, if you want to support arbitrary levels of nesting:
def flatten(data):
for item in data:
if isinstance(item, basestring):
yield [item]
else:
for i in flatten(item):
yield i
If you only need a list of emails, without each element wrapped in a list (which seems more reasonable to me), the solution is much simpler:
import itertools
print list(itertools.chain.from_iterable(data))
If you are working with CSV files you may want to try the CSV module from the standard library.
http://docs.python.org/library/csv.html
Example:
$ cat > test.csv
['email#email.com', 'email2#email.com'], ['email3#email.com']
$ python
>>> import csv
>>> reader = csv.reader(open('test.csv', 'r'))
>>> for row in reader:
... print row
...
["['email#email.com'", " 'email2#email.com']", " ['email3#email.com']"]
What I did there may not be what you want but if you look at the library you might find what you need.
I have
>>> import yaml
>>> yaml.dump(u'abc')
"!!python/unicode 'abc'\n"
But I want
>>> import yaml
>>> yaml.dump(u'abc', magic='something')
'abc\n'
What magic param forces no tagging?
You can use safe_dump instead of dump. Just keep in mind that it won't be able to represent arbitrary Python objects then. Also, when you load the YAML, you will get a str object instead of unicode.
How about this:
def unicode_representer(dumper, uni):
node = yaml.ScalarNode(tag=u'tag:yaml.org,2002:str', value=uni)
return node
yaml.add_representer(unicode, unicode_representer)
This seems to make dumping unicode objects work the same as dumping str objects for me (Python 2.6).
In [72]: yaml.dump(u'abc')
Out[72]: 'abc\n...\n'
In [73]: yaml.dump('abc')
Out[73]: 'abc\n...\n'
In [75]: yaml.dump(['abc'])
Out[75]: '[abc]\n'
In [76]: yaml.dump([u'abc'])
Out[76]: '[abc]\n'
You need a new dumper class that does everything the standard Dumper class does but overrides the representers for str and unicode.
from yaml.dumper import Dumper
from yaml.representer import SafeRepresenter
class KludgeDumper(Dumper):
pass
KludgeDumper.add_representer(str,
SafeRepresenter.represent_str)
KludgeDumper.add_representer(unicode,
SafeRepresenter.represent_unicode)
Which leads to
>>> print yaml.dump([u'abc',u'abc\xe7'],Dumper=KludgeDumper)
[abc, "abc\xE7"]
>>> print yaml.dump([u'abc',u'abc\xe7'],Dumper=KludgeDumper,encoding=None)
[abc, "abc\xE7"]
Granted, I'm still stumped on how to keep this pretty.
>>> print u'abc\xe7'
abcç
And it breaks a later yaml.load()
>>> yy=yaml.load(yaml.dump(['abc','abc\xe7'],Dumper=KludgeDumper,encoding=None))
>>> yy
['abc', 'abc\xe7']
>>> print yy[1]
abc�
>>> print u'abc\xe7'
abcç
little addition to interjay's excellent answer, you can keep your unicode on a reload if you take care of your file encodings.
# -*- coding: utf-8 -*-
import yaml
import codecs
data = dict(key = u"abcç\U0001F511")
fn = "test2.yaml"
with codecs.open(fn, "w", encoding="utf-8") as fo:
yaml.safe_dump(data, fo)
with codecs.open(fn, encoding="utf-8") as fi:
data2 = yaml.safe_load(fi)
print ("data2:", data2, "type(data.key):", type(data2.get("key")) )
print data2.get("key")
test2.yaml contents in my editor:
{key: "abc\xE7\uD83D\uDD11"}
print outputs:
('data2:', {'key': u'abc\xe7\U0001f511'}, 'type(data.key):', <type 'unicode'>)
abcç🔑
Plus, after reading http://nedbatchelder.com/blog/201302/war_is_peace.html I am pretty sure that safe_load/safe_dump is where I want to be anyway.
I've just started with Python and YAML, but probably this may also help. Just compare outputs:
def test_dump(self):
print yaml.dump([{'name': 'value'}, {'name2': 1}], explicit_start=True)
print yaml.dump_all([{'name': 'value'}, {'name2': 1}])