I have a csv file of such structure:
Id,Country,Cities
1,Canada,"['Toronto','Ottawa','Montreal']"
2,Italy,"['Rome','Milan','Naples', 'Palermo']"
3,France,"['Paris','Cannes','Lyon']"
4,Spain,"['Seville','Alicante','Barcelona']"
The last column contains a list, but it is represented as a string so that it is treated as a single element. When parsing the file, I need to have this element as a list, not a string. So far I've found the way to convert it:
L = "['Toronto','Ottawa','Montreal']"
seq = ast.literal_eval(L)
Since I'm a newbie in python, my question is -- is this normal way of doing this, or there's a right way to represent lists in CSV so that I don't have to do conversions, or there's a simpler way to convert?
Thanks!
Using ast.literal_eval(...) will work, but it requires special syntax that other CSV-reading software won't recognize, and uses an eval statement which is a red flag.
Using eval can be dangerous, even though in this case you're using the safer literal_eval option which is more restrained than the raw eval function.
Usually what you'll see in CSV files that have many values in a single column is that they'll use a simple delimiter and quote the field.
For instance:
ID,Country,Cities
1,Canada,"Toronto;Ottawa;Montreal"
Then in python, or any other language, it becomes trivial to read without having to resort to eval:
import csv
with open("data.csv") as fobj:
reader = csv.reader(fobj)
field_names = next(reader)
rows = []
for row in reader:
row[-1] = row[-1].split(";")
rows.append(row)
Issues with ast.literal_eval
Even though the ast.literal_eval function is much safer than using a regular eval on user input, it still might be exploitable. The documentation for literal_eval has this warning:
Warning: It is possible to crash the Python interpreter with a sufficiently large/complex string due to stack depth limitations in Python’s AST compiler.
A demonstration of this can be found here:
>>> import ast
>>> ast.literal_eval("()" * 10 ** 6)
[1] 48513 segmentation fault python
I'm definitely not an expert, but giving a user the ability to crash a program and potentially exploit some obscure memory vulnerability is bad, and in this use-case can be avoided.
If the reason you want to use literal_eval is to get proper typing, and you're positive that the input data is 100% trusted, then I suppose it's fine to use. But, you could always wrap the function to perform some sanity checks:
def sanely_eval(value: str, max_size: int = 100_000) -> object:
if len(value) > max_size:
raise ValueError(f"len(value) is greater than the max_size={max_size!r}")
return ast.literal_eval(value)
But, depending on how you're creating and using the CSV files, this may make the data less portable, since it's a python-specific format.
If you can control the CSV, you could separate the items with some other known character that isn't going to be in a city and isn't a comma. Say colon (:).
Then row one, for example, would look like this:
1,Canada,Toronto:Ottawa:Montreal
When it comes to processing the data, you'll have that whole element, and you can just do
cities.split(':')
If you want to go the other way (you have the cities in a Python list, and you want to create this string) you can use join()
':'.join(['Toronto', 'Ottawa', 'Montreal'])
For the specific structure of the csv, you could convert cities to list like this:
cities = '''"['Rome','Milan','Naples', 'Palermo']"'''
cities = cities[2:-2] # remove "[ and ]"
print(cities) # 'Rome','Milan','Naples', 'Palermo'
cities = cities.split(',') # convert to list
print(cities) # ["'Rome'", "'Milan'", "'Naples'", " 'Palermo'"]
cities = [x.strip() for x in cities] # remove leading or following spaces (if exists)
print(cities) # ["'Rome'", "'Milan'", "'Naples'", "'Palermo'"]
cities = [x[1:-1] for x in cities] # remove quotes '' from each city
print(cities) # ['Rome', 'Milan', 'Naples', 'Palermo']
Related
I am writing a minor OP5 plugin in Python 2.7 (version is out of my hands) that iterates over a multidimensional list that verifies fallback zip downloads have gone as they should.
Up until now I have put each host with their IP address in a multidimensional list looking like (cut short for brevity):
fallback = [
["host1", "192.168.1.3"],
["host2", "192.168.15.59"]
]
...and so on.
This lets me iterate through fallback[i] and use that along with fallback[i][1] for the IP address, the rest of the script uses both of these informations for various tasks and string manipulations. The script as it is now is mechanically sound but relies on availability of these indexes.
There is however a hidden file (.fallbackinfo) containing the same information for another script but it is written for perl, same as the script that uses that file as a source.
The file looks like this:
#hosts = (
["host1", "192.168.1.3", "type of firmware", "subfolder"],
["host2", "192.168.15.59", "type of firmware", "subfolder"],
);
I wish to import this into an iterable multidimensional list in my Python script, but am getting incredibly stuck.
My current attempt is the closest I have gotten:
with open("/home/runninguser/.fallbackinfo") as f:
lines = []
for line in f:
lines.append(line.rstrip().strip())
fallback = lines[1:len(lines)-1]
This has successfully made the list look as I want it, but all lines get imported as str objects. I have attempted to use list() to force the object to become a list but most of the time, that makes each character in the lines to become a list object instead. The network in question is cut off from internet access so I have to rely on built-in modules. My interpretation is that since it is formatted as a list, it should somehow be able to be interpreted as a list.
Can this be done at all, and if so, how?
You can use the json package (built-in) to achieve this:
import json
with open("/home/runninguser/.fallbackinfo") as f:
# For each line
for line in f:
# If the line starts with a bracket
if line.strip()[0] == "[":
# Print the line after removing spaces in front and the comma in the back
# and converting it into a list
print(json.loads(line.strip().rstrip(",")))
If you now use the type() function, you will see the list-formatted strings are now <class 'list'>
I have to read a file that has always the same format.
As I know it has the same format I can readline() and tokenize. But I guess there is a way to read it more, how to say it, "pretty to the eyes".
The file I have to read has this format :
Nom NMS-01
MAC AAAAAAAAAAA
UDPport 2019
TCPport 9129
I just want a different way to read it without having to tokenize, if that is possbile
Your question seems to imply that "tokenizing" is some kind of mysterious and complicated process. But in fact, the thing you are trying to do is exactly tokenizing.
Here is a perfectly valid way to read the file you show, break it up into tokens, and store it in a data structure:
def read_file_data(data_file_path):
result = {}
with open(data_file_path) as data_file:
for line in data_file:
key, value = line.split(' ', maxsplit=1)
result[key] = value
return result
That wasn't complicated, it wasn't a lot of code, it doesn't need a third-party library, and it's easy to work with:
data = read_file_data('path/to/file')
print(data['Nom']) # prints "NMS-01"
Now, this implementation makes many assumptions about the structure of the file. Among other things, it assumes:
The entire file is structured as key/value pairs
Each key/value pair fits on a single line
Every line in the file is a key/value pair (no comments or blank lines)
The key cannot contain space characters
The value cannot contain newline characters
The same key does not appear multiple times in the file (or, if it does, it is acceptable for the last value given to be the only one returned)
Some of these assumptions may be false, but they are all true for the data sample you provided.
More generally: if you want to parse some kind of structured data, you need to understand the structure of the data and how values are delimited from each other. That's why common structured data formats like XML, JSON, and YAML (among many others!) were invented. Once you know the language you are parsing, tokenization is simply the code you write to match up the language with the text of your input.
Pandas does many magical things, so maybe that is prettier for you?
import pandas as pd
pd.read_csv('input.txt',sep = ' ',header=None,index_col=0)
This gives you a dataframe that you can manipulate further:
0 1
Nom NMS-01
MAC AAAAAAAAAAA
UDPport 2019
TCPport 9129
Issue
The code does not correctly identify the input (item). It simply dumps to my failure message even if such a value exists in the CSV file. Can anyone help me determine what I am doing wrong?
Background
I am working on a small program that asks for user input (function not given here), searches a specific column in a CSV file (Item) and returns the entire row. The CSV data format is shown below. I have shortened the data from the actual amount (49 field names, 18000+ rows).
Code
import csv
from collections import namedtuple
from contextlib import closing
def search():
item = 1000001
raw_data = 'active_sanitized.csv'
failure = 'No matching item could be found with that item code. Please try again.'
check = False
with closing(open(raw_data, newline='')) as open_data:
read_data = csv.DictReader(open_data, delimiter=';')
item_data = namedtuple('item_data', read_data.fieldnames)
while check == False:
for row in map(item_data._make, read_data):
if row.Item == item:
return row
else:
return failure
CSV structure
active_sanitized.csv
Item;Name;Cost;Qty;Price;Description
1000001;Name here:1;1001;1;11;Item description here:1
1000002;Name here:2;1002;2;22;Item description here:2
1000003;Name here:3;1003;3;33;Item description here:3
1000004;Name here:4;1004;4;44;Item description here:4
1000005;Name here:5;1005;5;55;Item description here:5
1000006;Name here:6;1006;6;66;Item description here:6
1000007;Name here:7;1007;7;77;Item description here:7
1000008;Name here:8;1008;8;88;Item description here:8
1000009;Name here:9;1009;9;99;Item description here:9
Notes
My experience with Python is relatively little, but I thought this would be a good problem to start with in order to learn more.
I determined the methods to open (and wrap in a close function) the CSV file, read the data via DictReader (to get the field names), and then create a named tuple to be able to quickly select the desired columns for the output (Item, Cost, Price, Name). Column order is important, hence the use of DictReader and namedtuple.
While there is the possibility of hard-coding each of the field names, I felt that if the program can read them on file open, it would be much more helpful when working on similar files that have the same column names but different column organization.
Research
CSV Header and named tuple:
What is the pythonic way to read CSV file data as rows of namedtuples?
Converting CSV data to tuple: How to split a CSV row so row[0] is the name and any remaining items are a tuple?
There were additional links of research, but I cannot post more than two.
You have three problems with this:
You return on the first failure, so it will never get past the first line.
You are reading strings from the file, and comparing to an int.
_make iterates over the dictionary keys, not the values, producing the wrong result (item_data(Item='Name', Name='Price', Cost='Qty', Qty='Item', Price='Cost', Description='Description')).
for row in (item_data(**data) for data in read_data):
if row.Item == str(item):
return row
return failure
This fixes the issues at hand - we check against a string, and we only return if none of the items matched (although you might want to begin converting the strings to ints in the data rather than this hackish fix for the string/int issue).
I have also changed the way you are looping - using a generator expression makes for a more natural syntax, using the normal construction syntax for named attributes from a dict. This is cleaner and more readable than using _make and map(). It also fixes problem 3.
I need to send an array of namedtuples by a socket.
To create the array of namedtuples I use de following:
listaPeers=[]
for i in range(200):
ipPuerto=collections.namedtuple('ipPuerto', 'ip, puerto')
ipPuerto.ip="121.231.334.22"
ipPuerto.puerto="8988"
listaPeers.append(ipPuerto)
Now that is filled, i need to pack "listaPeers[200]"
How can i do it?
Something like?:
packedData = struct.pack('XXXX',listaPeers)
First of all you are using namedtuple incorrectly. It should look something like this:
# ipPuerto is a type
ipPuerto=collections.namedtuple('ipPuerto', 'ip, puerto')
# theTuple is a tuple object
theTuple = ipPuerto("121.231.334.22", "8988")
As for packing, it depends what you want to use on the other end. If the data will be read by Python, you can just use Pickle module.
import cPickle as Pickle
pickledTuple = Pickle.dumps(theTuple)
You can pickle whole array of them at once.
It is not that simple - yes, for integers and simple numbers, it s possible to pack straight from named tuples to data provided by the struct package.
However, you are holding your data as strings, not as numbers - it is a simple thing to convert to int in the case of the port - as it is a simple integer, but requires some juggling when it comes to the IP.
def ipv4_from_str(ip_str):
parts = ip_str.split(".")
result = 0
for part in parts:
result <<= 8
result += int(part)
return result
def ip_puerto_gen(list_of_ips):
for ip_puerto in list_of_ips:
yield(ipv4_from_str(ip_puerto.ip))
yield(int(ip_puerto.puerto))
def pack(list_of_ips):
return struct.pack(">" + "II" * len(list_of_ips),
*ip_puerto_gen(list_of_ips)
)
And you then use the "pack" function from here to pack your structure as you seem to want.
But first, attempt to the fact that you are creating your "listaPiers" incorrectly (your example code simply will fail with an IndexError) - use an empty list, and the append method on it to insert new named tuples with ip/port pairs as each element:
listaPiers = []
ipPuerto=collections.namedtuple('ipPuerto', 'ip, puerto')
for x in range(200):
new_element = ipPuerto("123.123.123.123", "8192")
listaPiers.append(new_element)
data = pack(listaPiers)
ISTR that pickle is considered insecure in server processes, if the server process is receiving pickled data from untrusted clients.
You might want to come up with some sort of separator character(s) for the records and fields (perhaps \0 and \001 or \376 and \377). Then putting together a message is kind of like a text file broken up into records and fields separated by spaces and newlines. Or for that matter, you could use spaces and newlines, if your normal data doesn't include these.
I find this module very valuable for framing data in socket-based protocols:
http://stromberg.dnsalias.org/~strombrg/bufsock.html
It lets you do things like "read up until the next null byte" or "read the next 10 characters" - without needing to worry about the complexities of IP aggregating or splitting packets.
This is my python file:-
TestCases-2
Input-5
Output-1,1,2,3,5
Input-7
Ouput-1,1,2,3,5,8,13
What I want is this:-
A variable test_no = 2 (No. of testcases)
A list testCaseInput = [5,7]
A list testCaseOutput = [[1,1,2,3,5],[1,1,2,3,5,8,13]]
I've tried doing it in this way:
testInput = testCase.readline(-10)
for i in range(0,int(testInput)):
testCaseInput = testCase.readline(-6)
testCaseOutput = testCase.readline(-7)
The next step would be to strip the numbers on the basis of (','), and then put them in a list.
Weirdly, the readline(-6) is not giving desired results.
Is there a better way to do this, which obviously I'm missing out on.
I don't mind using serialization here but I want to make it very simple for someone to write a text file as the one I have shown and then take the data out of it. How to do that?
A negative argument to the readline method specifies the number of bytes to read. I don't think this is what you want to be doing.
Instead, it is simpler to pull everything into a list all at once with readlines():
with open('data.txt') as f:
full_lines = f.readlines()
# parse full lines to get the text to right of "-"
lines = [line.partition('-')[2].rstrip() for line in full_lines]
numcases = int(lines[0])
for i in range(1, len(lines), 2):
caseinput = lines[i]
caseoutput = lines[i+1]
...
The idea here is to separate concerns (the source of the data, the parsing of '-', and the business logic of what to do with the cases). That is better than having a readline() and redundant parsing logic at every step.
I'm not sure if I follow exactly what you're trying to do, but I guess I'd try something like this:
testCaseIn = [];
testCaseOut = [];
for line in testInput:
if (line.startsWith("Input")):
testCaseIn.append(giveMeAList(line.split("-")[1]));
elif (line.startsWith("Output")):
testCaseOut.append(giveMeAList(line.split("-")[1]));
where giveMeAList() is a function that takes a comma seperated list of numbers, and generates a list datathing from it.
I didn't test this code, but I've written stuff that uses this kind of structure when I've wanted to make configuration files in the past.
You can use regex for this and it makes it much easier. See question: python: multiline regular expression
For your case, try this:
import re
s = open("input.txt","r").read()
(inputs,outputs) = zip(*re.findall(r"Input-(?P<input>.*)\nOutput-(?P<output>.*)\n",s))
and then split(",") each output element as required
If you do it this way you get the benefit that you don't need the first line in your input file so you don't need to specify how many entries you have in advance.
You can also take away the unzip (that's the zip(*...) ) from the code above, and then you can deal with each input and output a pair at a time. My guess is that is in fact exactly what you are trying to do.
EDIT Wanted to give you the full example of what I meant just then. I'm assuming this is for a testing script so I would say use the power of the pattern matching iterator to help keep your code shorter and simpler:
for (input,output) in re.findall(r"Input-(?P<input>.*)\nOutput-(?P<output>.*)\n",s):
expectedResults = output.split(",")
testResults = runTest(input)
// compare testResults and expectedResults ...
This line has an error:
Ouput-1,1,2,3,5,8,13 // it should be 'Output' not 'Ouput
This should work:
testCase = open('in.txt', 'r')
testInput = int(testCase.readline().replace("TestCases-",""))
for i in range(0,int(testInput)):
testCaseInput = testCase.readline().replace("Input-","")
testCaseOutput = testCase.readline().replace("Output-","").split(",")