I'm trying to make a namedtuple from a DictReader object. My code looks like this. The problem I'm struggling with is I have some really long and ugly column headers in the csv file I'm working with. For the sake of this example, one of the column headers I am working with is:
"What is typically the main dish at your Thanksgiving dinner?".
What is throwing me off is there are a bunch of spaces in this title, so if I understand correctly, the namedtuple thinks these are all arguments. What way would you recommend to solve this? I have referenced several threads and feel like I almost got there through this one: What is the pythonic way to read CSV file data as rows of namedtuples?
I am just using one column header as an example. Here is some code I have so far:
import csv
import collections
filename = 'thanksgiving2015.csv'
with open(filename, 'r', encoding = 'utf-8') as f:
reader = csv.DictReader(f)
columns = collections.namedtuple('columns',
'What is typically the main dish at your
Thanksgiving dinner?')
Should I strip all these column headers of their spaces before making the namedtuple? I could do this before I even import the csv in excel, but I assume there is a nice solution in python.
namedtuple treats a single string as a white-space-delimited list of field names. You need to pass an explicit list of column names instead.
namedtuple('columns', ['What is...', 'some other absurd column name'])
I would rethink using the header values directly as field names, though. Ignore the header, and pass a list of shorter names that you can use as attributes later.
As chepner pointed out, the second argument of nametuple() can either be a space-separated string or a list of strings like:
columns = collections.namedtuple('columns',
['What is typically the main dish at your Thanksgiving dinner?', 'other column'])
However, doing so will fail with:
ValueError: Type names and field names must be valid identifiers
This is because columns (which you should capitalize as Columns) will be an object with 'What is typically...' as an identifier and identifiers can't have spaces. To be clear, you would use it as:
Columns = namedtuple('columns', ['what is', 'this'])
columns = Columns('foo', 'bar')
print(columns.this) # Works fine
print(columns.what is) # Not going to work
If you were using a simple dict(), you would write:
print(columns['what is'])
You can however ask namedtuple to rename invalid identifiers:
Columns = namedtuple('columns', ['what is', 'this'], rename=True)
print(columns._0) # ugly but valid
print(columns.this)
Related
I have a csv file of such structure:
Id,Country,Cities
1,Canada,"['Toronto','Ottawa','Montreal']"
2,Italy,"['Rome','Milan','Naples', 'Palermo']"
3,France,"['Paris','Cannes','Lyon']"
4,Spain,"['Seville','Alicante','Barcelona']"
The last column contains a list, but it is represented as a string so that it is treated as a single element. When parsing the file, I need to have this element as a list, not a string. So far I've found the way to convert it:
L = "['Toronto','Ottawa','Montreal']"
seq = ast.literal_eval(L)
Since I'm a newbie in python, my question is -- is this normal way of doing this, or there's a right way to represent lists in CSV so that I don't have to do conversions, or there's a simpler way to convert?
Thanks!
Using ast.literal_eval(...) will work, but it requires special syntax that other CSV-reading software won't recognize, and uses an eval statement which is a red flag.
Using eval can be dangerous, even though in this case you're using the safer literal_eval option which is more restrained than the raw eval function.
Usually what you'll see in CSV files that have many values in a single column is that they'll use a simple delimiter and quote the field.
For instance:
ID,Country,Cities
1,Canada,"Toronto;Ottawa;Montreal"
Then in python, or any other language, it becomes trivial to read without having to resort to eval:
import csv
with open("data.csv") as fobj:
reader = csv.reader(fobj)
field_names = next(reader)
rows = []
for row in reader:
row[-1] = row[-1].split(";")
rows.append(row)
Issues with ast.literal_eval
Even though the ast.literal_eval function is much safer than using a regular eval on user input, it still might be exploitable. The documentation for literal_eval has this warning:
Warning: It is possible to crash the Python interpreter with a sufficiently large/complex string due to stack depth limitations in Python’s AST compiler.
A demonstration of this can be found here:
>>> import ast
>>> ast.literal_eval("()" * 10 ** 6)
[1] 48513 segmentation fault python
I'm definitely not an expert, but giving a user the ability to crash a program and potentially exploit some obscure memory vulnerability is bad, and in this use-case can be avoided.
If the reason you want to use literal_eval is to get proper typing, and you're positive that the input data is 100% trusted, then I suppose it's fine to use. But, you could always wrap the function to perform some sanity checks:
def sanely_eval(value: str, max_size: int = 100_000) -> object:
if len(value) > max_size:
raise ValueError(f"len(value) is greater than the max_size={max_size!r}")
return ast.literal_eval(value)
But, depending on how you're creating and using the CSV files, this may make the data less portable, since it's a python-specific format.
If you can control the CSV, you could separate the items with some other known character that isn't going to be in a city and isn't a comma. Say colon (:).
Then row one, for example, would look like this:
1,Canada,Toronto:Ottawa:Montreal
When it comes to processing the data, you'll have that whole element, and you can just do
cities.split(':')
If you want to go the other way (you have the cities in a Python list, and you want to create this string) you can use join()
':'.join(['Toronto', 'Ottawa', 'Montreal'])
For the specific structure of the csv, you could convert cities to list like this:
cities = '''"['Rome','Milan','Naples', 'Palermo']"'''
cities = cities[2:-2] # remove "[ and ]"
print(cities) # 'Rome','Milan','Naples', 'Palermo'
cities = cities.split(',') # convert to list
print(cities) # ["'Rome'", "'Milan'", "'Naples'", " 'Palermo'"]
cities = [x.strip() for x in cities] # remove leading or following spaces (if exists)
print(cities) # ["'Rome'", "'Milan'", "'Naples'", "'Palermo'"]
cities = [x[1:-1] for x in cities] # remove quotes '' from each city
print(cities) # ['Rome', 'Milan', 'Naples', 'Palermo']
train, test = data.TabularDataset.splits(path="./data/", train="train.csv",test="test.csv",format="csv",fields=[("Tweet",TEXT), ("Affect Dimension",LABEL)])
I have this code and want to evaluate, if the loaded data is correct or if it's using wrong columns for the actual text fields etc.
If my file has the columns "Tweet" for the Texts and "Affect Dimension" for the Class name, is it correct to put them like this is the fields section?
Edit: TabularDataset includes an Example object, in which the data can be read. When reading csv files, only a "," is accepted as a delimiter. Everything else will result in corrupted data.
You can put any field name irrespective of what your file has. Also, I recommend NOT TO use white-spaces in the field names.
So, rename Affect Dimension to Affect_Dimension or anything convenient for you.
Then you can iterate over different fields like below to check the read data.
for i in train.Tweet:
print i
for i in train.Affect_Dimension:
print i
for i in test.Tweet:
print i
for i in test.Affect_Dimension:
print i
Basically, I have a troubleshooting program, which, I want the user to enter their input. Then, I take this input and split the words into separate strings. After that, I want to create a dictionary from the contents of a .CSV file, with the key as recognisable keywords and the second column as solutions. Finally, I want to check if any of the strings from the split users input are in the dictionary key, print the solution.
However, the problem I am facing is that I can do what I have stated above, however, it loops through and if my input was 'My phone is wet', and 'wet' was a recognisable keyword, it would go through and say 'Not recognised', 'Not recognised', 'Not recognised', then finally it would print the solution. It says not recognised so many times because the strings 'My', 'phone' and 'is' are not recognised.
So how do I test if a users split input is in my dictionary without it outputting 'Not recognised' etc..
Sorry if this was unclear, I'm quite confused by the whole matter.
Code:
import csv, easygui as eg
KeywordsCSV = dict(csv.reader(open('Keywords and Solutions.csv')))
Problem = eg.enterbox('Please enter your problem: ', 'Troubleshooting').lower().split()
for Problems, Solutions in (KeywordsCSV.items()):
pass
Note, I have the pass there, because this is the part I need help on.
My CSV file consists of:
problemKeyword | solution
For example;
wet Put the phone in a bowl of rice.
Your code reads like some ugly code golf. Let's clean it up before we look at how to solve the problem
import easygui as eg
import csv
# # KeywordsCSV = dict(csv.reader(open('Keywords and Solutions.csv')))
# why are you nesting THREE function calls? That's awful. Don't do that.
# KeywordsCSV should be named something different, too. `problems` is probably fine.
with open("Keywords and Solutions.csv") as f:
reader = csv.reader(f)
problems = dict(reader)
problem = eg.enterbox('Please enter your problem: ', 'Troubleshooting').lower().split()
# this one's not bad, but I lowercased your `Problem` because capital-case
# words are idiomatically class names. Chaining this many functions together isn't
# ideal, but for this one-shot case it's not awful.
Let's break a second here and notice that I changed something on literally every line of your code. Take time to familiarize yourself with PEP8 when you can! It will drastically improve any code you write in Python.
Anyway, once you've got a problems dict, and a problem that should be a KEY in that dict, you can do:
if problem in problems:
solution = problems[problem]
or even using the default return of dict.get:
solution = problems.get(problem)
# if KeyError: solution is None
If you wanted to loop this, you could do something like:
while True:
problem = eg.enterbox(...) # as above
solution = problems.get(problem)
if solution is None:
# invalid problem, warn the user
else:
# display the solution? Do whatever it is you're doing with it and...
break
Just have a boolean and an if after the loop that only runs if none of the words in the sentence were recognized.
I think you might be able to use something like:
for word in Problem:
if KeywordsCSV.has_key(word):
KeywordsCSV.get(word)
or the list comprehension:
[KeywordsCSV.get(word) for word in Problem if KeywordsCSV.has_key(word)]
I've got an interesting problem.
I get a report per email and parse the CSV with csv.DictReader like so:
with open(extracted_report_uri) as f:
reader = csv.DictReader(f)
for row in reader:
report.append(row)
Unfortunately the CSV contains one column called "eCPM (€)" which leaves me with a list like so:
{'eCPM (€)': '1.42'}
Python really does not like a print(report[0]['eCPM (€)']) as it refuses to accept the Euro-sign as a key.
I tried creating an unicode string with the € inside and use that as the key but this also doesnt work.
I'd either like to access the value (obviously) as is, or simply get rid of the €.
The suggested duplicates answer is covering the topic of removing BOM rather than accessing my key. I also tried it via report[0][u'eCPM (€)'] as suggested in the comments there. Does not work. KeyError: 'eCPM (�)'
The suggestion from the comment also doesn't work for me. Using report[0][u'eCPM (%s)' % '€'.encode('unicode-escape')] results in KeyError: "eCPM (b'\\\\u20ac')"
After some more research I found out how to properly do this it seems. As I've seen all sorts of issues on Google/Stackoverflow with BOM/UTF-8 and DictReader here's the complete code:
Situation:
You got a CSV file that has Byte Order Mark (BOM)0xEF,0xBB,0xBF with special characters like €äöµ# or similar in the fieldname and want to read it properly to access the key:value pairs lateron.
In my example the CSV has a fieldname eCPM (€) and this' how it works:
import csv
report = []
with open('test.csv', encoding='utf-8-sig') as f:
reader = csv.DictReader(f)
for row in reader:
report.append(row)
print(report[0][u'eCPM (€)'])
Before this solution I removed the BOM with a function, but there's really no need for this. If you use open() with encoding='utf-8-sig it'll automagically handle the BOM correctly and properly encode the whole file.
And with [u'€'] you can easily access the values of the generated list unicode style.
Thanks for the comments that brought me on the right track!
Issue
The code does not correctly identify the input (item). It simply dumps to my failure message even if such a value exists in the CSV file. Can anyone help me determine what I am doing wrong?
Background
I am working on a small program that asks for user input (function not given here), searches a specific column in a CSV file (Item) and returns the entire row. The CSV data format is shown below. I have shortened the data from the actual amount (49 field names, 18000+ rows).
Code
import csv
from collections import namedtuple
from contextlib import closing
def search():
item = 1000001
raw_data = 'active_sanitized.csv'
failure = 'No matching item could be found with that item code. Please try again.'
check = False
with closing(open(raw_data, newline='')) as open_data:
read_data = csv.DictReader(open_data, delimiter=';')
item_data = namedtuple('item_data', read_data.fieldnames)
while check == False:
for row in map(item_data._make, read_data):
if row.Item == item:
return row
else:
return failure
CSV structure
active_sanitized.csv
Item;Name;Cost;Qty;Price;Description
1000001;Name here:1;1001;1;11;Item description here:1
1000002;Name here:2;1002;2;22;Item description here:2
1000003;Name here:3;1003;3;33;Item description here:3
1000004;Name here:4;1004;4;44;Item description here:4
1000005;Name here:5;1005;5;55;Item description here:5
1000006;Name here:6;1006;6;66;Item description here:6
1000007;Name here:7;1007;7;77;Item description here:7
1000008;Name here:8;1008;8;88;Item description here:8
1000009;Name here:9;1009;9;99;Item description here:9
Notes
My experience with Python is relatively little, but I thought this would be a good problem to start with in order to learn more.
I determined the methods to open (and wrap in a close function) the CSV file, read the data via DictReader (to get the field names), and then create a named tuple to be able to quickly select the desired columns for the output (Item, Cost, Price, Name). Column order is important, hence the use of DictReader and namedtuple.
While there is the possibility of hard-coding each of the field names, I felt that if the program can read them on file open, it would be much more helpful when working on similar files that have the same column names but different column organization.
Research
CSV Header and named tuple:
What is the pythonic way to read CSV file data as rows of namedtuples?
Converting CSV data to tuple: How to split a CSV row so row[0] is the name and any remaining items are a tuple?
There were additional links of research, but I cannot post more than two.
You have three problems with this:
You return on the first failure, so it will never get past the first line.
You are reading strings from the file, and comparing to an int.
_make iterates over the dictionary keys, not the values, producing the wrong result (item_data(Item='Name', Name='Price', Cost='Qty', Qty='Item', Price='Cost', Description='Description')).
for row in (item_data(**data) for data in read_data):
if row.Item == str(item):
return row
return failure
This fixes the issues at hand - we check against a string, and we only return if none of the items matched (although you might want to begin converting the strings to ints in the data rather than this hackish fix for the string/int issue).
I have also changed the way you are looping - using a generator expression makes for a more natural syntax, using the normal construction syntax for named attributes from a dict. This is cleaner and more readable than using _make and map(). It also fixes problem 3.