This is essentially the same question asked here: How can you parse excel CSV data that contains linebreaks in the data?
But I'm using Python 3 to write my CSV file. Does anyone know if there's a way to add line breaks to cell values from Python?
Here's an example of what the CSV should look like:
"order_number1", "item1\nitem2"
"order_number2", "item1"
"order_number3", "item1\nitem2\nitem3"
I've tried appending HTML line breaks between each item but the system to where I upload the data doesn't seem to recognize HTML.
Any and all help is appreciated.
Thanks!
Figured it out after playing around and I feel so stupid.
for key in dictionary:
outfile.writerow({
"Order ID": key,
"Item": "\n".join(dictionary[key])
})
Here's an example of what the CSV should look like:
"order_number1", "item1\nitem2"
"order_number2", "item1"
"order_number3", "item1\nitem2\nitem3"
The proper way to use newlines in fields is like this:
"order_number1","item1
item2"
"order_number2","item1"
"order_number3","item1
item2
item3"
The \n you show are just part of the string. Some software may convert it to a newline, other software may not.
Also try to avoid spaces around the separators.
Related
My target is to create a CSV file from an API call.
The problem is: The API returns a bytestring as content and I don't know how do convert it to a CSV file properly.
The content part of the response looks like this:
b'column_title_1, column_title_2, column_title_3, value1_1, value1_2, value1_3\nvalue2_1, value2_2(="Perfect!\nThank you"), value2_3\n, value3_1, value3_2, value3_3\n....'
How can I manage to get a clean CSV file from this? I tried Pandas, the CSV module and Numpy. Unfortunately, I was not able to handle the newline escapes which are sometimes within a string value (it is a column for comments) - see value2_2.
The result should look like this:
column_title_1, column_title_2, column_title_3
value1_1, value1_2, value1_3
value2_1, value2_2(="Perfect!\nThank you"), value2_3
value3_1, value3_2, value3_3
The closest of my results was this:
column_title_1, column_title_2, column_title_3
value1_1, value1_2, value1_3
value2_1, value2_2(="Perfect!
Thank you"), value2_3
value3_1, value3_2, value3_3
Even if I got close, I was not able to get rid of the \n within the values of some columns.
I did not figure out how to exclude the \n which are within "".
I have loaded a text file using the load csv function but when I try to print the schema it shows just one field from the root including every row in that one. like this:
root
|-- Prscrbr_Geo_Lvl Prscrbr_Geo_Cd Prscrbr_Geo_Desc Brnd_Name
Any idea how to fix this?
Adding my comment as an answer since it seems to have solved the problem.
From the output, it looks like the CSV file is actually using tab characters as the separator between columns instead of commas. To get Spark to use tabs as the separator, you can use spark.read.format("csv").option("sep", "\t").load("/path/to/file")
I have to read a file that has always the same format.
As I know it has the same format I can readline() and tokenize. But I guess there is a way to read it more, how to say it, "pretty to the eyes".
The file I have to read has this format :
Nom NMS-01
MAC AAAAAAAAAAA
UDPport 2019
TCPport 9129
I just want a different way to read it without having to tokenize, if that is possbile
Your question seems to imply that "tokenizing" is some kind of mysterious and complicated process. But in fact, the thing you are trying to do is exactly tokenizing.
Here is a perfectly valid way to read the file you show, break it up into tokens, and store it in a data structure:
def read_file_data(data_file_path):
result = {}
with open(data_file_path) as data_file:
for line in data_file:
key, value = line.split(' ', maxsplit=1)
result[key] = value
return result
That wasn't complicated, it wasn't a lot of code, it doesn't need a third-party library, and it's easy to work with:
data = read_file_data('path/to/file')
print(data['Nom']) # prints "NMS-01"
Now, this implementation makes many assumptions about the structure of the file. Among other things, it assumes:
The entire file is structured as key/value pairs
Each key/value pair fits on a single line
Every line in the file is a key/value pair (no comments or blank lines)
The key cannot contain space characters
The value cannot contain newline characters
The same key does not appear multiple times in the file (or, if it does, it is acceptable for the last value given to be the only one returned)
Some of these assumptions may be false, but they are all true for the data sample you provided.
More generally: if you want to parse some kind of structured data, you need to understand the structure of the data and how values are delimited from each other. That's why common structured data formats like XML, JSON, and YAML (among many others!) were invented. Once you know the language you are parsing, tokenization is simply the code you write to match up the language with the text of your input.
Pandas does many magical things, so maybe that is prettier for you?
import pandas as pd
pd.read_csv('input.txt',sep = ' ',header=None,index_col=0)
This gives you a dataframe that you can manipulate further:
0 1
Nom NMS-01
MAC AAAAAAAAAAA
UDPport 2019
TCPport 9129
I have made a GUI program where I can enter some values in various fields. All info from these fields is then combined into a dictionary, which is then stored using the shelve module. With the push of a button, I can then export all dictionary entries into an RTF file, as I want parts of the file formatted in italics.
The GUI and shelve part of the program works just fine. The problem I'm having is exporting multiple lines to the RTF file. When I print the strings I want to write to the RTF file into the python shell, I get multiple lines. But when I export it to RTF, it's all printed on one line. I know this should usually be fixed by adding an \n to the string, but this hasn't worked for me in any way. Can anyone tell me what I'm doing wrong, or maybe a workaround where I can still use italics to save the text?
As far as a working example goes:
data = dict()
data['first'] = {'author': 'Kendall MA'
'year': '1987',
'title': 'This is a test title'}
data['second'] = {'author': 'Mark',
'year': '2014',
'title': 'It is not working correctly'}
rtf = open('../test.rtf', 'w')
rtf.write(r'{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fswiss\fcharset0 Cambria;}}')
for key in data.keys():
entry = data[key]
rtf.write(r'{0} ({1}): \i {2} \i0'.format(entry['author'], entry['year'], entry['title']) + '\n')
rtf.write(r'}\n\x00')
rtf.close()
The output this code gives is:
Mark (2014): It is not working correctly Kendall (1987): This is a test title
While it should be:
Mark (2014): It is not working correctly
Kendall (1987): This is a test title
EDIT:
I found out that the combination of /line and /par works. Using them seperately does not for some reason that is unclear to me (maybe somebody can explain?).
But a new error occurred. When the author is in fact multiple authors, which I enter by list (['Kendall MA', 'Powsen RB']) and then make into a single string using ', '.join(entry['author']), the first word gets cut off. So I would get
'MA, Powsen RB' instead of 'Kendall MA, Powsen RB'. Does anyone know why and how to counter it?
\n has no special meaning in RTF. If you want to output a line break, you will need to use r'\line' or r'\par' for a paragraph break instead of (or in addition to, for readability) \n.
I've got an interesting problem.
I get a report per email and parse the CSV with csv.DictReader like so:
with open(extracted_report_uri) as f:
reader = csv.DictReader(f)
for row in reader:
report.append(row)
Unfortunately the CSV contains one column called "eCPM (€)" which leaves me with a list like so:
{'eCPM (€)': '1.42'}
Python really does not like a print(report[0]['eCPM (€)']) as it refuses to accept the Euro-sign as a key.
I tried creating an unicode string with the € inside and use that as the key but this also doesnt work.
I'd either like to access the value (obviously) as is, or simply get rid of the €.
The suggested duplicates answer is covering the topic of removing BOM rather than accessing my key. I also tried it via report[0][u'eCPM (€)'] as suggested in the comments there. Does not work. KeyError: 'eCPM (�)'
The suggestion from the comment also doesn't work for me. Using report[0][u'eCPM (%s)' % '€'.encode('unicode-escape')] results in KeyError: "eCPM (b'\\\\u20ac')"
After some more research I found out how to properly do this it seems. As I've seen all sorts of issues on Google/Stackoverflow with BOM/UTF-8 and DictReader here's the complete code:
Situation:
You got a CSV file that has Byte Order Mark (BOM)0xEF,0xBB,0xBF with special characters like €äöµ# or similar in the fieldname and want to read it properly to access the key:value pairs lateron.
In my example the CSV has a fieldname eCPM (€) and this' how it works:
import csv
report = []
with open('test.csv', encoding='utf-8-sig') as f:
reader = csv.DictReader(f)
for row in reader:
report.append(row)
print(report[0][u'eCPM (€)'])
Before this solution I removed the BOM with a function, but there's really no need for this. If you use open() with encoding='utf-8-sig it'll automagically handle the BOM correctly and properly encode the whole file.
And with [u'€'] you can easily access the values of the generated list unicode style.
Thanks for the comments that brought me on the right track!