My data looks like below
['[\'Patient, A\', \'G\', \'P\', \'RNA\']']
Irrespective of the brackets, quotes and back slashes, I'd like to separate the data by ',' and write to a CSV file like below
Patient,A,G,P,RNA
Mentioning delimiter = ',' has done no help. The output file then looks like
['Patient, A','G','P','RNA']
all in a single cell. I want to split them into multiple columns. How can I do that?
Edit - Mentioning quotechar='|' split them into different cells but it now looks like
|['Patient, A','G','P','RNA']|
Edit-
out_file_handle = csv.writer(out_file, quotechar='|', lineterminator='\n', delimiter = ",")
data = ''.join(mydict.get(word.lower(), word) for word in re.split('(\W+)', transposed))
data = [data,]
out_file_handle.writerow(data)
transposed:
['Patient, A','G','P','RNA']
data:
['[\'Patient, A\', \'G\', \'P\', \'RNA\']']
And it has multiple rows, the above is one of the rows from the entire data.
You first need to read this data into a Python array, by processing the string as a CSV file in memory:
from StringIO import StringIO
import csv
data = ['[\'Patient, A\', \'G\', \'P\', \'RNA\']']
clean_data = list(csv.reader( StringIO(data[0]) ))
However the output is still a single string, because it's not even a well-formed CSV! In which case, the best thing might be to filter out all those junk characters?
import re
clean_data = re.sub("[\[\]']","",data[0])
Now data[0] is 'Patient, A, G, P, RNA' which is a clean CSV you can write straight to a file.
If what you're trying to do is write data in the form of ['[\'Patient, A\', \'G\', \'P\', \'RNA\']'], where you have an array of these strings, to file, then it's really a question in two parts.
The first, is how do you separate the data into the correct format, and then the second is is to write it to file.
If that is the form of your data, for every row, then something like this should work (to get it into the correct format):
data = ['[\'Patient, A\', \'G\', \'P\', \'RNA\']', ...]
newData = [entry.replace("\'", "")[1:-1].split(",") for entry in data]
that will give you data in the following form:
[["Patient", "A", "G", "P", "RNA"], ...]
and then you can write it to file as suggested in the other answers;
with open('new.csv', 'wb') as write_file:
file_writer = csv.writer(write_file)
for dataEntry in range(newData ):
file_writer.writerow(dataEntry)
If you don't actually care about using the data in this round, and just want to clean it up, then you can just do data.replace("\'", "")[1:-1] and then write those strings to file.
The [1:-1] bits are just to remove the leading and trailing square brackets.
Python has a CSV writer. Start off with
import csv
Then try something like this
with open('new.csv', 'wb') as write_file:
file_writer = csv.writer(write_file)
for i in range(data):
file_writer.writerow([x for x in data[i]])
Edit:
You might have to wrangle the data a bit first before writing it, since it looks like its a string and not actually a list. Try playing around with the split() function
list = data.split()
"""
SAVING DATA INTO CSV FORMAT
* This format is used for many purposes, mainly for deep learning.
* This type of file can be used to view data in MS Excel or any similar
Application
"""
# == Imports ===================================================================
import csv
import sys
# == Initialisation Function ===================================================
def initialise_csvlog(filename, fields):
"""
Initilisation this function before using the Inserction function
* This Function checks the data before adding new one in order to maintain
perfect mechanisum of insertion
* It check the file if not exists then it creates a new one
* if it exists then it proceeds with getting fields
Parameters
----------
filename : String
Filename along with directory which need to be created
Fields : List
Colomns That need to be initialised
"""
try :
with open(filename,'r') as csvfile:
csvreader = csv.reader(csvfile)
fields = csvreader.next()
print("Data Already Exists")
sys.exit("Please Create a new empty file")
# print fields
except :
with open(filename,'w') as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow(fields)
# == Data Insertion Function ===================================================
def write_data_csv(filename, row_data):
"""
This Function save the Row Data into the CSV Created
* This adds the row data that is Double Listed
Parameters
----------
filename : String
Filename along with directory which need to be created
row_data : List
Double Listed consisting of row data and column elements in a list
"""
with open(filename,'a') as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerows(row_data)
if __name__ == '__main__':
"""
This function is used to test the Feature Run it independently
NOTE: DATA IN row_data MUST BE IN THE FOLLOWING DOUBLE LISTED AS SHOWN
"""
filename = "TestCSV.csv"
fields = ["sno","Name","Work","Department"]
#Init
initialise_csvlog(filename,fields)
#Add Data
row_data = [["1","Jhon","Coder","Pythonic"]]
write_data_csv(filename,row_data)
# == END =======================================================================
Read the Module and you can start using CSV and view data in Excel or any similar application (calc in libreoffice)
NOTE: Remember to place list of data to be double listed as shown in __main__ function (row_data)
Related
I have a text file (.txt) which could be in tab separated format or pipe separated format, and I need to convert it into CSV file format. I am using python 2.6. Can any one suggest me how to identify the delimiter in a text file, read the data and then convert that into comma separated file.
Thanks in advance
I fear that you can't identify the delimiter without knowing what it is. The problem with CSV is, that, quoting ESR:
the Microsoft version of CSV is a textbook example of how not to design a textual file format.
The delimiter needs to be escaped in some way if it can appear in fields. Without knowing, how the escaping is done, automatically identifying it is difficult. Escaping could be done the UNIX way, using a backslash '\', or the Microsoft way, using quotes which then must be escaped, too. This is not a trivial task.
So my suggestion is to get full documentation from whoever generates the file you want to convert. Then you can use one of the approaches suggested in the other answers or some variant.
Edit:
Python provides csv.Sniffer that can help you deduce the format of your DSV. If your input looks like this (note the quoted delimiter in the first field of the second row):
a|b|c
"a|b"|c|d
foo|"bar|baz"|qux
You can do this:
import csv
csvfile = open("csvfile.csv")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.DictReader(csvfile, dialect=dialect)
for row in reader:
print row,
# => {'a': 'a|b', 'c': 'd', 'b': 'c'} {'a': 'foo', 'c': 'qux', 'b': 'bar|baz'}
# write records using other dialect
Your strategy could be the following:
parse the file with BOTH a tab-separated csv reader and a pipe-separated csv reader
calculate some statistics on resulting rows to decide which resultset is the one you want to write. An idea could be counting the total number of fields in the two recordset (expecting that tab and pipe are not so common). Another one (if your data is strongly structured and you expect the same number of fields in each line) could be measuring the standard deviation of number of fields per line and take the record set with the smallest standard deviation.
In the following example you find the simpler statistic (total number of fields)
import csv
piperows= []
tabrows = []
#parsing | delimiter
f = open("file", "rb")
readerpipe = csv.reader(f, delimiter = "|")
for row in readerpipe:
piperows.append(row)
f.close()
#parsing TAB delimiter
f = open("file", "rb")
readertab = csv.reader(f, delimiter = "\t")
for row in readerpipe:
tabrows.append(row)
f.close()
#in this example, we use the total number of fields as indicator (but it's not guaranteed to work! it depends by the nature of your data)
#count total fields
totfieldspipe = reduce (lambda x,y: x+ y, [len(f) for f in piperows])
totfieldstab = reduce (lambda x,y: x+ y, [len(f) for f in tabrows])
if totfieldspipe > totfieldstab:
yourrows = piperows
else:
yourrows = tabrows
#the var yourrows contains the rows, now just write them in any format you like
Like this
from __future__ import with_statement
import csv
import re
with open( input, "r" ) as source:
with open( output, "wb" ) as destination:
writer= csv.writer( destination )
for line in input:
writer.writerow( re.split( '[\t|]', line ) )
I would suggest taking some of the example code from the existing answers, or perhaps better use the csv module from python and change it to first assume tab separated, then pipe separated, and produce two output files which are comma separated. Then you visually examine both files to determine which one you want and pick that.
If you actually have lots of files, then you need to try to find a way to detect which file is which.
One of the examples has this:
if "|" in line:
This may be enough: if the first line of a file contains a pipe, then maybe the whole file is pipe separated, else assume a tab separated file.
Alternatively fix the file to contain a key field in the first line which is easily identified - or maybe the first line contains column headers which can be detected.
for line in open("file"):
line=line.strip()
if "|" in line:
print ','.join(line.split("|"))
else:
print ','.join(line.split("\t"))
I want to create a word dictionary. The dictionary looks like
words_meanings= {
"rekindle": "relight",
"pesky":"annoying",
"verge": "border",
"maneuver": "activity",
"accountability":"responsibility",
}
keys_letter=[]
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
Output: rekindle , pesky, verge, maneuver, accountability
Here rekindle , pesky, verge, maneuver, accountability they are the keys and relight, annoying, border, activity, responsibility they are the values.
Now I want to create a csv file and my code will take input from the file.
The file looks like
rekindle | pesky | verge | maneuver | accountability
relight | annoying| border| activity | responsibility
So far I use this code to load the file and read data from it.
from google.colab import files
uploaded = files.upload()
import pandas as pd
data = pd.read_csv("words.csv")
data.head()
import csv
reader = csv.DictReader(open("words.csv", 'r'))
words_meanings = []
for line in reader:
words_meanings.append(line)
print(words_meanings)
This is the output of print(words_meanings)
[OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
It looks very odd to me.
keys_letter=[]
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
Now I create an empty list and want to append only key values. But the output is [OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
I am confused. As per the first code block it only included keys but now it includes both keys and their values. How can I overcome this situation?
I would suggest that you format your csv with your key and value on the same row. Like this
rekindle,relight
pesky,annoying
verge,border
This way the following code will work.
words_meanings = {}
with open(file_name, 'r') as file:
for line in file.readlines():
key, value = line.split(",")
word_meanings[key] = value.rstrip("\n")
if you want a list of the keys:
list_of_keys = list(word_meanings.keys())
To add keys and values to the file:
def add_values(key:str, value:str, file_name:str):
with open(file_name, 'a') as file:
file.writelines(f"\n{key},{value}")
key = input("Input the key you want to save: ")
value = input(f"Input the value you want to save to {key}:")
add_values(key, value, file_name)```
You run the same block of code but you use it with different objects and this gives different results.
First you use normal dictionary (check type(words_meanings))
words_meanings = {
"rekindle": "relight",
"pesky":"annoying",
"verge": "border",
"maneuver": "activity",
"accountability":"responsibility",
}
and for-loop gives you keys from this dictionary
You could get the same with
keys_letter = list(words_meanings.keys())
or even
keys_letter = list(words_meanings)
Later you use list with single dictionary inside this list (check type(words_meanings))
words_meanings = [OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
and for-loop gives you elements from this list, not keys from dictionary which is inside this list. So you move full dictionary from one list to another.
You could get the same with
keys_letter = words_meanings.copy()
or even the same
keys_letter = list(words_meanings)
from collections import OrderedDict
words_meanings = {
"rekindle": "relight",
"pesky":"annoying",
"verge": "border",
"maneuver": "activity",
"accountability":"responsibility",
}
print(type(words_meanings))
keys_letter = []
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
#keys_letter = list(words_meanings.keys())
keys_letter = list(words_meanings)
print(keys_letter)
words_meanings = [OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
print(type(words_meanings))
keys_letter = []
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
#keys_letter = words_meanings.copy()
keys_letter = list(words_meanings)
print(keys_letter)
The default field separator for the csv module is a comma. Your CSV file uses the pipe or bar symbol |, and the fields also seem to be fixed width. So, you need to specify | as the delimiter to use when creating the CSV reader.
Also, your CSV file is encoded as Big-endian UTF-16 Unicode text (UTF-16-BE). The file contains a byte-order-mark (BOM) but Python is not stripping it off, so you will notice the string '\ufeffrekindle' contains the FEFF UTF-16-BE BOM. That can be dealt with by specifying encoding='utf16' when you open the file.
import csv
with open('words.csv', newline='', encoding='utf-16') as f:
reader = csv.DictReader(f, delimiter='|', skipinitialspace=True)
for row in reader:
print(row)
Running this on your CSV file produces this:
{'rekindle ': 'relight ', 'pesky ': 'annoying', 'verge ': 'border', 'maneuver ': 'activity ', 'accountability': 'responsibility'}
Notice that there is trailing whitespace in the key and values. skipinitialspace=True removed the leading whitespace, but there is no option to remove the trailing whitespace. That can be fixed by exporting the CSV file from Excel without specifying a field width. If that can't be done, then it can be fixed by preprocessing the file using a generator:
import csv
def preprocess_csv(f, delimiter=','):
# assumes that fields can not contain embedded new lines
for line in f:
yield delimiter.join(field.strip() for field in line.split(delimiter))
with open('words.csv', newline='', encoding='utf-16') as f:
reader = csv.DictReader(preprocess_csv(f, '|'), delimiter='|', skipinitialspace=True)
for row in reader:
print(row)
which now outputs the stripped keys and values:
{'rekindle': 'relight', 'pesky': 'annoying', 'verge': 'border', 'maneuver': 'activity', 'accountability': 'responsibility'}
As I found that no one able to help me with the answer. Finally, I post the answer here. Hope this will help other.
import csv
file_name="words.csv"
words_meanings = {}
with open(file_name, newline='', encoding='utf-8-sig') as file:
for line in file.readlines():
key, value = line.split(",")
words_meanings[key] = value.rstrip("\n")
print(words_meanings)
This is the code to transfer a csv to a dictionary. Enjoy!!!
I have below csv file format which I need to convert to yaml.(or below example output)
CVS file format
CASSANDRA a a
DSE_OPSCENTER
IGNITE a
KAFKA_LEAD b
KAFKA_SMART
OAM
RBM a
I used below code to convert the file into expected output
datareader = csv.reader(csvfile, delimiter=',', quotechar='"')
data_headings = []
for i in datareader:
new_yaml = open('hosts', 'a')
yaml_text = ""
#heading = "["+i[0]+"]"
#new_yaml.write(heading)
new_yaml.write('\n')
for cell in i:
print cell
new_yaml.write(cell)
new_yaml.write('\n')
new_yaml.close()
csvfile.close()
And I get below output which is fine for me.
CASSANDRA
a
a
DSE_OPSCENTER
IGNITE
a
KAFKA_LEAD
KAFKA_SMART
...
I want a small help here in putting CASSANDRA, DSE_OPSCENTER and so on within square brackets. Something like below
[CASSANDRA]
a
a
[DSE_OPSCENTER]
...
Edit
I added a template format. But I dont know how to put values in their respective groups
HOST_VAR_TEMPLATE = """
[CASSANDRA]
{cell}
[DSE_OPSCENTER]
[SMART]
[SPARK]
[SPARK_MASTERS]
[ZK]
"""
csvfile = open('hosts.csv', 'r')
datareader = csv.reader(csvfile, delimiter=',', quotechar='"')
data_headings = []
for i in datareader:
print i[1:]
with open('hosts', "w") as f:
for cell in i:
print cell
f.write(
HOST_VAR_TEMPLATE.format(
cell=cell,
)
)
You are creating a file with a single document that consists of a multi-line scalar string, followed by an explicit end of document marker.
You try to write the file yourself, instead of using a YAML library. In principle that is possible, but there are some corner cases where you have to take special care.
You should first load the end result using a YAML library:
import ruamel.yaml
yaml = ruamel.yaml.YAML()
data = yaml.load("""\
[CASSANDRA]
a
a
[DSE_OPSCENTER]
...
""")
This will give you a ComposerError telling you that a (the first one) starts a new document. That is because starting with [ the parser assumes that the document consists of a single flow-style sequence, and once encountering the corresponding ], the document is done.
If you want to have a single, multi-line, string in your YAML file you're best of using a block style literal scalar. This is correct YAML:
|
[CASSANDRA]
a
a
[DSE_OPSCENTER]
...
If you don't want to run the risk of creating an invalid YAML file, then create a Python string variable data and append each line to it including newlines, then write it to file using the YAML library:
import sys
import ruamel.yaml
from ruamel.yaml.scalarstring import PreservedScalarString
yaml = ruamel.yaml.YAML()
yaml.explicit_end = True
data = ''
data += '[CASSANDRA]' + '\n'
data += 'a' + '\n'
data += 'a' + '\n'
data += '[DSE_OPSCENTER]' + '\n'
yaml.dump(PreservedScalarString(data), sys.stdout)
(the PreservedScalarString type wraps your multiline Python string into something that ruamel.yaml dumps as a block style literal scalar.
UPDATED(I'm sorry, it's my first question)
I'm an intern and really new into coding.
In my job, I have to read a file from Azure storage and then insert this data into a database.
To do this, I'm using get_file_to_text().content and storing its value in a variable file as follows:
file = file_service.get_file_to_text('teste','','Retorno.csv').content
and then, I'm using .splitlines() like this:
formFile.append(file.splitlines())
I expected a result like this(each line of my file being a sublist):
[['2017-08-01', 'Zabbix server Sura', 'system.cpu.load[percpu,avg5]', '0.2900', '0.05217361111111111111', '0.1'], ['2017-08-01', 'Zabbix server Sura', 'system.cpu.util[,iowait]' ... ]
But I've got this(One big sublist with all the lines inside):
[['2017-08-01;Zabbix server Sura;system.cpu.load[percpu,avg5];0.2900;0.05217361111111111111;0.1', '2017-08-01;Zabbix server Sura;system.cpu.util[,iowait]; ... ']]
I also tried a .split(';'):
file2 = file.split(';')
But it returns me a list with the values only:
['2017-08-01', 'Zabbix server Sura', 'system.cpu.load[percpu,avg5]', '0.2900', '0.05217361111111111111', '0.1\n2017-08-01', 'Zabbix server Sura', 'system.cpu.util[,iowait]', ...]
What can I do toget the result I expect?
Thanks!
UPDATE (RESOLVED):
I did this an it worked fine.
data = []
azurestorage_text = file_service.get_file_to_text('teste', '',
'Retorno.csv').content
with StringIO(azurestorage_text) as file_obj:
reader = csv.reader(file_obj, delimiter=';')
header = next(reader)
for line in reader:
data.append(line)
.splitlines() will split the lines in the text input, returning a list of whole lines. In order to parse that into fields (bits between semicolons) you would need to then .split(';') each line, e.g.
lines = text.splitlines()
rows = []
for line in lines:
row.append(line.split(';'))
However if you want to split semicolon-separated text like this you should be using csv.reader to parse the data. It is more robust at handling CSV formats, including for example "quoted text". Splitting on semicolons will break if any of the fields in the data have semicolons within them, e.g. "semicolons in quoted; text".
csv.reader requires a file-like object to operate on, rather than a string. To pass in a string, you can use StringIO to create a file-like interface to it:
For Python2:
from StringIO import cStringIO as StringIO
import csv
text = file_service.get_file_to_text('teste','','Retorno.csv').content
file_obj = StringIO(text)
reader = csv.reader(file_obj, delimiter=';')
for row in reader:
print(row)
For Python3:
from io import StringIO
import csv
file_obj = StringIO(text)
text = file_service.get_file_to_text('teste','','Retorno.csv').content
file_obj = StringIO(text)
reader = csv.reader(file_obj, delimiter=';')
for row in reader:
print(row)
Each row will contain a single line from your file, split into fields on the semicolons (specified by delimiter).
I want to create a csv from an existing csv, by splitting its rows.
Input csv:
A,R,T,11,12,13,14,15,21,22,23,24,25
Output csv:
A,R,T,11,12,13,14,15
A,R,T,21,22,23,24,25
So far my code looks like:
def update_csv(name):
#load csv file
file_ = open(name, 'rb')
#init first values
current_a = ""
current_r = ""
current_first_time = ""
file_content = csv.reader(file_)
#LOOP
for row in file_content:
current_a = row[0]
current_r = row[1]
current_first_time = row[2]
i = 2
#Write row to new csv
with open("updated_"+name, 'wb') as f:
writer = csv.writer(f)
writer.writerow((current_a,
current_r,
current_first_time,
",".join((row[x] for x in range(i+1,i+5)))
))
#do only one row, for debug purposes
return
But the row contains double quotes that I can't get rid of:
A002,R051,02-00-00,"05-21-11,00:00:00,REGULAR,003169391"
I've tried to use writer = csv.writer(f,quoting=csv.QUOTE_NONE) and got a _csv.Error: need to escape, but no escapechar set.
What is the correct approach to delete those quotes?
I think you could simplify the logic to split each row into two using something along these lines:
def update_csv(name):
with open(name, 'rb') as file_:
with open("updated_"+name, 'wb') as f:
writer = csv.writer(f)
# read one row from input csv
for row in csv.reader(file_):
# write 2 rows to new csv
writer.writerow(row[:8])
writer.writerow(row[:3] + row[8:])
writer.writerow is expecting an iterable such that it can write each item within the iterable as one item, separate by the appropriate delimiter, into the file. So:
writer.writerow([1, 2, 3])
would write "1,2,3\n" to the file.
Your call provides it with an iterable, one of whose items is a string that already contains the delimiter. It therefore needs some way to either escape the delimiter or a way to quote out that item. For example,
write.writerow([1, '2,3'])
Doesn't just give "1,2,3\n", but e.g. '1,"2,3"\n' - the string counts as one item in the output.
Therefore if you want to not have quotes in the output, you need to provide an escape character (e.g. '/') to mark the delimiters that shouldn't be counted as such (giving something like "1,2/,3\n").
However, I think what you actually want to do is include all of those elements as separate items. Don't ",".join(...) them yourself, try:
writer.writerow((current_a, current_r,
current_first_time, *row[i+2:i+5]))
to provide the relevant items from row as separate items in the tuple.