Related
I have a text file (.txt) which could be in tab separated format or pipe separated format, and I need to convert it into CSV file format. I am using python 2.6. Can any one suggest me how to identify the delimiter in a text file, read the data and then convert that into comma separated file.
Thanks in advance
I fear that you can't identify the delimiter without knowing what it is. The problem with CSV is, that, quoting ESR:
the Microsoft version of CSV is a textbook example of how not to design a textual file format.
The delimiter needs to be escaped in some way if it can appear in fields. Without knowing, how the escaping is done, automatically identifying it is difficult. Escaping could be done the UNIX way, using a backslash '\', or the Microsoft way, using quotes which then must be escaped, too. This is not a trivial task.
So my suggestion is to get full documentation from whoever generates the file you want to convert. Then you can use one of the approaches suggested in the other answers or some variant.
Edit:
Python provides csv.Sniffer that can help you deduce the format of your DSV. If your input looks like this (note the quoted delimiter in the first field of the second row):
a|b|c
"a|b"|c|d
foo|"bar|baz"|qux
You can do this:
import csv
csvfile = open("csvfile.csv")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.DictReader(csvfile, dialect=dialect)
for row in reader:
print row,
# => {'a': 'a|b', 'c': 'd', 'b': 'c'} {'a': 'foo', 'c': 'qux', 'b': 'bar|baz'}
# write records using other dialect
Your strategy could be the following:
parse the file with BOTH a tab-separated csv reader and a pipe-separated csv reader
calculate some statistics on resulting rows to decide which resultset is the one you want to write. An idea could be counting the total number of fields in the two recordset (expecting that tab and pipe are not so common). Another one (if your data is strongly structured and you expect the same number of fields in each line) could be measuring the standard deviation of number of fields per line and take the record set with the smallest standard deviation.
In the following example you find the simpler statistic (total number of fields)
import csv
piperows= []
tabrows = []
#parsing | delimiter
f = open("file", "rb")
readerpipe = csv.reader(f, delimiter = "|")
for row in readerpipe:
piperows.append(row)
f.close()
#parsing TAB delimiter
f = open("file", "rb")
readertab = csv.reader(f, delimiter = "\t")
for row in readerpipe:
tabrows.append(row)
f.close()
#in this example, we use the total number of fields as indicator (but it's not guaranteed to work! it depends by the nature of your data)
#count total fields
totfieldspipe = reduce (lambda x,y: x+ y, [len(f) for f in piperows])
totfieldstab = reduce (lambda x,y: x+ y, [len(f) for f in tabrows])
if totfieldspipe > totfieldstab:
yourrows = piperows
else:
yourrows = tabrows
#the var yourrows contains the rows, now just write them in any format you like
Like this
from __future__ import with_statement
import csv
import re
with open( input, "r" ) as source:
with open( output, "wb" ) as destination:
writer= csv.writer( destination )
for line in input:
writer.writerow( re.split( '[\t|]', line ) )
I would suggest taking some of the example code from the existing answers, or perhaps better use the csv module from python and change it to first assume tab separated, then pipe separated, and produce two output files which are comma separated. Then you visually examine both files to determine which one you want and pick that.
If you actually have lots of files, then you need to try to find a way to detect which file is which.
One of the examples has this:
if "|" in line:
This may be enough: if the first line of a file contains a pipe, then maybe the whole file is pipe separated, else assume a tab separated file.
Alternatively fix the file to contain a key field in the first line which is easily identified - or maybe the first line contains column headers which can be detected.
for line in open("file"):
line=line.strip()
if "|" in line:
print ','.join(line.split("|"))
else:
print ','.join(line.split("\t"))
I want to create a word dictionary. The dictionary looks like
words_meanings= {
"rekindle": "relight",
"pesky":"annoying",
"verge": "border",
"maneuver": "activity",
"accountability":"responsibility",
}
keys_letter=[]
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
Output: rekindle , pesky, verge, maneuver, accountability
Here rekindle , pesky, verge, maneuver, accountability they are the keys and relight, annoying, border, activity, responsibility they are the values.
Now I want to create a csv file and my code will take input from the file.
The file looks like
rekindle | pesky | verge | maneuver | accountability
relight | annoying| border| activity | responsibility
So far I use this code to load the file and read data from it.
from google.colab import files
uploaded = files.upload()
import pandas as pd
data = pd.read_csv("words.csv")
data.head()
import csv
reader = csv.DictReader(open("words.csv", 'r'))
words_meanings = []
for line in reader:
words_meanings.append(line)
print(words_meanings)
This is the output of print(words_meanings)
[OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
It looks very odd to me.
keys_letter=[]
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
Now I create an empty list and want to append only key values. But the output is [OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
I am confused. As per the first code block it only included keys but now it includes both keys and their values. How can I overcome this situation?
I would suggest that you format your csv with your key and value on the same row. Like this
rekindle,relight
pesky,annoying
verge,border
This way the following code will work.
words_meanings = {}
with open(file_name, 'r') as file:
for line in file.readlines():
key, value = line.split(",")
word_meanings[key] = value.rstrip("\n")
if you want a list of the keys:
list_of_keys = list(word_meanings.keys())
To add keys and values to the file:
def add_values(key:str, value:str, file_name:str):
with open(file_name, 'a') as file:
file.writelines(f"\n{key},{value}")
key = input("Input the key you want to save: ")
value = input(f"Input the value you want to save to {key}:")
add_values(key, value, file_name)```
You run the same block of code but you use it with different objects and this gives different results.
First you use normal dictionary (check type(words_meanings))
words_meanings = {
"rekindle": "relight",
"pesky":"annoying",
"verge": "border",
"maneuver": "activity",
"accountability":"responsibility",
}
and for-loop gives you keys from this dictionary
You could get the same with
keys_letter = list(words_meanings.keys())
or even
keys_letter = list(words_meanings)
Later you use list with single dictionary inside this list (check type(words_meanings))
words_meanings = [OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
and for-loop gives you elements from this list, not keys from dictionary which is inside this list. So you move full dictionary from one list to another.
You could get the same with
keys_letter = words_meanings.copy()
or even the same
keys_letter = list(words_meanings)
from collections import OrderedDict
words_meanings = {
"rekindle": "relight",
"pesky":"annoying",
"verge": "border",
"maneuver": "activity",
"accountability":"responsibility",
}
print(type(words_meanings))
keys_letter = []
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
#keys_letter = list(words_meanings.keys())
keys_letter = list(words_meanings)
print(keys_letter)
words_meanings = [OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
print(type(words_meanings))
keys_letter = []
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
#keys_letter = words_meanings.copy()
keys_letter = list(words_meanings)
print(keys_letter)
The default field separator for the csv module is a comma. Your CSV file uses the pipe or bar symbol |, and the fields also seem to be fixed width. So, you need to specify | as the delimiter to use when creating the CSV reader.
Also, your CSV file is encoded as Big-endian UTF-16 Unicode text (UTF-16-BE). The file contains a byte-order-mark (BOM) but Python is not stripping it off, so you will notice the string '\ufeffrekindle' contains the FEFF UTF-16-BE BOM. That can be dealt with by specifying encoding='utf16' when you open the file.
import csv
with open('words.csv', newline='', encoding='utf-16') as f:
reader = csv.DictReader(f, delimiter='|', skipinitialspace=True)
for row in reader:
print(row)
Running this on your CSV file produces this:
{'rekindle ': 'relight ', 'pesky ': 'annoying', 'verge ': 'border', 'maneuver ': 'activity ', 'accountability': 'responsibility'}
Notice that there is trailing whitespace in the key and values. skipinitialspace=True removed the leading whitespace, but there is no option to remove the trailing whitespace. That can be fixed by exporting the CSV file from Excel without specifying a field width. If that can't be done, then it can be fixed by preprocessing the file using a generator:
import csv
def preprocess_csv(f, delimiter=','):
# assumes that fields can not contain embedded new lines
for line in f:
yield delimiter.join(field.strip() for field in line.split(delimiter))
with open('words.csv', newline='', encoding='utf-16') as f:
reader = csv.DictReader(preprocess_csv(f, '|'), delimiter='|', skipinitialspace=True)
for row in reader:
print(row)
which now outputs the stripped keys and values:
{'rekindle': 'relight', 'pesky': 'annoying', 'verge': 'border', 'maneuver': 'activity', 'accountability': 'responsibility'}
As I found that no one able to help me with the answer. Finally, I post the answer here. Hope this will help other.
import csv
file_name="words.csv"
words_meanings = {}
with open(file_name, newline='', encoding='utf-8-sig') as file:
for line in file.readlines():
key, value = line.split(",")
words_meanings[key] = value.rstrip("\n")
print(words_meanings)
This is the code to transfer a csv to a dictionary. Enjoy!!!
I have below csv file format which I need to convert to yaml.(or below example output)
CVS file format
CASSANDRA a a
DSE_OPSCENTER
IGNITE a
KAFKA_LEAD b
KAFKA_SMART
OAM
RBM a
I used below code to convert the file into expected output
datareader = csv.reader(csvfile, delimiter=',', quotechar='"')
data_headings = []
for i in datareader:
new_yaml = open('hosts', 'a')
yaml_text = ""
#heading = "["+i[0]+"]"
#new_yaml.write(heading)
new_yaml.write('\n')
for cell in i:
print cell
new_yaml.write(cell)
new_yaml.write('\n')
new_yaml.close()
csvfile.close()
And I get below output which is fine for me.
CASSANDRA
a
a
DSE_OPSCENTER
IGNITE
a
KAFKA_LEAD
KAFKA_SMART
...
I want a small help here in putting CASSANDRA, DSE_OPSCENTER and so on within square brackets. Something like below
[CASSANDRA]
a
a
[DSE_OPSCENTER]
...
Edit
I added a template format. But I dont know how to put values in their respective groups
HOST_VAR_TEMPLATE = """
[CASSANDRA]
{cell}
[DSE_OPSCENTER]
[SMART]
[SPARK]
[SPARK_MASTERS]
[ZK]
"""
csvfile = open('hosts.csv', 'r')
datareader = csv.reader(csvfile, delimiter=',', quotechar='"')
data_headings = []
for i in datareader:
print i[1:]
with open('hosts', "w") as f:
for cell in i:
print cell
f.write(
HOST_VAR_TEMPLATE.format(
cell=cell,
)
)
You are creating a file with a single document that consists of a multi-line scalar string, followed by an explicit end of document marker.
You try to write the file yourself, instead of using a YAML library. In principle that is possible, but there are some corner cases where you have to take special care.
You should first load the end result using a YAML library:
import ruamel.yaml
yaml = ruamel.yaml.YAML()
data = yaml.load("""\
[CASSANDRA]
a
a
[DSE_OPSCENTER]
...
""")
This will give you a ComposerError telling you that a (the first one) starts a new document. That is because starting with [ the parser assumes that the document consists of a single flow-style sequence, and once encountering the corresponding ], the document is done.
If you want to have a single, multi-line, string in your YAML file you're best of using a block style literal scalar. This is correct YAML:
|
[CASSANDRA]
a
a
[DSE_OPSCENTER]
...
If you don't want to run the risk of creating an invalid YAML file, then create a Python string variable data and append each line to it including newlines, then write it to file using the YAML library:
import sys
import ruamel.yaml
from ruamel.yaml.scalarstring import PreservedScalarString
yaml = ruamel.yaml.YAML()
yaml.explicit_end = True
data = ''
data += '[CASSANDRA]' + '\n'
data += 'a' + '\n'
data += 'a' + '\n'
data += '[DSE_OPSCENTER]' + '\n'
yaml.dump(PreservedScalarString(data), sys.stdout)
(the PreservedScalarString type wraps your multiline Python string into something that ruamel.yaml dumps as a block style literal scalar.
I am currently fetching data from an API and I would like to store those data as csv.
However, some lines are always invalid which means I cannot split them via Excel's text-in-columns functionality.
I create the csv file as follows:
with open(directory_path + '/' + file_name + '-data.csv', 'a', newline='') as file:
# Setup a writer
csvwriter = csv.writer(file, delimiter='|')
# Write headline row
if not headline_exists:
csvwriter.writerow(['Title', 'Text', 'Tip'])
# Build the data row
record = data['title'] + '|' + data['text'] + '|' + data['tip']
csvwriter.writerow([record])
If you open the csv file in Excel you also immediately see that the row is invalid. While the valid one takes the default height and the whole width, the invalid one takes more height but less width.
Does anyone know a reason for that problem?
The rows are not invalid, but what you do is.
So first of all: You use pipes as delimeters. Its fine in some scenarios, but given the fact you want to load it into excel immediately it seems wiser to me to export the data in "excel" dialect:
csvwriter = csv.writer(file, dialect='excel')
Second, look at the following lines:
record = data['title'] + '|' + data['text'] + '|' + data['tip']
csvwriter.writerow([record])
This way you basically tell the csv writer that you want a single column, with pipes in it. If you use a csv writer you must not concatenate the delimeters on your own, it voids the point of using a writer. So this is how it should be done instead:
record = [data['title'], data['text'], data['tip']]
csvwriter.writerow(record)
Hope it helps.
I have finally found out that I had to strip the text and the tip because they sometimes contain whitespaces which would break the format.
Additionally, I also followed the recommendation to use the excel dialect since I think this will make it easier to process the data later on.
My data looks like below
['[\'Patient, A\', \'G\', \'P\', \'RNA\']']
Irrespective of the brackets, quotes and back slashes, I'd like to separate the data by ',' and write to a CSV file like below
Patient,A,G,P,RNA
Mentioning delimiter = ',' has done no help. The output file then looks like
['Patient, A','G','P','RNA']
all in a single cell. I want to split them into multiple columns. How can I do that?
Edit - Mentioning quotechar='|' split them into different cells but it now looks like
|['Patient, A','G','P','RNA']|
Edit-
out_file_handle = csv.writer(out_file, quotechar='|', lineterminator='\n', delimiter = ",")
data = ''.join(mydict.get(word.lower(), word) for word in re.split('(\W+)', transposed))
data = [data,]
out_file_handle.writerow(data)
transposed:
['Patient, A','G','P','RNA']
data:
['[\'Patient, A\', \'G\', \'P\', \'RNA\']']
And it has multiple rows, the above is one of the rows from the entire data.
You first need to read this data into a Python array, by processing the string as a CSV file in memory:
from StringIO import StringIO
import csv
data = ['[\'Patient, A\', \'G\', \'P\', \'RNA\']']
clean_data = list(csv.reader( StringIO(data[0]) ))
However the output is still a single string, because it's not even a well-formed CSV! In which case, the best thing might be to filter out all those junk characters?
import re
clean_data = re.sub("[\[\]']","",data[0])
Now data[0] is 'Patient, A, G, P, RNA' which is a clean CSV you can write straight to a file.
If what you're trying to do is write data in the form of ['[\'Patient, A\', \'G\', \'P\', \'RNA\']'], where you have an array of these strings, to file, then it's really a question in two parts.
The first, is how do you separate the data into the correct format, and then the second is is to write it to file.
If that is the form of your data, for every row, then something like this should work (to get it into the correct format):
data = ['[\'Patient, A\', \'G\', \'P\', \'RNA\']', ...]
newData = [entry.replace("\'", "")[1:-1].split(",") for entry in data]
that will give you data in the following form:
[["Patient", "A", "G", "P", "RNA"], ...]
and then you can write it to file as suggested in the other answers;
with open('new.csv', 'wb') as write_file:
file_writer = csv.writer(write_file)
for dataEntry in range(newData ):
file_writer.writerow(dataEntry)
If you don't actually care about using the data in this round, and just want to clean it up, then you can just do data.replace("\'", "")[1:-1] and then write those strings to file.
The [1:-1] bits are just to remove the leading and trailing square brackets.
Python has a CSV writer. Start off with
import csv
Then try something like this
with open('new.csv', 'wb') as write_file:
file_writer = csv.writer(write_file)
for i in range(data):
file_writer.writerow([x for x in data[i]])
Edit:
You might have to wrangle the data a bit first before writing it, since it looks like its a string and not actually a list. Try playing around with the split() function
list = data.split()
"""
SAVING DATA INTO CSV FORMAT
* This format is used for many purposes, mainly for deep learning.
* This type of file can be used to view data in MS Excel or any similar
Application
"""
# == Imports ===================================================================
import csv
import sys
# == Initialisation Function ===================================================
def initialise_csvlog(filename, fields):
"""
Initilisation this function before using the Inserction function
* This Function checks the data before adding new one in order to maintain
perfect mechanisum of insertion
* It check the file if not exists then it creates a new one
* if it exists then it proceeds with getting fields
Parameters
----------
filename : String
Filename along with directory which need to be created
Fields : List
Colomns That need to be initialised
"""
try :
with open(filename,'r') as csvfile:
csvreader = csv.reader(csvfile)
fields = csvreader.next()
print("Data Already Exists")
sys.exit("Please Create a new empty file")
# print fields
except :
with open(filename,'w') as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow(fields)
# == Data Insertion Function ===================================================
def write_data_csv(filename, row_data):
"""
This Function save the Row Data into the CSV Created
* This adds the row data that is Double Listed
Parameters
----------
filename : String
Filename along with directory which need to be created
row_data : List
Double Listed consisting of row data and column elements in a list
"""
with open(filename,'a') as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerows(row_data)
if __name__ == '__main__':
"""
This function is used to test the Feature Run it independently
NOTE: DATA IN row_data MUST BE IN THE FOLLOWING DOUBLE LISTED AS SHOWN
"""
filename = "TestCSV.csv"
fields = ["sno","Name","Work","Department"]
#Init
initialise_csvlog(filename,fields)
#Add Data
row_data = [["1","Jhon","Coder","Pythonic"]]
write_data_csv(filename,row_data)
# == END =======================================================================
Read the Module and you can start using CSV and view data in Excel or any similar application (calc in libreoffice)
NOTE: Remember to place list of data to be double listed as shown in __main__ function (row_data)