I have csv & excel files that were not correctly saved as UTF-8 so i cannot simply load them into pandas. Manually, I can open it and save as excel or csv and select utf-8 and then it works fine in pandas but I have too many files to do this manually and I don't want to replace the raw file (so overwriting it is out of the question). How can I accomplish this programmatically?
I thought of one solution could be to do something like this:
import pandas as pd
with open('path/to/bad_file.csv', 'rb') as f:
text = f.read()
with open('fixed-temp.csv', 'w', encoding='utf8') as f:
f.write(text.decode(encoding="latin-1"))
df = pd.read_csv('fixed-temp.csv')
But this leaves behind a temporary file or a new file that i don't want. I guess I could write more code to then delete this temporary file but that seems unclean and I'd rather encapsulate all this into one convenience function.
Related
My code:
import csv
# Create a list of Gujarati strings
strings = [['હેલો, વર્લ્ડ!', 'સુપ્રભાત', 'મારા નામ હેઠળ છે']]
# Open the CSV file in 'w' mode
with open('Gujarati.csv', 'w', encoding='utf-16',newline='') as f:
# Create a CSV writer
writer = csv.writer(f)
# Write the strings to the CSV file
writer.writerows(strings)
I am trying to write each heading as a different column, but I don't know why it is getting in the same column. I want it to be in separate columns. I don't know what else to write but feel free to ask me anything anytime.
I appreciate any help you can provide
https://support.microsoft.com/en-us/office/import-or-export-text-txt-or-csv-files-5250ac4c-663c-47ce-937b-339e391393ba#:~:text=You%20can%20import%20data%20from,to%20import%2C%20and%20click%20Import.
Import or export text (.txt or .csv) files You can change the separator character used in both delimited and .csv text files. This may be necessary to ensure that the import or export operation works the way you want it to.
I am doing data migration.Old application data is exported as one CSV file. We cannot import this CSV file directly to new application. Need to create new CSV template that match with new application and import some data into this new CSV template. I would like to request code that facilitate this requirement.
I'm not exactly sure what template you want to go to. I'm going to assume that you either want to change the number/order of columns or the delimiter.
The simplest thing is to read it line by line and write it:
import csv
with open("Old.csv", 'r') as readfp, open("new.csv", 'w') as writefp:
csvReader = csv.reader(readfp)
csvWriter = csv.writer(writefp, delimiter=',')
for line in csvReader:
#line is a list of strings so you can reorder it as you wish. I'll skip the third column as an example.
csvWriter.writerow(line[:2]+line[3:])
If you have pandas installed this is even simpler
import pandas as pd
df = pd.read_csv("Old.csv")
df.drop(labels=["name_of_bad_col1", "name_of_bad_col2"], sep=',')
df.to_csv("new.csv)
If you are going the pandas route, make sure to checkout the documentations (read_csv, to_csv)
I have been trying to save the data as a excel file as a type of CSV UTF-8 (Comma delimited) (*.csv) which is different then the normal
CSV (Comma delimited) (*.csv) file. It display the unicode text when opened in excel. I can save as that file easily from excel but from python i am only able to save it as normal csv. Which will not cause loss of data but when opened it shows this kind of text "à¤à¤‰à¤Ÿà¤¾" instead of "एउटा" this text.
If I copied the text opening it with notepad to the excel file and then manually save the file as CSV UTF-8 then it preserves the correct display. But doing so is time consuming since all values appear in same line in notepad and i have to separate it in excel file.
So i just want to know how can i save data as CSV UTF-8 format of excel using python.
I have tried the follwing code but it results in normal csv file.
import codecs
import unicodecsv as csv
input_text = codecs.open('input.txt', encoding='utf-8')
all_text = input_text.read()
text_list = all_text.split()
output_list = [['Words','Tags']]
for input_word in text_list:
word_tag_list = [input_word,'O']
output_list.append(word_tag_list)
with codecs.open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(output_list)
You need to indicate to Excel that this is a UTF-8 file. Unfortunately the only way to do this is by prepending a special byte sequence to the front of the file. Python will do this automatically if you use a special encoding.
with codecs.open("output.csv", "w", "encoding="utf_8_sig") as f:
I have found the answer. The encoding="utf_8_sig" should be given to csv.writer method to write the excel file as CSV UTF-8 file. Previous code can be witten as:
with open("output.csv", "wb") as f:
writer = csv.writer(f, dialect='excel', encoding='utf_8_sig')
writer.writerows(output_list)
However there was problem when data has , at the end Eg: "भने," For this case i didn't need the comma so i removed it with following code within the for loop.
import re
if re.search(r'.,$',input_word):
input_word = re.sub(',$','',input_word)
Finally I was able to obtain the output as desired with Unicode character correctly displayed and removing extra comma which is present at the end of data. So, if anyone know how to ignore comma at the end of data in excel file then you can comment here. Thanks.
I know the questions sounds generic but here is my problem.
I have a csv file that will always cause UnicodeErrors and errors like csv.empty although I am opening the file with utf-8
like this
with open(csv_filename, 'r', encoding='utf-8') as csvfile:
A workaround I found is to open the file I want, copy the lines and save to a new file(with visual code studio) everything works fine.
Someone told me that I have to use pandas. Is it true?
Is there a difference between opening a file with CSV and Pandas?
Pandas will load the contents of the csv file into a dataframe
The csv module has methods like reader and DictReader that will return generators that let you move through the file.
With Pandas:
import pandas as pd
df=pd.read_csv('file.csv')
df.to_csv('new_file.csv',index=False)
I am writing a program in Python which should import *.dat files, subtract a specific value from certain columns and subsequently save the file in *.dat format in a different directory.
My current tactic is to load the datafiles in a numpy array, perform the calculation and then save it. I am stuck with the saving part. I do not know how to save a file in python in the *.dat format. Can anyone help me? Or is there an alternative way without needing to import the *.dat file as a numpy array? Many thanks!
You can use struct to pack the integers in a bytes format and write them to a dat file.
import struct
data = [# your data]
Open:
with open('your_data.dat', 'rb') as your_data_file:
values = struct.unpack('i'*len(data), your_data_file.read())
Save data:
with open('your_data.dat', 'wb') as your_dat_file:
your_dat_file.write(struct.pack('i'*len(data), *data))
Reference.
You can read and export a .dat file using pandas:
import pandas as pd
input_df = pd.read_table('input_file_name.dat')
...
output_df = pd.DataFrame({'column_name': column_values})
output_df.to_csv('output_file_name.dat')
assuming your file looks like
file = open(filename, "r")
all you need to do is open another file with "w" as the second parameter
file = open(new_file-path,"w")
file.write(data)
file.close()
if your data is not a string, either make it a string, or use
file = open(filename, "rb")
file = open(filename, "wb")
when reading and writing, since these read and write raw bytes
The .dat file can be read using the pandas library:
df = pd.read_csv('xxxx.dat', sep='\s+', header=None, skiprows=1)
skiprows=1 will ignore the first row, which is the header.
\s+ is the separation (default) of .dat file.
Correct me if I'm wrong, but opening, writing to, and subsequently closing a file should count as "saving" it. You can test this yourself by running your import script and comparing the last modified dates.