Save data as a *.dat file? - python

I am writing a program in Python which should import *.dat files, subtract a specific value from certain columns and subsequently save the file in *.dat format in a different directory.
My current tactic is to load the datafiles in a numpy array, perform the calculation and then save it. I am stuck with the saving part. I do not know how to save a file in python in the *.dat format. Can anyone help me? Or is there an alternative way without needing to import the *.dat file as a numpy array? Many thanks!

You can use struct to pack the integers in a bytes format and write them to a dat file.
import struct
data = [# your data]
Open:
with open('your_data.dat', 'rb') as your_data_file:
values = struct.unpack('i'*len(data), your_data_file.read())
Save data:
with open('your_data.dat', 'wb') as your_dat_file:
your_dat_file.write(struct.pack('i'*len(data), *data))
Reference.

You can read and export a .dat file using pandas:
import pandas as pd
input_df = pd.read_table('input_file_name.dat')
...
output_df = pd.DataFrame({'column_name': column_values})
output_df.to_csv('output_file_name.dat')

assuming your file looks like
file = open(filename, "r")
all you need to do is open another file with "w" as the second parameter
file = open(new_file-path,"w")
file.write(data)
file.close()
if your data is not a string, either make it a string, or use
file = open(filename, "rb")
file = open(filename, "wb")
when reading and writing, since these read and write raw bytes

The .dat file can be read using the pandas library:
df = pd.read_csv('xxxx.dat', sep='\s+', header=None, skiprows=1)
skiprows=1 will ignore the first row, which is the header.
\s+ is the separation (default) of .dat file.

Correct me if I'm wrong, but opening, writing to, and subsequently closing a file should count as "saving" it. You can test this yourself by running your import script and comparing the last modified dates.

Related

Python - how to re-export a file as utf-8

I have csv & excel files that were not correctly saved as UTF-8 so i cannot simply load them into pandas. Manually, I can open it and save as excel or csv and select utf-8 and then it works fine in pandas but I have too many files to do this manually and I don't want to replace the raw file (so overwriting it is out of the question). How can I accomplish this programmatically?
I thought of one solution could be to do something like this:
import pandas as pd
with open('path/to/bad_file.csv', 'rb') as f:
text = f.read()
with open('fixed-temp.csv', 'w', encoding='utf8') as f:
f.write(text.decode(encoding="latin-1"))
df = pd.read_csv('fixed-temp.csv')
But this leaves behind a temporary file or a new file that i don't want. I guess I could write more code to then delete this temporary file but that seems unclean and I'd rather encapsulate all this into one convenience function.

How to recover a gzipped .bed file in its original format

import gzip#compresses the .bed example file
input_file = open("example.bed","rb")
data = input_file.read()
#convert_data = bytearray(data)
with gzip.open("example.bed.gz", "wb") as filez:
filez.write(data)
filez.close()
#failed attempts
with gzip.open("example.bed.gz", "r+") as fileopen:
output=fileopen.read()
output
print(output)
#this works but not in the desired manner
import pandas as pd
df=pd.read_csv("example.bed.gz", delimiter='\t',header=1 )
df.to_csv('exampleziptotxt.bed', index=False)
format before gzipping='chr8\t59420123
format from opening and reading gzipping=b'chr8\t59420123\
I have tried decoding to utf-8 only to get a bytes conflict
The above script gzips a tab delimited .bed file, I would like to unzip it and get the original .bed file in the exact same format prior to gzipping (e.g. just reversing the gzipping). Any advice on how to accomplish this would be appreciated.
import pandas as pd file
df=pd.read_csv("example.bed.gz", delimiter=',',header=0 )
df.to_csv('exampleziptotxt.bed', index=False)
I only needed to adjust the delimiter from "\t", to "," and it restores the original format

Creating Multiple .txt files from an Excel file with Python Loop

My work is in the process of switching from SAS to Python and I'm trying to get a head start on it. I'm trying to write separate .txt files, each one representing all of the values of an Excel file column.
I've been able to upload my Excel sheet and create a .txt file fine, but for practical uses, I need to find a way to create a loop that will go through and make each column into it's own .txt file, and name the file "ColumnName.txt".
Uploading Excel Sheet:
import pandas as pd
wb = pd.read_excel('placements.xls')
Creating single .txt file: (Named each column A-Z for easy reference)
with open("A.txt", "w") as f:
for item in wb['A']:
f.write("%s\n" % item)
Trying my hand at a for loop (to no avail):
import glob
for file in glob.glob("*.txt"):
f = open(( file.rsplit( ".", 1 )[ 0 ] ) + ".txt", "w")
f.write("%s\n" % item)
f.close()
The first portion worked like a charm and gave me a .txt file with all of the relevant data.
When I used the glob command to attempt making some iterations, it doesn't error out, but only gives me one output file (A.txt) and the only data point in A.txt is the letter A. I'm sure my inputs are way off... after scrounging around forever it's what I found that made sense and ran, but I don't think I'm understanding the inputs going in to the command, or if what I'm running is just totally inaccurate.
Any help anyone would give would be much appreciated! I'm sure it's a simple loop, just hard to wrap your head around when you're so new to python programming.
Thanks again!
I suggest use pandas for write files by to_csv, only change extension to .txt:
# Uploading Excel Sheet:
import pandas as pd
df = pd.read_excel('placements.xls')
# Creating single .txt file: (Named each column A-Z for easy reference)
for col in df.columns:
print (col)
#python 3.6+
df[col].to_csv(f"{col}.txt", index=False, header=None)
#python bellow
#df[col].to_csv("{}.txt".format(col), index=False, header=None)

optimize adding column to CSV file (~300GB) with Python

I want to add a column to CSV file, which is the difference of two other columns of the same file, I use Python (pandas) to do that and this is what I do:
import pandas as pd
row = ['times1','times2']
for df1 in pd.read_csv('C:/SET/parti_no_diff.CSV',skipinitialspace=True, usecols=row, chunksize=10**7):
df1['time_difference'] = (df1['times2'].astype('datetime64[s]')-df1['times1'].astype('datetime64[s]')).abs()
df1.to_csv('E:/SET/parti_with_diff_seconds.csv',mode='a')
I use a machine with 12GB RAM, and external hard disk of 2TB (5200RPM) (the input are not on the same hard disk as output), the program take more than 24h, how can I optimize it?
Honestly, Python's built in functionality to read and write text files is optimal for this. Read in a single line at a time to a list, add your extra column, then append it to the output text file.
Read in a single line at a time, modify it however you want, then append it to the output file. It'll happen faster than you think. You can use something like tqdm to monitor progress.
Something like:
import csv
from tqdm import tqdm
with open('myfile.txt', newline='') as f:
reader = csv.reader(f)
for row in tqdm(reader):
row.append('new_column')
with open('output.csv', 'a') as outfile:
outfile.write(row)

Read and write large CSV file in python

I use the following code to read a LARGE CSV file (6-10 GB), insert a header text, and then export it to CSV a again.
df = read_csv('read file')
df.columns =['list of headers']
df.to_csv('outfile',index=False,quoting=csv.QUOTE_NONNUMERIC)
But this methodology is extremely slow and I run out of memory. Any suggestions?
Rather than reading in the whole 6GB file, could you not just add the headers to a new file, and then cat in the rest? Something like this:
import fileinput
columns = ['list of headers']
columns.to_csv('outfile.csv',index=False,quoting=csv.QUOTE_NONNUMERIC)
with FileInput(files=('infile.csv')) as f:
for line in f:
outfile.write(line)
outfile.close()

Categories