I've got a huge csv file (around 10GB of data) and I want to delete its header.
Searching on this web I found this solution:
with open("test.csv",'r') as f, open("updated_test.csv",'w') as f1:
next(f) # skip header line
for line in f:
f1.write(line)
But this would imply creating a new csv file. ¿Is there a way just to delete the header without looping over all the csv lines?
The point you've got is this: You want to delete a line in the beginning of a file. Straight forward this means you need to shift the complete contents after the header to the front which in turn means copying the whole file.
But this is way too costly of course when we are talking about 10GB files.
In your case I propose to read the first two lines, store their sizes, open the file for reading/writing without creating (so no truncation takes place), write the second(!) line at the beginning of the file and pad it with as many spaces as are necessary to overwrite the original first and second line.
This way you overwrite the first two lines with a very long line which semantically only contains the data from the second line (the first data line) and syntactically contains just some additional trailing spaces (which in CSV files do not hurt normally).
with open('a', 'rw+') as f:
headers = f.readline()
firstData = f.readline()
f.seek(0)
firstData = firstData[:-1] + ' ' * len(headers) + '\n'
f.write(firstData)
My input, spaces displayed as dots here:
one.two.three.four.five
1.2.3.4.5
6.7.8.9.10
My output, spaces displayed as dots here:
1.2.3.4.5........................
6.7.8.9.10
Using pandas with the header=0
df = pd.read_csv('yourfile.csv', sep='joursep', header=0)
Related
I have a super dirty text-heavy dataset. While the various column values are tab-separated but there are many line breaks within the desired row of data.
All data entries are separated by a hard '\n' notation.
I tried Setting the lineterminator argument as '\n', but it is still reading the line breaks as a new row.
Performing any sort of regex or related operation is most likely resulting in loss of tab separations, which I need to load my data into a dataframe. Also doing a word-wise of line-wise operation is not exactly feasible owing to the size of the dataset.
Is there a way I can get the Pandas not to read the line breaks as a new row, and go to the new line only when it sees a '\n'?
Snapshot of my data:
The unprocessed dataset
Below is a quick look at the current state:
current output
The highlighted red box should be one entry.
You could preprocess to a proper TSV and then read it from there. Use itertools.groupby to find the "\N" endings. If there are other problems with this file, such as internal tabs not being escaped, all bets are off.
import itertools
import re
separator_re = re.compile(r"\s*\\N\s*$", re.MULTILINE)
with open('other.csv') as infp:
with open('other-conv.csv', 'w') as outfp:
for hassep, subiter in itertools.groupby(infp, separator_re.search):
if hassep:
outfp.writelines("{}\n".format(separator_re.sub("",line))
for line in subiter)
else:
for line in subiter:
if line.endswith("\\\n"):
line = line[:-2] + " "
else:
line = line.strip()
outfp.write(line)
I have a text file which contains text in the first 20 or so lines, followed by CSV data. Some of the text in the text section contains commas and so trying csv.reader or csv.dictreader doesn't work well.
I want to skip past the text section and only then start to parse the CSV data.
Searches don't yield much other than instructions to either use csv.reader/csv.dictreader and iterate through the rows that are returned (which doesn't work because of the commas in the text), or to read the file line-by-line and split the lines using ',' as the delimiter.
The latter works up to a point, but it produces strings, not numbers. I could convert the strings to numbers but I'm hoping that there's a simple way to do this either with the csv or numpy libraries.
As requested - Sample data:
This is the first line. This is all just text to be skipped.
The first line doesn't always have a comma - maybe it's in the third line
Still no commas, or was there?
Yes, there was. And there it is again.
and so on
There are more lines but they finally stop when you get to
EndOfHeader
1,2,3,4,5
8,9,10,11,12
3, 6, 9, 12, 15
Thanks for the help.
Edit#2
A suggested answer gave the following link entitled Read file from line 2...
That's kind of what I'm looking for, but I want to be able to read through the lines until I find the "EndOfHeader" and then call on the CSV library to handle the remainder of the file.
The reply by saimadhu.polamuri is part of what I've tried, specifically
with open(filename , 'r') as f:
first_line = f.readline()
for line in f:
#test if line equals EndOfHeader. If true then parse as CSV
But that's where it comes apart - I can't see how to have CSV work with the data from this point forward.
With thanks to #Mike for the suggestion, the code is actually reasonably straightforward.
with open('data.csv') as f: # open the file
for i in range(7): # Loop over first 7 lines
str=f.readline() # just read them. Could also do f.next()
r = csv.reader(f, delimiter=',') # Now pass the file handle to a csv reader
for row in r: # and loop over the resulting rows
print(row) # Print the row. Or do something else.
In my actual code, it will search for the EndOfHeader line and use that to decide where to start parsing the CSV
I'm posting this as an answer, as the question that this one supposedly duplicates doesn't explicitly consider this issue of the file handle and how it can be passed to a CSV reader, and so it may help someone else.
Thanks to all who took time to help.
I have a dataframe, which can be downloaded here. The first column contains a question while the second column contains an answers to that question.
My aim: To create two .txt files, one that contains questions and one that contains answers.
Each questions and answer should be written on a individual row. So that Row 50 in each .txt file contains the 50th question and the 50th answer. (IE that if the files are recombined the questions/answer pairs match up)
The code snippet below opens a textfile, writes each row of the column to that file and removes any \n. It seems to work for about 96% of the rows, but very rarely it writes a single DF row across multiple text lines.
These rare events don't seem to have any defining characteristics, they are not extremely long etc. For the file I attached above, the first one occurs at text file line 395 in the answers column.
f = open("Answers.txt","a", newline="\n",encoding='utf-8')
for i in tqdm(data['answers_body']):
line = i.replace('\n','')
f.write(line)
f.write("\n")
Interestingly, if I remove the f.write and just print to the console it seems to be work as expected... the issue only occurs during the write process.
Update: full version that resulting 1001 lines
import csv
data = []
with open('SO_dataset.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile)
for row in spamreader:
print ', '.join(row)
data.append((row[2] if len(row)> 2 else ''))
f = open("Answers.txt", "w")
i = 0
for line in data:
i += 1
line = line.replace('\n',' ')
f.write(str(i) + '. ' + line)
f.write("\n")
f.close
Actually, your original code seems fine. If you are talking about the txt file break your line and wrap to next line, that's property of Notepad... If you input them into word or excel, they should be fine without breaking line.
It is because it reads the line, and that's why it prints the line but when you write to the file it is writing on the same line,
you have to add the newline to the line so the next line jumps on a new line
For simplicity you can go file.write(line+'\n')
I would suggest to use print(line, file=f) instead, may the optional seperator is set to end="some sign" if you want...
EDIT
Sorry for writing such complicated: Also print has the ability to "write" into files. It also offers an option for alternative ending see above - For your case it would be:
f = open("Answers.txt","a", newline="\n",encoding='utf-8')
for i in tqdm(data['answers_body']):
line = i.replace('\n','')
print(line, file=f)
f.close()
if wanted or needed for other cases with print(line, file=f, end='\t') instead of a newline a tab is the last sign and the next print() continues after the tab
I have the following text in a csv file:
b'DataMart\n\nDate/Time Generated,11/7/16 8:54 PM\nReport Time Zone,America/New_York\nAccount ID,8967\nDate Range,10/8/16 - 11/6/16\n\nReport Fields\nSite (DCM),Creative\nGlobest.com,2016-08_CB_018_1040x320_Globe St_16_PropertyFilter\nGlobest.com,2016-08_CB_018_1040x320_Globe St_16_PropertyFilter'
Essentially there are multiple new line characters in this file instead of a single big string so you can picture the same text as follows
DataMart
Date/Time Generated,11/7/16 8:54 PM
Report Time Zone,America/New_York
Account ID,8967
Date Range,10/8/16 - 11/6/16
Report Fields
Site (DCM),Creative
Globest.com,2016-08_CB_018_1040x320_Globe St_16_PropertyFilter
Globest.com,2016-08_CB_018_1040x320_Globe St_16_PropertyFilter
I need to grab the last two lines, which is basically the data. I tried doing a for loop:
with open('file.csv','r') as f:
for line in f:
print(line)
It instead prints the entire line again with \n.
Just read the file and get the last two lines:
my_file = file("/path/to/file").read()
print(my_file.splitlines()[-2:])
The [-2:] is known as slicing: it creates a slice, starting from the second to last element, going to the end.
ok, after struggling around for a bit, i found out that i need to change the decoding of the file from binary to 'utf-8' and then i can apply the split functions. The problem was split functions are not applicable to the binary file.
This is the actual code that seems to be working for me now:
with open('BinaryFile.csv','rb') as f1:
data=f1.read()
text=data.decode('utf-8')
with open('TextFile.csv', 'w') as f2:
f2.write(text)
with open('TextFile.csv','r') as f3:
for line in f3:
print(line.split('\\n')[9:])
thanks for your help guys
I have a text file and would like to replace certain elements which is "NaN".
I usually have used file.replace function for change NaNs with a certain number through entire text file.
Now, I would like to replace NaNs with a certain number in only first line of text file, not whole text.
Would you give me a hint for this problem?
You can only read the whole file, call .replace() for the first line and write it to the new file.
with open('in.txt') as fin:
lines = fin.readlines()
lines[0] = lines[0].replace('old_value', 'new_value')
with open('out.txt', 'w') as fout:
for line in lines:
fout.write(line)
If your file isn't really big, you can use just .join():
with open('out.txt', 'w') as fout:
fout.write(''.join(lines))
And if it is really big, you would probably better read and write lines simultaneously.
You can hack this provided you accept a few constraints. The replacement string needs to be of equal length to the original string. If the replacement string is shorter than the original, pad the shorter string with spaces to make it of equal length (this only works if extra spaces in your data is acceptable). If the replacement string is longer than the original you can not do the replacement in place and need to follow Harold's answer.
with open('your_file.txt', 'r+') as f:
line = next(f) # grab first line
old = 'NaN'
new = '0 ' # padded with spaces to make same length as old
f.seek(0) # move file pointer to beginning of file
f.write(line.replace(old, new))
This will be fast on any length file.