How to extract one column to another file from a 300GB file

How to extract one column to another file from a 300GB file - python

Problem was the huge data number, and I have to do it with my personal laptop with 12GB RAM. I tried a loop with 1M. lines every round, and used csv.writer. But csv.writer wrote like 1M. lines every two hours. So, any other ways worth to try?
lines = 10000000
for i in range(0, 330):
list_str = []
with open(file, 'r') as f:
line_flag = 0
for _ in range(i*lines):
next(f)
for line in f:
line_flag = line_flag + 1
data = json.loads(line)['name']
if data != former_str:
list_str.append(data)
former_str = data
if line_flag == lines:
break
with open(self.path + 'data_range\\names.csv', 'a', newline='') as writeFile:
writer = csv.writer(writeFile, delimiter='\n')
writer.writerow(list_str)
writeFile.close()
another version
def read_large_file(f):
block_size = 200000000
block = []
for line in f:
block.append(line[:-1])
if len(block) == block_size:
yield block
block = []
if block:
yield block
def split_files():
with open(write_file, 'r') as f:
i = 0
for block in read_large_file(f):
print(i)
file_name = write_name + str(i) + '.csv'
with open(file_name, 'w', newline='') as f_:
writer = csv.writer(f_, delimiter='\n')
writer.writerow(block)
i += 1
This was after it read a block and writing ... I wonder how come the rate of data trasmission was keeping about 0.

It should be as simple as this:
import json
import csv
with open(read_file, 'rt') as r, open(write_file, 'wt', newline='') as w:
writer = csv.writer(w)
for line in r:
writer.writerow([json.loads(line)['name']])
I tried the loop inside the file, but I always get me a Error, I guessed we cannot write the data into another file while opening the file?
You totally can write data in one file while reading another. I can't tell you more about your error until you post what it said, though.
There was a bit in your code about former_str which is not covered under "extract one column", so I did not write anything about it.

Would something like this work?
Essentially using a generator to avoid reading the entire file in memory, and writing the data one line at a time.
import jsonlines # pip install jsonlines
from typing import Generator
def gen_lines(file_path: str, col_name: str) -> Generator[str]:
with jsonline.open(file_path) as f:
for obj in f:
yield obj[col_name]
# Here you can also change to writing a jsonline again
with open(output_file, "w") as out:
for item in gen_lines(your_file_path, col_name_to_extract):
out.write(f"{item}\n")

Related

Update Txt file in python

I have a text file with names and results. If the name already exists, only the result should be updated. I tried with this code and many others, but without success.
The content of the text file looks like this:
Ann, 200
Buddy, 10
Mark, 180
Luis, 100
PS: I started 2 weeks ago, so don't judge my bad code.
from os import rename
def updatescore(username, score):
file = open("mynewscores.txt", "r")
new_file = open("mynewscores2.txt", "w")
for line in file:
if username in line:
splitted = line.split(",")
splitted[1] = score
joined = "".join(splitted)
new_file.write(joined)
new_file.write(line)
file.close()
new_file.close()
maks = updatescore("Buddy", "200")
print(maks)

I would suggest reading the csv in as a dictionary and just update the one value.
import csv
d = {}
with open('test.txt', newline='') as f:
reader = csv.reader(f)
for row in reader:
key,value = row
d[key] = value
d['Buddy'] = 200
with open('test2.txt','w', newline='') as f:
writer = csv.writer(f)
for key, value in d.items():
writer.writerow([key,value])

So what needed to be different mostly is that when in your for loop you said to put line in the new text file, but it's never said to Not do that when wanting to replace a score, all that was needed was an else statement below the if statement:
from os import rename
def updatescore(username, score):
file = open("mynewscores.txt", "r")
new_file = open("mynewscores2.txt", "w")
for line in file:
if username in line:
splitted = line.split(",")
splitted[1] = score
print (splitted)
joined = ", ".join(splitted)
print(joined)
new_file.write(joined+'\n')
else:
new_file.write(line)
file.close()
new_file.close()
maks = updatescore("Buddy", "200")
print(maks)

You can try this, add the username if it doesn't exist, else update it.
def updatescore(username, score):
with open("mynewscores.txt", "r+") as file:
line = file.readline()
while line:
if username in line:
file.seek(file.tell() - len(line))
file.write(f"{username}, {score}")
return
line = file.readline()
file.write(f"\n{username}, {score}")
maks = updatescore("Buddy", "300")
maks = updatescore("Mario", "50")

You have new_file.write(joined) inside the if block, which is good, but you also have new_file.write(line) outside the if block.
Outside the if block, it's putting both the original and fixed lines into the file, and since you're using write() instead of writelines() both versions get put on the same line: there's no \n newline character.
You also want to add the comma: joined = ','.join(splitted) since you took the commas out when you used line.split(',')
I got the result you seem to be expecting when I put in both these fixes.
Next time you should include what you are expecting for output and what you're giving as input. It might be helpful if you also include what Error or result you actually got.
Welcome to Python BTW

Removed issues from your code:
def updatescore(username, score):
file = open("mynewscores.txt", "r")
new_file = open("mynewscores2.txt", "w")
for line in file.readlines():
splitted = line.split(",")
if username == splitted[0].strip():
splitted[1] = str(score)
joined = ",".join(splitted)
new_file.write(joined)
else:
new_file.write(line)
file.close()
new_file.close()

I believe this is the simplest/most straightforward way of doing things.
Code:
import csv
def update_score(name: str, score: int) -> None:
with open('../resources/name_data.csv', newline='') as file_obj:
reader = csv.reader(file_obj)
data_dict = dict(curr_row for curr_row in reader)
data_dict[name] = score
with open('../out/name_data_out.csv', 'w', newline='') as file_obj:
writer = csv.writer(file_obj)
writer.writerows(data_dict.items())
update_score('Buddy', 200)
Input file:
Ann,200
Buddy,10
Mark,180
Luis,100
Output file:
Ann,200
Buddy,200
Mark,180
Luis,100

How to read a CSV and adapt + write every row to another CSV?

I tried this but it just writes "lagerungskissen kleinkind,44" several times instead of transferring every row.
keyword = []
rank = []
rank = list(map(int, rank))
data = []
with open("keywords.csv", "r") as file:
for line in file:
data = line.strip().replace('"', '').split(",")
keyword = data[0]
rank = data[3]
import csv
with open("mynew.csv", "w", newline="") as f:
thewriter = csv.writer(f)
thewriter.writerow(["Keyword", "Rank"])
for row in keyword:
thewriter.writerow([keyword, rank])
It should look like this

This is writing the same line in your output CSV because the final block is
for row in keyword:
thewriter.writerow([keyword, rank])
Note that the keyword variable doesn't change in the loop, but the row does. You're writing that same [keyword, rank] line len(keyword) times.
I would use the csv package to do the reading and the writing for this. Something like
import csv
input_file = '../keywords.csv'
output_file = '../mynew.csv'
# open the files
fIn = open(input_file, 'r', newline='')
fOut = open(output_file, 'w')
csvIn = csv.reader(fIn, quotechar='"') # check the keyword args in the docs!
csvOut = csv.writer(fOut)
# write a header, then write each row one at a time
csvOut.writerow(['Keyword', 'Rank'])
for row in csvIn:
keyword = row[0]
rank = row[3]
csvOut.writerow([keyword, rank])
# and close the files
fOut.close()
fIn.close()
As as side note, you could write the above using the with context manager (e.g. with open(...) as file:). The answer here shows how to do it with multiple files (in this case fIn and fOut).

Trying to read files named file1,file2,file3 using for loop in Python

I am pretty new to python and trying to run a script to edit csv files. The problem I am facing is that I need to split the csv files into smaller pieces(as they are large files and getting memory errors) and then run another script to edit the files but when im trying to append these two scripts and run the test, the script is reading only the first small file and not reading the rest of the files.
For example: When I split the main csv file, the files are getting split and the names come as big-1.csv,big-2.csv. Then when the script is picking up the files to edit, only big-1.csv is getting edited and rest are not getting edited.
The script is:
import csv
from csv import DictWriter
divisor = 990
outfileno = 1
outfile = None
with open('MOCK_DATA.csv', 'r', newline='') as infile:
infile_iter = csv.reader(infile, delimiter='\t')
header = next(infile_iter)
for index, row in enumerate(infile_iter):
if index % divisor == 0:
if outfile:
outfile.close()
outfilename = 'big-{}.csv'.format(outfileno)
outfile = open(outfilename, 'w', newline='')
outfileno += 1
writer = csv.writer(outfile, delimiter='\t', quoting=csv.QUOTE_NONE)
writer.writerow(header)
writer.writerow(row)
# Don't forget to close the last file
if outfile:
outfile.close()
#export the data
# with correct quoting, and that you are stuck with what you have.
for i in range(1,2):
with open("big-" + str(i) + ".csv") as people_file:
next(people_file)
corrected_people = []
for person_line in people_file:
chomped_person_line = person_line.rstrip()
person_tokens = chomped_person_line.split(",")
# check that each field has the expected type
try:
corrected_person = {
"id": person_tokens[0],
"first_name":person_tokens[1],
"last_name": "".join(person_tokens[2:-3]),
"email":person_tokens[-3],
"gender":person_tokens[-2],
"ip_address":person_tokens[-1]
}
if not corrected_person["ip_address"].startswith(
"") and corrected_person["ip_address"] !="n/a":
raise ValueError
corrected_people.append(corrected_person)
except (IndexError, ValueError):
# print the ignored lines, so manual correction can be performed later.
print("Could not parse line: " + chomped_person_line)
with open("fix-" + str(i) + ".csv", "w") as corrected_people_file:
writer = DictWriter(
corrected_people_file,
fieldnames=[
"id","first_name","last_name","email","gender","ip_address"
],delimiter=',')
writer.writeheader()
writer.writerows(corrected_people)
I think this maybe an issue with reading the smaller files in the for loop. The script is running without any error. Please help.

Saving text file in a for loop

I'm trying to loop through a file, strip the sentences into individual lines, and then export that data.
filename = '00000BF8_ar.txt'
with open(filename, mode="r") as outfile:
str_output = outfile.readlines()
str_output = ''.join(str_output)
sentenceSplit = filter(None, str_output.split("."))
for s in sentenceSplit:
print(s.strip() + ".")
#output += s
myfile = open(filename, 'w')
myfile.writelines(s)
myfile.close()
Unfortunately, it looks like the loop only goes through a few lines and saves them. So the whole file isn't looped through and saved. Any help on how I can fix that?

Here is the code I hope this is what you want to achieve,
filename = '00000BF8_ar.txt'
with open(filename, mode="r") as outfile:
str_output = outfile.readlines()
str_output = ''.join(str_output)
sentenceSplit = filter(None, str_output.split("."))
l=[]
for s in sentenceSplit:
l.append(s.strip() + ".")
myfile = open(filename, 'w')
myfile.write('\n'.join(l))
myfile.close()

Each time you re-open the file with the 'w' option, you basically erase its content.
Try modifying your code like this:
filename = '00000BF8_ar.txt'
with open(filename, "r") as infile:
str_output = infile.readlines()
str_output = ''.join(str_output)
sentenceSplit = filter(None, str_output.split("."))
with open(filename, "w") as outfile:
for s in sentenceSplit:
print(s.strip() + ".")
#output += s
s.writelines(s)
Another way to achieve the same thing would have been to open a new file using open(filename_new, 'a') which open a file for appending, but as a rule of thumb try not to open/close files inside a loop.

open(filename, 'w') will overwrite the file every time it starts. My guess is that what's currently happening is that only the last element in sentenceSplit is showing up in myfile.
The simple "solution" is to use append instead of write:
open(filename, 'a')
which will simply start writing at the end of the file, without deleting the rest of it.
However, as #chepner's comment states, why are you reopening the file at all? I would recommend changing your code to this:
with open(filename, mode="r") as outfile:
str_output = outfile.readlines()
str_output = ''.join(str_output)
sentenceSplit = filter(None, str_output.split("."))
with open(filename, mode='w') as myfile:
for s in sentenceSplit:
print(s.strip() + ".")
myfile.writelines(s)
This way, instead of opening it many times, and overwriting it every time, you're only opening it once and just writing to it continuously.

How to change weird CSV delimiter?

I have a CSV file that I can't open in Excel.
The CSV delimiter is |~|, and at the end of a row it is |~~|.
I have some sample data:
Education|~|Name_Dutch|~|Name_English|~|Faculty|~~|International Business|~|MB|~|MB|~|ED|~~|
Where the Header part is: Education|~|Name_Dutch|~|Name_English|~|Faculty|~~|
And the Data/Row part is: International Business|~|MB|~|MB|~|ED|~~|
I need to find out how to change this CSV file in just a normal , comma separated value using a Python Script.

You can assist the built-in csv module + string.split() function:
import csv
content = """Education|~|Name_Dutch|~|Name_English|~|Faculty|~~|International Business|~|MB|~|MB|~|ED|~~|"""
# Or read it's content from a file
with open('output.csv', 'w+') as f:
writer = csv.writer(f)
lines = content.split('|~~|')
for line in lines:
csv_row = line.split('|~|')
writer.writerow(csv_row)
it will output a file named output.csv
Education,Name_Dutch,Name_English,Faculty
International Business,MB,MB,ED
""
When dealing with a csv file, I would prefer using the csv module instead of doing .replace('|~|', ',') because the csv module has build-in support for special characters such as ,

The custom delimiters you mention seem to be unique enough so you can just do a string.replace on them. Then just write out the file.
The read and write section has all the details you need.
https://docs.python.org/2/tutorial/inputoutput.html

import csv
in_name = 'your_input_name.csv'
outname = 'your_outpt_name.csv'
with open(in_name, newline='') as csvfile:
csvreader = csv.reader(csvfile, delimiter='~', quotechar='|')
with open(outname, "w", newline='') as outfile:
csvwriter = csv.writer(outfile, quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
for row in csvreader:
line = []
for item in row:
if item != "":
line.append(item)
else:
csvwriter.writerow(line)
line = []
As csv.reader doesn't recognize "~~" as end of line, it converts it to "", so for csv.writer we repeatedly prepare the part of list (obtained from csv.reader) until "" is reached.

If the file is small, you can simply read its entire contents into memory and replace all weird delimiters found and then write a new version of it back out.
However if the file is large or you just want to conserve memory usage, it's also possible to read the file incrementally, a single character at-a-time, and do accomplish what needs to be done.
The csvfile argument to the csv.reader constructor "can be any object which supports the iterator protocol and returns a string each time its next() method is called."
This means the "object" can be a generator function or a generator expression. In the code below I've implemented a simple FSM (Finite State Machine) to parse the oddly formatted file and yield each line of output it detects. It may seem like a lot of code, but operate very simply so should be relatively easy to understand how it works:
import csv
def weird_file_reader(filename):
"""Generator that opens and produces "lines" read from the file while
translating the sequences of '|~|' to ',' and '|~~|' to '\n' (newlines).
"""
state = 0
line = []
with open(filename, 'rb') as weird_file:
while True:
ch = weird_file.read(1) # read one character
if not ch: # end-of-file?
if line: # partial line read?
yield ''.join(line)
break
if state == 0:
if ch == '|':
state = 1
else:
line.append(ch)
#state = 0 # unnecessary
elif state == 1:
if ch == '~':
state = 2
else:
line.append('|'+ch)
state = 0
elif state == 2:
if ch == '|':
line.append(',')
state = 0
elif ch == '~':
state = 3
else:
line.append('|~'+ch)
state = 0
elif state == 3:
if ch == '|':
line.append('\n')
yield ''.join(line)
line = []
state = 0
else:
line.append('|~~'+ch)
state = 0
else:
raise RuntimeError("Can't happen")
with open('fixed.csv', 'wb') as outfile:
reader = csv.reader((line for line in weird_file_reader('weird.csv')))
writer = csv.writer(outfile)
writer.writerows(reader)
print('done')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract one column to another file from a 300GB file - python

Related

Update Txt file in python

How to read a CSV and adapt + write every row to another CSV?

Trying to read files named file1,file2,file3 using for loop in Python

Saving text file in a for loop

How to change weird CSV delimiter?

Categories

Resources