How to change weird CSV delimiter?

How to change weird CSV delimiter? - python

I have a CSV file that I can't open in Excel.
The CSV delimiter is |~|, and at the end of a row it is |~~|.
I have some sample data:
Education|~|Name_Dutch|~|Name_English|~|Faculty|~~|International Business|~|MB|~|MB|~|ED|~~|
Where the Header part is: Education|~|Name_Dutch|~|Name_English|~|Faculty|~~|
And the Data/Row part is: International Business|~|MB|~|MB|~|ED|~~|
I need to find out how to change this CSV file in just a normal , comma separated value using a Python Script.

You can assist the built-in csv module + string.split() function:
import csv
content = """Education|~|Name_Dutch|~|Name_English|~|Faculty|~~|International Business|~|MB|~|MB|~|ED|~~|"""
# Or read it's content from a file
with open('output.csv', 'w+') as f:
writer = csv.writer(f)
lines = content.split('|~~|')
for line in lines:
csv_row = line.split('|~|')
writer.writerow(csv_row)
it will output a file named output.csv
Education,Name_Dutch,Name_English,Faculty
International Business,MB,MB,ED
""
When dealing with a csv file, I would prefer using the csv module instead of doing .replace('|~|', ',') because the csv module has build-in support for special characters such as ,

The custom delimiters you mention seem to be unique enough so you can just do a string.replace on them. Then just write out the file.
The read and write section has all the details you need.
https://docs.python.org/2/tutorial/inputoutput.html

import csv
in_name = 'your_input_name.csv'
outname = 'your_outpt_name.csv'
with open(in_name, newline='') as csvfile:
csvreader = csv.reader(csvfile, delimiter='~', quotechar='|')
with open(outname, "w", newline='') as outfile:
csvwriter = csv.writer(outfile, quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
for row in csvreader:
line = []
for item in row:
if item != "":
line.append(item)
else:
csvwriter.writerow(line)
line = []
As csv.reader doesn't recognize "~~" as end of line, it converts it to "", so for csv.writer we repeatedly prepare the part of list (obtained from csv.reader) until "" is reached.

If the file is small, you can simply read its entire contents into memory and replace all weird delimiters found and then write a new version of it back out.
However if the file is large or you just want to conserve memory usage, it's also possible to read the file incrementally, a single character at-a-time, and do accomplish what needs to be done.
The csvfile argument to the csv.reader constructor "can be any object which supports the iterator protocol and returns a string each time its next() method is called."
This means the "object" can be a generator function or a generator expression. In the code below I've implemented a simple FSM (Finite State Machine) to parse the oddly formatted file and yield each line of output it detects. It may seem like a lot of code, but operate very simply so should be relatively easy to understand how it works:
import csv
def weird_file_reader(filename):
"""Generator that opens and produces "lines" read from the file while
translating the sequences of '|~|' to ',' and '|~~|' to '\n' (newlines).
"""
state = 0
line = []
with open(filename, 'rb') as weird_file:
while True:
ch = weird_file.read(1) # read one character
if not ch: # end-of-file?
if line: # partial line read?
yield ''.join(line)
break
if state == 0:
if ch == '|':
state = 1
else:
line.append(ch)
#state = 0 # unnecessary
elif state == 1:
if ch == '~':
state = 2
else:
line.append('|'+ch)
state = 0
elif state == 2:
if ch == '|':
line.append(',')
state = 0
elif ch == '~':
state = 3
else:
line.append('|~'+ch)
state = 0
elif state == 3:
if ch == '|':
line.append('\n')
yield ''.join(line)
line = []
state = 0
else:
line.append('|~~'+ch)
state = 0
else:
raise RuntimeError("Can't happen")
with open('fixed.csv', 'wb') as outfile:
reader = csv.reader((line for line in weird_file_reader('weird.csv')))
writer = csv.writer(outfile)
writer.writerows(reader)
print('done')

Related

Update Txt file in python

I have a text file with names and results. If the name already exists, only the result should be updated. I tried with this code and many others, but without success.
The content of the text file looks like this:
Ann, 200
Buddy, 10
Mark, 180
Luis, 100
PS: I started 2 weeks ago, so don't judge my bad code.
from os import rename
def updatescore(username, score):
file = open("mynewscores.txt", "r")
new_file = open("mynewscores2.txt", "w")
for line in file:
if username in line:
splitted = line.split(",")
splitted[1] = score
joined = "".join(splitted)
new_file.write(joined)
new_file.write(line)
file.close()
new_file.close()
maks = updatescore("Buddy", "200")
print(maks)

I would suggest reading the csv in as a dictionary and just update the one value.
import csv
d = {}
with open('test.txt', newline='') as f:
reader = csv.reader(f)
for row in reader:
key,value = row
d[key] = value
d['Buddy'] = 200
with open('test2.txt','w', newline='') as f:
writer = csv.writer(f)
for key, value in d.items():
writer.writerow([key,value])

So what needed to be different mostly is that when in your for loop you said to put line in the new text file, but it's never said to Not do that when wanting to replace a score, all that was needed was an else statement below the if statement:
from os import rename
def updatescore(username, score):
file = open("mynewscores.txt", "r")
new_file = open("mynewscores2.txt", "w")
for line in file:
if username in line:
splitted = line.split(",")
splitted[1] = score
print (splitted)
joined = ", ".join(splitted)
print(joined)
new_file.write(joined+'\n')
else:
new_file.write(line)
file.close()
new_file.close()
maks = updatescore("Buddy", "200")
print(maks)

You can try this, add the username if it doesn't exist, else update it.
def updatescore(username, score):
with open("mynewscores.txt", "r+") as file:
line = file.readline()
while line:
if username in line:
file.seek(file.tell() - len(line))
file.write(f"{username}, {score}")
return
line = file.readline()
file.write(f"\n{username}, {score}")
maks = updatescore("Buddy", "300")
maks = updatescore("Mario", "50")

You have new_file.write(joined) inside the if block, which is good, but you also have new_file.write(line) outside the if block.
Outside the if block, it's putting both the original and fixed lines into the file, and since you're using write() instead of writelines() both versions get put on the same line: there's no \n newline character.
You also want to add the comma: joined = ','.join(splitted) since you took the commas out when you used line.split(',')
I got the result you seem to be expecting when I put in both these fixes.
Next time you should include what you are expecting for output and what you're giving as input. It might be helpful if you also include what Error or result you actually got.
Welcome to Python BTW

Removed issues from your code:
def updatescore(username, score):
file = open("mynewscores.txt", "r")
new_file = open("mynewscores2.txt", "w")
for line in file.readlines():
splitted = line.split(",")
if username == splitted[0].strip():
splitted[1] = str(score)
joined = ",".join(splitted)
new_file.write(joined)
else:
new_file.write(line)
file.close()
new_file.close()

I believe this is the simplest/most straightforward way of doing things.
Code:
import csv
def update_score(name: str, score: int) -> None:
with open('../resources/name_data.csv', newline='') as file_obj:
reader = csv.reader(file_obj)
data_dict = dict(curr_row for curr_row in reader)
data_dict[name] = score
with open('../out/name_data_out.csv', 'w', newline='') as file_obj:
writer = csv.writer(file_obj)
writer.writerows(data_dict.items())
update_score('Buddy', 200)
Input file:
Ann,200
Buddy,10
Mark,180
Luis,100
Output file:
Ann,200
Buddy,200
Mark,180
Luis,100

How to extract one column to another file from a 300GB file

Problem was the huge data number, and I have to do it with my personal laptop with 12GB RAM. I tried a loop with 1M. lines every round, and used csv.writer. But csv.writer wrote like 1M. lines every two hours. So, any other ways worth to try?
lines = 10000000
for i in range(0, 330):
list_str = []
with open(file, 'r') as f:
line_flag = 0
for _ in range(i*lines):
next(f)
for line in f:
line_flag = line_flag + 1
data = json.loads(line)['name']
if data != former_str:
list_str.append(data)
former_str = data
if line_flag == lines:
break
with open(self.path + 'data_range\\names.csv', 'a', newline='') as writeFile:
writer = csv.writer(writeFile, delimiter='\n')
writer.writerow(list_str)
writeFile.close()
another version
def read_large_file(f):
block_size = 200000000
block = []
for line in f:
block.append(line[:-1])
if len(block) == block_size:
yield block
block = []
if block:
yield block
def split_files():
with open(write_file, 'r') as f:
i = 0
for block in read_large_file(f):
print(i)
file_name = write_name + str(i) + '.csv'
with open(file_name, 'w', newline='') as f_:
writer = csv.writer(f_, delimiter='\n')
writer.writerow(block)
i += 1
This was after it read a block and writing ... I wonder how come the rate of data trasmission was keeping about 0.

It should be as simple as this:
import json
import csv
with open(read_file, 'rt') as r, open(write_file, 'wt', newline='') as w:
writer = csv.writer(w)
for line in r:
writer.writerow([json.loads(line)['name']])
I tried the loop inside the file, but I always get me a Error, I guessed we cannot write the data into another file while opening the file?
You totally can write data in one file while reading another. I can't tell you more about your error until you post what it said, though.
There was a bit in your code about former_str which is not covered under "extract one column", so I did not write anything about it.

Would something like this work?
Essentially using a generator to avoid reading the entire file in memory, and writing the data one line at a time.
import jsonlines # pip install jsonlines
from typing import Generator
def gen_lines(file_path: str, col_name: str) -> Generator[str]:
with jsonline.open(file_path) as f:
for obj in f:
yield obj[col_name]
# Here you can also change to writing a jsonline again
with open(output_file, "w") as out:
for item in gen_lines(your_file_path, col_name_to_extract):
out.write(f"{item}\n")

How to read a CSV and adapt + write every row to another CSV?

I tried this but it just writes "lagerungskissen kleinkind,44" several times instead of transferring every row.
keyword = []
rank = []
rank = list(map(int, rank))
data = []
with open("keywords.csv", "r") as file:
for line in file:
data = line.strip().replace('"', '').split(",")
keyword = data[0]
rank = data[3]
import csv
with open("mynew.csv", "w", newline="") as f:
thewriter = csv.writer(f)
thewriter.writerow(["Keyword", "Rank"])
for row in keyword:
thewriter.writerow([keyword, rank])
It should look like this

This is writing the same line in your output CSV because the final block is
for row in keyword:
thewriter.writerow([keyword, rank])
Note that the keyword variable doesn't change in the loop, but the row does. You're writing that same [keyword, rank] line len(keyword) times.
I would use the csv package to do the reading and the writing for this. Something like
import csv
input_file = '../keywords.csv'
output_file = '../mynew.csv'
# open the files
fIn = open(input_file, 'r', newline='')
fOut = open(output_file, 'w')
csvIn = csv.reader(fIn, quotechar='"') # check the keyword args in the docs!
csvOut = csv.writer(fOut)
# write a header, then write each row one at a time
csvOut.writerow(['Keyword', 'Rank'])
for row in csvIn:
keyword = row[0]
rank = row[3]
csvOut.writerow([keyword, rank])
# and close the files
fOut.close()
fIn.close()
As as side note, you could write the above using the with context manager (e.g. with open(...) as file:). The answer here shows how to do it with multiple files (in this case fIn and fOut).

Remove newline character after specific words in csv?

I have a big csv file. After some items there is a newline character which is not supposed to be there. It is always after a specific item, let's say it's called 'foo'. I need to remove every newline character after foo. I figured out this is kind of what should happen:
for line in sys.stdin:
if line.split(",")[-1] == "foo":
line = line.rstrip()
How do I make sure I output the result back to the file?

You can't write line back to your original file but assuming you will use your script like python script.py < input_file.csv > output_file.csv you can simply print the lines you need:
import sys
for line in sys.stdin:
if line.split(",")[-1] == "foo":
line = line.rstrip()
# print() will append '\n' by default - we prevent it
print(line, end='')

I haven't tested this, but it should do what you need it to. This assumes there are no other items (other than foo) that has trailing white space that you don't want to strip. Otherwise, a simple conditional will fix that.
import csv
with open("/path/to/file", newline='') as f:
reader = csv.reader(f)
for row in reader:
for i, item in enumerate(row):
row[i] = item.rstrip()
with open("/path/to/file", 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(reader)

This answer just saves to a new csv file.
with open("test.csv", "r", newline="") as csvfile:
my_reader = csv.reader(csvfile, delimiter=',', quotechar='"')
with open("new.csv", "w", newline="") as csvfile2:
last_line = []
writer = csv.writer(csvfile2, delimiter=',', quotechar='"')
for line in my_reader:
if last_line != []:
writer.writerow(last_line + line)
last_line = []
elif line[-1] == "foo":
last_line = line
else:
writer.writerow(line)
if last_line != []: # when the last line also contain "foo"
writer.writerow(last_line)
Tested on a test.csv file:
this,"is,a ",book
this,is,foo
oh,my
this,foo
And gained a new.csv file:
this,"is,a ",book
this,is,foo,oh,my
this,foo

How do I iterate through 2 CSV files and get data from one and add to the other?

I'm trying to iterate over a CSV file that has a 'master list' of names, and compare it to another CSV file that contains only the names of people who were present and made phone calls.
I'm trying to iterate over the master list and compare it to the names in the other CSV file, take the number of calls made by the person and write a new CSV file containing number of Calls if the name isn't found or if it's 0, I need that column to have 0 there.
I'm not sure if its something incredibly simple I'm overlooking, or if I am truly going about this incorrectly.
Edited for formatting.
import csv
import sys
masterlst = open('masterlist.csv')
comparelst = open(sys.argv[1])
masterrdr = csv.DictReader(masterlst, dialect='excel')
comparerdr = csv.DictReader(comparelst, dialect='excel')
headers = comparerdr.fieldnames
with open('callcounts.csv', 'w') as outfile:
wrtr = csv.DictWriter(outfile, fieldnames=headers, dialect='excel', quoting=csv.QUOTE_MINIMAL, delimiter=',', escapechar='\n')
wrtr.writerow(dict((fn,fn) for fn in headers))
for lines in masterrdr:
for row in comparerdr:
if lines['Names'] == row['Names']:
print(lines['Names'] + ' has ' + row['Calls'] + ' calls')
wrtr.writerow(row)
elif lines['Names'] != row['Names']:
row['Calls'] = ('%s' % 0)
wrtr.writerow(row)
print(row['Names'] + ' had 0 calls')
masterlst.close()
comparelst.close()

Here's how I'd do it, assuming the file sizes do not prove to be problematic:
import csv
import sys
with open(sys.argv[1]) as comparelst:
comparerdr = csv.DictReader(comparelst, dialect='excel')
headers = comparerdr.fieldnames
names_and_counts = {}
for line in comparerdr:
names_and_counts[line['Names']] = line['Calls']
# or, if you're sure you only want the ones with 0 calls, just use a set and only add the line['Names'] values that that line['Calls'] == '0'
with open('masterlist.csv') as masterlst:
masterrdr = csv.DictReader(masterlst, dialect='excel')
with open('callcounts.csv', 'w') as outfile:
wrtr = csv.DictWriter(outfile, fieldnames=headers, dialect='excel', quoting=csv.QUOTE_MINIMAL, delimiter=',', escapechar='\n')
wrtr.writerow(dict((fn,fn) for fn in headers))
# or if you're on 2.7, wrtr.writeheader()
for line in masterrdr:
if names_and_counts.get(line['Names']) == '0':
row = {'Names': line['Names'], 'Calls': '0'}
wrtr.writerow(row)
That writes just the rows with 0 calls, which is what your text description said - you could tweak it if you wanted to write something else for non-0 calls.

Thanks everyone for the help. I was able to nest another with statement inside of my outer loop, and add a variable to test whether or not the name from the master list was found in the compare list. This is my final working code.
import csv
import sys
masterlst = open('masterlist.csv')
comparelst = open(sys.argv[1])
masterrdr = csv.DictReader(masterlst, dialect='excel')
comparerdr = csv.DictReader(comparelst, dialect='excel')
headers = comparerdr.fieldnames
with open('callcounts.csv', 'w') as outfile:
wrtr = csv.DictWriter(outfile, fieldnames=headers, dialect='excel', quoting=csv.QUOTE_MINIMAL, delimiter=',', escapechar='\n')
wrtr.writerow(dict((fn,fn) for fn in headers))
for line in masterrdr:
found = False
with open(sys.argv[1]) as loopfile:
looprdr = csv.DictReader(loopfile, dialect='excel')
for row in looprdr:
if row['Names'] == line['Names']:
line['Calls'] = row['Calls']
wrtr.writerow(line)
found = True
break
if found == False:
line['Calls'] = '0'
wrtr.writerow(line)
masterlst.close()
comparelst.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to change weird CSV delimiter? - python

The custom delimiters you mention seem to be unique enough so you can just do a string.replace on them. Then just write out the file. The read and write section has all the details you need. https://docs.python.org/2/tutorial/inputoutput.html

Related

Update Txt file in python

How to extract one column to another file from a 300GB file

How to read a CSV and adapt + write every row to another CSV?

Remove newline character after specific words in csv?

How do I iterate through 2 CSV files and get data from one and add to the other?

Categories

Resources