python data handling with mismatching fields in each line - python

i am new to python and coding. i have large data like below and want to save it in csv file with fields as the header. All fields are ',' separated and each parameter have value on right side
for example for LAIGCINAME="LocalLA" , LAIGCINAME is the field and "LocalLA" is the value. my problem is all lines have some missing fields. Can anyone help me how to handle this in python as the data us not sync
ZXWN:GCI="12345",LAIGCINAME="LocalLA",PROXYLAI=NO,MSCN="11223344",VLRN="11223344",MSAREANAME="0"
ZWGA:GCI="13DADC12",PROXYLAI=NO,MSCVLRTYPE=MSCVLRNUM,MSCN="33223344",VLRN="22334455",MSAREANAME="0",NONBCLAI=NO;

As your data has lots of possible columns names, you will need to first parse the whole file to determine a suitable list of names. Once this is done, the header for the output file can be written followed by all of the data.
By making use of a csv.DictWriter() object, missing entries will be written as empty cells. A restval parameter could be added if another value is needed for missing values e.g. "N/A"
import csv
header = set()
input_filename = 'input.csv'
output_filename = 'output.csv'
with open(input_filename, newline='') as f_input:
csv_input = csv.reader(f_input)
# First determine all possible column names
for row in csv_input:
header.update({entry.split('=')[0] for entry in row})
with open(input_filename, newline='') as f_input, open(output_filename, 'w', newline='') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.DictWriter(f_output, fieldnames=sorted(header))
csv_output.writeheader()
for row in csv_input:
output_row = {}
for entry in row:
key, value = entry.split('=')
output_row[key] = value.strip('"')
csv_output.writerow(output_row)
For the two lines you have given, this would give you an output file as:
LAIGCINAME,MSAREANAME,MSCN,MSCVLRTYPE,NONBCLAI,PROXYLAI,VLRN,ZWGA:GCI,ZXWN:GCI
LocalLA,0,11223344,,,NO,11223344,,12345
,0,33223344,MSCVLRNUM,NO;,NO,22334455,13DADC12,
The csv.dictwriter works by writing a row from a dictionary, the csv.writer works by taking a list of items.
The code creates a single dictionary for each row called output_row and then writes it to the output file. By working one row at a time, the script will be able to handle files of any size without running into memory problems.
An alternative approach would be to read the whole file into memory and create a list of dictionaries, one for each row. The header values could be calculated at the same time. This list of dictionaries could then be written in one go.
For example:
import csv
input_filename = 'input.csv'
output_filename = 'output.csv'
header = set() # Use a set to create unique header values from all rows
output_rows = [] # list of dictionary rows
with open(input_filename, newline='') as f_input:
csv_input = csv.reader(f_input)
for row in csv_input:
output_row = {}
for entry in row:
key, value = entry.split('=')
output_row[key] = value.strip('"')
header.add(key)
output_rows.append(output_row)
with open(output_filename, 'w', newline='') as f_output:
csv_output = csv.DictWriter(f_output, fieldnames=sorted(header))
csv_output.writeheader()
csv_output.writerows(output_rows)
Note, this approach would fail if the file is too big (your question mentions that you have large data).

Related

Create multiple files from unique values of a column using inbuilt libraries of python

I started learning python and was wondering if there was a way to create multiple files from unique values of a column. I know there are 100's of ways of getting it done through pandas. But I am looking to have it done through inbuilt libraries. I couldn't find a single example where its done through inbuilt libraries.
Here is the sample csv file data:
uniquevalue|count
a|123
b|345
c|567
d|789
a|123
b|345
c|567
Sample output file:
a.csv
uniquevalue|count
a|123
a|123
b.csv
b|345
b|345
I am struggling with looping on unique values in a column and then print them out. Can someone explain with logic how to do it ? That will be much appreciated. Thanks.
import csv
from collections import defaultdict
header = []
data = defaultdict(list)
DELIMITER = "|"
with open("inputfile.csv", newline="") as csvfile:
reader = csv.reader(csvfile, delimiter=DELIMITER)
for i, row in enumerate(reader):
if i == 0:
header = row
else:
key = row[0]
data[key].append(row)
for key, value in data.items():
filename = f"{key}.csv"
with open(filename, "w", newline="") as f:
writer = csv.writer(f, delimiter=DELIMITER)
rows = [header] + value
writer.writerows(rows)
import csv
with open('sample.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
with open(f"{row[0]}.csv", 'a') as inner:
writer = csv.writer(
inner, delimiter='|',
fieldnames=('uniquevalue', 'count')
)
writer.writerow(row)
the task can also be done without using csv module. the lines of the file are read, and with read_file.read().splitlines()[1:] the newline characters are stripped off, also skipping the header line of the csv file. with a set a unique collection of inputdata is created, that is used to count number of duplicates and to create the output files.
with open("unique_sample.csv", "r") as read_file:
items = read_file.read().splitlines()[1:]
for line in set(items):
with open(line[:line.index('|')] + '.csv', 'w') as output:
output.write((line + '\n') * items.count(line))

csv skipping appending data skips rows

I have python code for appending data to the same csv, but when I append the data, it skips rows, and starts from row 15, instead from row 4
import csv
with open('csvtask.csv', 'r') as csv_file:
csv_reader = csv.DictReader(csv_file)
ls = []
for line in csv_reader:
if len(line['Values'])!= 0:
ls.append(int(line['Values']))
new_ls = ['','','']
for i in range(len(ls)-1):
new_ls.append(ls[i+1]-ls[i])
print(new_ls)
with open('csvtask.csv','a',newline='') as new_file:
csv_writer = csv.writer(new_file)
for i in new_ls:
csv_writer.writerow(('','','','',i))
new_file.close()
Here is the image
It's not really feasible to update a file at the same time you're reading it, so a common workaround it to create a new file. The following does that while preserving the fieldnames in the origin file. The new column will be named Diff.
Since there's no previous value to use to calculate a difference for the first row, the rows of the files are processed using the built-in enumerate() function which provides a value each time it's called which provides the index of the item in the sequence as well as the item itself as the object is iterated. You can use the index to know whether the current row is the first one or not and handle in a special way.
import csv
# Read csv file and calculate values of new column.
with open('csvtask.csv', 'r', newline='') as file:
reader = csv.DictReader(file)
fieldnames = reader.fieldnames # Save for later.
diffs = []
prev_value = 0
for i, row in enumerate(reader):
row['Values'] = int(row['Values']) if row['Values'] else 0
diff = row['Values'] - prev_value if i > 0 else ''
prev_value = row['Values']
diffs.append(diff)
# Read file again and write an updated file with the column added to it.
fieldnames.append('Diff') # Name of new field.
with open('csvtask.csv', 'r', newline='') as inp:
reader = csv.DictReader(inp)
with open('csvtask_updated.csv', 'w', newline='') as outp:
writer = csv.DictWriter(outp, fieldnames)
writer.writeheader()
for i, row in enumerate(reader):
row.update({'Diff': diffs[i]}) # Add new column.
writer.writerow(row)
print('Done')
You can use the DictWriter function like this:-
header = ["data", "values"]
writer = csv.DictWriter(file, fieldnames = header)
data = [[1, 2], [4, 6]]
writer.writerows(data)

write specific row only once?

I want to write in a CSV file some data. I don't have a problem to do this. The only issue I get is that I want to write the "title" just once, but it's writing it every two lines.
Here is my code:
rows = [['IVE_PATH','FPS moyen','FPS max','FPS min','MEDIAN'],[str(listFps[k]),statistics.mean(numberList), max(numberList), min(numberList), statistics.median(numberList)]]
with open("C:\ProgramData\OutilTestObjets3D\MaquetteCB-2019\DataSet\doc.csv", 'a', newline='') as csvfile:
csv_writer = csv.writer(csvfile, delimiter=';')
for row in rows:
csv_writer.writerow(row)
k += 1
I want to have this:
['IVE_PATH','FPS moyen','FPS max','FPS min','MEDIAN']
written only once at the top of the file, and not every two lines.
Solution is adding Not keywor in loop
with open("C:\ProgramData\OutilTestObjets3D\MaquetteCB-2019\DataSet\doc.csv", 'a', newline='') as csvfile:
csv_writer = csv.writer(csvfile, delimiter=';')
for row Not in rows:
csv_writer.writerow(row)
k += 1
It's because you opened the file in append mode ('a') and you are iterating over all the rows each time you write to the file. This means every time you write, you will add both the header and the data to the existing file.
The solution is to separate the writing of the header and the data rows.
One way is to check first if you are writing to an empty file with tell(), and if you are, that's the only time to write the header. Then proceed with iterating over all the rows except for the header.
import csv
rows = [
['IVE_PATH','FPS moyen','FPS max','FPS min','MEDIAN'], # header
[1,2,3,4,5], # sample data
[6,7,8,9,0] # sample data
]
with open("doc.csv", 'a', newline='') as csvfile:
csv_writer = csv.writer(csvfile, delimiter=';')
# Check if we are at the top of an empty file.
# If yes, then write the header.
# If no, then assume that the header was already written earlier.
if csvfile.tell() == 0:
csv_writer.writerow(rows[0])
# Iterate over only the data, skip rows[0]
for row in rows[1:]:
csv_writer.writerow(row)
Another way is to check first if the output CSV file exists. If it does not exist yet, create it and write the header row. Then succeeding runs of your code should only append the data rows.
import csv
import os
rows = [
['IVE_PATH','FPS moyen','FPS max','FPS min','MEDIAN'], # header
[1,2,3,4,5], # sample data
[6,7,8,9,0] # sample data
]
csvpath = "doc.csv"
# If the output file does not exist yet, create it.
# Then write the header row.
if not os.path.exists(csvpath):
with open(csvpath, "w") as csvfile:
csv_writer = csv.writer(csvfile, delimiter=';')
csv_writer.writerow(rows[0])
with open(csvpath, 'a', newline='') as csvfile:
csv_writer = csv.writer(csvfile, delimiter=';')
# Iterate over only the data, skip rows[0]
for row in rows[1:]:
csv_writer.writerow(row)

How to skip the headers when processing a csv file using Python?

I am using below referred code to edit a csv using Python. Functions called in the code form upper part of the code.
Problem: I want the below referred code to start editing the csv from 2nd row, I want it to exclude 1st row which contains headers. Right now it is applying the functions on 1st row only and my header row is getting changed.
in_file = open("tmob_notcleaned.csv", "rb")
reader = csv.reader(in_file)
out_file = open("tmob_cleaned.csv", "wb")
writer = csv.writer(out_file)
row = 1
for row in reader:
row[13] = handle_color(row[10])[1].replace(" - ","").strip()
row[10] = handle_color(row[10])[0].replace("-","").replace("(","").replace(")","").strip()
row[14] = handle_gb(row[10])[1].replace("-","").replace(" ","").replace("GB","").strip()
row[10] = handle_gb(row[10])[0].strip()
row[9] = handle_oem(row[10])[1].replace("Blackberry","RIM").replace("TMobile","T-Mobile").strip()
row[15] = handle_addon(row[10])[1].strip()
row[10] = handle_addon(row[10])[0].replace(" by","").replace("FREE","").strip()
writer.writerow(row)
in_file.close()
out_file.close()
I tried to solve this problem by initializing row variable to 1 but it didn't work.
Please help me in solving this issue.
Your reader variable is an iterable, by looping over it you retrieve the rows.
To make it skip one item before your loop, simply call next(reader, None) and ignore the return value.
You can also simplify your code a little; use the opened files as context managers to have them closed automatically:
with open("tmob_notcleaned.csv", "rb") as infile, open("tmob_cleaned.csv", "wb") as outfile:
reader = csv.reader(infile)
next(reader, None) # skip the headers
writer = csv.writer(outfile)
for row in reader:
# process each row
writer.writerow(row)
# no need to close, the files are closed automatically when you get to this point.
If you wanted to write the header to the output file unprocessed, that's easy too, pass the output of next() to writer.writerow():
headers = next(reader, None) # returns the headers or `None` if the input is empty
if headers:
writer.writerow(headers)
Another way of solving this is to use the DictReader class, which "skips" the header row and uses it to allowed named indexing.
Given "foo.csv" as follows:
FirstColumn,SecondColumn
asdf,1234
qwer,5678
Use DictReader like this:
import csv
with open('foo.csv') as f:
reader = csv.DictReader(f, delimiter=',')
for row in reader:
print(row['FirstColumn']) # Access by column header instead of column number
print(row['SecondColumn'])
Doing row=1 won't change anything, because you'll just overwrite that with the results of the loop.
You want to do next(reader) to skip one row.
Simply iterate one time with next()
with open(filename) as file:
csvreaded = csv.reader(file)
header = next(csvreaded)
for row in csvreaded:
empty_list.append(row) #your csv list without header
or use [1:] at the end of reader object
with open(filename) as file:
csvreaded = csv.reader(file)
header = next(csvreaded)
for row in csvreaded[1:]:
empty_list.append(row) #your csv list without header
Inspired by Martijn Pieters' response.
In case you only need to delete the header from the csv file, you can work more efficiently if you write using the standard Python file I/O library, avoiding writing with the CSV Python library:
with open("tmob_notcleaned.csv", "rb") as infile, open("tmob_cleaned.csv", "wb") as outfile:
next(infile) # skip the headers
outfile.write(infile.read())

Python: add column to CSV file based on existing column

I already have written what I need for identifying and parsing the value I am seeking, I need help writing a column to the csv file (or a new csv file) with the parsed value. Here's some pseudocode / somewhat realistic Python code for what I am trying to do:
# Given a CSV file, this function creates a new CSV file with all values parsed
def handleCSVfile(csvfile):
with open(csvfile, 'rb') as file:
reader = csv.reader(file, delimiter=',', lineterminator='\n')
for row in reader:
for field in row:
if isWhatIWant(field):
parsedValue = parse(field)
# write new column to row containing parsed value
I've already written the isWhatIWant and parse functions. If I need to write a completely new csv file, then I am not sure how to have both open simultaneously and read and write from one into the other.
I'd do it like this. I'm guessing that isWhatIWant() is something that is supposed to replace a field in-place.
import csv
def handleCSVfile(infilename, outfilename):
with open(infilename, 'rb') as infile:
with open(outfilename, 'wb') as outfile:
reader = csv.reader(infile, lineterminator='\n')
writer = csv.writer(outfile, lineterminator='\n')
for row in reader:
for field_index, field in enumerate(row):
if isWhatIWant(field):
row[field_index] = parse(field)
writer.writerow(row)
This sort of pattern occurs a lot and results in really long lines. It can sometimes be helpful to break out the logic from opening and files into a different function, like this:
import csv
def load_save_csvfile(infilename, outfilename):
with open(infilename, 'rb') as infile:
with open(outfilename, 'wb') as outfile:
reader = csv.reader(infile, lineterminator='\n')
writer = csv.writer(outfile, lineterminator='\n')
read_write_csvfile(reader, writer)
def read_write_csvfile(reader, writer)
for row in reader:
for field_index, field in enumerate(row):
if isWhatIWant(field):
row[field_index] = parse(field)
writer.writerow(row)
This modularizes the code, making it easier for you to change the way the files and formats are handled from the logic independently from each other.
Additional hints:
Don't name variables file as that is a built-in function. Shadowing those names will bite you when you least expect it.
delimiter=',' is the default so you don't need to specify it explicitly.

Categories