I am using this script to take a large csv file and separate it by unique values in the first column then save new files. I would like to also add 3 columns at the end of each file that contain calculations based on the previous columns. The columns will have headers as well. My current code is as follows
import csv, itertools as it, operator as op
csv_contents = []
with open('Nov15.csv', 'rb') as fin:
file_reader = csv.DictReader(fin) # default delimiter is comma
print file_reader
fieldnames = file_reader.fieldnames # save for writing
for line in file_reader: # read in all of your data
csv_contents.append(line) # gather data into a list (of dicts)
# input to itertools.groupby must be sorted by the grouping value
sorted_csv_contents = sorted(csv_contents, key=op.itemgetter('Object'))
for groupkey, groupdata in it.groupby(sorted_csv_contents, key=op.itemgetter('Object')):
with open('slice_{:s}.csv'.format(groupkey), 'wb') as gips:
file_writer = csv.DictWriter(gips, fieldnames=fieldnames)
file_writer.writeheader()
file_writer.writerows(groupdata)
If your comments are true, you could probably do it like this (for imaginary columns col1, col2, and the calculation col1 * col2:
for line in file_reader: # read in all of your data
line['calculated_col0'] = line['col1'] * line['col2']
csv_contents.append(line) # gather data into a list (of dicts)
Related
So I have a CSV file like this,
how can I separate them into different columns like this,
using python without using the pandas lib.
Implementation that should work in python 3.6+.
import csv
with open("input.csv", newline="") as inputfile:
with open("output.csv", "w", newline="") as outputfile:
reader = csv.DictReader(inputfile) # reader
fieldnames = reader.fieldnames
writer = csv.DictWriter(outputfile, fieldnames=fieldnames) # writer
# make header
writer.writeheader()
# loop over each row in input CSV
for row in reader:
# get first column
column: str = str(row[fieldnames[0]])
numbers: list = column.split(",")
if len(numbers) != len(fieldnames):
print("Error: Lengths not equal")
# write row in output CSV
writer.writerow({field: num for field, num in zip(fieldnames, numbers)})
Explanation of the code:
The above code takes two file names input.csv and output.csv. The names being verbose don't need any further explanation.
It reads each row from input.csv and writes corresponding row in output.csv.
The last line is a "dictionary comprehension" combined with zip (similar to "list comprehensions" for lists). It's a nice way to do a lot of stuff in a single line but same code in expanded form looks like:
row = {}
for field, num in zip(fieldnames, numbers):
row[field] = num
writer.writerow(row)
It is already separated into different columns by , as separator, but the european version of excel usually uses ; as separator. You can specify the separator, when you import the csv:
https://support.microsoft.com/en-us/office/import-or-export-text-txt-or-csv-files-5250ac4c-663c-47ce-937b-339e391393ba
If you really want to change the file content with python use the replace function and replace , with ;: How to search and replace text in a file?
I'm trying to parse a data from json file and create csv file from that output. I've written the python script to create output as per my needs. I need to sort the below csv file in time and date.
current output
My code:
## Shift Start | End time. | Primary | Secondary
def write_CSV () :
# field names
fields = ['ShiftStart', 'EndTime', 'Primary', 'Secondary']
# name of csv file
filename = "CallingLog.csv"
# writing to csv file
with open(filename, 'w') as csvfile:
# creating a csv dict writer object
writer = csv.DictWriter(csvfile, delimiter=',', lineterminator='\n', fieldnames = fields)
# writing headers (field names)
writer.writeheader()
# writing data rows
writer.writerows(totalData)
I want my csv file to be sorted out with date and time like below. atleast date would be fine.
ShiftStart
2020-11-30T17:00:00-08:00
2020-12-01T01:00:00-08:00
2020-12-02T05:00:00-08:00
2020-12-03T05:00:00-08:00
2020-12-04T09:00:00-08:00
2020-12-05T13:00:00-08:00
2020-12-06T13:00:00-08:00
2020-12-07T09:00:00-08:00
2020-12-08T17:00:00-08:00
2020-12-09T09:00:00-08:00
2020-12-10T09:00:00-08:00
2020-12-11T17:00:00-08:00
YourDataframe.sort_values(['Col1','Col2']).to_csv('Path')
Try this, this not only sort and copy to csv but also retain original dataframe without sorting in program for further operations if needed..!
You can adapt this example to your data (that I have not in my possession -:)
from csv import DictReader, DictWriter
from sys import stdout
# simple, self-contained data
data = '''\
a,b,c
3,2,1
2,2,3
1,3,2
'''.splitlines()
# read the data
dr = DictReader(data)
rows = [row for row in dr]
# print the data
print('# unsorted')
dw = DictWriter(stdout, dr.fieldnames)
dw.writeheader()
dw.writerows(rows)
print('# sorted')
dw = DictWriter(stdout, dr.fieldnames)
dw.writeheader()
dw.writerows(sorted(rows, key=lambda d:d['a']))
# unsorted
a,b,c
3,2,1
2,2,3
1,3,2
# sorted
a,b,c
1,3,2
2,2,3
3,2,1
In [40]:
When you read the data using a DictReader, each element of the list rows is a dictionary, keyed on the field names of the first line of the CSV data file.
When you want to sort this list according to the values corresponding to a key, you have to provide sorted with a key argument, that is a function that returns the value on which you want to sort.
This function is called with the whole element to be sorted, in your case a dictionary, and we want to sort on the first field of the CSV, the one indexed by 'a', so that our function, using the lambda syntx to inline the definition in the function call, is just lambda d: d['a'] that returns the value on which we want to sort.
NOTE the sort in this case is alphabetically sorted, and works because I'm dealing with single digits, in general you possibly need to convert the value (by default a string) to something else that makes sense in your context, e.g., lambda d: int(d['a']).
I need to read csv file from stdin and output the rows only the rows which values are equal to those specified in the columns. My input is like this:
2
Kashiwa
Name,Campus,LabName
Shinichi MORISHITA,Kashiwa,Laboratory of Omics
Kenta Naai,Shirogane,Laboratory of Functional Analysis in Silico
Kiyoshi ASAI,Kashiwa,Laboratory of Genome Informatics
Yukihide Tomari,Yayoi,Laboratory of RNA Function
My output should be like this:
Name,Campus,LabName
Shinichi MORISHITA,Kashiwa,Laboratory of Omics
Kiyoshi ASAI,Kashiwa,Laboratory of Genome Informatics
I need to sort out the people whose values in column#2 == Kashiwa and not output first 2 lines of stdin in stdout.
So far I just tried to read from stdin into csv but I am getting each row as a list of strings (as expected from csv documentation). Can I change this?
#!usr/bin/env python3
import sys
import csv
data = sys.stdin.readlines()
for line in csv.reader(data):
print(line)
Output:
['2']
['Kashiwa']
['Name', 'Campus', 'LabName']
['Shinichi MORISHITA', 'Kashiwa', 'Laboratory of Omics']
['Kenta Naai', 'Shirogane', 'Laboratory of Functional Analysis in
Silico']
['Kiyoshi ASAI', 'Kashiwa', 'Laboratory of Genome Informatics']
['Yukihide Tomari', 'Yayoi', 'Laboratory of RNA Function']
Can someone give me some advice on reading stdin into CSV and manipulating the data later (outputting only needed values of columns, swapping the columns, etc.,)?
#!usr/bin/env python3
import sys
import csv
data = sys.stdin.readlines() # to read the file
column_to_be_matched = int(data.pop(0)) # to get the column number to match
word_to_be_matched = data.pop(0) # to get the word to be matched in said column
col_headers = data.pop(0) # to get the column names
print(", ".join(col_headers)) # to print the column names
for line in csv.reader(data):
if line[column_to_be_matched-1] == word_to_be_matched: #while it matched
print(", ".join(line)) #print it
Use Pandas to read your and manage your data in a DataFrame
import pandas as pd
# File location
infile = r'path/file'
# Load file and skip first two rows
df = pd.read_csv(infile, skiprows=2)
# Refresh your Dataframe en throw out the rows that contain Kashiwa in the campus column
df = df[df['campus'] != 'Kashiwa']
You can perform all kinds edits for example sort your DataFrame simply by:
df.sort(columns='your column')
Check the Pandas documentation for all the possibilities.
This is one approach.
Ex:
import csv
with open(filename) as csv_file:
reader = csv.reader(csv_file)
next(reader) #Skip First Line
next(reader) #Skip Second Line
print(next(reader)) #print Header
for row in reader:
if row[1] == 'Kashiwa': #Filter By 'Kashiwa'
print(row)
Output:
['Name', 'Campus', 'LabName']
['Shinichi MORISHITA', 'Kashiwa', 'Laboratory of Omics']
['Kiyoshi ASAI', 'Kashiwa', 'Laboratory of Genome Informatics']
import csv, sys
f= sys.stdin.readline()
data = csv.reader(f)
out = []
data_lines = list(data)
for line in data_lines[2:5]:#u can increase index to match urs
if line[1] == 'kashiwa':
new = [line[0], line[1], line[2]]#u can use string instead if list
string = f"{line[0]},{line[1]},{line[2]}"
#print(string)#print does same as stdout u can use dis
sys.stdout.write(string+'\n')
out.append(new)
sys.stdout.write(str(out))#same thing dat happens in print in the background#it out puts it as a list after the string repr
#print(out)#u can use dis too instead of stdout
f.close()
I am new to Python. I have used just letters to simplify my code below.My code writes a CSV file with columns of a,b,c,d values,each has 10 rows (length). I would like to add the average value of c and d to the same CSV file as well as an additional two columns each have one row for ave values. I have tried to append field names and write the new values but it didn't work.
with open('out.csv', 'w') as csvfile:
fieldnames=['a','b','c','d']
csv_writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
csv_writer.writeheader()
total_c=0
total_d=0
for i in range(1,length):
do something get a,b,c,d values.
total_c += c
total_d += d
csv_writer.writerow({'a': a,'b':b,'c':c,'d':d })
mean_c=total_c /length
mean_c=total_c /length
I expect to see something in the picture:
Try to use pandas library to deal with csv file. I provided sample code below, I assume that csv file has no header present on the first line.
import pandas as pd
data = pd.read_csv('out.csv',header=[['a','b','c','d'])
#making sure i am using copy of dataframe
avg_data = data.copy()
#creating new columns average in same dataframe
avg_data['mean_c'] = avg_data.iloc[:,2].mean(axis=1)
avg_data['mean_d'] = avg_data.iloc[:,3].mean(axis=1)
# writing updated data to csv file
avg_data.to_csv('out.csv', sep=',', encoding='utf-8')
I have two csv files with a single column of data. How can I remove data in the second csv file in-place by comparing it with the data in the first csv file? For example:
import csv
reader1 = csv.reader(open("file1.csv", "rb"))
reader = csv.reader(open("file2.csv", "rb"))f
for line in reader:
if line in reader1:
print line
if both files are just single columns, then you could use set to remove the differences. However, this presumes that the entries in each file do not need to be duplicated and their order doesn't really matter.
#since each file is a column, unroll each file into a single list:
dat1 = [x[0] for x in reader1]
dat2 = [y[0] for y in reader]
#take the set difference
dat1_without_dat2 = set(dat1).difference(dat2)