Python code to process CSV file - python

I am getting the CSV file updated on daily basis. Need to process and create new file based on the criteria - If New data then should be tagged as New against the row and if its an update to the existing data then should be tagged as Update. How to write a Python code to process and output in CSV file as follows based on the date.
Day1 input data
empid,enmname,sal,datekey
1,cholan,100,8/14/2018
2,ram,200,8/14/2018
Day2 input Data
empid,enmname,sal,datekey
1,cholan,100,8/14/2018
2,ram,200,8/14/2018
3,sundar,300,8/15/2018
2,raman,200,8/15/2018
Output Data
status,empid,enmname,sal,datekey
new,3,sundar,300,8/15/2018
update,2,raman,200,8/15/2018

I'm feeling nice, so I'll give you some code. Try to learn from it.
To work with CSV files, we'll need the csv module:
import csv
First off, let's teach the computer how to open and parse a CSV file:
def parse(path):
with open(path) as f:
return list(csv.DictReader(f))
csv.DictReader reads the first line of the csv file and uses it as the "names" of the columns. It then creates a dictionary for each subsequent row, where the keys are the column names.
That's all well and good, but we just want the last version with each key:
def parse(path):
data = {}
with open(path) as f:
for row in csv.DictReader(f):
data[row["empid"]] = row
return data
Instead of just creating a list containing everything, this creates a dictionary where the keys are the row's id. This way, rows found later in the file will overwrite rows found earlier in the file.
Now that we've taught the computer how to extract the data from the files, let's get it:
old_data = parse("file1.csv")
new_data = parse("file2.csv")
Iterating through a dictionary gives you its keys, which are the ids defined in the data set. For consistency, key in dictionary says whether key is one of the keys in the dictionary. So we can do this:
new = {
id_: row
for id_, row in new_data.items()
if id_ not in old_data
}
updated = {
id_: row
for id_, row in new_data.items()
if id_ in old_data and old_data[id_] != row
}
I'll put csv.DictWriter here and let you sort out the rest on your own.

Related

Searching CSV file and return rows as JSON [duplicate]

This question already has an answer here:
reading of csv into dictionary, first line becomes the name
(1 answer)
Closed 4 months ago.
I'm learning Python. I have a CSV file with these rows. I am trying to search and return rows that have year_ceremony matched with the year parameter the function accepts.
year_film,year_ceremony,ceremony,category,name,film,winner
1927,1928,1,ACTOR,Richard Barthelmess,The Noose,False
1927,1928,1,ACTOR,Emil Jannings,The Last Command,True
1927,1928,1,ACTRESS,Louise Dresser,A Ship Comes In,False
1928,1929,2,CINEMATOGRAPHY,Ernest Palmer,Four Devils;,False
1928,1929,2,CINEMATOGRAPHY,John Seitz,The Divine Lady,False
1928,1929,2,DIRECTING,Lionel Barrymore,Madame X,False
1928,1929,2,DIRECTING,Harry Beaumont,The Broadway Melody,False
def get_academy_awards_nominees(year):
response = []
csv_file = csv.reader(open("csvs/the_oscar_award.csv", "r"), delimiter=",")
for row in csv_file:
if row[1] == year:
response.append(row)
return response
I'm looking for a way to format matching row with the header (year_film,year_ceremony,ceremony,category,name,film,winner) as key and value and return them as JSON.
You can use DictReader from csv module for reading. It maps each row to a dict. And use json.dumps for converting the result(which is a list of dictionaries) to JSON format string:
import csv
import json
def get_academy_awards_nominees(year: int):
result = []
with open("csvs/the_oscar_award.csv") as f:
dict_reader = csv.DictReader(f)
for row in dict_reader:
if int(row["year_ceremony"]) == year:
result.append(row)
return json.dumps(result)
print(get_academy_awards_nominees(1928))
Another solution is to iterate like normal with a csv.reader and whenever you find the desired row, you create dictionary and append it:
import csv
import json
def get_academy_awards_nominees(year: int):
result = []
with open("csvs/the_oscar_award.csv") as f:
reader = csv.reader(f)
header = next(reader)
for row in reader:
if int(row[1]) == year:
result.append(dict(zip(header, row)))
return json.dumps(result)
print(get_academy_awards_nominees(1928))
Which one is better?
It depends. In first approach, if you delete year_film column for example, nothing gets wrong, because it deals with keys rather than index. But it second approach you have to check the indexes.
If your file is big enough and you just one a few rows, (use database instead generally or) I guess the second approach would be less costlier. Generating a dictionary from those keys require more work compare to generating a list. Prove it yourself since I didn't time it.
Can't edit my previous reply !!
To convert the dict to json use
json.dumps(s)

Converting a dictionary with nested lists into a CSV file

Been having a hard time trying to convert this dictionary with nested lists into a CSV file. I have a CSV file I am filtering - https://easyupload.io/8zobej. I turned it into a dictionary then cleaned it up. I am now stuck on trying to output it to a CSV and I don't know what to do. I've tried many different combinations of DictWriter and writerows but I keep coming up short. I am now trying to come up with a for loop that would go through the dictionary and output the value it finds to the CSV.
Here is my code - please excuse the comments - I was trying many things.
def dataSorter(filename:str):
"""
The defined function scans the inputted CSV file with 2 columns (Category, value) and sorts the values into categories.
Giving us lists of values for each category
Done by
"""
#Open the input csv file and parse them by comma delimited
with open(filename) as inputcsv:
readcsv = csv.reader(inputcsv, delimiter = ',')
sortedData = {}
#skips first row
next(readcsv)
#loops through file and assigns values to the key in dictionary "sortedData"
for i in readcsv:
category = i[0]
if category not in sortedData:
sortedData[category] = [i[1]]
else:
if i[1] not in sortedData[category]:
sortedData[category].append(i[1])
sortedData[category].sort()
for category in sortedData.keys():
sortedData[category].sort()

output file to CSV

I'm trying to parse a data from json file and create csv file from that output. I've written the python script to create output as per my needs. I need to sort the below csv file in time and date.
current output
My code:
## Shift Start | End time. | Primary | Secondary
def write_CSV () :
# field names
fields = ['ShiftStart', 'EndTime', 'Primary', 'Secondary']
# name of csv file
filename = "CallingLog.csv"
# writing to csv file
with open(filename, 'w') as csvfile:
# creating a csv dict writer object
writer = csv.DictWriter(csvfile, delimiter=',', lineterminator='\n', fieldnames = fields)
# writing headers (field names)
writer.writeheader()
# writing data rows
writer.writerows(totalData)
I want my csv file to be sorted out with date and time like below. atleast date would be fine.
ShiftStart
2020-11-30T17:00:00-08:00
2020-12-01T01:00:00-08:00
2020-12-02T05:00:00-08:00
2020-12-03T05:00:00-08:00
2020-12-04T09:00:00-08:00
2020-12-05T13:00:00-08:00
2020-12-06T13:00:00-08:00
2020-12-07T09:00:00-08:00
2020-12-08T17:00:00-08:00
2020-12-09T09:00:00-08:00
2020-12-10T09:00:00-08:00
2020-12-11T17:00:00-08:00
YourDataframe.sort_values(['Col1','Col2']).to_csv('Path')
Try this, this not only sort and copy to csv but also retain original dataframe without sorting in program for further operations if needed..!
You can adapt this example to your data (that I have not in my possession -:)
from csv import DictReader, DictWriter
from sys import stdout
# simple, self-contained data
data = '''\
a,b,c
3,2,1
2,2,3
1,3,2
'''.splitlines()
# read the data
dr = DictReader(data)
rows = [row for row in dr]
# print the data
print('# unsorted')
dw = DictWriter(stdout, dr.fieldnames)
dw.writeheader()
dw.writerows(rows)
print('# sorted')
dw = DictWriter(stdout, dr.fieldnames)
dw.writeheader()
dw.writerows(sorted(rows, key=lambda d:d['a']))
# unsorted
a,b,c
3,2,1
2,2,3
1,3,2
# sorted
a,b,c
1,3,2
2,2,3
3,2,1
In [40]:
When you read the data using a DictReader, each element of the list rows is a dictionary, keyed on the field names of the first line of the CSV data file.
When you want to sort this list according to the values corresponding to a key, you have to provide sorted with a key argument, that is a function that returns the value on which you want to sort.
This function is called with the whole element to be sorted, in your case a dictionary, and we want to sort on the first field of the CSV, the one indexed by 'a', so that our function, using the lambda syntx to inline the definition in the function call, is just lambda d: d['a'] that returns the value on which we want to sort.
NOTE the sort in this case is alphabetically sorted, and works because I'm dealing with single digits, in general you possibly need to convert the value (by default a string) to something else that makes sense in your context, e.g., lambda d: int(d['a']).

Need a way to take three csv files and put into one as well as remove duplicates and replace values in Python

I'm new to Python but I need help creating a script that will take in three different csv files, combine them together, remove duplicates from the first column as well as remove any rows that are blank, then change a revenue area to a number.
The three CSV files are setup the same.
The first column is a phone number and the second column is a revenue area (city).
The first column will need all duplicates & blank values removed.
The second column will have values like "Macon", "Marceline", "Brookfield", which will need to be changed to a specific value like:
Macon = 1
Marceline = 8
Brookfield = 4
And then if it doesn't match one of those values put a default value of 9.
Welcome to Stack Overflow!
Firstly, you'll want to be using the csv library for the "reader" and "writer" functions, so import the csv module.
Then, you'll want to open the new file to be written to, and use the csv.writer function on it.
After that, you'll want to define a set (I name it seen). This will be used to prevent duplicates from being written.
Write your headers (if you need them) to the new file using the writer.
Open your first old file, using csv module's "reader". Iterate through the rows using a for loop, and add the rows to the "seen" set. If a row has been seen, simply "continue" instead of writing to the file. Repeat this for the next two files.
To assign the values to the cities, you'll want to define a dictionary that holds the old names as the keys, and new values for the names as the values.
So, your code should look something like this:
import csv
myDict = {'Macon' : 1, 'Marceline' : 8, 'Brookfield' : 4}
seen = set()
newFile = open('newFile.csv', 'wb', newline='') #newline argument will prevent the writer from writing extra newlines, preventing empty rows.
writer = csv.writer(newFile)
writer.writerow(['Phone Number', 'City']) #This will write a header row for you.
#Open the first file, read each row, skip empty rows, skip duplicate rows, change value of "City", write to new file.
with open('firstFile.csv', 'rb') as inFile:
for row in csv.reader(inFile):
if any(row):
row[1] = myDict[row[1]]
if row in seen:
continue
seen.add(row)
writer.writerow(row)
#Open the second file, read each row, skip if row is empty, skip duplicate rows, change value of "City", write to new file.
with open('secondFile.csv', 'rb') as inFile:
for row in csv.reader(inFile):
if any(row):
row[1] = myDict[row[1]]
if row in seen:
continue
seen.add(row)
writer.writerow(row)
#Open the third file, read each row, skip empty rows, skip duplicate rows, change value of "City", write to new file.
with open('thirdFile.csv', 'rb') as inFile:
for row in csv.reader(inFile):
if any(row):
row[1] = myDict[row[1]]
if row in seen:
continue
seen.add(row)
writer.writerow(row)
#Close the output file
newFile.close()
I have not tested this myself, but it is very similar to two different programs that I wrote, and I have attempted to combine them into one. Let me know if this helps, or if there is something wrong with it!
-JCoder96

CSV to Python Dictionary with multiple lists for one key

So I have a csv file formatted like this
data_a,dataA,data1,data11
data_b,dataB,data1,data12
data_c,dataC,data1,data13
, , ,
data_d,dataD,data2,data21
data_e,dataE,data2,data22
data_f,dataF,data2,data23
HEADER1,HEADER2,HEADER3,HEADER4
The column headers are at the bottom, and I want the third column to be the keys. You can see that the third column is the same value for each of the two blocks of data and these blocks of data are separated by empty values, so I want to store the 3 rows of values to this 1 key and also disregard some columns such as column 4. This is my code right now
#!usr/bin/env python
import csv
with open("example.csv") as f:
readCSV = csv.reader(f)
for row in readCSV:
# disregard separating rows
if row[2] != '':
myDict = {row[2]:[row[0],row[1]]}
print(myDict)
What I basically want is that when I call
print(myDict['data2'])
I get
{[data_d,dataD][data_e,dataE][data_f,dataF]}
I tried editing my if loop to
if row[2] == 'data2':
myDict = {'data2':[row[0],row[1]]}
and just make an if for every individual key, but I don't think this will work either way.
With your current method, you probably want a defaultdict. This is a dictionary-like object that provides a default value if the key doesn't already exist. So in your case, we set this up to be a list, and then for each row we loop through, we append the values in columns 0 and 1 to this list as a tuple, like so:
import csv
from collections import defaultdict
data = defaultdict(list)
with open("example.csv") as f:
readCSV = csv.reader(f)
for row in readCSV:
# disregard separating rows
if row[2] != '':
data[row[2]].append((row[0], row[1]))
print(data)
With the example provided, this prints a defaultdict with the following entries:
{'data1': [('data_a', 'dataA'), ('data_b', 'dataB'), ('data_c', 'dataC')], 'data2': [('data_d', 'dataD'), ('data_e', 'dataE'), ('data_f', 'dataF')]}
I'm not a super Python geek, but I would suggest to use pandas (import pandas as pd). So you load data with pd.read_csv(file, header). With header you can specify the row you want to be a header and then it's much much easier to manipulate with the dataset (e.g. dropping the vars (del df['column_name']), creating dictionaries, etc).
Here is documentation to pd.read_csv: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Categories