empty cell in the column should replace with current date in python - python

I have a problem that i have a date/time column in which some of the columns are empty then i want to replace that
empty cell with todays date how it will work the same code i have written down please help with this.......
please note that i'm using Pandas Dataframe so please answer should not contain any dataframe..thanks
with open(tempFile, 'r',encoding="utf8") as csvfile:
# creating a csv reader object
reader = csv.DictReader(csvfile, delimiter=',')
# next(reader, None)
'''We then restructure the data to be a set of keys with list of values {key_1: [], key_2: []}:'''
data = {}
for row in reader:
# print(row)
for header, value in row.items():
try:
data[header].append(value)
except KeyError:
data[header] = [value]
'''Next we want to give each value in each list a unique identifier.'''
# Loop through all keys
for key in data.keys():
values = data[key]
things = list(sorted(set(values), key=values.index))
for i, x in enumerate(data[key]):
if key==("Date/Time") :
data[key][i] = data[key][i][0:10]
else:
data[key][i] = things.index(x) + 1
"""Since csv.writerows() takes a list but treats it as a row, we need to restructure our
data so that each row is one value from each list. This can be accomplished using zip():"""
with open('ram5.csv', "w") as outfile:
writer = csv.writer(outfile)
# Write headers
writer.writerow(data.keys())
# Make one row equal to one value from each list
rows = zip(*data.values())
# Write rows
writer.writerows(rows)
This code do other operations also please only focus on date and time column..
the input data is :
job_Id Name Address Email Date/Time
1 snehil singh marathalli ss#gmail.com 12/10/2011:02:03:20
2 salman marathalli ss#gmail.com 12/11/2011:03:10:20
3 Amir HSR ar#gmail.com
4 Rakhesh HSR rakesh#gmail.com 09/12/2010:02:03:55
5 Ram marathalli r#gmail.com
6 Shyam BTM ss#gmail.com 12/11/2012:01:03:20
7 salman HSR ss#gmail.com
8 Amir BTM ar#gmail.com 07/10/2013:04:02:30
9 snehil singh Majestic sne#gmail.com 03/03/2018:02:03:20
The empty date/time column should be replaced with current date...
I have tried to put between the code...but it does't work please hep me...thanks
if["Date/Time"]==None:
data[key][i]="11/12/2018"
else:
data[key][i] = data[key][i][0:10]
continue
my Code wroked with this way:
if data[key][i] == "":
data[key][i] = datetime.datetime.now().isoformat()
thnx evryone for help

You can select the empty cells of a column using .loc and .isnull(), and you can get the current time with the datetime library.
It should work altogether like this:
import datetime
data.loc[data['Date/Time'].isnull(), 'Date/Time'] = datetime.datetime.now()

Try this :
for each in data:
if not each['Date/Time']:
each['Date/Time']=datetime.datetime.now()

Related

How to group columns and sum them, in a large CSV?

I have a large CSV (hundreds of millions of rows) and I need to sum the Value column based on the grouping of the ID, Location, and Date columns.
My CSV is similar to:
ID Location Date Value
1 1 Loc1 2022-01-27 5
2 1 Loc1 2022-01-27 4
3 1 Loc1 2022-01-28 7
4 1 Loc2 2022-01-29 8
5 2 Loc1 2022-01-27 11
6 2 Loc2 2022-01-28 4
7 2 Loc2 2022-01-29 6
8 3 Loc1 2022-01-28 9
9 3 Loc1 2022-01-28 9
10 3 Loc2 2022-01-29 1
{ID: 1, Location: Loc1, Date: 2022-01-27} is one such group, and its sub values 5 and 4 should be summed to 9
{ID: 3, Location: Loc1, Date: 2022-01-28} is another group and its sum should be 18
Here's what that sample input should look like, processed/summed, and written to a new CSV:
ID Location Date Value
1 Loc1 2022-01-27 9
1 Loc1 2022-01-28 7
1 Loc2 2022-01-29 8
2 Loc1 2022-01-27 11
2 Loc2 2022-01-28 4
2 Loc2 2022-01-29 6
3 Loc1 2022-01-28 18
3 Loc2 2022-01-29 1
I know using df.groupby([columns]).sum() would give the desired result, but the CSV is so big I keep getting memory errors. I've tried looking at other ways to read/manipulate CSV data but have still not been successful, so if anyone knows a way I can do this in python without maxing out my memory that would be great!
NB: I know there is a unnamed first column in my initial csv, this is irrelevant and doesn't need to be in the outputted, but doesn't matter if it is :)
The appropriate answer is probably to use Dask but you can do with Pandas and chunk. The last_row variable is the last row of the previous chunk is case of the first row of the current chunk have the same ID, Location and Date.
chunksize = 4 # Number of rows
last_row = pd.DataFrame() # Last row of the previous chunk
with open('data.csv') as reader, open('output.csv', 'w') as writer:
# Write headers
writer.write(reader.readline())
reader.seek(0)
for chunk in pd.read_csv(reader, chunksize=chunksize):
df = pd.concat([last_row, chunk])
df = df.groupby(['ID', 'Location', 'Date'], as_index=False)['Value'].sum()
df, last_row = df.iloc[:-1], df.iloc[-1:]
df.to_csv(writer, header=False, index=False)
# Don't forget the last row!
last_row.to_csv(writer, header=False, index=False)
Content of output.csv:
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
If the lines to be concatenated are consecutive, the good old csv module allows to process huge files one line at a time, hence with a minimal memory footprint.
Here you could use:
with open('input.csv') as fd, open('output.csv', 'w', newline='') as fdout:
rd, wr = csv.reader(fd), csv.writer(fdout)
_ = wr.writerow(next(rd)) # header line
old = [None]*4
for row in rd:
row[3] = int(row[3]) # convert value field to integer
if row[:3] == old[:3]:
old[3] += row[3] # concatenate values of similar rows
else:
if old[0]: # and write the concatenated row
_ = wr.writerow(old)
old = row
if old[0]: # do not forget the last row...
_ = wr.writerow(old)
With the shown input data, it gives as expected:
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
Not as clean and neat than Pandas code but it should process files greater than the available memory without any problem.
You could use the built in csv library and build up the output line by line. A Counter can be used to combine and count rows with the same entries:
from collections import Counter
import csv
data = Counter()
with open('input.csv') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
for row in csv_input:
data[tuple(row[:3])] += int(row[3])
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(header)
for key, value in data.items():
csv_output.writerow([*key, value])
Giving the output:
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
This avoids storing the input CSV in memory, only the output CSV data.
If this is also too large, a slight variation would be to output data whenever the ID column changes. This would though assume the input is in ID order:
from collections import Counter
import csv
def write_id(csv_output, data):
for key, value in data.items():
csv_output.writerow([*key, value])
data.clear()
data = Counter()
current_id = None
with open('input.csv') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
header = next(csv_input)
csv_output.writerow(header)
for row in csv_input:
if current_id and row[0] != current_id:
write_id(csv_output, data)
data[tuple(row[:3])] += int(row[3])
current_id = row[0]
write_id(csv_output, data)
For the given example, this would give the same output.
Have you tried:
output = []
for key, group in df.groupby([columns]):
output.append((key, group['a'].sum()))
pd.DataFrame(output).to_csv("....csv")
source:
https://stackoverflow.com/a/54244289/7132906
There are a number of answers already that may suffice: #MartinEvans and #Corralien both recommend breaking-up/chunking the input-output.
I'm especially curious if #MartinEvans's answer works within your memory constraints: it's the simplest and still-correct solution so far (as I see it).
If either of those don't work, I think you'll be faced with the question:
What makes a chunk with all the ID/Loc/Date groups I need to count contained in that chunk, so no group crosses over a chunk and gets counted multiple times (end up with smaller sub sums, instead of a single and true sum)?
In a comment on the OP you said the input was sorted by "week number". I think this is the single deciding factor for when you have all the counts you'll get for a group of ID/Loc/Date. As the readers crosses week-group boundaries, it'll know it's "safe" to stop counting any of the groups encountered so far, and flush those counts to disk (to avoid holding on to too many counts in memory).
This solution relies on the pre-sorted-ness of your input CSV. Though, if your input was a bit out of sorts: you could run this, test for duplicate groups, re-sort, and re-run this (I see this problem as making a big, memory-constrained reducer):
import csv
from collections import Counter
from datetime import datetime
# Get the data out...
out_csv = open('output.csv', 'w', newline='')
writer = csv.writer(out_csv)
def write_row(row):
global writer
writer.writerow(row)
# Don't let counter get too big (for memory)
def flush_counter(counter):
for key, sum_ in counter.items():
id_, loc, date = key
write_row([id_, loc, date, sum_])
# You said "already grouped by week-number", so:
# - read and sum your input CSV in chunks of "week (number) groups"
# - once the reader reads past a week-group, it concludes week-group is finished
# and flushes the counts for that week-group
last_wk_group = None
counter = Counter()
# Open input
with open('input.csv', newline='') as f:
reader = csv.reader(f)
# Copy header
header = next(reader)
write_row(header)
for row in reader:
# Get "base" values
id_, loc, date = row[0:3]
value = int(row[3])
# 2022-01-27 -> 2022-04
wk_group = datetime.strptime(date, r'%Y-%m-%d').strftime(r'%Y-%U')
# Decide if last week-group has passed
if wk_group != last_wk_group:
flush_counter(counter)
counter = Counter()
last_wk_group = wk_group
# Count/sum this week-groups
key = tuple([id_, loc, date_])
counter[key] += value
# Flush remaining week-group counts
flush_counter(counter)
As a basic test, I moved the first row of your sample input to the last row, like #Corralien was asking:
ID,Location,Date,Value
1,Loc1,2022-01-27,5
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,9
3,Loc1,2022-01-28,9
3,Loc2,2022-01-29,1
1,Loc1,2022-01-27,4
and I still get the correct output (even in the correct order, because 1,Loc1,2022-01-27 appeared first in the input):
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1

Can't Get CSV Columns to Print Properly

I'm new to programming and trying to supplement my learning by doing some online tutorials. Today, I started looking at working with CSV files using a tutorial that seemed easy enough to follow, but I've ran into what amounts to an immaterial problem, but it's frustrating me that I can't figure it out, haha. I've spent around two hours Googling and testing things, but I'm just not savvy enough to know what to try next. Help, please! haha.
Here's the code in question:
# importing the csv module
import csv
# csv filename
filename = r'C:\Users\XXX\Documents\AAPL.csv'
# initialize the titles and row list
fields = []
rows = []
# read the csv file
with open(filename, 'r') as csvfile:
# create the csv reader object
csvreader = csv.reader(csvfile)
# extract field names through the first row
fields = next(csvreader)
# extract each data row one by one
for row in csvreader:
rows.append(row)
# get total number of rows
print("total no. of rows: %d"%(csvreader.line_num))
# print the field names
print("Field names are: " + ", ".join(field for field in fields))
# print the first 5 rows of data
print("\nFirst 5 rows are:\n")
for row in rows[:5]:
#parse each column of a row
for col in row:
print("%10s"%col),
print("\n")
The tutorial was actually written for Python 2.X, so the I found the updated formatting for 3.6 and changed that last statement to be:
for col in row:
print('{:>10}'.format(col))
print("\n")
Either way it's written, the results come out in this format:
First 5 rows are:
2013-09-18
66.168571
66.621429
65.808571
66.382858
60.492519
114215500
...
instead of the expected columnar format shown on the tutorial.
I thought I finally found the solution when I read somewhere that you needed the formatting for each item, so I tried:
for col in row:
print('{:>10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10}'.format(*col))
print("\n")
so that the formatting was there for each column, however that seems to create a column for each letter in the field, e.g:
2 0 1 3 - 0 9
The CSV is just a file of AAPLs stock prices--here's the first 9 rows of data if you want to create a CSV for testing:
Date,Open,High,Low,Close,Adj Close,Volume
2013-09-18,66.168571,66.621429,65.808571,66.382858,60.492519,114215500
2013-09-19,67.242859,67.975716,67.035713,67.471428,61.484497,101135300
2013-09-20,68.285713,68.364288,66.571426,66.772858,60.847912,174825700
2013-09-23,70.871429,70.987144,68.942856,70.091431,63.872025,190526700
2013-09-24,70.697144,70.781425,69.688568,69.871429,63.671543,91086100
2013-09-25,69.885712,69.948570,68.775711,68.790001,62.686062,79239300
2013-09-26,69.428574,69.794289,69.128571,69.459999,63.296616,59305400
2013-09-27,69.111427,69.238571,68.674286,68.964287,62.844891,57010100
2013-09-30,68.178574,68.808571,67.772858,68.107140,62.063782,65039100
# importing csv module
import csv
# csv file name
filename = r'C:\Users\XXX\Documents\AAPL.csv'
# initialize the titles and row list
fields = []
rows = []
# read the csv file
with open(filename, 'r') as csvfile:
# create the csv reader object
csvreader = csv.reader(csvfile)
# extract field names through the first row
fields = next(csvreader)
# extract each data row one by one
for row in csvreader:
rows.append(row)
# get total number of rows
print("total no. of rows: %d"%(csvreader.line_num))
# print the field names
print("Field names are: " + ", ".join(field for field in fields))
# print the first 5 rows of data
print("\nFirst 5 rows are:\n")
for row in rows[:5]:
#parse each column of a row
for col in row:
print("%10s"%col,end=',')
print("\n")
You need to replace
print("%10s"%col), with print("%10s"%col,end=',')
Krishnaa208's answer didn't quite give me the right format. print("%10s"%col,end=',') gave a table that included the comma and each field was surrounded by quotes. But it did point me in the right direction, which was:
# print the first 5 rows of data
print("\nFirst 5 rows are:\n")
for row in rows[:5]:
#parse each column of a row
for col in row:
print('{:>12}'.format(col), end = '')
print("\n")
and my results were:
First 5 rows are:
2013-09-18 66.168571 66.621429 65.808571 66.382858 60.492519 114215500
2013-09-19 67.242859 67.975716 67.035713 67.471428 61.484497 101135300
2013-09-20 68.285713 68.364288 66.571426 66.772858 60.847912 174825700
2013-09-23 70.871429 70.987144 68.942856 70.091431 63.872025 190526700
2013-09-24 70.697144 70.781425 69.688568 69.871429 63.671543 91086100
{:>10} was a little to close together since my CSV had the prices down to six decimal points.)
Thanks for the answer, though. I really did help!

Adding a Python list into the middle of an already populated CSV

I am generating a list of phone numbers from a registry into a Python list. I would then like to iterate through this list and add it to a new column in the middle of an already existing CSV file. The column is index 9 in the CSV. Below is the code I have attempted to write, but I get a "list index out of range error".
import csv
with open('input.csv','r') as csvinput:
with open('output.csv', 'w') as csvoutput:
writer = csv.writer(csvoutput)
i=0
for row in csv.reader(csvinput):
row[9]=writer.writerow(phone_list[i])
i+=1
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-10-74ca13edafc8> in <module>()
6 i=0
7 for row in csv.reader(csvinput):
----> 8 row[9]=writer.writerow(phone_list[i])
9 i+=1
IndexError: list index out of range
Any and all help is appreciated.
Your code should almost work, you can use zip() to iterate over the input file and the list of phones at the same time, instead of using the i variable as a numerical index:
for row, phone in zip(csv.reader(csvinput), phone_list):
row[9] = phone
writer.writerow(row)
Now, if you're still getting that IndexError, that means your csv file has some line that doesn't contain 10 columns in first place (index 9 is the 10th column since indexes start at 0). Double-check your csv file. Try this testing code to check:
for row, phone in zip(csv.reader(csvinput), phone_list):
if len(row) < 10:
print('FOUND A ROW WITH {} COLUMNS: '.format(len(row)), row)
else:
row[9] = phone
writer.writerow(row)
Another solution is to add empty columns to complete the 10 columns to every row that has less than 10 columns:
for row, phone in zip(csv.reader(csvinput), phone_list):
row.extend(['']* (10 - len(row)))
row[9] = phone
writer.writerow(row)
In your for loop, add the new column to the array 'row', then write that row to the csv.
for row in reader:
row.insert(9, phone_list[i])
writer.writerow(row)
if you are using anaconda, or have knowledge of pandas,
import pandas as pd. # You import the pandas library
input = pd.read_csv("input.csv") #your import csv
output = pd.read_csv("output.csv") #your export csv
input_array = list(input['#Name of column']) #use this if you know the name of the column
input_array = list(input.iloc[:,#index of column]) # use this if you know the index number of the column, index starts from 0
output = output.insert(loc=#the index you want; in this case 9 for the 10th column, column='Name of the new column', value=input_array) # you are inserting the array into output.csv
output.to_csv("path and output.csv",index = False) # we are writing a file, index = false means no new index row columns to be added.
import pandas as pd. # You import the pandas library
input = pd.read_csv("input.csv") #your import csv
output = pd.read_csv("output.csv") #your export csv
input_array = list(input['Name of column']) #use this if you know the name of the column
input_array = list(input.iloc[:,0]) # Assuming input csv has only 1 column
output = output.insert(loc=9, column='New_column_name', value=input_array) # assuming the array is to be inserted into output.csv, column 10, in which case its 9 in pandas
output.to_csv("path and output.csv",index = False) # we are writing a file, index = false means no new index row columns to be added.

Pulling the Next Value Under the Same Column Header

I am using Python's csv module to read ".csv" files and parse them out to MySQL insert statements. In order to maintain syntax for the statements I need to determine the type of the values listed under each column header. However, I have run into a problem as some of the rows start with a null value.
How can I use the csv module to return the next value under the same column until the value returned is not null? This does not have to be accomplished with the csv module; I am open to all solutions. After looking through the documentation I am not sure that the csv module is capable of doing what I need. I was thinking something along these lines:
if rowValue == '':
rowValue = nextRowValue(row)
Obviously the next() method simply returns the next value in the csv "list" rather than returning the next value under the same column like I want, and the nextRowValue() object does not exist. I am just demonstrating the idea.
Edit: Just to add some context, here is an example of what I am doing and the problems I am running into.
If the table is as follows:
ID Date Time Voltage Current Watts
0 7/2 11:15 0 0
0 7/2 11:15 0 0
0 7/2 11:15 380 1 380
And here is a very slimmed down version of the code that I am using to read the table, get the column headers and determine the type of the values from the first row. Then put them into separate lists and then use deque to add them to insert statements in a separate function. Not all of the code is featured and I might have left some crucial parts out, but here is an example:
import csv, os
from collections import deque
def findType(rowValue):
if rowValue == '':
rowValue =
if '.' in rowValue:
try:
rowValue = type(float(rowValue))
except ValueError:
pass
else:
try:
rowValue = type(int(rowValue))
except:
rowValue = type(str(rowValue))
return rowValue
def createTable():
inputPath = 'C:/Users/user/Desktop/test_input/'
outputPath = 'C:/Users/user/Desktop/test_output/'
for file in os.listdir(inputPath):
if file.endswith('.csv'):
with open(inputPath + file) as inFile:
with open(outputPath + file[:-4] + '.sql', 'w') as outFile:
csvFile = csv.reader(inFile)
columnHeader = next(csvFile)
firstRow = next(csvFile)
cList = deque(columnHeader)
rList = deque(firstRow)
hList = []
for value in firstRow:
valueType = findType(firstRow)
if valueType == str:
try:
val = '`' + cList.popleft() + 'varchar(255)'
hList.append(val)
except IndexError:
pass
etc.
And so forth for the rest of the value types returned from the findType function. The problem is that when adding the values to rList using deque it skips over null values so that the number of items in the list for column headers would be 6, for example, and the number of items in the list for rows would be 5 so they would not line up.
A somewhat drawn out solution would be to scan each row for null values until one was found using something like this:
for value in firstRow:
if value == '':
firstRow = next(csvFile)
And continuing this loop until a row was found with no null values. However this seems like a somewhat drawn out solution that would slow down the program, hence why I am looking for a different solution.
Rather than pull the next value from the column as the title suggests, I found it easier to just skip rows that contained any null values. There are two different ways to do this:
Use a loop to scan each row and see if it contains a null value, and jump to the next row until one is found that contains no null values. For example:
tempRow = next(csvFile)
for value in tempRow:
if value == '':
tempRow = next(csvFile)
else:
row = tempRow

Python Code Not Listing First Line in CSV File

I was working on IMDB movie list just to list movie names, links and my ratings. Here is the code:
import csv
r_list = open('ratings.csv')
rd = csv.reader(r_list, delimiter=',', quotechar='"')
movies = {}
for row in rd:
movies[row[5]] = [row[0], row[8]]
print(len(movies))
The output is 500 but actual number is 501. It is not showing the first line. But when I do the same thing for a list that contains 6 lines in total, it counts the first line and returns '6'.
Why?
Because you are using a dictionary and you have row[5] which is duplicate and is replaced and thereby shortening your list by the number of duplicates (minus one) you have. You cannot have 2 keys in a dictionary that are the same. That is illegal. Python handles that silently by overwritting (replacing) the value of the key you had with the new value.
e.g.
data = [('rambo', 1995), ('lethal weapon', 1992), ('rambo', 1980)]
movies = {}
for row in data:
movies[row[0]] = row[1]
print len(data) # -> 3
print len(movies) # -> 2
print movies['rambo'] # -> 1980
Solution is not to use the dictionary if you dont want duplicate keys to be replace each other.

Categories