Adding in-between column in csv Python - python

I work with csv files and it seems python provides a lot of flexibility for handling csv files.
I found several questions linked to my issue, but I cannot figure out how to combine the solutions effectively...
My starting point CSV file looks like this (note there is only 1 column in the 'header' row):
FILE1
Z1 20 44 3
Z1 21 44 5
Z1 21 44 8
Z1 22 45 10
What I want to do is add a column in between cols #1 and #2, and keep the rest unchanged. This new column has the same # rows as the other columns, but contains the same integer for all entries (10 in my example below). Another important point is I don't really know the number of rows, so I might have to count the # rows somehow first (?) My output should then look like:
FILE1
Z1 10 20 44 3
Z1 10 21 44 5
Z1 10 21 44 8
Z1 10 22 45 10
Is there a simple solution to this?

I think the easiest solution would be to just read each row and write a corresponding new row (with the inserted value) in a new file:
import csv
with open('input.csv', 'r') as infile:
with open('output.csv', 'w') as outfile:
reader = csv.reader(infile, delimiter=' ')
writer = csv.writer(outfile, delimiter=' ')
for row in reader:
new_row = [row[0], 10]
new_row += row[1:]
writer.writerow(new_row)
This might not make sense if you're not doing anything else with the data besides this bulk processing, though. You'd' want to look into csv libraries if that were the case.

Use pandas to import the csv file as a DataFrame named df and then use df.insert(idx, col_name, value); where idx is the index of the newly created column, col_name is the name you assign to this column and value is the list of values you wish to assign to the column. See below for illustration:
import pandas as pd
prices = pd.read_csv('C:\\Users\\abdou.seck\\Documents\\prices.csv')
prices
## Output
Shares Number Prices
0 AAP 100 100.67
1 MSFT 50 56.50
2 SAN 200 19.18
3 GOOG 300 500.34
prices.insert(3, 'Total', prices['Number']*prices['Prices'])
prices
## Output:
Shares Number Prices Total
0 AAP 100 100.67 10067
1 MSFT 50 56.50 2825
2 SAN 200 19.18 3836
3 GOOG 300 500.34 150102
Hope this helps.

Read the header first, then initialize the reader, write the header first, then initialize the writer:
import csv
with open("in.csv", "rb") as in_file:
header = in_file.readline()
csv_file_in = csv.reader(in_file, delimiter=" ")
with open("out.csv","wb") as out_file:
out_file.write(header)
csv_file_out = csv.writer(out_file, delimiter=" ")
for row in csv_file_in:
csv_file_out.writerow([row[0], 10] + row[1:])

Pull the data into a list, insert data for each row into the desired spot, and re-write the data.
import csv
data_to_add = 10
new_column_index = 1 # 0 based index
with open('FILE1.csv','r') as f:
csv_r = csv.reader(f,delimiter=' ')
data = [line for line in csv_r]
for row in data:
row.insert(new_column_index,data_to_add)
with open('FILE1.csv','w') as f:
csv_w = csv.writer(f,delimiter=' ')
for row in data:
csv_w.write(row)

Here's how I might do it with pandas:
import pandas as pd
with open("in.csv") as input_file:
header = input_file.readline()
data = pd.read_csv(input_file, sep=" ")
data.insert(1, "New Data", 10)
with open("out.csv", "w") as output_file:
output_file.write(header)
data.to_csv(output_file, index=False, header=False)

Related

Data loss while extracting the rows from large csv file

This is in continuation from my previous question. I have 2 files, file1.csv and a large csv called master_file.csv. They have several columns and have a common column name called EMP_Code.
File 1 example:
EMP_name
EMP_Code
EMP_dept
b
f367
abc
a
c264
xyz
c
d264
abc
master_file example:
EMP_name EMP_age EMP_Service EMP_Code EMP_dept
a 30 6 c264 xyz
b 29 3 f367 abc
r 27 1 g364 lmn
d 45 10 c264 abc
t 50 25 t453 lmn
I want to extract similar rows from master_file using all the EMP_Code values in file1. I tried the following code and I am loosing a lot of data. I cannot read the complete master csv file as it is around 20gb, has millions of rows and running out of memory. I want to read the master_file in chunks and extract the complete rows for each of the EMP_Code present in file1 and save it into new file Employee_full_data.
import csv
import pandas as pd
df = pd.read_csv(r"master_file.csv")
li = [c264,f367]
full_data = df[df.EMP_Code.isin(li)]
full_data.to_csv(r"Employee_full_data.csv", index=False)
I also tried the following code. I receive an empty file whenever I use EMP_Code column and works fine when I use columns like Emp_name or EMP_dept. I want to extract the data using EMP_Code.
import csv
import pandas as pd
df = pd.read_csv(r"file1.csv")
list_codes = list(df.EMP_Code)
selected_rows = []
with open(r"master_file.csv") as csv_file:
reader = csv.DictReader(csv_file)
for row in reader:
if row['EMP_Code'] in list_codes:
selected_rows.append(row)`
article_usage = pd.DataFrame.from_records(selected_rows)
article_usage.to_csv(r"Employee_full_data.csv", index=False)
Is there any other way that I can extract the data without loss? I have heard about join and reading data in chunks but not sure how to use it here. Any help is appreciated
I ran the code from your 2nd example (using csv.DictReader) on your small example and it worked. I'm guessing your problem might have to do with the real-life scale of master_file as you've alluded to.
The problem might be that despite using csv.DictReader to stream information in, you're still using a Pandas dataframe to aggregate everything before writing it out, and maybe the output is breaking your memory budget.
If that's true, then use csv.DictWriter to stream out. The only tricky bit is getting the writer set up because it needs to know the fieldnames, which can't be known till we've read the first row, so we'll set up the writer in the first iteration of the read loop.
(I've removed the with open(... contexts because I think they add too much indentation)
df = pd.read_csv(r"file1.csv")
list_codes = list(df.EMP_Code)
f_in = open(r"master_file.csv", newline="")
reader = csv.DictReader(f_in)
f_out = open(r"output.csv", "w", newline="")
init_writer = True
for row in reader:
if init_writer:
writer = csv.DictWriter(f_out, fieldnames=row)
writer.writeheader()
init_writer = False
if row["EMP_Code"] in list_codes:
writer.writerow(row)
f_out.close()
f_in.close()
EMP_name
EMP_age
EMP_Service
EMP_Code
EMP_dept
a
30
6
c264
xyz
b
29
3
f367
abc
d
45
10
c264
abc
And if you'd like to get rid of Pandas altogether:
list_codes = set()
with open(r"file1.csv", newline="") as f:
reader = csv.DictReader(f)
for row in reader:
list_codes.add(row["EMP_Code"])
You just have to pass chunksize=<SOME INTEGER> to pandas' .read_csv function (see documentation here)
If you pass a chunksize=2, you will read the file into dataframes of 2 rows. Or... more accurately, it will read 2 rows of the csv into a dataframe. You can then apply your filter to that 2-row dataframe and "accumulate" that into another dataframe. The next iteration will read the next two rows, which you can subsequently filter... Lather, rinse and repeat:
import pandas as pd
li = ['c264', 'f367']
result_df = pd.DataFrame()
with pd.read_csv("master_file.csv", chunksize=2) as reader:
for chunk_df in reader:
filtered_df = chunk_df[chunk_df.EMP_Code.isin(li)]
result_df = pd.concat([result_df, filtered_df])
print(result_df)
# Outputs:
# EMP_name EMP_age EMP_Service EMP_Code EMP_dept
# 0 a 30 6 c264 xyz
# 1 b 29 3 f367 abc
# 3 d 45 10 c264 abc
one way that you could fix these type of file read/write task is to use the generator and read the data you want in chunks or portions that you could handle (memory or etc constraints).
def read_line():
with open('master_file.csv','r') as fid:
while (line:= fid.readline().split()):
yield line
this simple generator in each call give one new line. now you could simply iterate over this to do what ever filtering you are interested and build your new dataframe.
r_line = read_line()
for l in r_line:
print(l)
you could modify the generator to for example parse and return list, or multiple lines , etc.

The total amount of precipitation for the year

How do I write a program that reads the PRCP column and sums all values in it? We are using the import csv from pathlib import Path. Using python.
The answer should = 1
Example of info:
STATION NAME, DATE, PRCP, TMAX, TMIN
USW00023183PHOENIX AIRPORT, 1/1/2020, 1, 60, 40
USW00023183PHOENIX AIRPORT, 1/2/2020, 0, 64, 41
Tried:
prcps = 0
for item in prcps:
month=item['DATE']
prcps =(item["PRCP"])
if prcps>0[month]:
sum (prcps)
perp = 0
for PRCP in reader:
month = item['DATE']
perp = (item["PRCP"])
perp += perp
You could use Python's CSV library to read the CSV lines in. Each row is parsed as strings. So the PRCP value will first need to be converted into an integer using int().
Using a csv.reader() which converts each row into a list:
import csv
with open('input.csv') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
print(sum(int(row[2]) for row in csv_input))
This first skips the header row and then extracts the third string value from each row, converts it into an integer and sums them.
Using a csv.DictReader() which assumes the first row is a header and then reads each row as a dictionary:
import csv
with open('input.csv') as f_input:
csv_input = csv.DictReader(f_input)
print(sum(int(row["PRCP"]) for row in csv_input))

How to group columns and sum them, in a large CSV?

I have a large CSV (hundreds of millions of rows) and I need to sum the Value column based on the grouping of the ID, Location, and Date columns.
My CSV is similar to:
ID Location Date Value
1 1 Loc1 2022-01-27 5
2 1 Loc1 2022-01-27 4
3 1 Loc1 2022-01-28 7
4 1 Loc2 2022-01-29 8
5 2 Loc1 2022-01-27 11
6 2 Loc2 2022-01-28 4
7 2 Loc2 2022-01-29 6
8 3 Loc1 2022-01-28 9
9 3 Loc1 2022-01-28 9
10 3 Loc2 2022-01-29 1
{ID: 1, Location: Loc1, Date: 2022-01-27} is one such group, and its sub values 5 and 4 should be summed to 9
{ID: 3, Location: Loc1, Date: 2022-01-28} is another group and its sum should be 18
Here's what that sample input should look like, processed/summed, and written to a new CSV:
ID Location Date Value
1 Loc1 2022-01-27 9
1 Loc1 2022-01-28 7
1 Loc2 2022-01-29 8
2 Loc1 2022-01-27 11
2 Loc2 2022-01-28 4
2 Loc2 2022-01-29 6
3 Loc1 2022-01-28 18
3 Loc2 2022-01-29 1
I know using df.groupby([columns]).sum() would give the desired result, but the CSV is so big I keep getting memory errors. I've tried looking at other ways to read/manipulate CSV data but have still not been successful, so if anyone knows a way I can do this in python without maxing out my memory that would be great!
NB: I know there is a unnamed first column in my initial csv, this is irrelevant and doesn't need to be in the outputted, but doesn't matter if it is :)
The appropriate answer is probably to use Dask but you can do with Pandas and chunk. The last_row variable is the last row of the previous chunk is case of the first row of the current chunk have the same ID, Location and Date.
chunksize = 4 # Number of rows
last_row = pd.DataFrame() # Last row of the previous chunk
with open('data.csv') as reader, open('output.csv', 'w') as writer:
# Write headers
writer.write(reader.readline())
reader.seek(0)
for chunk in pd.read_csv(reader, chunksize=chunksize):
df = pd.concat([last_row, chunk])
df = df.groupby(['ID', 'Location', 'Date'], as_index=False)['Value'].sum()
df, last_row = df.iloc[:-1], df.iloc[-1:]
df.to_csv(writer, header=False, index=False)
# Don't forget the last row!
last_row.to_csv(writer, header=False, index=False)
Content of output.csv:
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
If the lines to be concatenated are consecutive, the good old csv module allows to process huge files one line at a time, hence with a minimal memory footprint.
Here you could use:
with open('input.csv') as fd, open('output.csv', 'w', newline='') as fdout:
rd, wr = csv.reader(fd), csv.writer(fdout)
_ = wr.writerow(next(rd)) # header line
old = [None]*4
for row in rd:
row[3] = int(row[3]) # convert value field to integer
if row[:3] == old[:3]:
old[3] += row[3] # concatenate values of similar rows
else:
if old[0]: # and write the concatenated row
_ = wr.writerow(old)
old = row
if old[0]: # do not forget the last row...
_ = wr.writerow(old)
With the shown input data, it gives as expected:
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
Not as clean and neat than Pandas code but it should process files greater than the available memory without any problem.
You could use the built in csv library and build up the output line by line. A Counter can be used to combine and count rows with the same entries:
from collections import Counter
import csv
data = Counter()
with open('input.csv') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
for row in csv_input:
data[tuple(row[:3])] += int(row[3])
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(header)
for key, value in data.items():
csv_output.writerow([*key, value])
Giving the output:
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
This avoids storing the input CSV in memory, only the output CSV data.
If this is also too large, a slight variation would be to output data whenever the ID column changes. This would though assume the input is in ID order:
from collections import Counter
import csv
def write_id(csv_output, data):
for key, value in data.items():
csv_output.writerow([*key, value])
data.clear()
data = Counter()
current_id = None
with open('input.csv') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
header = next(csv_input)
csv_output.writerow(header)
for row in csv_input:
if current_id and row[0] != current_id:
write_id(csv_output, data)
data[tuple(row[:3])] += int(row[3])
current_id = row[0]
write_id(csv_output, data)
For the given example, this would give the same output.
Have you tried:
output = []
for key, group in df.groupby([columns]):
output.append((key, group['a'].sum()))
pd.DataFrame(output).to_csv("....csv")
source:
https://stackoverflow.com/a/54244289/7132906
There are a number of answers already that may suffice: #MartinEvans and #Corralien both recommend breaking-up/chunking the input-output.
I'm especially curious if #MartinEvans's answer works within your memory constraints: it's the simplest and still-correct solution so far (as I see it).
If either of those don't work, I think you'll be faced with the question:
What makes a chunk with all the ID/Loc/Date groups I need to count contained in that chunk, so no group crosses over a chunk and gets counted multiple times (end up with smaller sub sums, instead of a single and true sum)?
In a comment on the OP you said the input was sorted by "week number". I think this is the single deciding factor for when you have all the counts you'll get for a group of ID/Loc/Date. As the readers crosses week-group boundaries, it'll know it's "safe" to stop counting any of the groups encountered so far, and flush those counts to disk (to avoid holding on to too many counts in memory).
This solution relies on the pre-sorted-ness of your input CSV. Though, if your input was a bit out of sorts: you could run this, test for duplicate groups, re-sort, and re-run this (I see this problem as making a big, memory-constrained reducer):
import csv
from collections import Counter
from datetime import datetime
# Get the data out...
out_csv = open('output.csv', 'w', newline='')
writer = csv.writer(out_csv)
def write_row(row):
global writer
writer.writerow(row)
# Don't let counter get too big (for memory)
def flush_counter(counter):
for key, sum_ in counter.items():
id_, loc, date = key
write_row([id_, loc, date, sum_])
# You said "already grouped by week-number", so:
# - read and sum your input CSV in chunks of "week (number) groups"
# - once the reader reads past a week-group, it concludes week-group is finished
# and flushes the counts for that week-group
last_wk_group = None
counter = Counter()
# Open input
with open('input.csv', newline='') as f:
reader = csv.reader(f)
# Copy header
header = next(reader)
write_row(header)
for row in reader:
# Get "base" values
id_, loc, date = row[0:3]
value = int(row[3])
# 2022-01-27 -> 2022-04
wk_group = datetime.strptime(date, r'%Y-%m-%d').strftime(r'%Y-%U')
# Decide if last week-group has passed
if wk_group != last_wk_group:
flush_counter(counter)
counter = Counter()
last_wk_group = wk_group
# Count/sum this week-groups
key = tuple([id_, loc, date_])
counter[key] += value
# Flush remaining week-group counts
flush_counter(counter)
As a basic test, I moved the first row of your sample input to the last row, like #Corralien was asking:
ID,Location,Date,Value
1,Loc1,2022-01-27,5
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,9
3,Loc1,2022-01-28,9
3,Loc2,2022-01-29,1
1,Loc1,2022-01-27,4
and I still get the correct output (even in the correct order, because 1,Loc1,2022-01-27 appeared first in the input):
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1

how to delete unwanted rows from csv in python?

I have csv file abc.csv with two columns Name & ID and list of ids l_id. Now I want to delete all those rows from csv where ID not in l_id. I have tried the following:
l_id=[18850080, 535553, 19292162, 1728035, 1179719, 19194894, 22817838, 19997487, 19728145, 1457232, 13560402, 18855476, 7151442, 18955830, 11294262, 18506072, 1360698]
Name ID
0 2069 19277993
1 625050 19277900
2 1939496 19277793
3 2606806 19275471
4 3438546 19273652
5 4211111 7151442
6 4353024 19200001
7 5175848 11294262
8 5300858 1360698
9 5636006 535553
10 5729989 19277800
11 6045513 19277320
12 6160486 19277458
13 6540851 19276852
14 6752008 19277363
15 7643395 19997487
16 7644736 19292162
17 7712083 19292100
18 7768516 19292169
19 7809273 1360698
with open('abc.csv', 'r') as inp, open('abc_edit.csv', 'w') as out:
writer = csv.writer(out)
for line in inp:
if (set(line.split()).isdisjoint(set(l_id)))==False:
writer.writerow(line)
use pandas isin function
import pandas as pd
df = pd.read_csv('csv_name')
take only rows where those I_id exists in ID columns
df = df.loc[df['ID'].isin(I_id)]
You can convert l_ids to a set first, and use a generator expression in the writerows method of the CSV writer to include those with id in the set:
ids = set(l_id)
with open('abc.csv', 'r') as inp, open('abc_edit.csv', 'w') as out:
reader = csv.reader(inp)
writer = csv.writer(out)
writer.writerow(next(reader)) # write the header unconditionally
writer.writerows(name, id for name, id in reader if int(id) in ids)

how to add random values to the column of a csv file?

I want to append a column in a prefilled csv file with 3 million rows using python. Then, i want to fill the column with random values in the range of (1, 50). something like this:
input csv file,
awareness trip amount
25 1 30
30 2 35
output csv file,
awareness trip amount size
25 1 30 49
30 2 35 20
how can i do this?
the code i have written is as follows:
with open('2019-01-1.csv', 'r') as CSVIN: with open('2019-01-2.csv', 'w') as
CSVOUT:
CSVWrite = csv.writer(CSVOUT, lineterminator='\n') CSVRead =
csv.reader(CSVIN)
CSVWrite = csv.writer(CSVOUT, lineterminator='\n')
CSVRead = csv.reader(CSVIN)
NewDict = []
row = next(CSVRead)
row.append('Size')
NewDict.append(row)
print(NewDict.append(row))
for row in CSVRead:
randSize = np.random.randint(1, 50)
row.append(row[0])
NewDict.append(row)
CSVWrite.writerows(NewDict)
Check out this answer: Python Add string to each line in a file
I've found it much easier to use with for files instead of importing csv or other special filetype libraries unless my use case is very specific.
So in your case, it would be something like:
input_file_name = "2019-01-1.csv"
output_file_name = "2019-01-2.csv"
with open(input_file_name, 'r') as f:
file_lines = [''.join([x, ",Size,{}".format(random.randint(1, 50)), '\n']) for x in f.readlines()]
with open(output_file_name, 'w') as f:
f.writelines(file_lines)

Categories