Data loss while extracting the rows from large csv file

Data loss while extracting the rows from large csv file - python

This is in continuation from my previous question. I have 2 files, file1.csv and a large csv called master_file.csv. They have several columns and have a common column name called EMP_Code.
File 1 example:
EMP_name
EMP_Code
EMP_dept
b
f367
abc
a
c264
xyz
c
d264
abc
master_file example:
EMP_name EMP_age EMP_Service EMP_Code EMP_dept
a 30 6 c264 xyz
b 29 3 f367 abc
r 27 1 g364 lmn
d 45 10 c264 abc
t 50 25 t453 lmn
I want to extract similar rows from master_file using all the EMP_Code values in file1. I tried the following code and I am loosing a lot of data. I cannot read the complete master csv file as it is around 20gb, has millions of rows and running out of memory. I want to read the master_file in chunks and extract the complete rows for each of the EMP_Code present in file1 and save it into new file Employee_full_data.
import csv
import pandas as pd
df = pd.read_csv(r"master_file.csv")
li = [c264,f367]
full_data = df[df.EMP_Code.isin(li)]
full_data.to_csv(r"Employee_full_data.csv", index=False)
I also tried the following code. I receive an empty file whenever I use EMP_Code column and works fine when I use columns like Emp_name or EMP_dept. I want to extract the data using EMP_Code.
import csv
import pandas as pd
df = pd.read_csv(r"file1.csv")
list_codes = list(df.EMP_Code)
selected_rows = []
with open(r"master_file.csv") as csv_file:
reader = csv.DictReader(csv_file)
for row in reader:
if row['EMP_Code'] in list_codes:
selected_rows.append(row)`
article_usage = pd.DataFrame.from_records(selected_rows)
article_usage.to_csv(r"Employee_full_data.csv", index=False)
Is there any other way that I can extract the data without loss? I have heard about join and reading data in chunks but not sure how to use it here. Any help is appreciated

I ran the code from your 2nd example (using csv.DictReader) on your small example and it worked. I'm guessing your problem might have to do with the real-life scale of master_file as you've alluded to.
The problem might be that despite using csv.DictReader to stream information in, you're still using a Pandas dataframe to aggregate everything before writing it out, and maybe the output is breaking your memory budget.
If that's true, then use csv.DictWriter to stream out. The only tricky bit is getting the writer set up because it needs to know the fieldnames, which can't be known till we've read the first row, so we'll set up the writer in the first iteration of the read loop.
(I've removed the with open(... contexts because I think they add too much indentation)
df = pd.read_csv(r"file1.csv")
list_codes = list(df.EMP_Code)
f_in = open(r"master_file.csv", newline="")
reader = csv.DictReader(f_in)
f_out = open(r"output.csv", "w", newline="")
init_writer = True
for row in reader:
if init_writer:
writer = csv.DictWriter(f_out, fieldnames=row)
writer.writeheader()
init_writer = False
if row["EMP_Code"] in list_codes:
writer.writerow(row)
f_out.close()
f_in.close()
EMP_name
EMP_age
EMP_Service
EMP_Code
EMP_dept
a
30
6
c264
xyz
b
29
3
f367
abc
d
45
10
c264
abc
And if you'd like to get rid of Pandas altogether:
list_codes = set()
with open(r"file1.csv", newline="") as f:
reader = csv.DictReader(f)
for row in reader:
list_codes.add(row["EMP_Code"])

You just have to pass chunksize=<SOME INTEGER> to pandas' .read_csv function (see documentation here)
If you pass a chunksize=2, you will read the file into dataframes of 2 rows. Or... more accurately, it will read 2 rows of the csv into a dataframe. You can then apply your filter to that 2-row dataframe and "accumulate" that into another dataframe. The next iteration will read the next two rows, which you can subsequently filter... Lather, rinse and repeat:
import pandas as pd
li = ['c264', 'f367']
result_df = pd.DataFrame()
with pd.read_csv("master_file.csv", chunksize=2) as reader:
for chunk_df in reader:
filtered_df = chunk_df[chunk_df.EMP_Code.isin(li)]
result_df = pd.concat([result_df, filtered_df])
print(result_df)
# Outputs:
# EMP_name EMP_age EMP_Service EMP_Code EMP_dept
# 0 a 30 6 c264 xyz
# 1 b 29 3 f367 abc
# 3 d 45 10 c264 abc

one way that you could fix these type of file read/write task is to use the generator and read the data you want in chunks or portions that you could handle (memory or etc constraints).
def read_line():
with open('master_file.csv','r') as fid:
while (line:= fid.readline().split()):
yield line
this simple generator in each call give one new line. now you could simply iterate over this to do what ever filtering you are interested and build your new dataframe.
r_line = read_line()
for l in r_line:
print(l)
you could modify the generator to for example parse and return list, or multiple lines , etc.

Related

How to add key-pair values to an open csv file?

I am new to Python. I have used just letters to simplify my code below.My code writes a CSV file with columns of a,b,c,d values,each has 10 rows (length). I would like to add the average value of c and d to the same CSV file as well as an additional two columns each have one row for ave values. I have tried to append field names and write the new values but it didn't work.
with open('out.csv', 'w') as csvfile:
fieldnames=['a','b','c','d']
csv_writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
csv_writer.writeheader()
total_c=0
total_d=0
for i in range(1,length):
do something get a,b,c,d values.
total_c += c
total_d += d
csv_writer.writerow({'a': a,'b':b,'c':c,'d':d })
mean_c=total_c /length
mean_c=total_c /length
I expect to see something in the picture:

Try to use pandas library to deal with csv file. I provided sample code below, I assume that csv file has no header present on the first line.
import pandas as pd
data = pd.read_csv('out.csv',header=[['a','b','c','d'])
#making sure i am using copy of dataframe
avg_data = data.copy()
#creating new columns average in same dataframe
avg_data['mean_c'] = avg_data.iloc[:,2].mean(axis=1)
avg_data['mean_d'] = avg_data.iloc[:,3].mean(axis=1)
# writing updated data to csv file
avg_data.to_csv('out.csv', sep=',', encoding='utf-8')

Python: Extracting specific rows from csv as list

Probably a repost but I can't find a solution that works in my case.
I have a .csv file with an id number associated to a string like this:
0 | ABC
1 | DEF
...
100 | XYZ
I would like to extract the row with an ID number x and append it to a list, so ideally something like:
with open('myfile.csv', 'rb') as f:
reader = csv.reader(f)
results.append([row for idx, row in enumerate(reader) if idx[0] == x)])
this solution does not appear to work as it tells me that the "iterator should return strings, not bytes" despite the fact that I though I opened it in byte mode

use pandas to read the csv-file into a dataframe
import pandas as pd
sourceinput = pd.read_csv('myfile.csv')
outputlist = sourceinput['id'].loc[sourceinput['id'] == <value_you_need>].tolist()

Update a CSV row if it exists within a random sample

Got a CSV which I am selecting a random sample of 500 rows using the following code:
import csv
import random
with open('Original.csv' , "rb") as source:
lines = [line for line in source]
random_choice = random.sample(lines, 500);
what I'd like to do is update a column called [winner] if they exist within the sample and then save it back to a csv file but I have no idea how to achieve this...
There is a unique identifier in a column called [ID].
how would I go about doing this?

Starting with a CSV that looks like this:
ID something winner
1 a
2 b
3 c
4 a
5 d
6 a
7 b
8 e
9 f
10 g
You could use the following approach. The whole file is read in, rows are chosen by a randomly selected index, and written back out to the file.
import csv
import random
# Read in the data
with open('example.csv', 'r') as infile:
reader = csv.reader(infile)
header = next(reader) # We want the headers, but not as part of the sample
data = []
for row in reader:
data.append(row)
# Find the column called winner
winner_column_index = header.index('winner')
# Pick some random indices which will be used to generate the sample
all_indices = list(range(len(data)))
sampled_indices = random.sample(all_indices, 5)
# Add the winner column to those rows selected
for index in sampled_indices:
data[index][winner_column_index] = 'Winner'
# Write the data back
with open('example_out.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
writer.writerow(header) # Make sure we get the headers back in
writer.writerows(data) # Write the rest of the data
This will give the following output:
ID something winner
1 a
2 b Winner
3 c
4 a Winner
5 d
6 a Winner
7 b
8 e
9 f Winner
10 g Winner
EDIT: It turns out that having the first column of the CSV being called ID is not a good idea if you want to open with Excel. It then incorrectly thinks the file is in SYLK format.

First, why are you using csv and not a db? even an sqlite would be much easier (builtin - import sqlite3)
Second, you'll need to write the whole file again. I suggest you use your lines as lists and just update them (lists are like pointers so you can change the inner values and it will update)
lines=[list(line) for line in source]
and then
for choice in random_choice:
choice[WINNER_INDEX]+=1
and write the file

Filter large csv files (10GB+) based on column value in Python

EDITED : Added Complexity
I have a large csv file, and I want to filter out rows based on the column values. For example consider the following CSV file format:
Col1,Col2,Nation,State,Col4...
a1,b1,Germany,state1,d1...
a2,b2,Germany,state2,d2...
a3,b3,USA,AL,d3...
a3,b3,USA,AL,d4...
a3,b3,USA,AK,d5...
a3,b3,USA,AK,d6...
I want to filter all rows with Nation == 'USA', and then based on each of the 50 state. What's the most efficient way of doing this? I'm using Python. Thanks
Also, is R better than Python for such tasks?

Use boolean indexing or DataFrame.query:
df1 = df[df['Nation'] == "Japan"]
Or:
df1 = df.query('Nation == "Japan"')
Second should be faster, see performance of query.
If still not possible (not a lot of RAM) try use dask as commented Jon Clements (thank you).

One way would be to filter the csv first and then load, given the size of the data
import csv
with open('yourfile.csv', 'r') as f_in:
with open('yourfile_edit.csv', 'w') as f_outfile:
f_out = csv.writer(f_outfile, escapechar=' ',quoting=csv.QUOTE_NONE)
for line in f_in:
line = line.strip()
row = []
if 'Japan' in line:
row.append(line)
f_out.writerow(row)
Now load the csv
df = pd.read_csv('yourfile_edit.csv', sep = ',',header = None)
You get
0 1 2 3 4
0 2 a3 b3 Japan d3

You could open the file, index the position of the Nation header, then iterate over a reader().
import csv
temp = r'C:\path\to\file'
with open(temp, 'r', newline='') as f:
cr = csv.reader(f, delimiter=',')
# next(cr) gets the header row (row[0])
i = next(cr).index('Nation')
# list comprehension through remaining cr iterables
filtered = [row for row in cr if row[i] == 'Japan']

Adding in-between column in csv Python

I work with csv files and it seems python provides a lot of flexibility for handling csv files.
I found several questions linked to my issue, but I cannot figure out how to combine the solutions effectively...
My starting point CSV file looks like this (note there is only 1 column in the 'header' row):
FILE1
Z1 20 44 3
Z1 21 44 5
Z1 21 44 8
Z1 22 45 10
What I want to do is add a column in between cols #1 and #2, and keep the rest unchanged. This new column has the same # rows as the other columns, but contains the same integer for all entries (10 in my example below). Another important point is I don't really know the number of rows, so I might have to count the # rows somehow first (?) My output should then look like:
FILE1
Z1 10 20 44 3
Z1 10 21 44 5
Z1 10 21 44 8
Z1 10 22 45 10
Is there a simple solution to this?

I think the easiest solution would be to just read each row and write a corresponding new row (with the inserted value) in a new file:
import csv
with open('input.csv', 'r') as infile:
with open('output.csv', 'w') as outfile:
reader = csv.reader(infile, delimiter=' ')
writer = csv.writer(outfile, delimiter=' ')
for row in reader:
new_row = [row[0], 10]
new_row += row[1:]
writer.writerow(new_row)
This might not make sense if you're not doing anything else with the data besides this bulk processing, though. You'd' want to look into csv libraries if that were the case.

Use pandas to import the csv file as a DataFrame named df and then use df.insert(idx, col_name, value); where idx is the index of the newly created column, col_name is the name you assign to this column and value is the list of values you wish to assign to the column. See below for illustration:
import pandas as pd
prices = pd.read_csv('C:\\Users\\abdou.seck\\Documents\\prices.csv')
prices
## Output
Shares Number Prices
0 AAP 100 100.67
1 MSFT 50 56.50
2 SAN 200 19.18
3 GOOG 300 500.34
prices.insert(3, 'Total', prices['Number']*prices['Prices'])
prices
## Output:
Shares Number Prices Total
0 AAP 100 100.67 10067
1 MSFT 50 56.50 2825
2 SAN 200 19.18 3836
3 GOOG 300 500.34 150102
Hope this helps.

Read the header first, then initialize the reader, write the header first, then initialize the writer:
import csv
with open("in.csv", "rb") as in_file:
header = in_file.readline()
csv_file_in = csv.reader(in_file, delimiter=" ")
with open("out.csv","wb") as out_file:
out_file.write(header)
csv_file_out = csv.writer(out_file, delimiter=" ")
for row in csv_file_in:
csv_file_out.writerow([row[0], 10] + row[1:])

Pull the data into a list, insert data for each row into the desired spot, and re-write the data.
import csv
data_to_add = 10
new_column_index = 1 # 0 based index
with open('FILE1.csv','r') as f:
csv_r = csv.reader(f,delimiter=' ')
data = [line for line in csv_r]
for row in data:
row.insert(new_column_index,data_to_add)
with open('FILE1.csv','w') as f:
csv_w = csv.writer(f,delimiter=' ')
for row in data:
csv_w.write(row)

Here's how I might do it with pandas:
import pandas as pd
with open("in.csv") as input_file:
header = input_file.readline()
data = pd.read_csv(input_file, sep=" ")
data.insert(1, "New Data", 10)
with open("out.csv", "w") as output_file:
output_file.write(header)
data.to_csv(output_file, index=False, header=False)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data loss while extracting the rows from large csv file - python

Related

How to add key-pair values to an open csv file?

Python: Extracting specific rows from csv as list

Update a CSV row if it exists within a random sample

Filter large csv files (10GB+) based on column value in Python

Adding in-between column in csv Python

Categories

Resources