Related
This is in continuation from my previous question. I have 2 files, file1.csv and a large csv called master_file.csv. They have several columns and have a common column name called EMP_Code.
File 1 example:
EMP_name
EMP_Code
EMP_dept
b
f367
abc
a
c264
xyz
c
d264
abc
master_file example:
EMP_name EMP_age EMP_Service EMP_Code EMP_dept
a 30 6 c264 xyz
b 29 3 f367 abc
r 27 1 g364 lmn
d 45 10 c264 abc
t 50 25 t453 lmn
I want to extract similar rows from master_file using all the EMP_Code values in file1. I tried the following code and I am loosing a lot of data. I cannot read the complete master csv file as it is around 20gb, has millions of rows and running out of memory. I want to read the master_file in chunks and extract the complete rows for each of the EMP_Code present in file1 and save it into new file Employee_full_data.
import csv
import pandas as pd
df = pd.read_csv(r"master_file.csv")
li = [c264,f367]
full_data = df[df.EMP_Code.isin(li)]
full_data.to_csv(r"Employee_full_data.csv", index=False)
I also tried the following code. I receive an empty file whenever I use EMP_Code column and works fine when I use columns like Emp_name or EMP_dept. I want to extract the data using EMP_Code.
import csv
import pandas as pd
df = pd.read_csv(r"file1.csv")
list_codes = list(df.EMP_Code)
selected_rows = []
with open(r"master_file.csv") as csv_file:
reader = csv.DictReader(csv_file)
for row in reader:
if row['EMP_Code'] in list_codes:
selected_rows.append(row)`
article_usage = pd.DataFrame.from_records(selected_rows)
article_usage.to_csv(r"Employee_full_data.csv", index=False)
Is there any other way that I can extract the data without loss? I have heard about join and reading data in chunks but not sure how to use it here. Any help is appreciated
I ran the code from your 2nd example (using csv.DictReader) on your small example and it worked. I'm guessing your problem might have to do with the real-life scale of master_file as you've alluded to.
The problem might be that despite using csv.DictReader to stream information in, you're still using a Pandas dataframe to aggregate everything before writing it out, and maybe the output is breaking your memory budget.
If that's true, then use csv.DictWriter to stream out. The only tricky bit is getting the writer set up because it needs to know the fieldnames, which can't be known till we've read the first row, so we'll set up the writer in the first iteration of the read loop.
(I've removed the with open(... contexts because I think they add too much indentation)
df = pd.read_csv(r"file1.csv")
list_codes = list(df.EMP_Code)
f_in = open(r"master_file.csv", newline="")
reader = csv.DictReader(f_in)
f_out = open(r"output.csv", "w", newline="")
init_writer = True
for row in reader:
if init_writer:
writer = csv.DictWriter(f_out, fieldnames=row)
writer.writeheader()
init_writer = False
if row["EMP_Code"] in list_codes:
writer.writerow(row)
f_out.close()
f_in.close()
EMP_name
EMP_age
EMP_Service
EMP_Code
EMP_dept
a
30
6
c264
xyz
b
29
3
f367
abc
d
45
10
c264
abc
And if you'd like to get rid of Pandas altogether:
list_codes = set()
with open(r"file1.csv", newline="") as f:
reader = csv.DictReader(f)
for row in reader:
list_codes.add(row["EMP_Code"])
You just have to pass chunksize=<SOME INTEGER> to pandas' .read_csv function (see documentation here)
If you pass a chunksize=2, you will read the file into dataframes of 2 rows. Or... more accurately, it will read 2 rows of the csv into a dataframe. You can then apply your filter to that 2-row dataframe and "accumulate" that into another dataframe. The next iteration will read the next two rows, which you can subsequently filter... Lather, rinse and repeat:
import pandas as pd
li = ['c264', 'f367']
result_df = pd.DataFrame()
with pd.read_csv("master_file.csv", chunksize=2) as reader:
for chunk_df in reader:
filtered_df = chunk_df[chunk_df.EMP_Code.isin(li)]
result_df = pd.concat([result_df, filtered_df])
print(result_df)
# Outputs:
# EMP_name EMP_age EMP_Service EMP_Code EMP_dept
# 0 a 30 6 c264 xyz
# 1 b 29 3 f367 abc
# 3 d 45 10 c264 abc
one way that you could fix these type of file read/write task is to use the generator and read the data you want in chunks or portions that you could handle (memory or etc constraints).
def read_line():
with open('master_file.csv','r') as fid:
while (line:= fid.readline().split()):
yield line
this simple generator in each call give one new line. now you could simply iterate over this to do what ever filtering you are interested and build your new dataframe.
r_line = read_line()
for l in r_line:
print(l)
you could modify the generator to for example parse and return list, or multiple lines , etc.
So I'm trying to find how to open csv files and sort all the details in it...
so an example of data contained in a CSV file is...
2,8dac2b,ewmzr,jewelry,phone0,9759243157894736,us,69.166.231.58,vasstdc27m7nks3
1,668d39,aeqok,furniture,phone1,9759243157894736,in,50.201.125.84,jmqlhflrzwuay9c
3,622r49,arqek,doctor,phone2,9759544365415694736,in,53.001.135.54,weqlhrerreuert6f
and so I'm trying to let a function sortCSV(File) to open the CSV file and sort it based on the very first number, which is 0, 1 ....
so the output should be
1,668d39,aeqok,furniture,phone1,9759243157894736,in,50.201.125.84,jmqlhflrzwuay9c
2,8dac2b,ewmzr,jewelry,phone0,9759243157894736,us,69.166.231.58,vasstdc27m7nks3
3,622r49,arqek,doctor,phone2,9759544365415694736,in,53.001.135.54,weqlhrerreuert6f
Here is my code so far, which clearly doesn't work....
import csv
def CSV2List(csvFilename: str):
f = open(csvFilename)
q = list(f)
return q.sort()
What changes should I make to my code to make sure my code works??
using pandas, set the first column as index and use sort_index to sort based on your index column:
import pandas as pd
file_path = '/data.csv'
df = pd.read_csv(file_path,header=None,index_col=0)
df = df.sort_index()
print(df)
There's a number of ways you could handle this but one of the easiest would be to install Pandas (https://pandas.pydata.org/).
First off you most likely will need some titles of each column which should be on the first row of you CSV file. When you've added the column titles and installed pandas:
With pandas:
import pandas as pd
dataframe = pd.read_csv(filepath, index=0)
This will set the first column as the index column and will be sorting on the index.
Another way I've had to handle CSV:s with difficult formatting (aka exporting form excel etc) is by reading the file as a regular file and then iterating the rows to handle them on my own.
final_data = []
with open (filepath, "r") as f:
for row in f:
# Split the row
row_data = row.split(",")
# Add to final data array
final_data.append(row_data
# This sorts the final data based on first row
final_data.sort(key = lambda row: row[0])
# This returns a sorted list of rows of your CSV
return final_data
try csv.reader(Filename)
import csv
def CSV2List(csvFilename: str):
f = open(csvFilename)
q = csv.reader(f)
return q.sort(key=lambda x: x[0])
Using the csv module:
import csv
def csv_to_list(filename: str):
# use a context manager here
with open(filename) as fh:
reader = csv.reader(fh)
# convert the first item to an int for sorting
rows = [[int(num), *row] for num, *row in reader]
# sort the rows based on that value
return sorted(rows, key=lambda row: row[0])
This is not the best way to deal with CSV files but:
def CSV2List(csvFilename: str):
f = open(csvFilename,'r')
l = []
for line in f:
l.append(line.split(','))
for item in l:
item[0] = int(item[0])
return sorted(l)
print(CSV2List('data.csv'))
However I would probably use pandas instead, it is a great module
EDITED : Added Complexity
I have a large csv file, and I want to filter out rows based on the column values. For example consider the following CSV file format:
Col1,Col2,Nation,State,Col4...
a1,b1,Germany,state1,d1...
a2,b2,Germany,state2,d2...
a3,b3,USA,AL,d3...
a3,b3,USA,AL,d4...
a3,b3,USA,AK,d5...
a3,b3,USA,AK,d6...
I want to filter all rows with Nation == 'USA', and then based on each of the 50 state. What's the most efficient way of doing this? I'm using Python. Thanks
Also, is R better than Python for such tasks?
Use boolean indexing or DataFrame.query:
df1 = df[df['Nation'] == "Japan"]
Or:
df1 = df.query('Nation == "Japan"')
Second should be faster, see performance of query.
If still not possible (not a lot of RAM) try use dask as commented Jon Clements (thank you).
One way would be to filter the csv first and then load, given the size of the data
import csv
with open('yourfile.csv', 'r') as f_in:
with open('yourfile_edit.csv', 'w') as f_outfile:
f_out = csv.writer(f_outfile, escapechar=' ',quoting=csv.QUOTE_NONE)
for line in f_in:
line = line.strip()
row = []
if 'Japan' in line:
row.append(line)
f_out.writerow(row)
Now load the csv
df = pd.read_csv('yourfile_edit.csv', sep = ',',header = None)
You get
0 1 2 3 4
0 2 a3 b3 Japan d3
You could open the file, index the position of the Nation header, then iterate over a reader().
import csv
temp = r'C:\path\to\file'
with open(temp, 'r', newline='') as f:
cr = csv.reader(f, delimiter=',')
# next(cr) gets the header row (row[0])
i = next(cr).index('Nation')
# list comprehension through remaining cr iterables
filtered = [row for row in cr if row[i] == 'Japan']
I'm currently keeping track of the large scale digitization of video tapes and need help pulling data from multiple CSVs. Most tapes have multiple copies, but we only digitize one tape from the set. I would like to create a new CSV containing only tapes of shows that have yet to be digitized. Here's a mockup of my original CSV:
Date Digitized | Series | Episode Number | Title | Format
---------------|----------|----------------|-------|--------
01-01-2016 | Series A | 101 | | VHS
| Series A | 101 | | Beta
| Series A | 101 | | U-Matic
| Series B | 101 | | VHS
From here, I'd like to ignore all fields containing "Series A" AND "101", as this show has a value in the "Date Digitized" cell. I attempted isolating these conditions but can't seem to get a complete list of undigitized content. Here's my code:
import csv, glob
names = glob.glob("*.csv")
names = [os.path.splitext(each)[0] for each in names]
for name in names:
with open("%s_.csv" % name, "rb") as source:
reader = csv.reader( source )
with open("%s_edit.csv" % name,"wb") as result:
writer = csv.writer( result )
for row in reader:
if row[0]:
series = row[1]
epnum = row[2]
if row[1] != series and row[2] != epnum:
writer.writerow(row)
I'll add that this is my first question and I'm very new to Python, so any advice would be much appreciated!
I am not a hundred percent sure I've understood your needs. However, this might put you on a right track. I am using pandas module:
data = """
Date Digitized | Series | Episode Number | Title | Format
---------------|----------|----------------|-------|--------
01-01-2016 | Series A | 101 | | VHS
| Series A | 101 | | Beta
| Series A | 101 | | U-Matic
| Series B | 101 | | VHS"""
# useful module for treating csv files (and many other)
import pandas as pd
# module to handle data as it was a csv file
import io
# read the csv into pandas DataFrame
# use the 0 row as a header
# fields are separated by |
df = pd.read_csv(
io.StringIO(data),
header=0,
sep="|"
)
# there is a bit problem with white spaces
# remove white space from the column names
df.columns = [x.strip() for x in df.columns]
# remove white space from all string fields
df = df.applymap(lambda x: x.strip() if type(x) == str else x)
# finally choose the subset we want
# for some reason pandas guessed the type of Episode Number wrong
# it should be integer, this probably won't be a problem when loading
# directly from file
df = df[~((df["Series"] == "Series A") & (df["Episode Number"] == "101"))]
# print the result
print(df)
# Date Digitized Series Episode Number Title Format
# 0 --------------- ---------- ---------------- ------- --------
# 4 Series B 101 VHS
Feel free to ask, hopefully I'll be able to change the code according to your actual needs or help in any other way.
The simplest approach is to make two reads of the set of CSV files: one to build a list of all digitized tapes, the second to build a unique list of all tapes not on the digitized list:
# build list of digitized tapes
digitized = []
for name in names:
with open("%s_.csv" % name, "rb") as source:
reader = csv.reader(source)
next(reader) # skip header
for row in reader:
if row[0] and ((row[1], row[2]) not in digitized):
digitized.append((row[1], row[2]))
# build list of non-digitized tapes
digitize_me = []
for name in names:
with open("%s_.csv" % name, "rb") as source:
reader = csv.reader(source)
header = next(reader)[1:3] # skip / save header
for row in reader:
if not row[0] and ((row[1], row[2]) not in digitized + digitize_me):
digitize_me.append((row[1], row[2]))
# write non-digitized tapes to 'digitize.csv`
with open("digitize.csv","wb") as result:
writer = csv.writer(result)
writer.writerow(header)
for tape in digitize_me:
writer.writerow(tape)
input file 1:
Date Digitized,Series,Episode Number,Title,Format
01-01-2016,Series A,101,,VHS
,Series A,101,,Beta
,Series C,101,,Beta
,Series D,102,,VHS
,Series B,101,,U-Matic
input file 2:
Date Digitized,Series,Episode Number,Title,Format
,Series B,101,,VHS
,Series D,101,,Beta
01-01-2016,Series C,101,,VHS
Output:
Series,Episode Number
Series D,102
Series B,101
Series D,101
As per OP comment, the line
header = next(reader)[1:3] # skip / save header
serves two purposes:
Assuming each csv file starts with a header, we do not want to
read that header row as if it contained data about our tapes, so we
need to "skip" the header row in that sense
But we also want to save the relevant parts of the header for when
we write the output csv file. We want that file to have a header
as well. Since we are only writing the series and episode
number, which are row fields 1 and 2, we assign just that slice,
i.e. [1:3], of the header row to the header variable
It's not really standard to have a line of code serve two pretty unrelated purposes like that, which is why I commented it. It also assigns to header multiple times (assuming multiple input files) when header only needs to be assigned once. Perhaps a cleaner way to write that section would be:
# build list of non-digitized tapes
digitize_me = []
header = None
for name in names:
with open("%s_.csv" % name, "rb") as source:
reader = csv.reader(source)
if header:
next(reader) # skip header
else:
header = next(reader)[1:3] # read header
for row in reader:
...
It's a question of which form is more readable. Either way is close but I thought combining 5 lines into one keeps the focus on the more salient parts of the code. I would probably do it the other way next time.
I'm trying to write some data into the excel spreadsheet using CSV.
I'm writing a motif finder, reading the input from fasta and outputs to excel.
But I'm having a hard time writing the data in a correct format.
My desired result in the excel is like below..
SeqName M1 Hits M2 Hits
Seq1 MN[A-Z] 3 V[A-Z]R[ML] 2
Seq2 MN[A-Z] 0 V[A-Z]R[ML] 5
Seq3 MN[A-Z] 1 V[A-Z]R[ML] 0
I have generated correct results but I just don't know how to put them in a correct format like above.
This is the code that I have so far.
import re
from Bio import SeqIO
import csv
import collections
def SearchMotif(f1, motif, f2="motifs.xls"):
with open(f1, 'r') as fin, open(f2,'wb') as fout:
# This makes SeqName static and everything else mutable thus, when more than 1 motifs are searched,
# they can be correctly placed into excel.
writer = csv.writer(fout, delimiter = '\t')
motif_fieldnames = ['SeqName']
writer_dict = csv.DictWriter(fout,delimiter = '\t' ,fieldnames=motif_fieldnames)
for i in range(0,len(motif),1):
motif_fieldnames.append('M%d' %(i+1))
motif_fieldnames.append('Hits')
writer_dict.writeheader()
# Reading input fasta file for processing.
fasta_name = []
for seq_record in SeqIO.parse(f1,'fasta'):
sequence = repr(seq_record.seq) # re-module only takes string
fasta_name.append(seq_record.name)
print sequence **********
for j in motif:
motif_name = j
print motif_name **********
number_count = len(re.findall(j,sequence))
print number_count **********
writer.writerow([motif_name])
for i in fasta_name:
writer.writerow([i]) # [] makes it fit into one column instead of characters taking each columns
The print statement that have the asterisks ********** generates this...where number is the number of Hits and difference sequences are seq1, seq2 ...and so on.
Seq('QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQ...LTS', SingleLetterAlphabet())
PA[A-Z]
0
Y[A-Z]L[A-Z]
0
Seq('SFNVATLPAESSSTDLDTTVLLPDEPAEVSDLERIETEWTNMKILELPFAPQMK...VSS', SingleLetterAlphabet())
PA[A-Z]
2
Y[A-Z]L[A-Z]
0
Seq('PAESIYFKIEKTYNLT', SingleLetterAlphabet())
PA[A-Z]
1
Y[A-Z]L[A-Z]
1
You can write your data to a Pandas DataFrame, and then use the DataFrame's to_csv method to export it to a CSV. There is also a to_excel method. Pandas won't let you have multiple columns with the same name, like your "Hits" column. However, you can work around that by putting the column names you want in the first row and using the header=False option when you export.
"import pandas as pd", then replace your code starting at "fasta_name = []" with this:
column_names = ['SeqName']
for i, m in enumerate(motif):
column_names += ['M'+str(i), 'Hits'+str(i)]
df = pd.DataFrame(columns=column_names)
for row, seq_record in enumerate(SeqIO.parse(f1, 'fasta')):
sequence = repr(seq_record.name)
df.loc[row, 'SeqName'] = sequence
for i, j in enumerate(motif):
df.loc[row, 'M'+str(i)] = j
df.loc[row, 'Hits'+str(i)] = len(re.findall(j, sequence))
df.to_csv(index=False)