Got a CSV which I am selecting a random sample of 500 rows using the following code:
import csv
import random
with open('Original.csv' , "rb") as source:
lines = [line for line in source]
random_choice = random.sample(lines, 500);
what I'd like to do is update a column called [winner] if they exist within the sample and then save it back to a csv file but I have no idea how to achieve this...
There is a unique identifier in a column called [ID].
how would I go about doing this?
Starting with a CSV that looks like this:
ID something winner
1 a
2 b
3 c
4 a
5 d
6 a
7 b
8 e
9 f
10 g
You could use the following approach. The whole file is read in, rows are chosen by a randomly selected index, and written back out to the file.
import csv
import random
# Read in the data
with open('example.csv', 'r') as infile:
reader = csv.reader(infile)
header = next(reader) # We want the headers, but not as part of the sample
data = []
for row in reader:
data.append(row)
# Find the column called winner
winner_column_index = header.index('winner')
# Pick some random indices which will be used to generate the sample
all_indices = list(range(len(data)))
sampled_indices = random.sample(all_indices, 5)
# Add the winner column to those rows selected
for index in sampled_indices:
data[index][winner_column_index] = 'Winner'
# Write the data back
with open('example_out.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
writer.writerow(header) # Make sure we get the headers back in
writer.writerows(data) # Write the rest of the data
This will give the following output:
ID something winner
1 a
2 b Winner
3 c
4 a Winner
5 d
6 a Winner
7 b
8 e
9 f Winner
10 g Winner
EDIT: It turns out that having the first column of the CSV being called ID is not a good idea if you want to open with Excel. It then incorrectly thinks the file is in SYLK format.
First, why are you using csv and not a db? even an sqlite would be much easier (builtin - import sqlite3)
Second, you'll need to write the whole file again. I suggest you use your lines as lists and just update them (lists are like pointers so you can change the inner values and it will update)
lines=[list(line) for line in source]
and then
for choice in random_choice:
choice[WINNER_INDEX]+=1
and write the file
Related
This is in continuation from my previous question. I have 2 files, file1.csv and a large csv called master_file.csv. They have several columns and have a common column name called EMP_Code.
File 1 example:
EMP_name
EMP_Code
EMP_dept
b
f367
abc
a
c264
xyz
c
d264
abc
master_file example:
EMP_name EMP_age EMP_Service EMP_Code EMP_dept
a 30 6 c264 xyz
b 29 3 f367 abc
r 27 1 g364 lmn
d 45 10 c264 abc
t 50 25 t453 lmn
I want to extract similar rows from master_file using all the EMP_Code values in file1. I tried the following code and I am loosing a lot of data. I cannot read the complete master csv file as it is around 20gb, has millions of rows and running out of memory. I want to read the master_file in chunks and extract the complete rows for each of the EMP_Code present in file1 and save it into new file Employee_full_data.
import csv
import pandas as pd
df = pd.read_csv(r"master_file.csv")
li = [c264,f367]
full_data = df[df.EMP_Code.isin(li)]
full_data.to_csv(r"Employee_full_data.csv", index=False)
I also tried the following code. I receive an empty file whenever I use EMP_Code column and works fine when I use columns like Emp_name or EMP_dept. I want to extract the data using EMP_Code.
import csv
import pandas as pd
df = pd.read_csv(r"file1.csv")
list_codes = list(df.EMP_Code)
selected_rows = []
with open(r"master_file.csv") as csv_file:
reader = csv.DictReader(csv_file)
for row in reader:
if row['EMP_Code'] in list_codes:
selected_rows.append(row)`
article_usage = pd.DataFrame.from_records(selected_rows)
article_usage.to_csv(r"Employee_full_data.csv", index=False)
Is there any other way that I can extract the data without loss? I have heard about join and reading data in chunks but not sure how to use it here. Any help is appreciated
I ran the code from your 2nd example (using csv.DictReader) on your small example and it worked. I'm guessing your problem might have to do with the real-life scale of master_file as you've alluded to.
The problem might be that despite using csv.DictReader to stream information in, you're still using a Pandas dataframe to aggregate everything before writing it out, and maybe the output is breaking your memory budget.
If that's true, then use csv.DictWriter to stream out. The only tricky bit is getting the writer set up because it needs to know the fieldnames, which can't be known till we've read the first row, so we'll set up the writer in the first iteration of the read loop.
(I've removed the with open(... contexts because I think they add too much indentation)
df = pd.read_csv(r"file1.csv")
list_codes = list(df.EMP_Code)
f_in = open(r"master_file.csv", newline="")
reader = csv.DictReader(f_in)
f_out = open(r"output.csv", "w", newline="")
init_writer = True
for row in reader:
if init_writer:
writer = csv.DictWriter(f_out, fieldnames=row)
writer.writeheader()
init_writer = False
if row["EMP_Code"] in list_codes:
writer.writerow(row)
f_out.close()
f_in.close()
EMP_name
EMP_age
EMP_Service
EMP_Code
EMP_dept
a
30
6
c264
xyz
b
29
3
f367
abc
d
45
10
c264
abc
And if you'd like to get rid of Pandas altogether:
list_codes = set()
with open(r"file1.csv", newline="") as f:
reader = csv.DictReader(f)
for row in reader:
list_codes.add(row["EMP_Code"])
You just have to pass chunksize=<SOME INTEGER> to pandas' .read_csv function (see documentation here)
If you pass a chunksize=2, you will read the file into dataframes of 2 rows. Or... more accurately, it will read 2 rows of the csv into a dataframe. You can then apply your filter to that 2-row dataframe and "accumulate" that into another dataframe. The next iteration will read the next two rows, which you can subsequently filter... Lather, rinse and repeat:
import pandas as pd
li = ['c264', 'f367']
result_df = pd.DataFrame()
with pd.read_csv("master_file.csv", chunksize=2) as reader:
for chunk_df in reader:
filtered_df = chunk_df[chunk_df.EMP_Code.isin(li)]
result_df = pd.concat([result_df, filtered_df])
print(result_df)
# Outputs:
# EMP_name EMP_age EMP_Service EMP_Code EMP_dept
# 0 a 30 6 c264 xyz
# 1 b 29 3 f367 abc
# 3 d 45 10 c264 abc
one way that you could fix these type of file read/write task is to use the generator and read the data you want in chunks or portions that you could handle (memory or etc constraints).
def read_line():
with open('master_file.csv','r') as fid:
while (line:= fid.readline().split()):
yield line
this simple generator in each call give one new line. now you could simply iterate over this to do what ever filtering you are interested and build your new dataframe.
r_line = read_line()
for l in r_line:
print(l)
you could modify the generator to for example parse and return list, or multiple lines , etc.
I am new to Python. I am trying to write numbers in a CSV file. The first number makes the first element of the row. Second number second and then a new row should start. However, the way that my code works, instead of adding the second element to the same row, it makes a new row.
For instance what I want is:
a1,b1
a2,b2
But what I get is:
a1
b1
a2
b2
I use a loop to continuously write values into a CSV file:
n = Ratio # calculated in each loop
with open('ex1.csv', 'ab') as f:
writer = csv.writer(f)
writer.writerow([n])
...
m = Ratio2 # calculated in each loop
with open('ex1.csv', 'ab') as f:
writer = csv.writer(f)
writer.writerow([m])
I would like the results to be in format of
n1,m1
n2,m2
Example for writing to a file and then reading it back and printing it:
import csv
with open('ex1.csv', 'w') as f: # open file BEFORE you loop
writer = csv.writer(f) # declare your writer on the file
for rows in range(0,4): # do one loop per row
myRow = [] # remember all column values, clear list here
for colVal in range(0,10): # compute 10 columns
m = colVal * rows # heavy computing (your m or n)
myRow.append(m) # store column in row-list
writer.writerow(myRow) # write list containing all columns
with open('ex1.csv', 'r') as r: #read it back in
print(r.readlines()) # and print it
Output:
['0,0,0,0,0,0,0,0,0,0\r\n', '0,1,2,3,4,5,6,7,8,9\r\n', '0,2,4,6,8,10,12,14,16,18\r\n', '0,3,6,9,12,15,18,21,24,27\r\n']
which translates to a file of
0,0,0,0,0,0,0,0,0,0
0,1,2,3,4,5,6,7,8,9
0,2,4,6,8,10,12,14,16,18
0,3,6,9,12,15,18,21,24,27
You can also stuff each rows list (copy it by myList[:]) into another list and use writer.writerows([ [1,2,3,4],[4,5,6,7] ]) to write all your rows in one go .
See: https://docs.python.org/2/library/csv.html#writer-objects or https://docs.python.org/3/library/csv.html#writer-objects
import csv
s = open('models.csv')
checkIt = csv.reader(s)
o = open('data.csv')
csv_o = csv.reader(o)
for c in checkIt:
abc = c[0].split(".")
abcd = abc[2]
commodity_type = abcd[6:]
print(commodity_type)
**for csv in csv_o:
print(csv)
print(commodity_type)**
print function is executing only one time, it should execute for 4 time because i have 4 rows in models.csv file.
please give some solution that nested for loop run for according to number of row in models.csv
Try resetting the file pointer that csv_o points to.
for csv in csv_o:
print(csv)
print(commodity_type)
o.seek(0)
That should automatically make the CSV reader begin reading from the start of the file from the next iteration onwards.
I am currently trying to count repeated values in a column of a CSV file and return the value to another CSV column in a python.
For example, my CSV file :
KeyID GeneralID
145258 KL456
145259 BG486
145260 HJ789
145261 KL456
What I want to achieve is to count how many data have the same GeneralID and insert it into a new CSV column. For example,
KeyID Total_GeneralID
145258 2
145259 1
145260 1
145261 2
I have tried to split each column using the split method but it didn't work so well.
My code :
case_id_list_data = []
with open(file_path_1, "rU") as g:
for line in g:
case_id_list_data.append(line.split('\t'))
#print case_id_list_data[0][0] #the result is dissatisfying
#I'm stuck here..
And if you are adverse to pandas and want to stay with the standard library:
Code:
import csv
from collections import Counter
with open('file1', 'rU') as f:
reader = csv.reader(f, delimiter='\t')
header = next(reader)
lines = [line for line in reader]
counts = Counter([l[1] for l in lines])
new_lines = [l + [str(counts[l[1]])] for l in lines]
with open('file2', 'wb') as f:
writer = csv.writer(f, delimiter='\t')
writer.writerow(header + ['Total_GeneralID'])
writer.writerows(new_lines)
Results:
KeyID GeneralID Total_GeneralID
145258 KL456 2
145259 BG486 1
145260 HJ789 1
145261 KL456 2
You have to divide the task in three steps:
1. Read CSV file
2. Generate new column's value
3. Add value to the file back
import csv
import fileinput
import sys
# 1. Read CSV file
# This is opening CSV and reading value from it.
with open("dev.csv") as filein:
reader = csv.reader(filein, skipinitialspace = True)
xs, ys = zip(*reader)
result=["Total_GeneralID"]
# 2. Generate new column's value
# This loop is for counting the "GeneralID" element.
for i in range(1,len(ys),1):
result.append(ys.count(ys[i]))
# 3. Add value to the file back
# This loop is for writing new column
for ind,line in enumerate(fileinput.input("dev.csv",inplace=True)):
sys.stdout.write("{} {}, {}\n".format("",line.rstrip(),result[ind]))
I haven't use temp file or any high level module like panda or anything.
import pandas as pd
#read your csv to a dataframe
df = pd.read_csv('file_path_1')
#generate the Total_GeneralID by counting the values in the GeneralID column and extract the occurrance for the current row.
df['Total_GeneralID'] = df.GeneralID.apply(lambda x: df.GeneralID.value_counts()[x])
df = df[['KeyID','Total_GeneralID']]
Out[442]:
KeyID Total_GeneralID
0 145258 2
1 145259 1
2 145260 1
3 145261 2
You can use pandas library:
first read_csv
get counts of values in column GeneralID by value_counts, rename by output column
join to original DataFrame
import pandas as pd
df = pd.read_csv('file')
s = df['GeneralID'].value_counts().rename('Total_GeneralID')
df = df.join(s, on='GeneralID')
print (df)
KeyID GeneralID Total_GeneralID
0 145258 KL456 2
1 145259 BG486 1
2 145260 HJ789 1
3 145261 KL456 2
Use csv.reader instead of split() method.
Its easier.
Thanks
Ok so I need to take the following format of input file
5 5 12
1 1
1 2
2 1
2 3
3 1
3 2
3 4
4 2
4 4
1 2
2 3
5 5
and then turn it into a matrice where the left hand column represents the customer ID and the right hand column is the Item ID.
I can get Python to import the data however I'm struggling to manipulate it to the way I need.
and example of how I need the data would be:
Item 1 Item 2 Item 3
Customer1: 1 2 0
Customer2: 1 0 2
Here are the methods I've used to import the data:
def DataImport(filename):
Data = []
with open(filename) as f:
for line in f:
Data.append([int(v) for v in line.split()])
return Data
(incomplete)
def ImportData(filename):
if filename == "history.txt":
Option1 = np.loadtxt(filename, skiprows=1)
CustomerID = np.loadtxt(filename, skiprows=1, usecols=(0,))
ItemID = np.loadtxt(filename, skiprows=1, usecols=(1,))
print(CustomerID)
print(ItemID)
print(Option1)
Option11 = []
for i in Option1:
if i[i][0] == i[0]:
Option11.append(i[0][1])
print(Option11)
else:
Data = np.loadtxt(filename)
def ImportData(filename):
rawdata = {}
File = open(filename, "r")
for line in File:
rawdata[len(rawdata)+1] = line.rstrip("\n")
File.close()
print(rawdata)
def ImportData2(filename):
with open(filename, newline='') as file:
reader = csv.reader(file, delimiter=' ')
next(reader)
TRvalues = dict(reader)
print(TRvalues)
I think the numpy method presents in a way that would be easiest to work with but I'm not sure how to get Python to itterate using the customerID as a key and then adding the values in the way previously mentioned.
I'm not sure if I've explained that well but I can answer questions if you need more clarification.
In principle I highly recommend using something like PANDAS (http://pandas.pydata.org/pandas-docs/stable/io.html), especially
pandas.read_tabel
will generate a pandas data-frame that you can manipulate as you like.
However if you don't like that you can also just iterate by column rather than row by simply using
for row in matrix.T:
instructions for rows...
rather then
for col in matrix:
instructions for columns...
hope it helps.