how to add random values to the column of a csv file? - python

I want to append a column in a prefilled csv file with 3 million rows using python. Then, i want to fill the column with random values in the range of (1, 50). something like this:
input csv file,
awareness trip amount
25 1 30
30 2 35
output csv file,
awareness trip amount size
25 1 30 49
30 2 35 20
how can i do this?
the code i have written is as follows:
with open('2019-01-1.csv', 'r') as CSVIN: with open('2019-01-2.csv', 'w') as
CSVOUT:
CSVWrite = csv.writer(CSVOUT, lineterminator='\n') CSVRead =
csv.reader(CSVIN)
CSVWrite = csv.writer(CSVOUT, lineterminator='\n')
CSVRead = csv.reader(CSVIN)
NewDict = []
row = next(CSVRead)
row.append('Size')
NewDict.append(row)
print(NewDict.append(row))
for row in CSVRead:
randSize = np.random.randint(1, 50)
row.append(row[0])
NewDict.append(row)
CSVWrite.writerows(NewDict)

Check out this answer: Python Add string to each line in a file
I've found it much easier to use with for files instead of importing csv or other special filetype libraries unless my use case is very specific.
So in your case, it would be something like:
input_file_name = "2019-01-1.csv"
output_file_name = "2019-01-2.csv"
with open(input_file_name, 'r') as f:
file_lines = [''.join([x, ",Size,{}".format(random.randint(1, 50)), '\n']) for x in f.readlines()]
with open(output_file_name, 'w') as f:
f.writelines(file_lines)

Related

Why are the digits in my numbers printing separately rather than together?

This is an example of my code. It is not the whole code, it is just the part where I am having trouble. Does anyone understand why it prints like this rather than the full numbers, like 104.0 and 96.0? They are strings, but it will not allow me to convert it to a float because the period in some of the digits..
with open('file.csv','w') as file:
with open('file2.csv', 'r') as file2:
reader = csv.DictReader(file2)
file.write(','.join(row))
file.write('\n')
for num,row in enumerate(reader):
outrow = []
for x in row['numbers']:
print(x)
When I execute this, it prints out the values I am looking for but separately like this:
1
0
4
.
0
9
6
.
0
N
a
N
1
3
6
.
0
N
a
N
6
2
.
0
The 'NaN' are values I am changing, but the rest of the numbers I have to use. I cannot insert them into a list because they will end up separated right?
Seems like you want something like:
with open('file.csv','w') as file:
with open('file2.csv', 'r') as file2:
reader = csv.DictReader(file2)
file.write(','.join(row))
file.write('\n')
for num,row in enumerate(reader):
number = row['numbers']
print(number)
for x in row['numbers'] means, "Iterate over every individual character in the numbers cell/vallue".
Also, what are you doing here?
file.write(','.join(row))
file.write('\n')
You don't have a row variable/object at that point (at least not visible in your example). Are you trying to write the header? Presumably it's working, so you defined row before, maybe like, row = ['col1', 'numbers']
If so, maybe take this general approach:
import csv
# Do your reading and processing in one step
rows = []
with open('input.csv', newline='') as f:
reader = csv.DictReader(f)
for row in reader:
# do some work on row, like...
number = row['numbers']
if row['numbers'] == 'NaN':
row['numbers'] = '-1' # whatever you do with NaN
rows.append(row)
# Do your writing in another step
my_field_names = rows[0].keys()
with open('output.csv', 'w', newline='') as f:
# Use the provided writer, in addition to reader
writer = csv.DictWriter(f, fieldnames=my_field_names)
writer.writeheader()
writer.writerows(rows)
At the very least, use the provide writer and DictWriter classes, they will make your life much easier.
I mocked up this sample CSV:
input.csv
id,numbers
id_1,100.4
id2,NaN
id3,23
and the program above produced this:
output.csv
id,numbers
id_1,100.4
id2,-1
id3,23

Deleting specific lines in csv files

The csv file looks like below: (with a thousand more lines)
step0:
141
step1:
140
step2:
4
step3:
139
step4:
137
step5:
136
15
step6:
134
13
139
step7:
133
19
I am trying to read each line and remove lines (the ones that includes numbers only) that are, say, greater than 27.
Originally, my csv file a string file, so all of the lines are considered strings.
What I have done is the following:
first loop through the lines that does not include "step" in them
change them into float
remove all that are greater than 27
Now I want to save (overwrite) my file after deleting these numbers but I am stuck.
Could someone assist?
import csv
f = open('list.csv', 'r')
reader = csv.reader(f, delimiter="\n")
for row in reader:
for e in row:
if 'step' not in e:
d=float(e)
if d>27:
del(d)
import csv
with open('output.csv', 'w+') as output_file:
with open('input.csv') as input_file: #change you file name here
reader = csv.reader(input_file, delimiter = '\n')
line_index = 0 # debugging
for row in reader:
line_index += 1
line = row[0]
if 'step' in line:
output_file.write(line)
output_file.write('\n')
else:
try:
number = int(line) # you can use float, but then 30 become 30.0
if number <= 27:
output_file.write(line)
output_file.write('\n')
except:
print("Abnormal data at line %s", str(line_index))
I assume that your input file is input.csv. This program writes to new output file. The output is output.csv:
step0:
step1:
step2:
4
step3:
step4:
step5:
15
step6:
13
step7:
19
One solution with re module:
import re
with open('file.txt', 'r') as f_in:
data = f_in.read()
data = re.sub(r'\b(\d+)\n*', lambda g: '' if int(g.group()) > 27 else g.group(), data)
with open('file_out.txt', 'w') as f_out:
f_out.write(data)
The content of file_out.txt will be:
step0:
step1:
step2:
4
step3:
step4:
step5:
15
step6:
13
step7:
19
26
import csv
with open('list.csv', 'r') as list:
with open('new_list.csv', 'w') as new_list:
reader = csv.reader(list, delimiter="\n")
writer = csv.writer(new_list, delimiter="\n")
for row in reader:
if 'step' not in e:
if float(e) < 27:
writer.writerow(e)
else:
writer.writerow(e)
Essentially you're going to just copy over the rows you want to your new file. If the line is step, we write it. If the line is less than 27, we write it. If you'd prefer to just overwrite your file when you're done:
import csv
rows_to_keep = []
with open('list.csv', 'r') as list:
reader = csv.reader(list, delimiter="\n")
for row in reader:
if 'step' not in e:
if float(e) < 27:
rows_to_keep.append(e)
else:
rows_to_keep.append(e)
with open('list.csv', 'w') as new_list:
writer = csv.writer(list, delimiter="\n")
writer.write_rows(rows_to_keep)

merge tsv files in one csv by extracting particular columns and naming the column as file name

I have multiple tsv files in folder. From each file I have to extract 1st column which is the abundance and 5th column which is ID, there are no headers for columns. I have to merge these columns from each file in one file and give their headers as there file name. Also I have to compare check if all the ID'a are present, if not then value should be zero.
One of the sample files File_Name1 looks like:
0.11 31 31 U 0 unclassified
99.89 29001 0 - 1 root
99.89 29001 0 - 131567 cellular organisms
99.89 29001 64 D 2 Bacteria
59.94 17401 270 - 1783272 Terrabacteria group
53.47 15522 8 P 1239 Firmicutes
52.10 15127 998 C 186801 Clostridia
37.83 10982 494 O 186802 Clostridiales
20.61 5983 89 F 186803 Lachnospiraceae
16.95 4922 8 G 1506553 Lachnoclostridium
14.53 4219 0 S 84030 [Clostridium] saccharolyticum
Similarly I have multiple files. The file I want is like :
ID File_Name1 File_Name2
186802 16.95 37.88
1506553 20.61 0
84030 14.53 0.05
I have tried something like this:
import glob
import csv
directory = "C:\kraken\kraken_13266"
txt_files = glob.glob(directory+"\*.kraken")
for txt_file in txt_files:
with open(txt_file, "rt") as input_file, open('output.csv', "wt") as
out_file:
in_txt = csv.reader(input_file, delimiter='\t')
for line in in_txt:
firstcolumns = line[:1]
lastcolumns = line[-2].strip().split(",")
allcolumns = firstcolumns + lastcolumns
I'm stuck at this point. How should I proceed further.
The following should do what you are trying to do:
from collections import defaultdict
import glob
import csv
ids = defaultdict(dict) # e.g. {'186802' : {'FileName1' : '16.95', 'FileName2' : '37.88'}}
kraken_files = glob.glob('*.kraken')
for kraken_filename in kraken_files:
with open(kraken_filename, 'r', newline='') as f_input:
csv_input = csv.reader(f_input, delimiter='\t')
file_name = os.path.splitext(kraken_filename)[0]
for row in csv_input:
ids[int(row[4])].update({file_name : float(row[0])})
with open('output.csv', 'w', newline='') as f_output:
fieldnames = ['ID'] + [os.path.splitext(filename)[0] for filename in kraken_files]
csv_output = csv.DictWriter(f_output, fieldnames=fieldnames, restval=0)
csv_output.writeheader()
for id in sorted(ids.keys()):
id_values = ids[id]
id_values['ID'] = id
csv_output.writerow(id_values)
You will need to read all of the files in before you are able to write an output file. A dictionary is used to store all the IDs. For each one a dictionary is used to hold each file that contains a matching ID.

How to extract and copy lines from csv file to another csv file in python?

Let's suppose that I have a big data in a csv file:This is a set of lines from my file:
frame.number frame.len frame.cap_len frame.Type
1 100 100 ICMP_tt
2 64 64 UDP
3 100 100 ICMP_tt
4 87 64 ICMP_nn
I want to extract 30 % from this file and put it in another csv file.
I try by using this code but it gives me selection per row not per line:
import csv
data = [] #Buffer list
with open("E:\\Test.csv", "rb") as the_file:
reader = csv.reader(the_file, delimiter=",")
for line in reader:
try:
new_line = [line[0], line[1]]
#Basically ´write the rows to a list
data.append(new_line)
except IndexError as e:
print e
pass
with open("E:\\NewTest.csv", "w+") as to_file:
writer = csv.writer(to_file, delimiter=",")
for new_line in data:
writer.writerow(new_line)
I try
Python csv module use , by default, so its not necessary to specify delimiter unless you have a different delimiter. I suppose you have following csv file:
frame.number,frame.len,frame.cap_len,frame.Type
1,100,100,ICMP_tt
2,64,64,UDP
3,100,100,ICMP_tt
4,87,64,ICMP_nn
Each line in the file represents a row.
# read
data = []
with open('test.csv', 'r') as f:
f_csv = csv.reader(f)
# header = next(f_csv)
for row in f_csv:
data.append(row)
# write
with open('newtest.csv', 'w+') as f:
writer = csv.writer(f)
for i in range(int(len(data) * 30 / 100)):
writer.writerow(data[i])

Im reading a csv file with 30 lines but only printing 29 in shell

I have a csv file im reading from and returning a column of data. There is 30 rows of data and when I print the data I'm only getting 29.
def readMonth(fileName):
infile = open(fileName,"rb")
reader = csv.reader(infile)
month = []
for i in range(0,29):
data = next(reader)
month.append(int(float(data[25])))
infile.close()
return month
When I print I should have 30 lines not 29. What do I need to change in order to print all 30 lines?
for i in range(0,29):
This starts at 0 and ends at 28. Did you mean range(30)? Though I'm not sure why you're not just looping over the reader.
Maybe something like this is what you're looking for?
#!/usr/local/cpython-3.3/bin/python
import csv
def readMonth(fileName):
infile = open(fileName,"r")
reader = csv.reader(infile)
month = []
for row in reader:
month.append(int(row[0]))
infile.close()
return month
month = readMonth('foo.csv')
print(month)

Categories