I want to read a CSV file save to list and count the numbers of every word, but I got some error is about list index out of range in Python.
I have 21291918 rows in the CSV file. The following is a screenshot of the CSV file.
The following is my code:
from datetime import date,datetime
import numpy as np
import xlrd
import codecs
import time
import re
import os
import jieba
from itertools import repeat
import sys
import csv
maxInt = sys.maxsize
while True:
# decrease the maxInt value by factor 10
# as long as the OverflowError occurs.
try:
csv.field_size_limit(maxInt)
break
except OverflowError:
maxInt = int(maxInt/10)
sys.setrecursionlimit(100000000)
jieba.load_userdict('./data/dict.txt')
file_name = 'Real/B_Seg_output.csv'
with open (file_name, 'r', encoding="utf-8") as csvfile:
reader = csv.reader(csvfile)
column = [row[0] for row in reader]
author_list = list(column)
#print(author_list)
print('-'*30)
with open('Real/Other_Content_Count_All.csv', 'a', newline='', encoding='utf-8') as csvfile:
csvfile.write('回復內容\n')
j=0
cnt = set(author_list)
for i in cnt:
j += 1
print(j)
if(j % 10000 == 0):
print('*'*10+str(j)+" is sleeping"+'*'*10)
time.sleep(10)
if author_list.count(i)>0:
#print(i+',',author_list.count(i))
#print(i)
#print(author_list.count(i))
with open('Real/First_Author_Count_All.csv', 'a', newline='', encoding='utf-8') as csvfile:
csvfile.write(i+','+str(author_list.count(i))+'\n')
When I run this code, I got the following problem:
Traceback (most recent call last):
File ".\count_All_Other_Content.py", line 38, in <module>
column = [row[0] for row in reader]
File ".\count_All_Other_Content.py", line 38, in <listcomp>
column = [row[0] for row in reader]
IndexError: list index out of range
I searched the related problems. I suspected the reason is some lines have space value.
However, I cannot find the solution. And then, I suspected the rows of CSV is over than list limit.
I need to use this CSV file to count the number of occurrences of each word. I don't know what to solve.
Maybe you can just change your column = [row[0] for row in reader] line to one of these:
column = [row[0] if row else None for row in reader] - This will preserve the indexes if that's important
column = [row[0] for row in reader if row] - This will skip the empty rows
If the header is empty then it raises an IndexError when you try to access any items.
for row in reader:
if len(row[0]) > 0:
column = row[0]
else:
pass
You can add this line before author_list and after reader lines. So that if checks if something is on there, it takes it. Otherwise it passes to other rows.
I think that the fastest way to do this is just to use readlines like this:
with f as open('myfile'):
lines = f.readlines()
Now lines is a list of all the lines in the file, if a line is empty you will have an empty string (' ') in the list and you can easily check that. You can also delete '\r' and '\n' characters.
If you want to count the number of different words you can just use len(set(lines)). If you want the count of each one instead you can use numpy.unique function that will give you the array of unique values and also the count of each one.
Related
I'm trying print lines randomly from a csv.
Lets say the csv has the below 10 lines -
1,One
2,Two
3,Three
4,Four
5,Five
6,Six
7,Seven
8,Eight
9,Nine
10,Ten
If I write a code like below, it prints each line as a list in the same order as present in the CSV
import csv
with open("MyCSV.csv") as f:
reader = csv.reader(f)
for row_num, row in enumerate(reader):
print(row)
Instead, I'd like it to be random.
Its just a print for now. I'll later pass each line as a List to a Function.
This should work. You can reuse the lines list in your code as it is shuffled.
import random
with open("tmp.csv", "r") as f:
lines = f.readlines()
random.shuffle(lines)
print(lines)
import csv
import random
csv_elems = []
with open("MyCSV.csv") as f:
reader = csv.reader(f)
for row_num, row in enumerate(reader):
csv_elems.append(row)
random.shuffle(csv_elems)
print(csv_elems[0])
As you can see I'm just printing the first elem, you can iterate over the list, keep shuffling & print
Well you can define a list, append all elements of csv file into it, then shuffle it and print them, assume that the name of this list is temp
import csv
import random
temp = []
with open("your csv file.csv") as file:
reader = csv.reader(file)
for row_num, row in enumerate(reader):
temp.append(row)
random.shuffle(temp)
for i in range(len(temp)):
print(temp[i])
Why better don't you use pandas to handle csv?
import pandas as pd
data = pd.read_csv("MyCSV.csv")
And to get the samples you are looking for just write:
data.sample() # print one sample
data.sample(5) # to write 5 samples
Also if you want to pass each line to a function.
data_after_function = data.appy(function_name)
and inside the function you can cast the line into a list with list()
Hope this helps!
Couple of things to do:
Store CSV into a sequence of some sort
Get the data randomly
For 1, it’s probably best to use some form of sequence comprehension (I’ve gone for nested tuple in a list as it seems you want the row numbers and we can’t use dictionaries for shuffle).
We can use the random module for number 2.
import random
import csv
with open("MyCSV.csv") as f:
reader = csv.reader(f)
my_csv = [(row_num, row) for row_num, row in enumerate(reader)]
# get only 1 item from the list at random
random_row = random.choice(my_csv)
# randomise the order of all the rows
shuffled_csv = random.shuffle(my_csv)
I trying to get specific cell value in csv file and count number of the rows, but if I count the number before read the specific cell the error will come
my code is :
import os
import sys
import csv
with open('C:\Users\Administrator\Desktop\python test\update_test\datalog.csv','rb') as csvfile:
data= csv.reader(csvfile)
row_count=sum(1 for row in data)
data=list(data)
text=data[0][0]
print(text)
print row_count
You can't just read from a file twice. sum(1 for row in data) already read all the data, so data = list(data) is an empty list, because the file pointer is at the end of the file and won't return more data unless you rewind the file to the start.
You don't even need to use the sum() call, remove it. You can get the same count with len(data) after you used list() on it:
with open('C:\Users\Administrator\Desktop\python test\update_test\datalog.csv','rb') as csvfile:
data= csv.reader(csvfile)
data = list(data)
text = data[0][0]
print(text)
print len(data)
Hi I am trying to iterate over each row in a csv file with python and create new csv files for each row. So my thought process is open the file, and loop through each row and for each row create a file named n_file.csv (where 'n' is the iteration), so here is my code:
import csv
csvfile = open('sample.csv','rb')
csvFileArray = []
for row in csv.reader(csvfile, delimiter = '.'):
csvFileArray.append(row)
print(row)
n = 0
n += 1
file = open(str(n) + "_file.csv", 'w+')
file.write(str(row))
print(n) # returns 1 every time
Unfortunately this is not iterating properly (because it is only create a file named 1_file.csv and overwriting it each time). How can I fix this?
for row in csv.reader(csvfile, delimiter='.'):
csvFileArray.append(row)
print(row)
n = 0 # << you do n=0 each loop!!
n += 1
so it's better be,
for idx, row in enumerate(csv.reader(csvfile, delimiter='.')):
csvFileArray.append(row)
print(row)
file = open(str(idx) + "_file.csv", 'w+') # enumerate do same as you want!
file.write(str(row))
You set n to 0 each time, because you declared it inside the loop. Declare it before the for statement.
Try this:
import csv
with open('sample.csv','rb') as csvfile:
for i, row in enumerate(csv.reader(csvfile, delimiter = '.')):
with open("{}_file.csv".format(i), "w") as file:
file.write(str(row))
I have data in a csv file e.g
1,2,3,4
4,5,6,7
what I want is to create an extra column that sums the first rows so that the result will look like.
1,2,3,4,10
4,5,6,7,22
And an extra row that sums the columns.
1,2,3,4,10
4,5,6,7,22
5,7,9,11,32
This is probably really basic but I could do with the help please?
#!/usr/bin/python
import sys
from itertools import imap, repeat
from operator import add
total = repeat(0) # See how to handle initialization without knowing the number of columns ?
for line in sys.stdin:
l = map(int, line.split(','))
l.append(sum(l))
print ','.join(map(str,l))
total = imap(add, total, l)
print ','.join(map(str, total))
I know, I'm treating Python like Haskell these days.
import csv
thefile = ["1,2,3,4","4,5,6,7"]
reader = csv.reader(thefile)
temp = []
final = []
# read your csv into a temporary array
for row in reader:
temp.append([int(item) for item in row])
# add a row for sums along the bottom
temp.append(final)
for item in temp[0]:
final.append(0)
for row in temp:
sum = 0
for index, item in enumerate(row):
sum += item #total the items in each row
temp[-1][index] = temp[-1][index] + item #add each item to the column sum
row.append(sum) #add the row sum
print temp
import sys
import csv
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
with open(sys.argv[2], 'wb') as writefile:
writer = csv.writer(writefile, delimiter=',',quotechar='"', quoting=csv.QUOTE_MINIMAL)
with open(sys.argv[1], 'rb') as readfile:
reader = csv.reader(readfile, delimiter=',', quotechar='"')
for row in reader:
writer.writerow(row+[sum([float(r) for r in row if is_number(r)])])
How about some pythonic list comprehension:
import csv
in_file = ["1,2,3,4","4,5,6,7"]
in_reader = list(csv.reader(in_file))
row_sum = [ sum(map(int,row)) for row in in_reader]
col_sum = [sum(map(int,row)) for row in map(list, zip (*in_file)[::2])]
for (index,row_run) in enumerate([map(int,row) for row in in_reader]):
for data in row_run:
print str(data)+",",
print row_sum[index]
for data in col_sum:
print str(data)+",",
print str(sum(col_sum))
Let me know if you need anything else.
I need a way to get a specific item(field) of a CSV. Say I have a CSV with 100 rows and 2 columns (comma seperated). First column emails, second column passwords. For example I want to get the password of the email in row 38. So I need only the item from 2nd column row 38...
Say I have a csv file:
aaaaa#aaa.com,bbbbb
ccccc#ccc.com,ddddd
How can I get only 'ddddd' for example?
I'm new to the language and tried some stuff with the csv module, but I don't get it...
import csv
mycsv = csv.reader(open(myfilepath))
for row in mycsv:
text = row[1]
Following the comments to the SO question here, a best, more robust code would be:
import csv
with open(myfilepath, 'rb') as f:
mycsv = csv.reader(f)
for row in mycsv:
text = row[1]
............
Update: If what the OP actually wants is the last string in the last row of the csv file, there are several aproaches that not necesarily needs csv. For example,
fulltxt = open(mifilepath, 'rb').read()
laststring = fulltxt.split(',')[-1]
This is not good for very big files because you load the complete text in memory but could be ok for small files. Note that laststring could include a newline character so strip it before use.
And finally if what the OP wants is the second string in line n (for n=2):
Update 2: This is now the same code than the one in the answer from J.F.Sebastian. (The credit is for him):
import csv
line_number = 2
with open(myfilepath, 'rb') as f:
mycsv = csv.reader(f)
mycsv = list(mycsv)
text = mycsv[line_number][1]
............
#!/usr/bin/env python
"""Print a field specified by row, column numbers from given csv file.
USAGE:
%prog csv_filename row_number column_number
"""
import csv
import sys
filename = sys.argv[1]
row_number, column_number = [int(arg, 10)-1 for arg in sys.argv[2:])]
with open(filename, 'rb') as f:
rows = list(csv.reader(f))
print rows[row_number][column_number]
Example
$ python print-csv-field.py input.csv 2 2
ddddd
Note: list(csv.reader(f)) loads the whole file in memory. To avoid that you could use itertools:
import itertools
# ...
with open(filename, 'rb') as f:
row = next(itertools.islice(csv.reader(f), row_number, row_number+1))
print row[column_number]
import csv
def read_cell(x, y):
with open('file.csv', 'r') as f:
reader = csv.reader(f)
y_count = 0
for n in reader:
if y_count == y:
cell = n[x]
return cell
y_count += 1
print (read_cell(4, 8))
This example prints cell 4, 8 in Python 3.
There is an interesting point you need to catch about csv.reader() object. The csv.reader object is not list type, and not subscriptable.
This works:
for r in csv.reader(file_obj): # file not closed
print r
This does not:
r = csv.reader(file_obj)
print r[0]
So, you first have to convert to list type in order to make the above code work.
r = list( csv.reader(file_obj) )
print r[0]
Finaly I got it!!!
import csv
def select_index(index):
csv_file = open('oscar_age_female.csv', 'r')
csv_reader = csv.DictReader(csv_file)
for line in csv_reader:
l = line['Index']
if l == index:
print(line[' "Name"'])
select_index('11')
"Bette Davis"
Following may be be what you are looking for:
import pandas as pd
df = pd.read_csv("table.csv")
print(df["Password"][row_number])
#where row_number is 38 maybe
import csv
inf = csv.reader(open('yourfile.csv','r'))
for row in inf:
print row[1]