I want to read a CSV file save to list and count the numbers of every word, but I got some error is about list index out of range in Python.
I have 21291918 rows in the CSV file. The following is a screenshot of the CSV file.
The following is my code:
from datetime import date,datetime
import numpy as np
import xlrd
import codecs
import time
import re
import os
import jieba
from itertools import repeat
import sys
import csv
maxInt = sys.maxsize
while True:
# decrease the maxInt value by factor 10
# as long as the OverflowError occurs.
try:
csv.field_size_limit(maxInt)
break
except OverflowError:
maxInt = int(maxInt/10)
sys.setrecursionlimit(100000000)
jieba.load_userdict('./data/dict.txt')
file_name = 'Real/B_Seg_output.csv'
with open (file_name, 'r', encoding="utf-8") as csvfile:
reader = csv.reader(csvfile)
column = [row[0] for row in reader]
author_list = list(column)
#print(author_list)
print('-'*30)
with open('Real/Other_Content_Count_All.csv', 'a', newline='', encoding='utf-8') as csvfile:
csvfile.write('回復內容\n')
j=0
cnt = set(author_list)
for i in cnt:
j += 1
print(j)
if(j % 10000 == 0):
print('*'*10+str(j)+" is sleeping"+'*'*10)
time.sleep(10)
if author_list.count(i)>0:
#print(i+',',author_list.count(i))
#print(i)
#print(author_list.count(i))
with open('Real/First_Author_Count_All.csv', 'a', newline='', encoding='utf-8') as csvfile:
csvfile.write(i+','+str(author_list.count(i))+'\n')
When I run this code, I got the following problem:
Traceback (most recent call last):
File ".\count_All_Other_Content.py", line 38, in <module>
column = [row[0] for row in reader]
File ".\count_All_Other_Content.py", line 38, in <listcomp>
column = [row[0] for row in reader]
IndexError: list index out of range
I searched the related problems. I suspected the reason is some lines have space value.
However, I cannot find the solution. And then, I suspected the rows of CSV is over than list limit.
I need to use this CSV file to count the number of occurrences of each word. I don't know what to solve.
Maybe you can just change your column = [row[0] for row in reader] line to one of these:
column = [row[0] if row else None for row in reader] - This will preserve the indexes if that's important
column = [row[0] for row in reader if row] - This will skip the empty rows
If the header is empty then it raises an IndexError when you try to access any items.
for row in reader:
if len(row[0]) > 0:
column = row[0]
else:
pass
You can add this line before author_list and after reader lines. So that if checks if something is on there, it takes it. Otherwise it passes to other rows.
I think that the fastest way to do this is just to use readlines like this:
with f as open('myfile'):
lines = f.readlines()
Now lines is a list of all the lines in the file, if a line is empty you will have an empty string (' ') in the list and you can easily check that. You can also delete '\r' and '\n' characters.
If you want to count the number of different words you can just use len(set(lines)). If you want the count of each one instead you can use numpy.unique function that will give you the array of unique values and also the count of each one.
I'm trying print lines randomly from a csv.
Lets say the csv has the below 10 lines -
1,One
2,Two
3,Three
4,Four
5,Five
6,Six
7,Seven
8,Eight
9,Nine
10,Ten
If I write a code like below, it prints each line as a list in the same order as present in the CSV
import csv
with open("MyCSV.csv") as f:
reader = csv.reader(f)
for row_num, row in enumerate(reader):
print(row)
Instead, I'd like it to be random.
Its just a print for now. I'll later pass each line as a List to a Function.
This should work. You can reuse the lines list in your code as it is shuffled.
import random
with open("tmp.csv", "r") as f:
lines = f.readlines()
random.shuffle(lines)
print(lines)
import csv
import random
csv_elems = []
with open("MyCSV.csv") as f:
reader = csv.reader(f)
for row_num, row in enumerate(reader):
csv_elems.append(row)
random.shuffle(csv_elems)
print(csv_elems[0])
As you can see I'm just printing the first elem, you can iterate over the list, keep shuffling & print
Well you can define a list, append all elements of csv file into it, then shuffle it and print them, assume that the name of this list is temp
import csv
import random
temp = []
with open("your csv file.csv") as file:
reader = csv.reader(file)
for row_num, row in enumerate(reader):
temp.append(row)
random.shuffle(temp)
for i in range(len(temp)):
print(temp[i])
Why better don't you use pandas to handle csv?
import pandas as pd
data = pd.read_csv("MyCSV.csv")
And to get the samples you are looking for just write:
data.sample() # print one sample
data.sample(5) # to write 5 samples
Also if you want to pass each line to a function.
data_after_function = data.appy(function_name)
and inside the function you can cast the line into a list with list()
Hope this helps!
Couple of things to do:
Store CSV into a sequence of some sort
Get the data randomly
For 1, it’s probably best to use some form of sequence comprehension (I’ve gone for nested tuple in a list as it seems you want the row numbers and we can’t use dictionaries for shuffle).
We can use the random module for number 2.
import random
import csv
with open("MyCSV.csv") as f:
reader = csv.reader(f)
my_csv = [(row_num, row) for row_num, row in enumerate(reader)]
# get only 1 item from the list at random
random_row = random.choice(my_csv)
# randomise the order of all the rows
shuffled_csv = random.shuffle(my_csv)
I have a csv file with the following columns:
id,name,age,sex
Followed by a lot of values for the above columns.
I am trying to read the column names alone and put them inside a list.
I am using Dictreader and this gives out the correct details:
with open('details.csv') as csvfile:
i=["name","age","sex"]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]
But what I want to do is, I need the list of columns, ("i" in the above case)to be automatically parsed with the input csv than hardcoding them inside a list.
with open('details.csv') as csvfile:
rows=iter(csv.reader(csvfile)).next()
header=rows[1:]
re=csv.DictReader(csvfile)
for row in re:
print row
for x in header:
print row[x]
This gives out an error
Keyerrror:'name'
in the line print row[x]. Where am I going wrong? Is it possible to fetch the column names using Dictreader?
Though you already have an accepted answer, I figured I'd add this for anyone else interested in a different solution-
Python's DictReader object in the CSV module (as of Python 2.6 and above) has a public attribute called fieldnames.
https://docs.python.org/3.4/library/csv.html#csv.csvreader.fieldnames
An implementation could be as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
d_reader = csv.DictReader(f)
#get fieldnames from DictReader object and store in list
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])
In the above, d_reader.fieldnames returns a list of your headers (assuming the headers are in the top row).
Which allows...
>>> print(headers)
['MyCol1', 'MyCol2', 'MyCol3']
If your headers are in, say the 2nd row (with the very top row being row 1), you could do as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
#you can eat the first line before creating DictReader.
#if no "fieldnames" param is passed into
#DictReader object upon creation, DictReader
#will read the upper-most line as the headers
f.readline()
d_reader = csv.DictReader(f)
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])
You can read the header by using the next() function which return the next row of the reader’s iterable object as a list. then you can add the content of the file to a list.
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
rest = list(reader)
Now i has the column's names as a list.
print i
>>>['id', 'name', 'age', 'sex']
Also note that reader.next() does not work in python 3. Instead use the the inbuilt next() to get the first line of the csv immediately after reading like so:
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = next(reader)
print(i)
>>>['id', 'name', 'age', 'sex']
The csv.DictReader object exposes an attribute called fieldnames, and that is what you'd use. Here's example code, followed by input and corresponding output:
import csv
file = "/path/to/file.csv"
with open(file, mode='r', encoding='utf-8') as f:
reader = csv.DictReader(f, delimiter=',')
for row in reader:
print([col + '=' + row[col] for col in reader.fieldnames])
Input file contents:
col0,col1,col2,col3,col4,col5,col6,col7,col8,col9
00,01,02,03,04,05,06,07,08,09
10,11,12,13,14,15,16,17,18,19
20,21,22,23,24,25,26,27,28,29
30,31,32,33,34,35,36,37,38,39
40,41,42,43,44,45,46,47,48,49
50,51,52,53,54,55,56,57,58,59
60,61,62,63,64,65,66,67,68,69
70,71,72,73,74,75,76,77,78,79
80,81,82,83,84,85,86,87,88,89
90,91,92,93,94,95,96,97,98,99
Output of print statements:
['col0=00', 'col1=01', 'col2=02', 'col3=03', 'col4=04', 'col5=05', 'col6=06', 'col7=07', 'col8=08', 'col9=09']
['col0=10', 'col1=11', 'col2=12', 'col3=13', 'col4=14', 'col5=15', 'col6=16', 'col7=17', 'col8=18', 'col9=19']
['col0=20', 'col1=21', 'col2=22', 'col3=23', 'col4=24', 'col5=25', 'col6=26', 'col7=27', 'col8=28', 'col9=29']
['col0=30', 'col1=31', 'col2=32', 'col3=33', 'col4=34', 'col5=35', 'col6=36', 'col7=37', 'col8=38', 'col9=39']
['col0=40', 'col1=41', 'col2=42', 'col3=43', 'col4=44', 'col5=45', 'col6=46', 'col7=47', 'col8=48', 'col9=49']
['col0=50', 'col1=51', 'col2=52', 'col3=53', 'col4=54', 'col5=55', 'col6=56', 'col7=57', 'col8=58', 'col9=59']
['col0=60', 'col1=61', 'col2=62', 'col3=63', 'col4=64', 'col5=65', 'col6=66', 'col7=67', 'col8=68', 'col9=69']
['col0=70', 'col1=71', 'col2=72', 'col3=73', 'col4=74', 'col5=75', 'col6=76', 'col7=77', 'col8=78', 'col9=79']
['col0=80', 'col1=81', 'col2=82', 'col3=83', 'col4=84', 'col5=85', 'col6=86', 'col7=87', 'col8=88', 'col9=89']
['col0=90', 'col1=91', 'col2=92', 'col3=93', 'col4=94', 'col5=95', 'col6=96', 'col7=97', 'col8=98', 'col9=99']
How about
with open(csv_input_path + file, 'r') as ft:
header = ft.readline() # read only first line; returns string
header_list = header.split(',') # returns list
I am assuming your input file is CSV format.
If using pandas, it takes more time if the file is big size because it loads the entire data as the dataset.
I am just mentioning how to get all the column names from a csv file.
I am using pandas library.
First we read the file.
import pandas as pd
file = pd.read_csv('details.csv')
Then, in order to just get all the column names as a list from input file use:-
columns = list(file.head(0))
Thanking Daniel Jimenez for his perfect solution to fetch column names alone from my csv, I extend his solution to use DictReader so we can iterate over the rows using column names as indexes. Thanks Jimenez.
with open('myfile.csv') as csvfile:
rest = []
with open("myfile.csv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
i=i[1:]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]
here is the code to print only the headers or columns of the csv file.
import csv
HEADERS = next(csv.reader(open('filepath.csv')))
print (HEADERS)
Another method with pandas
import pandas as pd
HEADERS = list(pd.read_csv('filepath.csv').head(0))
print (HEADERS)
import pandas as pd
data = pd.read_csv("data.csv")
cols = data.columns
I literally just wanted the first row of my data which are the headers I need and didn't want to iterate over all my data to get them, so I just did this:
with open(data, 'r', newline='') as csvfile:
t = 0
for i in csv.reader(csvfile, delimiter=',', quotechar='|'):
if t > 0:
break
else:
dbh = i
t += 1
Using pandas is also an option.
But instead of loading the full file in memory, you can retrieve only the first chunk of it to get the field names by using iterator.
import pandas as pd
file = pd.read_csv('details.csv'), iterator=True)
column_names_full=file.get_chunk(1)
column_names=[column for column in column_names_full]
print column_names
I am probably making a stupid mistake, but I can't find where it is. I want to count the number of lines in my csv file. I wrote this, and obviously isn't working: I have row_count = 0 while it should be 400. Cheers.
f = open(adresse,"r")
reader = csv.reader(f,delimiter = ",")
data = [l for l in reader]
row_count = sum(1 for row in reader)
print row_count
with open(adresse,"r") as f:
reader = csv.reader(f,delimiter = ",")
data = list(reader)
row_count = len(data)
You are trying to read the file twice, when the file pointer has already reached the end of file after saving the data list.
First you have to open the file with open
input_file = open("nameOfFile.csv","r+")
Then use the csv.reader for open the csv
reader_file = csv.reader(input_file)
At the last, you can take the number of row with the instruction 'len'
value = len(list(reader_file))
The total code is this:
input_file = open("nameOfFile.csv","r+")
reader_file = csv.reader(input_file)
value = len(list(reader_file))
Remember that if you want to reuse the csv file, you have to make a input_file.fseek(0), because when you use a list for the reader_file, it reads all file, and the pointer in the file change its position
If you are working with python3 and have pandas library installed you can go with
import pandas as pd
results = pd.read_csv('f.csv')
print(len(results))
I would consider using a generator. It would do the job and keeps you safe from MemoryError of any kind
def generator_count_file_rows(input_file):
for row in open(input_file,'r'):
yield row
And then
for row in generator_count_file_rows('very_large_set.csv'):
count+=1
The important stuff is hidden in comments section of solution which is marked correct.
Re-sharing Erdős-Bacon's solution here for better visibility.
Why ?
Because: It saves lot of memory without having to create list.
So I think it is better do this way
def read_raw_csv(file_name):
with open(file_name, 'r') as file:
csvreader = csv.reader(file)
# count number of rows
entry_count = sum(1 for row in csvreader)
print(entry_count-1) # -1 is for discarding header row.
Checkout this link for more info
# with built in libraries
opened_file = open('f.csv')
from csv import reader
read_file = reader(opened_file)
apps_data = list(read_file)
rowcount = len(apps_data) #which incudes header row
print("Total rows incuding header: " + str(rowcount))
Simply Open the csv file in Notepad++. It shows the total row count in a jiffy. :)
Or
in cmd prompt , Provide file path and key in the command
find \c \v "some meaningless string" Filename.csv