I have a csv file that has only one column. I want to extract the number of rows.
When I run the the code below:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
I get the following output:
[65422771 rows x 1 columns]
But when I run the code below:
file = open("data.csv")
numline = len(file.readlines())
print (numline)
I get the following output:
130845543
What is the correct number of rows in my csv file? What is the difference between the two outputs?
Is it possible that you have an empty line after each entry? because the readlines count is exactly double wrt pandas df rows.
So pandas is skipping empty lines while readlines count them
in order to check the number of empty lines try:
import sys
import csv
csv.field_size_limit(sys.maxsize)
data= open ('data.csv')
for line in csv.reader(data):
if not line:
empty_lines += 1
continue
print line
Related
I have a csv file, where the columns are separated by tab delimiter but the number of columns is not constant. I need to read the file up to the 5th column. (I dont want to ready the whole file and then extract the columns, I would like to read for example line by line and skip the remaining columns)
You can use usecols argument in pd.read_csv to limit the number of columns to be read.
# test data
s = '''a,b,c
1,2,3'''
with open('a.txt', 'w') as f:
print(s, file=f)
df1 = pd.read_csv("a.txt", usecols=range(1))
df2 = pd.read_csv("a.txt", usecols=range(2))
print(df1)
print()
print(df2)
# output
# a
#0 1
#
# a b
#0 1 2
You can use pandas nrows to read only a certain number of csv lines:
import pandas as pd
df = pd.read_csv('out122.txt', usecols=[0,1,2,3,4])
Is there a better way to import a txt file in a single pandas row than the solution below?
import pandas as pd
with open(path_txt_file) as f:
text = f.read().replace("\n", "")
df = pd.DataFrame([text], columns = ["text"])
Sample lines from the .txt file:
Today is a beautiful day.
I will go swimming.
I tried pd.read_csv but it is returning multiple rows due to new lines.
You can concatenate the lines with .str.cat() [pandas-doc]:
text = pd.read_csv(path_txt_file, sep='\n', header=None)[0].str.cat()
df = pd.DataFrame([text], columns=['text'])
I have a csv file that I want to work with using python. For that I'm run this code :
import csv
import collections
col_values = collections.defaultdict(list)
with open('list.csv', 'rU',) as f:
reader = csv.reader(f)
data = list(reader)
row_count =len(data)
print(" Number of rows ", row_count)
As a result I get 4357 but the file has only 2432 rows, I've tried to change the delimiter in the reader() function, it didn't change the result.
So my question is, does anybody has an explanation why do I get this value ?
thanks in advance
UPDATE
since the number of column is also too big , here is the output of the last row and the start of non existing rows for one columns
opening the file with excel the end looks like :
I hope it helps
try using pandas.
import pandas as pd
df = pd.read_csv('list.csv')
df.count()
check whether you are getting proper rows now
I want to export a dataframe to csv. But on top of it, I would like to print the date of the dataframe to produce the following result in the csv file. How can I join the string sentence to the dataframe so that I can export it together to csv?
import pandas as pd
import datetime as dt
today1=dt.datetime.today().strftime('%Y%m%d')
print('This dataframe is created on ',today1)
df=pd.DataFrame({'A':[1,2],'B':[3,4]})
print(df)
df.to_csv('temp.csv')
pd.to_csv accepts a filehandle as input. So write your first line, then call to_csv with the same handle:
import pandas as pd
import datetime as dt
today1=dt.datetime.today().strftime('%Y%m%d')
df=pd.DataFrame({'A':[1,2],'B':[3,4]})
with open("temp.csv","w") as f:
f.write('This dataframe is created on {}\n'.format(today1))
df.to_csv(f)
when you read the data back just do the same with pd.read_csv():
with open("temp.csv","r") as f:
date_line = next(f)
df = pd.read_csv(f)
Just remove the to_csv line in your code, then run it in a terminal window as below:
python code.py >> temp.csv
Your print instructions will be printed in temp.csv. The output file is:
('This dataframe is created on ', '20161220')
A B
0 1 3
1 2 4
Not sure if it works in every OS though.
I am dealing with a csv file that contains three columns and three rows containing numeric data. The csv data file simply looks like the following:
Colum1,Colum2,Colum3
1,2,3
1,2,3
1,2,3
My question is how to write a python code that take a single value of one of the column and perform a specific operation. For example, let say I want to take the first value in 'Colum1' and subtract it from the sum of all the values in the column.
Here is my attempt:
import csv
f = open('columns.csv')
rows = csv.DictReader(f)
value_of_single_row = 0.0
for i in rows:
value_of_single_Row += float(i) # trying to isolate a single value here!
print value_of_single_row - sum(float(r['Colum1']) for r in rows)
f.close()
Based on the code you provided, I suggest you take a look at the doc to see the preferred approach on how to read through a csv file. Take a look here:
How to use CsvReader
with that being said, you can modify the beginning of your code slightly to this:
import csv
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
# perform operation per row
From there you now have access to each row.
This should give you what you need to do proper row-by-row operations.
What I suggest you do is play around with printing out your rows to see what your data looks like. You will see that each row being outputted is a dictionary.
So if you were going through each row, you can just simply do something like this:
for row in rows:
row['Colum1'] # or row.get('Colum1')
# to do some math to add everything in Column1
s += float(row['Column1'])
So all of that will look like this:
import csv
s = 0
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
s += float(row['Colum1'])
You can do pretty much all of this with pandas
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
import os
Location = r'path/test.csv'
df = pd.read_csv(Location, names=['Colum1','Colum2','Colum3'])
df = df[1:] #Remove the headers since they're unnecessary
print df
df.xs(1)['Colum1']=int(df.loc[1,'Colum1'])+5
print df
You can write back to your csv using df.to_csv('File path', index=False,header=True) Having headers=True will add the headers back in.
To do this more along the lines of what you have you can do it like this
import csv
Location = r'C:/Users/tnabrelsfo/Documents/Programs/Stack/test.csv'
data = []
with open(Location, 'r') as f:
for line in f:
data.append(line.replace('\n','').replace(' ','').split(','))
data = data[1:]
print data
data[1][1] = 5
print data
it will read in each row, cut out the column names, and then you can modify the values by index
So here is my simple solution using pandas library. Suppose we have sample.csv file
import pandas as pd
df = pd.read_csv('sample.csv') # df is now a DataFrame
df['Colum1'] = df['Colum1'] - df['Colum1'].sum() # here we replace the column by subtracting sum of value in the column
print df
df.to_csv('sample.csv', index=False) # save dataframe back to csv file
You can also use map function to do operation to one column, for example,
import pandas as pd
df = pd.read_csv('sample.csv')
col_sum = df['Colum1'].sum() # sum of the first column
df['Colum1'] = df['Colum1'].map(lambda x: x - col_sum)