Better way to parse CSV into list or array - python

Is there a better way to create a list or a numpy array from this csv file? What I'm asking is how to do it and parse more gracefully than I did in the code below.
fname = open("Computers discovered recently by discovery method.csv").readlines()
lst = [elt.strip().split(",")[8:] for elt in fname if elt != "\n"][4:]
lst2 = []
for row in lst:
print(row)
if row[0].startswith("SMZ-") or row[0].startswith("MTR-"):
lst2.append(row)
print(*lst2, sep = "\n")

You can always use Pandas. As an example,
import pandas as pd
import numpy as np
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv')
To convert it, you will have to convert it to your favorite numeric type. I guess you can write the whole thing in one line:
result = numpy.array(list(df)).astype("float")
You can also do the following:
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')

You can use pandas and specify header column to make it work correctly on you sample file
import pandas as pd
df = pd.read_csv('Computers discovered recently by discovery method.csv', header=2)
You can check your content using:
>>> df.head()
You can check headers using
>>> df.columns
And to convert it to numpy array you can use
>>> np_arr = df.values
It comes with a lot of options to parse and read csv files. For more information please check the docs

I am not sure what you want but try this
import csv
with open("Computers discovered recently by discovery method.csv", 'r') as f:
reader = csv.reader(f)
ll = list(reader)
print (ll)
this should read the csv line by line and store it as a list

You should never parse CSV structures manually unless you want to tackle all possible exceptions and CSV format oddities. Python has you covered in that regard with its csv module.
The main problem, in your case, stems from your data - there seems to be two different CSV structures in a single file so you first need to find where your second structure begins. Plus, from your code, it seems you want to filter out all columns before Details_Table0_Netbios_Name0 and include only rows whose Details_Table0_Netbios_Name0 starts with SMZ- or MTR-. So something like:
import csv
with open("Computers discovered recently by discovery method.csv") as f:
reader = csv.reader(f) # create a CSV reader
for row in reader: # skip the lines until we encounter the second CSV structure/header
if row and row[0] == "Header_Table0_Netbios_Name0":
break
index = row.index("Details_Table0_Netbios_Name0") # find where your columns begin
result = [] # storage for the rows we're interested in
for row in reader: # read the rest of the CSV row by row
if row and row[index][:4] in {"SMZ-", "MTR-"}: # only include these rows
result.append(row[index:]) # trim and append to the `result` list
print(result[10]) # etc.
# ['MTR-PC0BXQE6-LB', 'PR2', 'anisita', 'VALUEADDCO', 'VALUEADDCO', 'Heartbeat Discovery',
# '07.12.2017 17:47:51', '13']
should do the trick.

Sample Code
import csv
csv_file = 'sample.csv'
with open(csv_file) as fh:
reader = csv.reader(fh)
for row in reader:
print(row)
sample.csv
name,age,salary
clado,20,25000
student,30,34000
sam,34,32000

Related

How to import a csv-file into a data array?

I have a line of code in a script that imports data from a text file with lots of spaces between values into an array for use later.
textfile = open('file.txt')
data = []
for line in textfile:
row_data = line.strip("\n").split()
for i, item in enumerate(row_data):
try:
row_data[i] = float(item)
except ValueError:
pass
data.append(row_data)
I need to change this from a text file to a csv file. I don't want to just change this text to split on commas (since some values can have commas if they're in quotes). Luckily I saw there is a csv library I can import that can handle this.
import csv
with open('file.csv', 'rb') as csvfile:
???
How can I load the csv file into the data array?
If it makes a difference, this is how the data will be used:
row = 0
for row_data in (data):
worksheet.write_row(row, 0, row_data)
row += 1
Assuming the CSV file is delimited with commas, the simplest way using the csv module in Python 3 would probably be:
import csv
with open('testfile.csv', newline='') as csvfile:
data = list(csv.reader(csvfile))
print(data)
You can specify other delimiters, such as tab characters, by specifying them when creating the csv.reader:
data = list(csv.reader(csvfile, delimiter='\t'))
For Python 2, use open('testfile.csv', 'rb') to open the file.
You can use pandas library or numpy to read the CSV file. If your file is tab-separated then use '\t' in place of comma in both sep and delimiter arguments below.
import pandas as pd
myFile = pd.read_csv('filepath', sep=',')
Or
import numpy as np
myFile = np.genfromtxt('filepath', delimiter=',')
I think the simplest way to do this is via Pandas:
import pandas as pd
data = pd.read_csv(FILE).values
This returns a Numpy array of values from a DataFrame created from the CSV. See the documentation here.
This method also works for me.
Example: Having random data, and each data point starting on a newline like below:
'dog',5,2
'cat',5,7,1
'man',5,7,3,'banana'
'food',5,8,9,4,'girl'
import csv
with open('filePath.csv', 'r') as readData:
readCsv = csv.reader(readData)
data = list(readCsv)

How to overwrite a particular column of a csv file using pandas or normal python?

I am new to python. I have a .csv file which has 13 columns. I want to round off the floating values of the 2nd column which I was able to achieve successfully. I did this and stored it in a list. Now I am unable to figure out how to overwrite the rounded off values into the same csv file and into the same column i.e. column 2? I am using python3. Any help will be much appreciated.
My code is as follows:
Import statements for module import:
import csv
Creating an empty list:
list_string = []
Reading a csv file
with open('/home/user/Desktop/wine.csv', 'r') as csvDataFile:
csvReader = csv.reader(csvDataFile, delimiter = ',')
next(csvReader, None)
for row in csvReader:
floatParse = float(row[1])
closestInteger = int(round(floatParse))
stringConvert = str(closestInteger)
list_string.append(stringConvert)
print(list_string)
Writing into the same csv file for the second column (Overwrites the entire Excel file)
with open('/home/user/Desktop/wine.csv', 'w') as csvDataFile:
writer = csv.writer(csvDataFile)
next(csvDataFile)
row[1] = list_string
writer.writerows(row[1])
PS: The writing into the csv overwrites the entire csv and removes all the other columns which I don't want. I just want to overwrite the 2nd column with rounded off values and keep the rest of the data same.
this might be what you're looking for.
import pandas as pd
import numpy as np
#Some sample data
data = {"Document_ID": [102994,51861,51879,38242,60880,76139,76139],
"SecondColumnName": [7.256,1.222,3.16547,4.145658,4.154656,6.12,17.1568],
}
wine = pd.DataFrame(data)
#This is how you'd read in your data
#wine = pd.read_csv('/home/user/Desktop/wine.csv')
#Replace the SecondColumnName with the real name
wine["SecondColumnName"] = wine["SecondColumnName"].map('{:,.2f}'.format)
#This will overwrite the sheet, but it will have all the data as before
wine.to_csv(/home/user/Desktop/wine.csv')
Pandas is way easier than read csv...I'd recommended checking it out.
I think this better answers the specific question. The key to this is to define an input_file and an output_file during the with part.
The StringIO part is just there for sample data in this example. newline='' is for Python 3. Without it, blank lines between each row appears in the output. More info.
import csv
from io import StringIO
s = '''A,B,C,D,E,F,G,H,I,J,K,L
1,4.4343,3,4,5,6,7,8,9,10,11
1,8.6775433,3,4,5,6,7,8,9,10,11
1,16.83389832,3,4,5,6,7,8,9,10,11
1,32.2711122,3,4,5,6,7,8,9,10,11
1,128.949483,3,4,5,6,7,8,9,10,11'''
list_string = []
with StringIO(s) as input_file, open('output_file.csv', 'w', newline='') as output_file:
reader = csv.reader(input_file)
next(reader, None)
writer = csv.writer(output_file)
for row in reader:
floatParse = float(row[1]) + 1
closestInteger = int(round(floatParse))
stringConvert = str(closestInteger)
row[1] = stringConvert
writer.writerow(row)

Trouble returning the entire CSV data frame

When I run the code below, I am only getting the first row (my name row) of my CSV file. What can I do to make sure the code below returns my entire CSV?
import csv
import pandas as pd
import numpy as np
def open_elves():
with open('elves.csv') as csvjawn:
readCS = csv.reader(csvjawn, delimiter = ',')
for row in readCS:
return row
x = pd.DataFrame(open_elves())
print (x)
Use the read_csv function provided by the Pandas api
df = pd.read_csv('elves.csv')
Return always quits the loop immediately. Try e.g.
def f():
for i in range(100):
return i
f()
What your flow should look like instead is something like:
with open('elves.csv') as csvjawn:
readCS = csv.reader(csvjawn, delimiter = ',')
data = [row for row in readCS]
This uses a list comprehension which you may want to research if you haven't seen before.
I wouldn't use return to accomplish what you want to do.
The response of stackoverflow's colleague (abarnert):
with open('old.csv', 'rb') as oldf, open('new.csv', 'wb') as newf:
old_reader = csv.reader(oldf)
writer = csv.writer(newt)
for row in old_reader:
writer.writerow(transform(row))
with open('new.csv', 'rb') as newf:
new_reader = csv.reader(newf)
for row in new_reader:
print row
url: CSV reader object not reading entire file [Python]

Reading column names alone in a csv file

I have a csv file with the following columns:
id,name,age,sex
Followed by a lot of values for the above columns.
I am trying to read the column names alone and put them inside a list.
I am using Dictreader and this gives out the correct details:
with open('details.csv') as csvfile:
i=["name","age","sex"]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]
But what I want to do is, I need the list of columns, ("i" in the above case)to be automatically parsed with the input csv than hardcoding them inside a list.
with open('details.csv') as csvfile:
rows=iter(csv.reader(csvfile)).next()
header=rows[1:]
re=csv.DictReader(csvfile)
for row in re:
print row
for x in header:
print row[x]
This gives out an error
Keyerrror:'name'
in the line print row[x]. Where am I going wrong? Is it possible to fetch the column names using Dictreader?
Though you already have an accepted answer, I figured I'd add this for anyone else interested in a different solution-
Python's DictReader object in the CSV module (as of Python 2.6 and above) has a public attribute called fieldnames.
https://docs.python.org/3.4/library/csv.html#csv.csvreader.fieldnames
An implementation could be as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
d_reader = csv.DictReader(f)
#get fieldnames from DictReader object and store in list
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])
In the above, d_reader.fieldnames returns a list of your headers (assuming the headers are in the top row).
Which allows...
>>> print(headers)
['MyCol1', 'MyCol2', 'MyCol3']
If your headers are in, say the 2nd row (with the very top row being row 1), you could do as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
#you can eat the first line before creating DictReader.
#if no "fieldnames" param is passed into
#DictReader object upon creation, DictReader
#will read the upper-most line as the headers
f.readline()
d_reader = csv.DictReader(f)
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])
You can read the header by using the next() function which return the next row of the reader’s iterable object as a list. then you can add the content of the file to a list.
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
rest = list(reader)
Now i has the column's names as a list.
print i
>>>['id', 'name', 'age', 'sex']
Also note that reader.next() does not work in python 3. Instead use the the inbuilt next() to get the first line of the csv immediately after reading like so:
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = next(reader)
print(i)
>>>['id', 'name', 'age', 'sex']
The csv.DictReader object exposes an attribute called fieldnames, and that is what you'd use. Here's example code, followed by input and corresponding output:
import csv
file = "/path/to/file.csv"
with open(file, mode='r', encoding='utf-8') as f:
reader = csv.DictReader(f, delimiter=',')
for row in reader:
print([col + '=' + row[col] for col in reader.fieldnames])
Input file contents:
col0,col1,col2,col3,col4,col5,col6,col7,col8,col9
00,01,02,03,04,05,06,07,08,09
10,11,12,13,14,15,16,17,18,19
20,21,22,23,24,25,26,27,28,29
30,31,32,33,34,35,36,37,38,39
40,41,42,43,44,45,46,47,48,49
50,51,52,53,54,55,56,57,58,59
60,61,62,63,64,65,66,67,68,69
70,71,72,73,74,75,76,77,78,79
80,81,82,83,84,85,86,87,88,89
90,91,92,93,94,95,96,97,98,99
Output of print statements:
['col0=00', 'col1=01', 'col2=02', 'col3=03', 'col4=04', 'col5=05', 'col6=06', 'col7=07', 'col8=08', 'col9=09']
['col0=10', 'col1=11', 'col2=12', 'col3=13', 'col4=14', 'col5=15', 'col6=16', 'col7=17', 'col8=18', 'col9=19']
['col0=20', 'col1=21', 'col2=22', 'col3=23', 'col4=24', 'col5=25', 'col6=26', 'col7=27', 'col8=28', 'col9=29']
['col0=30', 'col1=31', 'col2=32', 'col3=33', 'col4=34', 'col5=35', 'col6=36', 'col7=37', 'col8=38', 'col9=39']
['col0=40', 'col1=41', 'col2=42', 'col3=43', 'col4=44', 'col5=45', 'col6=46', 'col7=47', 'col8=48', 'col9=49']
['col0=50', 'col1=51', 'col2=52', 'col3=53', 'col4=54', 'col5=55', 'col6=56', 'col7=57', 'col8=58', 'col9=59']
['col0=60', 'col1=61', 'col2=62', 'col3=63', 'col4=64', 'col5=65', 'col6=66', 'col7=67', 'col8=68', 'col9=69']
['col0=70', 'col1=71', 'col2=72', 'col3=73', 'col4=74', 'col5=75', 'col6=76', 'col7=77', 'col8=78', 'col9=79']
['col0=80', 'col1=81', 'col2=82', 'col3=83', 'col4=84', 'col5=85', 'col6=86', 'col7=87', 'col8=88', 'col9=89']
['col0=90', 'col1=91', 'col2=92', 'col3=93', 'col4=94', 'col5=95', 'col6=96', 'col7=97', 'col8=98', 'col9=99']
How about
with open(csv_input_path + file, 'r') as ft:
header = ft.readline() # read only first line; returns string
header_list = header.split(',') # returns list
I am assuming your input file is CSV format.
If using pandas, it takes more time if the file is big size because it loads the entire data as the dataset.
I am just mentioning how to get all the column names from a csv file.
I am using pandas library.
First we read the file.
import pandas as pd
file = pd.read_csv('details.csv')
Then, in order to just get all the column names as a list from input file use:-
columns = list(file.head(0))
Thanking Daniel Jimenez for his perfect solution to fetch column names alone from my csv, I extend his solution to use DictReader so we can iterate over the rows using column names as indexes. Thanks Jimenez.
with open('myfile.csv') as csvfile:
rest = []
with open("myfile.csv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
i=i[1:]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]
here is the code to print only the headers or columns of the csv file.
import csv
HEADERS = next(csv.reader(open('filepath.csv')))
print (HEADERS)
Another method with pandas
import pandas as pd
HEADERS = list(pd.read_csv('filepath.csv').head(0))
print (HEADERS)
import pandas as pd
data = pd.read_csv("data.csv")
cols = data.columns
I literally just wanted the first row of my data which are the headers I need and didn't want to iterate over all my data to get them, so I just did this:
with open(data, 'r', newline='') as csvfile:
t = 0
for i in csv.reader(csvfile, delimiter=',', quotechar='|'):
if t > 0:
break
else:
dbh = i
t += 1
Using pandas is also an option.
But instead of loading the full file in memory, you can retrieve only the first chunk of it to get the field names by using iterator.
import pandas as pd
file = pd.read_csv('details.csv'), iterator=True)
column_names_full=file.get_chunk(1)
column_names=[column for column in column_names_full]
print column_names

What is the best way to read the ith column of a csv file with Python?

I am used to R which offers quick functions to read CSV files column by column, can anyone propose a quick and efficient way to read large data (CSV for example) files in python? the ith column of a CSV file for example.
I have the following but it takes time :
import os,csv, numpy, scipy
from numpy import *
f= open('some.csv', 'rb')
reader = csv.reader(f, delimiter=',')
header = reader.next()
zipped = zip(*reader)
print( zipped[0] ) # is the first column
Is there a better way to read data (from large files) in python (at least as quick as R in terms of memory) ?
You can also use pandas.read_csv and its use_cols argument. See here
import pandas as pd
data = pd.read_csv('some.csv', use_cols = ['col_1', 'col_2', 'col_4'])
...
import csv
with open('some.csv') as fin:
reader = csv.reader(fin)
first_col = [row[0] for row in reader]
What you're doing using zip is loading the entire file to memory, then transposing it to get the col. If you only want the column values, just include that in the list to start with.
If you wanted multiple columns, then you could do:
from operator import itemgetter
get_cols = itemgetter(1, 3, 5)
cols = map(get_cols, reader)

Categories