is there any better way for reading files?

is there any better way for reading files? - python

Every time when i am reading CSv file as list by using this long method, can we simplify this?
Creating empty List
Reading file row-wise and appending to the list
filename = 'mtms_excelExtraction_m_Model_Definition.csv'
Ana_Type = []
Ana_Length = []
Ana_Text = []
Ana_Space = []
with open(filename, 'rt') as f:
reader = csv.reader(f)
try:
for row in reader:
Ana_Type.append(row[0])
Ana_Length.append(row[1])
Ana_Text.append(row[2])
Ana_Space.append(row[3])
except csv.Error as e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))

This is a good opportunity for you to start using pandas and working with DataFrames.
import pandas as pd
df = pd.read_csv(path_to_csv)
1-2 (depending on if you count the import) lines of code and you're done!

This one is essentially the numpy way of processing the csv file, without using numpy.
Whether it is better than your original method is close to a matter of taste. It has in common with the numpy or Pandas method the fact of loading the whole file in memory and than transposing it into lists:
with open(filename, 'rt') as f:
reader = csv.reader(f)
tmp = list(reader)
Ana_Type, Ana_Length, Ana_Text, Ana_Space = [[tmp[i][j] for i in range(len(tmp))]
for j in range(len(tmp[0]))]
It uses less code, and build arrays with comprehensions instead of repeated appends, but more memory (as would numpy or pandas).
Depending on how you later process the data, numpy or Pandas could be a nice option. Because IMHO using them only to load a csv file into list is not worth it.

You can use a DictReader
import csv
with open(filename, 'rt') as f:
data = list(csv.DictReader(f, fieldnames=["Type", "Length", "Text", "Space"]))
print(data)
This will give you a single list of dict objects, one per row.

Try this
import csv
from collections import defaultdict
d = defaultdict(list)
with open(filename, mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file)
for row in csv_reader:
for k,v in row.items():
d[k].append(v)
next
d.keys()
dict_keys(['Ana_Type', 'Ana_Length', 'Ana_Text', 'Ana_Space'])
next
d.get('Ana_Type')
['bla','bla1','df','ccc']

The repetitive calls to list.append can be avoided by reading the csv and using the zip builtin function to transpose the rows.
import io, csv
# Create an example file
buf = io.StringIO('type1,length1,text1,space1\ntype2,length2,text2,space2\ntype3,length3,text3,space3')
reader = csv.reader(buf)
# Uncomment the next line if there is a header row
# next(reader)
Ana_Types, Ana_Length, Ana_Text, Ana_Space = zip(*reader)
print(Ana_Types)
('type1', 'type2', 'type3')
print(Ana_Length)
('length1', 'length2', 'length3')
...
If you need lists rather than tuples you can use a list or generator comprehension to convert them:
Ana_Types, Ana_Length, Ana_Text, Ana_Space = [list(x) for x in zip(*reader)]

This could be useful:
import numpy as np
# read the rows with Numpy
rows = np.genfromtxt('data.csv',dtype='str',delimiter=';')
# call numpy.transpose to convert the rows to columns
cols = np.transpose(rows)
# get the stuff as lists
Ana_Type = list(cols[0])
Ana_Length = list(cols[1])
Ana_Text = list(cols[2])
Ana_Space = list(cols[0])
Edit : note that the first element will be the name of the columns (example with test data):
['Date', '2020-03-03', '2020-03-04', '2020-03-05', '2020-03-06']

Related

Can I print lines randomly from a csv in Python?

I'm trying print lines randomly from a csv.
Lets say the csv has the below 10 lines -
1,One
2,Two
3,Three
4,Four
5,Five
6,Six
7,Seven
8,Eight
9,Nine
10,Ten
If I write a code like below, it prints each line as a list in the same order as present in the CSV
import csv
with open("MyCSV.csv") as f:
reader = csv.reader(f)
for row_num, row in enumerate(reader):
print(row)
Instead, I'd like it to be random.
Its just a print for now. I'll later pass each line as a List to a Function.

This should work. You can reuse the lines list in your code as it is shuffled.
import random
with open("tmp.csv", "r") as f:
lines = f.readlines()
random.shuffle(lines)
print(lines)

import csv
import random
csv_elems = []
with open("MyCSV.csv") as f:
reader = csv.reader(f)
for row_num, row in enumerate(reader):
csv_elems.append(row)
random.shuffle(csv_elems)
print(csv_elems[0])
As you can see I'm just printing the first elem, you can iterate over the list, keep shuffling & print

Well you can define a list, append all elements of csv file into it, then shuffle it and print them, assume that the name of this list is temp
import csv
import random
temp = []
with open("your csv file.csv") as file:
reader = csv.reader(file)
for row_num, row in enumerate(reader):
temp.append(row)
random.shuffle(temp)
for i in range(len(temp)):
print(temp[i])

Why better don't you use pandas to handle csv?
import pandas as pd
data = pd.read_csv("MyCSV.csv")
And to get the samples you are looking for just write:
data.sample() # print one sample
data.sample(5) # to write 5 samples
Also if you want to pass each line to a function.
data_after_function = data.appy(function_name)
and inside the function you can cast the line into a list with list()
Hope this helps!

Couple of things to do:
Store CSV into a sequence of some sort
Get the data randomly
For 1, it’s probably best to use some form of sequence comprehension (I’ve gone for nested tuple in a list as it seems you want the row numbers and we can’t use dictionaries for shuffle).
We can use the random module for number 2.
import random
import csv
with open("MyCSV.csv") as f:
reader = csv.reader(f)
my_csv = [(row_num, row) for row_num, row in enumerate(reader)]
# get only 1 item from the list at random
random_row = random.choice(my_csv)
# randomise the order of all the rows
shuffled_csv = random.shuffle(my_csv)

Better way to parse CSV into list or array

Is there a better way to create a list or a numpy array from this csv file? What I'm asking is how to do it and parse more gracefully than I did in the code below.
fname = open("Computers discovered recently by discovery method.csv").readlines()
lst = [elt.strip().split(",")[8:] for elt in fname if elt != "\n"][4:]
lst2 = []
for row in lst:
print(row)
if row[0].startswith("SMZ-") or row[0].startswith("MTR-"):
lst2.append(row)
print(*lst2, sep = "\n")

You can always use Pandas. As an example,
import pandas as pd
import numpy as np
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv')
To convert it, you will have to convert it to your favorite numeric type. I guess you can write the whole thing in one line:
result = numpy.array(list(df)).astype("float")
You can also do the following:
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')

You can use pandas and specify header column to make it work correctly on you sample file
import pandas as pd
df = pd.read_csv('Computers discovered recently by discovery method.csv', header=2)
You can check your content using:
>>> df.head()
You can check headers using
>>> df.columns
And to convert it to numpy array you can use
>>> np_arr = df.values
It comes with a lot of options to parse and read csv files. For more information please check the docs

I am not sure what you want but try this
import csv
with open("Computers discovered recently by discovery method.csv", 'r') as f:
reader = csv.reader(f)
ll = list(reader)
print (ll)
this should read the csv line by line and store it as a list

You should never parse CSV structures manually unless you want to tackle all possible exceptions and CSV format oddities. Python has you covered in that regard with its csv module.
The main problem, in your case, stems from your data - there seems to be two different CSV structures in a single file so you first need to find where your second structure begins. Plus, from your code, it seems you want to filter out all columns before Details_Table0_Netbios_Name0 and include only rows whose Details_Table0_Netbios_Name0 starts with SMZ- or MTR-. So something like:
import csv
with open("Computers discovered recently by discovery method.csv") as f:
reader = csv.reader(f) # create a CSV reader
for row in reader: # skip the lines until we encounter the second CSV structure/header
if row and row[0] == "Header_Table0_Netbios_Name0":
break
index = row.index("Details_Table0_Netbios_Name0") # find where your columns begin
result = [] # storage for the rows we're interested in
for row in reader: # read the rest of the CSV row by row
if row and row[index][:4] in {"SMZ-", "MTR-"}: # only include these rows
result.append(row[index:]) # trim and append to the `result` list
print(result[10]) # etc.
# ['MTR-PC0BXQE6-LB', 'PR2', 'anisita', 'VALUEADDCO', 'VALUEADDCO', 'Heartbeat Discovery',
# '07.12.2017 17:47:51', '13']
should do the trick.

Sample Code
import csv
csv_file = 'sample.csv'
with open(csv_file) as fh:
reader = csv.reader(fh)
for row in reader:
print(row)
sample.csv
name,age,salary
clado,20,25000
student,30,34000
sam,34,32000

Splitting Rows in csv on several header rows

I am very new to python, so please be gentle.
I have a .csv file, reported to me in this format, so I cannot do much about it:
ClientAccountID AccountAlias CurrencyPrimary FromDate
SomeID SomeAlias SomeCurr SomeDate
OtherID OtherAlias OtherCurr OtherDate
ClientAccountID AccountAlias CurrencyPrimary AssetClass
SomeID SomeAlias SomeCurr SomeClass
OtherID OtherAlias OtherCurr OtherDate
AnotherID AnotherAlias AnotherCurr AnotherDate
I am using the csv package in python, so i have:
with open(theFile, 'rb') as csvfile:
theReader = csv.DictReader(csvfile, delimiter = ',')
Which, as I understand it, creates the dictionary 'theReader'. How do I subset this dictionary, into several dictionaries, splitting them by the header rows in the original csv file? Is there a simple, elegant, non-loop way to create a list of dictionaries (or even a dictionary of dictionaries, with account IDs as keys)? Does that make sense?
Oh. Please note the header rows are not equivalent, but the header rows will always begin with 'ClientAccountID'.
Thanks to # codie, I am now using the following to split the csv into several dicts, based on using the '\t' delimiter.
with open(theFile, 'rb') as csvfile:
theReader = csv.DictReader(csvfile, delimiter = '\t')
However, I now get the entire header row as a key, and each other row as a value. How do I further split this up?
Thanks to #Benjamin Hodgson below, I have the following:
from csv import DictReader
from io import BytesIO
stringios = []
with open('file.csv', 'r') as f:
stringio = None
for line in f:
if line.startswith('ClientAccountID'):
if stringio is not None:
stringios.append(stringio)
stringio = BytesIO()
stringio.write(line)
stringio.write("\n")
stringios.append(stringio)
data = [list(DictReader(x.getvalue(), delimiter=',')) for x in stringios]
If I print the first item in stringios, I get what I would expect. It looks like a single csv. However, if I print the first item in data, using below, i get something odd:
for row in data[0]:
print row
It returns:
{'C':'U'}
{'C':'S'}
{'C':'D'}
...
So it appears it is splitting every character, instead of using the comma delimiter.

If I've understood your question correctly, you have a single CSV file which contains multiple tables. Tables are delimited by header rows which always begin with the string "ClientAccountID".
So the job is to read the CSV file into a list of lists-of-dictionaries. Each entry in the list corresponds to one of the tables in your CSV file.
Here's how I'd do it:
Break up the single CSV file with multiple tables into multiple files each with a single table. (These files could be in-memory.) Do this by looking for lines which start with "ClientAccountID".
Read each of these files into a list of dictionaries using a DictReader.
Here's some code to read the file into a list of StringIOs. (A StringIO is an in-memory file. It works by wrapping a string up into a file-like interface).
from csv import DictReader
from io import StringIO
stringios = []
with open('file.csv', 'r') as f:
stringio = None
for line in f:
if line.startswith('ClientAccountID'):
if stringio is not None:
stringio.seek(0)
stringios.append(stringio)
stringio = StringIO()
stringio.write(line)
stringio.write("\n")
stringio.seek(0)
stringios.append(stringio)
If we encounter a line starting with 'ClientAccountID', we put the current StringIO into the list and start writing to a new one. When you've finished, remember to add the last one to the list too.
Don't forget (as I did, in an earlier version of this answer) to rewind the StringIO after you've written to it using stringio.seek(0).
Now it's straightforward to loop over the StringIOs to get a table of dictionaries.
data = [list(DictReader(x, delimiter='\t')) for x in stringios]
For each file-like object in the list stringios, create a DictReader and read it into a list.
It's not too hard to modify this approach if your data is too big to fit into memory. Use generators instead of lists and do the processing line-by-line.

If your data was not comma or tab delimited you could use str.split, you can combine it with itertools.groupby to delimit the headers and rows:
from itertools import groupby, izip, imap
with open("test.txt") as f:
grps, data = groupby(imap(str.split, f), lambda x: x[0] == "ClientAccountID"), []
for k, v in grps:
if k:
names = next(v)
vals = izip(*next(grps)[1])
data.append(dict(izip(names, vals)))
from pprint import pprint as pp
pp(data)
Output:
[{'AccountAlias': ('SomeAlias', 'OtherAlias'),
'ClientAccountID': ('SomeID', 'OtherID'),
'CurrencyPrimary': ('SomeCurr', 'OtherCurr'),
'FromDate': ('SomeDate', 'OtherDate')},
{'AccountAlias': ('SomeAlias', 'OtherAlias', 'AnotherAlias'),
'AssetClass': ('SomeClass', 'OtherDate', 'AnotherDate'),
'ClientAccountID': ('SomeID', 'OtherID', 'AnotherID'),
'CurrencyPrimary': ('SomeCurr', 'OtherCurr', 'AnotherCurr')}]
If it is tab delimited just change one line:
with open("test.txt") as f:
grps, data = groupby(csv.reader(f, delimiter="\t"), lambda x: x[0] == "ClientAccountID"), []
for k, v in grps:
if k:
names = next(v)
vals = izip(*next(grps)[1])
data.append(dict(izip(names, vals)))

Reading column names alone in a csv file

I have a csv file with the following columns:
id,name,age,sex
Followed by a lot of values for the above columns.
I am trying to read the column names alone and put them inside a list.
I am using Dictreader and this gives out the correct details:
with open('details.csv') as csvfile:
i=["name","age","sex"]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]
But what I want to do is, I need the list of columns, ("i" in the above case)to be automatically parsed with the input csv than hardcoding them inside a list.
with open('details.csv') as csvfile:
rows=iter(csv.reader(csvfile)).next()
header=rows[1:]
re=csv.DictReader(csvfile)
for row in re:
print row
for x in header:
print row[x]
This gives out an error
Keyerrror:'name'
in the line print row[x]. Where am I going wrong? Is it possible to fetch the column names using Dictreader?

Though you already have an accepted answer, I figured I'd add this for anyone else interested in a different solution-
Python's DictReader object in the CSV module (as of Python 2.6 and above) has a public attribute called fieldnames.
https://docs.python.org/3.4/library/csv.html#csv.csvreader.fieldnames
An implementation could be as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
d_reader = csv.DictReader(f)
#get fieldnames from DictReader object and store in list
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])
In the above, d_reader.fieldnames returns a list of your headers (assuming the headers are in the top row).
Which allows...
>>> print(headers)
['MyCol1', 'MyCol2', 'MyCol3']
If your headers are in, say the 2nd row (with the very top row being row 1), you could do as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
#you can eat the first line before creating DictReader.
#if no "fieldnames" param is passed into
#DictReader object upon creation, DictReader
#will read the upper-most line as the headers
f.readline()
d_reader = csv.DictReader(f)
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])

You can read the header by using the next() function which return the next row of the reader’s iterable object as a list. then you can add the content of the file to a list.
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
rest = list(reader)
Now i has the column's names as a list.
print i
>>>['id', 'name', 'age', 'sex']
Also note that reader.next() does not work in python 3. Instead use the the inbuilt next() to get the first line of the csv immediately after reading like so:
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = next(reader)
print(i)
>>>['id', 'name', 'age', 'sex']

The csv.DictReader object exposes an attribute called fieldnames, and that is what you'd use. Here's example code, followed by input and corresponding output:
import csv
file = "/path/to/file.csv"
with open(file, mode='r', encoding='utf-8') as f:
reader = csv.DictReader(f, delimiter=',')
for row in reader:
print([col + '=' + row[col] for col in reader.fieldnames])
Input file contents:
col0,col1,col2,col3,col4,col5,col6,col7,col8,col9
00,01,02,03,04,05,06,07,08,09
10,11,12,13,14,15,16,17,18,19
20,21,22,23,24,25,26,27,28,29
30,31,32,33,34,35,36,37,38,39
40,41,42,43,44,45,46,47,48,49
50,51,52,53,54,55,56,57,58,59
60,61,62,63,64,65,66,67,68,69
70,71,72,73,74,75,76,77,78,79
80,81,82,83,84,85,86,87,88,89
90,91,92,93,94,95,96,97,98,99
Output of print statements:
['col0=00', 'col1=01', 'col2=02', 'col3=03', 'col4=04', 'col5=05', 'col6=06', 'col7=07', 'col8=08', 'col9=09']
['col0=10', 'col1=11', 'col2=12', 'col3=13', 'col4=14', 'col5=15', 'col6=16', 'col7=17', 'col8=18', 'col9=19']
['col0=20', 'col1=21', 'col2=22', 'col3=23', 'col4=24', 'col5=25', 'col6=26', 'col7=27', 'col8=28', 'col9=29']
['col0=30', 'col1=31', 'col2=32', 'col3=33', 'col4=34', 'col5=35', 'col6=36', 'col7=37', 'col8=38', 'col9=39']
['col0=40', 'col1=41', 'col2=42', 'col3=43', 'col4=44', 'col5=45', 'col6=46', 'col7=47', 'col8=48', 'col9=49']
['col0=50', 'col1=51', 'col2=52', 'col3=53', 'col4=54', 'col5=55', 'col6=56', 'col7=57', 'col8=58', 'col9=59']
['col0=60', 'col1=61', 'col2=62', 'col3=63', 'col4=64', 'col5=65', 'col6=66', 'col7=67', 'col8=68', 'col9=69']
['col0=70', 'col1=71', 'col2=72', 'col3=73', 'col4=74', 'col5=75', 'col6=76', 'col7=77', 'col8=78', 'col9=79']
['col0=80', 'col1=81', 'col2=82', 'col3=83', 'col4=84', 'col5=85', 'col6=86', 'col7=87', 'col8=88', 'col9=89']
['col0=90', 'col1=91', 'col2=92', 'col3=93', 'col4=94', 'col5=95', 'col6=96', 'col7=97', 'col8=98', 'col9=99']

How about
with open(csv_input_path + file, 'r') as ft:
header = ft.readline() # read only first line; returns string
header_list = header.split(',') # returns list
I am assuming your input file is CSV format.
If using pandas, it takes more time if the file is big size because it loads the entire data as the dataset.

I am just mentioning how to get all the column names from a csv file.
I am using pandas library.
First we read the file.
import pandas as pd
file = pd.read_csv('details.csv')
Then, in order to just get all the column names as a list from input file use:-
columns = list(file.head(0))

Thanking Daniel Jimenez for his perfect solution to fetch column names alone from my csv, I extend his solution to use DictReader so we can iterate over the rows using column names as indexes. Thanks Jimenez.
with open('myfile.csv') as csvfile:
rest = []
with open("myfile.csv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
i=i[1:]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]

here is the code to print only the headers or columns of the csv file.
import csv
HEADERS = next(csv.reader(open('filepath.csv')))
print (HEADERS)
Another method with pandas
import pandas as pd
HEADERS = list(pd.read_csv('filepath.csv').head(0))
print (HEADERS)

import pandas as pd
data = pd.read_csv("data.csv")
cols = data.columns

I literally just wanted the first row of my data which are the headers I need and didn't want to iterate over all my data to get them, so I just did this:
with open(data, 'r', newline='') as csvfile:
t = 0
for i in csv.reader(csvfile, delimiter=',', quotechar='|'):
if t > 0:
break
else:
dbh = i
t += 1

Using pandas is also an option.
But instead of loading the full file in memory, you can retrieve only the first chunk of it to get the field names by using iterator.
import pandas as pd
file = pd.read_csv('details.csv'), iterator=True)
column_names_full=file.get_chunk(1)
column_names=[column for column in column_names_full]
print column_names

Creating a single dictionary from two tab delimited files

I'm somewhat new to Python and still trying to learn all its tricks and exploitations.
I'm looking to see if it's possible to collect column data from two separate files to create a single dictionary, rather than two distinct dictionaries. The code that I've used to import files before looks like this:
import csv
from collections import defaultdict
columns = defaultdict(list)
with open("myfile.txt") as f:
reader = csv.DictReader(f,delimiter='\t')
for row in reader:
for (header,variable) in row.items():
columns[header].append(variable)
f.close()
This code makes each element of the first line of the file into a header for the columns of data below it. What I'd like to do now is to import a file that only contains one line which I'll use as my header, and import another file that only contains data that I'll match the headers up to. What I've tried so far resembles this:
columns = defaultdict(list)
with open("headerData.txt") as g:
reader1 = csv.DictReader(g,delimiter='\t')
for row in reader1:
for (h,v) in row.items():
columns[h].append(v)
with open("variableData.txt") as f:
reader = csv.DictReader(f,delimiter='\t')
for row in reader:
for (h,v) in row.items():
columns[h].append(v)
Is nesting the open statements the right way to attempt this? Honestly I am totally lost on what to do. Any help is greatly appreciated.

You can't use DictReader like that if the headers are not in the file. But you can create a fake file object that would yield the headers and then the data, using itertools.chain:
from itertools import chain
with open('headerData.txt') as h, open('variableData.txt') as data:
f = chain(h, data)
reader = csv.DictReader(f,delimiter='\t')
# proceed with you code from the first snippet
# no close() calls needed when using open() with "with" statements
Another way of course would be to just read the headers into a list and use regular csv.reader on variableData.txt:
with open('headerData') as h:
names = next(h).split('\t')
with open('variableData.txt') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
for name, value in zip(names, row):
columns[name].append(value)

By default, DictReader will take the first line in your csv file and use that as the keys for the dict. However, according to the docs, you can also pass it a fieldnames parameter, which is a sequence containing the names of the keys to use for the dict. So you could do this:
columns = defaultdict(list)
with open("headerData.txt") as f, open("variableData.txt") as data:
reader = csv.DictReader(data,
fieldnames=f.read().rstrip().split('\t'),
delimiter='\t')
for row in reader:
for (h,v) in row.items():
columns[h].append(v)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

is there any better way for reading files? - python

This is a good opportunity for you to start using pandas and working with DataFrames. import pandas as pd df = pd.read_csv(path_to_csv) 1-2 (depending on if you count the import) lines of code and you're done!

You can use a DictReader import csv with open(filename, 'rt') as f: data = list(csv.DictReader(f, fieldnames=["Type", "Length", "Text", "Space"])) print(data) This will give you a single list of dict objects, one per row.

Related

Can I print lines randomly from a csv in Python?

Better way to parse CSV into list or array

Splitting Rows in csv on several header rows

Reading column names alone in a csv file

Creating a single dictionary from two tab delimited files

Categories

Resources