Read in the first column of a CSV in Python - python

I have a CSV (mylist.csv) with 2 columns that look similar to this:
jfj840398jgg item-2f
hd883hb2kjsd item-9k
jie9hgtrbu43 item-12
fjoi439jgnso item-3i
I need to read the first column into a variable so I just get:
jfj840398jgg
hd883hb2kjsd
jie9hgtrbu43
fjoi439jgnso
I tried the following, but it is only giving me the first letter of each column:
import csv
list2 = []
with open("mylist.csv") as f:
for row in f:
list2.append(row[0])
So the results of the above code are giving me list2 as:
['j', 'h', 'j', 'f']

You should split the row and then append the first item
list2 = []
with open("mylist.csv") as f:
for row in f:
list2.append(row.split()[0])
You could also use a list comprehension which are pretty standard for creating lists:
with open("mylist.csv") as f:
list2 = [row.split()[0] for row in f]

You can also use pandas here:
import pandas as pd
df = pd.read_csv(mylist.csv)
Then, getting the first column is as easy as:
matrix2 = df[df.columns[0]].as_matrix()
list2 = matrix2.tolist()
This will return only the first column in list. You might want to consider leaving the data in numpy, if you're conducting further data operation on the result you get.

You can use the csv module:
import csv
with open("processes_infos.csv", "r", newline="") as file:
reader = csv.reader(file, delimiter=",")
for row in reader:
print(row[0], row[1])
You can change the delimiter "," into " ".

you import csv, but then never use it to actually read the CSV. Then you open mylist.csv as a normal file, so when you declare:
for row in f:
list2.append(row[0])
What you're actually telling Python to do is "iterate through the lines, and append the first element of the lines (which would be the first letter) to list2". What you need to do, if you want to use the CSV module, is:
import csv
with open('mylist.csv', 'r') as f:
csv_reader = csv.reader(f, delimiter=' ')
for row in csv_reader:
list2.append(row[0])

The simplest answer
import pandas as pd
df = pd.read_csv(mylist.csv)
matrix2 = df[df.columns[0]].to_numpy()
list1 = matrix2.tolist()
print(list1)

Related

Can I print lines randomly from a csv in Python?

I'm trying print lines randomly from a csv.
Lets say the csv has the below 10 lines -
1,One
2,Two
3,Three
4,Four
5,Five
6,Six
7,Seven
8,Eight
9,Nine
10,Ten
If I write a code like below, it prints each line as a list in the same order as present in the CSV
import csv
with open("MyCSV.csv") as f:
reader = csv.reader(f)
for row_num, row in enumerate(reader):
print(row)
Instead, I'd like it to be random.
Its just a print for now. I'll later pass each line as a List to a Function.
This should work. You can reuse the lines list in your code as it is shuffled.
import random
with open("tmp.csv", "r") as f:
lines = f.readlines()
random.shuffle(lines)
print(lines)
import csv
import random
csv_elems = []
with open("MyCSV.csv") as f:
reader = csv.reader(f)
for row_num, row in enumerate(reader):
csv_elems.append(row)
random.shuffle(csv_elems)
print(csv_elems[0])
As you can see I'm just printing the first elem, you can iterate over the list, keep shuffling & print
Well you can define a list, append all elements of csv file into it, then shuffle it and print them, assume that the name of this list is temp
import csv
import random
temp = []
with open("your csv file.csv") as file:
reader = csv.reader(file)
for row_num, row in enumerate(reader):
temp.append(row)
random.shuffle(temp)
for i in range(len(temp)):
print(temp[i])
Why better don't you use pandas to handle csv?
import pandas as pd
data = pd.read_csv("MyCSV.csv")
And to get the samples you are looking for just write:
data.sample() # print one sample
data.sample(5) # to write 5 samples
Also if you want to pass each line to a function.
data_after_function = data.appy(function_name)
and inside the function you can cast the line into a list with list()
Hope this helps!
Couple of things to do:
Store CSV into a sequence of some sort
Get the data randomly
For 1, it’s probably best to use some form of sequence comprehension (I’ve gone for nested tuple in a list as it seems you want the row numbers and we can’t use dictionaries for shuffle).
We can use the random module for number 2.
import random
import csv
with open("MyCSV.csv") as f:
reader = csv.reader(f)
my_csv = [(row_num, row) for row_num, row in enumerate(reader)]
# get only 1 item from the list at random
random_row = random.choice(my_csv)
# randomise the order of all the rows
shuffled_csv = random.shuffle(my_csv)

convert items in csv column to list using python

So i have been reading answers on StackOverflow and haven't been able to find this specific doubt that i have.
I have a csv with a single column with values as follows:
**Values**
abc
xyz
bcd,fgh
tew,skdh,fsh
As you can see above some cells have more than one value separated by commas,
i used the following code:
with open('dat.csv', 'rb') as inputfile:
reader = csv.reader(inputfile)
colnames=['Keywords']
data = pandas.read_csv('dat.csv', names=colnames)
lkn=data.values.tolist()
print lkn
The output i got was: [['abc'],['xyz'],['bcd,fgh'],['tew,skdh,fsh']]
i would like to have the output as:
[['abc'],['xyz'],['bcd','fgh'],['tew','skdh','fsh']]
which i believe is a proper list of list format(fairly new to list of lists). Please do provide guidance in the right direction.
Thanks!.
NB:csv file with how cells are arranged (image)
Looking at your attached image, I'd bet that the cells have been quoted (although, to be sure, open the CSV file in a text editor, not in Excel) so you have to do the manual splitting yourself:
import csv
with open("file.csv", "r") as f:
reader = csv.reader(f)
your_list = [e[0].strip().split(",") for e in reader if e]
Try something like this :
import csv
with open('file.csv', 'r') as f:
reader = csv.reader(f)
your_list = list(reader)
for item in your_list:
item = list(item)
print(your_list)
Credit : Python import csv to list

Python read csv file columns into lists, ignoring headers

I have a file 'data.csv' that looks something like
ColA, ColB, ColC
1,2,3
4,5,6
7,8,9
I want to open and read the file columns into lists, with the 1st entry of that list omitted, e.g.
dataA = [1,4,7]
dataB = [2,5,8]
dataC = [3,6,9]
In reality there are more than 3 columns and the lists are very long, this is just an example of the format. I've tried:
csv_file = open('data.csv','rb')
csv_array = []
for row in csv.reader(csv_file, delimiter=','):
csv_array.append(row)
Where I would then allocate each index of csv_array to a list, e.g.
dataA = [int(i) for i in csv_array[0]]
But I'm getting errors:
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Also it feels like a very long winded way of just saving data to a few lists...
Thanks!
edit:
Here is how I solved it:
import pandas as pd
df = pd.read_csv('data.csv', names = ['ColA','ColB','ColC']
dataA = map(int,(df.ColA.tolist())[1:3])
and repeat for the rest of the columns.
Just to spell this out for people trying to solve a similar problem, perhaps without Pandas, here's a simple refactoring with comments.
import csv
# Open the file in 'r' mode, not 'rb'
csv_file = open('data.csv','r')
dataA = []
dataB = []
dataC = []
# Read off and discard first line, to skip headers
csv_file.readline()
# Split columns while reading
for a, b, c in csv.reader(csv_file, delimiter=','):
# Append each variable to a separate list
dataA.append(a)
dataB.append(b)
dataC.append(c)
This does nothing to convert the individual fields to numbers (use append(int(a)) etc if you want that) but should hopefully be explicit and flexible enough to show you how to adapt this to new requirements.
Use Pandas:
import pandas as pd
df = pd.DataFrame.from_csv(path)
rows = df.apply(lambda x: x.tolist(), axis=1)
To skip the header, create your reader on a seperate line. Then to convert from a list of rows to a list of columns, use zip():
import csv
with open('data.csv', 'rb') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
data = zip(*[map(int, row) for row in csv_input])
print data
Giving you:
[(1, 4, 7), (2, 5, 8), (3, 6, 9)]
So if needed:
dataA = data[0]
Seems like you have OSX line endings in your csv file. Try saving the csv file as "Windows Comma Separated (.csv)" format.
There are also easier ways to do what you're doing with the csv reader:
csv_array = []
with open('data.csv', 'r') as csv_file:
reader = csv.reader(csv_file)
# remove headers
reader.next()
# loop over rows in the file, append them to your array. each row is already formatted as a list.
for row in reader:
csv_array.append(row)
You can then set dataA = csv_array[0]
First if you read the csv file with csv.reader(csv_file, delimiter=','), you will still read the header.
csv_array[0] will be the header row -> ['ColA', ' ColB', ' ColC']
Also if you're using mac, this issues is already referenced here: CSV new-line character seen in unquoted field error
And I would recommend using pandas&numpy instead if you will do more analysis using the data. It read the csv file to pandas dataframe.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
use csv.DictReader() to select specific columns
dataA = []
dataB = []
with open('data.csv', 'r') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter=',')
for row in csv_reader:
dataA.append(row['ColA'])
dataB.append(row['ColB'])

How to convert the second column of a csv file to a list of floats?

I have I csv file like this:
string, 3.54545,4.3434,3.34435543
string, 4.54545,67.3434,5.34435543
...
stringN, 5.54545,1.3434,9.34435543
How can I extract the first(strings) and the second column(floats) in two diferent lists with the csv module?.For example I would like to get something like this:
list1 = [string,string,...,string]
list2 = [3.54545,4.54545,..,5.54545]
Where list1 is a list of strings and list2 is a list of floats. I tried the following with pandas, the problem is that it took a lot of read the file:
df = pd.read_csv('test_dict.csv', header = None)
list1 = df[0].values.tolist()
list2 = df[1].values.tolist()
Thanks in advance, guys!
You could do it with the csv module like this, but as I remarked in my comments, don't expect it to be any faster than using pandas.
import csv
col1 = []
col2 = []
with open('test_dict.csv') as f:
for row in csv.reader(f):
col1.append(row[0])
col2.append(row[1])
If you want the first two columns you can zip:
import csv
with open("in.csv") as f:
reader = csv.reader(f)
zipped = zip(*reader)
s, f = list(next(zipped)), list(map(float, next(zipped)))
For python 2 use itertools.izip:
import csv
from itertools import izip
with open("in.csv") as f:
reader = csv.reader(f)
zipped = izip(*reader)
s, f = list(next(zipped)), map(float, next(zipped))
print(s, f)
(['string', 'string', 'stringN'], [3.54545, 4.54545, 5.54545])

Reading column names alone in a csv file

I have a csv file with the following columns:
id,name,age,sex
Followed by a lot of values for the above columns.
I am trying to read the column names alone and put them inside a list.
I am using Dictreader and this gives out the correct details:
with open('details.csv') as csvfile:
i=["name","age","sex"]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]
But what I want to do is, I need the list of columns, ("i" in the above case)to be automatically parsed with the input csv than hardcoding them inside a list.
with open('details.csv') as csvfile:
rows=iter(csv.reader(csvfile)).next()
header=rows[1:]
re=csv.DictReader(csvfile)
for row in re:
print row
for x in header:
print row[x]
This gives out an error
Keyerrror:'name'
in the line print row[x]. Where am I going wrong? Is it possible to fetch the column names using Dictreader?
Though you already have an accepted answer, I figured I'd add this for anyone else interested in a different solution-
Python's DictReader object in the CSV module (as of Python 2.6 and above) has a public attribute called fieldnames.
https://docs.python.org/3.4/library/csv.html#csv.csvreader.fieldnames
An implementation could be as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
d_reader = csv.DictReader(f)
#get fieldnames from DictReader object and store in list
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])
In the above, d_reader.fieldnames returns a list of your headers (assuming the headers are in the top row).
Which allows...
>>> print(headers)
['MyCol1', 'MyCol2', 'MyCol3']
If your headers are in, say the 2nd row (with the very top row being row 1), you could do as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
#you can eat the first line before creating DictReader.
#if no "fieldnames" param is passed into
#DictReader object upon creation, DictReader
#will read the upper-most line as the headers
f.readline()
d_reader = csv.DictReader(f)
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])
You can read the header by using the next() function which return the next row of the reader’s iterable object as a list. then you can add the content of the file to a list.
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
rest = list(reader)
Now i has the column's names as a list.
print i
>>>['id', 'name', 'age', 'sex']
Also note that reader.next() does not work in python 3. Instead use the the inbuilt next() to get the first line of the csv immediately after reading like so:
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = next(reader)
print(i)
>>>['id', 'name', 'age', 'sex']
The csv.DictReader object exposes an attribute called fieldnames, and that is what you'd use. Here's example code, followed by input and corresponding output:
import csv
file = "/path/to/file.csv"
with open(file, mode='r', encoding='utf-8') as f:
reader = csv.DictReader(f, delimiter=',')
for row in reader:
print([col + '=' + row[col] for col in reader.fieldnames])
Input file contents:
col0,col1,col2,col3,col4,col5,col6,col7,col8,col9
00,01,02,03,04,05,06,07,08,09
10,11,12,13,14,15,16,17,18,19
20,21,22,23,24,25,26,27,28,29
30,31,32,33,34,35,36,37,38,39
40,41,42,43,44,45,46,47,48,49
50,51,52,53,54,55,56,57,58,59
60,61,62,63,64,65,66,67,68,69
70,71,72,73,74,75,76,77,78,79
80,81,82,83,84,85,86,87,88,89
90,91,92,93,94,95,96,97,98,99
Output of print statements:
['col0=00', 'col1=01', 'col2=02', 'col3=03', 'col4=04', 'col5=05', 'col6=06', 'col7=07', 'col8=08', 'col9=09']
['col0=10', 'col1=11', 'col2=12', 'col3=13', 'col4=14', 'col5=15', 'col6=16', 'col7=17', 'col8=18', 'col9=19']
['col0=20', 'col1=21', 'col2=22', 'col3=23', 'col4=24', 'col5=25', 'col6=26', 'col7=27', 'col8=28', 'col9=29']
['col0=30', 'col1=31', 'col2=32', 'col3=33', 'col4=34', 'col5=35', 'col6=36', 'col7=37', 'col8=38', 'col9=39']
['col0=40', 'col1=41', 'col2=42', 'col3=43', 'col4=44', 'col5=45', 'col6=46', 'col7=47', 'col8=48', 'col9=49']
['col0=50', 'col1=51', 'col2=52', 'col3=53', 'col4=54', 'col5=55', 'col6=56', 'col7=57', 'col8=58', 'col9=59']
['col0=60', 'col1=61', 'col2=62', 'col3=63', 'col4=64', 'col5=65', 'col6=66', 'col7=67', 'col8=68', 'col9=69']
['col0=70', 'col1=71', 'col2=72', 'col3=73', 'col4=74', 'col5=75', 'col6=76', 'col7=77', 'col8=78', 'col9=79']
['col0=80', 'col1=81', 'col2=82', 'col3=83', 'col4=84', 'col5=85', 'col6=86', 'col7=87', 'col8=88', 'col9=89']
['col0=90', 'col1=91', 'col2=92', 'col3=93', 'col4=94', 'col5=95', 'col6=96', 'col7=97', 'col8=98', 'col9=99']
How about
with open(csv_input_path + file, 'r') as ft:
header = ft.readline() # read only first line; returns string
header_list = header.split(',') # returns list
I am assuming your input file is CSV format.
If using pandas, it takes more time if the file is big size because it loads the entire data as the dataset.
I am just mentioning how to get all the column names from a csv file.
I am using pandas library.
First we read the file.
import pandas as pd
file = pd.read_csv('details.csv')
Then, in order to just get all the column names as a list from input file use:-
columns = list(file.head(0))
Thanking Daniel Jimenez for his perfect solution to fetch column names alone from my csv, I extend his solution to use DictReader so we can iterate over the rows using column names as indexes. Thanks Jimenez.
with open('myfile.csv') as csvfile:
rest = []
with open("myfile.csv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
i=i[1:]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]
here is the code to print only the headers or columns of the csv file.
import csv
HEADERS = next(csv.reader(open('filepath.csv')))
print (HEADERS)
Another method with pandas
import pandas as pd
HEADERS = list(pd.read_csv('filepath.csv').head(0))
print (HEADERS)
import pandas as pd
data = pd.read_csv("data.csv")
cols = data.columns
I literally just wanted the first row of my data which are the headers I need and didn't want to iterate over all my data to get them, so I just did this:
with open(data, 'r', newline='') as csvfile:
t = 0
for i in csv.reader(csvfile, delimiter=',', quotechar='|'):
if t > 0:
break
else:
dbh = i
t += 1
Using pandas is also an option.
But instead of loading the full file in memory, you can retrieve only the first chunk of it to get the field names by using iterator.
import pandas as pd
file = pd.read_csv('details.csv'), iterator=True)
column_names_full=file.get_chunk(1)
column_names=[column for column in column_names_full]
print column_names

Categories