Merge Two CSV files in Python - python

I have two csv files and I want to create a third csv from the a merge of the two. Here's how my files look:
Num | status
1213 | closed
4223 | open
2311 | open
and another file has this:
Num | code
1002 | 9822
1213 | 1891
4223 | 0011
So, here is my little code that I was trying to loop through but it does not print the output with the third column added matching the correct values.
def links():
first = open('closed.csv')
csv_file = csv.reader(first)
second = open('links.csv')
csv_file2 = csv.reader(second)
for row in csv_file:
for secrow in csv_file2:
if row[0] == secrow[0]:
print row[0]+"," +row[1]+","+ secrow[0]
time.sleep(1)
so what I want is something like:
Num | status | code
1213 | closed | 1891
4223 | open | 0011
2311 | open | blank no match

If you decide to use pandas, you can do it in only five lines.
import pandas as pd
first = pd.read_csv('closed.csv')
second = pd.read_csv('links.csv')
merged = pd.merge(first, second, how='left', on='Num')
merged.to_csv('merged.csv', index=False)

This is definitely a job for pandas. You can easily read in both csv files as DataFrames and use either merge or concat. It'll be way faster and you can do it in just a few lines of code.

The problem is that you could iterate over a csv reader only once, so that csv_file2 does not work after the first iteration. To solve that you should save the output of csv_file2 and iterate over the saved list.
It could look like that:
import time, csv
def links():
first = open('closed.csv')
csv_file = csv.reader(first, delimiter="|")
second = open('links.csv')
csv_file2 = csv.reader(second, delimiter="|")
list=[]
for row in csv_file2:
list.append(row)
for row in csv_file:
match=False
for secrow in list:
if row[0].replace(" ","") == secrow[0].replace(" ",""):
print row[0] + "," + row[1] + "," + secrow[1]
match=True
if not match:
print row[0] + "," + row[1] + ", blank no match"
time.sleep(1)
Output:
Num , status, code
1213 , closed, 1891
4223 , open, 0011
2311 , open, blank no match

You could read the values of the second file into a dictionary and then add them to the first.
Code = {}
for row in csv_file2:
Code[row[0]] = row[1]
for row in csv_file1:
row.append(Code.get(row[0], "blank no match"))

This code will do it for you:
import csv
def links():
# open both files
with open('closed.csv') as closed, open('links.csv') as links:
# using DictReader instead to be able more easily access information by num
csv_closed = csv.DictReader(closed)
csv_links = csv.DictReader(links)
# create dictionaries out of the two CSV files using dictionary comprehensions
num_dict = {row['num']:row['status'] for row in csv_closed}
link_dict = {row['num']:row['code'] for row in csv_links}
# print header, each column has width of 8 characters
print("{0:8} | {1:8} | {2:8}".format("Num", "Status", "Code"))
# print the information
for num, status in num_dict.items():
# note this call to link_dict.get() - we are getting values out of the link dictionary,
# but specifying a default return value of an empty string if num is not found in it
# to avoid an exception
print("{0:8} | {1:8} | {2:8}".format(num, status, link_dict.get(num, '')))
links()
In it, I'm taking advantage of dictionaries, which let you access information by keys. I'm also using implicit loops (the dictionary comprehensions) which tend to be faster and require less code.
There are two quirks of this code that you should be aware of, that your example suggests are fine:
Order is not preserved (because we're using dictionaries)
Num entries that are in links.csv but not closed.csv are not included in the printout
Last note: I made some assumptions about how your input files are formatted since you called them "CSV" files. This is what my input files looked like for this code:
closed.csv
num,status
1213,closed
4223,open
2311,open
links.csv
num,code
1002,9822
1213,1891
4223,0011
Given those input files, the result looks like this:
Num | Status | Code
1213 | closed | 1891
2311 | open |
4223 | open | 0011

Related

How to append a new value in a CSV file in Python?

I have a CSV sheet, having data like this:
| not used | Day 1 | Day 2 |
| Person 1 | Score | Score |
| Person 2 | Score | Score |
But with a lot more rows and columns. Every day I get progress of how much each person progressed, and I get that data as a dictionary where keys are names and values are score amounts.
The thing is, sometimes that dictionary will include new people and not include already existing ones. Then, if a new person comes, it will add 0 as every previous day and if the dict doesn't include already existing person, it will give him 0 score to that day
My idea of solving this is doing lines = file.readlines() on that CSV file, making a new list of people's names with
for line in lines:
names.append(line.split(",")[0])
then making a copy of lines (newLines = lines)
and going through dict's keys, seeing if that person is already in the csv, if so, append the value followed by a comma
But I'm stuck at the part of adding score of 0
Any help or contributions would be appreciated
EXAMPLE: Before I will have this
-,day1,day2,day3
Mark,1500,0,1660
John,1800,1640,0
Peter,1670,1680,1630
Hannah,1480,1520,1570
And I have this dictionary to add
{'Mark': 1750, 'Hannah':1640, 'Brian':1780}
The result should be
-,day1,day2,day3,day4
Mark,1500,0,1660,1750
John,1800,1640,0,0
Peter,1670,1680,1630,0
Hannah,1480,1520,1570,1640
Brian,0,0,0,1780
See how Brian is in the dict and not in the before csv and he got added with any other day score 0. I figured out that one line .split(',') would give a list of N elements, where N - 2 will be amount of zero scores to add prior to first day of that person
This is easy to do in pandas as an outer join. Read the CSV into a dataframe and generate a new dataframe from the dictionary. The join is almost what you want except that since not-a-number values are inserted for empty cells, you need to fill the NaN's with zero and reconvert everything to integer.
The one potential problem is that the CSV is sorted. You don't simply have the new rows appended to the bottom.
import pandas as pd
import errno
import os
INDEX_COL = "-"
def add_days_score(filename, colname, scores):
try:
df = pd.read_csv(filename, index_col=INDEX_COL)
except OSError as e:
if e.errno == errno.ENOENT:
# file doesn't exist, create empty df
df = pd.DataFrame([], columns=[INDEX_COL])
df = df.set_index(INDEX_COl)
else:
raise
new_df = pd.DataFrame.from_dict({colname:scores})
merged = df.join(new_df, how="outer").fillna(0).astype(int)
try:
merged.to_csv(filename + ".tmp", index_label=[INDEX_COL])
except:
raise
else:
os.rename(filename + ".tmp", filename)
return merged
#============================================================================
# TEST
#============================================================================
test_file = "this_is_a_test.csv"
before = """-,day1,day2,day3
Mark,1500,0,1660
John,1800,1640,0
Peter,1670,1680,1630
Hannah,1480,1520,1570
"""
after = """-,day1,day2,day3,day4
Brian,0,0,0,1780
Hannah,1480,1520,1570,1640
John,1800,1640,0,0
Mark,1500,0,1660,1750
Peter,1670,1680,1630,0
"""
test_dicts = [
["day4", {'Mark': 1750, 'Hannah':1640, 'Brian':1780}],
]
open(test_file, "w").write(before)
for name, scores in test_dicts:
add_days_score(test_file, name, scores)
print("want\n", after, "\n")
got = open(test_file).read()
print("got\n", got, "\n")
if got != after:
print("FAILED")

Finding the row number for the header row in a CSV file / Pandas Dataframe

I am trying to get an index or row number for the row that holds the headers in my CSV file.
The issue is, the header row can move up and down depending on the output of the report from our system (I have no control to change this)
code:
ht = pd.read_csv(file.csv)
test = ht.get_loc('Code') #Code being header im using to locate the header row
csv1 = read_csv(file.csv, header=test)
df1 = df1.append(csv1) #Appending as have many files
If I was to print test, I would expect a number around 4 or 5, and that's what I am feeding into the second read "read_csv"
The error I'm getting is that it's expecting 1 header column, but I have 26 columns. I am just trying to use the first header string to get the row number
Thanks
:-)
Edit:
CSV format
This file contains the data around the volume of items blablalbla
the deadlines for delivery of items a - z is 5 days
the deadlines for delivery of items aa through zz are 3 days
the deadlines for delivery of items aaa through zzz are 1 days
code,type,arrived_date,est_del_date
a/wrwgwr12/001,kids,12-dec-18,17-dec-18
aa/gjghgj35/030,pet,15-dec-18,18-dec-18
as you will see the "The deadlines" rows are the same, this can be 3 or 5 based on the code ids, thus the header row can change up or down.
I also did not write out all 26 column headers, not sure that matters.
Wanted DF format
index | code | type | arrived_date | est_del_date
1 | a/wrwgwr12/001 | kids | 12-dec-18 | 17-dec-18
2 | aa/gjghgj35/030 | Pet | 15-dec-18 | 18-dec-18
Hope this makes sense..
Thanks,
You can use the csv module to find the first row which contains a delimiter, then feed the index of this row as the skiprows parameter to pd.read_csv:
from io import StringIO
import csv
import pandas as pd
x = """This file contains the data around the volume of items blablalbla
the deadlines for delivery of items a - z is 5 days
the deadlines for delivery of items aa through zz are 3 days
the deadlines for delivery of items aaa through zzz are 1 days
code,type,arrived_date,est_del_date
a/wrwgwr12/001,kids,12-dec-18,17-dec-18
aa/gjghgj35/030,pet,15-dec-18,18-dec-18"""
# replace StringIO(x) with open('file.csv', 'r')
with StringIO(x) as fin:
reader = csv.reader(fin)
idx = next(idx for idx, row in enumerate(reader) if len(row) > 1) # 4
# replace StringIO(x) with 'file.csv'
df = pd.read_csv(StringIO(x), skiprows=idx)
print(df)
code type arrived_date est_del_date
0 a/wrwgwr12/001 kids 12-dec-18 17-dec-18
1 aa/gjghgj35/030 pet 15-dec-18 18-dec-18

Extract 2 lists out of csv file in python [duplicate]

I'm trying to parse through a csv file and extract the data from only specific columns.
Example csv:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
I'm trying to capture only specific columns, say ID, Name, Zip and Phone.
Code I've looked at has led me to believe I can call the specific column by its corresponding number, so ie: Name would correspond to 2 and iterating through each row using row[2] would produce all the items in column 2. Only it doesn't.
Here's what I've done so far:
import sys, argparse, csv
from settings import *
# command arguments
parser = argparse.ArgumentParser(description='csv to postgres',\
fromfile_prefix_chars="#" )
parser.add_argument('file', help='csv file to import', action='store')
args = parser.parse_args()
csv_file = args.file
# open csv file
with open(csv_file, 'rb') as csvfile:
# get number of columns
for line in csvfile.readlines():
array = line.split(',')
first_item = array[0]
num_columns = len(array)
csvfile.seek(0)
reader = csv.reader(csvfile, delimiter=' ')
included_cols = [1, 2, 6, 7]
for row in reader:
content = list(row[i] for i in included_cols)
print content
and I'm expecting that this will print out only the specific columns I want for each row except it doesn't, I get the last column only.
The only way you would be getting the last column from this code is if you don't include your print statement in your for loop.
This is most likely the end of your code:
for row in reader:
content = list(row[i] for i in included_cols)
print content
You want it to be this:
for row in reader:
content = list(row[i] for i in included_cols)
print content
Now that we have covered your mistake, I would like to take this time to introduce you to the pandas module.
Pandas is spectacular for dealing with csv files, and the following code would be all you need to read a csv and save an entire column into a variable:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']
so if you wanted to save all of the info in your column Names into a variable, this is all you need to do:
names = df.Names
It's a great module and I suggest you look into it. If for some reason your print statement was in for loop and it was still only printing out the last column, which shouldn't happen, but let me know if my assumption was wrong. Your posted code has a lot of indentation errors so it was hard to know what was supposed to be where. Hope this was helpful!
import csv
from collections import defaultdict
columns = defaultdict(list) # each value in each column is appended to a list
with open('file.txt') as f:
reader = csv.DictReader(f) # read rows into a dictionary format
for row in reader: # read a row as {column1: value1, column2: value2,...}
for (k,v) in row.items(): # go over each column name and value
columns[k].append(v) # append the value into the appropriate list
# based on column name k
print(columns['name'])
print(columns['phone'])
print(columns['street'])
With a file like
name,phone,street
Bob,0893,32 Silly
James,000,400 McHilly
Smithers,4442,23 Looped St.
Will output
>>>
['Bob', 'James', 'Smithers']
['0893', '000', '4442']
['32 Silly', '400 McHilly', '23 Looped St.']
Or alternatively if you want numerical indexing for the columns:
with open('file.txt') as f:
reader = csv.reader(f)
next(reader)
for row in reader:
for (i,v) in enumerate(row):
columns[i].append(v)
print(columns[0])
>>>
['Bob', 'James', 'Smithers']
To change the deliminator add delimiter=" " to the appropriate instantiation, i.e reader = csv.reader(f,delimiter=" ")
Use pandas:
import pandas as pd
my_csv = pd.read_csv(filename)
column = my_csv.column_name
# you can also use my_csv['column_name']
Discard unneeded columns at parse time:
my_filtered_csv = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])
P.S. I'm just aggregating what other's have said in a simple manner. Actual answers are taken from here and here.
You can use numpy.loadtext(filename). For example if this is your database .csv:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | Adam | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Carl | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Adolf | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Den | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
And you want the Name column:
import numpy as np
b=np.loadtxt(r'filepath\name.csv',dtype=str,delimiter='|',skiprows=1,usecols=(1,))
>>> b
array([' Adam ', ' Carl ', ' Adolf ', ' Den '],
dtype='|S7')
More easily you can use genfromtext:
b = np.genfromtxt(r'filepath\name.csv', delimiter='|', names=True,dtype=None)
>>> b['Name']
array([' Adam ', ' Carl ', ' Adolf ', ' Den '],
dtype='|S7')
With pandas you can use read_csv with usecols parameter:
df = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])
Example:
import pandas as pd
import io
s = '''
total_bill,tip,sex,smoker,day,time,size
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Dinner,3
'''
df = pd.read_csv(io.StringIO(s), usecols=['total_bill', 'day', 'size'])
print(df)
total_bill day size
0 16.99 Sun 2
1 10.34 Sun 3
2 21.01 Sun 3
Context: For this type of work you should use the amazing python petl library. That will save you a lot of work and potential frustration from doing things 'manually' with the standard csv module. AFAIK, the only people who still use the csv module are those who have not yet discovered better tools for working with tabular data (pandas, petl, etc.), which is fine, but if you plan to work with a lot of data in your career from various strange sources, learning something like petl is one of the best investments you can make. To get started should only take 30 minutes after you've done pip install petl. The documentation is excellent.
Answer: Let's say you have the first table in a csv file (you can also load directly from the database using petl). Then you would simply load it and do the following.
from petl import fromcsv, look, cut, tocsv
#Load the table
table1 = fromcsv('table1.csv')
# Alter the colums
table2 = cut(table1, 'Song_Name','Artist_ID')
#have a quick look to make sure things are ok. Prints a nicely formatted table to your console
print look(table2)
# Save to new file
tocsv(table2, 'new.csv')
I think there is an easier way
import pandas as pd
dataset = pd.read_csv('table1.csv')
ftCol = dataset.iloc[:, 0].values
So in here iloc[:, 0], : means all values, 0 means the position of the column.
in the example below ID will be selected
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
import pandas as pd
csv_file = pd.read_csv("file.csv")
column_val_list = csv_file.column_name._ndarray_values
Thanks to the way you can index and subset a pandas dataframe, a very easy way to extract a single column from a csv file into a variable is:
myVar = pd.read_csv('YourPath', sep = ",")['ColumnName']
A few things to consider:
The snippet above will produce a pandas Series and not dataframe.
The suggestion from ayhan with usecols will also be faster if speed is an issue.
Testing the two different approaches using %timeit on a 2122 KB sized csv file yields 22.8 ms for the usecols approach and 53 ms for my suggested approach.
And don't forget import pandas as pd
If you need to process the columns separately, I like to destructure the columns with the zip(*iterable) pattern (effectively "unzip"). So for your example:
ids, names, zips, phones = zip(*(
(row[1], row[2], row[6], row[7])
for row in reader
))
import pandas as pd
dataset = pd.read_csv('Train.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
X is a a bunch of columns, use it if you want to read more that one column
y is single column, use it to read one column
[:, 1:-1] are [row_index : to_row_index, column_index : to_column_index]
SAMPLE.CSV
a, 1, +
b, 2, -
c, 3, *
d, 4, /
column_names = ["Letter", "Number", "Symbol"]
df = pd.read_csv("sample.csv", names=column_names)
print(df)
OUTPUT
Letter Number Symbol
0 a 1 +
1 b 2 -
2 c 3 *
3 d 4 /
letters = df.Letter.to_list()
print(letters)
OUTPUT
['a', 'b', 'c', 'd']
import csv
with open('input.csv', encoding='utf-8-sig') as csv_file:
# the below statement will skip the first row
next(csv_file)
reader= csv.DictReader(csv_file)
Time_col ={'Time' : []}
#print(Time_col)
for record in reader :
Time_col['Time'].append(record['Time'])
print(Time_col)
From CSV File Reading and Writing you can import csv and use this code:
with open('names.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['first_name'], row['last_name'])
To fetch column name, instead of using readlines() better use readline() to avoid loop & reading the complete file & storing it in the array.
with open(csv_file, 'rb') as csvfile:
# get number of columns
line = csvfile.readline()
first_item = line.split(',')

How can I write part of dictionary value string as headers to csv file in python? Is there a better way?

I have a dictionary that uses log file names as keys. When reading through a text file if it contains a value I was searching for, it saves the entire line as a new value in the list of values for that key(log file name).
I.e Key: logFile0 Value: [Val1 5, Val2 3, Val3 72].
I want to output to a csv file the values as headers and which log file it was found and the value.
Key | Value
log1 | [Value_name Val_num], [Value1_name Val_num]
log2 | [Value1_name Val_num], [Value3_name Valu_num]
log3 | [Value2_name Val_num], [Value3_name Val_num]
I want it to be displayed in the csv file where for each time the value was found in the log file, it's value is displayed as:
Value_name | Value1_name | Value2_name | Value3_name
log1 val_num | val_num | |
log2 | val_num | | val_num
log3 | | val_num | val_num
Does anybody know how to do this? Or is there a better way to store all of this information and then display?
Here is my approach:
import csv
d = {"log1":[[0,"Val_num"],[1,"Val_num"]],
"log2":[[1,"Val_num"],[3,"Val_num"]],
"log3":[[2,"Val_num"],[3,"Val_num"]]}
I made a dictionary. Use yours but the "Val_num" are meant to be the actual numbers. The first number is the value_name e.g. 1 -> Value1_name
headers = ["Log File","Value_num","Value1_num","Value2_num","Value3_num"]
log1Row = ["log1","","","",""]
log2Row = ["log2","","","",""]
log3Row = ["log3","","","",""]
Here I simply initialised the rows to be written into the csv file.
for i in d["log1"]:
log1Row[i[0] + 1] = i[1]
for i in d["log2"]:
log2Row[i[0] + 1] = i[1]
for i in d["log3"]:
log3Row[i[0] + 1] = i[1]
Now I am going to iterate over each key in the dictionary and append them into the arrays. The + 1 is needed as the array position for where theses numbers begin from is index 1 not 0 as this is the word "log".
with open("file.csv","w") as file:
writer = csv.writer(file)
writer.writerow(headers)
writer.writerow(log1Row)
writer.writerow(log2Row)
writer.writerow(log3Row)
Just open the file to write to and write these rows.
Hopefully this works as long as your dictionary follows the way mine does.
Let me know of any erros.
Thank you.

How to read in only rows with specific values on a specific column? [duplicate]

I'm trying to parse through a csv file and extract the data from only specific columns.
Example csv:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
I'm trying to capture only specific columns, say ID, Name, Zip and Phone.
Code I've looked at has led me to believe I can call the specific column by its corresponding number, so ie: Name would correspond to 2 and iterating through each row using row[2] would produce all the items in column 2. Only it doesn't.
Here's what I've done so far:
import sys, argparse, csv
from settings import *
# command arguments
parser = argparse.ArgumentParser(description='csv to postgres',\
fromfile_prefix_chars="#" )
parser.add_argument('file', help='csv file to import', action='store')
args = parser.parse_args()
csv_file = args.file
# open csv file
with open(csv_file, 'rb') as csvfile:
# get number of columns
for line in csvfile.readlines():
array = line.split(',')
first_item = array[0]
num_columns = len(array)
csvfile.seek(0)
reader = csv.reader(csvfile, delimiter=' ')
included_cols = [1, 2, 6, 7]
for row in reader:
content = list(row[i] for i in included_cols)
print content
and I'm expecting that this will print out only the specific columns I want for each row except it doesn't, I get the last column only.
The only way you would be getting the last column from this code is if you don't include your print statement in your for loop.
This is most likely the end of your code:
for row in reader:
content = list(row[i] for i in included_cols)
print content
You want it to be this:
for row in reader:
content = list(row[i] for i in included_cols)
print content
Now that we have covered your mistake, I would like to take this time to introduce you to the pandas module.
Pandas is spectacular for dealing with csv files, and the following code would be all you need to read a csv and save an entire column into a variable:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']
so if you wanted to save all of the info in your column Names into a variable, this is all you need to do:
names = df.Names
It's a great module and I suggest you look into it. If for some reason your print statement was in for loop and it was still only printing out the last column, which shouldn't happen, but let me know if my assumption was wrong. Your posted code has a lot of indentation errors so it was hard to know what was supposed to be where. Hope this was helpful!
import csv
from collections import defaultdict
columns = defaultdict(list) # each value in each column is appended to a list
with open('file.txt') as f:
reader = csv.DictReader(f) # read rows into a dictionary format
for row in reader: # read a row as {column1: value1, column2: value2,...}
for (k,v) in row.items(): # go over each column name and value
columns[k].append(v) # append the value into the appropriate list
# based on column name k
print(columns['name'])
print(columns['phone'])
print(columns['street'])
With a file like
name,phone,street
Bob,0893,32 Silly
James,000,400 McHilly
Smithers,4442,23 Looped St.
Will output
>>>
['Bob', 'James', 'Smithers']
['0893', '000', '4442']
['32 Silly', '400 McHilly', '23 Looped St.']
Or alternatively if you want numerical indexing for the columns:
with open('file.txt') as f:
reader = csv.reader(f)
next(reader)
for row in reader:
for (i,v) in enumerate(row):
columns[i].append(v)
print(columns[0])
>>>
['Bob', 'James', 'Smithers']
To change the deliminator add delimiter=" " to the appropriate instantiation, i.e reader = csv.reader(f,delimiter=" ")
Use pandas:
import pandas as pd
my_csv = pd.read_csv(filename)
column = my_csv.column_name
# you can also use my_csv['column_name']
Discard unneeded columns at parse time:
my_filtered_csv = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])
P.S. I'm just aggregating what other's have said in a simple manner. Actual answers are taken from here and here.
You can use numpy.loadtext(filename). For example if this is your database .csv:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | Adam | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Carl | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Adolf | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Den | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
And you want the Name column:
import numpy as np
b=np.loadtxt(r'filepath\name.csv',dtype=str,delimiter='|',skiprows=1,usecols=(1,))
>>> b
array([' Adam ', ' Carl ', ' Adolf ', ' Den '],
dtype='|S7')
More easily you can use genfromtext:
b = np.genfromtxt(r'filepath\name.csv', delimiter='|', names=True,dtype=None)
>>> b['Name']
array([' Adam ', ' Carl ', ' Adolf ', ' Den '],
dtype='|S7')
With pandas you can use read_csv with usecols parameter:
df = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])
Example:
import pandas as pd
import io
s = '''
total_bill,tip,sex,smoker,day,time,size
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Dinner,3
'''
df = pd.read_csv(io.StringIO(s), usecols=['total_bill', 'day', 'size'])
print(df)
total_bill day size
0 16.99 Sun 2
1 10.34 Sun 3
2 21.01 Sun 3
Context: For this type of work you should use the amazing python petl library. That will save you a lot of work and potential frustration from doing things 'manually' with the standard csv module. AFAIK, the only people who still use the csv module are those who have not yet discovered better tools for working with tabular data (pandas, petl, etc.), which is fine, but if you plan to work with a lot of data in your career from various strange sources, learning something like petl is one of the best investments you can make. To get started should only take 30 minutes after you've done pip install petl. The documentation is excellent.
Answer: Let's say you have the first table in a csv file (you can also load directly from the database using petl). Then you would simply load it and do the following.
from petl import fromcsv, look, cut, tocsv
#Load the table
table1 = fromcsv('table1.csv')
# Alter the colums
table2 = cut(table1, 'Song_Name','Artist_ID')
#have a quick look to make sure things are ok. Prints a nicely formatted table to your console
print look(table2)
# Save to new file
tocsv(table2, 'new.csv')
I think there is an easier way
import pandas as pd
dataset = pd.read_csv('table1.csv')
ftCol = dataset.iloc[:, 0].values
So in here iloc[:, 0], : means all values, 0 means the position of the column.
in the example below ID will be selected
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
import pandas as pd
csv_file = pd.read_csv("file.csv")
column_val_list = csv_file.column_name._ndarray_values
Thanks to the way you can index and subset a pandas dataframe, a very easy way to extract a single column from a csv file into a variable is:
myVar = pd.read_csv('YourPath', sep = ",")['ColumnName']
A few things to consider:
The snippet above will produce a pandas Series and not dataframe.
The suggestion from ayhan with usecols will also be faster if speed is an issue.
Testing the two different approaches using %timeit on a 2122 KB sized csv file yields 22.8 ms for the usecols approach and 53 ms for my suggested approach.
And don't forget import pandas as pd
If you need to process the columns separately, I like to destructure the columns with the zip(*iterable) pattern (effectively "unzip"). So for your example:
ids, names, zips, phones = zip(*(
(row[1], row[2], row[6], row[7])
for row in reader
))
import pandas as pd
dataset = pd.read_csv('Train.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
X is a a bunch of columns, use it if you want to read more that one column
y is single column, use it to read one column
[:, 1:-1] are [row_index : to_row_index, column_index : to_column_index]
SAMPLE.CSV
a, 1, +
b, 2, -
c, 3, *
d, 4, /
column_names = ["Letter", "Number", "Symbol"]
df = pd.read_csv("sample.csv", names=column_names)
print(df)
OUTPUT
Letter Number Symbol
0 a 1 +
1 b 2 -
2 c 3 *
3 d 4 /
letters = df.Letter.to_list()
print(letters)
OUTPUT
['a', 'b', 'c', 'd']
import csv
with open('input.csv', encoding='utf-8-sig') as csv_file:
# the below statement will skip the first row
next(csv_file)
reader= csv.DictReader(csv_file)
Time_col ={'Time' : []}
#print(Time_col)
for record in reader :
Time_col['Time'].append(record['Time'])
print(Time_col)
From CSV File Reading and Writing you can import csv and use this code:
with open('names.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['first_name'], row['last_name'])
To fetch column name, instead of using readlines() better use readline() to avoid loop & reading the complete file & storing it in the array.
with open(csv_file, 'rb') as csvfile:
# get number of columns
line = csvfile.readline()
first_item = line.split(',')

Categories