Reading and Storing csv data line by line in a postgres - python

I want to copy csv data from different files and then store in a table. But the problem is, the number of column differes in each csv files, So some csv file have 3 columns while some have 4. So if there are 4 columns in a file, I want to simply ignore the fourth column and save only first three.
Using following code, I can copy data into the table, if there are only 3 columns,
CREATE TABLE ImportCSVTable (
name varchar(100),
address varchar(100),
phone varchar(100));
COPY ImportCSVTable (name , address , phone)
FROM 'path'
WITH DELIMITER ';' CSV QUOTE '"';
But I am looking forward to check each row individually and then store it in the table.
Thank you.

Since you want to read and store it one line at a time, the Python csv module should make it easy to read the first 3 columns from your CSV file regardless of any extra columns.
You can construct an INSERT statement and execute it with your preferred Python-PostGreSQL module. I have used pyPgSQL in the past; don't know what's current now.
#!/usr/bin/env python
import csv
filesource = 'PeopleAndResources.csv'
with open(filesource, 'rb') as f:
reader = csv.reader(f, delimiter=';', quotechar='"')
for row in reader:
statement = "INSERT INTO ImportCSVTable " + \
"(name, address, phone) " + \
"VALUES ('%s', '%s', '%s')" % (tuple(row[0:3]))
#execute statement

Use a text utility to chop off the fourth column. That way, all your input files will have three columns. Some combination of awk, cut, and sed should take care of it for you, but it depends on what your columns look like.

You can also just make your input table have a fourth column that is nullable, then after the import drop the extra column.

Related

Other way than splitting with coma to store in a database?

I started created a database with postgresql and I am currently facing a problem when I want to copy the data from my csv file to my database
Here is my code:
connexion = psycopg2.connect(dbname= "db_test" , user = "postgres", password ="passepasse" )
connexion.autocommit = True
cursor = connexion.cursor()
cursor.execute("""CREATE TABLE vocabulary(
fname integer PRIMARY KEY,
label text,
mids text
)""")
with open (r'C:\mypathtocsvfile.csv', 'r') as f:
next(f) # skip the header row
cursor.copy_from(f, 'vocabulary', sep=',')
connexion.commit()
I asked to allocate 4 column to store my csv data, the problem is that datas in my csv are stored like this:
fname,labels,mids,split
64760,"Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music","/m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlf",train
16399,"Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music","/m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlf",train
...
There is comas inside my columns label and mids, thats why i get the following error:
BadCopyFileFormat: ERROR: additional data after the last expected column
Which alternativ should I use to copy data from this csv file?
ty
if the file is small, then the easiest way is to open the file in LibreOffice and save the file with a new separetor. 
I usually use ^. 
If the file is large, write a script to replace ," and "," on ^" and "^", respectively.
COPY supports csv as a format, which already does what you want. But to access it via psycopg2, I think you will need to use copy_expert rather than copy_from.
cursor.copy_expert('copy vocabulary from stdin with csv', f)

Remove specific number of characters in csv

I have some CSV files being exported from an SQL database and transferred to me daily for me to import into my SQL server. The files all have a "title" line in them with 27 characters, the business name and date. I.e. "busname: 08-31-2020". I need a script that can remove those first 27 characters so they aren't imported into the database.
Is this possible? I can't find anything that will let me select a specific number of characters at the beginning of the file.
If you value is in the column 1 you can use str[27:] to get all the str after the given value.
import csv
with open('file.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
process_str = row[1][27:]
You can after create a new file using this processed string

python Postgresql: Ignoring the last column from csv file

I have problem with importing a CSV file. I am using postgresql's COPY FROM command to copy a CSV file into a 2-column table.
I have a CSV file in the following format;
"1";"A"
"2";"B"
"3";"C";"CAD450"
"4";"D";"ABX123"
I want to import all these lines of the CSV file into the table but I want to skip any extra added columns.
Currently I am skipping any lines that contain extra columns, for example here columns "1";"C";"CAD450" and "1";"D";"ABX123" are skipped and I am importing only the first two columns. But I want to copy all these four lines into my table. So is there any way where I can ignore the last column and copy all the four lines into my table, like this
"1";"A"
"1";"B"
"1";"C"
"1";"D"
Preprocess the file with awk to strip the extra columns:
awk -F';' '{print $1 ";" $2 }' > new_file.csv
Piping it through cut or awk (as suggested above) is easier than using python/psycopg.
cat csv_file.csv | cut -d';' -f1,2 | psql -u USER DATABASE -c "COPY table FROM STDIN WITH DELIMITER ';';"
with open("file.csv","r") as f:
t=[line.strip().split(";")[:2] for line in f]
Myriad ways to handle the problem.
I'd probably do something like this:
import csv
import psycopg2
dr = csv.DictReader(open('test.csv','rb'),
delimiter=';',
quotechar='"',
fieldnames=['col1','col2']) # need not specify other cols
CONNSTR = """
host=127.0.0.1
dbname=mydb
user=me
password=pw
port=5432"""
cxn = psycopg2.connect(CONNSTR)
cur = cxn.cursor()
cur.execute("""CREATE TABLE from_csv (
id serial NOT NULL,
col1 character varying,
col2 character varying,
CONSTRAINT from_csv_pkey PRIMARY KEY (id));""")
cur.executemany("""INSERT INTO from_csv (col1,col2)
VALUES (%(col1)s,%(col2)s);""", dr)
cxn.commit()

How to separate comma separated data from csv file?

I have opened a csv file and I want to sort each string which is comma separated and are in same line:
ex:: file :
name,sal,dept
tom,10000,it
o/p :: each string in string variable
I have a file which is already open, so I can not use "open" API, I have to use "csv.reader" which have to read one line at a time.
If the file open for reading is bound to a variable name, say fin; and assuming you're using Python 2.6, and you know the file's not empty (has at least the row with headers):
import csv
rd = csv.reader(fin)
headers = next(rd)
for data in rd:
...process data and headers...
In Python 2.5, use headers = rd.next() instead of headers = next(rd).
These versions use a list of fields data, which is a completely general solution (i.e., you don't need to know in advance how many columns the file has: you'll access them as data[0], data[1], etc, and the current row has len(data) fields at each leg of the loop).
If you know the file has exactly three columns and prefer to use separate names for a variable per column, change the loop header to:
for name, sales, department in rd:
The field data as returned by the reader (just like the headers) are all strings. If you know for example that the second column is an int and want to treat it as such, start the loop with
for data in rd:
data[1] = int(data[1])
or, if you're using the named-variables variant:
for name, sales, department in rd:
sales = int(sales)
I don't know if I have properly understood your question. You may want to have a look at the split function described in the Python documentation anyway.

How do quickly search through a .csv file in Python

I'm reading a 6 million entry .csv file with Python, and I want to be able to search through this file for a particular entry.
Are there any tricks to search the entire file? Should you read the whole thing into a dictionary or should you perform a search every time? I tried loading it into a dictionary but that took ages so I'm currently searching through the whole file every time which seems wasteful.
Could I possibly utilize that the list is alphabetically ordered? (e.g. if the search word starts with "b" I only search from the line that includes the first word beginning with "b" to the line that includes the last word beginning with "b")
I'm using import csv.
(a side question: it is possible to make csv go to a specific line in the file? I want to make the program start at a random line)
Edit: I already have a copy of the list as an .sql file as well, how could I implement that into Python?
If the csv file isn't changing, load in it into a database, where searching is fast and easy. If you're not familiar with SQL, you'll need to brush up on that though.
Here is a rough example of inserting from a csv into a sqlite table. Example csv is ';' delimited, and has 2 columns.
import csv
import sqlite3
con = sqlite3.Connection('newdb.sqlite')
cur = con.cursor()
cur.execute('CREATE TABLE "stuff" ("one" varchar(12), "two" varchar(12));')
f = open('stuff.csv')
csv_reader = csv.reader(f, delimiter=';')
cur.executemany('INSERT INTO stuff VALUES (?, ?)', csv_reader)
cur.close()
con.commit()
con.close()
f.close()
you can use memory mapping for really big files
import mmap,os,re
reportFile = open( "big_file" )
length = os.fstat( reportFile.fileno() ).st_size
try:
mapping = mmap.mmap( reportFile.fileno(), length, mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
mapping = mmap.mmap( reportFile.fileno(), 0, None, mmap.ACCESS_READ )
data = mapping.read(length)
pat =re.compile("b.+",re.M|re.DOTALL) # compile your pattern here.
print pat.findall(data)
Well, if your words aren't too big (meaning they'll fit in memory), then here is a simple way to do this (I'm assuming that they are all words).
from bisect import bisect_left
f = open('myfile.csv')
words = []
for line in f:
words.extend(line.strip().split(','))
wordtofind = 'bacon'
ind = bisect_left(words,wordtofind)
if words[ind] == wordtofind:
print '%s was found!' % wordtofind
It might take a minute to load in all of the values from the file. This uses binary search to find your words. In this case I was looking for bacon (who wouldn't look for bacon?). If there are repeated values you also might want to use bisect_right to find the the index of 1 beyond the rightmost element that equals the value you are searching for. You can still use this if you have key:value pairs. You'll just have to make each object in your words list be a list of [key, value].
Side Note
I don't think that you can really go from line to line in a csv file very easily. You see, these files are basically just long strings with \n characters that indicate new lines.
You can't go directly to a specific line in the file because lines are variable-length, so the only way to know when line #n starts is to search for the first n newlines. And it's not enough to just look for '\n' characters because CSV allows newlines in table cells, so you really do have to parse the file anyway.
my idea is to use python zodb module to store dictionaty type data and then create new csv file using that data structure. do all your operation at that time.
There is a fairly simple way to do this.Depending on how many columns you want python to print then you may need to add or remove some of the print lines.
import csv
search=input('Enter string to search: ')
stock=open ('FileName.csv', 'wb')
reader=csv.reader(FileName)
for row in reader:
for field in row:
if field==code:
print('Record found! \n')
print(row[0])
print(row[1])
print(row[2])
I hope this managed to help.

Categories