Extracting Rows of Data from a CSV-like File Using Python

Extracting Rows of Data from a CSV-like File Using Python - python

I have a large file from a proprietary archive format. Unzipping this archive gives a file that has no extension, but the data inside is comma-delimited. Adding a .csv extension or simply opening the file with Excel will work.
I have about 375-400 of these files, and I'm trying to extract a chunk of rows (about 13,500 out of 1.2M+ rows) between a keyword "Point A" and another keyword "Point B".
I found some code on this site that I think is extracting the data correctly, but I'm getting an error:
AttributeError: 'list' object has no attribute 'rows'
when trying to save out the file. Can somebody help me get this data to save into a csv?
import re
import csv
import time
print(time.ctime())
file = open('C:/Users/User/Desktop/File with No Extension That\'s Very Similar to CSV', 'r')
data = file.read()
x = re.findall(r'Point A(.*?)Point B', data,re.DOTALL)
name = "C:/Users/User/Desktop/testoutput.csv"
with open(name, 'w', newline='') as file2:
savefile = csv.writer(file2)
for i in x.rows:
savefile.writerow([cell.value for cell in i])
print(time.ctime())
Thanks in advance, any help would be much appreciated.

The following should work nicely. As mentioned, your regex usage was almost correct. It is possible to still use the Python CSV library to do the CSV processing by converting the found text into a StringIO object and passing that to the CSV reader:
import re
import csv
import time
import StringIO
print(time.ctime())
input_name = "C:/Users/User/Desktop/File with No Extension That's Very Similar to CSV"
output_name = "C:/Users/User/Desktop/testoutput.csv"
with open(input_name, 'r') as f_input, open(output_name, 'wb') as f_output:
# Read whole file in
all_input = f_input.read()
# Extract interesting lines
ab_input = re.findall(r'Point A(.*?)Point B', all_input, re.DOTALL)[0]
# Convert into a file object and parse using the CSV reader
fab_input = StringIO.StringIO(ab_input)
csv_input = csv.reader(fab_input)
csv_output = csv.writer(f_output)
# Iterate a row at a time from the input
for input_row in csv_input:
# Skip any empty rows
if input_row:
# Write row at a time to the output
csv_output.writerow(input_row)
print(time.ctime())
You have not given us an example from your CSV file, so if there are problems, you might need to configure the CSV 'dialect' to process it better.
Tested using Python 2.7

You have 2 problems here: the first is related to the regular expression and the other is about the list syntax.
Getting what you want
The way you are using the regular expression will return to you a list with a single value (all lines into an unique string).
Probably there is a better way of doing this but I would go now with something like this:
with open('bla', 'r') as input:
data = input.read()
x = re.findall(r'Point A(.*?)Point B', data, re.DOTALL)[0]
x = x.splitlines(False)[1:]
That's not pretty but will return a list with all values between those two points.
Working with lists
There is no rows attribute inside lists. You just have to iterate over it:
for i in x:
do what you have to do
See, I'm not familiar to the csv library but it looks that you will have to perform some manipulations to the i value before adding it to the library.
IMHO, I would avoid using CSV format since it is kind of "locale dependent" so it may not work as expected depending the settings your end-users may have on OS.

Updating the code so that #Martin Evans answer works on the latest Python version.
import re
import csv
import time
import io
print(time.ctime())
input_name = "C:/Users/User/Desktop/File with No Extension That's Very Similar to CSV"
output_name = "C:/Users/User/Desktop/testoutput.csv"
with open(input_name, 'r') as f_input, open(output_name, 'wt') as f_output:
# Read whole file in
all_input = f_input.read()
# Extract interesting lines
ab_input = re.findall(r'Point A(.*?)Point B', all_input, re.DOTALL)[0]
# Convert into a file object and parse using the CSV reader
fab_input = io.StringIO(ab_input)
csv_input = csv.reader(fab_input)
csv_output = csv.writer(f_output)
# Iterate a row at a time from the input
for input_row in csv_input:
# Skip any empty rows
if input_row:
# Write row at a time to the output
csv_output.writerow(input_row)
print(time.ctime())
Also, by using 'wt' instead of 'wb' one can avoid
"TypeError: a bytes-like object is required, not 'str'"

Related

Python: csv to pickle representation, back to csv messes with file content

I am trying to pickle a csv file and then turn its pickled representation back into a csv file.
This is the code I came up with:
from pathlib import Path
import pickle, csv
csvFilePath = Path('/path/to/file.csv')
pathToSaveTo = Path('/path/to/newFile.csv')
csvFile = open(csvFilePath, 'r')
f = csvFile.read()
csvFile.close()
f_pickled = pickle.dumps(f)
f_unpickled = pickle.loads(f_pickled)
#save unpickled csv file
new_csvFile = open(pathToSaveTo, 'w')
csvWriter = csv.writer(new_csvFile)
csvWriter.writerow(f_unpickled)
new_csvFile.close()
newFile.csv is created however there are two problems with its content:
There is now a comma between every character.
There is now a pair of quotation marks after every line.
What would I have to change about my code to get an exact copy of file.csv?

The problem is that you are reading the raw text of the file, with f = csvFile.read() then, on writting, you are feeding the data, which is a single lump of text, all in a single string, though a CSV writer object. The CSV writer will see the string as an iterable, and write each of the iterable elements in a CSV cell. Then, there is no data for a second row, and the process ends.
The pickle dumps and loads you perform is just a no-operation: nothing happens there, and if there were any issue, it would rather be due to some unpickleable object reference in the object you are passing to dumps: you'd get an exception, and not differing data when loads is called.
Now, without telling why you want to do this, and what intermediate steps you hav planned for the data, it is hard to tell you: you are performing two non-operations: reading a file, pickling and unpickling its contents, and writting those contents back to disk.
At which point do you need these data structured as rows, or as CSV cells? Just apply the proper transforms where you need it, and you are done.
If you want the whole "do nothing" cycle going through actual having the CSV data separated in different elements in Python you can perform:
from pathlib import Path
import pickle, csv
csvFilePath = Path('file.csv')
pathToSaveTo = Path('newFile.csv')
data = list(csv.reader(open(csvFilePath)))
# ^consumes all iterations of the reader: each iteration is a row, composed of a list where each cell value is a list elemnt
pickled_data = pickle.dumps(data)
restored_data = pickle.loads(pickled_data)
csv.writer(open(pathToSaveTo, "wt")).writerows(restored_data)
Perceive as in this snippet the data is read through csv.reader, not directly. Wrapping it in a list call causes all rows to be read and transformed in list items - because the reader is a lazy iterator otherwise (and it would not be pickeable, as one of the attributs it depends for its state is an open file)

I believe the problem is in how you're attempting to write the CSV file, the pickling and unpickling is fine. If you compare f with f_unpickled:
if f==f_unpickled:
print("Same")
This printed in my case. If you print the type, you'll see there's both strings.
The better option is to follow the document style and write each row one at a time rather than putting the entire string in including new lines. Something like this:
from pathlib import Path
import pickle, csv
csvFilePath = Path('file.csv')
pathToSaveTo = Path('newFile.csv')
rows = []
csvFile = open(csvFilePath, 'r')
with open(csvFilePath, 'r') as file:
reader = csv.reader(file)
for row in reader:
rows.append(row)
# pickle and unpickle
rows_pickled = pickle.dumps(rows)
rows_unpickled = pickle.loads(rows_pickled)
if rows==rows_unpickled:
print("Same")
#save unpickled csv file
with open(pathToSaveTo, 'w', newline='') as csvfile:
csvWriter = csv.writer(csvfile)
for row in rows_unpickled:
csvWriter.writerow(row)
This worked when I tested it--albeit it would take more finagling with line separators to get no empty line at the end.

How do I get python to write a csv file from the output my code?

I am incredibly new to python, so I might not have the right terminology...
I've extracted text from a pdf using pdfplumber. That's been saved as a object. The code I used for that is:
with pdfplumber.open('Bell_2014.pdf') as pdf:
page = pdf.pages[0]
bell = page.extract_text()
print(bell)
So "bell" is all of the text from the first page of the imported PDF.
what bell looks like I need to write all of that text as a string to a csv. I tried using:
with open('Bell_2014_ex.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(bell)
and
bell_ex = 'bell_2014_ex.csv'
with open(bell_ex, 'w', newline='') as csvfile:
file_writer = csv.writer(csvfile,delimiter=',')
file_writer.writerow(bell)
All I keep finding when I search this is how to create a csv with specific characters or numbers, but nothing from an output of an already executed code. For instance, I can get the above code:
bell_ex = 'bell_2014_ex.csv'
with open(bell_ex, 'w', newline='') as csvfile:
file_writer = csv.writer(csvfile,delimiter=',')
file_writer.writerow(['bell'])
to create a csv that has "bell" in one cell of the csv, but that's as close as I can get.
I feel like this should be super easy, but I just can't seem to get it to work.
Any thoughts?
Please and thank you for helping my inexperienced self.

page.extract_text() is defined as: "Collates all of the page's character objects into a single string." which would make bell just a very long string.
The CSV writerow() expects by default a list of strings, with each item in the list corresponding to a single column.
Your main issue is a type mismatch, you're trying to write a single string where a list of strings is expected. You will need to further operate on your bell object to convert it into a format acceptable to be written to a CSV.
Without having any knowledge of what bell contains or what you intend to write, I can't get any more specific, but documentation on Python's CSV module is very comprehensive in terms of settings delimiters, dialects, column definitions, etc. Once you have converted bell into a proper iterable of lists of strings, you can then write it to a CSV.

Some similar code I wrote recently converts a tab-separated file to csv for insertion into sqlite3 database:
Maybe this is helpful:
retval = ''
mode = 'r'
out_file = os.path.join('input', 'listfile.csv')
"""
Convert tab-delimited listfile.txt to comma separated values (.csv) file
"""
in_text = open(listfile.txt, 'r')
in_reader = csv.reader(in_text, delimiter='\t')
out_csv = open(out_file, 'w', newline='\n')
out_writer = csv.writer(out_csv, dialect=csv.excel)
for _line in in_reader:
out_writer.writerow(_line)
out_csv.close()
... and that's it, not too tough

So my problem was that I was missing the "encoding = 'utf-8'" for special characters and my delimiter need to be a space instead of a comma. What ended up working was:
from pdfminer.high_level import extract_text
object = extract_text('filepath.pdf')
print(object)
new_csv = 'filename.csv'
with open(new_csv, 'w', newline='', encoding = 'utf-8') as csvfile:
file_writer = csv.writer(csvfile,delimiter=' ')
file_writer.writerow(object)
However, since a lot of my pdfs weren't true pdfs but scans, the csv ended up having a lot of weird symbols. This worked for about half of the pdfs I have. If you have true pdfs, this will be great. If not, I'm currently trying to figure out how to extract all the text into a pandas dataframe separated by headers within the pdfs since pdfminer extracted all text perfectly.
Thank you for everyone that helped!

How can I open multiple csv files in a folder, take the average of a column and save in a separate file using python?

I am extremely new at python and need some help with this one. I've tried various codes and none seem to work, so suggestions would be awesome.
I have a folder with about 1500 csv files that each contain multiple columns of data. I need to take the average of the first column called "agr" and save this value in a different excel or csv file. It would be great if I could also somehow save the name of the file with its averaged value so that I can keep track of which file it came from. The name of the files are crop_city (e.g. corn_omaha).
import glob
import csv
import numpy as np
import pandas as pd
path = ('C:/test/*.csv')
for fname in glob.glob(path):
with open(fname) as csvfile:
agr = []
reader = csv.DictReader(fname)
print row['agr']
I know the code above is extremely rudimentary, so any help would be great thanks everyone!

Assuming the first column in these CSV files is a decimal or float, you don't really need to parse the entire line. Just split at the first separator and parse the first token. There is no real advantage to numpy or pandas either. Just use the builtin sum function.
import glob
import os
path = ('test/*.csv') # using local dir for test
outfile.write("Filename,Sum\r\n") # header for output
with open('output.csv', 'w', newline='') as outfile:
for fname in glob.glob(path):
with open(fname) as csvfile:
next(csvfile) # skip header
outfile.writelines("{},{}\r\n".format(os.path.basename(fname),
sum(float(line.split(',', 1)[0].strip())
for line in csvfile)))

Contrary to the answer by #tdelaney, I would not advise you to limit your code by relying on the fact that you are adding up the first column; what if you need to work with the third column next week? It's easy to do this properly by building on the code you provide. Parsing a couple of thousand text files is not going to slow you down.
The csv.DictReader constructor will automatically treat the first row of its input as a header (unless you explicitly specify a list of column names with the fieldnames parameter). So your code can look like this:
import csv
import glob
averages = []
for fname in glob.glob(path):
with open(fname, "rb") as csvfile:
reader = csv.DictReader(csvfile)
values = [ float(row["agr"]) for row in reader ]
avg = sum(values) / len(values)
averages.append((fname, avg))
The list averages now contains the numbers you want. This is how you write it out to another CSV file:
with open("avegages.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(["File", "Average agr"])
for row in averages:
writer.writerow(row)
PS. Since you included pandas in your imports, here's one way to do the same thing with pandas. However, I recommend sticking with csv for now. The pandas object model is complex, and hard to wrap your head around.
averages = []
for fname in glob.glob(path):
data = pd.DataFrame.from_csv(fname)
averages.append((fname, data["agr"].mean()))
df_out = pd.DataFrame.from_records(averages, columns=["File", "Average agr"])
df_out.to_csv("averages.csv", index=False)
As you can see the code is a lot shorter, since file i/o and calculations can be done with one statement.

Python error: need more than one value to unpack

Ok, so I'm learning Python. But for my studies I have to do rather complicated stuff already. I'm trying to run a script to analyse data in excel files. This is how it looks:
#!/usr/bin/python
import sys
#lots of functions, not relevant
resultsdir = /home/blah
filename1=sys.argv[1]
filename2=sys.argv[2]
out = open(sys.argv[3],"w")
#filename1,filename2="CNVB_reads.403476","CNVB_reads.403447"
file1=open(resultsdir+"/"+filename1+".csv")
file2=open(resultsdir+"/"+filename2+".csv")
for line in file1:
start.p,end.p,type,nexons,start,end,cnvlength,chromosome,id,BF,rest=line.split("\t",10)
CNVs1[chr].append([int(start),int(end),float(BF)])
for line in file2:
start.p,end.p,type,nexons,start,end,cnvlength,chromosome,id,BF,rest=line.split("\t",10)
CNVs2[chr].append([int(start),int(end),float(BF)])
These are the titles of the columns of the data in the excel files and I want to split them, I'm not even sure if that is necessary when using data from excel files.
#more irrelevant stuff
out.write(filename1+","+filename2+","+str(chromosome)+","+str(type)+","+str(shared)+"\n")
This is what it should write in my output, 'shared' is what I have calculated, the rest is already in the files.
Ok, now my question, finally, when I call the script like that:
python script.py CNVB_reads.403476 CNVB_reads.403447 script.csv in my shell
I get the following error message:
start.p,end.p,type,nexons,start,end,cnvlength,chromosome,id,BF,rest=line.split("\t",10)
ValueError: need more than 1 value to unpack
I have no idea what is meant by that in relation to the data... Any ideas?

The line.split('\t', 10) call did not return eleven elements. Perhaps it is empty?
You probably want to use the csv module instead to parse these files.
import csv
import os
for filename, target in ((filename1, CNVs1), (filename2, CNVs2)):
with open(os.path.join(resultsdir, filename + ".csv"), 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
for row in reader:
start.p, end.p = row[:2]
BF = float(row[8])
target[chr].append([int(start), int(end), BF])

Overwriting a specific row in a csv file using Python's CSV module

I'm using Python's csv module to do some reading and writing of csv files.
I've got the reading fine and appending to the csv fine, but I want to be able to overwrite a specific row in the csv.
For reference, here's my reading and then writing code to append:
#reading
b = open("bottles.csv", "rb")
bottles = csv.reader(b)
bottle_list = []
bottle_list.extend(bottles)
b.close()
#appending
b=open('bottles.csv','a')
writer = csv.writer(b)
writer.writerow([bottle,emptyButtonCount,100, img])
b.close()
And I'm using basically the same for the overwrite mode(which isn't correct, it just overwrites the whole csv file):
b=open('bottles.csv','wb')
writer = csv.writer(b)
writer.writerow([bottle,btlnum,100,img])
b.close()
In the second case, how do I tell Python I need a specific row overwritten? I've scoured Gogle and other stackoverflow posts to no avail. I assume my limited programming knowledge is to blame rather than Google.

I will add to Steven Answer :
import csv
bottle_list = []
# Read all data from the csv file.
with open('a.csv', 'rb') as b:
bottles = csv.reader(b)
bottle_list.extend(bottles)
# data to override in the format {line_num_to_override:data_to_write}.
line_to_override = {1:['e', 'c', 'd'] }
# Write data to the csv file and replace the lines in the line_to_override dict.
with open('a.csv', 'wb') as b:
writer = csv.writer(b)
for line, row in enumerate(bottle_list):
data = line_to_override.get(line, row)
writer.writerow(data)

You cannot overwrite a single row in the CSV file. You'll have to write all the rows you want to a new file and then rename it back to the original file name.
Your pattern of usage may fit a database better than a CSV file. Look into the sqlite3 module for a lightweight database.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting Rows of Data from a CSV-like File Using Python - python

Related

Python: csv to pickle representation, back to csv messes with file content

How do I get python to write a csv file from the output my code?

How can I open multiple csv files in a folder, take the average of a column and save in a separate file using python?

Python error: need more than one value to unpack

Overwriting a specific row in a csv file using Python's CSV module

Categories

Resources