I am using Python 3.5. I have several csv files:
The csv files are named according to a fixed structure. They have a fixed prefix (always the same) plus a varying filename part:
099_2019_01_01_filename1.csv
099_2019_01_01_filename2.csv
My original csv files look like this:
filename1-Streetname filename1-ZIPCODE
TEXT TEXT
TEXT TEXT
TEXT TEXT
TEXT TEXT
TEXT TEXT
TEXT TEXT
Street1 2012932
Street2 3023923
filename2-Name filename2-Phone
TEXT TEXT
TEXT TEXT
TEXT TEXT
TEXT TEXT
TEXT TEXT
TEXT TEXT
Name1 2012932
Name2 3023923
I am manipulating these files using the following code (I am reading the csv files from a source folder and writing them to a destination folder. I am skipping certain rows as I do not want to include this information):
I cut off the TEXT rows, as I do not need them:
import csv
skiprows = (1,2,3,4,5,6)
for file in os.listdir(sourcefolder):
with open(os.path.join(sourcefolder,file)) as fp_in:
reader = csv.reader(fp_in, delimiter=';')
rows = [row for i, row in enumerate(reader) if i not in skiprows]
with open(os.path.join(destinationfolder,file), 'w', newline='') as fp_out:
writer = csv.writer(fp_out)
writer.writerows(rows)
(this code works) gives
filename1-Streetname filename1-ZIPCODE
Street1 2012932
Street2 3023923
filename2-Name filename2-Phone
Name1 2012932
Name2 3023923
The first row contains the header. In the header names there is always the filename (however without the 099_2019_01_01_ prefix) plus a "-". The filename ending .csv is missing. I want to remove this "filename-" for each csv file.
The core part now is to get the first row and only for this row to perform a replace. I need to cut off the prefix and the .csv and then perform a general replace. The first replace could be something like this:
Either I could start with a function to cut off the first n signs, as the length is fixed or
According to this solution just use string.removeprefix('099_2019_01_01_')
As I have Python 3.5 I cannot use removeprefix so I try to just simple replace it.
string.replace("099_2019_01_01_","")
Then I need to remove the .csv which is easy:
string.replace(".csv","")
I put this together and I get (string.replace("099_2019_01_01_","")).replace(".csv",""). (Plus at the end the "-" needs to be removed too, see in the code below). I am not sure if this works.
My main problem is now for this csv import code that I do not know how I can manipulate only the first row when reading/writing the csv. So I want to replace this only in the first row. I tried something like this:
import csv
skiprows = (1,2,3,4,5,6)
for file in os.listdir(sourcefolder):
with open(os.path.join(sourcefolder,file)) as fp_in:
reader = csv.reader(fp_in, delimiter=';')
rows = [row for i, row in enumerate(reader) if i not in skiprows]
with open(os.path.join(destinationfolder,file), 'w', newline='') as fp_out:
writer = csv.writer(fp_out)
rows[0].replace((file.replace("099_2019_01_01_","")).replace(".csv","")+"-","")
writer.writerows(rows)
This gives an error as the idea with rows[0] is not working. How can I do this?
(I am not sure if I should try to include this replacing in the code or to put it into a second code which runs after the first code. However, then I would read and write csv files again I assume. So I think it would be most efficient to implement it into this code. Otherwise I need to open and change and save every file again. However, if it is not possible to include it into this code I would be also fine with a code which runs stand-alone and just does the replacing assuming the csv file have the rows 0 as header and then the data comes.)
Please note that I do want to go this way with csv and not use pandas.
EDIT:
At the end the csv files should look like this:
Streetname ZIPCode
Street1 9999
Street2 9848
Name Phone
Name1 23421
Name2 23232
Try by replacing this:
rows[0].replace((file.replace("099_2019_01_01_","")).replace(".csv","")+"-","")
By this in your code:
x=file.replace('099_2019_01_01_','').replace('.csv', '')
rows[0]=[i.replace(x+'-', '') for i in rows[0]]
Related
I want to delete rows from a csv file as they are processed.
My file:
Sr,Name1,Name2,Name3
1,Zname1,Zname2,Zname3
2,Yname1,Yname2,Yname3
3,Xname1,Xname2,Xname3
I want to read row by row and delete the row which has been processed.
So the file will be now:
2,Yname1,Yname2,Yname3
3,Xname1,Xname2,Xname3
The solutions which are provided on other questions are:
read the file
use next() or any other way to skip the row and write the remaining rows in an updated file
I want to delete the row from the original file which was entered in .reader() method
My code:
with open("file.txt", "r") as file
reader = csv.reader(file)
for row in reader:
#process the row
#delete the row
I have not been able to figure out how to delete/remove the row.
I want the change to be in the original file.txt because I will be running the program many times and so each time it runs, file.txt will already be present and the program will start from where it ended the last time.
Just read the csv file in memory as a list, then edit that list, and then write it back to the csv file.
lines = list()
members= input("Please enter a member's name to be deleted.")
with open('mycsv.csv', 'r') as readFile:
reader = csv.reader(readFile)
for row in reader:
lines.append(row)
for field in row:
if field == members:
lines.remove(row)
with open('mycsv.csv', 'w') as writeFile:
writer = csv.writer(writeFile)
writer.writerows(lines)
You can delete column like this:
We can use the panda pop () method to remove columns from CSV by naming the column as an argument.
Import Pandas.
Read CSV File.
Use pop() function for removing or deleting rows or columns from the CSV files.
Print Data.
You probably can find inspiration here: How to delete a specific line in a file?.
And don't forget to grant write permission when opening the file.
Since the pandas package deals with big data, there is no solution in basic Python.
You will have to import pandas.
import pandas
df=pandas.read_csv("file_name.txt")
df.set_value(0,"Name3",new_value)
df.to_csv("file_name.txt", index=False)
This code edits the cell in the 0th row and Name3 column. The 0th row is the first row below the header. Thus, Zname3 will be changed to something else. You can similarly delete a row or a cell.
I have not tried this code but it is supposed to work in the required manner.
I am trying to write python code that counts the amount of rows in a csv file, but ignores rows containing the text (zzz for example). I have been able to successfully count the rows but I do not know how to write the code so it ignores rows that contain zzz when counting. Any help in this, or at the least pointing me to something to read would be great.
import csv
filename = (r"name")
with open(filename, 'r') as csvf:
csv_csvf = cvs.reader(csvf)
reader = csv.reader(csvf)
lines = len(list(reader))
print(lines)
I want to loop through all the PDFs in a directory, extract the text from each one using PDFminer, and then write the output to a single CSV file. I am able to extract the text from each PDF individually by passing it to the function defined here. I am also able to get a list of all the PDF filenames in a given directory. But when I try to put the two together and write the results to a single CSV, I get a CSV with headers but no data.
Here is my code:
import os
pdf_files = [name for name in os.listdir("C:\\My\\Directory\\Path") if name.endswith(".pdf")] #get all files in directory
pdf_files_path = ["C:\\My\\Directory\\Path\\" + pdf_files[i] for i in range(len(pdf_files))] #add directory path
import pandas as pd
df = pd.DataFrame(columns=['FileName','Text'])
for i in range(len(pdf_files)):
scraped_text = convert_pdf_to_txt(pdf_files_path[i])
df.append({ 'FileName': pdf_files[i], 'Text': scraped_text[i]},ignore_index=True)
df.to_csv('output.csv')
The variables have the following values:
pdf_files: ['12280_2007_Article_9000.pdf', '12280_2007_Article_9001.pdf', '12280_2007_Article_9002.pdf', '12280_2007_Article_9003.pdf', '12280_2007_Article_9004.pdf', '12280_2007_Article_9005.pdf', '12280_2007_Article_9006.pdf', '12280_2007_Article_9007.pdf', '12280_2007_Article_9008.pdf', '12280_2007_Article_9009.pdf']
pdf_files_path: ['C:\\My\\Directory Path\\12280_2007_Article_9000.pdf', etc...]
Empty DataFrame
Columns: [FileName, Text]
Index: []
Update: based on a suggestion from #AMC I checked the contents of scraped_text in the loop. For the Text column, it seems that I'm looping through the characters in the first PDF file, rather than looping through each file in the directly. Also, the contents of the loop are not getting written to the dataframe or CSV.
12280_2007_Article_9000.pdf E
12280_2007_Article_9001.pdf a
12280_2007_Article_9002.pdf s
12280_2007_Article_9003.pdf t
12280_2007_Article_9004.pdf
12280_2007_Article_9005.pdf A
12280_2007_Article_9006.pdf s
12280_2007_Article_9007.pdf i
12280_2007_Article_9008.pdf a
12280_2007_Article_9009.pdf n
I guess you don't need pandas for that. You can make it simpler by using the standard library csv.
Another thing that can be improved, if you are using Python 3.4+, is to replace os with pathlib.
Here is an almost complete example:
import csv
from pathlib import Path
folder = Path('c:/My/Directory/Path')
csv_file = Path('c:/path/to/output.csv')
with csv_file.open('w', encoding='utf-8') as f:
writer = csv.writer(f, csv.QUOTE_ALL)
writer.writerow(['FileName', 'Text'])
for pdf_file in folder.glob('*.pdf'):
pdf_text = convert_pdf_to_txt(pdf_file).replace('\n', '|')
writer.writerow([pdf_file.name, pdf_text])
Another thing to bear in mind is to be sure pdf_text will be a single line or else your csv file will be kind of broken. One way to work around that is to pick an arbitrary character to use in place of the new line marks. If you pick the pipe character, for example, than you can do something like this, prior to writer.writerow:
pdf_text.replace('\n', '|')
It is not meant to be a complete example but a starting point. I hope it helps.
I have a CSV file of interview transcripts exported from an h5 file. When I read the rows into python, the output looks something like this:
line[0]=['title,date,responses']
line[1]=['[\'Transcript 1 title\'],"[\' July 7, 1997\']","[ '\nms. vogel: i look at all sectors of insurance, although to date i\nhaven\'t really focused on the reinsurers and the brokers.\n']']
line[2]=['[\'Transcript 2 title\'],"[\' July 8, 1997\']","[ '\nmr. tozzi: i formed cambridge in 1981. we are top-down sector managers,\nconstantly searching for non-consensus companies and industries.\n']']
etc...
I'd like to extract the text from the "responses" column ONLY into separate .txt files for every row in the CSV file, saving the .txt files into a specified directory and naming them as "t1.txt", "t2.txt", etc. according to the row number. The CSV file has roughly 30K rows.
Drawing from what I've already been able to find online, this is the code I have so far:
import csv
with open("twst.csv", "r") as f:
reader = csv.reader(f)
rownumber = 0
for row in reader:
g=open("t"+str(rownumber)+".txt","w")
g.write(row)
rownumber = rownumber + 1
g.close()
My biggest problem is that this pulls all columns from the row into the .txt file, but I only want the text from the "responses" column. Once I have that, I know I can loop through the various rows in the file (right now, what I have set up is just to test the first row), but I haven't found any guidance on pulling specific columns in the python documentation. I'm also not familiar enough with python to figure out the code on my own.
Thanks in advance for the help!
There may be something that can be done with the built-in csv module. However, if the format of the csv does not change, the following code should work by just using for loops and built-in read/write.
with open('test.csv', 'r') as file:
data = file.read().split('\n')
for row in range(1, len(data)):
third_col= data[x].split(',')
with open('t' + str(x) + '.txt', 'w') as output:
output.write(third_col[2])
I have a set of csv files and another csv file, GroundTruth2010_edited_copy.csv, which contains information I'd like to append to the end of the rows of the Set of files. The files contain information describing geologic samples. For all the files, including GroundTruth2010_edited_copy.csv, each row has an identifying 'rockid' that identifies the sample and the remainder of the row describes various parameters of the sample. I want to append corresponding information from GroundTruth2010_edited_copy.csv to the Set of csv files. That is, if the rows have the same 'rockid,' I want to combine them into a new row in a new csv file. Hence, there is a new csv file for each original csv file in the Set. Here is my code.
import os
import csv
#read in ground truth data
csvfilename='GroundTruth/GroundTruth2010_edited_copy.csv'
with open(csvfilename) as csvfile:
rocreader=csv.reader(csvfile)
path=os.getcwd()
filenames = os.listdir(path)
for filename in filenames:
if filename.endswith('.csv'):
#read csv files
r=csv.reader(open(filename))
new_data = []
for row in r:
rockid=row[-1]
for krow in rocreader:
entry=krow[0]
newentry=entry[:5] +entry[6:] #remove extra '0' from middle of entry
if newentry==rockid:
print('Ok!')
#append ground truth data
new_data.append([row, krow[1], krow[2], krow[3], krow[4]])
#write csv files
newfilename = "".join(filename.split(".csv")) + "_GT.csv"
with open(newfilename, "w") as f:
writer = csv.writer(f)
writer.writerows(new_data)
The code runs and makes my new csv files, but they are all empty. The problem seems to be that my second 'if' statement is never true: the console never prints 'Ok!' I've tried troubleshooting for a bit, and been rather frustrated. Perhaps the most frustrating thing is that after the program finishes, if I enter
rockid==newentry
The console returns 'True,' so it seems to me I should get at least one 'Ok!' for the final iteration. Can anyone help me find what's wrong?
Also, since my if statement is never true, there may also be a problem with the way I append 'new_data.'
You only open rocreader once, so when you try to use it later in the loop, you'll only get rows from it the first time through-- in the rest of the loop's runs, you're reading 0 rows (and of course getting no matches). To read it over and over, open and close it once for each time you need to use it.
But instead of re-scanning the Ground Truth file from disk (slow!) for every row of each of the other CSVs, you should read it once into a dictionary, so you can look up IDs in one step.
with open(csvfilename) as csvfile:
rocreader=csv.reader(csvfile)
rocindex = dict((row[-1], row) for row in rocreader)
Then for any key newentry, you can just check like this:
if newentry in rocindex:
truth = rocindex[newentry]
# Merge it with the row that has key `newentry`