Remove specific number of characters in csv - python

I have some CSV files being exported from an SQL database and transferred to me daily for me to import into my SQL server. The files all have a "title" line in them with 27 characters, the business name and date. I.e. "busname: 08-31-2020". I need a script that can remove those first 27 characters so they aren't imported into the database.
Is this possible? I can't find anything that will let me select a specific number of characters at the beginning of the file.

If you value is in the column 1 you can use str[27:] to get all the str after the given value.
import csv
with open('file.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
process_str = row[1][27:]
You can after create a new file using this processed string

Related

Import csv: remove filename from column names in first row

I am using Python 3.5. I have several csv files:
The csv files are named according to a fixed structure. They have a fixed prefix (always the same) plus a varying filename part:
099_2019_01_01_filename1.csv
099_2019_01_01_filename2.csv
My original csv files look like this:
filename1-Streetname filename1-ZIPCODE
TEXT TEXT
TEXT TEXT
TEXT TEXT
TEXT TEXT
TEXT TEXT
TEXT TEXT
Street1 2012932
Street2 3023923
filename2-Name filename2-Phone
TEXT TEXT
TEXT TEXT
TEXT TEXT
TEXT TEXT
TEXT TEXT
TEXT TEXT
Name1 2012932
Name2 3023923
I am manipulating these files using the following code (I am reading the csv files from a source folder and writing them to a destination folder. I am skipping certain rows as I do not want to include this information):
I cut off the TEXT rows, as I do not need them:
import csv
skiprows = (1,2,3,4,5,6)
for file in os.listdir(sourcefolder):
with open(os.path.join(sourcefolder,file)) as fp_in:
reader = csv.reader(fp_in, delimiter=';')
rows = [row for i, row in enumerate(reader) if i not in skiprows]
with open(os.path.join(destinationfolder,file), 'w', newline='') as fp_out:
writer = csv.writer(fp_out)
writer.writerows(rows)
(this code works) gives
filename1-Streetname filename1-ZIPCODE
Street1 2012932
Street2 3023923
filename2-Name filename2-Phone
Name1 2012932
Name2 3023923
The first row contains the header. In the header names there is always the filename (however without the 099_2019_01_01_ prefix) plus a "-". The filename ending .csv is missing. I want to remove this "filename-" for each csv file.
The core part now is to get the first row and only for this row to perform a replace. I need to cut off the prefix and the .csv and then perform a general replace. The first replace could be something like this:
Either I could start with a function to cut off the first n signs, as the length is fixed or
According to this solution just use string.removeprefix('099_2019_01_01_')
As I have Python 3.5 I cannot use removeprefix so I try to just simple replace it.
string.replace("099_2019_01_01_","")
Then I need to remove the .csv which is easy:
string.replace(".csv","")
I put this together and I get (string.replace("099_2019_01_01_","")).replace(".csv",""). (Plus at the end the "-" needs to be removed too, see in the code below). I am not sure if this works.
My main problem is now for this csv import code that I do not know how I can manipulate only the first row when reading/writing the csv. So I want to replace this only in the first row. I tried something like this:
import csv
skiprows = (1,2,3,4,5,6)
for file in os.listdir(sourcefolder):
with open(os.path.join(sourcefolder,file)) as fp_in:
reader = csv.reader(fp_in, delimiter=';')
rows = [row for i, row in enumerate(reader) if i not in skiprows]
with open(os.path.join(destinationfolder,file), 'w', newline='') as fp_out:
writer = csv.writer(fp_out)
rows[0].replace((file.replace("099_2019_01_01_","")).replace(".csv","")+"-","")
writer.writerows(rows)
This gives an error as the idea with rows[0] is not working. How can I do this?
(I am not sure if I should try to include this replacing in the code or to put it into a second code which runs after the first code. However, then I would read and write csv files again I assume. So I think it would be most efficient to implement it into this code. Otherwise I need to open and change and save every file again. However, if it is not possible to include it into this code I would be also fine with a code which runs stand-alone and just does the replacing assuming the csv file have the rows 0 as header and then the data comes.)
Please note that I do want to go this way with csv and not use pandas.
EDIT:
At the end the csv files should look like this:
Streetname ZIPCode
Street1 9999
Street2 9848
Name Phone
Name1 23421
Name2 23232
Try by replacing this:
rows[0].replace((file.replace("099_2019_01_01_","")).replace(".csv","")+"-","")
By this in your code:
x=file.replace('099_2019_01_01_','').replace('.csv', '')
rows[0]=[i.replace(x+'-', '') for i in rows[0]]

Producing seperate text files from a large Json file

I'm trying to use the code below to produce a set of .txt files from a large .json file where there is one Json object per line, with date and a string of text. I want the date to be the filename.
When I open up the .json file (in sublime text editor), it shows me 2272 lines, so I assume the code should produce this number of text files. However, it is only producing half as many. Can anybody tell me why, and what I should do to correct this?
import json
#with open('results.json') as json_file:
data = [json.loads(line) for line in open('results.json', 'r')]
for p in data:
date = p["date"]
filename = date.replace(" ", "_").replace(":","_")
print(filename)
text = p["text"]
with open('Articles2/'+filename+'.txt', 'w') as f:
f.write(text+'\n')
Thanks for any help!
You have duplicate dates in your sample data, so each iteration of your for loop is creating a new entry and then overwriting the entries where the date is the exact same.
For example: 2018-11-17 17:11:48 with 3 entries about 13 lines down in your data. This will only create 1 new file because your new file creation criteria in your script is only based on the date data.
You need to add some other unique value to the date to make them unique so open() doesn't overwrite the file that already exists. ie. add milliseconds to your date, concat values from the "text" property in your JSON object, add a counter, etc.

How to delimit a CSV file with semicolons?

I am trying to open a CSV file after I create it with python. My goal is to be able to read back the file without editing it and my problem has been that I cannot get the delimiter to work. My file is created with python csv writer and then I am attempting to use the reader to read the data from the file. This is where I am getting stuck. My CSV file is saved in the same location that my python program is saved in thus I know it is not an access issue. My file is created with special character delimiter I am using Semicolons; because the raw data already contains commas,, colons;, plus signs+, ampersands&, periods., and possibly underscores_ and/or dashes-. This is the code that I am using to read my CSV file:
with open('Cool.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=';', dialect=csv.excel_tab)
for row in csv_reader:
print row[0]
csv_file.close()
Now this is my csv file (Cool.csv):
"Sat, 20 Apr 2019 00:17:05 +0000;Need to go to store;Eggs & Milk are needed ;Store: Grocery;Full Name: Safeway;Email: safewayiscool#gmail.com;Safeway <safewayiscool#gmail.com>, ;"
"Tue, 5 Mar 2019 05:54:24 +0000;Need to buy ham;Green eggs and Ham are needed for dinner ;Username: Dr.Seuss;Full Name: Theodor Seuss Geisel;Email: greeneggs+ham#seuss.com;"
So I would expect my output to be the following when I run the code:
Sat, 20 Apr 2019 00:17:05 +0000
Tue, 5 Mar 2019 05:54:24 +0000
I either get a null error of some kind or it will print out the entire line. How can I get it to separate out data into what I want to have defined the columns delimited by the ;?
I am not sure if the issue is that I am trying to use the semicolon or if it is something else. If it is just the semicolon I could change it if necessary but many other characters are already taken in the incoming data.
Also please do not suggest I simply just read it in from the original file. It is a massive file that has a lot of other data and I want to trim it before then executing with this second program.
UPDATE:
This is the code that builds the file:
with open('Cool.csv', 'w') as csvFile:
writer = csv.writer(csvFile, delimiter=';')
for m in file:
message = m['msg']
message2 = message.replace('\r\n\r\n', ';')
message3 = message2.replace('\r\n', ';')
entry = m['date'] + ";" + m['subject'] + ";" + message3
list = []
list.append(entry)
writer.writerow(list)
csvFile.close()
It looks like the file was created incorrectly. The sample data provided shows the whole line double-quoted, which treats it as one long single column. Here's correct code to write and read and semicolon-delimited file:
import csv
with open('Cool.csv','w',newline='',encoding='utf-8-sig') as csv_file:
csv_writer = csv.writer(csv_file,delimiter=';')
csv_writer.writerow(['data,data','data;data','data+-":_'])
with open('Cool.csv','r',newline='',encoding='utf-8-sig') as csv_file:
csv_reader = csv.reader(csv_file,delimiter=';')
for row in csv_reader:
print(row)
Output (matches data written):
['data,data', 'data;data', 'data+-":_']
Cool.csv:
data,data;"data;data";"data+-"":_"
Notes:
utf-8-sig is the most compatible encoding with Excel. Any Unicode character you put in the file will work and look correct when the CSV is opened in Excel.
newline='' is required per the csv documentation. The csv module handles its own newlines per the dialect used (default 'excel').
; delimiter is not needed. The default , will work. Note how the second entry has a semicolon, so the field was quoted. The first field with comma would have been quoted instead if the delimiter was a comma and it would still work.
csv_writer.writerow takes a sequence containing the column data.
csv_reader returns each row as a list of the column data.
A column in the .CSV is double-quoted if it contains the delimiter, and quotes are doubled if present in the data to escape them as well. Note the third field has a double quote.
csv_writer.close() and csv_reader.close() are not needed if using with.
RTFM.
From help (csv)
DIALECT REGISTRATION:
Readers and writers support a dialect argument, which is a convenient
handle on a group of settings. When the dialect argument is a string,
it identifies one of the dialects previously registered with the module.
If it is a class or instance, the attributes of the argument are used as
the settings for the reader or writer:
class excel:
delimiter = ','
quotechar = '"'
escapechar = None
doublequote = True
skipinitialspace = False
lineterminator = '\r\n'
quoting = QUOTE_MINIMAL
And you use dialect=csv.excel_tab.
You effectively overwrite your delimiter. Just don't use the dialect option.
Sidenote: with handles closing of the file handle for you. Read here
Second sidenote: The whole line of your CSV file is in double quotes. Either get rid of them, or disable the quoting. i.e.
with open('b.txt') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=';', quoting=csv.QUOTE_NONE)
for row in csv_reader:
print (row[0])

Python 3 reading CSV file with line breaks in rows

I have a large CSV file with one column and line breaks in some of its rows. I want to read the content of each cell and write it to a text file but the CSV reader is splitting the cells with line breaks into multiple ones (multiple rows) and writing each one to a separate text file.
Using Python 3.6.2 on a MAC Sierra
Here is an example:
"content of row 1"
"content of row 2
continues here"
"content of row 3"
And here is how I am reading it:
with open(csvFileName, 'r') as csvfile:
lines= csv.reader(csvfile)
i=0
for row in lines:
i+=1
content= row
outFile= open("output"+str(i)+".txt", 'w')
outFile.write(content)
outFile.close()
This is creating 4 files instead of 3 for each row. Any suggestions on how to ignore the line break in the second row?
You could define a regular expression pattern to help you iterate over the rows.
Read the entire file contents - if possible.
s = '''"content of row 1"
"content of row 2
continues here"
"content of row 3"'''
Pattern - double-quote, followed by anything that isn't a double-quote, followed by a double-quote.:
row_pattern = '''"[^"]*"'''
row = re.compile(row_pattern, flags = re.DOTALL | re.MULTILINE)
Iterate the rows:
for r in row.finditer(s):
print r.group()
print '******'
>>>
"content of row 1"
******
"content of row 2
continues here"
******
"content of row 3"
******
>>>
The file you describe is NOT a CSV (comma separated values) file. A CSV file is a list of records one per line where each record is separated from the others by commas. There are various "flavors" of CSV which support various features for quoting fields (in case fields have embedded commas in them, for example).
I think your best bet would be to create an adapter class/instance which would pre-process the raw file, find and merge the continuation lines into records and them pass those to your instance of csv.reader. You could model your class after StringIO from the Python standard libraries.
The point is that you create something which processes data but behaves enough like a file object that it can be used, transparently, as the input source for something like csv.reader().
(Done properly you can even implement the Python context manager protocol. io.StringIO does support this protocol and could be used as a reference. This would allow you to use instances of this hypothetical "line merging" adapter class in a Python with statement just as you're doing with your open file() object in your example code).
from io import StringIO
import csv
data = u'1,"a,b",2\n2,ab,2.1\n'
with StringIO(data) as infile:
reader = csv.reader(infile, quotechar='"')
for rec in reader:
print(rec[0], rec[2], rec[1])
That's just a simple example of using the io.StringIO in a with statement Note that io.StringIO requires Unicode data, io.BytesIO requires "bytes" or string data (at least in 2.7.x). Your adapter class can do whatever you like.

Trying to import a list of words using csv (Python 2.7)

import csv, Tkinter
with open('most_common_words.csv') as csv_file: # Opens the file in a 'closure' so that when it's finished it's automatically closed"
csv_reader = csv.reader(csv_file) # Create a csv reader instance
for row in csv_reader: # Read each line in the csv file into 'row' as a list
print row[0] # Print the first item in the list
I'm trying to import this list of most common words using csv. It continues to give me the same error
for row in csv_reader: # Read each line in the csv file into 'row' as a list
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
I've tried a couple different ways to do it as well, but they didn't work either. Any suggestions?
Also, where does this file need to be saved? Is it okay just being in the same folder as the program?
You should always open a CSV file in binary mode (Python 2) or universal newline mode (Python 3). Also, make sure that the delimiters and quote characters are , and ", or you'll need to specify otherwise:
with open('most_common_words.csv', 'rb') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=';', quotechar='"') # for EU CSV
You can save the file in the same folder as your program. If you don't, you can provide the correct path to open() as well. Be sure to use raw strings if you're on Windows, otherwise the backslashes may trick you: open(r"C:\Python27\data\table.csv")
It seems you have a file with one column as you say here:
It is a simple list of words. When I open it up, it opens into Excel
with one column and 500 rows of 500 different words.
If so, you don't need the csv module at all:
with open('most_common_words.csv') as f:
rows = list(f)
Note in this case, each item of the list will have the newline appended to it, so if your file is:
apple
dog
cat
rows will be ['apple\n', 'dog\n', 'cat\n']
If you want to strip the end of line, then you can do this:
with open('most_common_words.csv') as f:
rows = list(i.rstrip() for i in f)

Categories