Cleaning up a csv by delimiter - python

I have a csv file where the columns are all in one row, encased in quotation marks and separated by commas. The columns are in one line.
The rows in the csv are split by comma , if there are 2 commas this means that there is a missing value. I would like to separate these columns by these parameters. In cases where the row has a quotation mark this the comma in the quotation mark should not be a separator because this is an address.
This is a sample of the data (its a csv, I converted it into a dictionary to show you a sample)
{'Store code,"Biz","Add","Labels","TotalSe","DirectSe","DSe","TotalVe","SeVe","MaVe","Totalac","Webact","Dions","Ps"': {0: ',,,,"Numsearching","Numsearchingbusiness","Numcatprod","Numview","Numviewed","Numviewed2","Numaction","Numwebsite","Numreques","Numcall"',
1: 'Nora,"Ora","Sgo, Mp, 2000",,111,44,33,121,1232,53411,4,5,,3',
2: 'mc11,"21 old","tjis that place, somewher, Netherlands, 2434",,3245,325,52454,3432,243,4353,343,23,23,18'}}
I have tried this so far and a bit stuck:
disc = pd.read_csv('/content/gdrive/My Drive/blank/blank.csv',delimiter='",')
Sample of csv:
csv sample

I use normal functions to remove " in every line on both ends, and I convert two "" into single "
This way I get CSV which I can load with read_csv()
f1 = open('Sample - Sheet1.csv')
f2 = open('temp.csv', 'w')
for row in f1:
row = row.strip() # remove "\n"
row = row[1:-1] # remove " on both ends
row = row.replace('""', '"') # conver "" into "
f2.write(row + '\n')
f2.close()
f1.close()
df = pd.read_csv('temp.csv')
print(len(df.columns))
print(df)
Another method: read it as CSV and save as normal string
import csv
f1 = open('Sample - Sheet1.csv')
f2 = open('temp.csv', 'w')
reader = csv.reader(f1)
for row in reader:
f2.write(row[0] + '\n')
f2.close()
f1.close()
df = pd.read_csv('temp.csv')
print(len(df.columns))
print(df)

Related

Reading a txt file and saving individual columns as lists

I am trying to read a .txt file and save the data in each column as a list. each column in the file contains a variable which I will later on use to plot a graph. I have tried looking up the best method to do this and most answers recommend opening the file, reading it, and then either splitting or saving the columns as a list. The data in the .txt is as follows -
0 1.644231726
0.00025 1.651333945
0.0005 1.669593478
0.00075 1.695214575
0.001 1.725409504
the delimiter is a space '' or a tab '\t' . I have used the following code to try and append the columns to my variables -
import csv
with open('./rvt.txt') as file:
readfile = csv.reader(file, delimiter='\t')
time = []
rim = []
for line in readfile:
t = line[0]
r = line[1]
time.append(t)
rim.append(r)
print(time, rim)
However, when I try to print the lists, time and rim, using print(time, rim), I get the following error message -
r = line[1]
IndexError: list index out of range
I am, however, able to print only the 'time' if I comment out the r=line[1] and rim.append(r) parts. How do I approach this problem? Thank you in advance!
I would suggest the following:
import pandas as pd
df=pd.read_csv('./rvt.txt', sep='\t'), header=[a list with your column names])
Then you can use list(your_column) to work with your columns as lists
The problem is with the delimiter. The dataset contain multiple space ' '.
When you use '\t' and
print line you can see it's not separating the line with the delimiter.
eg:
['0 1.644231726']
['0.00025 1.651333945']
['0.0005 1.669593478']
['0.00075 1.695214575']
['0.001 1.725409504']
To get the desired result you can use (space) as delimiter and filter the empty values:
readfile = csv.reader(file, delimiter=" ")
time, rim = [], []
for line in readfile:
line = list(filter(lambda x: len(x), line))
t = line[0]
r = line[1]
Here is the code to do this:
import csv
with open('./rvt.txt') as file:
readfile = csv.reader(file, delimiter=” ”)
time = []
rim = []
for line in readfile:
t = line[0]
r = line[1]
time.append(t)
rim.append(r)
print(time, rim)

Python: replace a value in specific column in csv

please help on my below challenge, that i want to replace a value in specific column (comma separated data) when a match is found.
file.csv contains the number of rows, with a comma separated values. Using below, FOR each row, i look for first column field, and second column field.
if column1 filed == column2 field -->Delete first 2 fields, and write that row lines in column1 named file.
if column1 filed != column2 field -->Delete first 2 fields, and write that row lines in separate two files. (column1 named file and column2 named file)
if column1 filed = empty, but column2 field exist -->Delete first 2 fields, and write that row lines in column2 named file, and vice versa.
my challenge is, before writing the file, i need to change the column5 value to `column0/1' based on below condition.
import datetime
import os, csv
Y = open('file.csv', "r").readlines()
timestamp = '_' + '{:%Y%m%d%H%M%S}'.format(datetime.datetime.now())
for x in Y:
csvdata = x.split(",")
up = ','.join(csvdata[2:]) ######THIS DELETE FIRST 2 FIELDS
if csvdata[0] == csvdata[1]:
with open(csvdata[0] + timestamp + '.csv', 'a') as f:
f.write(up)
f.close()
elif csvdata[0] != csvdata[1] and csvdata[1] != '' and csvdata[0] != '':
with open(csvdata[0] + timestamp + '.csv', 'a') as f:
f.write(up)
f.close()
with open(csvdata[1] + timestamp + '.csv', 'a') as f:
f.write(up)
f.close()
elif csvdata[1] != '' and csvdata[0] == '':
with open(csvdata[1] + timestamp + '.csv', 'a') as f:
f.write(up)
f.close()
elif csvdata[0] != '' and csvdata[1] == '':
with open(csvdata[0] + timestamp + '.csv', 'a') as f:
f.write(up)
f.close()
file.csv
apple,orange,0,1,orange,30 --> goes to BOTH apple, orange files(with replacement of 5th field)
apple,'',0,2,orange,30 ---> goes to apple file (with replacement of 5th field orange to apple)
'',orange,0,3,apple,30 ---> goes to orange file (with replacement of 5th field apple to orange)
apple,apple,0,4,orange,30 ---> goes to apple file (with replacement of 5th field orange to apple)
orange,orange,0,5,apple,30 ---> goes to orange file (with replacement of 5th field apple to orange)
expected output:
apple_20200402134567.csv
0,1,apple,30
0,2,apple,30
0,4,apple,30
orange_20200402134567.csv
0,1,orange,30
0,3,orange,30
0,5,orange,30
Please help how to add piece of code in above to replace 5th field based on match/condition.
Thanks in advance.
The following code uses the csv import to read and write the files. This ensures that the empty column fields '' are empty strings. Instead of the if/elif sequence, it uses a python set to determine the relevant file names to write. If there are many lines, the "open file/append a line/close file" is inefficient. It would better to either use a dictionary to cache the csv.writer objects or accumulate all the rows in memory and then write all the files at the end.
import datetime
import csv
timestamp = '{:%Y%m%d%H%M%S}'.format(datetime.datetime.now())
with open('file.csv') as csvfile:
reader = csv.reader(csvfile, quotechar="'")
for row in reader:
names = set(i for i in row[0:2] if i)
for name in names:
with open('{}-{}.csv'.format(name,timestamp), 'a') as output:
writer = csv.writer(output, quotechar="'")
row[4] = name
writer.writerow(row[2:])

Is there a function to concatenate two header rows into one?

Consider the following textfile excerpt
Distance,Velocity,Time
(m),(m/s),(s)
1,1,1
2,1,2
3,1,3
I want it to be transformed into this:
Distance(m),Velocity(m/s),Time(s)
1,1,1
2,1,2
3,1,3
In other words, I want to concatenate rows that contains text, and I want them to be concatenated column-wise.
I am initially manipulating a textfile that's generated from a software. I have successfully transformed it down to only numeric columns and their headers, in a csv format. But I have multiple headers for each column. And I need all the information in each header row, because the column attributes will differ from file to file. How can I do this in a smart way in python?
edit: Thank you for your suggestions, it helped me a lot. I used Daweos solution, and added dynamic row count because number of header rows may differ from 2 to 7, depending on the generated output. Here's the code snippet i ended up with.
# Get column headers
a = 0
header_rows= 0
with open(full,"r") as input:
Lines= ""
for line in input:
l = line
g = re.sub(' +',' ',l)
y = re.sub('\t',',',g)
numlines += 1
if len(l.encode('ANSI')) > 250:
# finds header start row
a += 1
if a>0:
# finds header end row
if "---" in line:
header_rows = numlines - (numlines-a+1)
break
else:
# Lines is my headers string
Lines = Lines + "%s" % (y) + ' '
output.close()
# Create concatenated column headers
rows = [i.split(',') for i in Lines.rstrip().split('\n')]
cols = [list(c) for c in zip(*rows)]
for i in (cols):
for j in (rows):
newcolz = [list(c) for c in zip(*rows)]
print(newcolz)
I would do it following way:
txt = " Distance,Velocity,Time \n (m),(m/s),(s) \n 1,1,1 \n 2,1,2 \n 3,1,3 \n "
rows = [i.split(',') for i in txt.rstrip().split('\n')]
cols = [list(c) for c in zip(*rows)]
newcols = [[i[0]+i[1],*i[2:]] for i in cols]
newrows = [','.join(i) for i in zip(*newcols)]
print(newtxt)
Output:
Distance (m),Velocity(m/s),Time (s)
1,1,1
2,1,2
3,1,3
Crucial here is usage of zip to transpose your data, so I can deal with columns rather than rows. [[i[0]+i[1],*i[2:]] for i in cols] is responsible for actual concat, so if you would have headers spanning 3 lines you can do [[i[0]+i[1]+i[2],*i[3:]] for i in cols] and so on.
I am not aware of anything that exists to do this so instaed you can just write a custom function. In the example below the function takes to strings and also a separator which defaults to ,.
It will split each string into a list then use list comprehension using zip to pair up the lists. and then joining the pairs.
Lastly it will join the consolidated headers again with the separator.
def concat_headers(header1, header2, seperator=","):
headers1 = header1.split(seperator)
headers2 = header2.split(seperator)
consolidated_headers = ["".join(values) for values in zip(headers1, headers2)]
return seperator.join(consolidated_headers)
data = """Distance,Velocity,Time\n(m),(m/s),(s)\n1,1,1\n2,1,2\n3,1,3\n"""
header1, header2, *lines = data.splitlines()
consolidated_headers = concat_headers(header1, header2)
print(consolidated_headers)
print("\n".join(lines))
OUTPUT
Distance(m),Velocity(m/s),Time(s)
1,1,1
2,1,2
3,1,3
You don't really need a function to do it because it can be done like this using the csv module:
import csv
data_filename = 'position_data.csv'
new_filename = 'new_position_data.csv'
with open(data_filename, 'r', newline='') as inp, \
open(new_filename, 'w', newline='') as outp:
reader, writer = csv.reader(inp), csv.writer(outp)
row1, row2 = next(reader), next(reader)
new_header = [a+b for a,b in zip(row1, row2)]
writer.writerow(new_header)
# Copy the rest of the input file.
for row in reader:
writer.writerow(row)

Python: Pandas, dealing with spaced column names

If I have multiple text files that I need to parse that look like so, but can vary in terms of column names, and the length of the hashtags above:
How would I go about turning this into a pandas dataframe? I've tried using pd.read_table('file.txt', delim_whitespace = True, skiprows = 14), but it has all sorts of problems. My issues are...
All the text, asterisks, and pounds at the top needs to be ignored, but I can't just use skip rows because the size of all the junk up top can vary in length in another file.
The columns "stat (+/-)" and "syst (+/-)" are seen as 4 columns because of the whitespace.
The one pound sign is included in the column names, and I don't want that. I can't just assign the column names manually because they vary from text file to text file.
Any help is much obliged, I'm just not really sure where to go from after I read the file using pandas.
Consider reading in raw file, cleaning it line by line while writing to a new file using csv module. Regex is used to identify column headers using the i as match criteria. Below assumes more than one space separates columns:
import os
import csv, re
import pandas as pd
rawfile = "path/To/RawText.txt"
tempfile = "path/To/TempText.txt"
with open(tempfile, 'w', newline='') as output_file:
writer = csv.writer(output_file)
with open(rawfile, 'r') as data_file:
for line in data_file:
if re.match('^.*i', line): # KEEP COLUMN HEADER ROW
line = line.replace('\n', '')
row = line.split(" ")
writer.writerow(row)
elif line.startswith('#') == False: # REMOVE HASHTAG LINES
line = line.replace('\n', '')
row = line.split(" ")
writer.writerow(row)
df = pd.read_csv(tempfile) # IMPORT TEMP FILE
df.columns = [c.replace('# ', '') for c in df.columns] # REMOVE '#' IN COL NAMES
os.remove(tempfile) # DELETE TEMP FILE
This is the way I'm mentioning in the comment: it uses a file object to skip the custom dirty data you need to skip at the beginning. You land the file offset at the appropriate location in the file where read_fwf simply does the job:
with open(rawfile, 'r') as data_file:
while(data_file.read(1)=='#'):
last_pound_pos = data_file.tell()
data_file.readline()
data_file.seek(last_pound_pos)
df = pd.read_fwf(data_file)
df
Out[88]:
i mult stat (+/-) syst (+/-) Q2 x x.1 Php
0 0 0.322541 0.018731 0.026681 1.250269 0.037525 0.148981 0.104192
1 1 0.667686 0.023593 0.033163 1.250269 0.037525 0.150414 0.211203
2 2 0.766044 0.022712 0.037836 1.250269 0.037525 0.149641 0.316589
3 3 0.668402 0.024219 0.031938 1.250269 0.037525 0.148027 0.415451
4 4 0.423496 0.020548 0.018001 1.250269 0.037525 0.154227 0.557743
5 5 0.237175 0.023561 0.007481 1.250269 0.037525 0.159904 0.750544

How to merge multiple rows into single cell in csv separated by comma using Python

here is my problem:
in my csv file, I only have one column and multiple rows containing telephone numbers.
a
1 222
2 333
3 444
4 555
what I want is merge them into one string and separate by comma, eg:
a
1 222,333,444,555
The code I am using right now is:
import csv
b = open('test.csv', 'wb')
a = csv.writer(b)
s = ''
with open ("book2.csv", "rb") as annotate:
for col in annotate:
ann = col.lower().split(",")
s += ann[0] + ','
s = s[:-1] # Remove last comma
a.writerow([s])
b.close()
what I get from this is
a
1 222,
333,
444,
555
All the numbers are in one cell now (good) but they are not on one line (there is /r/n after each telephone number so I think that's why they are not on one line). Thank you in advance!
import csv
b = open('test.csv', 'wb')
a = csv.writer(b)
s = ''
with open ("book2.csv", "rb") as annotate:
for col in annotate:
ann = col.lower().strip('\n').split(",")
s += ann[0] + ','
s = s[:-1] # Remove last comma
a.writerow([s])
b.close()
You're using the csv module, but ignoring csv.reader. It handles all the parsing for you:
#!python2
import csv
with open('book2.csv','rb') as inf, open('test.csv','wb') as outf:
r = csv.reader(inf)
w = csv.writer(outf)
L = [row for row in r] # Read all lines as list of lists.
L = zip(*L) # transpose all the lines.
w.writerows(L) # Write them all back out.
Input:
222
333
444
555
Output in the .csv file:
222,333,444,555
EDIT: I see now that you want the data in Excel to be in a single cell. The above will put it in a row of four cells:
The following will write a single cell:
#!python2
import csv
with open('book2.csv') as inf, open('test.csv','wb') as outf:
w = csv.writer(outf)
data = inf.read().splitlines()
w.writerow([','.join(data)])
Output in the .csv file:
"222,333,444,555"
Output in Excel:

Categories