please help on my below challenge, that i want to replace a value in specific column (comma separated data) when a match is found.
file.csv contains the number of rows, with a comma separated values. Using below, FOR each row, i look for first column field, and second column field.
if column1 filed == column2 field -->Delete first 2 fields, and write that row lines in column1 named file.
if column1 filed != column2 field -->Delete first 2 fields, and write that row lines in separate two files. (column1 named file and column2 named file)
if column1 filed = empty, but column2 field exist -->Delete first 2 fields, and write that row lines in column2 named file, and vice versa.
my challenge is, before writing the file, i need to change the column5 value to `column0/1' based on below condition.
import datetime
import os, csv
Y = open('file.csv', "r").readlines()
timestamp = '_' + '{:%Y%m%d%H%M%S}'.format(datetime.datetime.now())
for x in Y:
csvdata = x.split(",")
up = ','.join(csvdata[2:]) ######THIS DELETE FIRST 2 FIELDS
if csvdata[0] == csvdata[1]:
with open(csvdata[0] + timestamp + '.csv', 'a') as f:
f.write(up)
f.close()
elif csvdata[0] != csvdata[1] and csvdata[1] != '' and csvdata[0] != '':
with open(csvdata[0] + timestamp + '.csv', 'a') as f:
f.write(up)
f.close()
with open(csvdata[1] + timestamp + '.csv', 'a') as f:
f.write(up)
f.close()
elif csvdata[1] != '' and csvdata[0] == '':
with open(csvdata[1] + timestamp + '.csv', 'a') as f:
f.write(up)
f.close()
elif csvdata[0] != '' and csvdata[1] == '':
with open(csvdata[0] + timestamp + '.csv', 'a') as f:
f.write(up)
f.close()
file.csv
apple,orange,0,1,orange,30 --> goes to BOTH apple, orange files(with replacement of 5th field)
apple,'',0,2,orange,30 ---> goes to apple file (with replacement of 5th field orange to apple)
'',orange,0,3,apple,30 ---> goes to orange file (with replacement of 5th field apple to orange)
apple,apple,0,4,orange,30 ---> goes to apple file (with replacement of 5th field orange to apple)
orange,orange,0,5,apple,30 ---> goes to orange file (with replacement of 5th field apple to orange)
expected output:
apple_20200402134567.csv
0,1,apple,30
0,2,apple,30
0,4,apple,30
orange_20200402134567.csv
0,1,orange,30
0,3,orange,30
0,5,orange,30
Please help how to add piece of code in above to replace 5th field based on match/condition.
Thanks in advance.
The following code uses the csv import to read and write the files. This ensures that the empty column fields '' are empty strings. Instead of the if/elif sequence, it uses a python set to determine the relevant file names to write. If there are many lines, the "open file/append a line/close file" is inefficient. It would better to either use a dictionary to cache the csv.writer objects or accumulate all the rows in memory and then write all the files at the end.
import datetime
import csv
timestamp = '{:%Y%m%d%H%M%S}'.format(datetime.datetime.now())
with open('file.csv') as csvfile:
reader = csv.reader(csvfile, quotechar="'")
for row in reader:
names = set(i for i in row[0:2] if i)
for name in names:
with open('{}-{}.csv'.format(name,timestamp), 'a') as output:
writer = csv.writer(output, quotechar="'")
row[4] = name
writer.writerow(row[2:])
Related
I'm using a text file containing data and I want to reorganize it in a different shape. This file containing lines with values separates by a semi colon and no header. Some lines containing values who are children of other lines. I can distinguish them with a code (1 or 2) and their order : a children value is always in the line after is parent value. The number of children elements is different between one line from an other. Parents values can have no children values.
To be more explicit, here is a data sample:
030;001;1;AD0192;
030;001;2;AF5612;AF5613;AF5614
030;001;1;CD0124;
030;001;2;CD0846;CD0847;CD0848
030;002;1;EG0376;
030;002;2;EG0666;EG0667;EG0668;EG0669;EG0670;
030;003;1;ZB0001;
030;003;1;ZB0002;
030;003;1;ZB0003;
030;003;2;ZB0004;ZB0005
The structure is:
The first three characters are an id
The next three characters are also an id
The next one is a code : when 1, the value (named key in my example) is parent, when 2 values are childrens of the line before.
The values after are keys, parent or childrens.
I want to store children values (with a code 2) in a list and in the same line of their parent value.
Here is an example with my data sample above and a header:
id1;id2;key;children;
030;001;AD0192;[AF5612,AF5613,AF5614]
030;001;CD0124;[CD0846,CD0847,CD0848]
030;002;EG0376;[EG0666,EG0667,EG0668,EG0669,EG0670]
030;003;ZB0001;
030;003;ZB0002;
030;003;ZB0003;[ZB0004,ZB0005]
I'm able to build CSV delimited file from this text source file, add a header, a DictReader to manipulate easily my columns and conditions to identify my parents and children values.
But how to store hierarchical elements (with a code 2) in a list in the same line of their parent key ?
Here is my actual script in Python
import csv
inputTextFile = 'myOriginalFile.txt'
csvFile = 'myNewFile.csv'
countKey = 0
countKeyParent = 0
countKeyChildren = 0
with open(inputTextFile, 'r', encoding='utf-8') as inputFile:
stripped = (line.strip() for line in inputFile)
lines = (line.split(";") for line in stripped if line)
# Write a CSV file
with open(csvFile, 'w', newline='', encoding='utf-8') as outputCsvFile:
writer = csv.writer(outputCsvFile, delimiter=';')
writer.writerow(('id1','id2', 'code', 'key', 'children'))
writer.writerows(lines)
# Read the CSV
with open(csvFile, 'r', newline='', encoding='utf-8') as myCsvFile:
csvReader = csv.DictReader(myCsvFile, delimiter=';', quotechar='"')
for row in csvReader:
countKey +=1
if '1' in row['code'] :
countKeyParent += 1
print("Parent: " + row['key'])
elif '2' in row['code'] :
countKeyChildren += 1
print("Children: " + row['key'])
print(f"----\nSum of elements: {countKey}\nParents keys: {countKeyParent}\nChildren keys: {countKeyChildren}")
A simple solution might be the following. I first load your data in as a list of rows, each a list of strings. Then, we first build the hierarchy you've explained, and write the output to a CSV file.
from typing import List
ID_FIRST = 0
ID_SECOND = 1
PARENT_FIELD = 2
KEY_FIELD = 3
IS_PARENT = "1"
IS_CHILDREN = "2"
def read(where: str) -> List[List[str]]:
with open(where) as fp:
data = fp.readlines()
rows = []
for line in data:
fields = line.strip().split(';')
rows.append([fields[ID_FIRST],
fields[ID_SECOND],
fields[PARENT_FIELD],
*[item for item in fields[KEY_FIELD:]
if item != ""]])
return rows
def assign_parents(rows: List[List[str]]):
parent_idx = 0
for idx, fields in enumerate(rows):
if fields[PARENT_FIELD] == IS_PARENT:
parent_idx = idx
if fields[PARENT_FIELD] == IS_CHILDREN:
rows[parent_idx] += fields[KEY_FIELD:]
def write(where: str, rows: List[List[str]]):
with open(where, 'w') as file:
file.write("id1;id2;key;children;\n")
for fields in rows:
if fields[PARENT_FIELD] == IS_CHILDREN:
# These have been grouped into their parents.
continue
string = ";".join(fields[:PARENT_FIELD])
string += ";" + fields[KEY_FIELD] + ";"
if len(fields[KEY_FIELD + 1:]) != 0: # has children?
children = ",".join(fields[KEY_FIELD + 1:])
string += "[" + children + "]"
file.write(string + '\n')
def main():
rows = read('myOriginalFile.txt')
assign_parents(rows)
write('myNewFile.csv', rows)
if __name__ == "__main__":
main()
For your example I get
id1;id2;key;children;
030;001;AD0192;[AF5612,AF5613,AF5614]
030;001;CD0124;[CD0846,CD0847,CD0848]
030;002;EG0376;[EG0666,EG0667,EG0668,EG0669,EG0670]
030;003;ZB0001;
030;003;ZB0002;
030;003;ZB0003;[ZB0004,ZB0005]
which appears to be correct.
I have a csv file where the columns are all in one row, encased in quotation marks and separated by commas. The columns are in one line.
The rows in the csv are split by comma , if there are 2 commas this means that there is a missing value. I would like to separate these columns by these parameters. In cases where the row has a quotation mark this the comma in the quotation mark should not be a separator because this is an address.
This is a sample of the data (its a csv, I converted it into a dictionary to show you a sample)
{'Store code,"Biz","Add","Labels","TotalSe","DirectSe","DSe","TotalVe","SeVe","MaVe","Totalac","Webact","Dions","Ps"': {0: ',,,,"Numsearching","Numsearchingbusiness","Numcatprod","Numview","Numviewed","Numviewed2","Numaction","Numwebsite","Numreques","Numcall"',
1: 'Nora,"Ora","Sgo, Mp, 2000",,111,44,33,121,1232,53411,4,5,,3',
2: 'mc11,"21 old","tjis that place, somewher, Netherlands, 2434",,3245,325,52454,3432,243,4353,343,23,23,18'}}
I have tried this so far and a bit stuck:
disc = pd.read_csv('/content/gdrive/My Drive/blank/blank.csv',delimiter='",')
Sample of csv:
csv sample
I use normal functions to remove " in every line on both ends, and I convert two "" into single "
This way I get CSV which I can load with read_csv()
f1 = open('Sample - Sheet1.csv')
f2 = open('temp.csv', 'w')
for row in f1:
row = row.strip() # remove "\n"
row = row[1:-1] # remove " on both ends
row = row.replace('""', '"') # conver "" into "
f2.write(row + '\n')
f2.close()
f1.close()
df = pd.read_csv('temp.csv')
print(len(df.columns))
print(df)
Another method: read it as CSV and save as normal string
import csv
f1 = open('Sample - Sheet1.csv')
f2 = open('temp.csv', 'w')
reader = csv.reader(f1)
for row in reader:
f2.write(row[0] + '\n')
f2.close()
f1.close()
df = pd.read_csv('temp.csv')
print(len(df.columns))
print(df)
I have an excel document that I have exported as CSV. It looks like this:
"First Name","Last Name","First Name","Last Name","Address","City","State"
"Bob","Robertson","Roberta","Robertson","123 South Street","Salt Lake City","UT"
"Leo","Smart","Carter","Smart","827 Cherry Street","Macon","GA"
"Mats","Lindgren","Lucas","Lindgren","237 strawberry xing","houston","tx"
I have a class called "Category" that has a name variable. My code makes a category for each of the first line strings, but now I need to add each item to the column that it is supposed to go in.
import xlutils
from difflib import SequenceMatcher
from address import AddressParser, Address
from nameparser import HumanName
import xlrd
import csv
class Category:
name = ""
contents = []
index = 0
columns = []
alltext = ""
with open('test.csv', 'rb') as csvfile:
document = csv.reader(csvfile, delimiter=',', quotechar='\"')
for row in document:
alltext = alltext + ', '.join(row) + "\n"
splitText = alltext.split('\n')
categoryNames = splitText[0].split(', ')
ixt = 0
for name in categoryNames:
thisCategory = Category()
thisCategory.name = name
thisCategory.index = ixt
columns.append(thisCategory)
ixt = ixt + 1
for line in splitText:
if(line != splitText[0] and len(line) != 0):
individualItems = line.split(', ')
for index, item in enumerate(individualItems):
if(columns[index].index == index):
print(item + " (" + str(index) + ") is being sent to " + columns[index].name)
columns[index].contents.append(item)
for col in columns:
print("-----" + col.name + " (" + str(col.index) + ")-----")
for stuff in col.contents:
print(stuff)
As the code runs, it gives an output for each item that says:
Bob (0) is being sent to First Name
Robertson(1) is being sent to Last Name
Which is what it should be doing. Every item says that it is being sent to the correct category. At the end, however, instead of having each item be in the category that it claims, every category has every item, and instead of this:
-----First Name-----
Bob
Roberta
Leo
Carter
Mats
Lucas
And so on and so forth, for each of the categories. I get this:
-----First Name-----
Bob
Robertson
Roberta
Robertson
123 South Street
Salt Lake City
UT
Leo
Smart
Carter
Smart
827 Cherry Street
Macon
GA
Mats
Lindgren
Lucas
Lindgren
237 strawberry xing
houston
tx
I don't know what is going wrong. There is nothing in between those two lines of code that could possibly be messing it up.
The problem is that you defined class level variables for Category, not instance variables. That was mostly harmless for
thisCategory.name = name
thisCategory.index = ixt
because that created instance variables for each object that mask the class variable. But
columns[index].contents.append(item)
is different. It got the single class level contents list and added the data regardless of which instance happened to be active at the time.
The solution is to use instance variables created in __init__. Also, you were doing too much work reassembling things into strings then breaking them out again. Just process the columns as the rows are read.
#import xlutils
#from difflib import SequenceMatcher
#from address import AddressParser, Address
#from nameparser import HumanName
#import xlrd
import csv
class Category:
def __init__(self, index, name):
self.name = name
self.index = index
self.contents = []
columns = []
alltext = ""
with open('test.csv', 'r', newline='') as csvfile:
document = csv.reader(csvfile, delimiter=',', quotechar='\"')
# create categories from first row
columns = [Category(index, name)
for index, name in enumerate(next(document))]
# add columns for the rest of the file
for row in document:
if row:
for index, cell in enumerate(row):
columns[index].contents.append(cell)
for col in columns:
print("-----" + col.name + " (" + str(col.index) + ")-----")
for stuff in col.contents:
print(stuff)
3 comments:
You aren't taking into account the first field - you take an empty string alltext = "" and the first thing you do is add a comma. This is pushing everything one field over. You would need to test if you are on the first row.
You are opening a csv ... then twisting it back to a text file. This is looks like it is because a csv will field-separate the values and you want to do this manually later on. If you open the file as a text file in the first place and read it using read, you don't need the first part of the code (unless you have done something very strange to your csv; since we don't have a sample to examine I can't comment on that).
with open('test.csv', 'r') as f:
document = f.read()
will give you the correctly formatted alltext string.
This is a good use-case for csv.DictReader, which will give you the fields in a structured format. See this StackOverflow question as an example and the documentation.
Try using below statement for reading csv.
import csv
data = []
with open("test.csv") as f :
document = csv.reader(f)
for line in document :
data.append(line)
wherein data[0] will have all category names
If I have multiple text files that I need to parse that look like so, but can vary in terms of column names, and the length of the hashtags above:
How would I go about turning this into a pandas dataframe? I've tried using pd.read_table('file.txt', delim_whitespace = True, skiprows = 14), but it has all sorts of problems. My issues are...
All the text, asterisks, and pounds at the top needs to be ignored, but I can't just use skip rows because the size of all the junk up top can vary in length in another file.
The columns "stat (+/-)" and "syst (+/-)" are seen as 4 columns because of the whitespace.
The one pound sign is included in the column names, and I don't want that. I can't just assign the column names manually because they vary from text file to text file.
Any help is much obliged, I'm just not really sure where to go from after I read the file using pandas.
Consider reading in raw file, cleaning it line by line while writing to a new file using csv module. Regex is used to identify column headers using the i as match criteria. Below assumes more than one space separates columns:
import os
import csv, re
import pandas as pd
rawfile = "path/To/RawText.txt"
tempfile = "path/To/TempText.txt"
with open(tempfile, 'w', newline='') as output_file:
writer = csv.writer(output_file)
with open(rawfile, 'r') as data_file:
for line in data_file:
if re.match('^.*i', line): # KEEP COLUMN HEADER ROW
line = line.replace('\n', '')
row = line.split(" ")
writer.writerow(row)
elif line.startswith('#') == False: # REMOVE HASHTAG LINES
line = line.replace('\n', '')
row = line.split(" ")
writer.writerow(row)
df = pd.read_csv(tempfile) # IMPORT TEMP FILE
df.columns = [c.replace('# ', '') for c in df.columns] # REMOVE '#' IN COL NAMES
os.remove(tempfile) # DELETE TEMP FILE
This is the way I'm mentioning in the comment: it uses a file object to skip the custom dirty data you need to skip at the beginning. You land the file offset at the appropriate location in the file where read_fwf simply does the job:
with open(rawfile, 'r') as data_file:
while(data_file.read(1)=='#'):
last_pound_pos = data_file.tell()
data_file.readline()
data_file.seek(last_pound_pos)
df = pd.read_fwf(data_file)
df
Out[88]:
i mult stat (+/-) syst (+/-) Q2 x x.1 Php
0 0 0.322541 0.018731 0.026681 1.250269 0.037525 0.148981 0.104192
1 1 0.667686 0.023593 0.033163 1.250269 0.037525 0.150414 0.211203
2 2 0.766044 0.022712 0.037836 1.250269 0.037525 0.149641 0.316589
3 3 0.668402 0.024219 0.031938 1.250269 0.037525 0.148027 0.415451
4 4 0.423496 0.020548 0.018001 1.250269 0.037525 0.154227 0.557743
5 5 0.237175 0.023561 0.007481 1.250269 0.037525 0.159904 0.750544
I am trying to create a clean csv file by merging some of variables together from an old file and appending them to a new csv file.
I have no problem running the data the first time. I get the output I want but whenever I try to append the data with a new variable (i.e. new column) it appends the variable to the bottom and the output is wonky.
I have basically been running the same code for each variable, except changing the
groupvariables variable to my desired variables and then using the f2= open('outputfile.csv', "ab") <--- but with an ab for amend. Any help would be appreciated
groupvariables=['x','y']
f2 = open('outputfile.csv', "wb")
writer = csv.writer(f2, delimiter=",")
writer.writerow(("ID","Diagnosis"))
for line in csv_f:
line = line.rstrip('\n')
columns = line.split(",")
tempname = columns[0]
tempindvar = columns[1:]
templist = []
for j in groupvariables:
tempvar=tempindvar[headers.index(j)]
if tempvar != ".":
templist.append(tempvar)
newList = list(set(templist))
if len(newList) > 1:
output = 'nomatch'
elif len(newList) == 0:
output = "."
else:
output = newList[0]
tempoutrow = (tempname,output)
writer.writerow(tempoutrow)
f2.close()
CSV is a line-based file format, so the only way to add a column to an existing CSV file is to read it into memory and overwrite it entirely, adding the new column to each line.
If all you want to do is add lines, though, appending will work fine.
Here is something that might help. I assumed the first field on each row in each csv file is a primary key for the record and can be used to match rows between the two files. The code below reads the records in from one file, stored them in a dictionary, then reads in the records from another file, appended the values to the dictionary, and writes out a new file. You can adapt this example to better fit your actual problem.
import csv
# using python3
db = {}
reader = csv.reader(open('t1.csv', 'r'))
for row in reader:
key, *values = row
db[key] = ','.join(values)
reader = csv.reader(open('t2.csv', 'r'))
for row in reader:
key, *values = row
if key in db:
db[key] = db[key] + ',' + ','.join(values)
else:
db[key] = ','.join(values)
writer = open('combo.csv', 'w')
for key in sorted(db.keys()):
writer.write(key + ',' + db[key] + '\n')