I'm using a text file containing data and I want to reorganize it in a different shape. This file containing lines with values separates by a semi colon and no header. Some lines containing values who are children of other lines. I can distinguish them with a code (1 or 2) and their order : a children value is always in the line after is parent value. The number of children elements is different between one line from an other. Parents values can have no children values.
To be more explicit, here is a data sample:
030;001;1;AD0192;
030;001;2;AF5612;AF5613;AF5614
030;001;1;CD0124;
030;001;2;CD0846;CD0847;CD0848
030;002;1;EG0376;
030;002;2;EG0666;EG0667;EG0668;EG0669;EG0670;
030;003;1;ZB0001;
030;003;1;ZB0002;
030;003;1;ZB0003;
030;003;2;ZB0004;ZB0005
The structure is:
The first three characters are an id
The next three characters are also an id
The next one is a code : when 1, the value (named key in my example) is parent, when 2 values are childrens of the line before.
The values after are keys, parent or childrens.
I want to store children values (with a code 2) in a list and in the same line of their parent value.
Here is an example with my data sample above and a header:
id1;id2;key;children;
030;001;AD0192;[AF5612,AF5613,AF5614]
030;001;CD0124;[CD0846,CD0847,CD0848]
030;002;EG0376;[EG0666,EG0667,EG0668,EG0669,EG0670]
030;003;ZB0001;
030;003;ZB0002;
030;003;ZB0003;[ZB0004,ZB0005]
I'm able to build CSV delimited file from this text source file, add a header, a DictReader to manipulate easily my columns and conditions to identify my parents and children values.
But how to store hierarchical elements (with a code 2) in a list in the same line of their parent key ?
Here is my actual script in Python
import csv
inputTextFile = 'myOriginalFile.txt'
csvFile = 'myNewFile.csv'
countKey = 0
countKeyParent = 0
countKeyChildren = 0
with open(inputTextFile, 'r', encoding='utf-8') as inputFile:
stripped = (line.strip() for line in inputFile)
lines = (line.split(";") for line in stripped if line)
# Write a CSV file
with open(csvFile, 'w', newline='', encoding='utf-8') as outputCsvFile:
writer = csv.writer(outputCsvFile, delimiter=';')
writer.writerow(('id1','id2', 'code', 'key', 'children'))
writer.writerows(lines)
# Read the CSV
with open(csvFile, 'r', newline='', encoding='utf-8') as myCsvFile:
csvReader = csv.DictReader(myCsvFile, delimiter=';', quotechar='"')
for row in csvReader:
countKey +=1
if '1' in row['code'] :
countKeyParent += 1
print("Parent: " + row['key'])
elif '2' in row['code'] :
countKeyChildren += 1
print("Children: " + row['key'])
print(f"----\nSum of elements: {countKey}\nParents keys: {countKeyParent}\nChildren keys: {countKeyChildren}")
A simple solution might be the following. I first load your data in as a list of rows, each a list of strings. Then, we first build the hierarchy you've explained, and write the output to a CSV file.
from typing import List
ID_FIRST = 0
ID_SECOND = 1
PARENT_FIELD = 2
KEY_FIELD = 3
IS_PARENT = "1"
IS_CHILDREN = "2"
def read(where: str) -> List[List[str]]:
with open(where) as fp:
data = fp.readlines()
rows = []
for line in data:
fields = line.strip().split(';')
rows.append([fields[ID_FIRST],
fields[ID_SECOND],
fields[PARENT_FIELD],
*[item for item in fields[KEY_FIELD:]
if item != ""]])
return rows
def assign_parents(rows: List[List[str]]):
parent_idx = 0
for idx, fields in enumerate(rows):
if fields[PARENT_FIELD] == IS_PARENT:
parent_idx = idx
if fields[PARENT_FIELD] == IS_CHILDREN:
rows[parent_idx] += fields[KEY_FIELD:]
def write(where: str, rows: List[List[str]]):
with open(where, 'w') as file:
file.write("id1;id2;key;children;\n")
for fields in rows:
if fields[PARENT_FIELD] == IS_CHILDREN:
# These have been grouped into their parents.
continue
string = ";".join(fields[:PARENT_FIELD])
string += ";" + fields[KEY_FIELD] + ";"
if len(fields[KEY_FIELD + 1:]) != 0: # has children?
children = ",".join(fields[KEY_FIELD + 1:])
string += "[" + children + "]"
file.write(string + '\n')
def main():
rows = read('myOriginalFile.txt')
assign_parents(rows)
write('myNewFile.csv', rows)
if __name__ == "__main__":
main()
For your example I get
id1;id2;key;children;
030;001;AD0192;[AF5612,AF5613,AF5614]
030;001;CD0124;[CD0846,CD0847,CD0848]
030;002;EG0376;[EG0666,EG0667,EG0668,EG0669,EG0670]
030;003;ZB0001;
030;003;ZB0002;
030;003;ZB0003;[ZB0004,ZB0005]
which appears to be correct.
Related
I have multiple csv files, which are separated by an \t (TAB).
In each file there are rows of data with an timeline at the left of it
Attention: multiple data sets can have the same timeline! No pattern detectable!
I want to loop through all the files and in each file i want to loop through each header (word)
Then I want to copy the data row with (for example: I51) in the headers name (I51_RH_T) into a new file (called I51) with the corresponding timeline!
I also want to cycle through multiple keywords for the header!
This is what i managed to do:
import re, csv, os,
keywords = ["I01", "I02", "I03"]
Input_Dir = "E:/MA/05_Sensordaten/Sensordaten txt/"
files = os.listdir(Input_Dir)
i = 0
zz = 0
for each in keywords:
keyword = str(keywords[i])
#print(files)
for file in files:
if zz <= len(files) -1:
#print(files[zz])
file_and_path = Input_Dir + files[zz]
with open(file_and_path, "r") as csv_file:
csv_reader = csv.reader(csv_file, delimiter = "\t")
#print(csv_file)
with open("G:/MA/05_Sensordaten/Sensordaten sortiert/Aufbau" + keyword + ".txt", "w") as new_file:
csv_writer = csv.writer(new_file, delimiter=" ") # \t = TAB
j = 0
for words in csv_reader:
if j <= 400:
j += 1
while re.match(keyword, words[j]):
print(words[j])
j += 1
else:
break
zz += 1
i += 1
Right now I get the error that the list index is out of range!
Whats missing in the code is the part where i copy the header with the corresponding timeline in the new file!
the csv files i want to extract the data looks like this:
timeline0 I51_Rh_T I54_Rh_Rh I57_T ........
01.10.20100:00 8,47 54,67 20,54 ......
..................
Any help would be welcomed!
Sincerely Daniel
for word in csv_reader:
x = re.search(keyword, word) #Sucht Aufbau in csv_reader
print(x)
Here you are getting dictionary object in word
You have to add one more loop for keys or values from this dictionary e.g.
for word in csv_reader:
for key in word.keys():
x = re.search(keyword, key) #Sucht Aufbau in csv_reader
print(x)
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
For a current project, I am planning to run several iterations of the script below and to save the results in different CSV files with a new file for each iteration (the CSV part is at the end of the script).
The given code currently shows the relevant results in the terminal while it only creates empty CSV files. I have spent days figuring out how to solve the situation but cannot get to a solution. Is there anyone who can help?
Note: I have updated the code in accordance with user recommendations while the original issue/challenge still persists.
import string
import json
import csv
import pandas as pd
import datetime
from dateutil.relativedelta import *
import numpy as np
import matplotlib.pyplot as plt
# Loading and reading dataset
file = open("Glassdoor_A.json", "r")
data = json.load(file)
df = pd.json_normalize(data)
df['Date'] = pd.to_datetime(df['Date'])
# Allocate periods for individual CSV file names
periods = pd.period_range('2009Q1','2018Q4',freq='Q')
ts = pd.Series(np.random.randn(40), periods)
type(ts.index)
intervals = ts.index
# Create individual empty files with headers
for i in intervals:
name = 'Glassdoor_A_' + 'Text Main_' + str(i)
with open(name+'.csv', 'w', newline='') as file:
writer = csv.writer(file)
# Create an empty dictionary
d = dict()
# Filtering by date
start_date = pd.to_datetime('2009-01-01')
end_date = pd.to_datetime('2009-03-31')
last_end_date = pd.to_datetime('2017-12-31')
mnthBeg = pd.offsets.MonthBegin(3)
mnthEnd = pd.offsets.MonthEnd(3)
while end_date <= last_end_date:
filtered_dates = df[df.Date.between(start_date, end_date)]
n = len(filtered_dates.index)
print(f'Date range: {start_date.strftime("%Y-%m-%d")} - {end_date.strftime("%Y-%m-%d")}, {n} rows.')
if n > 0:
print(filtered_dates)
start_date += mnthBeg
end_date += mnthEnd
# Processing Text Main section
for index, row in filtered_dates.iterrows():
line = row['Text Main']
# Remove the leading spaces and newline character
line = line.split(' ')
line = [val.strip() for val in line]
# Convert the characters in line to
# lowercase to avoid case mismatch
line = [val.lower() for val in line]
# Remove the punctuation marks from the line
line = [val.translate(val.maketrans("", "", string.punctuation)) for val in line]
print(line)
# Split the line into words
# words = [val.split(" ") for val in line]
# print(words)
# Iterate over each word in line
for word in line:
# Check if the word is already in dictionary
if word in d.keys():
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
print(d)
# Print the contents of dictionary
for key in list(d.keys()):
print(key, ":", d[key])
# Count the total number of words
total = sum(d.values())
percent = d[key] / total
print(d[key], total, percent)
# Save as CSV file
while end_date <= last_end_date:
for index, row in filtered_dates.iterrows():
for i in data:
name = 'Glassdoor_A_' + str(i)
with open(name+'.csv', 'a', newline='') as file:
writer.writerow(["Word", "Occurrences", "Percentage"])
writer.writerows([key, d[key], percent] for key in list(d.keys()))
Wrt your inner loop which writes the CSV files:
# Create individual file names
for i in data:
name = 'Glassdoor_A_' + str(i)
# Save output in CSV file
with open(name+'.csv', 'w', newline='') as file:
...
⬆ is executed for each iteration of the outer loop for index, row in filtered_dates.iterrows():. So each iteration while overwrite the previously created files. Try using mode as 'a' (append) and write the headers with empty data outside of these two loops.
Without getting into the details of what you're calculating and writing out, the way to make it append data to the outfiles would be:
Create the files with just the headers at the start of the script.
The last inner loop should write to the files in append mode.
So, at the start of your script, add:
data = json.load(file)
# Create individual empty files with headers
for i in data:
name = 'Glassdoor_A_' + str(i)
with open(name+'.csv', 'w', newline='') as file:
writer = csv.writer(file) # you probably don't need to use the csv module for the first part
writer.writerow(["Text Main Words", "Text Main Occurrences"])
# nothing else here for now
Then at the end of your script, for the inner most loop where you're writing out the data, do:
while end_date <= last_end_date:
...
for index, row in filtered_dates.iterrows():
...
for i in data:
name = 'Glassdoor_A_' + str(i)
with open(name+'.csv', 'a', newline='') as file: # note the 'append' mode
writer = csv.writer(file)
writer.writerows([occurrence])
Btw, that last line writer.writerows([occurrence]) should probably be writer.writerows(list(occurrence)) if occurrence is not already a list of tuples or a list of lists with two elements in each inner list.
I am trying to write a python program to clean survey data coming from a CSV file.
I would like to dump rows which contain a sequence of blank fields, like the first and the third line in the following example.
"1","a","b","c",,,,,
"2","a","b","c","d","e","f",,"h"
"3","a","b","c",,,,,
"4","a","z","u","d","i","f","x","h"
"5","d","c","c",,"c","f","g","z"
Following my unsuccessful code:
import csv
fname = raw_input("Enter input file name: ")
if len(fname) < 1 : fname = "survey.csv"
foutput = raw_input("Enter output file name: ")
if len(foutput) < 1 : foutput = "output_"+fname
input = open(fname, 'rb')
output = open(foutput, 'wb')
searchFor = 5*['']
writer = csv.writer(output)
for row in csv.reader(input):
if searchFor not in row :
writer.writerow(row)
input.close()
output.close()
Use counter to check if one list is subset of another as below. If you want to remove empty elements then just use None, bool or lento filter blanks and discard them-
import csv
from itertools import repeat
from collections import Counter
input = open(fname, 'rb')
output = open(foutput, 'wb')
writer = csv.writer(output)
#Helper function
def counterSubset(list1, list2):
c1, c2 = Counter(list1), Counter(list2)
for k, n in c1.items():
if n > c2[k]:
return False
return True
for row in csv.reader(input):
if not counterSubset(list(repeat('',5)),row):# i used 5 for five '' you can change it
writer.writerow(row)#use filter(None,row) or filter(bool,row) or filter(len,row) to remove empty elements
input.close()
output.close()
Output-
1,a,b,c,,
2,a,b,c,d,e,f,g,h
4,a,,z,u,d,i,f,x,h
5,d,c,c,d,c,f,g,z
How about
# change this to whatever a blank item is from the csv reader
# probably "" or None
blank_item = None
for row in csv.reader(input):
# filter out all blank elements
blanks = [x for x in row if x == blank_item]
if len(blanks) < 5:
writer.writerow(row)
This will count the number of blanks in a row and let you drop them as desired.
I have a CSV file. Then I have some rules that have to be applied and then create a new CSV based on the rules.
So it could go two ways:
Add a new column, with its own header and data
Take an existing column and alter the data of that column.
This is what I have so far
def applyRules(directory):
FILES = []
for f in listdir(OUTPUT_DIR):
writer = csv.writer(open("%s%s" % (DZINE_DIR, f), "wb"))
for rule in Substring.objects.filter(source_file=f):
from_column = rule.from_column
to_column = rule.to_column
reader = csv.DictReader(open("%s%s" % (OUTPUT_DIR, f)))
headers = reader.fieldnames
for row in reader:
if rule.get_rule_type_display() == "substring":
string = rule.string.split(",")
# alter value
row[to_column] = string[0] + row[from_column] + string[1]
if rule.from_column == rule.to_column:
print rule.from_column
else:
print rule.from_column
The rule as a FROM_COLUMN and a TO_COLUMN, if both are the same, then the column stays the same, but the data must be updated with the rule, in this case just adding a string before and or after the current value.
When the TO_COLUMN is different, then its just a new column with the altered data as above under the new column.
So currently Im just changing the values of the dict, but Im not sure how to get it back to the new CSV etc.
If you open the output file as a DictWriter() object, then you can write out your altered dictionaries quite easily. You do need to determine your extra fieldnames ahead of time:
with open(os.path.join(OUTPUT_DIR, f), 'rb') as rfile:
reader = csv.DictReader(rfile)
headers = reader.fieldnames
rules = Substring.objects.filter(source_file=f).all()
# pre-process the rules to determine the headers
for rule in rules:
from_column = rule.from_column
to_column = rule.to_column
if from_column not in headers:
# problem; perhaps raise an error?
if to_column not in headers:
headers.append(to_column
with open(os.path.join(DZINE_DIR, f), "wb") as wfile:
writer = csv.DictWriter(wfile, fieldnames=headers)
for row in reader:
for rule in rules:
from_column = rule.from_column
to_column = rule.to_column
if rule.get_rule_type_display() == "substring":
string = rule.string.split(",")
row[to_column] = string[0] + row[from_column] + string[1]
writer.writerow(reader)
I am trying to create a clean csv file by merging some of variables together from an old file and appending them to a new csv file.
I have no problem running the data the first time. I get the output I want but whenever I try to append the data with a new variable (i.e. new column) it appends the variable to the bottom and the output is wonky.
I have basically been running the same code for each variable, except changing the
groupvariables variable to my desired variables and then using the f2= open('outputfile.csv', "ab") <--- but with an ab for amend. Any help would be appreciated
groupvariables=['x','y']
f2 = open('outputfile.csv', "wb")
writer = csv.writer(f2, delimiter=",")
writer.writerow(("ID","Diagnosis"))
for line in csv_f:
line = line.rstrip('\n')
columns = line.split(",")
tempname = columns[0]
tempindvar = columns[1:]
templist = []
for j in groupvariables:
tempvar=tempindvar[headers.index(j)]
if tempvar != ".":
templist.append(tempvar)
newList = list(set(templist))
if len(newList) > 1:
output = 'nomatch'
elif len(newList) == 0:
output = "."
else:
output = newList[0]
tempoutrow = (tempname,output)
writer.writerow(tempoutrow)
f2.close()
CSV is a line-based file format, so the only way to add a column to an existing CSV file is to read it into memory and overwrite it entirely, adding the new column to each line.
If all you want to do is add lines, though, appending will work fine.
Here is something that might help. I assumed the first field on each row in each csv file is a primary key for the record and can be used to match rows between the two files. The code below reads the records in from one file, stored them in a dictionary, then reads in the records from another file, appended the values to the dictionary, and writes out a new file. You can adapt this example to better fit your actual problem.
import csv
# using python3
db = {}
reader = csv.reader(open('t1.csv', 'r'))
for row in reader:
key, *values = row
db[key] = ','.join(values)
reader = csv.reader(open('t2.csv', 'r'))
for row in reader:
key, *values = row
if key in db:
db[key] = db[key] + ',' + ','.join(values)
else:
db[key] = ','.join(values)
writer = open('combo.csv', 'w')
for key in sorted(db.keys()):
writer.write(key + ',' + db[key] + '\n')