Saving Iterations in Different CSV Files [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
For a current project, I am planning to run several iterations of the script below and to save the results in different CSV files with a new file for each iteration (the CSV part is at the end of the script).
The given code currently shows the relevant results in the terminal while it only creates empty CSV files. I have spent days figuring out how to solve the situation but cannot get to a solution. Is there anyone who can help?
Note: I have updated the code in accordance with user recommendations while the original issue/challenge still persists.
import string
import json
import csv
import pandas as pd
import datetime
from dateutil.relativedelta import *
import numpy as np
import matplotlib.pyplot as plt
# Loading and reading dataset
file = open("Glassdoor_A.json", "r")
data = json.load(file)
df = pd.json_normalize(data)
df['Date'] = pd.to_datetime(df['Date'])
# Allocate periods for individual CSV file names
periods = pd.period_range('2009Q1','2018Q4',freq='Q')
ts = pd.Series(np.random.randn(40), periods)
type(ts.index)
intervals = ts.index
# Create individual empty files with headers
for i in intervals:
name = 'Glassdoor_A_' + 'Text Main_' + str(i)
with open(name+'.csv', 'w', newline='') as file:
writer = csv.writer(file)
# Create an empty dictionary
d = dict()
# Filtering by date
start_date = pd.to_datetime('2009-01-01')
end_date = pd.to_datetime('2009-03-31')
last_end_date = pd.to_datetime('2017-12-31')
mnthBeg = pd.offsets.MonthBegin(3)
mnthEnd = pd.offsets.MonthEnd(3)
while end_date <= last_end_date:
filtered_dates = df[df.Date.between(start_date, end_date)]
n = len(filtered_dates.index)
print(f'Date range: {start_date.strftime("%Y-%m-%d")} - {end_date.strftime("%Y-%m-%d")}, {n} rows.')
if n > 0:
print(filtered_dates)
start_date += mnthBeg
end_date += mnthEnd
# Processing Text Main section
for index, row in filtered_dates.iterrows():
line = row['Text Main']
# Remove the leading spaces and newline character
line = line.split(' ')
line = [val.strip() for val in line]
# Convert the characters in line to
# lowercase to avoid case mismatch
line = [val.lower() for val in line]
# Remove the punctuation marks from the line
line = [val.translate(val.maketrans("", "", string.punctuation)) for val in line]
print(line)
# Split the line into words
# words = [val.split(" ") for val in line]
# print(words)
# Iterate over each word in line
for word in line:
# Check if the word is already in dictionary
if word in d.keys():
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
print(d)
# Print the contents of dictionary
for key in list(d.keys()):
print(key, ":", d[key])
# Count the total number of words
total = sum(d.values())
percent = d[key] / total
print(d[key], total, percent)
# Save as CSV file
while end_date <= last_end_date:
for index, row in filtered_dates.iterrows():
for i in data:
name = 'Glassdoor_A_' + str(i)
with open(name+'.csv', 'a', newline='') as file:
writer.writerow(["Word", "Occurrences", "Percentage"])
writer.writerows([key, d[key], percent] for key in list(d.keys()))

Wrt your inner loop which writes the CSV files:
# Create individual file names
for i in data:
name = 'Glassdoor_A_' + str(i)
# Save output in CSV file
with open(name+'.csv', 'w', newline='') as file:
...
⬆ is executed for each iteration of the outer loop for index, row in filtered_dates.iterrows():. So each iteration while overwrite the previously created files. Try using mode as 'a' (append) and write the headers with empty data outside of these two loops.
Without getting into the details of what you're calculating and writing out, the way to make it append data to the outfiles would be:
Create the files with just the headers at the start of the script.
The last inner loop should write to the files in append mode.
So, at the start of your script, add:
data = json.load(file)
# Create individual empty files with headers
for i in data:
name = 'Glassdoor_A_' + str(i)
with open(name+'.csv', 'w', newline='') as file:
writer = csv.writer(file) # you probably don't need to use the csv module for the first part
writer.writerow(["Text Main Words", "Text Main Occurrences"])
# nothing else here for now
Then at the end of your script, for the inner most loop where you're writing out the data, do:
while end_date <= last_end_date:
...
for index, row in filtered_dates.iterrows():
...
for i in data:
name = 'Glassdoor_A_' + str(i)
with open(name+'.csv', 'a', newline='') as file: # note the 'append' mode
writer = csv.writer(file)
writer.writerows([occurrence])
Btw, that last line writer.writerows([occurrence]) should probably be writer.writerows(list(occurrence)) if occurrence is not already a list of tuples or a list of lists with two elements in each inner list.

Related

Reading a txt file and saving individual columns as lists

I am trying to read a .txt file and save the data in each column as a list. each column in the file contains a variable which I will later on use to plot a graph. I have tried looking up the best method to do this and most answers recommend opening the file, reading it, and then either splitting or saving the columns as a list. The data in the .txt is as follows -
0 1.644231726
0.00025 1.651333945
0.0005 1.669593478
0.00075 1.695214575
0.001 1.725409504
the delimiter is a space '' or a tab '\t' . I have used the following code to try and append the columns to my variables -
import csv
with open('./rvt.txt') as file:
readfile = csv.reader(file, delimiter='\t')
time = []
rim = []
for line in readfile:
t = line[0]
r = line[1]
time.append(t)
rim.append(r)
print(time, rim)
However, when I try to print the lists, time and rim, using print(time, rim), I get the following error message -
r = line[1]
IndexError: list index out of range
I am, however, able to print only the 'time' if I comment out the r=line[1] and rim.append(r) parts. How do I approach this problem? Thank you in advance!
I would suggest the following:
import pandas as pd
df=pd.read_csv('./rvt.txt', sep='\t'), header=[a list with your column names])
Then you can use list(your_column) to work with your columns as lists
The problem is with the delimiter. The dataset contain multiple space ' '.
When you use '\t' and
print line you can see it's not separating the line with the delimiter.
eg:
['0 1.644231726']
['0.00025 1.651333945']
['0.0005 1.669593478']
['0.00075 1.695214575']
['0.001 1.725409504']
To get the desired result you can use (space) as delimiter and filter the empty values:
readfile = csv.reader(file, delimiter=" ")
time, rim = [], []
for line in readfile:
line = list(filter(lambda x: len(x), line))
t = line[0]
r = line[1]
Here is the code to do this:
import csv
with open('./rvt.txt') as file:
readfile = csv.reader(file, delimiter=” ”)
time = []
rim = []
for line in readfile:
t = line[0]
r = line[1]
time.append(t)
rim.append(r)
print(time, rim)

Python - Hierarchical relationship between lines in a CSV

I'm using a text file containing data and I want to reorganize it in a different shape. This file containing lines with values separates by a semi colon and no header. Some lines containing values who are children of other lines. I can distinguish them with a code (1 or 2) and their order : a children value is always in the line after is parent value. The number of children elements is different between one line from an other. Parents values can have no children values.
To be more explicit, here is a data sample:
030;001;1;AD0192;
030;001;2;AF5612;AF5613;AF5614
030;001;1;CD0124;
030;001;2;CD0846;CD0847;CD0848
030;002;1;EG0376;
030;002;2;EG0666;EG0667;EG0668;EG0669;EG0670;
030;003;1;ZB0001;
030;003;1;ZB0002;
030;003;1;ZB0003;
030;003;2;ZB0004;ZB0005
The structure is:
The first three characters are an id
The next three characters are also an id
The next one is a code : when 1, the value (named key in my example) is parent, when 2 values are childrens of the line before.
The values after are keys, parent or childrens.
I want to store children values (with a code 2) in a list and in the same line of their parent value.
Here is an example with my data sample above and a header:
id1;id2;key;children;
030;001;AD0192;[AF5612,AF5613,AF5614]
030;001;CD0124;[CD0846,CD0847,CD0848]
030;002;EG0376;[EG0666,EG0667,EG0668,EG0669,EG0670]
030;003;ZB0001;
030;003;ZB0002;
030;003;ZB0003;[ZB0004,ZB0005]
I'm able to build CSV delimited file from this text source file, add a header, a DictReader to manipulate easily my columns and conditions to identify my parents and children values.
But how to store hierarchical elements (with a code 2) in a list in the same line of their parent key ?
Here is my actual script in Python
import csv
inputTextFile = 'myOriginalFile.txt'
csvFile = 'myNewFile.csv'
countKey = 0
countKeyParent = 0
countKeyChildren = 0
with open(inputTextFile, 'r', encoding='utf-8') as inputFile:
stripped = (line.strip() for line in inputFile)
lines = (line.split(";") for line in stripped if line)
# Write a CSV file
with open(csvFile, 'w', newline='', encoding='utf-8') as outputCsvFile:
writer = csv.writer(outputCsvFile, delimiter=';')
writer.writerow(('id1','id2', 'code', 'key', 'children'))
writer.writerows(lines)
# Read the CSV
with open(csvFile, 'r', newline='', encoding='utf-8') as myCsvFile:
csvReader = csv.DictReader(myCsvFile, delimiter=';', quotechar='"')
for row in csvReader:
countKey +=1
if '1' in row['code'] :
countKeyParent += 1
print("Parent: " + row['key'])
elif '2' in row['code'] :
countKeyChildren += 1
print("Children: " + row['key'])
print(f"----\nSum of elements: {countKey}\nParents keys: {countKeyParent}\nChildren keys: {countKeyChildren}")
A simple solution might be the following. I first load your data in as a list of rows, each a list of strings. Then, we first build the hierarchy you've explained, and write the output to a CSV file.
from typing import List
ID_FIRST = 0
ID_SECOND = 1
PARENT_FIELD = 2
KEY_FIELD = 3
IS_PARENT = "1"
IS_CHILDREN = "2"
def read(where: str) -> List[List[str]]:
with open(where) as fp:
data = fp.readlines()
rows = []
for line in data:
fields = line.strip().split(';')
rows.append([fields[ID_FIRST],
fields[ID_SECOND],
fields[PARENT_FIELD],
*[item for item in fields[KEY_FIELD:]
if item != ""]])
return rows
def assign_parents(rows: List[List[str]]):
parent_idx = 0
for idx, fields in enumerate(rows):
if fields[PARENT_FIELD] == IS_PARENT:
parent_idx = idx
if fields[PARENT_FIELD] == IS_CHILDREN:
rows[parent_idx] += fields[KEY_FIELD:]
def write(where: str, rows: List[List[str]]):
with open(where, 'w') as file:
file.write("id1;id2;key;children;\n")
for fields in rows:
if fields[PARENT_FIELD] == IS_CHILDREN:
# These have been grouped into their parents.
continue
string = ";".join(fields[:PARENT_FIELD])
string += ";" + fields[KEY_FIELD] + ";"
if len(fields[KEY_FIELD + 1:]) != 0: # has children?
children = ",".join(fields[KEY_FIELD + 1:])
string += "[" + children + "]"
file.write(string + '\n')
def main():
rows = read('myOriginalFile.txt')
assign_parents(rows)
write('myNewFile.csv', rows)
if __name__ == "__main__":
main()
For your example I get
id1;id2;key;children;
030;001;AD0192;[AF5612,AF5613,AF5614]
030;001;CD0124;[CD0846,CD0847,CD0848]
030;002;EG0376;[EG0666,EG0667,EG0668,EG0669,EG0670]
030;003;ZB0001;
030;003;ZB0002;
030;003;ZB0003;[ZB0004,ZB0005]
which appears to be correct.

Extract two columns sorted from CSV

I have a large csv file, containing multiple values, in the form
Date,Dslam_Name,Card,Port,Ani,DownStream,UpStream,Status
2020-01-03 07:10:01,aart-m1-m1,204,57,302xxxxxxxxx,0,0,down
I want to extract the Dslam_Name and Ani values, sort them by Dslam_name and write them to a new csv in two different columns.
So far my code is as follows:
import csv
import operator
with open('bad_voice_ports.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
sortedlist = sorted(readCSV, key=operator.itemgetter(1))
for row in sortedlist:
bad_port = row[1][:4],row[4][2::]
print(bad_port)
f = open("bad_voice_portsnew20200103SORTED.csv","a+")
f.write(row[1][:4] + " " + row[4][2::] + '\n')
f.close()
But my Dslam_Name and Ani values are kept in the same column.
As a next step I would like to count how many times the same value appears in the 1st column.
You are forcing them to be a single column. Joining the two into a single string means Python no longer regards them as separate.
But try this instead:
import csv
import operator
with open('bad_voice_ports.csv') as readfile, open('bad_voice_portsnew20200103SORTED.csv', 'w') as writefile:
readCSV = csv.reader(readfile)
writeCSV = csv.writer(writefile)
for row in sorted(readCSV, key=operator.itemgetter(1)):
bad_port = row[1][:4],row[4][2::]
print(bad_port)
writeCSV.writerow(bad_port)
If you want to include the number of times each key occurred, you can easily include that in the program, too. I would refactor slightly to separate the reading and the writing.
import csv
import operator
from collections import Counter
with open('bad_voice_ports.csv') as readfile:
readCSV = csv.reader(readfile)
rows = []
counts = Counter()
for row in readCSV:
rows.append([row[1][:4], row[4][2::]])
counts[row[1][:4]] += 1
with open('bad_voice_portsnew20200103SORTED.csv', 'w') as writefile:
writeCSV = csv.writer(writefile)
for row in sorted(rows):
print(row)
writeCSV.writerow([counts[row[0]]] + row)
I would recommend to remove the header line from the CSV file entirely; throwing away (or separating out and prepending back) the first line should be an easy change if you want to keep it.
(Also, hard-coding input and output file names is problematic; maybe have the program read them from sys.argv[1:] instead.)
So my suggestion is failry simple. As i stated in a previous comment there is good documentation on CSV read and write in python here: https://realpython.com/python-csv/
As per an example, to read from a csv the columns you need you can simply do this:
>>> file = open('some.csv', mode='r')
>>> csv_reader = csv.DictReader(file)
>>> for line in csv_reader:
... print(line["Dslam_Name"] + " " + line["Ani"])
...
This would return:
aart-m1-m1 302xxxxxxxxx
Now you can just as easilly create a variable and store the column values there and later write them to a file or just open up a new file wile reading lines and writing the column values in there. I hope this helps you.
After the help from #tripleee and #marxmacher my final code is
import csv
import operator
from collections import Counter
with open('bad_voice_ports.csv') as csv_file:
readCSV = csv.reader(csv_file, delimiter=',')
sortedlist = sorted(readCSV, key=operator.itemgetter(1))
line_count = 0
rows = []
counts = Counter()
for row in sortedlist:
Dslam = row[1][:4]
Ani = row[4][2:]
if line_count == 0:
print(row[1], row[4])
line_count += 1
else:
rows.append([row[1][:4], row[4][2::]])
counts[row[1][:4]] += 1
print(Dslam, Ani)
line_count += 1
for row in sorted(rows):
f = open("bad_voice_portsnew202001061917.xls","a+")
f.write(row[0] + '\t' + row[1] + '\t' + str(counts[row[0]]) + '\n')
f.close()
print('Total of Bad ports =', str(line_count-1))
As with this way the desired values/columns are extracted from the initial csv file and a new xls file is generated with the desired values stored in different columns and the total values per key are counted, along with the total of entries.
Thanks for all the help, please feel free for any improvement suggestions!
You can use sorted:
import csv
_h, *data = csv.reader(open('filename.csv'))
with open('new_csv.csv', 'w') as f:
write = csv.writer(f)
csv.writerows([_h, *sorted([(i[1], i[4]) for i in data], key=lambda x:x[0])])

Python perform logic for all CSV files

I've got the following python script that opens a single CSV file and performs a few bits to recreate columns and slices lists, finally outputs to CSV.
I would like the script to one all current .CSV files in a directory and perform the actions and output a final CSV file.
FYI: I'm trying to avoid PANDAS.
import csv
projects = []
count = 0
with open('timesheets.csv') as csvfile:
timesheets = csv.reader(csvfile, delimiter=',')
for rows in timesheets:
# IF statement will run until count is more than 3, then move out of IF statement and print remaining rows.
if count < 3:
count += 1
continue
# Columns 1,8,9,13,15,18
columns1 = rows[1:2]
columns8_9 = rows[8:10]
columns13 = rows[13:14]
columns15 = rows[15:15]
columns18 = rows[18:19]
project = columns1 + columns8_9 + columns13 + columns15 + columns18
# Append each line as a seperate list to projects list. You end up with multiple lists within in a list.
projects.append(project)
# Remove the last list in projects list, since being empy and causes errors.
projects = projects[:-1]
newlist = []
# Remove the first 6 characters from each line[1] & line[4]
for lists in projects:
# Remove the first 6 characters from each line[1]
engineer = lists[1]
engineer = engineer[8:]
lists[1] = engineer
# Remove the first 6 characters from each line[3]
employee = lists[3]
employee = employee[8:]
lists[3] = employee
newlist.append(lists)
# Change the first list to the following list, which effectively changes the column names.
newlist[0] = ['Project Name', 'Line Manager', 'Element', 'Employee', 'Hours']
writer = csv.writer(open('output.csv', 'w'))
writer.writerows(newlist)
Put list of csv's in a list and iterate. This will help you:-
import glob
files_list = glob.glob("path/*.csv")
for file in files_list:
#your remaining code goes here...

Trouble with sorting list and "for" statement snytax

I need help sorting a list from a text file. I'm reading a .txt and then adding some data, then sorting it by population change %, then lastly, writing that to a new text file.
The only thing that's giving me trouble now is the sort function. I think the for statement syntax is what's giving me issues -- I'm unsure where in the code I would add the sort statement and how I would apply it to the output of the for loop statement.
The population change data I am trying to sort by is the [1] item in the list.
#Read file into script
NCFile = open("C:\filelocation\NC2010.txt")
#Save a write file
PopulationChange =
open("C:\filelocation\Sorted_Population_Change_Output.txt", "w")
#Read everything into lines, except for first(header) row
lines = NCFile.readlines()[1:]
#Pull relevant data and create population change variable
for aLine in lines:
dataRow = aLine.split(",")
countyName = dataRow[1]
population2000 = float(dataRow[6])
population2010 = float(dataRow[8])
popChange = ((population2010-population2000)/population2000)*100
outputRow = countyName + ", %.2f" %popChange + "%\n"
PopulationChange.write(outputRow)
NCFile.close()
PopulationChange.close()
You can fix your issue with a couple of minor changes. Split the line as you read it in and loop over the sorted lines:
lines = [aLine.split(',') for aLine in NCFile][1:]
#Pull relevant data and create population change variable
for dataRow in sorted(lines, key=lambda row: row[1]):
population2000 = float(dataRow[6])
population2010 = float(dataRow[8])
...
However, if this is a csv you might want to look into the csv module. In particular DictReader will read in the data as a list of dictionaries based on the header row. I'm making up the field names below but you should get the idea. You'll notice I sort the data based on 'countryName' as it is read in:
from csv import DictReader, DictWriter
with open("C:\filelocation\NC2010.txt") as NCFile:
reader = DictReader(NCFile)
data = sorted(reader, key=lambda row: row['countyName'])
for row in data:
population2000 = float(row['population2000'])
population2010 = float(row['population2010'])
popChange = ((population2010-population2000)/population2000)*100
row['popChange'] = "{0:.2f}".format(popChange)
with open("C:\filelocation\Sorted_Population_Change_Output.txt", "w") as PopulationChange:
writer = csv.DictWriter(PopulationChange, fieldnames=['countryName', 'popChange'])
writer.writeheader()
writer.writerows(data)
This will give you a 2 column csv of ['countryName', 'popChange']. You would need to correct this with the correct fieldnames.
You need to read all of the lines in the file before you can sort it. I've created a list called change to hold the tuple pair of the population change and the country name. This list is sorted and then saved.
with open("NC2010.txt") as NCFile:
lines = NCFile.readlines()[1:]
change = []
for line in lines:
row = line.split(",")
country_name = row[1]
population_2000 = float(row[6])
population_2010 = float(row[8])
pop_change = ((population_2010 / population_2000) - 1) * 100
change.append((pop_change, country_name))
change.sort()
output_rows = []
[output_rows.append("{0}, {1:.2f}\n".format(pair[1], pair[0]))
for pair in change]
with open("Sorted_Population_Change_Output.txt", "w") as PopulationChange:
PopulationChange.writelines(output_rows)
I used a list comprehension to generate the output rows which swaps the pair back in the desired order, i.e. country name first.

Categories