How to export csv row into separate .txt file - python

I have a .csv (export.csv) which contains almost 9k rows structured as follows:
|---------------------|------------------|---------------|
| Oggetto | valueID | note |
|---------------------|------------------|---------------|
| 1 | work1 |DescrizioneA |
|---------------------|------------------|---------------|
| 2 | work2 |DescrizioneB |
|---------------------|------------------|---------------|
| 3 | work3 |DescrizioneC |
|---------------------|------------------|---------------|
I would export the rows from the column "note" in a separate .txt file and then name the file as the value from the column "valueID" i.e work1.txt (Content of the work1.txt file "DescrizioneA").
Starting from this similar issue I tried, failing, like so:
import csv
with open('export.csv', 'r') as file:
data = file.read().split('\n')
for row in range(1, len(data)):
third_col= data
with open('t' + '.txt', 'w') as output:
output.write(third_col[2])
I tried then with Pandas
import pandas as pd
data = pd.read_csv("export.csv", engine ='python')
d = data
file = 'file{}.txt'
n = 0 # to number the files
for row in d.iterrows():
with open(file.format(n), 'w') as f:
f.write(str(row))
n += 1
I'm getting something but:
The filename of each is progressive according to the number of the file from 1 to 9000 ex. file0001.txt
The content of the .txt comprises all the 3 columns plus the content of the column "note" it's partial ex: "La cornice architettonica superiore del capite..."
Any idea?
Thanks

You can try pandas:
df=pd.read_csv("export.csv",sep=",")
for index in range(len(df)):
with open(df["valueID"][index] + '.txt', 'w') as output:
output.write(df["note"][index])

if you do not want to use Pandas you can do like this
with open('export.csv', 'r') as file:
data = file.read().split('\n')
Ok, it was good start to store your data line by line in variable
Now you need to find you data is each row. If you data is stored like you said (single word or number separated by spacebar) you can again split it:
text_for_file = ""
for row in range(1, len(data)):
splitted_text = row.split(' ')
text_for_file = '\n'.join([text_for_file, splitted_text[2]])
#So now all you notes are stored in text_for_file string line by line
#write all you data to file
with open("your_file.txt", 'w') as f:
f.write(text_for_file)

Related

python read excel file and save N txt files with title and content from excel

I have excel file with 3 columns:
index | name | surname
0 | John | White
2 | Bill | Black
3 | Jack | Red
I need create N number of txt files (based on number of rows) with title as in column Name and content as in column Surname.
For example, based on example above I would like to have 3 files, John.txt (with content "White"), Bill.txt (content Black) and Jack.txt (content Red)
You can do this using pandas and extracting the values as lists
# import and read
import pandas as pd
df = pd.read_excel("your_file.xlsx")
# create lists
names = df["name"].values
file_contents = df["surname"].values
# iterate through lists
for name, content in zip(names, file_contents):
f = open(f"{name}.txt", "w")
f.write(content)
You can do this pretty easily with pylightxl see https://pylightxl.readthedocs.io/en/latest/quickstart.html
import pylightxl as xl
workbook = xl.readxl('yourexcefile.xlsx')
for row in workbook.ws('Sheet1').rows:
filename = row[2]
text = row[3]
with open(filename, "w") as f:
f.write(text)

Python: merge csv data with differing headers

I have a bunch of software output files that I have manipulated into csv-like text files. I have probably done this the hard way, because I am not too familiar with python library
The next step is to gather all this data in one single csv file. The files have different headers, or are sorted differently.
Lets say this is file A:
A | B | C | D | id
0 2 3 2 "A"
...
and this is file B:
B | A | Z | D | id
4 6 1 0 "B"
...
I want the append.csv file to look like:
A | B | C | D | Z | id
0 2 3 2 "A"
6 4 0 1 "B"
...
How can I do this, elegantly? Thank you for all answers.
You can use pandas to read CSV files into DataFrames and use the concat method, then write the result to CSV:
import pandas as pd
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
df = pd.concat([df1, df2], axis=0, ignore_index=True)
df.to_csv("file.csv", index=False)
The csv module in the standard library provides tools you can use to do this. The DictReader class produces a mapping of column name to value for each row in a csv file; the DictWriter class will write such mappings to a csv file.
DictWriter must be provided with a list of column names, but does not require all column names to be present in each row mapping.
import csv
list_of_files = ['1.csv', '2.csv']
# Collect the column names.
all_headers = set()
for file_ in list_of_files:
with open(file_, newline='') as f:
reader = csv.reader(f)
headers = next(reader)
all_headers.update(headers)
all_headers = sorted(all_headers)
# Generate the output file.
with open('append.csv', 'w', newline='') as outfile:
writer = csv.DictWriter(outfile, fieldnames=all_headers)
writer.writeheader()
for file_ in list_of_files:
with open(file_, newline='') as f:
reader = csv.DictReader(f)
writer.writerows(reader)
$ cat append.csv
A,B,C,D,Z,id
0,2,3,2,,A
6,4,,0,1,B

Add paddings in csv file to make data frame readable for pandas

I have several data in csv files (similar data structure but not the same), with different sizes of rows and columns for certain lines.
For example, the first three lines of each csv file has varying number of columns, ie:
----------------
Table | Format |
----------------
Code | Label | Index |
-----------------------------------------
a | b | c | d | e |
-----------------------------------------
which does kind of look ugly, and makes it difficult to read in as pandas to work with.
I want to make the table so that it recognizes the max length of columns in a file and add whatever padding to the empty spaces to make the dimensions equal.
ie.
-----------------------------------------
Table | Format | pad | pad | pad |
-----------------------------------------
Code | Label | Index | pad | pad |
-----------------------------------------
a | b | c | d | e |
-----------------------------------------
So far, I looked into reading pandas and adding headers to csv file, but because the maximum number of columns for each csv file varies, I've been kind of stuck.
Any help or pointer would be grateful!
If your column separator is a comma, you can pad by simply inserting an appropriate number of commas at the end of each row. Using read_csv pandas will read the padded values in as NaN.
with open('/path/to/data.csv', 'r') as f:
data = f.read().split()
# Count the the number of columns in each line
cols = [row.count(',')+1 for row in data]
# Find the widest row
max_cols = max(cols)
# Loop over lines in text
for id, row in enumerate(data):
# Pad extra columns when necessary
if cols[id] < max_cols:
data[id] += (max_cols - cols[id]) * ','
# Write the data out
with open('/path/to/pad_data.csv', 'w') as f:
f.write('\n'.join(data))
Setting up some test data:
data = '1,2,3\n4,\n5,6,7,8,9\n'
print(data)
#1,2,3
#4,
#5,6,7,8,9
Applying the method above gives:
print('\n'.join(pad_data))
#1,2,3,,
#4,,,,
#5,6,7,8,9
Here's a little script I wrote to pad out columns derived from a pandas dataframe. My file was intermediate file was pipe-delimited:
INPUT_FILE = r'blah.txt'
OUTPUT_FILE = r'blah.TAB'
col_widths = []
with open(INPUT_FILE, "r") as fi:
line = fi.readline()
headers = line.split(sep='|')
for h in headers:
col_widths.append(len(h))
with open(INPUT_FILE) as fi:
line = fi.readline()
while line:
cols = line.split(sep='|')
line = fi.readline()
index = 0
for c in cols:
if len(c) > col_widths[index]:
col_widths[index] = len(c)
index += 1
with open(INPUT_FILE) as fi:
fo = open(OUTPUT_FILE, 'w')
line = fi.readline()
while line:
tokens = line.split(sep='|')
index = 0
for t in tokens:
if index == len(col_widths) - 1:
t = t.rstrip('\r\n')
ft = '{:<' + str(col_widths[index]) + '}'
v = ft.format(t)
fo.write(v + '|')
index += 1
fo.write('\r')
line = fi.readline()
fo.close()

How do I parse "invoice" level data into columnar data for analysis?

Data looks like this
Invoice 1
ID
Lat
Long
Year
Month
Observations
1
.
.
.
n
#-----
Invoice 2-n (pattern repeats)
My goal is to end up with a table in the form
ID | Lat | Long | Year | Month | Obs 1 | Obs 2 | Obs 3 | Obs n
#----- acts as the delimiter between invoices
It's easy to go from wide to long at that point, but what's the best way to write the mapping rule and iterate through the data? All my data is in a single .csv, but it's over 1 million lines.
I'm looking for a place to start, and a general process for handling data in this format.
with open('path/to/input') as infile, open('path/to/output', 'w') as fout:
outfile = csv.writer(fout)
invoice = []
for line in infile:
if line.startswith("Invoice"):
outfile.writerow(invoice)
invoice = []
continue
line = line.strip()
if not line: continue
invoice.append(line)
outfile.writerow(invoice)
A simple loop should work:
with open('...') as infile:
data = []
line = []
item = infile.readline().strip()
while item != '':
if item.startswith('#-----'):
data.append(line)
line = []
else:
line.append(item)
item = infile.readline().strip()
At the end, data is a list of lists (not necessarily rectangular).

Python CSV: How to ignore writing similar rows given one row meets a condition?

I'm currently keeping track of the large scale digitization of video tapes and need help pulling data from multiple CSVs. Most tapes have multiple copies, but we only digitize one tape from the set. I would like to create a new CSV containing only tapes of shows that have yet to be digitized. Here's a mockup of my original CSV:
Date Digitized | Series | Episode Number | Title | Format
---------------|----------|----------------|-------|--------
01-01-2016 | Series A | 101 | | VHS
| Series A | 101 | | Beta
| Series A | 101 | | U-Matic
| Series B | 101 | | VHS
From here, I'd like to ignore all fields containing "Series A" AND "101", as this show has a value in the "Date Digitized" cell. I attempted isolating these conditions but can't seem to get a complete list of undigitized content. Here's my code:
import csv, glob
names = glob.glob("*.csv")
names = [os.path.splitext(each)[0] for each in names]
for name in names:
with open("%s_.csv" % name, "rb") as source:
reader = csv.reader( source )
with open("%s_edit.csv" % name,"wb") as result:
writer = csv.writer( result )
for row in reader:
if row[0]:
series = row[1]
epnum = row[2]
if row[1] != series and row[2] != epnum:
writer.writerow(row)
I'll add that this is my first question and I'm very new to Python, so any advice would be much appreciated!
I am not a hundred percent sure I've understood your needs. However, this might put you on a right track. I am using pandas module:
data = """
Date Digitized | Series | Episode Number | Title | Format
---------------|----------|----------------|-------|--------
01-01-2016 | Series A | 101 | | VHS
| Series A | 101 | | Beta
| Series A | 101 | | U-Matic
| Series B | 101 | | VHS"""
# useful module for treating csv files (and many other)
import pandas as pd
# module to handle data as it was a csv file
import io
# read the csv into pandas DataFrame
# use the 0 row as a header
# fields are separated by |
df = pd.read_csv(
io.StringIO(data),
header=0,
sep="|"
)
# there is a bit problem with white spaces
# remove white space from the column names
df.columns = [x.strip() for x in df.columns]
# remove white space from all string fields
df = df.applymap(lambda x: x.strip() if type(x) == str else x)
# finally choose the subset we want
# for some reason pandas guessed the type of Episode Number wrong
# it should be integer, this probably won't be a problem when loading
# directly from file
df = df[~((df["Series"] == "Series A") & (df["Episode Number"] == "101"))]
# print the result
print(df)
# Date Digitized Series Episode Number Title Format
# 0 --------------- ---------- ---------------- ------- --------
# 4 Series B 101 VHS
Feel free to ask, hopefully I'll be able to change the code according to your actual needs or help in any other way.
The simplest approach is to make two reads of the set of CSV files: one to build a list of all digitized tapes, the second to build a unique list of all tapes not on the digitized list:
# build list of digitized tapes
digitized = []
for name in names:
with open("%s_.csv" % name, "rb") as source:
reader = csv.reader(source)
next(reader) # skip header
for row in reader:
if row[0] and ((row[1], row[2]) not in digitized):
digitized.append((row[1], row[2]))
# build list of non-digitized tapes
digitize_me = []
for name in names:
with open("%s_.csv" % name, "rb") as source:
reader = csv.reader(source)
header = next(reader)[1:3] # skip / save header
for row in reader:
if not row[0] and ((row[1], row[2]) not in digitized + digitize_me):
digitize_me.append((row[1], row[2]))
# write non-digitized tapes to 'digitize.csv`
with open("digitize.csv","wb") as result:
writer = csv.writer(result)
writer.writerow(header)
for tape in digitize_me:
writer.writerow(tape)
input file 1:
Date Digitized,Series,Episode Number,Title,Format
01-01-2016,Series A,101,,VHS
,Series A,101,,Beta
,Series C,101,,Beta
,Series D,102,,VHS
,Series B,101,,U-Matic
input file 2:
Date Digitized,Series,Episode Number,Title,Format
,Series B,101,,VHS
,Series D,101,,Beta
01-01-2016,Series C,101,,VHS
Output:
Series,Episode Number
Series D,102
Series B,101
Series D,101
As per OP comment, the line
header = next(reader)[1:3] # skip / save header
serves two purposes:
Assuming each csv file starts with a header, we do not want to
read that header row as if it contained data about our tapes, so we
need to "skip" the header row in that sense
But we also want to save the relevant parts of the header for when
we write the output csv file. We want that file to have a header
as well. Since we are only writing the series and episode
number, which are row fields 1 and 2, we assign just that slice,
i.e. [1:3], of the header row to the header variable
It's not really standard to have a line of code serve two pretty unrelated purposes like that, which is why I commented it. It also assigns to header multiple times (assuming multiple input files) when header only needs to be assigned once. Perhaps a cleaner way to write that section would be:
# build list of non-digitized tapes
digitize_me = []
header = None
for name in names:
with open("%s_.csv" % name, "rb") as source:
reader = csv.reader(source)
if header:
next(reader) # skip header
else:
header = next(reader)[1:3] # read header
for row in reader:
...
It's a question of which form is more readable. Either way is close but I thought combining 5 lines into one keeps the focus on the more salient parts of the code. I would probably do it the other way next time.

Categories