Python: merge csv data with differing headers - python

I have a bunch of software output files that I have manipulated into csv-like text files. I have probably done this the hard way, because I am not too familiar with python library
The next step is to gather all this data in one single csv file. The files have different headers, or are sorted differently.
Lets say this is file A:
A | B | C | D | id
0 2 3 2 "A"
...
and this is file B:
B | A | Z | D | id
4 6 1 0 "B"
...
I want the append.csv file to look like:
A | B | C | D | Z | id
0 2 3 2 "A"
6 4 0 1 "B"
...
How can I do this, elegantly? Thank you for all answers.

You can use pandas to read CSV files into DataFrames and use the concat method, then write the result to CSV:
import pandas as pd
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
df = pd.concat([df1, df2], axis=0, ignore_index=True)
df.to_csv("file.csv", index=False)

The csv module in the standard library provides tools you can use to do this. The DictReader class produces a mapping of column name to value for each row in a csv file; the DictWriter class will write such mappings to a csv file.
DictWriter must be provided with a list of column names, but does not require all column names to be present in each row mapping.
import csv
list_of_files = ['1.csv', '2.csv']
# Collect the column names.
all_headers = set()
for file_ in list_of_files:
with open(file_, newline='') as f:
reader = csv.reader(f)
headers = next(reader)
all_headers.update(headers)
all_headers = sorted(all_headers)
# Generate the output file.
with open('append.csv', 'w', newline='') as outfile:
writer = csv.DictWriter(outfile, fieldnames=all_headers)
writer.writeheader()
for file_ in list_of_files:
with open(file_, newline='') as f:
reader = csv.DictReader(f)
writer.writerows(reader)
$ cat append.csv
A,B,C,D,Z,id
0,2,3,2,,A
6,4,,0,1,B

Related

Add a new column in csv python

How do I add a new column at the very beginning of csv file? I know we can do it using pandas, but I am having issue with pandas so is there another way to do it? I have something that looks like this:
a
b
c
d
0
1
2
3
I want to do this instead:
letters
a
b
c
d
numbers
0
1
2
3
if the tables are not formatted properly here is a picture:
Tables
What do you mean by "I a am having issues" with pandas ?
Have you tried using/running df.insert(0, "letters", "numbers") ?
Anyways, alternatively, you can use csv.reader function from csv module with a listcomp to insert the new column :
import csv
with open("input.csv", "r") as file:
rows = [["letters" if idx == 0 else "numbers"] + row
for idx, row in enumerate(csv.reader(file, delimiter=","))]
with open("output.csv", "w", newline="") as file:
csv.writer(file, delimiter=",").writerows(rows)
Output :
from tabulate import tabulate #pip install tabulate
with open("output.csv", "r") as file:
reader = csv.reader(file, delimiter=",")
print(tabulate([row for row in reader]))
------- - - - -
letters a b c d
numbers 0 1 2 3
------- - - - -

Python | How to add a header to the csv file while creating it?

I am parsing hex data being obtained from a pipeline. The data is being parsed line-by-line and written to a csv file. I need to add the header.
So data obtained:
a b c d e....iy
f g h i j....iy
Required format:
1 2 3 4 5....259
a b c d e....iy
f g h i j....iy
I had tried writerow function. As it is line-by-line parsing, data obtained is as follows:
1 2 3 4 5....259
a b c d e....iy
1 2 3 4 5....259
e f g h i....iy
It prints the header name after every line.
The code I am currently using to print data to file is as below:
if '[' in line:
#processdata functions(converting from hex)
line = processdata
f = open("output.csv", "a+")
f.write(line)
f.close()
I'd appreciate it if there are any suggestions regarding this for line-to-line parsing of the file.
I am looking for something like open("file.csv", "a+", header = ['1', '2','3','n']. Thank you.
Using pandas
file.to_csv("gfg2.csv", header=headerList, index=False)
# importing python package
import pandas as pd
# read contents of csv file
file = pd.read_csv("gfg.csv")
print("\nOriginal file:")
print(file)
# adding header
headerList = ['id', 'name', 'profession']
# converting data frame to csv
file.to_csv("gfg2.csv", header=headerList, index=False)
# display modified csv file
file2 = pd.read_csv("gfg2.csv")
print('\nModified file:')
print(file2)
see https://www.geeksforgeeks.org/how-to-add-a-header-to-a-csv-file-in-python/
and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

How to export csv row into separate .txt file

I have a .csv (export.csv) which contains almost 9k rows structured as follows:
|---------------------|------------------|---------------|
| Oggetto | valueID | note |
|---------------------|------------------|---------------|
| 1 | work1 |DescrizioneA |
|---------------------|------------------|---------------|
| 2 | work2 |DescrizioneB |
|---------------------|------------------|---------------|
| 3 | work3 |DescrizioneC |
|---------------------|------------------|---------------|
I would export the rows from the column "note" in a separate .txt file and then name the file as the value from the column "valueID" i.e work1.txt (Content of the work1.txt file "DescrizioneA").
Starting from this similar issue I tried, failing, like so:
import csv
with open('export.csv', 'r') as file:
data = file.read().split('\n')
for row in range(1, len(data)):
third_col= data
with open('t' + '.txt', 'w') as output:
output.write(third_col[2])
I tried then with Pandas
import pandas as pd
data = pd.read_csv("export.csv", engine ='python')
d = data
file = 'file{}.txt'
n = 0 # to number the files
for row in d.iterrows():
with open(file.format(n), 'w') as f:
f.write(str(row))
n += 1
I'm getting something but:
The filename of each is progressive according to the number of the file from 1 to 9000 ex. file0001.txt
The content of the .txt comprises all the 3 columns plus the content of the column "note" it's partial ex: "La cornice architettonica superiore del capite..."
Any idea?
Thanks
You can try pandas:
df=pd.read_csv("export.csv",sep=",")
for index in range(len(df)):
with open(df["valueID"][index] + '.txt', 'w') as output:
output.write(df["note"][index])
if you do not want to use Pandas you can do like this
with open('export.csv', 'r') as file:
data = file.read().split('\n')
Ok, it was good start to store your data line by line in variable
Now you need to find you data is each row. If you data is stored like you said (single word or number separated by spacebar) you can again split it:
text_for_file = ""
for row in range(1, len(data)):
splitted_text = row.split(' ')
text_for_file = '\n'.join([text_for_file, splitted_text[2]])
#So now all you notes are stored in text_for_file string line by line
#write all you data to file
with open("your_file.txt", 'w') as f:
f.write(text_for_file)

sort csv column based on value of specific column with defaultdict in python

How could I sort a column in a csv file the way that excel will sort. below is my csv file and snippet code that I have so far. I want to sort ArrivalTime, so the particular Process and ServiceTime move along. Thank for any help or advice.
csv:
Process,ArrivalTime,ServiceTime
A,0,3
B,2,6
C,4,4
D,6,5
E,8,2
and my code:
import csv
from collections import defaultdict
columns = defaultdict(list)
with open('file.csv') as f:
reader = csv.DictReader(f)
for row in reader:
for (k,v) in row.items():
columns[k].append(v)
st = columns['ServiceTime']
at = columns['ArrivalTime']
pr = columns['Process']
Have you considered using pandas? It has built-in methods for handling exactly this type of situation.
import pandas as pd
# create a dataframe from the file, like an Excel spreadsheet
df = pd.read_csv('file.csv')
df.sort_values('ArrivalTime')
# returns:
Process ArrivalTime ServiceTime
0 A 0 3
1 B 2 6
2 C 4 4
3 D 6 5
4 E 8 2
I agree that you should use pandas...
Apart from that you don't need a defaultdict here.
Read the file and sort:
import csv
import operator as op
list_of_dicts = []
with open('in.csv','r') as f:
reader = csv.DictReader(f)
for line in reader:
list_of_dicts.append(line)
list_of_dicts.sort(key=op.itemgetter('ArrivalTime'))
Write it back out:
with open('out.csv','w') as f:
writer = csv.DictWriter(f,fieldnames=list_of_dicts[0].keys())
for i in list_of_dicts:
writer.writerow(i)

Writing columns from separate files into a single file

I am relatively new to working with csv files in python and would appreciate some guidiance. I have 6 separate csv files. I would like to copy data from column 1, column 2 and column 3 from each of the csv files into the corresponding first 3 columns in a new file.
How do I word that into my code?
Here is my incomplete code:
import csv
file1 = open ('fileA.csv', 'rb')
reader1 = csv.reader (file1)
file2 = open ('fileB.csv', 'rb')
reader2 = csv.reader (file2)
file3 = open ('fileC.csv', 'rb')
reader3 = csv.reader (file3)
file4 = open ('fileD.csv', 'rb')
reader4 = csv.reader (file4)
file5 = open ('fileE.csv', 'rb')
reader5 = csv.reader (file5)
file6 = open ('fileF.csv', 'rb')
reader6 = csv.reader (file6)
WriteFile = open ('NewFile.csv','wb')
writer = csv.writer(WriteFile)
next(reader1, None)
Data1 = (col[0:3] for col in reader1)
next(reader2, None)
Data2 = (col[0:3] for col in reader2)
next(reader3, None)
Data3 = (col[0:3] for col in reader3)
next(reader4, None)
Data4 = (col[0:3] for col in reader4)
next(reader5, None)
Data5 = (col[0:3] for col in reader5)
next(reader6, None)
Data6 = (col[0:3] for col in reader6)
.......????????
file1.close()
file2.close()
file3.close()
file4.close()
file5.close()
file6.close()
WriteFile.close()
Thanks!
If you just want these all concatenated, that's easy. You can either call writerows on each of your iterators, or chain them together:
writer.writerows(itertools.chain(Data1, Data2, Data3, Data4, Data5, Data6))
Or, if you want them interleaved, where you get row 1 from Data1, then row 1 from Data 2, and so on, and then row 2 from Data 1, etc., use zip to transpose the data, and then chain again to flatten it:
writer.writerows(itertools.chain.from_iterable(zip(Data1, Data2, Data3,
Data4, Data5, Data6)))
If the files are of different lengths, that zip will stop as soon as you reach the end of any of the files. Is that what you want? I have no idea. You might want that. You might want to fill in the gaps with blank rows (in which case look at zip_longest). You might want to skip over the gaps (which you can do with zip_longest plus filter). Or a million other possibilities.
As a side note, once you get to this many similar variables, it's usually a good sign that you really wanted a single iterable instead of separate variables. For example:
filenames = ('fileA.csv', 'fileB.csv', 'fileC.csv',
'fileD.csv', 'fileE.csv', 'fileF.csv')
files = [open(filename, 'rb') for filename in filenames]
readers = [csv.reader(file) for file in files]
WriteFile = open ('NewFile.csv','wb')
writer = csv.writer(WriteFile)
for reader in readers:
next(reader, None)
Data = [(col[0:3] for col in reader) for reader in readers]
writer.writerows(itertools.chain.from_iterable(Data))
for file in files:
file.close()
WriteFile.close()
(Notice that I used list comprehensions, not generator expressions, for the collections of files, readers, data, etc. That's because we need to iterate over them repeatedly—e.g., create a reader for every file, and later call close on every file. Also because there are a fixed, small number of elements—6—so "wasting" a whole list isn't really any issue.)
The way I understand your question is that you have six separate csv's that have 3 columns each and the data in each column is of the same type in all six files. If so you could use pandas. Say you had 3 files that looked like ...
file1:
col1 col2 col3
1 1 1
1 1 1
and then a second and third file with 2's in the second and 3's in the third you could write...
#!/usr/bin/env python
import pandas as pd
cols = ['col1', 'col2', 'col3']
files = ['~/one.txt', '~/two.txt', '~/three.txt']
data_1 = pd.read_csv(files[0], sep=',', header=False, names=cols)
data_2 = pd.read_csv(files[1], sep=',', header=False, names=cols)
data_3 = pd.read_csv(files[2], sep=',', header=False, names=cols)
data_final = data_1.append(data_2).append(data_3)
Then data_final should have the contents of all three data sets stacked on each other. You can modify for 6 (or n) datasets. Hope this is what you wanted.
Out[1]: col1 col2 col3
1 1 1
1 1 1
2 2 2
2 2 2
3 3 3
3 3 3

Categories