Extract and merge columns in a CSV - python

Here's my situation....I have two CSV files (file 1 and file2). File1 has about 15 columns and file2 has about 10 columns. I need to grab all 15 columns from file1 and extract just column 13 from file2 and merge all 16 columns in a new csv file called "final.csv" Please suggest me some ideas as to how I can make this code work. Here is what I have so far...
import csv
File1 = 'F:\somedata\somefolder\file1.csv'
File2 = 'F:\somedata\somefolder\file2.csv'
File3 = 'F:\\somedata\somefolder\final.csv'
with open('r', 'File1' and 'File2', 'rt') as f, open('r', 'File3', 'wt', newline='') as f_out:
headings = next(iter(csv.reader(f)))
csv.writer(f_out).writerow(headings)
csvout = csv.DictWriter(f_out, fieldnames=headings)
for d in csv.DictReader(f, fieldnames=headings):
csvout.writerow(d)

I would start by using pandas load your files as tables. Then use indexing to select the columns you want, merge the files then create a new file. Obviously you cant select the thirteenth column from file2 if it only has 10 columns, so here I am assuming you DO have 13 columns in that file.
import pandas as pd
file1 = pd.read_table('F:\somedata\somefolder\file1.csv', delimiter=',', header=None)
file2 = pd.read_table('F:\somedata\somefolder\file2.csv', delimiter=',', header=None)
file2_short = file2.ix[:,12:13]
new = pd.concat(file1, file2_short, axis=1)
new.to_csv('F:\somedata\somefolder\newfile.csv')
This assumes that you want column 13 from file 2. If that column has a header (of course you would remove the 'header = None' part) you can select by that instead...
file2_short = file2['col_13']
Hope this helps

Related

Writing contents from a text file to csv file on only one Column-How to?

I have converted a pdf file to a text file. This text file is also converted to a csv file. My Problem is the contents in the csv file is written in multiple columns(A,B,C,D,E) whereas I wanted to write it in only one column ie Column A. How could i write the contents from these columns into only one column?
I've tried using merge function and concatenate function and join function but it was of no help.
here's my code
import os.path
import csv
import pdftotext
#Load your PDF
with open("crimestory.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# Save all text to a txt file.
with open('crimestory.txt', 'w') as f:
f.write("\n\n".join(pdf))
save_path = "/home/mayureshk/PycharmProjects/NLP/"
completeName_in = os.path.join(save_path, 'crimestory' + '.txt')
completeName_out = os.path.join(save_path, 'crimestoryycsv' + '.csv')
file1 = open(completeName_in)
In_text = csv.reader(file1, delimiter=',')
file2 = open(completeName_out, 'w')
out_csv = csv.writer(file2)
file3 = out_csv.writerows(In_text)
file1.close()
file2.close()
The expected output in the csv file should be Column A
All information. Rest of the columns
Empty
You can use this answer to merge all columns in one.
#dummy df
df =pd.DataFrame({'ColA':['value_A1','value_A2','value_A3','value_A4'],'ColB':['value_B1','value_B2','value_B3','value_B4'],'ColC':['value_C1','value_C2','value_C3','value_C4']})
I'll use pandas to load your csv:
import pandas as pd
df= pd.read_csv(sep=',',savepath+'crimestorycsv.csv')
df = df.astype(str)
col = df.columns
df['All'] = df[col[0]].str.cat(df[col[1:]],sep='|')
df.drop(col,axis=1,inplace=True)
Results :
All
0 value_A1|value_B1|value_C1
1 value_A2|value_B2|value_C2
2 value_A3|value_B3|value_C3
3 value_A4|value_B4|value_C4

Missing out rows with blank spaces when writing to a new CSV file

I'm attempting to write a program that enters a directory full of CSV files (all with the same layout but different data), reads the files, and writes all the data in the specific columns to a new CSV file. I would also like it to miss out the entire row of data is there is a blank space in one of the columns (in this case, if there is a gap in the Name column).
The program works fine in writing in specific columns (in this case Name and Location) from the old CSV files to the new one, however, I am unsure as to how I would miss out a line if there was a blank space.
import nltk
import csv
from nltk.corpus import PlaintextCorpusReader
root = '/Users/bennaylor/local documents/humanotics'
incorpus = root + '/chats/input/'
outcorpus =root + '/chats/output.csv'
doccorpus = PlaintextCorpusReader(incorpus, '.*\.csv')
filelist = doccorpus.fileids()
with open(outcorpus, 'w', newline='') as fw:
fieldnames = (['Name','Location'])
writer = csv.DictWriter(fw, fieldnames=fieldnames)
writer.writeheader()
print('Writing Headers!!')
for rawdoc in filelist:
infile = incorpus + rawdoc
with open(infile, encoding='utf-8') as fr:
reader = csv.DictReader(fr)
for row in reader:
rowName = row['Name']
rowLocation = row['Location']
writer.writerow({'Name': rowName, 'Location': rowLocation})
An example CSV input file would look like this:
Name,Age,Location,Birth Month
Steve,14,London,November
,18,Sydney,April
Matt,12,New York,June
Jeff,20,,December
Jonty,19,Greenland,July
With gaps in the Name column on the third row, and Location column on the fifth. In this case, I would like the program to miss out the third row when writing the data to a new CSV as there is a gap in the Name column
Thanks in advance for any help.
This is easy to do using pandas:
import pandas as pd
import os
# Create an empty data frame
df = pd.DataFrame()
# Add the data from all the files into the data frame
for filename in filelist:
data = pd.read_csv(os.path.join(incorpus, filename))
df = df.append(data)
# Drop rows with any empty values
df = df.dropna()
# Keep only the needed columns
df = df.reindex(columns=['Name', 'Location'])
# Write the dataframe to a .csv file
df.to_csv(outcorpus)

Combine columns from several CSV files into a single file and making multiple CSV file with for loop

I have 14 CSV files and each has 100 columns, what i want to do is to extract first column from each file and copy it in a single csv file. I have to do it for each 100 columns (for example next step is to put second column from each file in a csv file).
What i've tried before is the code below which is perfect for extracting one column, but i want to put it in a loop so i get the 100 files at once how can i do it?
import csv
import itertools as IT
filenames = ['Sul-v1.csv', 'Sul-v2.csv','Sul-v3.csv', 'Sul-v4.csv', 'Sul-v5.csv', 'Sul-v6.csv', 'Sul-v7.csv', 'Sul-v8.csv', 'Sul-v9.csv', 'Sul-v10.csv', 'Sul-v11.csv', 'Sul-v12.csv', 'Sul-v13.csv', 'Sul-v14.csv']
handles = [open(filename, 'rb') for filename in filenames]
readers = [csv.reader(f, delimiter=',') for f in handles]
with open('combined.csv', 'wb') as h:
writer = csv.writer(h, delimiter=',', lineterminator='\n', )
for rows in IT.izip_longest(*readers, fillvalue=['']*2):
combined_row = []
for row in rows:
row = row[:1] # select the columns you want
if len(row) == 1:
combined_row.extend(row)
else:
combined.extend(['']*2)
writer.writerow(combined_row)
for f in handles:
f.close()
Thanks in advance!
Use pandas.
Start by loading all csv files into one dateframe. (see here)
Next, save each column into a new csv by looping over the columns and using to_csv .
Make sure you pass the column to 'to_csv' using the 'columns' argument

Filter rows in csv file based on another csv file and save the filtered data in a new file

Good day all
so I was trying to filter file2 based on file1, where file1 is a subset from file2. But file2 has a description column that I need to be able to analyse the data in file1. What I'm trying to do is to filter file2, and get only the titles that are in file1 with their description. I tried this, but I'm not quite sure if it is totally right, plus it is working but I don't get any file saved on my computer.
import re
import mmap
from pandas import DataFrame
output = []
with open('file2.csv', 'r') as f2:
mm = mmap.mmap(f2.fileno(), 0, access=mmap.ACCESS_READ)
for line in open('file1.csv', 'r'):
Title = bytes("")
nameMatch = re.search(Title, mm)
if nameMatch:
# output.append(str(""))
fulltypes = [ 'O*NET-SOC Code', 'Title' , 'Discription' ]
final = DataFrame(columns=fulltypes)
final.to_csv(output.append(str("")))
mm.close()
Any idea?
Assuming your csv files aren't too huge, you can do this by reading both into pandas and using the join method. Take the following example:
import pandas as pd
file1 = pd.DataFrame({'Title': ['file1.csv', 'file2.csv', 'file3.csv']})
file2 = pd.DataFrame({'Title': ['file1.csv', 'file2.csv', 'file4.csv'],
'Description': ['List of files', 'List of descriptions', 'Something unrelated']})
joined = pd.merge(file1, file2, left_on='Title', right_on='Title')
print joined
This prints:
Title Description
0 file1.csv List of files
1 file2.csv List of descriptions
i.e. just the files that exist in both.
As pandas can natively read a csv into a dataframe, in your case you could do:
import pandas as pd
file1 = pd.DataFrame.from_csv('file1.csv')
file2 = pd.DataFrame.from_csv('file2.csv')
joined = pd.merge(file1, file2, left_on='Title', right_on='Title')
joined.to_csv('Output.csv', index=False)

Writing columns from separate files into a single file

I am relatively new to working with csv files in python and would appreciate some guidiance. I have 6 separate csv files. I would like to copy data from column 1, column 2 and column 3 from each of the csv files into the corresponding first 3 columns in a new file.
How do I word that into my code?
Here is my incomplete code:
import csv
file1 = open ('fileA.csv', 'rb')
reader1 = csv.reader (file1)
file2 = open ('fileB.csv', 'rb')
reader2 = csv.reader (file2)
file3 = open ('fileC.csv', 'rb')
reader3 = csv.reader (file3)
file4 = open ('fileD.csv', 'rb')
reader4 = csv.reader (file4)
file5 = open ('fileE.csv', 'rb')
reader5 = csv.reader (file5)
file6 = open ('fileF.csv', 'rb')
reader6 = csv.reader (file6)
WriteFile = open ('NewFile.csv','wb')
writer = csv.writer(WriteFile)
next(reader1, None)
Data1 = (col[0:3] for col in reader1)
next(reader2, None)
Data2 = (col[0:3] for col in reader2)
next(reader3, None)
Data3 = (col[0:3] for col in reader3)
next(reader4, None)
Data4 = (col[0:3] for col in reader4)
next(reader5, None)
Data5 = (col[0:3] for col in reader5)
next(reader6, None)
Data6 = (col[0:3] for col in reader6)
.......????????
file1.close()
file2.close()
file3.close()
file4.close()
file5.close()
file6.close()
WriteFile.close()
Thanks!
If you just want these all concatenated, that's easy. You can either call writerows on each of your iterators, or chain them together:
writer.writerows(itertools.chain(Data1, Data2, Data3, Data4, Data5, Data6))
Or, if you want them interleaved, where you get row 1 from Data1, then row 1 from Data 2, and so on, and then row 2 from Data 1, etc., use zip to transpose the data, and then chain again to flatten it:
writer.writerows(itertools.chain.from_iterable(zip(Data1, Data2, Data3,
Data4, Data5, Data6)))
If the files are of different lengths, that zip will stop as soon as you reach the end of any of the files. Is that what you want? I have no idea. You might want that. You might want to fill in the gaps with blank rows (in which case look at zip_longest). You might want to skip over the gaps (which you can do with zip_longest plus filter). Or a million other possibilities.
As a side note, once you get to this many similar variables, it's usually a good sign that you really wanted a single iterable instead of separate variables. For example:
filenames = ('fileA.csv', 'fileB.csv', 'fileC.csv',
'fileD.csv', 'fileE.csv', 'fileF.csv')
files = [open(filename, 'rb') for filename in filenames]
readers = [csv.reader(file) for file in files]
WriteFile = open ('NewFile.csv','wb')
writer = csv.writer(WriteFile)
for reader in readers:
next(reader, None)
Data = [(col[0:3] for col in reader) for reader in readers]
writer.writerows(itertools.chain.from_iterable(Data))
for file in files:
file.close()
WriteFile.close()
(Notice that I used list comprehensions, not generator expressions, for the collections of files, readers, data, etc. That's because we need to iterate over them repeatedly—e.g., create a reader for every file, and later call close on every file. Also because there are a fixed, small number of elements—6—so "wasting" a whole list isn't really any issue.)
The way I understand your question is that you have six separate csv's that have 3 columns each and the data in each column is of the same type in all six files. If so you could use pandas. Say you had 3 files that looked like ...
file1:
col1 col2 col3
1 1 1
1 1 1
and then a second and third file with 2's in the second and 3's in the third you could write...
#!/usr/bin/env python
import pandas as pd
cols = ['col1', 'col2', 'col3']
files = ['~/one.txt', '~/two.txt', '~/three.txt']
data_1 = pd.read_csv(files[0], sep=',', header=False, names=cols)
data_2 = pd.read_csv(files[1], sep=',', header=False, names=cols)
data_3 = pd.read_csv(files[2], sep=',', header=False, names=cols)
data_final = data_1.append(data_2).append(data_3)
Then data_final should have the contents of all three data sets stacked on each other. You can modify for 6 (or n) datasets. Hope this is what you wanted.
Out[1]: col1 col2 col3
1 1 1
1 1 1
2 2 2
2 2 2
3 3 3
3 3 3

Categories