I have a text file which consists of data including some random data among which there are "names" that exist in separate excel file as rows in a column. What I need to do is to compare strings from txt file and excel and output those that are matching along with some extra data corresponding to that row from different columns. I'd be thankful for some example how to go about it maybe using pandas?
You should open the text and excel file like so:
textdata = open(path_to_file, "r")
exceldata = open(path_to_file, "r")
Then put the data in lists:
textdatalist = [line.split(',') for line in textdata.readlines()]
exceldatalist = [line.split(',') for line in exceldata.readlines()]
And then compare the two lists with:
print(set(exceldatalist).intersection(textdatalist))
All together:
textdata = open(path_to_file, "r")
exceldata = open(path_to_file, "r")
textdatalist = [line.split(',') for line in textdata.readlines()]
exceldatalist = [line.split(',') for line in exceldata.readlines()]
print(set(exceldatalist).intersection(textdatalist))
I'm quite an amateur regarding python and I'm stuck.
So I managed to write a script to extract the second column of floats of a list of text files, from the line 5025 to the end, and create a list containing all these columns to export them into a csv.
My problem is that in the csv, all the columns from each file are pasted in one single column in the csv. So, what I wanted is that each column of each file in my list to be pasted in a different column of the csv (if I have 4 files to process, I would like to have 4 columns in the csv, one per file).
So here is what I have now:
#!/usr/bin/python3
import numpy as np
def read_csv(file_path):
with open(file_path, "r") as f:
lines = f.readlines()
return lines[5025:len(lines)] #ignoring 5025 first lines
def extract_second_column(file_path): #returns second column
lines = read_csv(file_path)
second_col = []
for elem in lines:
elem = elem.split()
second_col.append(float(elem[1]))
return second_col
def combined_array(file_path_list): #combines all columns to one array
all_values = []
for file_path in file_path_list:
col_data = extract_second_column(file_path)
all_values.append(col_data)
return all_values
def write_csv(data, csv_name ): #writes the array to a csv file
np.savetxt(csv_name, data, delimiter="\n")
#Now the magic happens:
file_name_list = ["GLN-46_Coul.xvg","GLN-46_LJ.xvg","GLU-102_Coul.xvg","GLU-102_LJ.xvg"]
data = combined_array(file_name_list) #array containing all columns
write_csv(data, "ENERGIES.csv") #writing to csv file
I will appreciate whatever suggestion! I'm aware that the code looks ugly but what I need right now is something that works.
I would just like to create a csv file and at the same time add my data row by row with a for loop.
for x in y:
newRow = "\n%s,%s\n" % (sentence1, sentence2)
with open('Mydata.csv', "a") as f:
f.write(newRow)
After the above process, I tried to read the csv file but I can't separate the columns. It seems that there is only one column, maybe I did something wrong in the csv creation process?
colnames = ['A_sentence', 'B_sentence']
Mydata = pd.read_csv(Mydata, names=colnames, delimiter=";")
print(Mydata['A_sntence']) #output Nan
When you are writing the file, it looks like you are using commas as separators, but when reading the file you are using semicolons (probably just a typo). Change delimiter=";" to delimiter="," and it should work.
I am using this script to take a large csv file and separate it by unique values in the first column then save new files. I would like to also add 3 columns at the end of each file that contain calculations based on the previous columns. The columns will have headers as well. My current code is as follows
import csv, itertools as it, operator as op
csv_contents = []
with open('Nov15.csv', 'rb') as fin:
file_reader = csv.DictReader(fin) # default delimiter is comma
print file_reader
fieldnames = file_reader.fieldnames # save for writing
for line in file_reader: # read in all of your data
csv_contents.append(line) # gather data into a list (of dicts)
# input to itertools.groupby must be sorted by the grouping value
sorted_csv_contents = sorted(csv_contents, key=op.itemgetter('Object'))
for groupkey, groupdata in it.groupby(sorted_csv_contents, key=op.itemgetter('Object')):
with open('slice_{:s}.csv'.format(groupkey), 'wb') as gips:
file_writer = csv.DictWriter(gips, fieldnames=fieldnames)
file_writer.writeheader()
file_writer.writerows(groupdata)
If your comments are true, you could probably do it like this (for imaginary columns col1, col2, and the calculation col1 * col2:
for line in file_reader: # read in all of your data
line['calculated_col0'] = line['col1'] * line['col2']
csv_contents.append(line) # gather data into a list (of dicts)
I have two files, the first one is called book1.csv, and looks like this:
header1,header2,header3,header4,header5
1,2,3,4,5
1,2,3,4,5
1,2,3,4,5
The second file is called book2.csv, and looks like this:
header1,header2,header3,header4,header5
1,2,3,4
1,2,3,4
1,2,3,4
My goal is to copy the column that contains the 5's in book1.csv to the corresponding column in book2.csv.
The problem with my code seems to be that it is not appending right nor is it selecting just the index that I want to copy.It also gives an error that I have selected an incorrect index position. The output is as follows:
header1,header2,header3,header4,header5
1,2,3,4
1,2,3,4
1,2,3,41,2,3,4,5
Here is my code:
import csv
with open('C:/Users/SAM/Desktop/book2.csv','a') as csvout:
write=csv.writer(csvout, delimiter=',')
with open('C:/Users/SAM/Desktop/book1.csv','rb') as csvfile1:
read=csv.reader(csvfile1, delimiter=',')
header=next(read)
for row in read:
row[5]=write.writerow(row)
What should I do to get this to append properly?
Thanks for any help!
What about something like this. I read in both books, append the last element of book1 to the book2 row for every row in book2, which I store in a list. Then I write the contents of that list to a new .csv file.
with open('book1.csv', 'r') as book1:
with open('book2.csv', 'r') as book2:
reader1 = csv.reader(book1, delimiter=',')
reader2 = csv.reader(book2, delimiter=',')
both = []
fields = reader1.next() # read header row
reader2.next() # read and ignore header row
for row1, row2 in zip(reader1, reader2):
row2.append(row1[-1])
both.append(row2)
with open('output.csv', 'w') as output:
writer = csv.writer(output, delimiter=',')
writer.writerow(fields) # write a header row
writer.writerows(both)
Although some of the code above will work it is not really scalable and a vectorised approach is needed. Getting to work with numpy or pandas will make some of these tasks easier so it is great to learn a bit of it.
You can download pandas from the Pandas Website
# Load Pandas
from pandas import DataFrame
# Load each file into a pandas dataframe, this is based on a numpy array
data1 = DataFrame.from_csv('csv1.csv',sep=',',parse_dates=False)
data2 = DataFrame.from_csv('csv2.csv',sep=',',parse_dates=False)
#Now add 'header5' from data1 to data2
data2['header5'] = data1['header5']
#Save it back to csv
data2.to_csv('output.csv')
Regarding the "error that I have selected an incorrect index position," I suspect this is because you're using row[5] in your code. Indexing in Python starts from 0, so if you have A = [1, 2, 3, 4, 5] then to get the 5 you would do print(A[4]).
Assuming the two files have the same number of rows and the rows are in the same order, I think you want to do something like this:
import csv
# Open the two input files, which I've renamed to be more descriptive,
# and also an output file that we'll be creating
with open("four_col.csv", mode='r') as four_col, \
open("five_col.csv", mode='r') as five_col, \
open("five_output.csv", mode='w', newline='') as outfile:
four_reader = csv.reader(four_col)
five_reader = csv.reader(five_col)
five_writer = csv.writer(outfile)
_ = next(four_reader) # Ignore headers for the 4-column file
headers = next(five_reader)
five_writer.writerow(headers)
for four_row, five_row in zip(four_reader, five_reader):
last_col = five_row[-1] # # Or use five_row[4]
four_row.append(last_col)
five_writer.writerow(four_row)
Why not reading the files line by line and use the -1 index to find the last item?
endings=[]
with open('book1.csv') as book1:
for line in book1:
# if not header line:
endings.append(line.split(',')[-1])
linecounter=0
with open('book2.csv') as book2:
for line in book2:
# if not header line:
print line+','+str(endings[linecounter]) # or write to file
linecounter+=1
You should also catch errors if row numbers don't match.