Comparing two CSV files using lists and dictionaries - python

I have two CSV files, the first with 3 columns and numerous rows and the second with 4 columns and numerous rows, I am trying to retrieve data from the 1st file based on the RemoveDes list (In code), "RemovedDes" is a filtered version of File 2, which has filtered out rows of data where the first letter is 'E' in the Destination column of File 2. Not all data from File 1 is going to be used, only data which corresponds to the RemoveDes hence why I need to compare the two.
How can I print out only the relevant data from file 1?
I know it's probably very easy to do but I am new to this, any assistance is much appreciated, cheers.
(for further clarification; I'm after the Eastings and Northings in File 1 but need to use "RemovedDes" (which filtered out unnecessary information in File2) to match the data in the two files)
File 1 Sample Data (many more rows):
Destination Easting Northing
D4 . 102019 . 1018347
D2 . 102385 . 2048908
File 2 Sample Data (many more rows):
Legend Destination Distance Width
10 D4 . 67 . 87
18 E2 . 32 . 44
Note that E2 is filtered out as it starts with E.. See code bellow for clarification.
Legend Destination Distance Width
1stFile = open(file2.csv, 'r')
FILE1 = 1stFile.readlines()
print(FILE1)
list_dictionary = []
2ndFile = open(file2.csv, 'r')
FILE2 = 2ndFile.readlines()
print(FILE2)
for line in FILE2:
values = line.split(',')
Legend = values[0]
Destination = values[1]
Distance = values[2]
Width = values[3]
diction_list['LEG'] = Legend
diction_list['DEST'] = Destination
diction_list['DIST'] = Distance
diction_list['WID'] = Width
list_dictionary.append(the_dictionary)
RemovedDes = []
for line_dict in list_dictionary:
if not li_dict['DEST'].startswith('E'): #Filters out rows of data which starts with the letter E in File 2.
RemovedDes.append(li_dict)
print(RemovedDes)

Based on the clarification in the comments, I suggest the following approach:
use a pandas.DataFrame as your data structure of choice
perform a join of your lists
The following code will create a pandas data frame data with all entries of file2, extended by their respective entries in the columns Easting and Northing of file1
import pandas as pd
file1 = pd.read_csv('file1.csv')
file2 = pd.read_csv('file2.csv')
data = pd.merge(file2, file1, how = 'left', on = 'Destination')
Note: this assumes that Destination has unique values across the board and that both .csv-Files come with a header line.

Related

Finding a logic to map the values of two .tsv files

I am working on a problem where I have 2 .tsv files and one has been arranged wrongly with respect to the other one.
When I scan the file , I noticed a pattern which I am unable to put it in terms of coding language. The pattern that I observed was :
For every increase in the row number of metadata file = 8 rows of increment to match in the flipped_metadata.tsv file to match the same values in the metadata file
For every increase in the flipped_metadata file = 12 rows if increment in the metadata.tsv file to match the same values in the flipped_metadata file.
For more clarity I have attached the 2 .tsv files along with this:
Metadata.tsv file and Flipped_metadata.tsv file
The openpyxl library has good functions for dealing with Excel cell locations. These can be used to convert A1 to a proper row and column.
Read each row in and convert the cell reference to a simple numeric row and column value. Use a dictionary to store each cell found with the two values for that cell. e.g. cells[(1,1)] = "123 456"
Whilst reading in, keep a track of the largest row and column.
Create an empty array (list of lists) to allow each cell to be assigned into.
Iterate over all of the dictionary items and assign each value into the array.
Finally save the array to a new CSV file.
For example:
from openpyxl.utils.cell import coordinate_from_string, column_index_from_string
import csv
def flip(input_filename, output_filename):
cells = {}
max_row = 0
max_col = 0
with open(input_filename) as f_input:
for cell, v1, v2 in csv.reader(f_input, delimiter='\t'):
col_letter, row_number = coordinate_from_string(cell)
col_number = column_index_from_string(col_letter)
cells[(row_number, col_number)] = f"{v1} {v2}"
if row_number > max_row:
max_row = row_number
if col_number > max_col:
max_col = col_number
output = [[''] * max_col for _ in range(max_row)]
for (row_number, col_number), values in cells.items():
output[row_number-1][col_number-1] = values
with open(output_filename, 'w', newline='') as f_output:
csv.writer(f_output).writerows(output)
flip('metadata.tsv', 'output_metadata.csv')
flip('flipped_metadata.tsv', 'output_flipped_metadata.csv')
This would give you:
Note: this approach correctly handles all cell references e.g. FK42. It would also handle holes in the data, if A2 was deleted it would still align correctly, as it is not 100% clear if data in cells can be missing,

How to combine multiple columns into one long column using python and pandas

Hi everyone I am currently working on data like the following:
Example of original data file
There are a total of 51 files, each with more than 800 oscillating columns, e.g. (Time, ID, x1, x2, ID, x1, x2,...), the columns are all unlabelled. Within the file, each row has different numbers of columns, something looks like this:Shape of one data file
I need to merge all 51 files into one file, and then stack the columns vertically like this:
Example of output file
So for each timestamp, each student will have a specific row with their location x,y.
Can anyone please help me with this, thanks
I used the following code to merge CSV files with different columns, but the output file is twice the size of the originals (e.g. 100MB VS 50MB). My approach was to combine the files using the maximum number of columns and expand to each row. However, this approach created a lot of missing values in the data, and thus, increasing the size of output files.
import os
import glob
import pandas as pd
def concatenate(indir="C:\Test Files",outfile="F:\Research Assitant\PROJECT_Position Data\Test File\Concatenate.csv"):
os.chdir(indir)
fileList=glob.glob("*.csv")
dfList=[]
for filename in fileList:
### Loop over each line
with open(filename, 'r') as f:
### Skip first four lines
for _ in range(4):
next(f)
### Get the numbers of columns in each line
col_count = [ len(l.split(",")) for l in f.readlines() ]
### Read the current csv file
df = pd.read_csv(filename, header=None, delimiter=",", names=range(max(col_count)),
skiprows=4, keep_default_na=False, na_values=[""])
### Append to the list
dfList.append(df)
concatDf=pd.concat(dfList,axis=0)
concatDf.to_csv(outfile,index=None)
Is there any way to reduce the size of the output files? Or a more efficient way to deal with heterogeneous CSV files in python?
And how do I stack the columns vertically after merged all the CSV files?
with open(os.path.join(working_folder, file_name)) as f:
student_data = []
for line in f:
row = line.strip().split(",")
number_of_results = round(len(row[1:]) / 4) # if we do not count time column, data repeats every 4 times
time_column = row[0]
results = row[1:]
for i in range(number_of_results):
data = [time_column] + results[i*4: (i+1)*4]
student_data.append(data)
df = pd.DataFrame(student_data, columns=["Time", "ID", "Name", "x1", "x2"])
df

how to join multiple tab files by using python

I have multiple tab files with same name in different folders like this
F:/RNASEQ2019/ballgown/abundance_est/RBRN02.sorted.bam\t_data.ctab
F:/RNASEQ2019/ballgown/abundance_est/RBRN151.sorted.bam\t_data.ctab
Each file have 5-6 common columns and I want to pick up two columns- Gene and FPKM. Gene column is same for all only FPKM value differ. I want to pickup Gene and FPKM column form each file and make a master file like this
Gene RBRN02 RBRN03 RBRN151
gene1 67 699 88
gene2 66 77 89
I did this
import os
path ="F:/RNASEQ2019/ballgown/abundance_est/"
files =[]
## r=root, d=directory , f=file
for r, d, f in os.walk(path):
for file in f:
if 't_data.ctab' in file:
files.append(os.path.join(r, file))
df=[]
for f in files:
df.append(pd.read_csv(f, sep="\t"))
But this is not doing side wise merge. How do I get that above format? please help
Using datatable, you can read multiple files at once by specifying the pattern:
import datatable as dt
dfs = dt.fread("F:/RNASEQ2019/ballgown/abundance_est/**/t_data.ctab",
columns={"Gene", "FPKM"})
If there are multiple files, this will produce a dictionary where each key is the name of the file, and the corresponding value is the content of that file, parsed into a frame. The optional columns parameter limits which columns you want to read.
In your case it seems like you want to rename the columns based on the name of the file where it came from, so you may do something like this:
frames = []
for filename, frame in dfs.items():
mm = re.search(r"(\w+)\.sorted\.bam", filename)
frame.names = {"FPKM": mm.group(1)}
frames.append(frame)
In the end, you can cbind the list of frames:
df = dt.cbind(frames)
If you need to work with a pandas dataframe, you can convert easily: df.to_pandas().
IIUC, you can get your desired result with a simple list comprehension :
dfs = [pd.read_csv(f,sep='\t') for f in files]
df = pd.concat(dfs)
print(df)
or as a one liner
df = pd.concat([pd.read_csv(f,sep='\t') for f in files])
How about reading each file in a separate data frame and then merging them?

Selectin Dataframe columns name from a csv file

I have a .csv to read into a DataFrame and the names of the columns are in the same .csv file in the previos rows. Usually I drop all the 'unnecesary' rows to create the DataFrame and then hardcode the names of each dataframe
Trigger time,2017-07-31,10:45:38
CH,Signal name,Input,Range,Filter,Span
CH1, "Tin_MIX_Air",TEMP,PT,Off,2000.000000,-200.000000,degC
CH2, "Tout_Fan2b",TEMP,PT,Off,2000.000000,-200.000000,degC
CH3, "Tout_Fan2a",TEMP,PT,Off,2000.000000,-200.000000,degC
CH4, "Tout_Fan1a",TEMP,PT,Off,2000.000000,-200.000000,degC
Here you can see the rows where the columns names are in double quotes "TinMix","Tout..",etc there are exactly 16 rows with names
Logic/Pulse,Off
Data
Number,Date&Time,ms,CH1,CH2,CH3,CH4,CH5,CH7,CH8,CH9,CH10,CH11,CH12,CH13,CH14,CH15,CH16,CH20,Alarm1-10,Alarm11-20,AlarmOut
NO.,Time,ms,degC,degC,degC,degC,degC,degC,%RH,%RH,degC,degC,degC,degC,degC,Pa,Pa,A,A1234567890,A1234567890,A1234
1,2017-07-31 10:45:38,000,+25.6,+26.2,+26.1,+26.0,+26.3,+25.7,+43.70,+37.22,+25.6,+25.3,+25.1,+25.3,+25.3,+0.25,+0.15,+0.00,LLLLLLLLLL,LLLLLLLLLL,LLLL
And here the values of each variables start.
What I need to do is create a Dataframe from this .csv and place these names in the columns names. I'm new to Python and I'm not very sure how to do it
import pandas as pd
path = r'path-to-file.csv'
data=pd.DataFrame()
with open(path, 'r') as f:
for line in f:
data = pd.concat([data, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True)
data.drop(data.index[range(0,29)],inplace=True)
x=len(data.iloc[0])
data.drop(data.columns[[0,1,2,x-1,x-2,x-3]],axis=1,inplace=True)
data.reset_index(drop=True,inplace=True)
data = data.T.reset_index(drop=True).T
data = data.apply(pd.to_numeric)
This is what I've done so far to get my dataframe with the usefull data, I'm dropping all the other columns that arent useful to me and keeping only the values. Last three lines are to reset row/column indexes and to transform the whole df to floats. What I would like is to name the columns with each of the names I showed in the first piece of coding as a I said before I'm doing this manually as:
data.columns = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p']
But I would like to get them from the .csv file since theres a possibility on changing the CH# - "Name" combination
Thank you very much for the help!
Comment: possible for it to work within the other "OPEN " loop that I have?
Assume Column Names from Row 2 up to 6, Data from Row 7 up to EOF.
For instance (untested code)
data = None
columns = []
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2 and row <= 6:
ch, name = line.split(',')[:2]
columns.append(name)
else:
row_data = [tuple(line.strip().split(','))]
if not data:
data = pd.DataFrame(row_data, columns=columns, ignore_index=True)
else:
data.append(row_data)
Question: ... I would like to get them from the .csv file
Start with:
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2:
ch, name = line.split(',')[:2]

Merge csv's with some common columns and fill in Nans

I have several csv files (all in one folder) which have columns in common but also have distinct columns. They all contain the IP column. The data looks like
File_1.csv
a,IP,b,c,d,e
info,192.168.0.1,info1,info2,info3,info4
File_2.csv
a,b,IP,d,f,g
info,,192.168.0.1,info2,info5,info6
As you can see File 1 and File 2 disagree on what belongs in column d but I do not mind which file it keeps the information from. I have tried pandas.merge but this however returns two separate entries for 192.168.0.1 with NaN in the columns present in File 1 not in File 2 and vice versa. Does anyone know of a way to do this?
Edit 1:
The desired output should look like:
output
a,IP,b,c,d,e,f,g
info,192.168.0.1,info1,info2,info3,info4,info5,info6
and I would like the output to be like this for all rows, not every item in file 1 is in file2 and vice versa.
Edit 2:
Any IP address present in file 1 but not present in file 2 should have a blank or Not Available Value in any unique columns in the output file. For example in the output file, columns f and g would be blank for IP addresses that were present in file 1 but not in file 2. Similarly, for an IP in file 2 and not in file 1, columns c and e would be blank in the output file.
This case:
Set IP_address as index column and then use combine_first() to fill in a holes in a data_frame which is the union of all IP_address and columns.
import pandas as pd
#read in the files using the IP address as the index column
df_1 = pd.read_csv('file1.csv', header= 0, index_col = 'IP')
df_2 = pd.read_csv('file2.csv', header= 0, index_col = 'IP')
#fill in the Nan
combined_df = df_1.combine_first(df_2)
combined_df.write_csv(path = '', sep = ',')
EDIT: The union of the indices will be taken, so we should put the IP address in the index column to ensure IP addresses in both files are read in.
combine_first() for other cases:
As the documentation states, you'll only have to be careful if the same IP address in both files has conflicting nonempty information for a column (such as column d in your above example). In df_1.combine_first(df_2), df_1 is prioritized and column d will be set to the value from df_1. Since you said, it doesn't matter which file you will draw information from in this case, this isn't a concern for this problem.
I think a simple dictionary should do the job. Assume you've read the contents of each file into lists file1 and file2, so that:
file1[0] = [a,IP,b,c,d,e]
file1[1] = [info,192.168.0.1,info1,info2,info3,info4]
file2[0] = [a,b,IP,d,f,g]
file2[1] = [info,,192.168.0.1,info2,info5,info6]
(with quotes around each entry). The following should do what you want:
new_dict = {}
for i in range(0, len(file2[0]):
new_dict[file2[0][i]] = file2[1][i]
for i in range(0, len(file1[0]):
new_dict[file1[0][i]] = file1[1][i]
output = [[],[]]
output[0] = [key for key in new_dict]
output[1] = [new_dict[key] for key in output[0]]
Then you should get:
output[0] = [a,IP,b,c,d,e,f,g]
output[1] = [info,192.168.0.1,info1,info2,info3,info4,info5,info6]

Categories