Merge two tab delimited text files by one common column in python

Merge two tab delimited text files by one common column in python - python

#tim-pietzcker I would like to merge two tab-delimited text files that share one common column. I have an 'identifier file' that looks like this (2 columns by 1050 rows):
module 1 gene 1
module 1 gene 2
..
module x gene y
I also have a tab-delimited 'target' text file that looks like this (36 columns by 12000 rows):
gene 1 sample 1 sample 2 etc
gene 2 sample 1 sample 2 etc
..
gene z sample 1 sample 2 etc
I would like to merge the two files based on the gene identifier and have both the matching expression values and module affiliations from the identifier and target files. Essentially to take the genes from the identifier file, find them in the target file and create a new file with module #, gene # and expression values all in one file. Your suggestions would be welcome.
Thanks!

Open the gene descripion file, and load a dictionary where key would be the gene number, and the value would be the sample description.
Then open the module file, and loop on its lines. For each line, look for the corresponding gene entry in the dictionary. Print the module, gene, sample description.
That's it! If you need more information, check how to read a file and use a dictionary in the python documentation.

Related

Reading txt file (similar to dictionary format) into pandas dataframe

I have a txt file that looks like this:
('GTCC', 'ACTB'): 1
('GTCC', 'GAPDH'): 2
('CGAG', 'ACTB'): 1
('CGAG', 'GAPDH'): 4
where the first string is a gRNA name, the second string is a gene name, and the number is a count of those two strings occurring together.
I want to read this into a pandas dataframe and re-shape it so that it looks like this:
ACTB GAPDH
GTCC 1 2
CGAG 1 4
How might I do this?
The file will not always be this size-- it will often be much larger (200 gRNA names x 20 gene names) but the size will be variable. There will always only be one gRNA name and one gene name per count. The titles of the columns/rows are accurate as to what the real file will look like (some string of letters for the rows and some gene name for the columns).

This is certainly not the cleanest way to do it, but I figured out a way to get what I wanted:
df = pd.read_csv('test.txt', sep=",|:", engine ='python', names=['gRNA','gene','count'])
df["gRNA"]=df["gRNA"].str.replace("(","")
df["gRNA"]=df["gRNA"].str.replace("'","")
df["gene"]=df["gene"].str.replace(")","")
df["gene"]=df["gene"].str.replace("'","")
df=df.pivot(index='gRNA', columns='gene', values='count')

Merge sequential filename repetitions in list

I've been trying to create a script that loops through and merges CSVs in a folder and calculates the averages of specific columns and exports the results to a single file. So far I've managed to create the logic for it now but I'm struggling with the identification for each column in the resulting CSV, these columns should be named after the 3 files that were averaged.
I've listed the files in the current directory using glob, all the files are named with the pattern:
AA_XXXX-b.
Where AA is a sample number and b is the repetition (1-2 for duplicates, 1-3 for triplicates, etc.) and XXXX is a short sample description. I thought of using the list generated when listing the files and somehow merge all repetitions of a sample into a single item with a format like:
AA_XXXX_1-N,
Where N is the number of repetitions, and store the merged names in a list in order to use it to name the columns with the averages in the final file, but couldn't think of or find anything similar. I apologize if this question was already asked.
Edit:
Here's an example of what I'm trying to do:
This is what the data in the individual csvs looks like:
Filename: 01_NoCons-1
abs
Wavelength (nm)
0
901.5391
0.523718
902.8409
0.516127
905.4431
0.521074
908.0434
0.516442
909.3429
0.510993
Filename: 01_NoCons-2
abs
Wavelength (nm)
0
901.5391
0.523718
902.8409
0.516127
905.4431
0.521074
908.0434
0.516442
909.3429
0.510993
Filename: 01_NoCons- 3
abs
Wavelength (nm)
0
901.5391
0.523718
902.8409
0.516127
905.4431
0.521074
908.0434
0.516442
909.3429
0.510993
And after concatenating and calculating the average of the 3 abs columns, the result is transfered to a new table already containing the Wavelenght column like so:
Filename: Final table
01_NoCons_1-3
Wavelength (nm)
0
901.5391
0.523718
902.8409
0.516127
905.4431
0.521074
908.0434
0.516442
909.3429
0.510993
This process is repeated for every set of sample repetitions, and I'd like the resulting column name to identify from which set it was generated, such as 01_NoCons_1-3, which indicates that the column is a result of the average of repetitions 1 to 3 of the sample 01_NoCons

Combining several csv files and calculating averages of a column

I have a folder with several csv files. Each file has several columns but the ones I'm interested in are id, sequence (3 different sequences) and time. See an example of a csv file below. I have over 40 csv files-each file belongs to a different participant.
id
sequence
time
300
1
2
2
3
1
3
etc
I need to calculate the average times for 3 different sequences for each participant. The code I currently have combines the csv files in one dataframe, selects the columns I'm interested in (id, sequence and time) and calculates the average for each person for 3 conditions, pivots the table to wide format (it's the format I need) and exports this as a csv file. I am not sure if this is the best way to do it but it works. However, for some conditions 'time' is 0. I want to exclude these sequences from the averages. How do I do this? Many thanks in advance.
filenames = sorted(glob.glob('times*.csv'))
df = pd.concat((pd.read_csv(filename) for filename in filenames))
df_new = df[["id","sequence","time"]]
df_new_ave = df_new.groupby(['id','sequence'] ['time'].mean().reset_index(name='Avg_Score')
df_new_ave_wide = df_new_ave.pivot(index='id', columns='sequence', values='Avg_Score')
df_new_ave_wide.to_csv('final_ave.csv', encoding='utf-8', index=True)

Is it possible to modify output data file names in pySpark?

Simplified case.
Given that I have 5 input files in directory data_directory:
data_2020-01-01.txt,
data_2020-01-02.txt,
data_2020-01-03.txt,
data_2020-01-04.txt,
data_2020-01-05.txt
I read them all to pySpark RDD and perform some operation on them that doesn't do any shuffling.
spark = SparkSession.builder.appName("Clean Data").getOrCreate()
sparkContext = spark.sparkContext
input_rdd = sparkContext.textFile("data_directory/*")
result = input_rdd.mapPartitions(lambda x: remove_corrupted_rows(x))
Now I want to save data:
result.saveAsTextFile(
"results",
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec",
)
And I get 5 files where each contains name "part". So I've lost information about from which input file does the output file come from:
._SUCCESS.crc
.part-00000.gz.crc
.part-00001.gz.crc
.part-00002.gz.crc
.part-00003.gz.crc
.part-00004.gz.crc
_SUCCESS
part-00000.gz
part-00001.gz
part-00002.gz
part-00003.gz
part-00004.gz
Is there anyway to keep the input file names or introduce my own naming pattern in this case?
Expected desired result:
._SUCCESS.crc
.data_2020-01-01.gz.crc
.data_2020-01-02.gz.crc
.data_2020-01-03.gz.crc
.data_2020-01-04.gz.crc
.data_2020-01-05.crc
_SUCCESS
data_2020-01-01.gz
data_2020-01-02.gz
data_2020-01-03.gz
data_2020-01-04.gz
data_2020-01-05.gz

You could use pyspark.sql.functions.input_file_name() (docs here https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.input_file_name) and then partition your dataframe by the column created.
This way, 5 input files should give you a categorical column with 5 different values and partitioning on it should split your output into 5 parts.
Alternatively, if you wish to have a full naming pattern, then functionally split the dataframe on the input_file_name() column (here into 5 dataframes), repartition (e.g. to 1 using coalesce(1)) and then save with custom logic (e.g. a dict mapping or by extracting the filename from the column and parsing to DataFrameWriter.csv() as name).
N.B.: When changing to 1 partition, be sure that the data fits all into your memory!

Pandas: Splitting and editing file based on dictionary

I'm new to pandas and having a little trouble solving the following problem.
I have two files I need to use to create output. The first file contains a list on functions and associated genes.
an example of the file (with obviously completely made up data)
File 1:
Function Genes
Emotions HAPPY,SAD,GOOFY,SILLY
Walking LEG,MUSCLE,TENDON,BLOOD
Singing VOCAL,NECK,BLOOD,HAPPY
I'm reading into a dictionary using:
from collections import *
FunctionsWithGenes = defaultdict(list)
def read_functions_file(File):
Header = File.readline()
Lines = File.readlines()
for Line in Lines:
Function, Genes = Line[0], Line[1]
FunctionsWithGenes[Function] = Genes.split(",") # the genes for each function are in the same row and separated by commas
The second table contains all the information I need in a .txt file that contains a column of genes
for example:
chr start end Gene Value MoreData
chr1 123 123 HAPPY 41.1 3.4
chr1 342 355 SAD 34.2 9.0
chr1 462 470 LEG 20.0 2.7
that I read in using:
import pandas as pd
df = pd.read_table(File)
The dataframe contains multiple columns one of which is "Genes". This column can contain a variable number of entries. I would like to split the dataframe by the "Function" key in the FunctionsWithGenes dictionary. So far I have:
df = df[df["Gene"].isin(FunctionsWithGenes.keys())] # to remove all rows with no matching entries
Now I need to somehow split the dataframe based on gene functions. I was thinking perhaps to add a new column with gene function, but not sure if that would work since some genes can have more than one function.

I'm a little confused by your last line of code:
df = df[df["Gene"].isin(FunctionsWithGenes.keys())]
since the keys of FunctionsWithGenes are the actual functions (Emotions etc...) but the genes columns have the values. The resulting DataFrame would always be empty.
If I understand you correctly, you would like to split the table up so that all the genes belonging to a function are in one table, if that's the case, you could use a simple dictionary comprehension, I set up some variables similar to yours:
>>> for function, genes in FunctionsWithGenes.iteritems():
... print function, genes
...
Walking ['LEG', 'MUSCLE', 'TENDON', 'BLOOD']
Singing ['VOCAL', 'NECK', 'BLOOD', 'HAPPY']
Emotions ['HAPPY', 'SAD', 'GOOFY', 'SILLY']
>>> df
Gene Value
0 HAPPY 3.40
1 SAD 4.30
2 LEG 5.55
Then I split up the the DataFrame like this:
>>> FunctionsWithDf = {function:df[df['Gene'].isin(genes)]
... for function, genes in FunctionsWithGenes.iteritems()}
Now FunctionsWithDf is a dictionary which maps Function to a DataFrame with all rows whose Gene columns is in the value of FunctionsWithGenes[Function]
For example:
>>> FunctionsWithDf['Emotions']
Gene Value
0 HAPPY 3.4
1 SAD 4.3
>>> FunctionsWithDf['Singing']
Gene Value
0 HAPPY 3.4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge two tab delimited text files by one common column in python - python

Related

Reading txt file (similar to dictionary format) into pandas dataframe

Merge sequential filename repetitions in list

Combining several csv files and calculating averages of a column

Is it possible to modify output data file names in pySpark?

Pandas: Splitting and editing file based on dictionary

Categories

Resources