Merge sequential filename repetitions in list - python

I've been trying to create a script that loops through and merges CSVs in a folder and calculates the averages of specific columns and exports the results to a single file. So far I've managed to create the logic for it now but I'm struggling with the identification for each column in the resulting CSV, these columns should be named after the 3 files that were averaged.
I've listed the files in the current directory using glob, all the files are named with the pattern:
AA_XXXX-b.
Where AA is a sample number and b is the repetition (1-2 for duplicates, 1-3 for triplicates, etc.) and XXXX is a short sample description. I thought of using the list generated when listing the files and somehow merge all repetitions of a sample into a single item with a format like:
AA_XXXX_1-N,
Where N is the number of repetitions, and store the merged names in a list in order to use it to name the columns with the averages in the final file, but couldn't think of or find anything similar. I apologize if this question was already asked.
Edit:
Here's an example of what I'm trying to do:
This is what the data in the individual csvs looks like:
Filename: 01_NoCons-1
abs
Wavelength (nm)
0
901.5391
0.523718
902.8409
0.516127
905.4431
0.521074
908.0434
0.516442
909.3429
0.510993
Filename: 01_NoCons-2
abs
Wavelength (nm)
0
901.5391
0.523718
902.8409
0.516127
905.4431
0.521074
908.0434
0.516442
909.3429
0.510993
Filename: 01_NoCons- 3
abs
Wavelength (nm)
0
901.5391
0.523718
902.8409
0.516127
905.4431
0.521074
908.0434
0.516442
909.3429
0.510993
And after concatenating and calculating the average of the 3 abs columns, the result is transfered to a new table already containing the Wavelenght column like so:
Filename: Final table
01_NoCons_1-3
Wavelength (nm)
0
901.5391
0.523718
902.8409
0.516127
905.4431
0.521074
908.0434
0.516442
909.3429
0.510993
This process is repeated for every set of sample repetitions, and I'd like the resulting column name to identify from which set it was generated, such as 01_NoCons_1-3, which indicates that the column is a result of the average of repetitions 1 to 3 of the sample 01_NoCons

Related

Reading txt file (similar to dictionary format) into pandas dataframe

I have a txt file that looks like this:
('GTCC', 'ACTB'): 1
('GTCC', 'GAPDH'): 2
('CGAG', 'ACTB'): 1
('CGAG', 'GAPDH'): 4
where the first string is a gRNA name, the second string is a gene name, and the number is a count of those two strings occurring together.
I want to read this into a pandas dataframe and re-shape it so that it looks like this:
ACTB GAPDH
GTCC 1 2
CGAG 1 4
How might I do this?
The file will not always be this size-- it will often be much larger (200 gRNA names x 20 gene names) but the size will be variable. There will always only be one gRNA name and one gene name per count. The titles of the columns/rows are accurate as to what the real file will look like (some string of letters for the rows and some gene name for the columns).
This is certainly not the cleanest way to do it, but I figured out a way to get what I wanted:
df = pd.read_csv('test.txt', sep=",|:", engine ='python', names=['gRNA','gene','count'])
df["gRNA"]=df["gRNA"].str.replace("(","")
df["gRNA"]=df["gRNA"].str.replace("'","")
df["gene"]=df["gene"].str.replace(")","")
df["gene"]=df["gene"].str.replace("'","")
df=df.pivot(index='gRNA', columns='gene', values='count')

Combining several csv files and calculating averages of a column

I have a folder with several csv files. Each file has several columns but the ones I'm interested in are id, sequence (3 different sequences) and time. See an example of a csv file below. I have over 40 csv files-each file belongs to a different participant.
id
sequence
time
300
1
2
2
3
1
3
etc
I need to calculate the average times for 3 different sequences for each participant. The code I currently have combines the csv files in one dataframe, selects the columns I'm interested in (id, sequence and time) and calculates the average for each person for 3 conditions, pivots the table to wide format (it's the format I need) and exports this as a csv file. I am not sure if this is the best way to do it but it works. However, for some conditions 'time' is 0. I want to exclude these sequences from the averages. How do I do this? Many thanks in advance.
filenames = sorted(glob.glob('times*.csv'))
df = pd.concat((pd.read_csv(filename) for filename in filenames))
df_new = df[["id","sequence","time"]]
df_new_ave = df_new.groupby(['id','sequence'] ['time'].mean().reset_index(name='Avg_Score')
df_new_ave_wide = df_new_ave.pivot(index='id', columns='sequence', values='Avg_Score')
df_new_ave_wide.to_csv('final_ave.csv', encoding='utf-8', index=True)

Is it possible to modify output data file names in pySpark?

Simplified case.
Given that I have 5 input files in directory data_directory:
data_2020-01-01.txt,
data_2020-01-02.txt,
data_2020-01-03.txt,
data_2020-01-04.txt,
data_2020-01-05.txt
I read them all to pySpark RDD and perform some operation on them that doesn't do any shuffling.
spark = SparkSession.builder.appName("Clean Data").getOrCreate()
sparkContext = spark.sparkContext
input_rdd = sparkContext.textFile("data_directory/*")
result = input_rdd.mapPartitions(lambda x: remove_corrupted_rows(x))
Now I want to save data:
result.saveAsTextFile(
"results",
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec",
)
And I get 5 files where each contains name "part". So I've lost information about from which input file does the output file come from:
._SUCCESS.crc
.part-00000.gz.crc
.part-00001.gz.crc
.part-00002.gz.crc
.part-00003.gz.crc
.part-00004.gz.crc
_SUCCESS
part-00000.gz
part-00001.gz
part-00002.gz
part-00003.gz
part-00004.gz
Is there anyway to keep the input file names or introduce my own naming pattern in this case?
Expected desired result:
._SUCCESS.crc
.data_2020-01-01.gz.crc
.data_2020-01-02.gz.crc
.data_2020-01-03.gz.crc
.data_2020-01-04.gz.crc
.data_2020-01-05.crc
_SUCCESS
data_2020-01-01.gz
data_2020-01-02.gz
data_2020-01-03.gz
data_2020-01-04.gz
data_2020-01-05.gz
You could use pyspark.sql.functions.input_file_name() (docs here https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.input_file_name) and then partition your dataframe by the column created.
This way, 5 input files should give you a categorical column with 5 different values and partitioning on it should split your output into 5 parts.
Alternatively, if you wish to have a full naming pattern, then functionally split the dataframe on the input_file_name() column (here into 5 dataframes), repartition (e.g. to 1 using coalesce(1)) and then save with custom logic (e.g. a dict mapping or by extracting the filename from the column and parsing to DataFrameWriter.csv() as name).
N.B.: When changing to 1 partition, be sure that the data fits all into your memory!

Create Multiple New Columns Based on Pipe-Delimited Column in Pandas

I have a pandas dataframe with a pipe delimited column with an arbitrary number of elements, called Parts. The number of elements in these pipe-strings varies from 0 to over 10. The number of unique elements contained in all pipe-strings is not much smaller than the number of rows (which makes it impossible for me to manually specify all of them while creating new columns).
For each row, I want to create a new column that acts as an indicator variable for each element of the pipe delimited list. For instance, if the row
...'Parts'...
...'12|34|56'
should be transformed to
...'Part_12' 'Part_34' 'Part_56'...
...1 1 1...
Because they are a lot of unique parts, these columns are obviously going to be sparse - mostly zeros since each row only contains a small fraction of unique parts.
I haven't found any approach that doesn't require manually specifying the columns (for instance, Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries).
I've also looked at pandas' melt, but I don't think that's the appropriate tool.
The way I know how to solve it would be to pipe the raw CSV to another python script and deal with it on a char-by-char basis, but I need to work within my existing script since I will be processing hundreds of CSVs in this manner.
Here's a better illustration of the data
ID YEAR AMT PARTZ
1202 2007 99.34
9321 1988 1012.99 2031|8942
2342 2012 381.22 1939|8321|Amx3
You can use get_dummies and add_prefix:
df.Parts.str.get_dummies().add_prefix('Part_')
Output:
Part_12 Part_34 Part_56
0 1 1 1
Edit for comment and counting duplicates.
df = pd.DataFrame({'Parts':['12|34|56|12']}, index=[0])
pd.get_dummies(df.Parts.str.split('|',expand=True).stack()).sum(level=0).add_prefix('Part_')
Output:
Part_12 Part_34 Part_56
0 2 1 1

Merge two tab delimited text files by one common column in python

#tim-pietzcker I would like to merge two tab-delimited text files that share one common column. I have an 'identifier file' that looks like this (2 columns by 1050 rows):
module 1 gene 1
module 1 gene 2
..
module x gene y
I also have a tab-delimited 'target' text file that looks like this (36 columns by 12000 rows):
gene 1 sample 1 sample 2 etc
gene 2 sample 1 sample 2 etc
..
gene z sample 1 sample 2 etc
I would like to merge the two files based on the gene identifier and have both the matching expression values and module affiliations from the identifier and target files. Essentially to take the genes from the identifier file, find them in the target file and create a new file with module #, gene # and expression values all in one file. Your suggestions would be welcome.
Thanks!
Open the gene descripion file, and load a dictionary where key would be the gene number, and the value would be the sample description.
Then open the module file, and loop on its lines. For each line, look for the corresponding gene entry in the dictionary. Print the module, gene, sample description.
That's it! If you need more information, check how to read a file and use a dictionary in the python documentation.

Categories