Pandas: Splitting and editing file based on dictionary - python

I'm new to pandas and having a little trouble solving the following problem.
I have two files I need to use to create output. The first file contains a list on functions and associated genes.
an example of the file (with obviously completely made up data)
File 1:
Function Genes
Emotions HAPPY,SAD,GOOFY,SILLY
Walking LEG,MUSCLE,TENDON,BLOOD
Singing VOCAL,NECK,BLOOD,HAPPY
I'm reading into a dictionary using:
from collections import *
FunctionsWithGenes = defaultdict(list)
def read_functions_file(File):
Header = File.readline()
Lines = File.readlines()
for Line in Lines:
Function, Genes = Line[0], Line[1]
FunctionsWithGenes[Function] = Genes.split(",") # the genes for each function are in the same row and separated by commas
The second table contains all the information I need in a .txt file that contains a column of genes
for example:
chr start end Gene Value MoreData
chr1 123 123 HAPPY 41.1 3.4
chr1 342 355 SAD 34.2 9.0
chr1 462 470 LEG 20.0 2.7
that I read in using:
import pandas as pd
df = pd.read_table(File)
The dataframe contains multiple columns one of which is "Genes". This column can contain a variable number of entries. I would like to split the dataframe by the "Function" key in the FunctionsWithGenes dictionary. So far I have:
df = df[df["Gene"].isin(FunctionsWithGenes.keys())] # to remove all rows with no matching entries
Now I need to somehow split the dataframe based on gene functions. I was thinking perhaps to add a new column with gene function, but not sure if that would work since some genes can have more than one function.

I'm a little confused by your last line of code:
df = df[df["Gene"].isin(FunctionsWithGenes.keys())]
since the keys of FunctionsWithGenes are the actual functions (Emotions etc...) but the genes columns have the values. The resulting DataFrame would always be empty.
If I understand you correctly, you would like to split the table up so that all the genes belonging to a function are in one table, if that's the case, you could use a simple dictionary comprehension, I set up some variables similar to yours:
>>> for function, genes in FunctionsWithGenes.iteritems():
... print function, genes
...
Walking ['LEG', 'MUSCLE', 'TENDON', 'BLOOD']
Singing ['VOCAL', 'NECK', 'BLOOD', 'HAPPY']
Emotions ['HAPPY', 'SAD', 'GOOFY', 'SILLY']
>>> df
Gene Value
0 HAPPY 3.40
1 SAD 4.30
2 LEG 5.55
Then I split up the the DataFrame like this:
>>> FunctionsWithDf = {function:df[df['Gene'].isin(genes)]
... for function, genes in FunctionsWithGenes.iteritems()}
Now FunctionsWithDf is a dictionary which maps Function to a DataFrame with all rows whose Gene columns is in the value of FunctionsWithGenes[Function]
For example:
>>> FunctionsWithDf['Emotions']
Gene Value
0 HAPPY 3.4
1 SAD 4.3
>>> FunctionsWithDf['Singing']
Gene Value
0 HAPPY 3.4

Related

Python - how to store multiple variables in one column of a .csv file and then read those variables into a list

I think this issue is pretty basic, but I haven't seen it answered online. I'm using python and I have 'pandas' installed to make things easier. If there's a way to do it without 'pandas' that would be awesome too! I'm coding a node connected map. I want to be able to take in some .csv file with a 'previous' and 'next' node list. I want this data to be then read by the program and stored in a list. For example:
.csv file:
Name
Previous
Next
Alpha
one two
Three
Beta
four
five
Charlie
six
seven eight
what I want in my program:
alpha, [one, two], [three]
beta, [four], [five]
charlie, [six], [seven, eight]
I have heard about two ways to write multiple variables in one .csv column.
One way was placing a space in between the two values/variables:
alpha,one two,three
and another way I've heard to solve this is use " marks and separate with a comma:
alpha,"one,two",three
Although I have heard about these answers before, I haven't been able to implement them. When reading the data in my program, it will assume that the space is part of the string or that the comma is part of the string.
file = pd.read_csv("connections.csv")
previous_alpha = []
previous_alpha.append(file.previous[0])
So, instead of having a list of two strings [one, two] my program will have a list containing one string that looks like ["one,two"] or [one two]
I can change the way the variables are structured in the .csv file or the code reading in the data. Thanks for all the help in advance!
There are multiple ways of doing this. Each for a different way you start with CSV data.
First method will have the data in CSV as a single row with lists of things:
Name,Previous,Next
Alpha,"One,Two",Three
Beta,Four,Five
Charlie,Six,"Seven,Eight"
Note the quotation around the lists. We can use apply to change the values. The convert function will just split the string using , as the delimiter.
import pandas as pd
def convert(x):
return x.split(',')
df = pd.read_csv('file.csv')
df['Previous'] = df['Previous'].apply(convert)
df['Next'] = df['Previous'].apply(convert)
Second, each row is repeated for Name with the values in CSV:
Name,Previous,Next
Alpha,One,Three
Alpha,Two,Three
Beta,Four,Five
Charlie,Six,Seven
Charlie,Six,Eight
We can you the agg function to aggregate. The convert function drops the duplicates and returns as a list.
import pandas as pd
def convert(x):
return x.drop_duplicates().to_list()
df = pd.read_csv('file.csv')
df = df.groupby('Name').agg({'Previous': convert, 'Next': convert})
The results should look like this:
Previous Next
Name
Alpha [One, Two] [Three]
Beta [Four] [Five]
Charlie [Six] [Seven, Eight]
If you have this DataFrame:
Name Previous Next
0 Alpha one two Three
1 Beta four five
2 Charlie six seven eight
Then you can split the strings in required columns and save the CSV normally:
df["Previous"] = df["Previous"].str.split()
df["Next"] = df["Next"].str.split()
print(df)
df.to_csv("data.csv", index=False)
Name Previous Next
0 Alpha [one, two] [Three]
1 Beta [four] [five]
2 Charlie [six] [seven, eight]
To load the data back, you can use pd.read_csv with converters= parameter:
from ast import literal_eval
df = pd.read_csv(
"data.csv", converters={"Previous": literal_eval, "Next": literal_eval}
)
print(df)
Prints:
Name Previous Next
0 Alpha [one, two] [Three]
1 Beta [four] [five]
2 Charlie [six] [seven, eight]

Reading txt file (similar to dictionary format) into pandas dataframe

I have a txt file that looks like this:
('GTCC', 'ACTB'): 1
('GTCC', 'GAPDH'): 2
('CGAG', 'ACTB'): 1
('CGAG', 'GAPDH'): 4
where the first string is a gRNA name, the second string is a gene name, and the number is a count of those two strings occurring together.
I want to read this into a pandas dataframe and re-shape it so that it looks like this:
ACTB GAPDH
GTCC 1 2
CGAG 1 4
How might I do this?
The file will not always be this size-- it will often be much larger (200 gRNA names x 20 gene names) but the size will be variable. There will always only be one gRNA name and one gene name per count. The titles of the columns/rows are accurate as to what the real file will look like (some string of letters for the rows and some gene name for the columns).
This is certainly not the cleanest way to do it, but I figured out a way to get what I wanted:
df = pd.read_csv('test.txt', sep=",|:", engine ='python', names=['gRNA','gene','count'])
df["gRNA"]=df["gRNA"].str.replace("(","")
df["gRNA"]=df["gRNA"].str.replace("'","")
df["gene"]=df["gene"].str.replace(")","")
df["gene"]=df["gene"].str.replace("'","")
df=df.pivot(index='gRNA', columns='gene', values='count')

Processing pandas dataframe with variable delimiter and row length

I have a tedious csv file with the following format
HELLO 1000 db1 3.88
HELLO 10 db123456 3.8899949
HELLO repository 10.0000
HELLO rep 001 0.001
Basically, the first four characters are always constant, while names are in different length and different separators
(for example, "1000 db1"), and final values are all float numbers but again in different formats/lengths.
What I would like is to be able to read columns as
constant name value
HELLO ..... ....
I have looked for a solution but can't figure out. Initially, I was trying
df.map(lambda x: x[...])
to cut the last values but it does not work as the last values do not always have the same length.
Thanks in advance
I suppose you want to split the CSV into three columns. You can use re module for the task (if file.csv is in the format as you describe in your question):
import re
with open('file.csv', 'r') as f_in:
df = pd.DataFrame(re.findall(r'([^\s]+)\s(.*)\s(.+)', f_in.read()), columns=['constant', 'name', 'value'])
print(df)
Prints:
constant name value
0 HELLO 1000 db1 3.88
1 HELLO 10 db123456 3.8899949
2 HELLO repository 10.0000
3 HELLO rep 001 0.001

Loop over different Dataframes Pandas

Hello World,
I have 52 dataframes (df and data) with each of them containing only Name with first letter of Alphabet.
Below is an example for letter A & B (containing in df and data)
dfA dfB
Code Name Code Name
15 Amiks 68 Bernard
157 Alis 14 Barpti
dataA dataB
Code Name Code Name
Amiks Berti
Alis Bernard
Anatole Barpti
Question:
Not an expert in Python, How can I loop over dataframes with the same letter? and not looking for all dataframes but just the same letter.
For Eg: Check whether:
dataA.Name is in dfA.Name ?
dataB.Name is in dfB.Name ?
dataZ.Name is in dfZ.Name ?
Edit
The original DF are below
df data
Code Name Code Name
15 Amiks Amiks
157 Alis Alis
14 Barpti Bernard
68 Bernard Barpti
I just created one df per first letter.
The aim is to speed up the computational time and avoid checking within the whole DF, when we can check for only rows with the same first letter.
Thanks for anyone helping.
You should considere variable names as only meaningful for you and not for the Python interpretor. Of course, there are hacky ways using locals() or globals() but they should never be used in normal code.
That means that is your dataframes are related, you should use a data structure to trace the relation. You could use 2 lists or a list of pairs, for example:
dataframes = [ (dfA, dataA), (dfB, dataB), ...]
If the letter itself matter, you could use a dictionary:
dataframes = { 'A': (dfA, dataA), 'B': (dfB, dataB), ...}
Then you can easily iterate:
for letter in dataframes:
commons = dataframes[letter][0].merge(dataframes[letter][1], on='Name',
suffixes=('_x', ''))[['Code', 'Name']]
# process commons which contains code and name where datax.Name is in dfx.Name
...
But unless the original dataframe is really huge, I would suggest you to benchmark the one single dataframe with first letter way against the 52 dataframes one. Pandas if rather efficient provided everything fits in memory...
For those who might be interested, below is what I have done:
Step 1: re-merge all df's and data's.
The output is :
Df (containing all letters)
Data (containing all letters)
Step 2: Retrieve First letter of Name
# Function to retrieve first letter
def first_letter(data, col):
a= data[col].map(lambda x: x[0])
b= np.unique(a)
c= b.tolist()
return c
Step 3: Loop over DF and Data
Compare only Name for which first letter is the same
# Apply function to retrieve unique first letter of Name's column and UV
first= first_letter(df, 'Name')
# Loop over each first letter to apply the matching by starting with the same letter for both DF
for letter in first:
df_first= df[df['Name'].str.startswith(letter)]
data_first= data[data['Name'].str.startswith(letter)]
# Process
....
With this code, I can match Names only for those who have the same first letter and not looking into the whole DF.

Reading fixed-width column text file with multiple different types in pandas

I would like to read text file with fixed-width column size (read_fwf) using pandas. Though each line in the file can be of different type (each would represent different structure with different number of columns) determined by the first character on the line. How to parse into (I would say multiple) dataFrames depending on type? Is there a parameter of read_fwf which can handle this?
File Example. Record types A,B,C:
Header
A1234 Another Field 567 fourthfield
A32 Second Field 456 fourthfield2
BFirstColumn SecondColumn ThirdColumn
BFirstColumn2 SecondColumn2 ThirdColumn2
CA B C 123 456 789
CEF TTTCCC001 001 001
A1 Next Field 999 fourthfield3
I tried to split based on the first column by reading:
data = pd.read_fwf(path, widths=[1,...], header=0, skipfooter=0, names = ['Type','Data'])
As = data.loc[data['Type'] == 'A']
But then I did not find a way how to easily break fixed-with data inside the pandas dataframe As['Data']. You cannot input pandas dataframe into read_fwf so you can only do.
As['col001'] = As['Data'].str.slice(0,6)
As['col002'] = As['Data'].str.slice(6,10)
...
Is there any simplier way?

Categories