python: splitting a file based on a key word - python

I have this file:
GSENumber Species Platform Sample Age Tissue Sex Count
GSE11097 Rat GPL1355 GSM280267 4 Liver Male Count
GSE11097 Rat GPL1355 GSM280268 4 Liver Female Count
GSE11097 Rat GPL1355 GSM280269 6 Liver Male Count
GSE11097 Rat GPL1355 GSM280409 6 Liver Female Count
GSE11291 Mouse GPL1261 GSM284967 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284968 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284969 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284970 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284975 10 Heart Male Count
GSE11291 Mouse GPL1261 GSM284976 10 Heart Male Count
GSE11291 Mouse GPL1261 GSM284987 5 Muscle Male Count
GSE11291 Mouse GPL1261 GSM284988 5 Muscle Female Count
GSE11291 Mouse GPL1261 GSM284989 30 Muscle Male Count
GSE11291 Mouse GPL1261 GSM284990 30 Muscle Male Count
GSE11291 Mouse GPL1261 GSM284991 30 Muscle Male Count
You can see here there is two series (GSE11097 and GSE11291), and I want a summary for each series; The output should be a dictionary like this, for each "GSE" number:
Series Species Platform AgeRange Tissue Sex Count
GSE11097 Rat GPL1355 4-6 Liver Mixed Count
GSE11291 Mouse GPL1261 5-10 Heart Male Count
GSE11291 Mouse GPL1261 5-30 Muscle Mixed Count
So I know one way to do this would be:
Read in the file and make a list of all the GSE numbers.
Then read in the file again and parse based on GSE number.
e.g.
import sys
list_of_series = list(set([line.strip().split()[0] for line in open(sys.argv[1])]))
list_of_dicts = []
for each_list in list_of_series:
temp_dict={"species":"","platform":"","age":[],"tissue":"","Sex":[],"Count":""}
for line in open(sys.argv[1]).readlines()[1:]:
line = line.strip().split()
if line[0] == each_list:
temp_dict["species"] = line[1]
temp_dict["platform"] = line[2]
temp_dict["age"].append(line[4])
temp_dict["tissue"] = line[5]
temp_dict["sex"].append(line[6])
temp_dict["count"] = line[7]
I think this is messy in two ways:
I've to read in the whole file twice (in reality, file much bigger than example here)
This method keeps re-writing over the same dictionary entry with the same word.
Also, There's a problem with the sex, I want to say "if both male and female, put "mixed" in dict, else, put "male" or "female".
I can make this code work, but I'm wondering about quick tips to make the code cleaner/more pythonic?

I agree with Max Paymar that this should be done in a query language. If you really want to do it in Python, the pandas module will help a lot.
import pandas as pd
## columns widths of the fixed width format
fwidths = [12, 8, 8, 12, 4, 8, 8, 5]
## read fixed width format into pandas data frame
df = pd.read_fwf("your_file.txt", widths=fwidths, header=1,
names=["GSENumber", "Species", "Platform", "Sample",
"Age", "Tissue", "Sex", "Count"])
## drop "Sample" column as it is not needed in the output
df = df.drop("Sample", axis=1)
## group by GSENumber
grouped = df.groupby(df.GSENumber)
## aggregate columns for each group
aggregated = grouped.agg({'Species': lambda x: list(x.unique()),
'Platform': lambda x: list(x.unique()),
'Age': lambda x: "%d-%d" % (min(x), max(x)),
'Tissue': lambda x: list(x.unique()),
'Sex': lambda x: "Mixed" if x.nunique() > 1 else list(x.unique()),
'Count': lambda x: list(x.unique())})
print aggregated
This produces pretty much the result you asked for and is much cleaner than parsing the file in pure Python.

import sys
def main():
data = read_data(open(sys.argv[1]))
result = process_rows(data)
format_and_print(result, sys.argv[2])
def read_data(file):
data = [line.strip().split() for line in open(sys.argv[1])]
data.pop(0) # remove header
return data
def process_rows(data):
data_dict = {}
for row in data:
process_row(row, data_dict)
return data_dict
def process_row(row, data_dict):
composite_key = row[0] + row[1] + row[5] #assuming this is how you are grouping the entries
if composite_key in data_dict:
data_dict[composite_key]['age_range'].add(row[4])
if row[5] != data_dict[composite_key]:
data_dict[composite_key]['sex'] = 'Mixed'
#do you need to accumulate the counts? data_dict[composite_key]['count']+=row[6]
else:
data_dict[composite_key] = {
'series': row[0],
'species': row[1],
'platform': row[2],
'age_range': set([row[4]]),
'tissue': row[5],
'sex': row[6],
'count': row[7]
}
def format_and_print(data_dict, outfile):
pass
#you can implement this one :)
if __name__ == "__main__":
main()

Related

Python: fuzzywuzzy, the output of the first value is correct, the others are NaN

I'm stuck in a very strange problem:
I have two dfs and I have to match strings of one df with the strings of the other df, by similarity.
The target column is the name of the television program (program_name_1 & program_name_2).
In order to let him choose from a limited set of data, I also used the column 'channel' as filter.
The function applies the fuzzy algorithm and gives as result the match of the elements from the columns program_name_1 with program_name_2 and the score similarity between them.
The really strange thing is that the output works fine just for the first channel, but for all the next channels it doesn't. The first column (scorer_test_2), that just prints the program_name_1 is always correct, but scorer_test_2 (that should print program_name_2) and the similarity columns are NaN.
I did a lot of checks on the dfs: I am sure that the names of the columns are the same of the names in the lists and that in the other channels, there are all the data I'm asking for.
The strangest thing is that the first channel and all the other channels are in the same df, for this reason there are no differences between the data of the channels.
I will show you 'toys dts', to ley you understand better the problem:
df1 = {'Channel': ['1','1','1','2','2','2','3','4'], 'program_name_1': ['party','animals','gucci','the simpson', 'cars', 'mathematics', 'bikes', 'chef']}
df2 = {'Channel': ['1','1','1','2','2','2','3','4'], 'program_name_2': ['parties','gucci_gucci','animal','simpsons', 'math', 'the car', 'bike', 'cooking']}
df1 = pd.DataFrame(df1, columns = ['Channel','program_name_1'])
df2 = pd.DataFrame(df2, columns = ['Channel','program_name_2'])
that will print for the df1:
Channel program_name_1
1 party
1 animals
1 gucci
2 the simpson
2 cars
2 mathematics
3 bikes
4 chef
and for the df2:
Channel program_name_2
1 parties
1 gucci_gucci
1 animal
2 simpsons
2 math
2 the car
3 bike
4 cooking
and here the code:
scorer_test_1 = df_1.loc[(df_1['Channel'] == '1')]['program_name_1']
scorer_test_2 = df_2.loc[(df_2['Channel'] == '1')]['program_name_2']
# creation of a function for the score
def scorer_tester_function(x):
matching_list = []
similarity = []
# iterate on the rows
for i in scorer_test_1:
if pd.isnull(i):
matching_list.append(np.null)
similarity.append(np.null)
else:
ratio = process.extract(i, scorer_test_2, limit=5, scorer=scorer_dict[x])
matching_list.append(ratio[0][0])
similarity.append(ratio[0][1])
my_df = pd.DataFrame()
my_df['program_name_1'] = scorer_test_1
my_df['program_name_2'] = pd.Series(matching_list)
my_df['similarity'] = pd.Series(similarity)
return my_df
print(scorer_tester_function('R').head())
The output that I would like to get for all the channels, but I just get if I pass the first channel in the code is this:
for the channel[1]:
program_name_1 program_name_2 similarity
party parties 95
animals animal 95
gucci gucci_gucci 75
for the channel[2]:
program_name_1 program_name_2 similarity
the simpson simpsons 85
cars the car 75
mathematics math 70
This is the output I get if I ask for the channel 2 or next:
code:
scorer_test_1 = df_1.loc[(df_1['Channel'] == '2')]['program_name_1']
scorer_test_2 = df_2.loc[(df_2['Channel'] == '2')]['program_name_2']
output:
Channel program_name_1 program_name_2 similarity
2 the simpson NaN NaN
2 cars NaN NaN
2 mathematics NaN NaN
I hope someone can help me :)
Thanks!
This was for Index mismatch, resetting indices after adding first dataseries can do the work!
def scorer_tester_function(x):
matching_list = []
similarity = []
# iterate on the rows
for i in scorer_test_1:
if pd.isnull(i):
matching_list.append(np.null)
similarity.append(np.null)
else:
ratio = process.extract(i, scorer_test_2, limit=5)#, scorer=scorer_dict[x])
matching_list.append(ratio[0][0])
similarity.append(ratio[0][1])
my_df = pd.DataFrame()
my_df['program_name_1'] = scorer_test_1
print(my_df.index)
my_df.reset_index(inplace=True)
print(my_df.index)
my_df['program_name_2'] = pd.Series(matching_list)
my_df['similarity'] = pd.Series(similarity)
return my_df

Dynamically count occurences of multiple words within lists

I'm trying to count the occurences of multiple keywords within each phrases of a dataframe. This seems similar to other questions but not quite the same.
Here we have a df and a list of lists containing keywords/topics:
df=pd.DataFrame({'phrases':['very expensive meal near city center','very good meal and waiters','nice restaurant near center and public transport']})
topics=[['expensive','city'],['good','waiters'],['center','transport']]
for each phrase, we want to count how many words match in each separate topic. So the first phrase should score 2 for 1st topic, 0 for 2nd topic and 1 for 3rd topic, etc
I've tried this but it does not work:
from collections import Counter
topnum=0
for t in topics:
counts=[]
topnum+=1
results = Counter()
for line in df['phrases']:
for c in line.split(' '):
results[c] = t.count(c)
counts.append(sum(results.values()))
df['topic_'+str(topnum)] = counts
I'm not sure what i'm doing wrong, ideally i would end up with a count of matching words for each topic/phrases combinations but instead the counts seem to repeat themselves:
phrases topic_1 topic_2 topic_3
very expensive meal near city centre 2 0 0
very good meal and waiters 2 2 0
nice restaurant near center and public transport 2 2 2
Many thanks to whoever can help me.
Best Wishes
Here is a solution that defines a helper function called find_count and applies it as a lambda to the dataframe.
import pandas as pd
df=pd.DataFrame({'phrases':['very expensive meal near city center','very good meal and waiters','nice restaurant near center and public transport']})
topics=[['expensive','city'],['good','waiters'],['center','transport']]
def find_count(row, topics_index):
count = 0
word_list = row['phrases'].split()
for word in word_list:
if word in topics[topics_index]:
count+=1
return count
df['Topic 1'] = df.apply(lambda row:find_count(row,0), axis=1)
df['Topic 2'] = df.apply(lambda row:find_count(row,1), axis=1)
df['Topic 3'] = df.apply(lambda row:find_count(row,2), axis=1)
print(df)
#Output
phrases Topic 1 Topic 2 Topic 3
0 very expensive meal near city center 2 0 1
1 very good meal and waiters 0 2 0
2 nice restaurant near center and public transport 0 0 2

How to assign multiple categories based on a condition

Here are the categories each with a list of words ill be checking the rows for match:
fashion = ['bag','purse','pen']
general = ['knob','hanger','bottle','printing','tissues','book','tissue','holder','heart']
decor =['holder','decoration','candels','frame','paisley','bunting','decorations','party','candles','design','clock','sign','vintage','hanging','mirror','drawer','home','clusters','placements','willow','stickers','box']
kitchen = ['pantry','jam','cake','glass','bowl','napkins','kitchen','baking','jar','mug','cookie','bowl','placements','molds','coaster','placemats']
holiday = ['rabbit','ornament','christmas','trinket','party']
garden = ['lantern','hammok','garden','tree']
kids = ['children','doll','birdie','asstd','bank','soldiers','spaceboy','childs']
Here is my code: (I am checking sentences for keywords and assign the row a category accordingly. I want to allow overlapping, so one row could have more than one category)
#check if description row contains words from one of our category lists
df['description'] = np.select(
[
(df['description'].str.contains('|'.join(fashion))),
(df['description'].str.contains('|'.join(general))),
(df['description'].str.contains('|'.join(decor))),
(df['description'].str.contains('|'.join(kitchen))),
(df['description'].str.contains('|'.join(holiday))),
(df['description'].str.contains('|'.join(garden))),
(df['description'].str.contains('|'.join(kids)))
],
['fashion','general','decor','kitchen','holiday','garden','kids'],
'Other'
)
Current Output:
index description category
0 children wine glass kids
1 candles decor
2 christmas tree holiday
3 bottle general
4 soldiers kids
5 bag fashion
Expected Output:
index description category
0 children wine glass kids, kitchen
1 candles decor
2 christmas tree holiday, garden
3 bottle general
4 soldiers kids
5 bag fashion
Here's an option using apply():
df = pd.DataFrame({'description': ['children wine glass',
'candles',
'christmas tree',
'bottle',
'soldiers',
'bag']})
def categorize(desc):
lst = []
for w in desc.split(' '):
if w in fashion:
lst.append('fashion')
if w in general:
lst.append('general')
if w in decor:
lst.append('decor')
if w in kitchen:
lst.append('kitchen')
if w in holiday:
lst.append('holiday')
if w in garden:
lst.append('garden')
if w in kids:
lst.append('kids')
return ', '.join(lst)
df.apply(lambda x: categorize(x.description), axis=1)
Outuput:
0 kids, kitchen
1 decor
2 holiday, garden
3 general
4 kids
5 fashion
Here's how I would do it.
Comments above each line provides you details on what I am trying to do.
Steps:
Convert all the categories into key:value pair. Use values in the
category as key and the category as value. This is to enable you to
search for the value and map it back to key
Split the description field into multiple columns using
split(expand)
Do a match for key value on each column. The result will be
categories and NaNs
Join all of these back into a column with ', ' separated to get final result while excluding NaNs. Apply pd.unique() on it again to remove duplicate categories
The six lines of code you need are:
dict_keys = ['fashion','general','decor','kitchen','holiday','garden','kids']
dict_cats = [fashion,general,decor,kitchen,holiday,garden,kids]
s_dict = {val:dict_keys[i] for i,cats in enumerate(dict_cats) for val in cats}
temp = df['description'].str.split(expand=True)
temp = temp.applymap(s_dict.get)
df['new_category'] = temp.apply(lambda x: ','.join(x[x.notnull()]), axis = 1)
df['new_category'] = df['new_category'].apply(lambda x: ', '.join(pd.unique(x.split(','))))
If you have more categories, just add it to dict_keys and dict_cats. Everything else stays the same.
The full code with comments begins here:
import pandas as pd
c = ['description','category']
d = [['children wine glass','kids'],
['candles','decor'],
['christmas tree','holiday'],
['bottle','general'],
['soldiers','kids'],
['bag','fashion']]
df = pd.DataFrame(d,columns = c)
fashion = ['bag','purse','pen']
general = ['knob','hanger','bottle','printing','tissues','book','tissue','holder','heart']
decor =['holder','decoration','candels','frame','paisley','bunting','decorations','party','candles','design','clock','sign','vintage','hanging','mirror','drawer','home','clusters','placements','willow','stickers','box']
kitchen = ['pantry','jam','cake','glass','bowl','napkins','kitchen','baking','jar','mug','cookie','bowl','placements','molds','coaster','placemats']
holiday = ['rabbit','ornament','christmas','trinket','party']
garden = ['lantern','hammok','garden','tree']
kids = ['children','doll','birdie','asstd','bank','soldiers','spaceboy','childs']
#create a list of all the lists
dict_keys = ['fashion','general','decor','kitchen','holiday','garden','kids']
dict_cats = [fashion,general,decor,kitchen,holiday,garden,kids]
#create a dictionary with words from the list as key and category as value
s_dict = {val:dict_keys[i] for i,cats in enumerate(dict_cats) for val in cats}
#create a temp dataframe with one word for each column using split
temp = df['description'].str.split(expand=True)
#match the words in each column against the dictionary
temp = temp.applymap(s_dict.get)
#Now put them back together and you have the final list
df['new_category'] = temp.apply(lambda x: ','.join(x[x.notnull()]), axis = 1)
#Remove duplicates using pd.unique()
#Note: prev line join modified to ',' from ', '
df['new_category'] = df['new_category'].apply(lambda x: ', '.join(pd.unique(x.split(','))))
print (df)
The output of this will be: (I kept your category column and created new one called new_category
description category new_category
0 children wine glass kids kids, kitchen
1 candles decor decor
2 christmas tree holiday holiday, garden
3 bottle general general
4 soldiers kids kids
5 bag fashion fashion
The output including 'party candles holder' is :
description category new_category
0 children wine glass kids kids, kitchen
1 candles decor decor
2 christmas tree holiday holiday, garden
3 bottle general general
4 party candles holder None holiday, decor
5 soldiers kids kids
6 bag fashion fashion

Create Names column in Pandas DataFrame

I am using the Python Package names to generate some first names for QA testing.
The names package contains the function names.get_first_name(gender) which allows either the string male or female as the parameter. Currently I have the following DataFrame:
Marital Gender
0 Single Female
1 Married Male
2 Married Male
3 Single Male
4 Married Female
I have tried the following:
df.loc[df.Gender == 'Male', 'FirstName'] = names.get_first_name(gender = 'male')
df.loc[df.Gender == 'Female', 'FirstName'] = names.get_first_name(gender = 'female')
But all I get in return is the are just two names:
Marital Gender FirstName
0 Single Female Kathleen
1 Married Male David
2 Married Male David
3 Single Male David
4 Married Female Kathleen
Is there a way to call this function separately for each row so not all males/females have the same exact name?
you need apply
df['Firstname']=df['Gender'].str.lower().apply(names.get_first_name)
You can use a list comprehension:
df['Firstname']= [names.get_first_name(gender) for gender in df['Gender'].str.lower()]
And hear is a hack that reads all of the names by gender (together with their probabilities), and then randomly samples.
import names
def get_names(gender):
if not isinstance(gender, (str, unicode)) or gender.lower() not in ('male', 'female'):
raise ValueError('Invalid gender')
with open(names.FILES['first:{}'.format(gender.lower())], 'rb') as fin:
first_names = []
probs = []
for line in fin:
first_name, prob, dummy, dummy = line.strip().split()
first_names.append(first_name)
probs.append(float(prob) / 100)
return pd.DataFrame({'first_name': first_names, 'probability': probs})
def get_random_first_names(n, first_names_by_gender):
first_names = (
first_names_by_gender
.sample(n, replace=True, weights='probability')
.loc[:, 'first_name']
.tolist()
)
return first_names
first_names = {gender: get_names(gender) for gender in ('Male', 'Female')}
>>> get_random_first_names(3, first_names['Male'])
['RICHARD', 'EDWARD', 'HOMER']
>>> get_random_first_names(4, first_names['Female'])
['JANICE', 'CAROLINE', 'DOROTHY', 'DIANE']
If the speed is matter using map
list(map(names.get_first_name,df.Gender))
Out[51]: ['Harriett', 'Parker', 'Alfred', 'Debbie', 'Stanley']
#df['FN']=list(map(names.get_first_name,df.Gender))

Count based on other csv file

I have a dataframe df with two columns called 'MovieName' and 'Actors'. It looks like:
MovieName Actors
lights out Maria Bello
legend Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis
Please note that different actor names are separated by '*'. I have another csv file called gender.csv which has the gender of all actors based on their first names. gender.csv looks like -
ActorName Gender
Tom male
Emily female
Christopher male
I want to add two columns in my dataframe 'female_actors' and 'male_actors' which contains the count of female and male actors in that particular movie respectively.
How do I achieve this task using both df and gender.csv in pandas?
Please note that -
If particular name isn't present in gender.csv, don't count it in the total.
If there is just one actor in a movie, and it isn't present in gender.csv, then it's count should be zero.
Result of above example should be -
MovieName Actors male_actors female_actors
lights out Maria Bello 0 0
legend Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis 2 1
import pandas as pd
df1 = pd.DataFrame({'MovieName': ['lights out', 'legend'], 'Actors':['Maria Bello', 'Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis']})
df2 = pd.DataFrame({'ActorName': ['Tom', 'Emily', 'Christopher'], 'Gender':['male', 'female', 'male']})
def func(actors, gender):
actors = [act.split()[0] for act in actors.split('*')]
n_gender = df2.Gender[df2.Gender==gender][df2.ActorName.isin(actors)].count()
return n_gender
df1['male_actors'] = df1.Actors.apply(lambda x: func(x, 'male'))
df1['female_actors'] = df1.Actors.apply(lambda x: func(x, 'female'))
df1.to_csv('res.csv', index=False)
print df1
Output
Actors,MovieName,male_actors,female_actors
Maria Bello,lights out,0,0
Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis,legend,2,1

Categories