Pandas DataFrame columns getting passed to the new data dataframe - python

I am trying to create a new dataframe using existing dataframe values. Below code accepts a dataframe called dfhiddencols, which has 3 columns in it
Parent , Childlist, Formula
then it creates a new dataframe called newdf with 2 columns called
Parent, Child
then it loops through each row of dfhiddencols to find a particular pattern. when it finds the pattern, it adds a new row to dfnew. by fetching parent column value from dfhiddencols and matched pattern string.
However, when this new record is added its adding 2 additional columns to newdf
childlist, formula
These 2 columns are not defined when creating the dictionary createrow. Do you know why the columns are getting passed to the new dataframe and how to avoid such scenario?
def extracthiddencolumns(dfhiddencols):
newdf = pd.DataFrame(columns=['child', 'parent'])
createrow ={}
for idx, row in dfhiddencols.iterrows():
#if len(str(row['formula'])) > 3:
for formula in row['formula'].split('|||'):
if formula != '' and '??' in formula:
formula = formula.strip('\n')
formula = formula.strip('\t')
for i in re.findall(r"\[\?\?([A-Za-z0-9_]+)\.([A-Za-z0-9_]+)\?\?\]", formula):
strconcat = i[0] + "." + i[1]
parent = row['parent']
createrow = {'child': parent, 'parent': strconcat}
newdf = dfhiddencols.append(createrow, ignore_index=True)
createrow = {}
newdf.drop(columns=['childlist', 'formula'])
return newdf

I believe using a for loop is not a good idea, but in your code, you can try this to append each row to newdf:
#all your code up to below
createrow = {'child': parent, 'parent': strconcat}
newdf = newdf.append(pd.DataFrame(createrow), ignore_index=True)

Its embarrassing to say, I was appending new record to the passed DF, hence it explains new columns showing up in the dataframe

Related

Counting combinations in Dataframe create new Dataframe

So I have a dataframe called reactions_drugs
and I want to create a table called new_r_d where I keep track of how often a see a symptom for a given medication like
Here is the code I have but I am running into errors such as "Unable to coerce to Series, length must be 3 given 0"
new_r_d = pd.DataFrame(columns = ['drugname', 'reaction', 'count']
for i in range(len(reactions_drugs)):
name = reactions_drugs.drugname[i]
drug_rec_act = reactions_drugs.drug_rec_act[i]
for rec in drug_rec_act:
row = new_r_d.loc[(new_r_d['drugname'] == name) & (new_r_d['reaction'] == rec)]
if row == []:
# create new row
new_r_d.append({'drugname': name, 'reaction': rec, 'count': 1})
else:
new_r_d.at[row,'count'] += 1
Assuming the rows in your current reactions (drug_rec_act) column contain one string enclosed in a list, you can convert the values in that column to lists of strings (by splitting each string on the comma delimiter) and then utilize the explode() function and value_counts() to get your desired result:
df['drug_rec_act'] = df['drug_rec_act'].apply(lambda x: x[0].split(','))
df_long = df.explode('drug_rec_act')
result = df_long.groupby('drugname')['drug_rec_act'].value_counts().reset_index(name='count')

Using Panda, Update column values based on a list of ID and new Values

I have a df with and ID and Sell columns. I want to update the Sell column, using a list of new Sells (not all raws need to be updated - just some of them). In all examples I have seen, the value is always the same or is coming from a column. In my case, I have a dynamic value.
This is what I would like:
file = ('something.csv') # Has 300 rows
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410] # Sells values
csv = path_pattern = os.path.join(os.getcwd(), file)
df = pd.read_csv(file)
df.loc[df['Id'].isin(IDList[x]), 'Sell'] = SellList[x] # Update the rows with the corresponding Sell value of the ID.
df.to_csv(file)
Any ideas?
Thanks in advance
Assuming 'id' is a string (as mentioned in IDList) & is not index of your df
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410]
id_dict={x:y for x,y in zip(IDList,SellList)}
for index,row in df.iterrows():
if row['id'] in IDList:
df.loc[str(index),'Sell']=id_dict[row['id']]
If id is index:
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410]
id_dict={x:y for x,y in zip(IDList,SellList)}
for index,row in df.iterrows():
if index in IDList:
df.loc[str(index),'Sell']=id_dict[index]
What I did is created a dictionary using IDlist & SellList & than looped over the df using iterrows()
df = pd.read_csv('something.csv')
IDList= ['453164259','453106168','453163869','453164463']
SellList=[120,270,350,410]
This will work efficiently, specially for large files:
df.set_index('id', inplace=True)
df.loc[IDList, 'Sell'] = SellList
df.reset_index() ## not mandatory, just in case you need 'id' back as a column
df.to_csv(file)

How do I fix the For Loop to return a certain character from a DataFrame?

I have imported an excel file and made it into a DataFrame and iterated over a column called "Titles" to spit out titles with certain keywords. I have the list of titles as "match_titles." What I want to do now is to create a For Loop to return the column before "titles" for each title in match_titles." I'm not sure why the code is not working. Any help would be appreciated.
import pandas as pd
data = pd.read_excel(r'C:\Users\bryanmccormack\Downloads\asin_list.xlsx')
df = pd.DataFrame(data, columns=['Track','Asin','Title'])
excludes = ["Chainsaw", "Diaper pail", "Leaf Blower"]
my_excludes = [set(key_word.lower().split()) for key_word in excludes]
match_titles = [e for e in df.Title if
any(keywords.issubset(e.lower().split()) for keywords in my_excludes)]
a = []
for i in match_titles:
a.append(df['Asin'])
print(a)
In your for loop you are appending the unfiltered column df['Asin'] to your list a as many times as there are values in match_titles. But there isn't any filtering of df.
One solution would be to make a column of the match_values then you can return the column Asin after filtering on that match_values column:
# make a function to perform your match analysis.
def is_match(title, excludes=["Chainsaw", "Diaper pail", "Leaf Blower"]):
my_excludes = [set(key_word.lower().split()) for key_word in excludes]
if any(keywords.issubset(title.lower().split()) for keywords in my_excludes):
return True
return False
# Make a new boolean column for the matches. This applies your
# function to each value in df['Title'] and puts the output in
# the new column.
df['match_titles'] = df['Title'].apply(is_match)
# Filter the df to only matches and return the column you want.
# Because the match_titles column is boolean it can be used as
# an index.
result = df[df['match_titles']]['Asin']

Save data frame from inside for loop

I have a function that takes in a dataframe and returns a (reduced) dataframe, e.g. like this:
def transforming_data(dataframe, col_1, col_2, normalized = True):
''' takes in dataframe, groups col_1 according to col_2 and returns dataframe
'''
df = dataframe[col_1].groupby(dataframe[col_2]).value_counts(normalize = normalized).unstack(fill_value = 0)
return dataframe
For the following code, this gives me:
import pandas as pd
import numpy as np
np.random.seed(12)
def transforming_data(df, col_1, col_2, normalized = True):
''' takes in df, groups col_1 according to col_2 and returns df '''
df = dataframe[col_1].groupby(dataframe[col_2]).value_counts(normalize = normalized).unstack(fill_value = 0)
return df
numrows = 1000
dataframe = pd.DataFrame({'Numerical': np.random.randn(numrows),
'Category': np.random.choice(['Panda', 'Elephant', 'Anaconda'], numrows),
'Response 1': np.random.choice(['Yes', 'Maybe', 'No', 'Don\'t know'], numrows),
'Response 2': np.random.choice(['Very Much', 'Much', 'A bit', 'Not at all'], numrows)})
test = transforming_data(dataframe, 'Response 1', 'Category')
print(test)
# Output
# Response 1 Don't know Maybe No Yes
# Category
# Anaconda 0.275229 0.232416 0.217125 0.275229
# Elephant 0.220588 0.270588 0.255882 0.252941
# Panda 0.258258 0.222222 0.273273 0.246246
So far, so good.
Now I want to use the function transforming_data inside a for loop for every column in dataframe (as I have lots of columns, not just two) and save the resulting dataframe to a new dataframe, e.g. test_response_1 and test_response_2 for this example.
Can someone point me in the right direction - i.e. how to implement the loop correctly?
So far, I am using something like this - but cannot figure out how to save the data frame
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
# here, I need to save tmp_df outside of the loop but don't know how to
Thanks a lot for pointers and help. (Note: the most similar question I found does not talk about actually saving the data frame, so it doesn't help me with this.
If you want to save (in memory) all of the temp_df's from your loop, you can append them to a list that you can then index afterwards:
temp_dfs = []
for column in dataframe.columns.tolist(): #you don't actually need the tolist() method here
temp_df = transforming_data(dataframe, column, 'Category')
temp_dfs.append(temp_df)
If you rather be able to access these temp_df's by the column name that was used to transform them, then you could assign each to a dictionary, using the column as the key:
temp_dfs = {}
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
temp_dfs[column] = temp_df
If by "save" you meant "write to disk", then you can use one of the many to_<file_format>() methods that pandas provides:
temp_dfs = {}
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
temp_df.to_csv('temp_df{}.csv'.format(column))
Here's the to_csv() docs.
The most simple solution would be to save the result dataframes into a list. Assuming that all columns that you want to loop over have the text Response in their column name:
result_dframes = []
for col_name in dataframe.filter(like='Response').columns:
result_dframe = transforming_data(dataframe, col_name, 'Category')
result_dframes.append(result_dframe)
Alternatively you can also obtain the exact same result with a list comprehension instead of a for-loop:
result_dframes = [
transforming_data(dataframe, col_name, 'Category')
for col_name in dataframe.filter(like='Response')
]

concat the strings of one column based on condition on other column

I have a data frame that I want to remove duplicates on column named "sample" and the add string information in gene and status columns to new column as shown in the attached pics.
Thank you so much in advance
below is the modified version of data frame.where gene in rows are replaced by actual gene names
Here, df is your Pandas DataFrame.
def new_1(g):
return ','.join(g.gene)
def new_2(g):
return ','.join(g.gene + '-' + g.status)
new_1_data = df.groupby("sample").apply(new_1).to_frame(name="new_1")
new_2_data = df.groupby("sample").apply(new_2).to_frame(name="new_2")
new_data = pd.merge(new_1_data, new_2_data, on="sample")
new_df = pd.merge(df, new_data, on="sample").drop_duplicates("sample")
If you wish to have "sample" as a column instead of an index, then add
new_df = new_df.reset_index(drop=True)
Lastly, as you did not specify which of the original rows of duplicates to retain, I simply use the default behavior of Pandas and drop all but the first occurrence.
Edit
I converted your example to the following CSV file (delimited by ',') which I will call "data.csv".
sample,gene,status
ppar,p53,gain
ppar,gata,gain
ppar,nb,loss
srty,nf1,gain
srty,cat,gain
srty,cd23,gain
tygd,brac1,loss
tygd,brac2,gain
tygd,ras,loss
I load this data as
# Default delimiter is ','. Pass `sep` argument to specify delimiter.
df = pd.read_csv("data.csv")
Running the code above and printing the dataframe produces the output
sample gene status new_1 new_2
0 ppar p53 gain p53,gata,nb p53-gain,gata-gain,nb-loss
3 srty nf1 gain nf1,cat,cd23 nf1-gain,cat-gain,cd23-gain
6 tygd brac1 loss brac1,brac2,ras brac1-loss,brac2-gain,ras-loss
This is exactly the expected output given in your example.
Note that the left-most column of numbers (0, 3, 6) are the remnants of the index of the original dataframes produced after the merges. When you write this dataframe to file you can exclude it by setting index=False for df.to_csv(...).
Edit 2
I checked the CSV file you emailed me. You have a space after the word "gene" in the header of your CSV file.
Change the first line of your CSV file from
sample,gene ,status
to
sample,gene,status
Also, there are spaces in your entries. If you wish to remove them, you can
# Strip spaces from entries. Only works for string entries
df = df.applymap(lambda x: x.strip())
Might not be the most efficient solution but this should get you there:
samples = []
genes= []
statuses = []
for s in set(df["sample"]):
#grab unique samples
samples.append(s)
#get the genes for each sample and concatenate them
g = df["gene"][df["sample"]==s].str.cat(sep=",")
genes.append(g)
#loop through the genes for the sample and get the statuses
status = ''
for gene in g.split(","):
gene_status = df["status"][(df["sample"] == s) & (df["gene"] == gene)].to_string(index=False)
status += gene
status += "-"
status += gene_status
status += ','
statuses.append(status)
#create new df
new_df = pd.DataFrame({'sample': samples,
'new': genes,
'new1': statuses})

Categories