I'm in the initial stages of doing some 'machine learning'.
I'm trying to create a new data frame and one of the columns doesn't appear to be recognised..?
I've loaded an Excel file with 2 columns (removed the index). All fine.
Code:
df = pd.read_excel('scores.xlsx',index=False)
df=df.rename(columns=dict(zip(df.columns,['Date','Amount'])))
df.index=df['Date']
df=df[['Amount']]
#creating dataframe
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date','Amount'])
for i in range(0,len(data)):
new_data['Date'][i] = data['Date'][i]
new_data['Amount'][i] = data['Amount'][i]
The error:
KeyError: 'Date'
Not really sure what's the problem here.
Any help greatly appreciated
I think in line 4 you reduce your dataframe to just one column "Amount"
To add to #Grzegorz Skibinski's answer, the problem is after line 4, there is no longer a 'Date' column. The Date column was assigned to the index and removed, and while the index has a name "Date", you can't use 'Date' as a key to get the index - you have to use data.index[i] instead of data['Date'][i].
It seems that you have an error in the formatting of your Date column.
To check that you don't have an error on the name of the columns you can print the columns names:
import pandas as pd
# create data
data_dict = {}
data_dict['Fruit '] = ['Apple', 'Orange']
data_dict['Price'] = [1.5, 3.24]
# create dataframe from dict
df = pd.DataFrame.from_dict(data_dict)
# Print columns names
print(df.columns.values)
# Print "Fruit " column
print(df['Fruit '])
This code outputs:
['Fruit ' 'Price']
0 Apple
1 Orange
Name: Fruit , dtype: object
We clearly see that the "Fruit " column as a trailing space. This is an easy mistake to do, especially when using excel.
If you try to call "Fruit" instead of "Fruit " you obtain the error you have:
KeyError: 'Fruit'
Related
My excel spread sheet currently looks like this after inserting the new column "Expense" by using the code:
import pandas as pd
df = pd.read_csv(r"C:\Users\Mihir Patel\Project\Excel & CSV Stuff\June '20 CSVData.csv")
df_Expense = df.insert(2, "Expense", " ")
df.to_excel(r"C:\Users\Mihir Patel\Project\Excel & CSV Stuff\June '20 CSVData.xlsx", index=None, header=True)
So because the Description column contains the word "DRAKES" I can categories that expense as "Personal" which should appear in the Expense column next to it.
Similarly the next one down contains "Optus" is categorized as a mobile related expense so the word "Phone" should appear in the Expense column.
I have tried searching on Google and YouTube but I just can't seem to find an example for something like this.
Thanks for your help.
You can define a function which has all these rules and simply apply it. For ex.
def rules(x):
if "DRAKES" in x.description:
return "Personal"
if "OPUS" in x.description:
return "Mobile"
df["Expense"] = df.apply(lambda x: rules(x), axis=1)
I have solved my problem by using a while loop. I tried to use the method in quest's answer but I most likely didn't use it properly and kept getting an error. So I used a while loop to search through each individual cell in the "Description" column and categories it in the same row on the "Expense" column.
My solution using a while loop:
import pandas as pd
df = pd.read_csv("C:\\Users\\Mihir Patel\\PycharmProjects\\pythonProject\\June '20 CSVData.csv")
df.insert(2, "Expenses", "")
description = "Description"
expense = "Expenses"
transfer = "Transfer"
i = -1 #Because I wanted python to start searching from index 0
while i < 296: #296 is the row where my data ends
i = i + 1
if "Drakes".upper() in df.loc[i, description]:
df.loc[i, expense] = "Personal"
if "Optus".upper() in df.loc[i, description]:
df.loc[i, expense] = "Phone"
df.sort_values(by=["Expenses"], inplace=True)
df.to_excel("C:\\Users\\Mihir Patel\\PycharmProjects\\pythonProject\\June '20 CSVData.xlsx", index=False)
I have a function that takes in a dataframe and returns a (reduced) dataframe, e.g. like this:
def transforming_data(dataframe, col_1, col_2, normalized = True):
''' takes in dataframe, groups col_1 according to col_2 and returns dataframe
'''
df = dataframe[col_1].groupby(dataframe[col_2]).value_counts(normalize = normalized).unstack(fill_value = 0)
return dataframe
For the following code, this gives me:
import pandas as pd
import numpy as np
np.random.seed(12)
def transforming_data(df, col_1, col_2, normalized = True):
''' takes in df, groups col_1 according to col_2 and returns df '''
df = dataframe[col_1].groupby(dataframe[col_2]).value_counts(normalize = normalized).unstack(fill_value = 0)
return df
numrows = 1000
dataframe = pd.DataFrame({'Numerical': np.random.randn(numrows),
'Category': np.random.choice(['Panda', 'Elephant', 'Anaconda'], numrows),
'Response 1': np.random.choice(['Yes', 'Maybe', 'No', 'Don\'t know'], numrows),
'Response 2': np.random.choice(['Very Much', 'Much', 'A bit', 'Not at all'], numrows)})
test = transforming_data(dataframe, 'Response 1', 'Category')
print(test)
# Output
# Response 1 Don't know Maybe No Yes
# Category
# Anaconda 0.275229 0.232416 0.217125 0.275229
# Elephant 0.220588 0.270588 0.255882 0.252941
# Panda 0.258258 0.222222 0.273273 0.246246
So far, so good.
Now I want to use the function transforming_data inside a for loop for every column in dataframe (as I have lots of columns, not just two) and save the resulting dataframe to a new dataframe, e.g. test_response_1 and test_response_2 for this example.
Can someone point me in the right direction - i.e. how to implement the loop correctly?
So far, I am using something like this - but cannot figure out how to save the data frame
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
# here, I need to save tmp_df outside of the loop but don't know how to
Thanks a lot for pointers and help. (Note: the most similar question I found does not talk about actually saving the data frame, so it doesn't help me with this.
If you want to save (in memory) all of the temp_df's from your loop, you can append them to a list that you can then index afterwards:
temp_dfs = []
for column in dataframe.columns.tolist(): #you don't actually need the tolist() method here
temp_df = transforming_data(dataframe, column, 'Category')
temp_dfs.append(temp_df)
If you rather be able to access these temp_df's by the column name that was used to transform them, then you could assign each to a dictionary, using the column as the key:
temp_dfs = {}
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
temp_dfs[column] = temp_df
If by "save" you meant "write to disk", then you can use one of the many to_<file_format>() methods that pandas provides:
temp_dfs = {}
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
temp_df.to_csv('temp_df{}.csv'.format(column))
Here's the to_csv() docs.
The most simple solution would be to save the result dataframes into a list. Assuming that all columns that you want to loop over have the text Response in their column name:
result_dframes = []
for col_name in dataframe.filter(like='Response').columns:
result_dframe = transforming_data(dataframe, col_name, 'Category')
result_dframes.append(result_dframe)
Alternatively you can also obtain the exact same result with a list comprehension instead of a for-loop:
result_dframes = [
transforming_data(dataframe, col_name, 'Category')
for col_name in dataframe.filter(like='Response')
]
I have a dataset, where the second column looks like this.
FileName
892e7c8382943342a29a6ae5a55f2272532d8e04.exe.asm
2d42c1b2c33a440d165683eeeec341ebf61218a1.exe.asm
1fbab6b4566a2465a8668bbfed21c0bfaa2c2eed.exe.asm
Now, I want to extract the name before ".exe.asm" from the column and append it to a new list for all the rows of my dataset. I tried the following code:
import pandas as pd
df = pd.read_csv("dataset1.csv")
exekey = []
for row in df.iterrows():
exekey.append(row[1].split('.'))
exekey
This execution gave me the following error:
AttributeError: 'Series' object has no attribute 'split'
I am not able to do it. Please help
On changing, the output was of the form Output image
Split the filename using . and access 1st element using indexing.
import pandas as pd
df = pd.DataFrame({'FileName':['892e7c8382943342a29a6ae5a55f2272532d8e04.exe.asm',
'2d42c1b2c33a440d165683eeeec341ebf61218a1.exe.asm',
'1fbab6b4566a2465a8668bbfed21c0bfaa2c2eed.exe.asm']})
exekey = [i.split(".")[0] for i in df['FileName']]
print(exekey)
Alternate way:
exekey2 = df['FileName'].apply(lambda x: x.split(".")[0]).tolist()
Output:
['892e7c8382943342a29a6ae5a55f2272532d8e04', '2d42c1b2c33a440d165683eeeec341ebf61218a1', '1fbab6b4566a2465a8668bbfed21c0bfaa2c2eed']
You can use map like this to split on . and take index 0,
df['FileName'].map(lambda f : f.split('.')[0])
# Output
0 892e7c8382943342a29a6ae5a55f2272532d8e04
1 2d42c1b2c33a440d165683eeeec341ebf61218a1
2 1fbab6b4566a2465a8668bbfed21c0bfaa2c2eed
Name: FileName, dtype: object
If you want to get a list of names you can do,
df['FileName'].map(lambda f : f.split('.')[0]).values.tolist()
# Output : ['892e7c8382943342a29a6ae5a55f2272532d8e04',
'2d42c1b2c33a440d165683eeeec341ebf61218a1',
'1fbab6b4566a2465a8668bbfed21c0bfaa2c2eed']
I have a data frame that I want to remove duplicates on column named "sample" and the add string information in gene and status columns to new column as shown in the attached pics.
Thank you so much in advance
below is the modified version of data frame.where gene in rows are replaced by actual gene names
Here, df is your Pandas DataFrame.
def new_1(g):
return ','.join(g.gene)
def new_2(g):
return ','.join(g.gene + '-' + g.status)
new_1_data = df.groupby("sample").apply(new_1).to_frame(name="new_1")
new_2_data = df.groupby("sample").apply(new_2).to_frame(name="new_2")
new_data = pd.merge(new_1_data, new_2_data, on="sample")
new_df = pd.merge(df, new_data, on="sample").drop_duplicates("sample")
If you wish to have "sample" as a column instead of an index, then add
new_df = new_df.reset_index(drop=True)
Lastly, as you did not specify which of the original rows of duplicates to retain, I simply use the default behavior of Pandas and drop all but the first occurrence.
Edit
I converted your example to the following CSV file (delimited by ',') which I will call "data.csv".
sample,gene,status
ppar,p53,gain
ppar,gata,gain
ppar,nb,loss
srty,nf1,gain
srty,cat,gain
srty,cd23,gain
tygd,brac1,loss
tygd,brac2,gain
tygd,ras,loss
I load this data as
# Default delimiter is ','. Pass `sep` argument to specify delimiter.
df = pd.read_csv("data.csv")
Running the code above and printing the dataframe produces the output
sample gene status new_1 new_2
0 ppar p53 gain p53,gata,nb p53-gain,gata-gain,nb-loss
3 srty nf1 gain nf1,cat,cd23 nf1-gain,cat-gain,cd23-gain
6 tygd brac1 loss brac1,brac2,ras brac1-loss,brac2-gain,ras-loss
This is exactly the expected output given in your example.
Note that the left-most column of numbers (0, 3, 6) are the remnants of the index of the original dataframes produced after the merges. When you write this dataframe to file you can exclude it by setting index=False for df.to_csv(...).
Edit 2
I checked the CSV file you emailed me. You have a space after the word "gene" in the header of your CSV file.
Change the first line of your CSV file from
sample,gene ,status
to
sample,gene,status
Also, there are spaces in your entries. If you wish to remove them, you can
# Strip spaces from entries. Only works for string entries
df = df.applymap(lambda x: x.strip())
Might not be the most efficient solution but this should get you there:
samples = []
genes= []
statuses = []
for s in set(df["sample"]):
#grab unique samples
samples.append(s)
#get the genes for each sample and concatenate them
g = df["gene"][df["sample"]==s].str.cat(sep=",")
genes.append(g)
#loop through the genes for the sample and get the statuses
status = ''
for gene in g.split(","):
gene_status = df["status"][(df["sample"] == s) & (df["gene"] == gene)].to_string(index=False)
status += gene
status += "-"
status += gene_status
status += ','
statuses.append(status)
#create new df
new_df = pd.DataFrame({'sample': samples,
'new': genes,
'new1': statuses})
df.columns = df.columns.str.strip() ##found a fix for leading whitespaces
arrest_only_Y= df.loc[df['ARREST'] == 'Y']
arrest_only_Y_two_col=arrest_only_Y[["ARREST",'LOCATION DESCRIPTION','CASE#']]##running fine here
arrest_only_Y_two_col.reset_index()
arrest_only_Y_two_col_groupby = arrest_only_Y_two_col.groupby('LOCATION DESCRIPTION').count() ##and here as well ## arrest_only_Y_two_col_groupby_desc=arrest_only_Y_two_col_groupby.sort_values(['ARREST'],ascending = False).head()
arrest_only_Y_two_col_groupby_desc.reset_index(drop = True)
arrest_only_Y_two_col_groupby_desc
In output LOCATION DESCRIPTION becomes as index and i cant use it as a column to run this code
locdesc_list = arrest_only_Y_two_col_groupby_desc['LOCATION
DESCRIPTION'].tolist()
I get: Key Error : 'LOCATION DESCRIPTION'
Replace your line:
arrest_only_Y_two_col_groupby_desc.reset_index(drop=True)
With:
arrest_only_Y_two_col_groupby_desc.reset_index(inplace=True)
You can just try this
df =pd.DataFrame(df,index=index,column=['A','B'])