Pandas keyerror while dropping value from a column - python

I am a beginner in Python and getting an error while trying to drop values from a column in pandas dataframe. I keep getting Keyerror after sometime. Here is the code snippet:
for i in data['FilePath'].keys():
if '.' not in data['FilePath'][i]:
value = data['FilePath'][i]
data = data[data['FilePath'] != value]
I keep getting Keyerror near the line "if '.' not in data['FilePath'][i]". Please help me fix this error

If I understand your logic correctly, then you should be be able to do this without a loop. From what I can see, it looks like you want to drop rows if the FilePath column does not begin with .. If this is correct, then below is one way to do this:
Create sample data using nested list
d = [
['BytesAccessed','FilePath','DateTime'],
[0, '/lib/x86_64-linux-gnu/libtinfo.so.5 832.0', '[28/Jun/2018:11:53:09]'],
[1, './lib/x86-linux-gnu/yourtext.so.6 932.0', '[28/Jun/2018:11:53:09]'],
[2, '/lib/x86_64-linux-gnu/mytes0', '[28/Jun/2018:11:53:09]'],
]
data = pd.DataFrame(d[1:], columns=d[0])
print(data)
BytesAccessed FilePath DateTime
0 0 /lib/x86_64-linux-gnu/libtinfo.so.5 832.0 [28/Jun/2018:11:53:09]
1 1 ./lib/x86-linux-gnu/yourtext.so.6 932.0 [28/Jun/2018:11:53:09]
2 2 /lib/x86_64-linux-gnu/mytes0 [28/Jun/2018:11:53:09]
Filtered data to drop rows that do not contain . at any location in the FilePath column
data_filtered = (data.set_index('FilePath')
.filter(like='.', axis=0)
.reset_index())[data.columns]
print(data_filtered)
BytesAccessed FilePath DateTime
0 0 /lib/x86_64-linux-gnu/libtinfo.so.5 832.0 [28/Jun/2018:11:53:09]
1 1 ./lib/x86-linux-gnu/yourtext.so.6 932.0 [28/Jun/2018:11:53:09]

Related

name/save data frames that are currently in a dictionary in a for loop pandas

I have a dictionary of dataframes (the key is the name of the data frame and the value is the rows/columns). Each dataframe within the dictionary has just 2 columns and varying numbers of rows. I also have a list that has all of the keys in it.
I need to use a for-loop to iteratively name each dataframe with the key and have it saved outside of the dictionary. I know I can access each data frame using the dictionary, but i don't want to do it that way. I am using Spyder so I like to look at my tables in the Variable Explorer and I do not like printing them to the console. Additionally, I would like to modify some of the completed data frames and I need them to be their own thing for that.
Here is my code to make the dictionary (i did this because I wanted to look at all of the categories in each column with the frequency of those values):
import pandas as pd
mydict = {
"dummy":[1, 1, 1],
"type":["new", "old", "new"],
"location":["AB", "BC", "ON"]
}
mydf = pd.DataFrame(mydict)
colnames = mydf.columns.tolist()
mydict2 = {}
for i in colnames:
mydict2[i] = pd.DataFrame(mydf.groupby([i, 'dummy']).size())
print(mydict2)
mydf looks like this:
dummy
type
location
1
new
AB
1
old
BC
1
new
ON
the output of print(mydict2) looks like this:
{'dummy': 0
dummy dummy
1 1 3, 'type': 0
type dummy
new 1 2
old 1 1, 'location': 0
location dummy
AB 1 1
BC 1 1
ON 1 1}
I want the final output to look like this:
Type:
Type
Dummy
new
2
old
1
Location
Location
Dummy
AB
1
BC
1
ON
1
I am basically just trying to generate a frequency table for each column in the original table, using a loop. Any help would be much appreciated!
i believe this yeilds the correct output :
type_count = mydf[["type", "dummy"]].groupby(by=['type'])['dummy'].sum().reset_index()
loca_count = mydf[["location", "dummy"]].groupby(by=['location'])['dummy'].sum().reset_index()
Edit :
Dynamically, you could add all the dataframes to a loop like below : (assuming that you want to do it based on the dummy column)
df_list = []
for name in colnames:
if name != "dummy":
df_list.append(mydf[[name, "dummy"]].groupby(by=[name])['dummy'].sum().reset_index())

unable to parse using pd.json_normalize, it throws null with index values

Sample of my data:
ID
target
1
{"abc":"xyz"}
2
{"abc":"adf"}
this data was a csv output that i imported as below in python
data=pd.read_csv('location',converters{'target':json.loads},header=None,doublequote=True,encoding='unicode_escape')
data=data.drop(labels=0,axis=0)
data=data.rename(columns={0:'ID',1:'target'})
when I try to parse this data using
df=pd.json_normalize(data['target'])
i get Empty dataframe
0
1
You need to change the cells from strings to actual dicts and then your code works.
Try this:
df['target'] = df['target'].apply(json.loads)
df = pd.json_normalize(df['target'])

How to compare columns of two dataframes and have consequences when they match in Python Pandas

I am trying to have Python Pandas compare two dataframes with each other. In dataframe 1, i have two columns (AC-Cat and Origin). I am trying to compare the AC-Cat column with the inputs of Dataframe 2. If a match is found between one of the columns of Dataframe 2 and the value of dataframe 1 being studied, i want Pandas to copy the header of the column of Dataframe 2 in which the match is found to a new column in Dataframe 1.
DF1:
f = {'AC-Cat': pd.Series(['B737', 'A320', 'MD11']),
'Origin': pd.Series(['AJD', 'JFK', 'LRO'])}
Flight_df = pd.DataFrame(f)
DF2:
w = {'CAT-C': pd.Series(['DC85', 'IL76', 'MD11', 'TU22', 'TU95']),
'CAT-D': pd.Series(['A320', 'A321', 'AN12', 'B736', 'B737'])}
WCat_df = pd.DataFrame(w)
I imported pandas as pd and numpy as np and tried to define a function to compare these columns.
def get_wake_cat(AC_cat):
try:
Wcat = [WCat_df.columns.values[0]][WCat_df.iloc[:,1]==AC_cat].values[0]
except:
Wcat = np.NAN
return Wcat
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT))
However, the function does not result in the desired outputs. For example: Take the B737 AC-Cat value. I want Python Pandas to then find this value in DF2 in the column CAT-D and copy this header to the new column of DF 1. This does not happen. Can someone help me find out why my code is not giving the desired results?
Not pretty but I think I got it working. Part of the error was that the function did not have WCat_df. I also changed the indexing into two steps:
def get_wake_cat(AC_cat, WCat_df):
try:
d=WCat_df[WCat_df.columns.values][WCat_df.iloc[:]==AC_cat]
Wcat=d.columns[(d==AC_cat).any()][0]
except:
Wcat = np.NAN
return Wcat
Then you need to change your next line to:
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT,WCat_df ))
AC-Cat Origin CAT
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C
Hope that solves the problem
This will give you 2 new columns with the name\s of the match\s found:
Flight_df['CAT1'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-C' if x in list(WCat_df['CAT-C']) else '')
Flight_df['CAT2'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-D' if x in list(WCat_df['CAT-D']) else '')
Flight_df.loc[Flight_df['CAT1'] == '', 'CAT1'] = Flight_df['CAT2']
Flight_df.loc[Flight_df['CAT1'] == Flight_df['CAT2'], 'CAT2'] = ''
IUC, you can do a stack and merge:
final=(Flight_df.merge(WCat_df.stack().reset_index(1,name='AC-Cat'),on='AC-Cat',how='left')
.rename(columns={'level_1':'New'}))
print(final)
Or with melt:
final=Flight_df.merge(WCat_df.melt(var_name='New',value_name='AC-Cat'),
on='AC-Cat',how='left')
AC-Cat Origin New
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C

Filter data through multiple columns and print rows?

Kind of a follow up on my last question. So I got this data in .csv file that looks like:
id,first_name,last_name,email,gender,ip_address,birthday
1,Ced,Begwell,cbegwell0#google.ca,Male,134.107.135.233,17/10/1978
2,Nataline,Cheatle,ncheatle1#msn.com,Female,189.106.181.194,26/06/1989
3,Laverna,Hamlen,lhamlen2#dot.gov,Female,52.165.62.174,24/04/1990
4,Gawen,Gillfillan,ggillfillan3#hp.com,Male,83.249.190.232,31/10/1984
5,Syd,Gilfether,sgilfether4#china.com.cn,Male,180.153.199.106,11/07/1995
What I want is that when the python program runs it asks the user what keywords to search for. It then takes all keywords entered ( maybe they are stored in a list???), then prints out all rows that contain all keywords no matter what column that keyword is in.
I've been playing around with csv and pandas, and have been googling for hours but just can't seem to get it to work like I want it to. I'm still kinda new to python3. Please help.
**Edit to show what I've got so far:
import csv
# Asks for search criteria from user
search_parts = input("Enter search criteria:\n").split(",")
# Opens csv data file
file = csv.reader(open("MOCK_DATA.csv"))
# Go over each row and print it if it contains user input.
for row in file:
if all([x in row for x in search_parts]):
print(row)
Works great if only searching by one keyword. But I want the choice of filtering by one or mutiple keywords.
Here you go, using try and except because if the datatype is not matched with your keyword it would raise an error
import pandas as pd
def fun(data,keyword):
ans = pd.DataFrame()
for i in data.columns:
try:
ans = pd.concat((data[data[i]==keyword],ans))
except:
pass
ans.drop_duplicates(inplace=True)
return ans
Try the following code for AND search with the keywords:
def AND_serach(df,list_of_keywords):
# init a numpy array to store the index
index_arr = np.array([])
for keyword in list_of_keywords:
# drop the nan if entire row is nan and get remaining rows' indexs
index = df[df==keyword].dropna(how='all').index.values
# if index_arr is empty then assign to it; otherwise update to intersect of two arrays
index_arr = index if index_arr.size == 0 else np.intersect1d(index_arr,index)
# get back the df by filter the index
return df.loc[index_arr.astype(int)]
Try the following code for ORsearch with the keywords:
def OR_serach(df,list_of_keywords):
index_arr = np.array([])
for keyword in list_of_keywords:
index = df[df==keyword].dropna(how='all').index.values
# get all the unique index
index_arr = np.unique(np.concatenate((index_arr,index),0))
return df.loc[index_arr.astype(int)]
OUTPUT
d = {'A': [1,2,3], 'B': [10,1,5]}
df = pd.DataFrame(data=d)
print df
A B
0 1 10
1 2 1
2 3 5
keywords = [1,5]
AND_serach(df,keywords) # return nothing
Out[]:
A B
OR_serach(df,keywords)
Out[]:
A B
0 1 10
1 2 1
2 3 5

Pandas - Getting a Key Error when the Key Exists

I'm trying to join two dataframes in Pandas.
The first frame is called Trades and has these columns:
TRADE DATE
ACCOUNT
COMPANY
COST CENTER
CURRENCY
The second frame is called Company_Mapping and has these columns:
ACTUAL_COMPANY_ID
MAPPED_COMPANY_ID
I'm trying to join them with this code:
trade_df = pd.merge(left=Trades, right = Company_Mapping, how = 'left', left_on = 'COMPANY', right_on = 'ACTUAL_COMPANY_ID'
This returns:
KeyError: 'COMPANY'
I've double checked the spelling and COMPANY is clearly in Trades, and I have no clue what would cause this.
Any ideas?
Thanks!
Your Trades dataframe has a single column with all the intended column names mashed together into a single string. Check the code that parses your file.
Make sure you read your file with the right seperation.
df = pd.read_csv("file.csv", sep=';')
or
df = pd.read_csv("file.csv", sep=',')
Essentially keyError is shown in pandas python when there is no such column name for example you are typing df.loc[df['one']==10] but column name 'one does not exist' whoever if it exist and you are still getting the same error try place try and except statement my problem was solved using try and except statement.
for example
try:
df_new = df.loc[df['one']==10]
except KeyError:
print('No KeyError')
Just in case someone have the same problem, sometimes you need to transpose your dataframe:
import pandas as pd
df = pd.read_csv('file.csv')
# A B C
# -------
# 1 2 3
# 4 5 6
new_df = pd.DataFrame([df['A'], df['B']])
# A | 1 4
# B | 2 5
new_df['A'] # KeyError
new_df = new_df.T
# A B
# ---
# 1 2
# 4 5
new_df['A'] # KeyError
# A
# -
# 1
# 4

Categories