I have seen this error here. But my problem is not that.
I am trying to extract some column of large dataframe:
dfx = df1[["THRSP", "SERHL2", "TARP", "ADH1C", "KRT4",
"SORD", "SERHL", 'C18orf17','UHRF1', "CEBPD",
'OLR1', 'TBC1D2', 'AXUD1',"TSC22D3",
"ADH1A", "VIPR1", "LRFN2", "ANKRD22"]]
It throws an error as follows:
KeyError: "['C18orf17', 'UHRF1', 'OLR1', 'TBC1D2', 'AXUD1'] not in index"
After removing the above columns it started working. fine
dfx = df1[["THRSP", "SERHL2", "TARP", "ADH1C", "KRT4",
"SORD", "SERHL", "TSC22D3",
"ADH1A", "VIPR1", "LRFN2", "ANKRD22"]]
But, I want ignore this error by not considering the column names if not present and consider which overlap. Any help appreciated..
Use Index.intersection for select only columns with list if exist:
L = ["THRSP", "SERHL2", "TARP", "ADH1C", "KRT4",
"SORD", "SERHL", 'C18orf17','UHRF1', "CEBPD",
'OLR1', 'TBC1D2', 'AXUD1',"TSC22D3",
"ADH1A", "VIPR1", "LRFN2", "ANKRD22"]
dfx = df1[df1.columns.intersection(L, sort=False)]
Or filter them in Index.isin, then need DataFrame.loc with first : for select all rows and columns by mask:
dfx = df1.loc[:, df1.columns.isin(L)]
Related
I have a JSON where some columns are sometimes not present in the structure. I'm trying to put a condition but it's giving an error.
My code is:
v_id_row = df.schema.simpleString().find('id:')
df1 = df.select ('name','age','city','email',when(v_id_row > 0,'id').otherwise(lit(""))
I am getting the following error:
TypeError: condition should be a Column
How can I do this validation?
Could you try:
df1 = df.select (col('name'),col('age'),col('city'),col('email'),when(col('v_id_row') > 0, col('id')).otherwise(lit(""))
That works for me.
You can do like this:
v_id_row = df.schema.simpleString().find('id:')
col_list = ['name','age','city','email']
if v_id_row > 0:
col_list.append("id")
df1 = df.select(col_list)
I have a csv that looks like this:
screen_name,tweet,following,followers,is_retweet,bot
narutouz16,Grad school is lonely.,59,20,0,0
narutouz16,RT #GetMadz: Sound design in this game is 10/10 game freak lied. ,59,20,1,0
narutouz16,#hbthen3rd I know I don't.,59,20,0,0
narutouz16,"#TonyKelly95 I'm still not satisfied in the ending, even though its longer.",59,20,0,0
narutouz16,I'm currently in second place in my leaderboards in duolongo.,59,20,0,0
I am able to read this into a dataframe using the following:
df = pd.read_csv("file.csv")
That works great. I get the following dimensions when I print(df.shape)
(1223726, 6)
I have a list of usernames, like below:
bad_names = ['BELOZEROVNIKIT', 'ALTMANBELINDA', '666STEVEROGERS', 'ALVA_MC_GHEE', 'CALIFRONIAREP', 'BECCYWILL', 'BOGDANOVAO2', 'ADELE_BROCK', 'ANN1EMCCONNELL', 'ARONHOLDEN8', 'BISHOLORINE', 'BLACKTIVISTSUS', 'ANGELITHSS', 'ANWARJAMIL22', 'BREMENBOTE', 'BEN_SAR_GENT', 'ASSUNCAOWALLAS', 'AHMADRADJAB', 'AN_N_GASTON', 'BLACK_ELEVATION', 'BERT_HENLEY', 'BLACKERTHEBERR5', 'ARTHCLAUDIA', 'ALBERTA_HAYNESS', 'ADRIANAMFTTT']
What I want to do is loop through the dataframe, and if the username is in this list at all, to remove those rows from df and add them to a new df called bad_names_df.
Pseudocode would look like:
for each row in df:
if row.username in bad_names:
bad_names_df.append(row)
df.remove(row)
else:
continue
My attempt:
for row, col in df.iterrows():
if row['username'] in bad_user_names:
new_df.append(row)
else:
continue
How is it possible to (efficiently) loop through df, with over 1.2M rows, and if the username is in the bad_names list, remove that row and add that row to a bad_names_df? I have not found any other SO posts that address this issue.
You can also create a mask using isin:
mask = df["screen_name"].isin(bad_names)
print (df[mask]) #df of bad names
print (df[~mask]) #df of good names
You can apply a lambda then filter as follows:
df['keep'] = df['username'].apply(lambda x: False if x in bad_names else True)
df = df[df['keep']==True]
I'm in the initial stages of doing some 'machine learning'.
I'm trying to create a new data frame and one of the columns doesn't appear to be recognised..?
I've loaded an Excel file with 2 columns (removed the index). All fine.
Code:
df = pd.read_excel('scores.xlsx',index=False)
df=df.rename(columns=dict(zip(df.columns,['Date','Amount'])))
df.index=df['Date']
df=df[['Amount']]
#creating dataframe
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date','Amount'])
for i in range(0,len(data)):
new_data['Date'][i] = data['Date'][i]
new_data['Amount'][i] = data['Amount'][i]
The error:
KeyError: 'Date'
Not really sure what's the problem here.
Any help greatly appreciated
I think in line 4 you reduce your dataframe to just one column "Amount"
To add to #Grzegorz Skibinski's answer, the problem is after line 4, there is no longer a 'Date' column. The Date column was assigned to the index and removed, and while the index has a name "Date", you can't use 'Date' as a key to get the index - you have to use data.index[i] instead of data['Date'][i].
It seems that you have an error in the formatting of your Date column.
To check that you don't have an error on the name of the columns you can print the columns names:
import pandas as pd
# create data
data_dict = {}
data_dict['Fruit '] = ['Apple', 'Orange']
data_dict['Price'] = [1.5, 3.24]
# create dataframe from dict
df = pd.DataFrame.from_dict(data_dict)
# Print columns names
print(df.columns.values)
# Print "Fruit " column
print(df['Fruit '])
This code outputs:
['Fruit ' 'Price']
0 Apple
1 Orange
Name: Fruit , dtype: object
We clearly see that the "Fruit " column as a trailing space. This is an easy mistake to do, especially when using excel.
If you try to call "Fruit" instead of "Fruit " you obtain the error you have:
KeyError: 'Fruit'
I have a pandas dataframe, sectors
with every value within each field as string and all the fields except for sector_id have null values wihtin them.
sector_id sector_code sector_percent sector
----------------------------------------------------
UB1274 230;455;621 20;30;50 some_sector1
AB12312 234;786;3049 45;45;10 some_sector2
WEU234I 2344;9813 70;30 some_sector3
U2J3 3498 10 some_sector4
ALK345 ;;1289; 25;50;10;5 some_sector5
YAB45 2498;456 80 some_sector6
I'm basically trying to explode each row into multiple rows. And with some help from the stackoverflow community split-cell-into-multiple-rows-in-pandas-dataframe this is how I have been trying to do this,
from itertools import chain
def chainer(s):
return list(chain.from_iterable(s.str.split(';')))
sectors['sector_code'].fillna(value='0', inplace=True)
sectors['sector'].fillna(value='unknown', inplace=True)
sectors['sector_percent'].fillna(value='100', inplace=True)
len_of_split = sectors['sector_code'].str.split(';').map(len) if isinstance(sectors['sector_code'], str) else 0
pd.DataFrame({
'sector_id': np.repeat(sectors['sector_id'], len_of_split),
'sector_code': chainer(sectors['sector_code']),
'sector': np.repeat(sectors['sector'], len_of_split),
'sector_percent': chainer(sectors['sector_percent'])
})
but as there are also NULL values in all the columns except for sector_id, I'm getting this error as,
ValueError: arrays must all be same length
Here's a sample code for creating the above dummy dataframe sectors,
sectors = pandas.DataFrame({'sector_id':['UB1274','AB12312','WEU234I','U2J3','ALK345','YAB45'], 'sector_code':['230;455;621','234;786;3049','2344;9813','3498',';;1289;','2498;456'], 'sector_percent':['20;30;50','45;45;10','70;30','10','25;50;10;5','80'], 'sector':['some_sector1','some_sector2','some_sector3','some_sector4','some_sector5','some_sector6']})
How do I handle this? Any help is appreciated. Thanks.
df.columns = df.columns.str.strip() ##found a fix for leading whitespaces
arrest_only_Y= df.loc[df['ARREST'] == 'Y']
arrest_only_Y_two_col=arrest_only_Y[["ARREST",'LOCATION DESCRIPTION','CASE#']]##running fine here
arrest_only_Y_two_col.reset_index()
arrest_only_Y_two_col_groupby = arrest_only_Y_two_col.groupby('LOCATION DESCRIPTION').count() ##and here as well ## arrest_only_Y_two_col_groupby_desc=arrest_only_Y_two_col_groupby.sort_values(['ARREST'],ascending = False).head()
arrest_only_Y_two_col_groupby_desc.reset_index(drop = True)
arrest_only_Y_two_col_groupby_desc
In output LOCATION DESCRIPTION becomes as index and i cant use it as a column to run this code
locdesc_list = arrest_only_Y_two_col_groupby_desc['LOCATION
DESCRIPTION'].tolist()
I get: Key Error : 'LOCATION DESCRIPTION'
Replace your line:
arrest_only_Y_two_col_groupby_desc.reset_index(drop=True)
With:
arrest_only_Y_two_col_groupby_desc.reset_index(inplace=True)
You can just try this
df =pd.DataFrame(df,index=index,column=['A','B'])