KeyError when running df through function - python

I'm trying to apply the function below to a dataframe and return only the rows that qualify, but get an KeyError. What am I doing wrong?
N = 100
np.random.seed(0)
df = pd.DataFrame(
{'X':np.random.uniform(-3,10,N),
'Y':np.random.uniform(-3,10,N),
'Z':np.random.uniform(-3,10,N),
})
def func_sec(df):
for i in range(len(df)):
for k in range( i+1, len(df)+1 ):
df_sum = df[i:k].sum()
m = (df_sum>2).all() & (df_sum.sum()>10)
return df[m]
func_sec(df)

Like others have noted, the key error is thrown off because of df[m]. Your column names aren't booleans, they are 'X', 'Y', 'Z'. Somewhere at the bottom of the pandas documentation there is some information on boolean indexing, so i suggest you check it out.
Long story short, you can't do df[True], but you can do df[df['X'] > 10] per se.

For a dataframe df you can select by column e.g. 'X' in your case:
df['X']
or slice some rows
df[0:10]
If you try something invalid like df[0] or df[True] you will get a key error.

Related

Pandas Add column and fill it with complex Concatenate

I have this Excel formula in A2 : =IF(B1=1;CONCAT(D1:Z1);"Null")
All cells are string or integer but some are empty. I filled the empty one with "null"
I've tried to translate it with pandas, and so far I wrote this :
'''
import pandas as pd
df = pd.read_table('C:/*path*/001.txt', sep=';', header=0, dtype=str)
rowcount = 0
for row in df:
rowcount+= 1
n = rowcount
m = len(df)
df['A']=""
for i in range(1,n):
if df[i-1]["B"]==1:
for k in range(2,m):
if df[i][k]!="Null"
df[i]['A']+=df[i][k]
'''
I can't find something close enough to my problem in questions, anyone can help?
I not sure you really expecting for this. If you need to fill empty cell with 'null' string in dataframe. You can use this
df.fillna('null', inplace=True)
If you provide the expected output with your input file. May helpful for the contributors.
Test dataframe:
df = pd.DataFrame({
"b":[1,0,1],
"c":["dummy", "dummy", "dummy"],
"d":["red", "green", "blue"],
"e":["-a", "-b", "-c"]
})
First step: add a new column and fill with NULL.
df["concatenated"] = "NULL"
Second step: filter by where column B is 1, and then set the value of the new column to the concatenation of columns D to Z.
df["concatenated"][df["b"]==1] = df[sub_columns].sum(axis=1)
df
Output:
EDIT: I notice there is an offset in your excel formula. Not sure if this is deliberate, but experiment with df.shift(-1) if so.
There's a lot to unpack here.
Firstly, len(df) gives us the row count. In your code, n and m will be one and the same.
Secondly, please never do chain indexing in pandas unless you absolutely have to. There's a number of reasons not to, one of them being that it's easy to make a mistake; also, assignment can fail. Here, we have a default range index, so we can use df.loc[i-1, 'B'] in place of df[i - 1]["B"].
Thirdly, the dtype is str, so please use =='1' rather than ==1.
If I understand your problem correctly, the following code should help you:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
'B': ['1','2','0','1'],
'C': ['first', 'second', 'third', 'fourth'],
'D': ['-1', '-2', '-3', '-4']
})
In [3]: RELEVANT_COLUMNS = ['C', 'D'] # You can also extract them in any automatic way you please
In [4]: df["A"] = ""
In [5]: df.loc[df['B'] == '1', 'A'] = df.loc[df['B'] == '1', RELEVANT_COLUMNS].sum(axis=1)
In [6]: df
Out[6]:
B C D A
0 1 first -1 first-1
1 2 second -2
2 0 third -3
3 1 fourth -4 fourth-4
We note which columns to concat (In [3]) (we do not want to make the mistake of adding a column later on and using it. Here if we add 'A' it doesn't hurt, because it's full of empty strings. But it's more manageable to save the columns we concat.
We then add the column with empty strings (In [4]) (if we skip this step, we'll get NaNs instead of empty strings for the records where B does not equal 1).
The In [5] uses pandas' boolean indexing (through the Series to scalar equality operator) to limit our scope to where column B equals 1, and then we pull up the columns to concat and do just that, using the an axis-reducing sum operation.

Generate Pivoted Pandas Dataframes by Changing 'Values' Argument

I have an empty list I want to populate with pivoted dataframes with the intention of looping over the list to generate heatmaps using seaborn.
The original dataframes look something like:
x y ds_ic ele1 ele2 ele3 ele4
0 0 0.394888 18.8099 25.468 7.03E-15 0.417225
0 1 0.3990888 20.5525 23.54 0 0.331358
0 2 0.3901616 22.6762 19.5485 3.63E-11 0.448073
0 3 0.3838604 24.4072 27.781 0 0.406801
0 4 0.387536 21.6036 23.8371 0 0.263638
0 5 0.387536 23.4229 22.542 4.30E-14 0.395689
I'm using the following code to reshape the data and make it suitable for plotting:
def mapShape(dataframe_list):
plotList = []
for df in dataframe_list:
df = df.pivot(index = 'y', columns = 'x', values = 'ds_ic')
plotList.append(df)
return plotList
shaped_dataframes = mapShape(simplified_dataframes)
Where simplified_dataframes is a list of dataframes that have the same shape as the original dataframe. This works fine for pivoting a single column of my choosing (i.e. whenever I manually set values).
The goal is to make a reshaped/pivoted dataframe for all columns after x-y. I thought of passing a column-header string to values of df.pivot(), resembling something like the following:
columns = ['ds_ic', 'ele1', 'ele2', 'ele3', 'ele4']
def mapShape(dataframe_list):
plotList = []
for df in dataframe_list:
for c in columns:
df = df.pivot(index = 'y', columns = 'x', values = c)
plotList.append(df)
return plotList
shaped_dataframes = mapShape(simplified_dataframes)
When I try this, df.pivot() throws a KeyError for 'y'. I tried substituting df.pivot() with df.pivot_table(), but that throws a KeyError for 'ele2'. I have a feeling there is an easier way to do this and look forward to your suggestions. Thanks in advance for the help!
The problem is your assigning the pivot_table back df in this line:df = df.pivot(index = 'y', columns = 'x', values = c). In the next iteration, it will then try to pivot this pivot table, which doesn't have a y-colun. If you assign the pivot to df2 and then append that to your plot list, it works like a charm :)
On another note; I don't know what and how and why you're plotting, but I feel there may be a more straightforward way. If you show your intended output, I could have a look.

print the unique values in every column in a pandas dataframe

I have a dataframe (df) and want to print the unique values from each column in the dataframe.
I need to substitute the variable (i) [column name] into the print statement
column_list = df.columns.values.tolist()
for column_name in column_list:
print(df."[column_name]".unique()
Update
When I use this: I get "Unexpected EOF Parsing" with no extra details.
column_list = sorted_data.columns.values.tolist()
for column_name in column_list:
print(sorted_data[column_name].unique()
What is the difference between your syntax YS-L (above) and the below:
for column_name in sorted_data:
print(column_name)
s = sorted_data[column_name].unique()
for i in s:
print(str(i))
It can be written more concisely like this:
for col in df:
print(df[col].unique())
Generally, you can access a column of the DataFrame through indexing using the [] operator (e.g. df['col']), or through attribute (e.g. df.col).
Attribute accessing makes the code a bit more concise when the target column name is known beforehand, but has several caveats -- for example, it does not work when the column name is not a valid Python identifier (e.g. df.123), or clashes with the built-in DataFrame attribute (e.g. df.index). On the other hand, the [] notation should always work.
Most upvoted answer is a loop solution, hence adding a one line solution using pandas apply() method and lambda function.
print(df.apply(lambda col: col.unique()))
This will get the unique values in proper format:
pd.Series({col:df[col].unique() for col in df})
If you're trying to create multiple separate dataframes as mentioned in your comments, create a dictionary of dataframes:
df_dict = dict(zip([i for i in df.columns] , [pd.DataFrame(df[i].unique(), columns=[i]) for i in df.columns]))
Then you can access any dataframe easily using the name of the column:
df_dict[column name]
We can make this even more concise:
df.describe(include='all').loc['unique', :]
Pandas describe gives a few key statistics about each column, but we can just grab the 'unique' statistic and leave it at that.
Note that this will give a unique count of NaN for numeric columns - if you want to include those columns as well, you can do something like this:
df.astype('object').describe(include='all').loc['unique', :]
I was seeking for a solution to this problem as well, and the code below proved to be more helpful in my situation,
for col in df:
print(col)
print(df[col].unique())
print('\n')
It gives something like below:
Fuel_Type
['Diesel' 'Petrol' 'CNG']
HP
[ 90 192 69 110 97 71 116 98 86 72 107 73]
Met_Color
[1 0]
The code below could provide you a list of unique values for each field, I find it very useful when you want to take a deeper look at the data frame:
for col in list(df):
print(col)
print(df[col].unique())
You can also sort the unique values if you want them to be sorted:
import numpy as np
for col in list(df):
print(col)
print(np.sort(df[col].unique()))
cu = []
i = []
for cn in card.columns[:7]:
cu.append(card[cn].unique())
i.append(cn)
pd.DataFrame( cu, index=i).T
Simply do this:
for i in df.columns:
print(df[i].unique())
Or in short it can be written as:
for val in df['column_name'].unique():
print(val)
Even better. Here's code to view all the unique values as a dataframe column-wise transposed:
columns=[*df.columns]
unique_values={}
for i in columns:
unique_values[i]=df[i].unique()
unique=pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in unique_vals.items() ]))
unique.fillna('').T
This solution constructs a dataframe of unique values with some stats and gracefully handles any unhashable column types.
Resulting dataframe columns are: col, unique_len, df_len, perc_unique, unique_values
df_len = len(df)
unique_cols_list = []
for col in df:
try:
unique_values = df[col].unique()
unique_len = len(unique_values)
except TypeError: # not all cols are hashable
unique_values = ""
unique_len = -1
perc_unique = unique_len*100/df_len
unique_cols_list.append((col, unique_len, df_len, perc_unique, unique_values))
df_unique_cols = pd.DataFrame(unique_cols_list, columns=["col", "unique_len", "df_len", "perc_unique", "unique_values"])
df_unique_cols = df_unique_cols[df_unique_cols["unique_len"] > 0].sort_values("unique_len", ascending=False)
print(df_unique_cols)
The best way to do that:
Series.unique()
For example students.age.unique() the output will be the different values that occurred in the age column of the students' data frame.
To get only the number of how many different values:
Series.nunique()

Pandas concat yields ValueError: Plan shapes are not aligned

In pandas, I am attempting to concatenate a set of dataframes and I am getting this error:
ValueError: Plan shapes are not aligned
My understanding of .concat() is that it will join where columns are the same, but for those that it can't find it will fill with NA. This doesn't seem to be the case here.
Here's the concat statement:
dfs = [npo_jun_df, npo_jul_df,npo_may_df,npo_apr_df,npo_feb_df]
alpha = pd.concat(dfs)
In case it helps, I have also hit this error when I tried to concatenate two data frames (and as of the time of writing this is the only related hit I can find on google other than the source code).
I don't know whether this answer would have solved the OP's problem (since he/she didn't post enough information), but for me, this was caused when I tried to concat dataframe df1 with columns ['A', 'B', 'B', 'C'] (see the duplicate column headings?) with dataframe df2 with columns ['A', 'B']. Understandably the duplication caused pandas to throw a wobbly. Change df1 to ['A', 'B', 'C'] (i.e. drop one of the duplicate columns) and everything works fine.
I recently got this message, too, and I found like user #jason and #user3805082 above that I had duplicate columns in several of the hundreds of dataframes I was trying to concat, each with dozens of enigmatic varnames. Manually searching for duplicates was not practical.
In case anyone else has the same problem, I wrote the following function which might help out.
def duplicated_varnames(df):
"""Return a dict of all variable names that
are duplicated in a given dataframe."""
repeat_dict = {}
var_list = list(df) # list of varnames as strings
for varname in var_list:
# make a list of all instances of that varname
test_list = [v for v in var_list if v == varname]
# if more than one instance, report duplications in repeat_dict
if len(test_list) > 1:
repeat_dict[varname] = len(test_list)
return repeat_dict
Then you can iterate over that dict to report how many duplicates there are, delete the duplicated variables, or rename them in some systematic way.
Wrote a small function to concatenate duplicated column names.
Function cares about sorting if original dataframe is unsorted, the output will be a sorted one.
def concat_duplicate_columns(df):
dupli = {}
# populate dictionary with column names and count for duplicates
for column in df.columns:
dupli[column] = dupli[column] + 1 if column in dupli.keys() else 1
# rename duplicated keys with °°° number suffix
for key, val in dict(dupli).items():
del dupli[key]
if val > 1:
for i in range(val):
dupli[key+'°°°'+str(i)] = val
else: dupli[key] = 1
# rename columns so that we can now access abmigous column names
# sorting in dict is the same as in original table
df.columns = dupli.keys()
# for each duplicated column name
for i in set(re.sub('°°°(.*)','',j) for j in dupli.keys() if '°°°' in j):
i = str(i)
# for each duplicate of a column name
for k in range(dupli[i+'°°°0']-1):
# concatenate values in duplicated columns
df[i+'°°°0'] = df[i+'°°°0'].astype(str) + df[i+'°°°'+str(k+1)].astype(str)
# Drop duplicated columns from which we have aquired data
df = df.drop(i+'°°°'+str(k+1), 1)
# resort column names for proper mapping
df = df.reindex_axis(sorted(df.columns), axis = 1)
# rename columns
df.columns = sorted(set(re.sub('°°°(.*)','',i) for i in dupli.keys()))
return df
You need to have the same header names for all the df you want to concat.
Do it for example with :
headername = list(df)
Data = Data.filter(headername)
How to reproduce above error from pandas.concat(...):
ValueError: Plan shapes are not aligned
The Python (3.6.8) code:
import pandas as pd
df = pd.DataFrame({"foo": [3] })
print(df)
df2 = pd.concat([df, df], axis="columns")
print(df2)
df3 = pd.concat([df2, df], sort=False) #ValueError: Plan shapes are not aligned
which prints:
foo
0 3
foo foo
0 3 3
ValueError: Plan shapes are not aligned
Explanation of error
If the first pandas dataframe (here df2) has a duplicate named column and is sent to pd.concat and the second dataframe isn't of the same dimension as the first, then you get this error.
Solution
Make sure there are no duplicate named columns:
df_onefoo = pd.DataFrame({"foo": [3] })
print(df_onefoo)
df_onebar = pd.DataFrame({"bar": [3] })
print(df_onebar)
df2 = pd.concat([df_onefoo, df_onebar], axis="columns")
print(df2)
df3 = pd.concat([df2, df_onefoo], sort=False)
print(df2)
prints:
foo
0 3
bar
0 3
foo bar
0 3 3
foo bar
0 3 3
Pandas concat could have been more helpful with that error message. It's a straight up bubbleup-implementation-itis, which is textbook python.
I was receiving the ValueError: Plan shapes are not aligned when adding dataframes together. I was trying to loop over Excel sheets and after cleaning concacting them together.
The error was being raised as their were multiple none columns which I dropped with the code below:
df = df.loc[:, df.columns.notnull()] # found on stackoverflow
Error is result of having duplicate columns. Use following function in order to remove duplicate function without impacting data.
def duplicated_varnames(df):
repeat_dict = {}
var_list = list(df) # list of varnames as strings
for varname in var_list:
test_list = [v for v in var_list if v == varname]
if len(test_list) > 1:
repeat_dict[varname] = len(test_list)
if len(repeat_dict)>0:
df = df.loc[:,~df.columns.duplicated()]
return df

Check if a value exists in pandas dataframe index

I am sure there is an obvious way to do this but cant think of anything slick right now.
Basically instead of raising exception I would like to get True or False to see if a value exists in pandas df index.
import pandas as pd
df = pd.DataFrame({'test':[1,2,3,4]}, index=['a','b','c','d'])
df.loc['g'] # (should give False)
What I have working now is the following
sum(df.index == 'g')
This should do the trick
'g' in df.index
Multi index works a little different from single index. Here are some methods for multi-indexed dataframe.
df = pd.DataFrame({'col1': ['a', 'b','c', 'd'], 'col2': ['X','X','Y', 'Y'], 'col3': [1, 2, 3, 4]}, columns=['col1', 'col2', 'col3'])
df = df.set_index(['col1', 'col2'])
in df.index works for the first level only when checking single index value.
'a' in df.index # True
'X' in df.index # False
Check df.index.levels for other levels.
'a' in df.index.levels[0] # True
'X' in df.index.levels[1] # True
Check in df.index for an index combination tuple.
('a', 'X') in df.index # True
('a', 'Y') in df.index # False
Just for reference as it was something I was looking for, you can test for presence within the values or the index by appending the ".values" method, e.g.
g in df.<your selected field>.values
g in df.index.values
I find that adding the ".values" to get a simple list or ndarray out makes exist or "in" checks run more smoothly with the other python tools. Just thought I'd toss that out there for people.
Code below does not print boolean, but allows for dataframe subsetting by index... I understand this is likely not the most efficient way to solve the problem, but I (1) like the way this reads and (2) you can easily subset where df1 index exists in df2:
df3 = df1[df1.index.isin(df2.index)]
or where df1 index does not exist in df2...
df3 = df1[~df1.index.isin(df2.index)]
with DataFrame: df_data
>>> df_data
id name value
0 a ampha 1
1 b beta 2
2 c ce 3
I tried:
>>> getattr(df_data, 'value').isin([1]).any()
True
>>> getattr(df_data, 'value').isin(['1']).any()
True
but:
>>> 1 in getattr(df_data, 'value')
True
>>> '1' in getattr(df_data, 'value')
False
So fun :D
df = pandas.DataFrame({'g':[1]}, index=['isStop'])
#df.loc['g']
if 'g' in df.index:
print("find g")
if 'isStop' in df.index:
print("find a")
I like to use:
if 'value' in df.index.get_level_values(0):
print(True)
get_level_values method is good because it allows you to get the value in the indexes no matter if your index is simple or composite.
Use 0 (zero) if you have a single index in your dataframe [or you want to check the first index in multiple index levels]. Use 1 for the second index, and so on...

Categories