create a "dynamic" pandas column, calculated on the fly from other columns - python

I'm interested in creating a "dynamic" column in a pandas dataframe. The dynamic column's contents contain references to other columns, such that when one is modified the dynamic's values are also.
df = pd.DataFrame({'source':[1,2,3]})
df['dyn_col'] = ## something magic here that means source*2
df.dyn_col
>>> 2,4,6
df.source = [4,5,6]
df.dyn_col
>>> 8,10,12
This could potentially include references to multiple columns, eg A*B
I tried extending __getitem__ but it got messy quickly (in part (I think)) because the key wasn't present in the columns
How can this be accomplished?

Related

How to add a column with values depending on existing rows with lower index in pandas?

Is there a fast way of adding a column to a data frame df with values depending on all the rows of df with smaller index? A very simple example where the new column only depends on the value of one other column would be df["new_col"] = df["old_col"].cumsum() (if df is ordered), but I have something more complicated in mind. Ideally, I'd like to write something like
df["new_col"] = df.[some function here](f),
where [some function] sets the i-th value of df["new_col"] to f(df[df.index <= df.index[i]]). (Ideally [some function] can also be applied to groupby() objects.)
At the moment I loop through rows, add a temporary column containing a dict of relevant values and then apply a function, but this is very slow, memory-inefficient, etc.

MultiIndex (multilevel) column names from Dataframe rows

I have a rather messy dataframe in which I need to assign first 3 rows as a multilevel column names.
This is my dataframe and I need index 3, 4 and 5 to be my multiindex column names.
For example, 'MINERAL TOTAL' should be the level 0 until next item; 'TRATAMIENTO (ts)' should be level 1 until 'LEY Cu(%)' comes up.
What I need actually is try to emulate what pandas.read_excel does when 'header' is specified with multiple rows.
Please help!
I am trying this, but no luck at all:
pd.DataFrame(data=df.iloc[3:, :].to_numpy(), columns=tuple(df.iloc[:3, :].to_numpy(dtype='str')))
You can pass a list of row indexes to the header argument and pandas will combine them into a MultiIndex.
import pandas as pd
df = pd.read_excel('ExcelFile.xlsx', header=[0,1,2])
By default, pandas will read in the top row as the sole header row. You can pass the header argument into pandas.read_excel() that indicates how many rows are to be used as headers. This can be either an int, or list of ints. See the pandas.read_excel() documentation for more information.
As you mentioned you are unable to use pandas.read_excel(). However, if you do already have a DataFrame of the data you need, you can use pandas.MultiIndex.from_arrays(). First you would need to specify an array of the header rows which in your case would look something like:
array = [df.iloc[0].values, df.iloc[1].values, df.iloc[2].values]
df.columns = pd.MultiIndex.from_arrays(array)
The only issue here is this includes the "NaN" values in the new MultiIndex header. To get around this, you could create some function to clean and forward fill the lists that make up the array.
Although not the prettiest, nor the most efficient, this could look something like the following (off the top of my head):
def forward_fill(iterable):
return pd.Series(iterable).ffill().to_list()
zero = forward_fill(df.iloc[0].to_list())
one = forward_fill(df.iloc[1].to_list())
two = one = forward_fill(df.iloc[2].to_list())
array = [zero, one, two]
df.columns = pd.MultiIndex.from_arrays(array)
You may also wish to drop the header rows (in this case rows 0 and 1) and reindex the DataFrame.
df.drop(index=[0,1,2], inplace=True)
df.reset_index(drop=True, inplace=True)
Since columns are also indices, you can just transpose, set index levels, and transpose back.
df.T.fillna(method='ffill').set_index([3, 4, 5]).T

Lookup Each item from a list to items from List2. If there's a match return such value, if not delete the entire row

I have two lists that were created from columns from two different dataframes. The two dataframes have the following structure:
In [73][dev]: cw.shape
Out[73]: (4666, 13)
In [74][dev]: ml.shape
Out[74]: (815, 5)
and the two lists are identifier objects intended to match data from one dataframe with another. My intention is conceptually equivalent to a vlookup in Excel, which is to look up whether an item from list ID is in list ID2, and if so, returns the appropriate 'class1' value from the second list into this new "Class" that I've created. If the "vlookup" (pardon my Excel reference here but hopefully you catch my drift) doesn't find the relevant value, the drop all rows.
import pandas as pd
cw = pd.read_excel("abc.xlsx")
ml = pd.read_excel("xyz.xlsx")
ID = cw['Identifier']
cw["Class"] = ""
asc = cw["Class"]
ID2 = ml['num']
bac = ml['class1']
for item in ID:
if item in ID2:
asc[item] = bac[item]
else:
cw.drop(cw.index, inplace = True)
Unfortunately the pasted script drops all rows in cw, rendering it a blank dataframe. Not what I intended. Again, what I'm targeting for here is to remove rows that don't get a match between two ID identifiers, and return class1 values for those rows with matching IDs into this new Class column that I've just created.
In [76][dev]: cw.shape
Out[76]: (0, 13)
I hope I've made this clear. I suspect I didn't setup the if statement correctly but not sure. Thank you very much for helping a beginner here.
I found a simpler and more straight forward solution by using pandas merge.
# Merge with master list
cw_ac = pd.merge(cw, ml, on='cusip', how='inner')
This acts like an inner join in SQL based on the identifier and remove non-matching IDs.

Cleaning Dataframe in Python 3

I've got a dataframe (haveleft) full of people who have left a service and their reason for leaving. The 'text' column is their reason, but some of them aren't strings. Not many, so I just want to remove those rows, either in place or to a new dataframe. Below code just gives me a dataframe populated with only NaN. Why doesn't it work?
cleanedleft = pd.DataFrame()
cleanedleft = haveleft[haveleft[haveleft['text'] == str]]
print(holder[0:10])
or if I remove one of the 'haveleft[ ]' I get an empty dataframe
cleanedleft = pd.DataFrame()
cleanedleft = haveleft[haveleft['text'] == str]
print(holder[0:10])
I've tried to add a type() but can't seem to figure out the way to do this.
It doesn't work because DataFrame columns cannot contain mixed types; your text column will be string or object, even if some values are numerical. You'll want to figure out how to characterize unwanted data and drop them accordingly.
For instance, to drop rows where 'text' consists only of digits as in the single-line example you give:
cleaned = df[~df['text'].str.match('^\d+$')]

Adding individual items and sequences of items to dataframes and series

Say I have a dataframe df
import pandas as pd
df = pd.DataFrame()
and I have the following tuple and value:
column_and_row = ('bar', 'foo')
value = 56
How can I most easily add this tuple to my dataframe so that:
df['bar']['foo']
returns 56?
What if I have a list of such tuples and list of values? e.g.
columns_and_rows = [A, B, C, ...]
values = [5, 10, 15]
where A, B and C are tuples of columns and rows (similar to column_and_row).
Along the same lines, how would this be done with a Series?, e.g.:
import pandas as pd
srs = pd.Series()
and I want to add one item to it with index 'foo' and value 2 so that:
srs['foo']
returns 2?
Note:
I know that none of these are efficient ways of creating dataframes or series, but I need a solution that allows me to grow my structures organically in this way when I have no other choice.
For a series, you can do it with append, but you have to create a series from your value first:
>>> print x
A 1
B 2
C 3
>>> print x.append( pandas.Series([8, 9], index=["foo", "bar"]))
A 1
B 2
C 3
foo 8
bar 9
For a DataFrame, you can also use append or concat, but it doesn't make sense to do this for a single cell only. DataFrames are tabular, so you can only add a whole row or a whole column. The documentation has plenty of examples and there are other questions about this.
Edit: Apparently you actually can set a single value with df.set_value('newRow', 'newCol', newVal). However, if that row/column doesn't already exist, this will actually create an entire new row and/or column, with the rest of the values in the created row/column filled with NaN. Note that in this case a new object will be returned, so you'd have to do df = df.set_value('newRow', 'newCol', newVal) to modify the original.
However, now matter how you do it, this is going to be inefficient. Pandas data structures are based on Numpy and are fundamentally reliant on knowing the size of the array ahead of time. You can add rows and columns, but every time you do so, and entirely new data structure is created, so if you do this a lot, it will be slower than using ordinary Python lists/dicts.

Categories