I have a dataframe (df) my goal is to add a new column ("grad") that corresponds to the gradient between the points if they have the same index.
First I didn't find an easy way to do it using only pandas for now I use numpy+pandas. I have written a function to get the gradient for each row by group, and it works but it is not pretty and a bit wonky.
Second I want to add the pandas series of numpy arrays to the df but I don't know how to do so. I tried to stack them so i get a series of size (0,9) (grouped_2 ) but when I use concat I have the following message: "ValueError: Shape of passed values is (48, 3), indices imply (16, 3)". According to a previous question I think having duplicate index values is the problem but I can't modify the index of my first df.
df = pd.DataFrame(index = [1,1,1,1,1,2,2,2,2],
data={'value': [1,5,8,10,12,1,2,8,2], 'diff_day':[-1,0,2,3,4,-2,-1,0,10]} )
def grad(gr):
val = gr['value']
dif = gr['diff_day']
return np.gradient(val, dif)
grouped_1 = df.groupby(level=0).apply(grad)
grouped_2 = pd.DataFrame(grouped_1.values.tolist(), index=grouped_1.index).stack().reset_index(drop=True)
result = pd.concat([df, grouped_2], axis=1)
My expectation was the following dataframe:
pd.DataFrame(index = [1,1,1,1,1,2,2,2,2],
data={'value': [1,5,8,10,12,1,2,8,2], 'diff_day':[-1,0,2,3,4,-2,-1,0,10], 'grad':[4,3.16,1.83,2,2,1,3.5,5.4,-0.6]} )
Here's a simple way:
df['grad'] = np.gradient(df['value'], df['diff_day'])
To make your solution work, you can do:
result = pd.concat([df.reset_index(drop=True), grouped_2.reset_index(drop=True)], axis=1)
Related
I have a function that searches for a term within a DataFrame and I would like to return an integer based index of the found term, I checked the docs and think that Index.get_loc() should do the trick however I am getting the following error with my code:
df = pd.read_excel(os.path.join(MEDIA_ROOT, "files", file), sheet_name=sheet)
for col in df.columns:
rowfilter = df[col].map(lambda i: str(i)).str.contains("quantity", case=False, regex=False)
search_row = df[rowfilter].index.get_loc() #This is a 'slice' of the rows containing the search term
print(search_row)
#The output should be 6
However, I am getting the following error,
Index.get_loc() missing 1 required positional argument: 'key'
I have tried the following:
search_row = df[rowfilter].index()
print(pd.Index.get_loc(search_row))
But get the same error, So the question is, what is the correct key?
As an aside, you can simplify your search:
mask = df[col].astype(str).str.contains('quantity', case=False, regex=False)
In any case, once you obtain your mask (which is a bool Series), you can get the numerical indices of all the matches:
ix = df.reset_index().index[mask]
This will be an Int64Index. You can also get it as a list:
ixs = ix.tolist()
In any case, you can use ix or ixs for .iloc (df.iloc[ix]).
Since you haven't shared the sample data. creating a test dataframe
df = pd.DataFrame({
"col" : [1,2,3,4]
})
you can get the index value by
df[df.col == 1].index.values
it will return an array
array([0])
if there are multiple rows with match
df[df.col.isin([1,2,])].index.values
it returns
array([0, 1])
Hope it helps
I have a csv dataset (with > 8m rows) that I load into a dataframe. The csv has columns like:
...,started_at,ended_at,...
2022-04-01 18:23:32,2022-04-01 22:18:15
2022-04-02 01:16:34,2022-04-02 02:18:32
...
I am able to load the dataset into my dataframe, but then I need to add multiple calculated columns to the dataframe for each row. In otherwords, unlike this SO question, I do not want the rows of the new columns to have the same initial value (col 1 all NAN, col 2, all "dogs", etc.).
Right now, I can add my columns by doing something like:
df['start_time'] = df.apply(lambda row: add_start_time(row['started_at']), axis = 1)
df['start_cat'] = df.apply(lambda row: add_start_cat(row['start_time']), axis = 1)
df['is_dark'] = df.apply(lambda row: add_is_dark(row['started_at']), axis = 1)
df['duration'] = df.apply(lamba row: calc_dur(row'[started_at'],row['ended_at']), axis = 1)
But it seems inefficient since the entire dataset is processed N times (once for each call).
It seems that I should be able to calculate all of the new columns in a single go, but I am missing some conceptual approach.
Examples:
def calc_dur(started_at, ended_at):
# started_at, ended_at are datetime64[ns]; converted at csv load
diff = ended_at - started_at
return diff.total_seconds() / 60
def add_start_time(started_at):
# started_at is datetime64[ns]; converted at csv load
return started_at.time()
def add_is_dark(started_at):
# tz is pytz.timezone('US/Central')
# chi_town is the astral lookup for Chicago
st = started_at.replace(tzinfo=TZ)
chk = sun(chi_town.observer, date=st, tzinfo=chi_town.timezone)
return st >= chk['dusk'] or st <= chk['dawn']
Update 1
Following on the information for MoRe, I was able to get the essential working. I needed to augment by adding the column names, and then with the merge to specify the index.
data = pd.Series(df.apply(lambda x: [
add_start_time(x['started_at']),
add_is_dark(x['started_at']),
yrmo(x['year'], x['month']),
calc_duration_in_minutes(x['started_at'], x['ended_at']),
add_start_cat(x['started_at'])
], axis = 1))
new_df = pd.DataFrame(data.tolist(),
data.index,
columns=['start_time','is_dark','yrmo',
'duration','start_cat'])
df = df.merge(new_df, left_index=True, right_index=True)
import pandas as pd
data = pd.Series(dataframe.apply(lambda x: [function1(x[column_name]), function2(x[column_name)], function3(x[column_name])], axis = 1))
pd.DataFrame(data.tolist(),data.index)
if i understood your mean correctly, it's your answer. but before everything please use Swifter pip :)
first create a series by lists and convert it to columns...
swifter is a simple library (at least i think it is simple) that only has only one useful method: apply
import swifter
data.swifter.apply(lambda x: x+1)
it use parallel manner to improve speed in large datasets... in small ones, it isn't good and even is worse
https://pypi.org/project/swifter/
I have some numpy array, whose number of rows (axis=0) is the same as a pandas dataframe's number of rows.
I want to create a new column in the dataframe, for which each entry would be a numpy array of a lesser dimension.
Code:
some_df = pd.DataFrame(columns=['A'])
for i in range(10):
some_df.loc[i] = [np.random.rand(4, 6, 8)
data = np.stack(some_df['A'].values) #shape (10, 4, 6, 8)
processed = np.max(data, axis=1) # shape (10, 6, 8)
some_df['B'] = processed # This fails
I want the new column 'B' to contain numpy arrays of shape (6, 8)
How can this be done?
This is not recommended, it is pain, slow and later processing is not easy.
One possible solution is use list comprehension:
some_df['B'] = [x for x in processed]
Or convert to list and assign:
some_df['B'] = processed.tolist()
Coming back to this after 2 years, here is a much better practice:
from itertools import product, chain
import pandas as pd
import numpy as np
from typing import Dict
def calc_col_names(named_shape):
*prefix, shape = named_shape
names = [map(str, range(i)) for i in shape]
return map('_'.join, product(prefix, *names))
def create_flat_columns_df_from_dict_of_numpy(
named_np: Dict[str, np.array],
n_samples_per_np: int,
):
named_np_correct_lenth = {k: v for k, v in named_np.items() if len(v) == n_samples_per_np}
flat_nps = [a.reshape(n_samples_per_np, -1) for a in named_np_correct_lenth.values()]
stacked_nps = np.column_stack(flat_nps)
named_shapes = [(name, arr.shape[1:]) for name, arr in named_np_correct_lenth.items()]
col_names = [*chain.from_iterable(calc_col_names(named_shape) for named_shape in named_shapes)]
df = pd.DataFrame(stacked_nps, columns=col_names)
df = df.convert_dtypes()
return df
def parse_series_into_np(df, col_name, shp):
# can parse the shape from the col names
n_samples = len(df)
col_names = sorted(c for c in df.columns if col_name in c)
col_names = list(filter(lambda c: c.startswith(col_name + "_") or len(col_names) == 1, col_names))
col_as_np = df[col_names].astype(np.float).values.reshape((n_samples, *shp))
return col_as_np
usage to put a ndarray into a Dataframe:
full_rate_df = create_flat_columns_df_from_dict_of_numpy(
named_np={name: np.array(d[name]) for name in ["name1", "name2"]},
n_samples_per_np=d["name1"].shape[0]
)
where d is a dict of nd arrays of the same shape[0], hashed by ["name1", "name2"].
The reverse operation can be obtained by parse_series_into_np.
The accepted answer remains, as it answers the original question, but this one is a much better practice.
I know this question already has an answer to it, but I would like to add a much more scalable way of doing this. As mentioned in the comments above it is in general not recommended to store arrays as "field"-values in a pandas-Dataframe column (I actually do not know why?). Nevertheless, in my day to day work this is an extermely important functionality when working with time-series data and a bunch of related meta-data.
In general I organize my experimantal time-series in form of pandas dataframes with one column holding same-length numpy arrays and the other columns containing information on meta-data with respect to certain measurement conditions etc.
The proposed solution by jezrael works very well, and I used this for the last 4 years on a regular basis. But this method potentially encounters huge memory problems. In my case I came across these problems working with dataframes beyond 5 Million rows and time-series with approx. 100 data points.
The solution to these problems is extremely simple, since I did not find it anywhere I just wanted to share it here: Simply transform your 2D array to a pandas-Series object and assign this to a column of your dataframe:
df["new_list_column"] = pd.Series(list(numpy_array_2D))
I want to apply the .nunique() function to a full dataFrame.
On the following screenshot, we can see that it contains 130 features. Screenshot of shape and columns of the dataframe.
The goal is to get the number of different values per feature.
I use the following code (that worked on another dataFrame).
def nbDifferentValues(data):
total = data.nunique()
total = total.sort_values(ascending=False)
percent = (total/data.shape[0]*100)
return pd.concat([total, percent], axis=1, keys=['Total','Pourcentage'])
diffValues = nbDifferentValues(dataFrame)
And the code fails at the first line and I get the following error which I don't know how to solve ("unhashable type : 'list'", 'occured at index columns'):
Trace of the error
You probably have a column whose content are lists.
Since lists in Python are mutable they are unhashable.
import pandas as pd
df = pd.DataFrame([
(0, [1,2]),
(1, [2,3])
])
# raises "unhashable type : 'list'" error
df.nunique()
SOLUTION: Don't use mutable structures (like lists) in your dataframe:
df = pd.DataFrame([
(0, (1,2)),
(1, (2,3))
])
df.nunique()
# 0 2
# 1 2
# dtype: int64
To get nunique or unique in a pandas.Series , my preferred approaches are
Quick Approach
NOTE: It wouldn't hurt if the col values are lists and string type. Also, nested lists might needed to be flattened.
_unique_items = df.COL_LIST.explode().unique()
or
_unique_count = df.COL_LIST.explode().nunique()
Alternate Approach
Alternatively, if I wish not to explode the items,
# If col values are strings
_unique_items = df.COL_STR_LIST.apply("|".join).unique()
# Lambda will save if col values are non-strings
_unique_items = df.COL_LIST.apply(lambda _l: "|".join([str(_y) for _y in _i])).unique()
Bonus
df.COL.apply(json.dumps) might handle all the cases.
OP's solution
df['uniqueness'] = df.apply(lambda _x: json.dumps(_x.to_list()), axis=1)
...
# Plug more code
...
I have come across this problem with .nunique() when converting results from a Rest API from dict (or list) to pandas dataframe. The problem is that one of the columns is stored as a list or dict (common situation in nested json results). Here is a sample code to remove the columns causing the error.
# this is the dataframe that is causing your issues
df = data.copy()
print(f"Rows and columns: {df.shape} \n")
print(f"Null values per column: \n{df.isna().sum()} \n")
# check which columns error when counting number of uniques
ls_cols_nunique = []
ls_cols_error_nunique = []
for each_col in df.columns:
try:
df[each_col].nunique()
ls_cols_nunique.append(each_col)
except:
ls_cols_error_nunique.append(each_col)
print(f"Unique values per column: \n{df[ls_cols_nunique].nunique()} \n")
print(f"Columns error nunique: \n{ls_cols_error_nunique} \n")
This code should split your dataframe columns into 2 lists:
Column that can calculate .nunique()
Column that errors when running .nunique()
Then just calculate the .nunique() on the columns without errors.
As far as converting the columns with errors, there are other resources that address that with .apply(pd.series).
I have a dict containing 3 dataframes of identical shape. I would like to create:
a 4th dataframe which identifies the largest value from the original 3 at each coordinate - so dic['four'].ix[0,'A'] = MAX( dic['one'].ix[0,'A'], dic['two'].ix[0,'A'], dic['three'].ix[0,'A'] )
a 5th with the second largest value
dic = {}
for i in ['one','two','three']:
dic[i] = pd.DataFrame(np.random.randint(0,100,size=(10,3)), columns=list('ABC'))
I cannot figure out how to use .where() to compare the original 3 dfs. Looping through would be inefficient for ultimate data set.
consider the dict dfs which is a dictionary of pd.DataFrames
import pandas as pd
import numpy as np
np.random.seed([3,1415])
dfs = dict(
one=pd.DataFrame(np.random.randint(1, 10, (5, 5))),
two=pd.DataFrame(np.random.randint(1, 10, (5, 5))),
three=pd.DataFrame(np.random.randint(1, 10, (5, 5))),
)
the best way to handle this is with a pd.Panel object, which is the higher dimensional object analogous to pd.DataFrame.
p = pd.Panel(dfs)
then the answers you need are very straighforward
max
p.max(axis='items') or p.max(0)
penultimate
p.apply(lambda x: np.sort(x)[-2], axis=0)
The 1st question is easy to answer, you could use the numpy.maximum() function to find the element wise maximum value in each cell, across multiple dataframes
dic ['four'] = pd.DataFrame(np.maximum(dic['one'].values,dic['two'].values,dic['three'].values),columns = list('ABC'))