I've got a pandas DataFrame that contains NumPy arrays in some columns:
import numpy as np, pandas as pd
data = {'col1': [np.array([1, 2]), np.array([3, 4])],
'col2': [np.array([5, 6]), np.array([7, 8])],
'col3': [9, 10]}
df = pd.DataFrame(data)
I need to store a large frame like this one in a CSV file, but the arrays have to be strings that look like this:
col1,col2,col3
"(1, 2)","(5, 6)",9
"(3, 4)","(7, 8)",10
What I'm currently doing to achieve this result is to iterate over each column and each row of the DataFrame, but my solution doesn't seem efficient.
This is my current solution:
pd.options.mode.chained_assignment = None
array_columns = [column for column in df.columns if isinstance(df[column].iloc[0], np.ndarray)]
for index, row in df.iterrows():
for column in array_columns:
# Here 'tuple' is only used to replace brackets for parenthesis
df[column][index] = str(tuple(row[column]))
I tried using apply, although I've heard it's usually not an efficient alternative:
def array_to_str(array):
return str(tuple(array))
df[array_columns] = df[array_columns].apply(array_to_str)
But my arrays become NaN:
col1 col2 col3
0 NaN NaN 9
1 NaN NaN 10
I tried other similar solutions, but the error:
ValueError: Must have equal len keys and value when setting with an iterable
appeared quite often.
Is there a more efficient way of performing the same operation? My real dataframes can contain many columns and thousands of rows.
Try this:
tupcols = ['col1', 'col2']
df[tupcols] = df[tupcols].apply(lambda col: col.apply(tuple)).astype('str')
df.to_csv()
You would need to convert the arrays into tuple for the correct representation. In order to do so, you can apply tuple function on columns with object dtype.
to_save = df.apply(lambda x: x.map(lambda y: tuple(y)) if x.dtype=='object' else x)
to_save.to_csv(index=False)
Output:
col1,col2,col3
"(1, 2)","(5, 6)",9
"(3, 4)","(7, 8)",10
Note: This would be dangerous if you have other columns, e.g. string type.
data = {'col1': [np.array([1, 2]), np.array([3, 4])],
'col2': [np.array([5, 6]), np.array([7, 8])],
'col3': [9, 10]}
df = pd.DataFrame(data)
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(lambda x: tuple(x))
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(lambda x: ''' "{}" '''.format(x))
col1 col2 col3
0 "(1, 2)" "(5, 6)" 9
1 "(3, 4)" "(7, 8)" 10
Related
So I have a dataframe as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3, 3, 2, 1], [4, 3, 6, 6 ,3 ,4], [7, 2, 9, 9, 2, 7]]),
columns=['a', 'b', 'c', 'a_select','b_select','c_select'])
df
Now, I may need to reorganize the dataframe (or use two) to accomplish this, but...
I'd like to select the 2 largest values from each '_select' column per row, then use that to mean the corresponding column.
For example, row 1 would mean the values from a & b, row 2 a & c (NOT the values from the _select columns that we're looking at).
Currently I'm just iterating each row - as that seems rather simple, but slow with a large dataset - however I can't figure out how to use an apply or lambda function to do the equivelant (or if it's even possible).
Simple oneliner using nlargest
>>> df.filter(like='select').apply(lambda s: s.nlargest(2), 1).mean(1)
For performance, maybe numpy is useful:
>>> np.sort(df.filter(like='select').to_numpy(), 1)[:, -2:].mean(1)
To get values from the first columns, use argsort
>>> arr = df.filter(like='select').to_numpy()
>>> df[['a', 'b', 'c']].to_numpy()[[[x] for x in np.arange(len(arr))],
np.argsort(arr, 1)][:, -2:].mean(1)
array([1.5, 5. , 8. ])
I'm trying to create a dask dataframe from a numpy array. For that, I need to specify the column types. As suggested in dask documentation, I use for that a pandas empty dataframe. This doesn't throw an error, however all the data types are created as object. I need to use the empty Pandas dataframe, how to make this work?
import pandas as pd
import dask.dataframe as dd
array = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001, 2, 2))])
meta = pd.DataFrame({'col1': pd.Series(dtype='float64'),
'col2': pd.Series(dtype='float64'),
'col3': pd.Series(dtype='float64'),
'date1': pd.Series(dtype='datetime64[ns]')})
print(meta.dtypes)
>>> col1 float64
>>> col2 float64
>>> col3 float64
>>> date1 datetime64[ns]
>>> dtype: object
columns = ['col1', 'col2', 'col3', 'date1']
ddf = dd.from_array(array, columns=columns, meta=meta)
ddf.compute()
print(ddf.dtypes)
>>> col1 object
>>> col2 object
>>> col3 object
>>> date1 object
>>> dtype: object
Could dtypes be set after dataframe creation?
import pandas as pd
import numpy as np
from datetime import datetime
import dask.dataframe as dd
array = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001, 2, 2))])
columns = ['col1', 'col2', 'col3', 'date1']
ddf = dd.from_array(array, columns = columns)
ddf.compute()
ddf = ddf.astype({'col1': 'float64','col2':'float64','col3':'float64','date1':'datetime64[ns]'})
print(ddf.dtypes)
Does this work -
df = (pd.DataFrame(array, columns = ["col1", "col2", "col3", "col4"])
.astype({"col1": "float64",
"col2": "float64",
"col3": "float64",
"col4": "datetime64[ns]"}))
ddf = dd.from_pandas(df, npartitions=10)
The output of ddf.dtypes gives me the correct data types.
I noticed the following behaviour when working with Int64. Is there a way to avoid the type conversion and preserve the Int64 type post merge?
df1 = pd.DataFrame(data={'col1': [1, 2, 3, 4, 5], 'col2': [10, 11, 12, 13, 14]}, dtype=pd.Int64Dtype())
df2 = pd.DataFrame(data={'col1': [1, 2, 3], 'col2': [10, 11, 12]}, dtype=pd.Int64Dtype())
df = df2.merge(df1, how='outer', indicator=True, suffixes=('_x', ''))
df1.dtypes
Out[8]:
col1 Int64
col2 Int64
dtype: object
df2.dtypes
Out[9]:
col1 Int64
col2 Int64
dtype: object
df.dtypes
Out[10]:
col1 object
col2 object
_merge category
dtype: object
I should clarify that I am looking for an answer that doesn't involve explicitly doing something like:
for k, v in df1.dtypes.to_dict().items():
df[k] = df[k].astype(v)
It comes from needing to reindex df2 (base dataframe) needing to reindex to match df1 (merging dataframe). It probably should behave as you expect but is an edge case from using the pandas Int64Dtype type instead of python int type.
When preforming the merge, this reindexing is called:
> /home/tron/.local/lib/python3.7/site-packages/pandas/core/reshape/merge.py(840)_maybe_add_join_keys()
838 key_col = rvals
839 else:
--> 840 key_col = Index(lvals).where(~mask, rvals)
841
842 if result._is_label_reference(name):
Which then calls this array dtype.
> /home/tron/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py(359)__new__()
357 data = ea_cls._from_sequence(data, dtype=dtype, copy=False)
358 else:
--> 359 data = np.asarray(data, dtype=object)
360
361 # coerce to the object dtype
You can explore this yourself by using the pdb debugger and stepping through the result.
df1 = pd.DataFrame(data={'col1': [1, 2, 3, 4, 5], 'col2': [10, 11, 12, 13, 14]}, dtype=pd.Int64Dtype())
df2 = pd.DataFrame(data={'col1': [1, 2, 3], 'col2': [10, 11, 12]}, dtype=pd.Int64Dtype())
def test():
import pdb
pdb.set_trace()
df = df2.merge(df1, how='outer', indicator=True, suffixes=('_x', ''))
return df
df = text()
Some interesting notes:
If you use dtype=int instead of dtype=pd.Int64Dtype() the types are actually as expected. It probably should work similar with both but the int type has a different logic path in pandas/core/indexes/base.py(359)__new__() which interprets int as "# index-like. That said, you should likely default to using the default int, float, bool` types from python instead of pandas dtypes unless you have a specific use case.
df2.merge(df1, how='inner') preserves the types because no reindexing is needed.
df1.merge(df2, how='outer') preserves the types because df1 (base dataframe) does not need to reindex to merge df2.
Looking to get the row of a group that has the maximum value across multiple columns:
pd.DataFrame([{'grouper': 'a', 'col1': 1, 'col2': 3, 'uniq_id': 1}, {'grouper': 'a', 'col1': 2, 'col2': 4, 'uniq_id': 2}, {'grouper': 'a', 'col1': 3, 'col2': 2, 'uniq_id': 3}])
col1 col2 grouper uniq_id
0 1 3 a 1
1 2 4 a 2
2 3 2 a 3
In the above, I'm grouping by the "grouper" column. Within the "a" group, I want to get the row that has the max of col1 and col2, in this case, when I group my DataFrame, I want to get the row with uniq_id of 2 because it has the highest value of col1/col2 with 4, so the outcome would be:
col1 col2 grouper uniq_id
1 2 4 a 2
In my actual example, I'm using timestamps, so I actually don't expect ties. But in the case of a tie, I am indifferent to which row I select in the group, so it would just be first of the group in that case.
One more way you can try:
# find row wise max value
df['row_max'] = df[['col1','col2']].max(axis=1)
# filter rows from groups
df.loc[df.groupby('grouper')['row_max'].idxmax()]
col1 col2 grouper uniq_id row_max
1 2 4 a 2 4
Later you can drop row_max using df.drop('row_max', axis=1)
IIUC using transform then compare with original dataframe
g=df.groupby('grouper')
s1=g.col1.transform('max')
s2=g.col2.transform('max')
s=pd.concat([s1,s2],axis=1).max(1)
df.loc[df[['col1','col2']].eq(s,0).any(1)]
Out[89]:
col1 col2 grouper uniq_id
1 2 4 a 2
Interesting approaches all around. Adding another one just to show the power of apply (which I'm a big fan of) and using some of the other mentioned methods.
import pandas as pd
df = pd.DataFrame(
[
{"grouper": "a", "col1": 1, "col2": 3, "uniq_id": 1},
{"grouper": "a", "col1": 2, "col2": 4, "uniq_id": 2},
{"grouper": "a", "col1": 3, "col2": 2, "uniq_id": 3},
]
)
def find_max(grp):
# find max value per row, then find index of row with max val
max_row_idx = grp[["col1", "col2"]].max(axis=1).idxmax()
return grp.loc[max_row_idx]
df.groupby("grouper").apply(find_max)
value = pd.concat([df['col1'], df['col2']], axis = 0).max()
df.loc[(df['col1'] == value) | (df['col2'] == value), :]
col1 col2 grouper uniq_id
1 2 4 a 2
This probably isn't the fastest way, but it will work in your case. Concat both the columns and find the max, then search the df for where either column equals the value.
You can use numpy and pandas as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [3, 4, 2],
'grouper': ['a', 'a', 'a'],
'uniq_id': [1, 2, 3]})
df['temp'] = np.max([df.col1.values, df.col2.values],axis=0)
idx = df.groupby('grouper')['temp'].idxmax()
df.loc[idx].drop('temp',1)
col1 col2 grouper uniq_id
1 2 4 a 2
I'm trying to subset a pandas DataFrame in python based on two logical statements
i.e.
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df[df.col1 = 1 and df.col2 = 3]
but I'm getting invalid syntax on line3.
Is there a way to do this in one line?
You need to use logical operators. == is equals for returning boolean, = is setting a value.
Try:
df[(df.col1 == 1) & (df.col2 == 3)]
Disclaimer: as mentioned by #jp_data_analysis and pandas docs, the following solution is not the best one given it uses chained indexing. Please refer to Matt W. and AChampion solution.
An alternative one line solution.
>>> d = {'col1': [1, 2, 1], 'col2': [3, 4, 2]}
>>> df = pd.DataFrame(data=d)
>>> df[df.col1==1][df.col2==3]
col1 col2
0 1 3
I have added a third row, with 'col1'=1 and 'col2'=2, so we can have an extra negative test case.