Given DataFrame df:
Id Sex Group Time Time!
0 21 M 2 2.31 NaN
1 2 F 2 2.29 NaN
and update:
Id Sex Group Time
0 21 M 2 2.36
1 2 F 2 2.09
2 3 F 1 1.79
I want to match on Id, Sex and Group and either update Time! with Time value (from the update df) if match, or insert if a new record.
Here is how I do it:
df = df.set_index(['Id', 'Sex', 'Group'])
update = update.set_index(['Id', 'Sex', 'Group'])
for i, row in update.iterrows():
if i in df.index: # update
df.ix[i, 'Time!'] = row['Time']
else: # insert new record
cols = up.columns.values
row = np.array(row).reshape(1, len(row))
_ = pd.DataFrame(row, index=[i], columns=cols)
df = df.append(_)
print df
Time Time!
Id Sex Group
21 M 2 2.31 2.36
2 F 2 2.29 2.09
3 F 1 1.79 NaN
The code seem to work and my wished result matches with the above. However, I have noticed this behaving faultily on a big data set, with the conditional
if i in df.index:
...
else:
...
working obviously wrong (it would proceed to else and vice-verse where it shouldn't, I guess, this MultiIndex may be the cause somehow).
So my question is, do you know any other way, or a more robust version of mine, to update one df based on another df?
I think I would do this with a merge, and then update the columns with a where. First remove the Time column from up:
In [11]: times = up.pop('Time') # up = the update DataFrame
In [12]: df1 = df.merge(up, how='outer')
In [13]: df1
Out[13]:
Id Sex Group Time Time!
0 21 M 2 2.31 NaN
1 2 F 2 2.29 NaN
2 3 F 1 NaN NaN
Update Time if it's not NaN and Time! if it's NaN:
In [14]: df1['Time!'] = df1['Time'].where(df1['Time'].isnull(), times)
In [15]: df1['Time'] = df1['Time'].where(df1['Time'].notnull(), times)
In [16]: df1
Out[16]:
Id Sex Group Time Time!
0 21 M 2 2.31 2.36
1 2 F 2 2.29 2.09
2 3 F 1 1.79 NaN
Related
I have two dataframes df1 and df2.
df1:
id val
1 25
2 40
3 78
df2:
id val
2 8
1 5
Now I want to do something like df1['val'] = df1['val']/df2['val'] for matching id. I can do that by iterating over all df2 rows as df2 is a subset of df1 so it may be missing some values, which I want to keep unchanged. This is what I have right now:
for row in df2.iterrows():
df1.loc[df1['id']==row[1]['id'], 'val'] /= row[1]['val']
df1:
id val
1 5
2 5
3 78
How can I achieve the same without using for loop to improve speed?
Use Series.map with Series.div:
df1['val'] = df1['val'].div(df1['id'].map(df2.set_index('id')['val']), fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0
Solution with merge with left join:
df1['val'] = df1['val'].div(df1.merge(df2, on='id', how='left')['val_y'], fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0
I have a dataset which is about 1.4m rows x 16 columns, there are no missing values in the dataset but instead there are some string or text like "+AC0-5.3" instead of just empty or 'NaN' .
The numerical values after 'AC0-***' are different for different entries but they all start with '+***-***'. How to deal with this? I guess dropping all rows wherever there's such entries would be a good idea.
I tried a solution from stack-overflow:
dataset = dataset[~dataset['total+AF8-amount'].astype(str).str.startswith('+')]
Which helps remove all the rows that had an entry of that missing value. The problem is that it only checks the missing values in the target column which is 'total+AF8-amount'.
I want to remove all the rows where there is this weird missing value in any column, how can I accomplish this?
This the training set .head()
sorry for the bad formatting every space refers to the next column,
ID vendor+AF8-id pickup+AF8-loc drop+AF8-loc driver+AF8-tip mta+AF8-tax distance pickup+AF8-time drop+AF8-time num+AF8-passengers toll+AF8-amount payment+AF8-method rate+AF8-code stored+AF8-flag extra+AF8-charges improvement+AF8-charge total+AF8-amount
0 1 170 233 1.83 0.5 0.7 04-04-2017 17.59 04-04-2017 18.05 1 0 1 1 N 1 0.3 9.13
1 2 151 243 3.56 0.5 4.64 04-03-2017 19.03 04-03-2017 19.20 1 0 1 1 N 1 0.3 21.36
2 2 68 90 1.5 0.5 1.29 04-03-2017 15.06 04-03-2017 15.12 2 0 1 1 N 0 0.3 8.8
3 2 142 234 1.5 0.5 2.74 04-04-2017 8.10 04-04-2017 8.27 1 0 1 1 N 0 0.3 14.8
1656 2 114 255 3.96 0.5 3.92 04-05-2017 22.57 04-05-2017 23.22 2 0 1 1 N 0.5 0.3 23.76
1657 2 230 100 0 **+AC0-0.5** 0.51 04-06-2017 8.14 04-06-2017 8.18 1 0 3 1 N 0 **+AC0-0.3 +AC0-5.3**
1658 2 163 226 0 0.5 3.93 04-07-2017 4.06 04-07-2017 4.20 1 0 2 1 N 0.5 0.3 15.8
1659 2 229 90 2.56 0.5 2.61 04-07-2017 13.49 04-07-2017 14.06 2 0 1 1 N 0 0.3 15.36
For example the row with ID 1657 have a those missing entries, there are other such rows. This is what i have done:
dataset = pd.read_csv('chh-OLA-Kaggle.csv', index_col = 'ID')
testset = pd.read_csv('test.csv', index_col = 'ID')
dataset.dropna(axis = 0, subset = ['total+AF8-amount'], inplace = True)
dataset = dataset[~dataset['total+AF8-amount'].astype(str).str.startswith('+')]
X = dataset.iloc[:, :15].values
y = dataset['total+AF8-amount'].values
One more problem arises is that now all those values is of type 'str', how to make all columns of numerical values to type 'float64' so that I could fit it to a model.
Are all datasets like this?
However, this is been answered but still i found a way to do it different way which i used 3 years back from the So itself which is more intuitive and good to have..
str.contains('+') matches the beginning of any string, Since every string has a beginning, everything matches. Instead use str.contains('\+') to match the literal + character.
Therefore we can use for col in df for every column by calling the str.contains on it to get the values with np.column_stack() as a boolean masking and save it and then later applying it using the dataFrame.loc() along row-wise(asis=1).
Example DataFrame:
>>> df
col1 col2 col3
0 32.1 33.2 +232
1 34.2 3.4 3.4
2 32.44 +232 32.44
3 +232 1.32 +234
4 1.312 131.23 131.23
Solution:
>>> mask = np.column_stack([df[col].str.contains(r"\+", na=False) for col in df])
>>> df.loc[ ~ mask.any(axis=1)]
col1 col2 col3
1 34.2 3.4 3.4
4 1.312 131.23 131.23
Solution 2:
Without np.column_stack purly with pandas but it returns dataframe obj while earlier with numpy returns numpy.ndarray which is best fit for boolean masking.
>>> mask = df.apply(lambda x: x.str.contains(r'\+', na=False))
>>> df.loc[ ~ mask.any(axis=1)]
col1 col2 col3
1 34.2 3.4 3.4
4 1.312 131.23 131.23
In case you need to apply float, try following..
df.loc[ ~ mask.any(axis=1)].astype(float)
Note:
Since you asked, the use of ~ used as boolean vectors to filter the data. other operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.
First of all you have to count how many rows have this type(starts with '+-***) of entries.if they are less then you can simply drop those rows.
If not then you should perform data cleaning operation on each row.For that you can use apply() method.pass axis=1 in apply()
You can do it this way.
Example dataframe:
Col1 Col2 Col3
0 32.1 33.2 +232
1 34.2 3.4 3.4
2 32.44 +232 32.44
3 +232 1.32 +232
4 1.3123 131.23 131.23
You want to remove rows from any column that starts with + and also make sure you convert them back to float64 to pass it to your model. Then do:
for x in df.columns:
df = df[~df[x].astype(str).str.startswith('+')]
df[x] = df[x].astype(float)
If you do not want to loop over all columns, you can just pass the column names in a list instead of df.columns.
Final output after getting rid of rows having entries starting with +:
Col1 Col2 Col3
1 34.2000 3.40 3.40
4 1.3123 131.23 131.23
Output of df.info() showing they are now in float64 type:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1 to 4
Data columns (total 3 columns):
Col1 2 non-null float64
Col2 2 non-null float64
Col3 2 non-null float64
dtypes: float64(3)
memory usage: 64.0 bytes
I have some problem where data is sorted by date, for example something like this:
date, value, min
2015-08-17, 3, nan
2015-08-18, 2, nan
2015-08-19, 4, nan
2015-08-28, 1, nan
2015-08-29, 5, nan
Now I want to save min values in min column till this row, so result would look something like this:
date, value, min
2015-08-17, 3, 3
2015-08-18, 2, 2
2015-08-19, 4, 2
2015-08-28, 1, 1
2015-08-29, 5, 1
I've tried some options, but still don't get what I'm doing wrong, here is one example that I tried:
data['min'] = min(data['value'], data['min'].shift())
I don't want to iterate through all rows because the data I have is big. What is the best strategy you can write using pandas for this kind of problem?
Since you mentioned that you are working with big dataset, with focus on performance, here's one using NumPy's np.minimum.accumulate -
df['min'] = np.minimum.accumulate(df.value)
Sample run -
In [70]: df
Out[70]:
date value min
0 2015-08-17 3 NaN
1 2015-08-18 2 NaN
2 2015-08-19 4 NaN
3 2015-08-28 1 NaN
4 2015-08-29 5 NaN
In [71]: df['min'] = np.minimum.accumulate(df.value)
In [72]: df
Out[72]:
date value min
0 2015-08-17 3 3
1 2015-08-18 2 2
2 2015-08-19 4 2
3 2015-08-28 1 1
4 2015-08-29 5 1
Runtime test -
In [65]: df = pd.DataFrame(np.random.randint(0,100,(1000000)), columns=list(['value']))
# #MaxU's soln using pandas cummin
In [66]: %timeit df['min'] = df.value.cummin()
100 loops, best of 3: 6.84 ms per loop
In [67]: df = pd.DataFrame(np.random.randint(0,100,(1000000)), columns=list(['value']))
# Using NumPy
In [68]: %timeit df['min'] = np.minimum.accumulate(df.value)
100 loops, best of 3: 3.97 ms per loop
Use cummin() method:
In [53]: df['min'] = df.value.cummin()
In [54]: df
Out[54]:
date value min
0 2015-08-17 3 3
1 2015-08-18 2 2
2 2015-08-19 4 2
3 2015-08-28 1 1
4 2015-08-29 5 1
Assuming I have the following DataFrame:
A | B
1 | Ms
1 | PhD
2 | Ms
2 | Bs
I want to remove the duplicate rows with respect to column A, and I want to retain the row with value 'PhD' in column B as the original, if I don't find a 'PhD', I want to retain the row with 'Bs' in column B.
I am trying to use
df.drop_duplicates('A')
with a condition
Consider using Categoricals. They're a nice was to group / order text non-alphabetically (among other things.)
import pandas as pd
#create a pandas dataframe for testing with two columns A integer and B string
df = pd.DataFrame([(1, 'Ms'), (1, 'PhD'),
(2, 'Ms'), (2, 'Bs'),
(3, 'PhD'), (3, 'Bs'),
(4, 'Ms'), (4, 'PhD'), (4, 'Bs')],
columns=['A', 'B'])
print("Original data")
print(df)
# force the column's string column B to type 'category'
df['B'] = df['B'].astype('category')
# define the valid categories:
df['B'] = df['B'].cat.set_categories(['PhD', 'Bs', 'Ms'], ordered=True)
#pandas dataframe sort_values to inflicts order on your categories
df.sort_values(['A', 'B'], inplace=True, ascending=True)
print("Now sorted by custom categories (PhD > Bs > Ms)")
print(df)
# dropping duplicates keeps first
df_unique = df.drop_duplicates('A')
print("Keep the highest value category given duplicate integer group")
print(df_unique)
Prints:
Original data
A B
0 1 Ms
1 1 PhD
2 2 Ms
3 2 Bs
4 3 PhD
5 3 Bs
6 4 Ms
7 4 PhD
8 4 Bs
Now sorted by custom categories (PhD > Bs > Ms)
A B
1 1 PhD
0 1 Ms
3 2 Bs
2 2 Ms
4 3 PhD
5 3 Bs
7 4 PhD
8 4 Bs
6 4 Ms
Keep the highest value category given duplicate integer group
A B
1 1 PhD
3 2 Bs
4 3 PhD
7 4 PhD
>>> df
A B
0 1 Ms
1 1 Ms
2 1 Ms
3 1 Ms
4 1 PhD
5 2 Ms
6 2 Ms
7 2 Bs
8 2 PhD
Sorting a dataframe with a custom function:
def sort_df(df, column_idx, key):
'''Takes a dataframe, a column index and a custom function for sorting,
returns a dataframe sorted by that column using that function'''
col = df.ix[:,column_idx]
df = df.ix[[i[1] for i in sorted(zip(col,range(len(col))), key=key)]]
return df
Our function for sorting:
cmp = lambda x:2 if 'PhD' in x else 1 if 'Bs' in x else 0
In action:
sort_df(df,'B',cmp).drop_duplicates('A', take_last=True)
P.S. in modern pandas versions there is no option take_last, use keep instead - see the doc.
A B
4 1 PhD
8 2 PhD
Assuming uniqueness of B value given A value, and that each A value has a row with Bs in the B column:
df2 = df[df['B']=="PhD"]
will give you a dataframe with the PhD rows you want.
Then remove all the PhD and Ms from df:
df = df[df['B']=="Bs"]
Then concatenate df and df2:
df3 = concat([df2, df])
Then you can use drop_duplicates like you wanted:
df3.drop_duplicates('A', inplace=True)
Remove duplicates retain original:
Sort your columns to put the one you want to keep on the top, then drop_duplicates does the right thing.
import pandas as pd
df = pd.DataFrame([(1, '2022-01-25'),
(1, '2022-05-25'),
(2, '2021-12-20'),
(2, '2021-11-20'),
(3, '2020-03-03'),
(3, '2020-03-04'),
(4, '2019-07-06'),
(4, '2019-07-07'),
(4, '2019-07-05')], columns=['A', 'B'])
print("Original data")
print(df.to_string(index=False) )
#Sort your dataframe so that the one you want is on the top:
df.sort_values(['A', 'B'], inplace=True, ascending=True)
print("custom sort")
print(df.to_string(index=False) )
# dropping duplicates this way keeps first
df_unique = df.drop_duplicates('A')
print("Keep first")
print(df_unique.to_string(index=False) )
Prints:
Original data
A B
1 2022-01-25
1 2022-05-25
2 2021-12-20
2 2021-11-20
3 2020-03-03
3 2020-03-04
4 2019-07-06
4 2019-07-07
4 2019-07-05
custom sort
A B
1 2022-01-25
1 2022-05-25
2 2021-11-20
2 2021-12-20
3 2020-03-03
3 2020-03-04
4 2019-07-05
4 2019-07-06
4 2019-07-07
Keep first
A B
1 2022-01-25
2 2021-11-20
3 2020-03-03
4 2019-07-05
In a Pandas dataset I only want to keep the lowest value per line. All other values should be deleted.
I need the original dataset intact. Just remove all values (replace by NaN) which are not the minimum.
What is the best way to do this - speed/performance wise.
I can also transpose the dataset if the operation is easier per column.
Thanks
Robert
Since the operation you are contemplating does not rely on the columns or index, it might be easier (and faster) to do this using NumPy rather than Pandas.
You can find the location (i.e. column index) of the minimums for each row using
idx = np.argmin(arr, axis=1)
You could then make a new array filled with NaNs and copy the minimum values
to the new array.
import numpy as np
import pandas as pd
def nan_all_but_min(df):
arr = df.values
idx = np.argmin(arr, axis=1)
newarr = np.full_like(arr, np.nan, dtype='float')
newarr[np.arange(arr.shape[0]), idx] = arr[np.arange(arr.shape[0]), idx]
df = pd.DataFrame(newarr, columns=df.columns, index=df.index)
return df
df = pd.DataFrame(np.random.random((4,3)))
print(df)
# 0 1 2
# 0 0.542924 0.499702 0.058555
# 1 0.682663 0.162582 0.885756
# 2 0.389789 0.648591 0.513351
# 3 0.629413 0.843302 0.862828
df = nan_all_but_min(df)
print(df)
yields
0 1 2
0 NaN NaN 0.058555
1 NaN 0.162582 NaN
2 0.389789 NaN NaN
3 0.629413 NaN NaN
Here is a benchmark comparing nan_all_but_min vs using_where:
def using_where(df):
return df.where(df.values == df.min(axis=1)[:,None])
In [73]: df = pd.DataFrame(np.random.random(100*100).reshape(100,100))
In [74]: %timeit using_where(df)
1000 loops, best of 3: 701 µs per loop
In [75]: %timeit nan_all_but_min(df)
10000 loops, best of 3: 105 µs per loop
Note that using_where and nan_all_but_min behave differently if a row contains the same min value more than once. using_where will preserve all the mins, nan_all_but_min will preserve only one min. For example:
In [76]: using_where(pd.DataFrame([(0,0,1), (1,2,1)]))
Out[76]:
0 1 2
0 0 0 NaN
1 1 NaN 1
In [77]: nan_all_but_min(pd.DataFrame([(0,0,1), (1,2,1)]))
Out[77]:
0 1 2
0 0 NaN NaN
1 1 NaN NaN
Piggybacking off #unutbu's excellent answer, the following minor change should accommodate your modified question.
The where method
In [26]: df2 = df.copy()
In [27]: df2
Out[27]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
In [28]: df2.where(df2.values == df2.min(axis=1)[:,None])
Out[28]:
0 1 2
0 0 NaN NaN
1 3 NaN NaN
2 6 NaN NaN
3 9 NaN NaN
Mandatory speed test.
In [29]: df3 = pd.DataFrame(np.random.random(100*100).reshape(100,100))
In [30]: %timeit df3.where(df3.values == df3.min(axis=1)[:,None])
1000 loops, best of 3: 723 µs per loop
If your data frame already contains NaN values, you must use numpy's nanmin as follows:
df2.where(df2.values==np.nanmin(df2,axis=0))
I just found and tried out the answer by unutbu.
I tried the .where method, but it will be deprecated soon.
FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
However, i got this sucker working instead. However, it is a lambda function, and most likely slower...
df = pd.DataFrame(np.random.random((4,3)))
print(df)
# 0 1 2
# 0 0.542924 0.499702 0.058555
# 1 0.682663 0.162582 0.885756
# 2 0.389789 0.648591 0.513351
# 3 0.629413 0.843302 0.862828
mask = df.apply(lambda d:(d == df.min(axis=1)))
print (df[mask])
Should yield:
0 1 2
0 NaN NaN 0.058555
1 NaN 0.162582 NaN
2 0.389789 NaN NaN
3 0.629413 NaN NaN