Removing matching pairs in dataframe in Python - python

For df:
id Date ITEM_ID TYPE GROUP
0 13710750 2019-07-01 SLM607 O X
1 13710760 2019-07-01 SLM607 O M
2 13710770 2019-07-03 SLM607 O I
3 13710780 2019-09-03 SLM607 O N
4 13667449 2019-08-02 887643 O I
5 13667450 2019-08-02 792184 O I
6 13728171 2019-09-17 SLM607 I I
7 13667452 2019-08-02 794580 O I
reproducible example:
data = {'id': [13710750, 13710760, 13710770, 13710780, 13667449, 13667450, 13728171, 13667452],
'Date': ['2019-07-01', '2019-07-01', '2019-07-03', '2019-09-03', '2019-08-02', '2019-08-02', '2019-09-17', '2019-08-02'],
'ITEM_ID': ['SLM607', 'SLM607', 'SLM607', 'SLM607', '887643', '792184', 'SLM607', '794580'],
'TYPE': ['O', 'O', 'O', 'O', 'O', 'O', 'I', 'O'],
'GROUP': ['X', 'M', 'I','N','I','I','I', 'I']}
df = pd.DataFrame(data)
df
how can I delete pairs of rows that have same values for ITEM_ID and GROUP, but one with O for TYPE that comes first, and another one with I for TYPE that happens later?
Expected outcome:
id Date ITEM_ID TYPE GROUP
0 13710750 2019-07-01 SLM607 O X
1 13710760 2019-07-01 SLM607 O M
3 13710780 2019-09-03 SLM607 O N
4 13667449 2019-08-02 887643 O I
5 13667450 2019-08-02 792184 O I
7 13667452 2019-08-02 794580 O I

shift with filter
out = df.groupby(['ITEM_ID','GROUP']).filter(lambda x : ~(x['TYPE'].eq('I') & x['TYPE'].shift().eq('O')).any())
Out[7]:
id Date ITEM_ID TYPE GROUP
0 13710750 2019-07-01 SLM607 O X
1 13710760 2019-07-01 SLM607 O M
3 13710780 2019-09-03 SLM607 O N
4 13667449 2019-08-02 887643 O I
5 13667450 2019-08-02 792184 O I
7 13667452 2019-08-02 794580 O I

Related

Group Multiple columns while performing multiple aggregations in pandas

I would like to group by multiple columns and perform several different aggregations. Grouping by type and date and taking average of en, en2, stat1 and stat2.
Data
type en en2 date stat1 stat2
aa 40 80 1/1/2021 1 1
aa 20 20 1/1/2021 2 1
aa 10 10 1/1/2021 3 5
bb 10 10 1/1/2021 3 9
bb 50 5 1/1/2021 5 1
aa 90 5 1/7/2021 5 2
aa 100 10 1/7/2021 1 5
bb 80 10 1/7/2021 5 2
Desired
type en en2 date stat1 stat2
aa 23 36 1/1/2021 2 3
bb 30 7.5 1/1/2021 4 5
aa 95 7.5 1/7/2021 3 3.5
bb 80 10 1/7/2021 5 2
Doing
grouped = final.groupby(['date'],['type']) \
.agg({'en':'mean', 'en2':'mean','stat1':'mean','stat2':'mean'})
I am getting a typeError. - Unhashable list
I am researching.
Any suggestion is appreciated.
Try:
grouped = final.groupby(['date', 'type'], as_index=False) \
.agg({'type': 'first', 'en': 'mean', 'en2': 'mean',
'date': 'first', 'stat1': 'mean', 'stat2': 'mean'})
print(grouped)
# Output
type en en2 date stat1 stat2
0 aa 23.333333 36.666667 1/1/2021 2.0 2.333333
1 bb 30.000000 7.500000 1/1/2021 4.0 5.000000
2 aa 95.000000 7.500000 1/7/2021 3.0 3.500000
3 bb 80.000000 10.000000 1/7/2021 5.0 2.000000
grouped = final[['date', 'type', 'en',
'en2','stat1','stat2']].groupby(['date', 'type'],
as_index=False, dropna=False).sum()

How can I fill missing data in my dataframe faster for a big dataset, and without a SettingWithCopyWarning?

I have a dataframe with the count of people per day and per location. Without any missing data, I expect to have 4 lines per day: 2 locations and 2 genders. Some data is missing and should be replaced by the mean count, but only if that location has data for that gender on the day before.
If data is missing for more dan 1 day, I assume that there is supposed to be no data. So for example in my example dataframe: Day 2, Location X, Gender F should be filled, because Day 1, Location X, Gender F exists. But Day 4, Location Y, Gender F must stay empty, because Day 3, Location Y, Gender F does not exist.
The code below works for this small dataframe, but it's really slow for my large dataset. Is there a way to do this faster?
Can I avoid the SettingWithCopyWarnings in this case?
import pandas as pd
import numpy as np
import random
data = pd.DataFrame({'day': [1,1,2,3,3,4,5,1,2],
'location': ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y'],
'gender': ['F', 'M', 'M','F', 'M','F', 'F','F','F'],
'count': random.sample(range(10, 30), 9)})
print(data.sort_values('day').reset_index(drop=True))
day location gender count
0 1 X F 17
1 1 X M 10
2 1 Y F 12
3 2 X M 20
4 2 Y F 15
5 3 X F 24
6 3 X M 29
7 4 X F 11
8 5 X F 14
data2 = pd.DataFrame()
for e, today in enumerate(list(set(data['day'].sort_values()))[1:]):
yesterday = (list(set(data['day'].sort_values()))[e])
today_df = data[(data['day']==today)].set_index(['location','gender'])
yesterday_df = data[(data['day']==yesterday)].set_index(['location','gender'])
today_missing = [[i[0],i[1]] for i in yesterday_df.index if i not in today_df.index]
for i in today_missing:
new_row = data[(data['day']==yesterday) & (data['location']==i[0]) & (data['gender']==i[1])]
new_row['day'] = today
new_row['count'] = int(np.mean(data['count'][(data['location']==i[0]) & (data['gender']==i[1])]))
data2 = data2.append(new_row, ignore_index=True)
data = data.append(data2).sort_values('day').reset_index(drop=True)
print(data)
day location gender count
0 1 X F 17
1 1 X M 10
2 1 Y F 12
3 2 X M 20
4 2 Y F 15
5 2 X F 16
6 3 X F 24
7 3 X M 29
8 3 Y F 13
9 4 X F 11
10 4 X M 19
11 5 X F 14
One solution can be to re-generate the posible combinations of location, gender and day
df = data.set_index(['location', 'gender', 'day'])
.reindex(pd.MultiIndex.from_product(
[['X', 'Y'], ['F', 'M'], range(1, 8)],
names=['location', 'gender', 'day']))
count
location gender day
X F 1 17.0
2 NaN
3 24.0
4 11.0
5 14.0
6 NaN
7 NaN
M 1 10.0
2 20.0
3 29.0
4 NaN
5 NaN
6 NaN
7 NaN
Y F 1 12.0
2 15.0
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
M 1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
1: Solution filling with mean per location, gender group
df.groupby(['location', 'gender']).transform(lambda x: x.fillna(x.mean(), limit=1)).dropna()
count
location gender day
X F 1 17.000000
2 16.500000
3 24.000000
4 11.000000
5 14.000000
M 1 10.000000
2 20.000000
3 29.000000
4 19.666667
Y F 1 12.000000
2 15.000000
3 13.500000
2: Solution interpolating linearly between days
Another solution can be to interpolate between days within the [location, gender] groups, with a limit of 1 day filling:
df.interpolate(level=['location', 'gender'], limit=1).dropna()
count
location gender day
X F 1 17.000000
2 20.500000
3 24.000000
4 11.000000
5 14.000000
6 12.666667
M 1 10.000000
2 20.000000
3 29.000000
4 25.600000
Y F 1 12.000000
2 15.000000
3 15.000000
You can remove the multiindex doing df.reset_index(). Hope it serves.

Transforming dates in chronological order using pandas dataframe

I need help with comparing dates in different rows and in different columns and making sure that they follow a chronological order.
First, I group data based on Id and group columns. Next, each date value is supposed to occur in the future.
The first group [1111 + A ] contains an error because the dates don't follow a chronological order :
1/1/2016 > 2/20/2016 > **2/19/2016** > 4/25/2016 > **4/1/2016** > 5/1/2016
Current result
id start end group
0 1111 01/01/2016 02/20/2016 A
1 1111 02/19/2016 04/25/2016 A
2 1111 04/01/2016 05/01/2016 A
3 2345 05/01/2016 05/28/2016 B
4 2345 05/29/2016 06/28/2016 B
5 1234 08/01/2016 09/16/2016 F
6 9882 01/01/2016 08/29/2016 D
7 9992 03/01/2016 03/15/2016 C
8 9992 03/16/2016 08/03/2016 C
9 9992 05/16/2016 09/16/2016 C
10 9992 09/17/2016 10/16/2016 C
11 9992 10/17/2016 12/13/2016 C
The answer should be:
1/1/2016 > 2/20/2016 > **2/21/2016** > 4/25/2016 > **4/26/2016** > 5/1/2016
Desired output
id start end group
0 1111 01/01/2016 02/20/2016 A
1 1111 02/21/2016 04/25/2016 A
2 1111 04/26/2018 05/01/2016 A
3 2345 05/01/2016 05/28/2016 B
4 2345 05/29/2016 06/28/2016 B
5 1234 08/01/2016 09/16/2016 F
6 9882 01/01/2016 08/29/2016 C
7 9992 03/01/2016 03/15/2016 C
8 9992 03/16/2016 08/03/2016 C
9 9992 08/04/2016 09/16/2016 C
10 9992 09/17/2016 10/16/2016 C
11 9992 10/17/2016 12/13/2016 C
Any help will be greatly appreciated.
One way is to apply your logic to each group, then concatenate your groups.
# convert series to datetime
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
# iterate groups and add results to grps list
grps = []
for _, group in df.groupby(['id', 'group'], sort=False):
end_shift = group['end'].shift()
group.loc[group['start'] <= end_shift, 'start'] = end_shift + pd.DateOffset(1)
grps.append(group)
# concatenate dataframes in grps to build a single dataframe
res = pd.concat(grps, ignore_index=True)
print(res)
id start end group
0 1111 2016-01-01 2016-02-20 A
1 1111 2016-02-21 2016-04-25 A
2 1111 2016-04-26 2016-05-01 A
3 2345 2016-05-01 2016-05-28 B
4 2345 2016-05-29 2016-06-28 B
5 1234 2016-08-01 2016-09-16 F
6 9882 2016-01-01 2016-08-29 D
7 9992 2016-03-01 2016-03-15 C
8 9992 2016-03-16 2016-08-03 C
9 9992 2016-08-04 2016-09-16 C
10 9992 2016-09-17 2016-10-16 C
11 9992 2016-10-17 2016-12-13 C
I believe this should work:
# First make sure your column are datetimes:
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
# Get your new start times:
new_times = (df.groupby(['id', 'group'])
.apply(lambda x: (x.end + pd.Timedelta(days=1)).shift())
.reset_index(['id', 'group'], drop=True))
# put back into original dataframe
df.loc[new_times.notnull(), 'start'] = new_times[new_times.notnull()]
>>> df
id start end group
0 1111 2016-01-01 2016-02-20 A
1 1111 2016-02-21 2016-04-25 A
2 1111 2016-04-26 2016-05-01 A
3 2345 2016-05-01 2016-05-28 B
4 2345 2016-05-29 2016-06-28 B
5 1234 2016-08-01 2016-09-16 F
6 9882 2016-01-01 2016-08-29 D
7 9992 2016-03-01 2016-03-15 C
8 9992 2016-03-16 2016-08-03 C
9 9992 2016-08-04 2016-09-16 C
10 9992 2016-09-17 2016-10-16 C
11 9992 2016-10-17 2016-12-13 C
Explanation:
new_times looks like this:
>>> new_times
0 NaT
1 2016-02-21
2 2016-04-26
5 NaT
3 NaT
4 2016-05-29
6 NaT
7 NaT
8 2016-03-16
9 2016-08-04
10 2016-09-17
11 2016-10-17
You can then use df.loc[new_times.notnull(), 'start'] = new_times[new_times.notnull()] to find where new_times is not null (i.e. where it is not the first row in a given group), and insert those new_times into your original start column.

Resample rows for missing dates and forward fill values in all columns except one

I currently have the following sample dataframe:
No FlNo DATE Loc Type
20 1826 6/1/2017 AAA O
20 1112 6/4/2017 BBB O
20 1234 6/6/2017 CCC O
20 43 6/7/2017 DDD O
20 1840 6/8/2017 EEE O
I want to fill in missing dates for two rows right on top of each other. I want to also fill in the values of the non-date columns with the values in the top row BUT leave 'Type' column blank for filled in rows.
Please see desired output:
No FlNo DATE Loc Type
20 1826 6/1/2017 AAA O
20 1826 6/2/2017 AAA
20 1826 6/3/2017 AAA
20 1112 6/4/2017 BBB O
20 1112 6/5/2017 BBB
20 1234 6/6/2017 CCC O
20 43 6/7/2017 DDD O
20 1840 6/8/2017 EEE O
I have searched all around Google and stackoverflow but did not find any date fill in answers for pandas dataframe.
First, convert DATE to a datetime column using pd.to_datetime,
df.DATE = pd.to_datetime(df.DATE)
Option 1
Use resample + ffill, and then reset the Type column later. First, store the unique dates in some list:
dates = df.DATE.unique()
Now,
df = df.set_index('DATE').resample('1D').ffill().reset_index()
df.Type = df.Type.where(df.DATE.isin(dates), '')
df
DATE No FlNo Loc Type
0 2017-06-01 20 1826 AAA O
1 2017-06-02 20 1826 AAA
2 2017-06-03 20 1826 AAA
3 2017-06-04 20 1112 BBB O
4 2017-06-05 20 1112 BBB
5 2017-06-06 20 1234 CCC O
6 2017-06-07 20 43 DDD O
7 2017-06-08 20 1840 EEE O
If needed, you may bring DATE back to its original state;
df.DATE = df.DATE.dt.strftime('%m/%d/%Y')
Option 2
Another option would be asfreq + ffill + fillna:
df = df.set_index('DATE').asfreq('1D').reset_index()
c = df.columns.difference(['Type'])
df[c] = df[c].ffill()
df['Type'] = df['Type'].fillna('')
df
DATE No FlNo Loc Type
0 2017-06-01 20.0 1826.0 AAA O
1 2017-06-02 20.0 1826.0 AAA
2 2017-06-03 20.0 1826.0 AAA
3 2017-06-04 20.0 1112.0 BBB O
4 2017-06-05 20.0 1112.0 BBB
5 2017-06-06 20.0 1234.0 CCC O
6 2017-06-07 20.0 43.0 DDD O
7 2017-06-08 20.0 1840.0 EEE O

dropping columns by condition where dtypes are string and numeric

I have the following data (# of columns can vary):
NAME ID POTENTIAL_VOTERS VOTES SPOILT_VOTES LEGAL_VOTES אמת ג ודעם ז ... נץ ע פה ף ףץ קנ קץ רק שס voter_turnout
0 תל אביב - יפו 5000 403338 263205 1860 261345 89567 2628 8488 9 ... 34 132 30241 105 124 2667 2906 209 10189 0.647955
1 ירושלים 3000 385888 258879 3593 255286 24696 53948 3148 10 ... 54 215 10752 37 148 1619 18330 121 30579 0.661555
2 חיפה 4000 243274 151318 1758 149560 37805 4894 12363 24 ... 16 103 16826 40 87 1596 1648 142 3342 0.614780
3 ראשון לציון 8300 195958 138998 1188 137810 31492 924 86 8 ... 16 5 19953 26 68 1821 2258 121 4095 0.703263
4 פתח תקווה 7900 177367 125633 1223 124410 22103 4810 85 8 ... 14 9 14661 15 65 1224 3227 74 6946 0.701427
5 אשדוד 70 170193 115145 1942 113203 9694 11132 33 7 ... 14 10 8841 26 74 1322 4180 80 11923 0.665145
6 נתניה 7400 168914 106738 1270 105468 14575 2921 65 5 ... 14 9 11035 40 63 1089 3177 103 8319 0.624389
When I try to remove columns by condition of sum (where the total sum is less than 40000 I don't need this column), using this code:
df.drop([col for col, val in df.sum().iteritems() if val < 40000], axis=1, inplace=True)
I am getting the following error:
TypeError: '<' not supported between instances of 'str' and 'int'
I assume this is because some of the columns are not integers (as the have text). Any idea how to solve this?
The problem here is that sum will concatenate all the strings, you need to filter the df to select just the numeric dtypes and then filter them:
In[27]:
df = pd.DataFrame({'a': list('abcd'), 'b':np.random.randn(4), 'c':np.arange(4)})
df
Out[27]:
a b c
0 a -0.053771 0
1 b 0.124416 1
2 c -2.024073 2
3 d -2.541324 3
We can select just the numeric types using select_dtypes and pass np.number
In[28]:
df1 = df.select_dtypes([np.number])
df1
Out[28]:
b c
0 -0.053771 0
1 0.124416 1
2 -2.024073 2
3 -2.541324 3
Now we can filter the columns:
In[29]:
df1.loc[:,df1.sum() > 1]
Out[29]:
c
0 0
1 1
2 2
3 3
You can see that sum is returning the strings concatenated
In[30]:
df.sum()
Out[30]:
a abcd
b -4.49475
c 6
dtype: object
If need remove only numeric columns by condition:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,100005,5,4],
'C':[7,8,9,4,2,3],
'D':[10111,30000,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 10111 5 a
1 b 5 8 30000 3 a
2 c 4 9 5 6 a
3 d 100005 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
k = 40000
a = df.loc[:, pd.to_numeric(df.sum(), errors='coerce').fillna(k + 1) > k]
print (a)
A B D F
0 a 4 10111 a
1 b 5 30000 a
2 c 4 5 a
3 d 100005 7 b
4 e 5 1 b
5 f 4 0 b
Detail:
First convert summed Series to_numeric with errors='coerce' for replace not parseable strings columns to NaNs:
print (pd.to_numeric(df.sum(), errors='coerce'))
A NaN
B 100027.0
C 33.0
D 40124.0
E 29.0
F NaN
dtype: float64
And then replace NaNs by value + 1 which need filter for including non numeric columns:
print (pd.to_numeric(df.sum(), errors='coerce').fillna(k + 1))
A 40001.0
B 100027.0
C 33.0
D 40124.0
E 29.0
F 40001.0
dtype: float64
Last compare:
print (pd.to_numeric(df.sum(), errors='coerce').fillna(k + 1) > k)
A True
B True
C False
D True
E False
F True
dtype: bool
And filter by boolean indexing:
print (df.loc[:, pd.to_numeric(df.sum(), errors='coerce').fillna(k + 1) > k])
A B D F
0 a 4 10111 a
1 b 5 30000 a
2 c 4 5 a
3 d 100005 7 b
4 e 5 1 b
5 f 4 0 b
---
Alternative solution with omiting strings columns and then added Trues to mask by reindex:
df = df.loc[:, (df.sum(numeric_only=True) > 40000).reindex(df.columns, fill_value=True)]
print (df)
A B D F
0 a 4 10111 a
1 b 5 30000 a
2 c 4 5 a
3 d 100005 7 b
4 e 5 1 b
5 f 4 0 b
Detail:
First sum only numeric columns by parameter numeric_only=True:
print (df.sum(numeric_only=True))
B 100027
C 33
D 40124
E 29
dtype: int64
Compare by 40000
print (df.sum(numeric_only=True) > 40000)
B True
C False
D True
E False
dtype: bool
Add strings columns by reindex:
print ((df.sum(numeric_only=True) > 40000).reindex(df.columns, fill_value=True))
A True
B True
C False
D True
E False
F True
dtype: bool
Last filtering:
print (df.loc[:, (df.sum(numeric_only=True) > 40000).reindex(df.columns, fill_value=True)])
A B D F
0 a 4 10111 a
1 b 5 30000 a
2 c 4 5 a
3 d 100005 7 b
4 e 5 1 b
5 f 4 0 b
sum has a parameter numeric_only that you can make use of.
df.drop(
[col for col, greater in (df.sum(numeric_only=True) > 40000).to_dict().items()
if greater is False], axis=1, inplace=True
)

Categories