Appreciate any help on this.
Let's say I have the following df with two columns:
col1 col2
NaN NaN
11 100
12 110
15 115
NaN NaN
NaN NaN
NaN NaN
9 142
12 144
NaN NaN
NaN NaN
NaN NaN
6 155
9 156
7 161
NaN NaN
NaN NaN
I'd like to forward fill and replace the Nan values with the median value of the preceding values. For example, the median of 11,12,15 in 'col1' is 12, therefore I need the Nan values to be filled with 12 until I get to the next non-Nan values in the column and continue iterating the same. See below the expected df:
col1 col2
NaN NaN
11 100
12 110
15 115
12 110
12 110
12 110
9 142
12 144
10.5 143
10.5 143
10.5 143
6 155
9 156
7 161
7 156
7 156
Try:
m1 = (df.col1.isna() != df.col1.isna().shift(1)).cumsum()
m2 = (df.col2.isna() != df.col2.isna().shift(1)).cumsum()
df["col1"] = df["col1"].fillna(
df.groupby(m1)["col1"].transform("median").ffill()
)
df["col2"] = df["col2"].fillna(
df.groupby(m2)["col2"].transform("median").ffill()
)
print(df)
Prints:
col1 col2
0 NaN NaN
1 11.0 100.0
2 12.0 110.0
3 15.0 115.0
4 12.0 110.0
5 12.0 110.0
6 12.0 110.0
7 9.0 142.0
8 12.0 144.0
9 10.5 143.0
10 10.5 143.0
11 10.5 143.0
12 6.0 155.0
13 9.0 156.0
14 7.0 161.0
15 7.0 156.0
16 7.0 156.0
IIUC, if we fill null values like so:
Fill with Median of last 3 non-null items.
Fill with Median of last 2 non-null items.
Front fill values.
We'll get what you're looking for.
out = (df.combine_first(df.rolling(4,3).median())
.combine_first(df.rolling(3,2).median())
.ffill())
print(out)
Output:
col1 col2
0 NaN NaN
1 11.0 100.0
2 12.0 110.0
3 15.0 115.0
4 12.0 110.0
5 12.0 110.0
6 12.0 110.0
7 9.0 142.0
8 12.0 144.0
9 10.5 143.0
10 10.5 143.0
11 10.5 143.0
12 6.0 155.0
13 9.0 156.0
14 7.0 161.0
15 7.0 156.0
16 7.0 156.0
Related
I am trying to merge two dataframes of different sizes based on partial match of the column 'name' and 'sub_name' (and full match of values in column A):
sub_name val_1 A
0 AAA 2 208
1 AAB 4 208
2 AAC 8 208
3 BAA 7 210
4 CAA 4 213
5 CAC 6 213
6 CAD 2 213
7 CAE 3 213
8 EAA 8 222
9 FAA 3 223
name val_2 A
0 XAAA 1 208
1 AABYY 5 208
2 BAAZ 9 210
3 CAAY 5 213
4 YCABX 8 213
5 XXCAC 6 213
6 YCADZ 3 213
7 XDAA 6 215
8 EAAX 4 222
the codes:
df1 = pd.DataFrame({'sub_name': ['AAA','AAB','AAC','BAA','CAA','CAC','CAD','CAE','EAA', 'FAA'],
'val_1': [2,4,8,7,4,6,2,3,8,3],
'A':[208,208,208,210,213,213,213,213,222,223]})
df2 = pd.DataFrame({'name': ['XAAA','AABYY','BAAZ','CAAY','YCABX','XXCAC','YCADZ','XDAA','EAAX'],
'val_2': [1,5,9,5,8,6,3,6,4],
'A': [208,208,210,213,213,213,213,215,222]})
Edit: I want to do an outer merge of these two dataframes – if there is no match, keep the rows, if there is a partial match (between sub_name and name) and also if the values in column A match, merge them together. If there is a partial match between name and sub_name but the column A values don't match, keep both rows.
I am trying to obtain:
name val_1 val_2 A
0 AAA 2.0 1.0 208
1 AAB 4.0 5.0 208
2 AAC 8.0 NaN 208
3 BAA 7.0 9.0 210
4 CAA 4.0 5.0 213
5 YCABX NaN 8.0 213
6 CAC 6.0 6.0 213
7 CAD 2.0 3.0 213
8 CAE 3.0 NaN 213
9 xDAA NaN 6.0 215
10 EAA 8.0 4.0 222
11 FAA 3.0 NaN 223
or this (it doesn't matter if I keep the full name or just the sub_name where the rows match):
name val_1 val_2 A
0 XAAA 2.0 1.0 208
1 AABYY 4.0 5.0 208
2 AAC 8.0 NaN 208
3 BAAZ 7.0 9.0 210
4 CAAY 4.0 5.0 213
5 YCABX NaN 8.0 213
6 XXCAC 6.0 6.0 213
7 YCADZ 2.0 3.0 213
8 CAE 3.0 NaN 213
9 XDAA NaN 6.0 215
10 EAA 8.0 4.0 222
11 FAA 3.0 NaN 223
If I needed full match I would use .merge(df1, df2, how = 'outer') but since I am working with substrings I don't know how to approach this. Maybe str.contains() could be useful?
Note: The sub_name can be made of more than just three letters. This is just an example.
# Create new column, an extraction of df's subname in df2's name
df2['sub_name']=df2['name'].str.findall('|'.join(df1['sub_name'].values.tolist())).str.join(',')
#Do an outer merge
df_new=df2.merge(df1, how='outer', on=['sub_name', 'A'])
#Update the name's columns values and drop the sub_name column created above
df_new=df_new.assign(name=df_new['name'].combine_first(df_new['sub_name'])).drop(columns=['sub_name'])
outcome
name val_2 A val_1
0 XAAA 1.0 208 2.0
1 AABYY 5.0 208 4.0
2 BAAZ 9.0 210 7.0
3 CAAY 5.0 213 4.0
4 YCABX 8.0 213 NaN
5 XXCAC 6.0 213 6.0
6 YCADZ 3.0 213 2.0
7 XDAA 6.0 215 NaN
8 EAAX 4.0 222 8.0
9 AAC NaN 208 8.0
10 CAE NaN 213 3.0
11 FAA NaN 223 3.0
You can use a fuzzy match with a threshold, then merge:
from thefuzz import process
def best(x, thresh=80):
match, score = process.extractOne(x, df2['name'])
if score>=thresh:
return match
df1.merge(df2, left_on=['A', df1['sub_name'].apply(best)],
right_on=['A', 'name'],
how='outer')
Output:
sub_name val_1 A name val_2
0 AAA 2.0 208 XAAA 1.0
1 AAB 4.0 208 AABYY 5.0
2 AAC 8.0 208 None NaN
3 BAA 7.0 210 BAAZ 9.0
4 CAA 4.0 213 CAAY 5.0
5 CAC 6.0 213 XXCAC 6.0
6 CAD 2.0 213 YCADZ 3.0
7 CAE 3.0 213 None NaN
8 EAA 8.0 222 EAAX 4.0
9 FAA 3.0 223 None NaN
10 NaN NaN 213 YCABX 8.0
11 NaN NaN 215 XDAA 6.0
Given a dataframe as follows:
id value1 value2
0 3918703 62.0 64.705882
1 3919144 60.0 60.000000
2 3919534 62.5 30.000000
3 3919559 55.0 55.000000
4 3920438 82.0 82.031250
5 3920463 71.0 71.428571
6 3920502 70.0 69.230769
7 3920535 80.0 40.000000
8 3920674 62.0 62.222222
9 3920856 80.0 79.987176
I want to check if value2 is in the range of plus and minus 10% of value1, and return a new column result_review.
If it's not in the range as required, then indicate No as result_review's values.
id value1 value2 results_review
0 3918703 62.0 64.705882 NaN
1 3919144 60.0 60.000000 NaN
2 3919534 62.5 30.000000 no
3 3919559 55.0 55.000000 NaN
4 3920438 82.0 82.031250 NaN
5 3920463 71.0 71.428571 NaN
6 3920502 70.0 69.230769 NaN
7 3920535 80.0 40.000000 no
8 3920674 62.0 62.222222 NaN
9 3920856 80.0 79.987176 NaN
How can I do that in Pandas? Thanks for your help at advance.
Use Series.between with DataFrame.loc:
m = df['value2'].between(df['value1'].mul(0.9), df['value1'].mul(1.1))
df.loc[~m, 'results_review'] = 'no'
print(df)
id value1 value2 results_review
0 3918703 62.0 64.705882 NaN
1 3919144 60.0 60.000000 NaN
2 3919534 62.5 30.000000 no
3 3919559 55.0 55.000000 NaN
4 3920438 82.0 82.031250 NaN
5 3920463 71.0 71.428571 NaN
6 3920502 70.0 69.230769 NaN
7 3920535 80.0 40.000000 no
8 3920674 62.0 62.222222 NaN
9 3920856 80.0 79.987176 NaN
I have this data frame:
ID date X1 X2 Y
A 16-07-19 58 50 0
A 21-07-19 28 74 0
B 25-07-19 54 65 1
B 27-07-19 50 30 0
B 29-07-19 81 61 0
C 30-07-19 55 29 0
C 31-07-19 97 69 1
C 03-08-19 13 48 1
D 19-07-18 77 27 1
D 20-07-18 68 50 1
D 22-07-18 89 57 1
D 23-07-18 46 70 0
D 26-07-18 56 13 0
E 06-08-19 47 35 1
I want to "stretch" the data by date, from the first row, to the last row of each ID (groupby),
and to fill the missing values with NaN.
For example: ID A has two rows on 16-07-19, and 21-07-19.
After the implementation, (s)he should have 6 rows on 16-21 of July, 2019.
Expected result:
ID date X1 X2 Y
A 16-07-19 58.0 50.0 0.0
A 17-07-19 NaN NaN NaN
A 18-07-19 NaN NaN NaN
A 19-07-19 NaN NaN NaN
A 20-07-19 NaN NaN NaN
A 21-07-19 28.0 74.0 0.0
B 25-07-19 54.0 65.0 1.0
B 26-07-19 NaN NaN NaN
B 27-07-19 50.0 30.0 0.0
B 28-07-19 NaN NaN NaN
B 29-07-19 81.0 61.0 0.0
C 30-07-19 55.0 29.0 0.0
C 31-07-19 97.0 69.0 1.0
C 01-08-19 NaN NaN NaN
C 02-08-19 NaN NaN NaN
C 03-08-19 13.0 48.0 1.0
D 19-07-18 77.0 27.0 1.0
D 20-07-18 68.0 50.0 1.0
D 21-07-18 NaN NaN NaN
D 22-07-18 89.0 57.0 1.0
D 23-07-18 46.0 70.0 0.0
D 24-07-18 NaN NaN NaN
D 25-07-18 NaN NaN NaN
D 26-07-18 56.0 13.0 0.0
E 06-08-19 47.0 35.0 1.0
Use DataFrame.asfreq per groups working with DatetimeIndex:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
cols = df.columns.difference(['date','ID'], sort=False)
df = df.set_index('date').groupby('ID')[cols].apply(lambda x: x.asfreq('d')).reset_index()
print (df)
ID date X1 X2 Y
0 A 2019-07-16 58.0 50.0 0.0
1 A 2019-07-17 NaN NaN NaN
2 A 2019-07-18 NaN NaN NaN
3 A 2019-07-19 NaN NaN NaN
4 A 2019-07-20 NaN NaN NaN
5 A 2019-07-21 28.0 74.0 0.0
6 B 2019-07-25 54.0 65.0 1.0
7 B 2019-07-26 NaN NaN NaN
8 B 2019-07-27 50.0 30.0 0.0
9 B 2019-07-28 NaN NaN NaN
10 B 2019-07-29 81.0 61.0 0.0
11 C 2019-07-30 55.0 29.0 0.0
12 C 2019-07-31 97.0 69.0 1.0
13 C 2019-08-01 NaN NaN NaN
14 C 2019-08-02 NaN NaN NaN
15 C 2019-08-03 13.0 48.0 1.0
16 D 2018-07-19 77.0 27.0 1.0
17 D 2018-07-20 68.0 50.0 1.0
18 D 2018-07-21 NaN NaN NaN
19 D 2018-07-22 89.0 57.0 1.0
20 D 2018-07-23 46.0 70.0 0.0
21 D 2018-07-24 NaN NaN NaN
22 D 2018-07-25 NaN NaN NaN
23 D 2018-07-26 56.0 13.0 0.0
24 E 2019-08-06 47.0 35.0 1.0
Another idea with DataFrame.reindex per groups:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
cols = df.columns.difference(['date','ID'], sort=False)
f = lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max()))
df = df.set_index('date').groupby('ID')[cols].apply(f).reset_index()
Here is my sort jitsu:
def Sort_by_date(dataf):
# rule1
dataf['Current'] = pd.to_datetime(dataf.Current)
dataf = dataf.sort_values(by=['Current'],ascending=True)
# rule2
dataf['Current'] = pd.to_datetime(dataf.Current)
Mask = (dataf['Current'] > '1/1/2020') & (dataf['Current'] <= '12/31/2022')
dataf = dataf.loc[Mask]
return dataf
you can modify this code to learn to sort by date for your solution.
Next lets sort by group:
Week1 = WeeklyDF.groupby('ID')
Week1_Report = Week1['ID','date','X1','X2','Y']
Week1_Report
lastly, Lets replace the NaN
Week1_Report['X1'.fillna("X1 is 0", inplace = True)
Week1_Report['X2'.fillna("X2 is 0", inplace = True)
Week1_Report['Y'.fillna("Y is 0", inplace = True)
l m learning Pandas and l m trying to learn things in it. However ,l got stuck in adding new column as new columns have larger index number.l would like to add more than 3 columns.
here is my code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
example=pd.read_excel("C:/Users/ömer sarı/AppData/Local/Programs/Python/Python35-32/data_analysis/pydata-book-master/example.xlsx",names=["a","b","c","d","e","f","g"])
dropNan=example.dropna()
#print(dropNan)
Fillup=example.fillna(-99)
#print(Fillup)
Countt=Fillup.get_dtype_counts()
#print(Countt)
date=pd.date_range("2014-01-01","2014-01-15",freq="D")
#print(date)
mm=date.month
yy=date.year
dd=date.day
df=pd.DataFrame(example)
print(df)
x=[i for i in yy]
print(x)
df["year"]=df[x]
here is example dataset:
a b c d e f g
0 1 1.0 2.0 5 3 11.0 57.0
1 2 4.0 6.0 10 6 22.0 59.0
2 3 9.0 12.0 15 9 33.0 61.0
3 4 16.0 20.0 20 12 44.0 63.0
4 5 25.0 NaN 25 15 NaN 65.0
5 6 NaN 42.0 30 18 66.0 NaN
6 7 49.0 56.0 35 21 77.0 69.0
7 8 64.0 72.0 40 24 88.0 71.0
8 9 81.0 NaN 45 27 99.0 73.0
9 10 NaN 110.0 50 30 NaN 75.0
10 11 121.0 NaN 55 33 121.0 77.0
11 12 144.0 156.0 60 36 132.0 NaN
12 13 169.0 182.0 65 39 143.0 81.0
13 14 196.0 NaN 70 42 154.0 83.0
14 15 225.0 240.0 75 45 165.0 85.0
here is error message:
IndexError: indices are out-of-bounds
after that ,l tried that one and l got new error:
df=pd.DataFrame(range(len(x)),index=x, columns=["a","b","c","d","e","f","g"])
pandas.core.common.PandasError: DataFrame constructor not properly called!
it is just a trial to learn and how can l add the date with split parts as new column like["date","year","month","day"....]
I have a list of students in a csv file. I want (using Python) to display four columns in that I want to display the male students who have higher marks in Maths, Computer, and Physics.
I tried to use pandas library.
marks = pd.concat([data['name'],
data.loc[data['students']==1, 'maths'].nlargest(n=10)], 'computer'].nlargest(n=10)], 'physics'].nlargest(n=10)])
I used 1 for male students and 0 for female students.
It gives me an error saying: Invalid syntax.
Here's a way to show the top 10 students in each of the disciplines. You could of course just sum the three scores and select the students with the highest total if you want the combined as opposed to the individual performance (see illustration below).
df1 = pd.DataFrame(data={'name': [''.join(random.choice('abcdefgh') for _ in range(8)) for i in range(100)],
'students': np.random.randint(0, 2, size=100)})
df2 = pd.DataFrame(data=np.random.randint(0, 10, size=(100, 3)), columns=['math', 'physics', 'computers'])
data = pd.concat([df1, df2], axis=1)
data.info()
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
name 100 non-null object
students 100 non-null int64
math 100 non-null int64
physics 100 non-null int64
computers 100 non-null int64
dtypes: int64(4), object(1)
memory usage: 4.0+ KB
res = pd.concat([data.loc[:, ['name']], data.loc[data['students'] == 1, 'math'].nlargest(n=10), data.loc[data['students'] == 1, 'physics'].nlargest(n=10), data.loc[data['students'] == 1, 'computers'].nlargest(n=10)], axis=1)
res.dropna(how='all', subset=['math', 'physics', 'computers'])
name math physics computers
0 geghhbce NaN 9.0 NaN
1 hbbdhcef NaN 7.0 NaN
4 ghgffgga NaN NaN 8.0
6 hfcaccgg 8.0 NaN NaN
14 feechdec NaN NaN 8.0
15 dfaabcgh 9.0 NaN NaN
16 ghbchgdg 9.0 NaN NaN
23 fbeggcha NaN NaN 9.0
27 agechbcf 8.0 NaN NaN
28 bcddedeg NaN NaN 9.0
30 hcdgbgdg NaN 8.0 NaN
38 fgdfeefd NaN NaN 9.0
39 fbcgbeda 9.0 NaN NaN
41 agbdaegg 8.0 NaN 9.0
49 adgbefgg NaN 8.0 NaN
50 dehdhhhh NaN NaN 9.0
55 ccbaaagc NaN 8.0 NaN
68 hhggfffe 8.0 9.0 NaN
71 bhggbheg NaN 9.0 NaN
84 aabcefhf NaN NaN 9.0
85 feeeefbd 9.0 NaN NaN
86 hgeecacc NaN 8.0 NaN
88 ggedgfeg 9.0 8.0 NaN
89 faafgbfe 9.0 NaN 9.0
94 degegegd NaN 8.0 NaN
99 beadccdb NaN NaN 9.0
data['total'] = data.loc[:, ['math', 'physics', 'computers']].sum(axis=1)
data[data.students==1].nlargest(10, 'total').sort_values('total', ascending=False)
name students math physics computers total
29 fahddafg 1 8 8 8 24
79 acchhcdb 1 8 9 7 24
9 ecacceff 1 7 9 7 23
16 dccefaeb 1 9 9 4 22
92 dhaechfb 1 4 9 9 22
47 eefbfeef 1 8 8 5 21
60 bbfaaada 1 4 7 9 20
82 fbbbehbf 1 9 3 8 20
18 dhhfgcbb 1 8 8 3 19
1 ehfdhegg 1 5 7 6 18