Even though I was googling a lot, I couldn't find the solution for my problem.
I have dataframe
filter10 REF
0 NaN 0.00
1 NaN 0.75
2 NaN 1.50
3 NaN 2.25
4 NaN 3.00
5 NaN 3.75
6 NaN 4.50
...
15 2.804688 11.25
16 3.021875 12.00
17 3.578125 12.75
18 3.779688 13.50
...
27 NaN 20.25
28 NaN 21.00
29 NaN 21.75
30 NaN 22.50
31 6.746875 NaN
32 NaN NaN
...
I would like now to add the column df['DIFF'] where function goes through whole column filter10 and when it is the number it finds closest number in REF column.
And afterwards calculate the difference between them and put it the same row as number in filter10 is.
I would like this output:
filter10 REF DIFF
0 NaN 0.00 NaN
1 NaN 0.75 NaN
2 NaN 1.50 NaN
3 NaN 2.25 NaN
4 NaN 3.00 NaN
5 NaN 3.75 NaN
6 NaN 4.50 NaN
...
15 2.804688 11.25 -0.195312 # 2.804688 - 3 (find closest value in REF) = -0.195312
16 3.021875 12.00 0.021875
17 3.578125 12.75 -0.171875
18 3.779688 13.50 0.029688
...
27 NaN 20.25 NaN
28 NaN 21.00 NaN
29 NaN 21.75 NaN
30 NaN 22.50 NaN
31 6.746875 NaN -0.003125
32 NaN NaN NaN
...
Use pandas.merge_asof to find the nearest value:
df['DIFF'] = (pd.merge_asof(df['filter10'].dropna().sort_values().reset_index(),
df[['REF']].dropna().sort_values('REF'),
left_on='filter10', right_on='REF', direction='nearest')
.set_index('index')['REF'].rsub(df['filter10'])
)
Output:
filter10 REF DIFF
0 NaN 0.00 NaN
1 NaN 0.75 NaN
2 NaN 1.50 NaN
3 NaN 2.25 NaN
4 NaN 3.00 NaN
5 NaN 3.75 NaN
6 NaN 4.50 NaN
15 2.804688 11.25 -0.195312
16 3.021875 12.00 0.021875
17 3.578125 12.75 -0.171875
18 3.779688 13.50 0.029688
27 NaN 20.25 NaN
28 NaN 21.00 NaN
29 NaN 21.75 NaN
30 NaN 22.50 NaN
31 6.746875 NaN 2.246875 # likely different due to missing data
32 NaN NaN NaN
As an alternative, one can use cKDTree for this:
from io import StringIO
import pandas as pd
s=""" filter10 REF
0 NaN 0.00
1 NaN 0.75
2 NaN 1.50
3 NaN 2.25
4 NaN 3.00
5 NaN 3.75
6 NaN 4.50
15 2.804688 11.25
16 3.021875 12.00
17 3.578125 12.75
18 3.779688 13.50
27 NaN 20.25
28 NaN 21.00
29 NaN 21.75
30 NaN 22.50
31 6.746875 NaN
32 NaN NaN"""
df = pd.read_csv(StringIO(s), delimiter=r"\s+")
from scipy.spatial import cKDTree # or simply KDTree
tree = cKDTree(df.REF.values[:,None]) # needs to be an nxm array, hence the [:,None] which is called a broadcast in numpy world
df['DIFF'] = df.filter10 - np.array([df.REF[i] if not np.isinf(dist) else np.nan for dist,i in [tree.query(x,1) for x in df.filter10]])
# filter10 REF DIFF
#0 NaN 0.00 NaN
#1 NaN 0.75 NaN
#2 NaN 1.50 NaN
#3 NaN 2.25 NaN
#4 NaN 3.00 NaN
#5 NaN 3.75 NaN
#6 NaN 4.50 NaN
#15 2.804688 11.25 -0.195312
#16 3.021875 12.00 0.021875
#17 3.578125 12.75 -0.171875
#18 3.779688 13.50 0.029688
#27 NaN 20.25 NaN
#28 NaN 21.00 NaN
#29 NaN 21.75 NaN
#30 NaN 22.50 NaN
#31 6.746875 NaN 2.246875
#32 NaN NaN NaN
The query method returns infinity when the point in question is nan.
I'm working on this raw data frame that needs some cleaning. So far, I have transformed this xlsx file
into this pandas dataframe:
print(df.head(16))
date technician alkalinity colour uv ph turbidity \
0 2020-02-01 00:00:00 Catherine 24.5 33 0.15 7.24 1.53
1 Unnamed: 2 NaN NaN NaN NaN NaN 2.31
2 Unnamed: 3 NaN NaN NaN NaN NaN 2.08
3 Unnamed: 4 NaN NaN NaN NaN NaN 2.2
4 Unnamed: 5 Michel 24 35 0.152 7.22 1.59
5 Unnamed: 6 NaN NaN NaN NaN NaN 1.66
6 Unnamed: 7 NaN NaN NaN NaN NaN 1.71
7 Unnamed: 8 NaN NaN NaN NaN NaN 1.53
8 2020-02-02 00:00:00 Catherine 24 NaN 0.145 7.21 1.44
9 Unnamed: 10 NaN NaN NaN NaN NaN 1.97
10 Unnamed: 11 NaN NaN NaN NaN NaN 1.91
11 Unnamed: 12 NaN NaN 33.0 NaN NaN 2.07
12 Unnamed: 13 Michel 24 34 0.15 7.24 1.76
13 Unnamed: 14 NaN NaN NaN NaN NaN 1.84
14 Unnamed: 15 NaN NaN NaN NaN NaN 1.72
15 Unnamed: 16 NaN NaN NaN NaN NaN 1.85
temperature
0 3
1 NaN
2 NaN
3 NaN
4 3
5 NaN
6 NaN
7 NaN
8 3
9 NaN
10 NaN
11 NaN
12 3
13 NaN
14 NaN
15 NaN
From here, I want to combine the rows so that I only have one row for each date. The values for each row will be the mean in the respective columns. ie.
print(new_df.head(2))
date time alkalinity colour uv ph turbidity temperature
0 2020-02-01 00:00:00 24.25 34 0.151 7.23 1.83 3
1 2020-02-02 00:00:00 24 33.5 0.148 7.23 1.82 3
How can I accomplish this when I have Unnamed values in my date column? Thanks!
Try setting the values to NaN and then use ffill:
df.loc[df.date.str.contains('Unnamed', na=False), 'date'] = np.nan
df.date = df.date.ffill()
If I understand, correctly you want to drop rows that contain 'Unnamed' in the date column, right?
Please look here:
https://stackoverflow.com/a/27360130/12790501
The solution would be something like this:
df = df.drop(df['Unnamed' in df.date].index)
Edit:
No, I would like to replace those Unnamed values with the date so I
could then use the groupby('date') function to return the mean values
for the columns
so in the case you should just iterate over the whole table
last_date = ''
for i in df.index:
if 'Unnamed' not in df.at[i, 'date']:
last_date = df.at[i, 'date']
else:
df.at[i, 'date'] = last_date
If the 'date' column is of type object i.e. string
then just write a logic to loop over the number as seen in image provided it follows a certain pattern-
for _ in range(2,9):
df.loc[(df['date'] == 'Unnamed: '+str(_), 'date'] = your_value
I am trying to figure out how to do merge. I have a labels.csv which contains the names that I have to use to replace the numbers for the same field in my dat.csv
My dat.csv is as follows:
Id,Help in household,Maths,Reading,Science,Social
11011001001,4,20.37,,27.78,
11011001002,3,12.96,,38.18,
11011001003,4,27.78,70,,
11011001004,4,,56.67,,36
11011001005,1,,,14.55,8.33
11011001006,4,,23.33,,30
11011001007,4,40.74,70,,
11011001008,3,,26.67,,22.92
11011001009,2,24.07,,25.45,
11011001010,4,18.52,26.67,,
11011001012,2,37.04,16.67,,
11011001013,4,20.37,,20,
11011001014,2,,,29.63,35.42
11011001015,4,27.78,66.67,,
11011001016,0,18.52,,,
11011001017,4,,,42.59,32
11011001018,2,16.67,,,
11011001019,3,,,21.82,
11011001020,4,,20,,16
11011001021,1,,,18.52,16.67
My labels.csv is as follows:
Column,Name,Level,Rename
Help in household,Every day,4,Every day
Help in household,Never,1,Never
Help in household,Once a month,2,Once a month
Help in household,Once a week,3,Once a week
my programme is as follows:
import pandas as pd
df = pd.read_csv('dat.csv')
labels = pd.read_csv('labels.csv')
df=df.merge(labels,left_on='Help in household',right_on='Name',how='left')
print df
However, the names do not appear as I want them to.
STUID Help in household Maths % Reading % Science % Social % \
0 11011001001 4 20.37 NaN 27.78 NaN
1 11011001002 3 12.96 NaN 38.18 NaN
2 11011001003 4 27.78 70.00 NaN NaN
3 11011001004 4 NaN 56.67 NaN 36.00
4 11011001005 1 NaN NaN 14.55 8.33
5 11011001006 4 NaN 23.33 NaN 30.00
6 11011001007 4 40.74 70.00 NaN NaN
7 11011001008 3 NaN 26.67 NaN 22.92
8 11011001009 2 24.07 NaN 25.45 NaN
9 11011001010 4 18.52 26.67 NaN NaN
10 11011001012 2 37.04 16.67 NaN NaN
11 11011001013 4 20.37 NaN 20.00 NaN
12 11011001014 2 NaN NaN 29.63 35.42
13 11011001015 4 27.78 66.67 NaN NaN
14 11011001016 0 18.52 NaN NaN NaN
15 11011001017 4 NaN NaN 42.59 32.00
16 11011001018 2 16.67 NaN NaN NaN
17 11011001019 3 NaN NaN 21.82 NaN
18 11011001020 4 NaN 20.00 NaN 16.00
19 11011001021 1 NaN NaN 18.52 16.67
Column Name Level Rename
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
11 NaN NaN NaN NaN
12 NaN NaN NaN NaN
13 NaN NaN NaN NaN
14 NaN NaN NaN NaN
15 NaN NaN NaN NaN
16 NaN NaN NaN NaN
17 NaN NaN NaN NaN
18 NaN NaN NaN NaN
19 NaN NaN NaN NaN
What am I doing wrong?
Okay, is this what you want?
df['Name'] = df['Help in household'].map(labels.set_index('Level')['Name'])
Output:
Id Help in household Maths Reading Science Social \
0 11011001001 4 20.37 NaN 27.78 NaN
1 11011001002 3 12.96 NaN 38.18 NaN
2 11011001003 4 27.78 70.00 NaN NaN
3 11011001004 4 NaN 56.67 NaN 36.00
4 11011001005 1 NaN NaN 14.55 8.33
5 11011001006 4 NaN 23.33 NaN 30.00
6 11011001007 4 40.74 70.00 NaN NaN
7 11011001008 3 NaN 26.67 NaN 22.92
8 11011001009 2 24.07 NaN 25.45 NaN
9 11011001010 4 18.52 26.67 NaN NaN
10 11011001012 2 37.04 16.67 NaN NaN
11 11011001013 4 20.37 NaN 20.00 NaN
12 11011001014 2 NaN NaN 29.63 35.42
13 11011001015 4 27.78 66.67 NaN NaN
14 11011001016 0 18.52 NaN NaN NaN
15 11011001017 4 NaN NaN 42.59 32.00
16 11011001018 2 16.67 NaN NaN NaN
17 11011001019 3 NaN NaN 21.82 NaN
18 11011001020 4 NaN 20.00 NaN 16.00
19 11011001021 1 NaN NaN 18.52 16.67
Name
0 Every day
1 Once a week
2 Every day
3 Every day
4 Never
5 Every day
6 Every day
7 Once a week
8 Once a month
9 Every day
10 Once a month
11 Every day
12 Once a month
13 Every day
14 NaN
15 Every day
16 Once a month
17 Once a week
18 Every day
19 Never
I need to merge similar columns and remove duplicates (entries with the same date). The data frame:
Albumin C-reactive protein CRP Ferritin Haemoglobin Hb Iron Nancy Index Plasma Platelets Transferrin saturation % Transferrin saturations UCEIS (0 to 8) WCC White Cell Count test_date
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12.35 2016-04-17 23:00:00
1 NaN NaN NaN NaN 133.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN 406.0 NaN NaN NaN NaN NaN 2016-04-17 23:00:00
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 406.0 NaN NaN NaN NaN NaN 2016-04-17 23:00:00
4 NaN 32.2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
5 36.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
6 NaN NaN NaN 99.7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 25.0 NaN NaN NaN NaN 2016-04-17 23:00:00
12 36.0 NaN 32.2 99.7 NaN 133.0 NaN NaN NaN 406.0 NaN 25.0 NaN 12.35 NaN 2016-04-17 23:00:00
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN NaN 2016-04-25 23:00:00
79 34.0 NaN 5.4 55.9 NaN 133.0 NaN NaN NaN 372.0 NaN 28.0 NaN 7.99 NaN 2016-06-12 23:00:00
I need to get:
Albumin CRP Ferritin Hb Nancy Index Plasma Platelets Transferrin saturations UCEIS (0 to 8) WCC test_date
12 36.0 32.2 99.7 133.0 NaN NaN 406.0 25.0 NaN 12.35 2016-04-17 23:00:00
14 NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN 2016-04-25 23:00:00
79 34.0 5.4 55.9 133.0 NaN NaN 372.0 28.0 NaN 7.99 2016-06-12 23:00:00
So, columns 'C-reactive protein' should be merged with 'CRP', 'Hemoglobin' with 'Hb', 'Transferrin saturation %' with 'Transferrin saturation'.
I can easily remove duplicates with .drop_duplicates(), but the trick is remove not only row with the same date, but also to make sure, that the values in the same column are duplicated. For example, 'C-reactive protein' at row '4' has the same values as 'CRP' in row '12', in addition, they both have the same entry date. Given all that, I need to have only 'CRP' column with values 32.2 and the date '2016-04-17' (plus other unique columns).
EDIT
Some entries are really duplicates (absolutely identical, due to system glitches), for example (last three rows, on 2016-06-20, indices '803' and '122'). Is the solution below capable of removing such identical rows?
P.S. Thanks for the amazing and general solution for duplicate, but not identical entries.
Albumin C-reactive protein CRP Ferritin Haemoglobin Hb Iron Nancy Index Plasma Platelets Transferrin saturation % Transferrin saturations UCEIS (0 to 8) WCC White Cell Count setName test_date
735 39.0 NaN 0.4 52.0 NaN 144.0 NaN NaN NaN 197.0 NaN 25.0 NaN 4.88 NaN Bloods 2016-05-31 23:00:00
803 40.0 NaN 0.2 81.0 NaN 147.0 NaN NaN NaN 234.0 NaN 35.0 NaN 8.47 NaN Bloods 2016-06-20 23:00:00
347 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN Research Bloods 2016-06-20 23:00:00
122 40.0 NaN 0.2 81.9 NaN 147.0 NaN NaN NaN 234.0 NaN 35.0 NaN 8.47 NaN Bloods 2016-06-20 23:00:00
I think you need groupby with rename columns by dict:
d = {'C-reactive protein':'CRP', 'Hemoglobin':'Hb',
'Transferrin saturation %':'Transferrin saturations'}
df = df.groupby('test_date').max().rename(columns=d).groupby(axis=1, level=0).max()
print (df)
Albumin CRP Ferritin Haemoglobin Hb Iron \
test_date
2016-04-17 23:00:00 36.0 32.2 99.7 133.0 133.0 NaN
2016-04-25 23:00:00 NaN NaN NaN NaN NaN NaN
2016-06-12 23:00:00 34.0 5.4 55.9 NaN 133.0 NaN
Nancy Index Plasma Platelets Transferrin saturations \
test_date
2016-04-17 23:00:00 NaN NaN 406.0 25.0
2016-04-25 23:00:00 NaN NaN NaN NaN
2016-06-12 23:00:00 NaN NaN 372.0 28.0
UCEIS (0 to 8) WCC White Cell Count
test_date
2016-04-17 23:00:00 NaN 12.35 12.35
2016-04-25 23:00:00 7.0 NaN NaN
2016-06-12 23:00:00 NaN 7.99 NaN
More general solution is reshape by melt, remove duplicates and then create DataFrame back:
d = {'C-reactive protein':'CRP', 'Hemoglobin':'Hb',
'Transferrin saturation %':'Transferrin saturations'}
df = df.rename(columns=d).groupby(axis=1, level=0).max()
df = pd.melt(df, id_vars='test_date').dropna(subset=['value']).drop_duplicates()
df = df.groupby(['test_date','variable'])['value'] \
.apply(lambda x: pd.Series(x.values)) \
.unstack(1) \
.reset_index(level=1, drop=True) \
.reset_index() \
.rename_axis(None,axis=1)
print (df)
test_date Albumin CRP Ferritin Hb Platelets \
0 2016-04-17 23:00:00 1000.0 32.2 99.7 1000.0 406.0
1 2016-04-17 23:00:00 36.0 NaN NaN 133.0 NaN
2 2016-04-25 23:00:00 NaN NaN NaN NaN NaN
3 2016-06-12 23:00:00 34.0 5.4 55.9 133.0 372.0
Transferrin saturations UCEIS (0 to 8) WCC White Cell Count
0 25.0 NaN 12.35 12.35
1 NaN NaN NaN NaN
2 NaN 7.0 NaN NaN
3 28.0 NaN 7.99 NaN
What #jezrael was saying is that if you had a situation where:
Albumin C-reactive protein CRP test_date
0 NaN NaN 32 2016-04-17 23:00:00
1 NaN 8.0 NaN 2016-04-17 23:00:00
then his method would erase the 8.0 reading and keep only the 32 (this is because he does it in two steps (or 3?), in this line: df = df.groupby('test_date').max().rename(columns=d).groupby(axis=1, level=0).max()
df = df.groupby('test_date').max() # selects max of each column
# while collapsing 'test_date'
which for my truncated example would give:
Albumin C-reactive protein CRP test_date
0 NaN 8.0 32 2016-04-17 23:00:00
then rename .rename(columns=d) giving:
Albumin CRP CRP test_date
0 NaN 8.0 32 2016-04-17 23:00:00
then .groupby(axis=1, level=0).max() to group along rows (instead of down columns) which gives:
Albumin CRP test_date
0 NaN 32 2016-04-17 23:00:00
which is where you run the highest risk of losing data.
Alternative
I would split the original data into two frames first
df1 = df[["C-reactive protein","Haemoglobin", ...]]
df2 = df[["CRP", "Hb"]]
# then rename
df2 = df2.rename(columns={"CRP":"C-reactive protein", "Hb":"Haemoglobin", ...})
# use concat to stack them on one another
df3 = pd.concat([df1, df2]) # i've run out of names
df3 = df3.drop_duplicates() # perhaps also drop NAs?
but this is only necessary if you have multiple non-duplicate entries for the same test on the same day.
I have two data frames that contain time-series data that are on different ranges. One starts earlier, and ends earlier. Also, one is monthly and one is quarterly. However, the index of both is in the form of YYYY-MM-DD. Is there a cute way of merging these dataframes using "Python" and "Pandas"?
Thanks!
/edit
One set:
DATE GDP GPDI NFLS
0 1947-01-01 243.1 35.9 112.815
1 1947-04-01 246.3 34.5 111.253
2 1947-07-01 250.1 34.9 113.023
3 1947-10-01 260.3 43.2 111.440
The other one:
DATE INDPRO M08354USM310NNBR GDP
(...)
334 1946-11-01 13.3916 NaN NaN
335 1946-12-01 13.4721 NaN NaN
336 1947-01-01 13.6332 42.8 NaN
337 1947-02-01 13.7137 42.5 NaN
Together I would like to join them, such that
DATE INDPRO M08354USM310NNBR GDP GPDI NFLS
1946-11-01 13.3916 NaN NaN NaN NaN
1946-12-01 13.4712 NaN NaN NaN NaN
1947-01-01 13.6332 42.8 243.1 35.9 112.815
1947-02-01 13.7137 42.5 NaN NaN NaN
(...)
Just perform a merge the fact the periods are different and don't overlap suits you in fact:
merged = df1.merge(df2, on='DATE', how='outer')
merged
Out[54]:
DATE GDP_x GPDI NFLS INDPRO M08354USM310NNBR GDP_y
0 1947-01-01 243.1 35.9 112.815 13.6332 42.8 NaN
1 1947-04-01 246.3 34.5 111.253 NaN NaN NaN
2 1947-07-01 250.1 34.9 113.023 NaN NaN NaN
3 1947-10-01 260.3 43.2 111.440 NaN NaN NaN
4 1946-11-01 NaN NaN NaN 13.3916 NaN NaN
5 1946-12-01 NaN NaN NaN 13.4721 NaN NaN
6 1947-02-01 NaN NaN NaN 13.7137 42.5 NaN
[7 rows x 7 columns]
You can rename, fill, drop the erroneous 'GDP_y' column
To sort the merged 'DATE' column just call sort:
In [57]:
merged.sort(['DATE'])
Out[57]:
DATE GDP_x GPDI NFLS INDPRO M08354USM310NNBR GDP_y
4 1946-11-01 NaN NaN NaN 13.3916 NaN NaN
5 1946-12-01 NaN NaN NaN 13.4721 NaN NaN
0 1947-01-01 243.1 35.9 112.815 13.6332 42.8 NaN
6 1947-02-01 NaN NaN NaN 13.7137 42.5 NaN
1 1947-04-01 246.3 34.5 111.253 NaN NaN NaN
2 1947-07-01 250.1 34.9 113.023 NaN NaN NaN
3 1947-10-01 260.3 43.2 111.440 NaN NaN NaN
[7 rows x 7 columns]