I am trying merge 2 dataframes.
df1
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
df2
Date B
07.01.2021 14
08.01.2021 27
09.01.2021 28
10.01.2021 29
11.01.2021 30
12.01.2021 31
13.01.2021 32
Both dataframes have one same row (although there could be several overlappings).
So I want to get df3 that looks as follows:
df3
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
08.01.2021 Nan 27 Nan
09.01.2021 Nan 28 Nan
10.01.2021 Nan 29 Nan
11.01.2021 Nan 30 Nan
12.01.2021 Nan 31 Nan
13.01.2021 Nan 32 Nan
I've tried
df3=df1.merge(df2, on='Date', how='outer') but it gives extra A,B,C columns. Could you give some idea how to get df3?
Thanks a lot.
merge outer without specifying on (default on is the intersection of columns between the two DataFrames in this case ['Date', 'B']):
df3 = df1.merge(df2, how='outer')
df3:
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
7 08.01.2021 NaN 27 NaN
8 09.01.2021 NaN 28 NaN
9 10.01.2021 NaN 29 NaN
10 11.01.2021 NaN 30 NaN
11 12.01.2021 NaN 31 NaN
12 13.01.2021 NaN 32 NaN
Assuming you always want to keep the first full version, you can concat the df2 on the end of df1 and drop duplicates on the Date column.
pd.concat([df1,df2]).drop_duplicates(subset='Date')
Output
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
1 08.01.2021 NaN 27 NaN
2 09.01.2021 NaN 28 NaN
3 10.01.2021 NaN 29 NaN
4 11.01.2021 NaN 30 NaN
5 12.01.2021 NaN 31 NaN
6 13.01.2021 NaN 32 NaN
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have these two dictionaries, that I would like to merge, but without deleting data of a, but deleting the non-matching values of b:
a = pd.DataFrame({'coname': ['Apple','Microsoft','JPMorgan','Facebook','Intel','McKinsey'],
'eva': [20, 18, 73, 62, 56, 92],
'ratio': [4, 7, 1, 6, 9, 8]
})
b = pd.DataFrame({'coname': ['Apple','Microsoft','JPMorgan','Netflix','Total','Ford'],
'city': ['Cupertino','Seattle','NYC','Palo Alto','Paris', 'Detroit'],
'state': ['CA','WA','NY','CA','Ile de France', 'MI']
})
I want to following output: EDITED
coname eva ratio city state
0 Apple 20.0 4.0 Cupertino CA
1 Microsoft 18.0 7.0 Seattle WA
2 JPMorgan 73.0 1.0 NYC NY
3 Facebook 62.0 6.0 NaN NaN
4 Intel 56.0 9.0 NaN NaN
5 McKinsey 92.0 8.0 NaN NaN
I have tried
a = pd.merge(a,b, on = 'coname', how='outer')
for i in a['coname']:
if i in b['coname']:
a.drop(i)
with but I only get this:
coname eva ratio city state
0 Apple 20.0 4.0 Cupertino CA
1 Microsoft 18.0 7.0 Seattle WA
2 JPMorgan 73.0 1.0 NYC NY
3 Facebook 62.0 6.0 NaN NaN
4 Intel 56.0 9.0 NaN NaN
5 McKinsey 92.0 8.0 NaN NaN
6 Netflix NaN NaN Palo Alto CA
7 Total NaN NaN Paris Ile de France
8 Ford NaN NaN Detroit MI
how='left' will do the trick:
pd.merge(a,b, on = 'coname', how='left')
Result:
coname
eva
ratio
city
state
0
Apple
20
4
Cupertino
CA
1
Microsoft
18
7
Seattle
WA
2
JPMorgan
73
1
NYC
NY
3
Facebook
62
6
nan
nan
4
Intel
56
9
nan
nan
5
McKinsey
92
8
nan
nan
I have a table as follows. The first column is the year, the second column is the type of pavement treatment, the third column is the score of the pavement. I need to create a third column called 'year diff' by subtracting the year of the last treatment from the year of the current score. For example, year 2014 need to subtract 2013, since treatment 9 is done in 2013, and the result which is 1 need to be recorded in col['year diff'] in the corresponding cell. And year 2022 need to subtract 2020 since treatment 10 is done in 2020.
Thanks a lot everyone for your help.
Sincerely
Wilson
Use:
#check not missing values
m = df['treatment'].notnull()
#create groups starting not missing values
s = m.cumsum()
#add missing values for first group and for not missing values
mask = (s == 0) | m
#subtract score with first score per group
out = df['score'] - df['score'].groupby(s).transform('first')
#add missing values
df['year diff'] = np.where(mask, np.nan, out)
print (df)
year treatment score year diff
0 2010 NaN 1 NaN
1 2011 NaN 2 NaN
2 2012 NaN 3 NaN
3 2013 9.0 4 NaN
4 2014 NaN 5 1.0
5 2015 NaN 6 2.0
6 2016 NaN 7 3.0
7 2017 NaN 8 4.0
8 2018 NaN 9 5.0
9 2019 NaN 10 6.0
10 2020 10.0 11 NaN
11 2021 NaN 12 1.0
12 2022 NaN 13 2.0
13 2023 NaN 14 3.0
14 2024 NaN 15 4.0
15 2025 12.0 16 NaN
16 2026 NaN 17 1.0
17 2027 NaN 18 2.0
IIUC , you could use:
df['identifier']=(df['year'].diff().eq(1)&df['treatment'].notnull()).cumsum()
df['year diff ']=df.groupby('identifier')['identifier'].apply\
(lambda x: pd.Series(np.where(x!=0,pd.Series(pd.factorize(x)[0]+1).cumsum().shift(),np.nan))).values
print(df)
Or if you need to consider the difference of scores based on the value in treatment:
df['identifier']=(df['year'].diff().eq(1) &df['treatment'].notnull()).cumsum()
df['year diff']=df.groupby('identifier')['score']\
.apply(lambda x : pd.Series(np.where(x!=0,x.diff().expanding().sum(),np.nan))).reset_index(drop=True)
df.loc[df['identifier']==0,'year diff']=np.nan
print(df)
year treatment score identifier year diff
0 2010 NaN 1 0 NaN
1 2011 NaN 2 0 NaN
2 2012 NaN 3 0 NaN
3 2013 9.0 4 1 NaN
4 2014 NaN 5 1 1.0
5 2015 NaN 6 1 2.0
6 2016 NaN 7 1 3.0
7 2017 NaN 8 1 4.0
8 2018 NaN 9 1 5.0
9 2019 NaN 10 1 6.0
10 2020 10.0 11 2 NaN
11 2021 NaN 12 2 1.0
12 2022 NaN 13 2 2.0
13 2023 NaN 14 2 3.0
14 2024 NaN 15 2 4.0
15 2025 12.0 16 3 NaN
16 2026 NaN 17 3 1.0
17 2027 NaN 18 3 2.0
If you want to do by using a for loop:
df = pd.DataFrame(mydata)
mylist = df.index[df['treatment'] != ''].tolist()
And now we subtract the year values
re_list= []
for index,row in df.iterrows():
if index > min(mylist):
m = [i for i in mylist if i <= index]
re_list.append(df.iloc[index]['year'] - df.iloc[max(m)]['year'])
else:
re_list.append(0)
df['Result'] = re_list
I have dozens of csv files with similar (but not always exactly the same) headers. For instance, one has:
Year Month Day Hour Minute Direct Diffuse D_Global D_IR Zenith Test_Site
One has:
Year Month Day Hour Minute Direct Diffuse2 D_Global D_IR U_Global U_IR Zenith Test_Site
(Notice one lacks "U_Global" and "U_IR", the other has "Diffuse2" instead of "Diffuse")
I know how to pass multiple csv's into my script, but how do I have the csv's only pass values to columns in which they currently have values? And perhaps pass "Nan" to all other columns in that row.
Ideally I'd have something like:
'Year','Month','Day','Hour','Minute','Direct','Diffuse','Diffuse2','D_Global','D_IR','U_Global','U_IR','Zenith','Test_Site'
1992,1,1,0,3,-999.00,-999.00,"Nan",-999.00,-999.00,"Nan","Nan",122.517,"BER"
2013,5,30,15,55,812.84,270.62,"Nan",1078.06,-999.00,"Nan","Nan",11.542,"BER"
2004,9,1,0,1,1.04,79.40,"Nan",78.67,303.58,61.06,310.95,85.142,"ALT"
2014,12,1,0,1,0.00,0.00,"Nan",-999.00,226.95,0.00,230.16,115.410,"ALT"
The other caveat, is that this dataframe needs to be appended to. It needs to remain as multiple csv files are passed into it. I think I'll probably have it write out to it's own csv at the end (it's eventually going to NETCDF4).
Assuming you have the following CSV files:
test1.csv:
year,month,day,Direct
1992,1,1,11
2013,5,30,11
2004,9,1,11
test2.csv:
year,month,day,Direct,Direct2
1992,1,1,21,201
2013,5,30,21,202
2004,9,1,21,203
test3.csv:
year,month,day,File3
1992,1,1,text1
2013,5,30,text2
2004,9,1,text3
2016,1,1,unmatching_date
Solution:
import glob
import pandas as pd
files = glob.glob(r'd:/temp/test*.csv')
def get_merged(files, **kwargs):
df = pd.read_csv(files[0], **kwargs)
for f in files[1:]:
df = df.merge(pd.read_csv(f, **kwargs), how='outer')
return df
print(get_merged(files))
Output:
year month day Direct Direct Direct2 File3
0 1992 1 1 11.0 21.0 201.0 text1
1 2013 5 30 11.0 21.0 202.0 text2
2 2004 9 1 11.0 21.0 203.0 text3
3 2016 1 1 NaN NaN NaN unmatching_date
UPDATE: usual idiomatic pd.concat(list_of_dfs) solution wouldn't work here, because it's joining by indexes:
In [192]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=0, ignore_index=True)
Out[192]:
Direct Direct Direct2 File3 day month year
0 NaN 11.0 NaN NaN 1 1 1992
1 NaN 11.0 NaN NaN 30 5 2013
2 NaN 11.0 NaN NaN 1 9 2004
3 21.0 NaN 201.0 NaN 1 1 1992
4 21.0 NaN 202.0 NaN 30 5 2013
5 21.0 NaN 203.0 NaN 1 9 2004
6 NaN NaN NaN text1 1 1 1992
7 NaN NaN NaN text2 30 5 2013
8 NaN NaN NaN text3 1 9 2004
9 NaN NaN NaN unmatching_date 1 1 2016
In [193]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=1, ignore_index=True)
Out[193]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1 text1
1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30 text2
2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1 text3
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016 1 1 unmatching_date
or using index_col=None explicitly:
In [194]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=0, ignore_index=True)
Out[194]:
Direct Direct Direct2 File3 day month year
0 NaN 11.0 NaN NaN 1 1 1992
1 NaN 11.0 NaN NaN 30 5 2013
2 NaN 11.0 NaN NaN 1 9 2004
3 21.0 NaN 201.0 NaN 1 1 1992
4 21.0 NaN 202.0 NaN 30 5 2013
5 21.0 NaN 203.0 NaN 1 9 2004
6 NaN NaN NaN text1 1 1 1992
7 NaN NaN NaN text2 30 5 2013
8 NaN NaN NaN text3 1 9 2004
9 NaN NaN NaN unmatching_date 1 1 2016
In [195]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=1, ignore_index=True)
Out[195]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1 text1
1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30 text2
2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1 text3
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016 1 1 unmatching_date
The following more idiomatic solution works, but it changes original order of columns and rows / data:
In [224]: dfs = [pd.read_csv(f, index_col=None) for f in glob.glob(r'd:/temp/test*.csv')]
...:
...: common_cols = list(set.intersection(*[set(x.columns.tolist()) for x in dfs]))
...:
...: pd.concat((df.set_index(common_cols) for df in dfs), axis=1).reset_index()
...:
Out[224]:
month day year Direct Direct Direct2 File3
0 1 1 1992 11.0 21.0 201.0 text1
1 1 1 2016 NaN NaN NaN unmatching_date
2 5 30 2013 11.0 21.0 202.0 text2
3 9 1 2004 11.0 21.0 203.0 text3
Can't pandas take care of this automagically?
http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-using-append
If your indices overlap, don't forget to add 'ignore_index=True'
First, run through all the files to define the common headers :
csv_path = './csv_files'
csv_separator = ','
full_headers = []
for fn in os.listdir(csv_path):
with open(fn, 'r') as f:
headers = f.readline().split(csv_separator)
full_headers += full_headers + list(set(full_headers) - set(headers))
Then write your header line into your output file, and run again through all the files to fill it.
You can use : csv.DictReader(open('myfile.csv')) to be able to match the headers to their designated column simply.