Consider I have a data frame :
>>> data
c0 c1 c2 _c1 _c2
0 0 1 2 18.0 19.0
1 3 4 5 NaN NaN
2 6 7 8 20.0 21.0
3 9 10 11 NaN NaN
4 12 13 14 NaN NaN
5 15 16 17 NaN NaN
I want to update the values in the c1 and c2 columns with the values in the _c1 and _c2 columns whenever those latter values are not NaN. Why won't the following work, and what is the correct way to do this?
>>> data.loc[~(data._c1.isna()),['c1','c2']]=data.loc[~(data._c1.isna()),['_c1','_c2']]
>>> data
c0 c1 c2 _c1 _c2
0 0 NaN NaN 18.0 19.0
1 3 4.0 5.0 NaN NaN
2 6 NaN NaN 20.0 21.0
3 9 10.0 11.0 NaN NaN
4 12 13.0 14.0 NaN NaN
5 15 16.0 17.0 NaN NaN
For completeness's sake I want the result to look like
>>> data.loc[~(data._c1.isna()),['c1','c2']]=data.loc[~(data._c1.isna()),['_c1','_c2']]
>>> data
c0 c1 c2 _c1 _c2
0 0 18.0 19.0. 18.0 19.0
1 3 4.0 5.0 NaN NaN
2 6 20.0 21.0 20.0 21.0
3 9 10.0 11.0 NaN NaN
4 12 13.0 14.0 NaN NaN
5 15 16.0 17.0 NaN NaN
I recommend update after rename
df.update(df[['_c1','_c2']].rename(columns={'_c1':'c1','_c2':'c2'}))
df
Out[266]:
c0 c1 c2 _c1 _c2
0 0 18.0 19.0 18.0 19.0
1 3 4.0 5.0 NaN NaN
2 6 20.0 21.0 20.0 21.0
3 9 10.0 11.0 NaN NaN
4 12 13.0 14.0 NaN NaN
5 15 16.0 17.0 NaN NaN
You can use np.where:
df[['c1', 'c2']] = np.where(df[['_c1', '_c2']].notna(),
df[['_c1', '_c2']],
df[['c1', 'c2']])
print(df)
# Output:
c0 c1 c2 _c1 _c2
0 0 18.0 19.0 18.0 19.0
1 3 4.0 5.0 NaN NaN
2 6 20.0 21.0 20.0 21.0
3 9 10.0 11.0 NaN NaN
4 12 13.0 14.0 NaN NaN
5 15 16.0 17.0 NaN NaN
Update
Do you know by any chance WHY the above doesn't work the way I thought it would?
Your column names from left and right side of your expression are different so Pandas can't use this values even if the shape is the same.
# Left side of your expression
>>> data.loc[~(data._c1.isna()),['c1','c2']]
c1 c2 # <- note the column names
0 18.0 19.0
2 20.0 21.0
# Right side of your expression
>>> data.loc[~(data._c1.isna()),['_c1','_c2']]
_c1 _c2 # <- Your column names are difference from left side
0 18.0 19.0
2 20.0 21.0
How to solve it? Simply use .values on the right side. As your right side is not row/column indexed, Pandas use the shape to set the values.
data.loc[~(data._c1.isna()),['c1','c2']] = \
data.loc[~(data._c1.isna()),['_c1','_c2']].values
print(data)
# Output:
c0 c1 c2 _c1 _c2
0 0 18.0 19.0 18.0 19.0
1 3 4.0 5.0 NaN NaN
2 6 20.0 21.0 20.0 21.0
3 9 10.0 11.0 NaN NaN
4 12 13.0 14.0 NaN NaN
5 15 16.0 17.0 NaN NaN
Related
I have a data-frame like this:
dtf:
id f1 f2 f3 f4 f5
t1 34 12 5 nan 6
t1 nan 4 2 9 7
t1 34 nan 5 nan 6
t2 nan nan nan nan nan
t2 nan nan nan nan nan
t2 nan nan nan nan nan
t3 23 7 8 1 32
t3 12 3 nan 45 56
t3 nan nan nan nan nan
I want to remove those rows (which have unique id) and all the features' values are 'nan' (like t2). Thus my desired data-frame should be like this:
dtf_new:
id f1 f2 f3 f4 f5
t1 34 12 5 nan 6
t1 nan 4 2 9 7
t1 34 nan 5 nan 6
t3 23 7 8 1 32
t3 12 3 nan 45 56
t3 nan nan nan nan nan
I have tried to convert it a dictionary using the below code, and then try to find nan values. But I still could not find the right solution.
dict=dict(enumerate(dtf.id.unique()))
You could do groupby and isna:
>>> dtf
id f1 f2 f3 f4 f5
0 t1 34.0 12.0 5.0 NaN 6.0
1 t1 NaN 4.0 2.0 9.0 7.0
2 t1 34.0 NaN 5.0 NaN 6.0
3 t2 NaN NaN NaN NaN NaN
4 t2 NaN NaN NaN NaN NaN
5 t2 NaN NaN NaN NaN NaN
6 t3 23.0 7.0 8.0 1.0 32.0
7 t3 12.0 3.0 NaN 45.0 56.0
8 t3 NaN NaN NaN NaN NaN
>>> dtf_new = dtf[~dtf['id'].map(dtf.groupby('id').apply(lambda x: x.drop(columns='id').isna().all(axis=None)))]
>>> dtf_new
id f1 f2 f3 f4 f5
0 t1 34.0 12.0 5.0 NaN 6.0
1 t1 NaN 4.0 2.0 9.0 7.0
2 t1 34.0 NaN 5.0 NaN 6.0
6 t3 23.0 7.0 8.0 1.0 32.0
7 t3 12.0 3.0 NaN 45.0 56.0
8 t3 NaN NaN NaN NaN NaN
#sacse is right, dropna does the job
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
Just change default "how" parameter...
These kinds of processing needs are basic, and common to most of pandas users... You can assume there is a feature for it, few mins on the documentation and you will find your answer, and other interesting features :)
Worth some time reading!
I have df1:
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 xyz 23.0
1 b 2.0 56.0 abc 24.0
2 c 3.0 34.0 qwerty 28.0
3 d 4.0 34.0 wer 33.0
4 e NaN NaN NaN NaN
df2:
ColA ColB ID1 ColC ID2
0 i 0 45.0 NaN 23.0
1 j 0 56.0 NaN 24.0
2 NaN 0 NaN fd 25.0
3 NaN 0 NaN NaN 26.0
4 NaN 0 23.0 e 45.0
5 NaN 0 45.0 r NaN
6 NaN 0 56.0 NaN 29.0
I am trying to update df2 only on columns which wil be a choice= ['ColA','ColB'] where ID1 and ID2 both matches in the 2 dfs.
Expected output:
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 NaN 23.0
1 b 2.0 56.0 NaN 24.0
2 NaN 0 NaN fd 25.0
3 NaN 0 NaN NaN 26.0
4 NaN 0 23.0 e 45.0
5 NaN 0 45.0 r NaN
6 NaN 0 56.0 NaN 29.0
So far I have tried:
u = df1.set_index(['ID1','ID2'])
u = u.loc[u.index.dropna()]
v = df2.set_index(['ID1','ID2'])
v= v.loc[v.index.dropna()]
v.update(u)
v.reset_index()
Which gives me the correct update(but I loose the Ids which are NaN) also the update takes place on ColC which i dont want:
ID1 ID2 ColA ColB ColC
0 45.0 23.0 a 1.0 xyz
1 56.0 24.0 b 2.0 abc
2 23.0 45.0 NaN 0.0 e
3 56.0 29.0 NaN 0.0 NaN
I have also tried merge and combine_first. cant figure out what is the best approach to do this based on the choicelist.
Use merge with right join and then combine_first:
choice= ['ColA','ColB']
joined = ['ID1','ID2']
c = choice + joined
df3 = df1[c].merge(df2[c], on=joined, suffixes=('','_'), how='right')[c]
print (df3)
ColA ColB ID1 ID2
0 a 1.0 45.0 23.0
1 b 2.0 56.0 24.0
2 NaN NaN NaN 25.0
3 NaN NaN NaN 26.0
4 NaN NaN 23.0 45.0
5 NaN NaN 45.0 NaN
6 NaN NaN 56.0 29.0
df2[c] = df3.combine_first(df2[c])
print (df2)
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 NaN 23.0
1 b 2.0 56.0 NaN 24.0
2 NaN 0.0 NaN fd 25.0
3 NaN 0.0 NaN NaN 26.0
4 NaN 0.0 23.0 e 45.0
5 NaN 0.0 45.0 r NaN
6 NaN 0.0 56.0 NaN 29.0
here's a way
df1
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 xyz 23.0
1 b 2.0 56.0 abc 24.0
2 c 3.0 34.0 qwerty 28.0
3 d 4.0 34.0 wer 33.0
4 e NaN NaN NaN NaN
df2
ColA ColB ID1 ColC ID2
0 i 0 45.0 NaN 23.0
1 j 0 56.0 NaN 24.0
2 NaN 0 NaN fd 25.0
3 NaN 0 NaN NaN 26.0
4 NaN 0 23.0 e 45.0
5 NaN 0 45.0 r NaN
6 NaN 0 56.0 NaN 29.0
df3 = df1.merge(df2, on=['ID1','ID2'], left_index=True)[['ColA_x','ColB_x']]
df2.loc[df3.index, 'ColA'] = df3['ColA_x']
df2.loc[df3.index, 'ColB'] = df3['ColB_x']
output
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 NaN 23.0
1 b 2.0 56.0 NaN 24.0
2 NaN 0.0 NaN fd 25.0
3 NaN 0.0 NaN NaN 26.0
4 NaN 0.0 23.0 e 45.0
5 NaN 0.0 45.0 r NaN
6 NaN 0.0 56.0 NaN 29.0
There seems to still be the issue in 0.24 where NaN merges with NaN when they are keys. Prevent this by dropping those records before merging. I'm assuming ['ID1', 'ID2'] is a unique key for df1 (for rows where both are not null):
keys = ['ID1', 'ID2']
updates = ['ColA', 'ColB']
df3 = df2.merge(df1[updates+keys].dropna(subset=keys), on=keys, how='left')
Then resolve information. Take the value in df1 if it's not null, else take the value in df2. In recent versions of python the merge output should be ordered so for duplicated columns _x appears to the left of the _y column. If not, sort the index
#df3 = df3.sort_index(axis=1) # If not sorted _x left of _y
df3.groupby([x[0] for x in df3.columns.str.split('_')], axis=1).apply(lambda x: x.ffill(1).iloc[:, -1])
ColA ColB ColC ID1 ID2
0 a 1.0 NaN 45.0 23.0
1 b 2.0 NaN 56.0 24.0
2 NaN 0.0 fd NaN 25.0
3 NaN 0.0 NaN NaN 26.0
4 NaN 0.0 e 23.0 45.0
5 NaN 0.0 r 45.0 NaN
6 NaN 0.0 NaN 56.0 29.0
My sample code is as follow:
import pandas as pd
dictx = {'col1':[1,'nan','nan','nan',5,'nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9]}
df = pd.DataFrame(dictx).astype(float)
I'm trying to interpolate various segments which contain the value 'nan'.
For context, I'm trying to track bus speeds using GPS data provided by the city (São Paulo, Brazil), but the data is scarce and with parts that do not provide the information, as the e.g., but there're segments which I know for a fact that they are stopped, such as dawn, but the information come as 'nan' as well.
What I need:
I've been experimenting with dataframe.interpolate() parameters (limit and limit_diretcion) but came up short. If I set df.interpolate(limit=2) I will not only interpolate the data that I need but the data where it shouldn't. So I need to interpolate between sections defined by a limit
Desired output:
Out[7]:
col1 col2 col3
0 1.0 20.00 15.00
1 nan nan nan
2 nan nan nan
3 nan nan nan
4 5.0 22.00 10.00
5 6.0 23.50 12.00
6 7.0 25.00 14.00
7 8.0 27.50 13.50
8 9.0 30.00 13.00
9 nan nan nan
10 nan nan nan
11 nan nan nan
12 13.0 25.00 9.00
The logic that I've been trying to apply is basically trying to find nan's and calculating the difference between their indexes and so createing a new dataframe_temp to interpolate and only than add it to another creating a new dataframe_final. But this has become hard to achieve due to the fact that 'nan'=='nan' return False
This is a hack but may still be useful. Likely Pandas 0.23 will have a better solution.
https://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#dataframe-interpolate-has-gained-the-limit-area-kwarg
df_fw = df.interpolate(limit=1)
df_bk = df.interpolate(limit=1, limit_direction='backward')
df_fw.where(df_bk.notna())
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
Not a Hack
More legitimate way of handling it.
Generalized to handle any limit.
def interp(df, limit):
d = df.notna().rolling(limit + 1).agg(any).fillna(1)
d = pd.concat({
i: d.shift(-i).fillna(1)
for i in range(limit + 1)
}).prod(level=1)
return df.interpolate(limit=limit).where(d.astype(bool))
df.pipe(interp, 1)
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
Can also handle variation in NaN from column to column. Consider a different df
dictx = {'col1':[1,'nan','nan','nan',5,'nan','nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan','nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9,'nan']}
df = pd.DataFrame(dictx).astype(float)
df
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 NaN NaN NaN
6 NaN 25.0 14.0
7 7.0 NaN NaN
8 NaN NaN 13.0
9 9.0 30.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.0 25.0 NaN
Then with limit=1
df.pipe(interp, 1)
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 NaN 23.5 12.0
6 NaN 25.0 14.0
7 7.0 NaN 13.5
8 8.0 NaN 13.0
9 9.0 30.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.0 25.0 9.0
And with limit=2
df.pipe(interp, 2).round(2)
col1 col2 col3
0 1.00 20.00 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.00 22.00 10.0
5 5.67 23.50 12.0
6 6.33 25.00 14.0
7 7.00 26.67 13.5
8 8.00 28.33 13.0
9 9.00 30.00 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.00 25.00 9.0
Here is a way to selectively ignore rows which are consecutive runs of NaNs whose length is greater than a certain size (given by limit):
import numpy as np
import pandas as pd
dictx = {'col1':[1,'nan','nan','nan',5,'nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9]}
df = pd.DataFrame(dictx).astype(float)
limit = 2
notnull = pd.notnull(df).all(axis=1)
# assign group numbers to the rows of df. Each group starts with a non-null row,
# followed by null rows
group = notnull.cumsum()
# find the index of groups having length > limit
ignore = (df.groupby(group).filter(lambda grp: len(grp)>limit)).index
# only ignore rows which are null
ignore = df.loc[~notnull].index.intersection(ignore)
keep = df.index.difference(ignore)
# interpolate only the kept rows
df.loc[keep] = df.loc[keep].interpolate()
print(df)
prints
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
By changing the value of limit you can control how big the group has to be before it should be ignored.
This is a partial answer.
for i in list(df):
for x in range(len(df[i])):
if not df[i][x] > -100:
df[i][x] = 0
df
col1 col2 col3
0 1.0 20.0 15.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 5.0 22.0 10.0
5 0.0 0.0 0.0
6 7.0 25.0 14.0
7 0.0 0.0 0.0
8 9.0 30.0 13.0
9 0.0 0.0 0.0
10 0.0 0.0 0.0
11 0.0 0.0 0.0
12 13.0 25.0 9.0
Now,
df["col1"][1] == df["col2"][1]
True
The code below will generate the desired output in ONE dataframe, however, I would like to dynamically create data frames in a FOR loop then assign the shifted value to that data frame. Example, data frame df_lag_12 would only contain column1_t12 and column2_12. Any ideas would be greatly appreciated. I attempted to dynamically create 12 dataframes using the EXEC statement, google searching seems to state this is poor practice.
import pandas as pd
list1=list(range(0,20))
list2=list(range(19,-1,-1))
d={'column1':list(range(0,20)),
'column2':list(range(19,-1,-1))}
df=pd.DataFrame(d)
df_lags=pd.DataFrame()
for col in df.columns:
for i in range(12,0,-1):
df_lags[col+'_t'+str(i)]=df[col].shift(i)
df_lags[col]=df[col].values
print(df_lags)
for df in (range(12,0,-1)):
exec('model_data_lag_'+str(df)+'=pd.DataFrame()')
Desired output for dymanically created dataframe DF_LAGS_12:
var_list=['column1_t12','column2_t12']
df_lags_12=df_lags[var_list]
print(df_lags_12)
I think the best is create dictionary of DataFrames:
d = {}
for i in range(12,0,-1):
d['t' + str(i)] = df.shift(i).add_suffix('_t' + str(i))
If need specify columns first:
d = {}
cols = ['column1','column2']
for i in range(12,0,-1):
d['t' + str(i)] = df[cols].shift(i).add_suffix('_t' + str(i))
dict comprehension solution:
d = {'t' + str(i): df.shift(i).add_suffix('_t' + str(i)) for i in range(12,0,-1)}
print (d['t10'])
column1_t10 column2_t10
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 0.0 19.0
11 1.0 18.0
12 2.0 17.0
13 3.0 16.0
14 4.0 15.0
15 5.0 14.0
16 6.0 13.0
17 7.0 12.0
18 8.0 11.0
19 9.0 10.0
EDIT: Is it possible by globals, but much better is dictionary:
d = {}
cols = ['column1','column2']
for i in range(12,0,-1):
globals()['df' + str(i)] = df[cols].shift(i).add_suffix('_t' + str(i))
print (df10)
column1_t10 column2_t10
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 0.0 19.0
11 1.0 18.0
12 2.0 17.0
13 3.0 16.0
14 4.0 15.0
15 5.0 14.0
16 6.0 13.0
17 7.0 12.0
18 8.0 11.0
19 9.0 10.0
for i in range(1, 16):
text=f"Version{i}=pd.DataFrame()"
exec(text)
A combination of exec and f"..." will help you do that.
If you need iterating or Versions of same variable above statement will help
I have dozens of csv files with similar (but not always exactly the same) headers. For instance, one has:
Year Month Day Hour Minute Direct Diffuse D_Global D_IR Zenith Test_Site
One has:
Year Month Day Hour Minute Direct Diffuse2 D_Global D_IR U_Global U_IR Zenith Test_Site
(Notice one lacks "U_Global" and "U_IR", the other has "Diffuse2" instead of "Diffuse")
I know how to pass multiple csv's into my script, but how do I have the csv's only pass values to columns in which they currently have values? And perhaps pass "Nan" to all other columns in that row.
Ideally I'd have something like:
'Year','Month','Day','Hour','Minute','Direct','Diffuse','Diffuse2','D_Global','D_IR','U_Global','U_IR','Zenith','Test_Site'
1992,1,1,0,3,-999.00,-999.00,"Nan",-999.00,-999.00,"Nan","Nan",122.517,"BER"
2013,5,30,15,55,812.84,270.62,"Nan",1078.06,-999.00,"Nan","Nan",11.542,"BER"
2004,9,1,0,1,1.04,79.40,"Nan",78.67,303.58,61.06,310.95,85.142,"ALT"
2014,12,1,0,1,0.00,0.00,"Nan",-999.00,226.95,0.00,230.16,115.410,"ALT"
The other caveat, is that this dataframe needs to be appended to. It needs to remain as multiple csv files are passed into it. I think I'll probably have it write out to it's own csv at the end (it's eventually going to NETCDF4).
Assuming you have the following CSV files:
test1.csv:
year,month,day,Direct
1992,1,1,11
2013,5,30,11
2004,9,1,11
test2.csv:
year,month,day,Direct,Direct2
1992,1,1,21,201
2013,5,30,21,202
2004,9,1,21,203
test3.csv:
year,month,day,File3
1992,1,1,text1
2013,5,30,text2
2004,9,1,text3
2016,1,1,unmatching_date
Solution:
import glob
import pandas as pd
files = glob.glob(r'd:/temp/test*.csv')
def get_merged(files, **kwargs):
df = pd.read_csv(files[0], **kwargs)
for f in files[1:]:
df = df.merge(pd.read_csv(f, **kwargs), how='outer')
return df
print(get_merged(files))
Output:
year month day Direct Direct Direct2 File3
0 1992 1 1 11.0 21.0 201.0 text1
1 2013 5 30 11.0 21.0 202.0 text2
2 2004 9 1 11.0 21.0 203.0 text3
3 2016 1 1 NaN NaN NaN unmatching_date
UPDATE: usual idiomatic pd.concat(list_of_dfs) solution wouldn't work here, because it's joining by indexes:
In [192]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=0, ignore_index=True)
Out[192]:
Direct Direct Direct2 File3 day month year
0 NaN 11.0 NaN NaN 1 1 1992
1 NaN 11.0 NaN NaN 30 5 2013
2 NaN 11.0 NaN NaN 1 9 2004
3 21.0 NaN 201.0 NaN 1 1 1992
4 21.0 NaN 202.0 NaN 30 5 2013
5 21.0 NaN 203.0 NaN 1 9 2004
6 NaN NaN NaN text1 1 1 1992
7 NaN NaN NaN text2 30 5 2013
8 NaN NaN NaN text3 1 9 2004
9 NaN NaN NaN unmatching_date 1 1 2016
In [193]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=1, ignore_index=True)
Out[193]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1 text1
1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30 text2
2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1 text3
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016 1 1 unmatching_date
or using index_col=None explicitly:
In [194]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=0, ignore_index=True)
Out[194]:
Direct Direct Direct2 File3 day month year
0 NaN 11.0 NaN NaN 1 1 1992
1 NaN 11.0 NaN NaN 30 5 2013
2 NaN 11.0 NaN NaN 1 9 2004
3 21.0 NaN 201.0 NaN 1 1 1992
4 21.0 NaN 202.0 NaN 30 5 2013
5 21.0 NaN 203.0 NaN 1 9 2004
6 NaN NaN NaN text1 1 1 1992
7 NaN NaN NaN text2 30 5 2013
8 NaN NaN NaN text3 1 9 2004
9 NaN NaN NaN unmatching_date 1 1 2016
In [195]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=1, ignore_index=True)
Out[195]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1 text1
1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30 text2
2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1 text3
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016 1 1 unmatching_date
The following more idiomatic solution works, but it changes original order of columns and rows / data:
In [224]: dfs = [pd.read_csv(f, index_col=None) for f in glob.glob(r'd:/temp/test*.csv')]
...:
...: common_cols = list(set.intersection(*[set(x.columns.tolist()) for x in dfs]))
...:
...: pd.concat((df.set_index(common_cols) for df in dfs), axis=1).reset_index()
...:
Out[224]:
month day year Direct Direct Direct2 File3
0 1 1 1992 11.0 21.0 201.0 text1
1 1 1 2016 NaN NaN NaN unmatching_date
2 5 30 2013 11.0 21.0 202.0 text2
3 9 1 2004 11.0 21.0 203.0 text3
Can't pandas take care of this automagically?
http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-using-append
If your indices overlap, don't forget to add 'ignore_index=True'
First, run through all the files to define the common headers :
csv_path = './csv_files'
csv_separator = ','
full_headers = []
for fn in os.listdir(csv_path):
with open(fn, 'r') as f:
headers = f.readline().split(csv_separator)
full_headers += full_headers + list(set(full_headers) - set(headers))
Then write your header line into your output file, and run again through all the files to fill it.
You can use : csv.DictReader(open('myfile.csv')) to be able to match the headers to their designated column simply.