In python, reading multiple CSV's, with different headers, into one dataframe

In python, reading multiple CSV's, with different headers, into one dataframe - python

I have dozens of csv files with similar (but not always exactly the same) headers. For instance, one has:
Year Month Day Hour Minute Direct Diffuse D_Global D_IR Zenith Test_Site
One has:
Year Month Day Hour Minute Direct Diffuse2 D_Global D_IR U_Global U_IR Zenith Test_Site
(Notice one lacks "U_Global" and "U_IR", the other has "Diffuse2" instead of "Diffuse")
I know how to pass multiple csv's into my script, but how do I have the csv's only pass values to columns in which they currently have values? And perhaps pass "Nan" to all other columns in that row.
Ideally I'd have something like:
'Year','Month','Day','Hour','Minute','Direct','Diffuse','Diffuse2','D_Global','D_IR','U_Global','U_IR','Zenith','Test_Site'
1992,1,1,0,3,-999.00,-999.00,"Nan",-999.00,-999.00,"Nan","Nan",122.517,"BER"
2013,5,30,15,55,812.84,270.62,"Nan",1078.06,-999.00,"Nan","Nan",11.542,"BER"
2004,9,1,0,1,1.04,79.40,"Nan",78.67,303.58,61.06,310.95,85.142,"ALT"
2014,12,1,0,1,0.00,0.00,"Nan",-999.00,226.95,0.00,230.16,115.410,"ALT"
The other caveat, is that this dataframe needs to be appended to. It needs to remain as multiple csv files are passed into it. I think I'll probably have it write out to it's own csv at the end (it's eventually going to NETCDF4).

Assuming you have the following CSV files:
test1.csv:
year,month,day,Direct
1992,1,1,11
2013,5,30,11
2004,9,1,11
test2.csv:
year,month,day,Direct,Direct2
1992,1,1,21,201
2013,5,30,21,202
2004,9,1,21,203
test3.csv:
year,month,day,File3
1992,1,1,text1
2013,5,30,text2
2004,9,1,text3
2016,1,1,unmatching_date
Solution:
import glob
import pandas as pd
files = glob.glob(r'd:/temp/test*.csv')
def get_merged(files, **kwargs):
df = pd.read_csv(files[0], **kwargs)
for f in files[1:]:
df = df.merge(pd.read_csv(f, **kwargs), how='outer')
return df
print(get_merged(files))
Output:
year month day Direct Direct Direct2 File3
0 1992 1 1 11.0 21.0 201.0 text1
1 2013 5 30 11.0 21.0 202.0 text2
2 2004 9 1 11.0 21.0 203.0 text3
3 2016 1 1 NaN NaN NaN unmatching_date
UPDATE: usual idiomatic pd.concat(list_of_dfs) solution wouldn't work here, because it's joining by indexes:
In [192]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=0, ignore_index=True)
Out[192]:
Direct Direct Direct2 File3 day month year
0 NaN 11.0 NaN NaN 1 1 1992
1 NaN 11.0 NaN NaN 30 5 2013
2 NaN 11.0 NaN NaN 1 9 2004
3 21.0 NaN 201.0 NaN 1 1 1992
4 21.0 NaN 202.0 NaN 30 5 2013
5 21.0 NaN 203.0 NaN 1 9 2004
6 NaN NaN NaN text1 1 1 1992
7 NaN NaN NaN text2 30 5 2013
8 NaN NaN NaN text3 1 9 2004
9 NaN NaN NaN unmatching_date 1 1 2016
In [193]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=1, ignore_index=True)
Out[193]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1 text1
1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30 text2
2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1 text3
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016 1 1 unmatching_date
or using index_col=None explicitly:
In [194]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=0, ignore_index=True)
Out[194]:
Direct Direct Direct2 File3 day month year
0 NaN 11.0 NaN NaN 1 1 1992
1 NaN 11.0 NaN NaN 30 5 2013
2 NaN 11.0 NaN NaN 1 9 2004
3 21.0 NaN 201.0 NaN 1 1 1992
4 21.0 NaN 202.0 NaN 30 5 2013
5 21.0 NaN 203.0 NaN 1 9 2004
6 NaN NaN NaN text1 1 1 1992
7 NaN NaN NaN text2 30 5 2013
8 NaN NaN NaN text3 1 9 2004
9 NaN NaN NaN unmatching_date 1 1 2016
In [195]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=1, ignore_index=True)
Out[195]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1 text1
1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30 text2
2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1 text3
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016 1 1 unmatching_date
The following more idiomatic solution works, but it changes original order of columns and rows / data:
In [224]: dfs = [pd.read_csv(f, index_col=None) for f in glob.glob(r'd:/temp/test*.csv')]
...:
...: common_cols = list(set.intersection(*[set(x.columns.tolist()) for x in dfs]))
...:
...: pd.concat((df.set_index(common_cols) for df in dfs), axis=1).reset_index()
...:
Out[224]:
month day year Direct Direct Direct2 File3
0 1 1 1992 11.0 21.0 201.0 text1
1 1 1 2016 NaN NaN NaN unmatching_date
2 5 30 2013 11.0 21.0 202.0 text2
3 9 1 2004 11.0 21.0 203.0 text3

Can't pandas take care of this automagically?
http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-using-append
If your indices overlap, don't forget to add 'ignore_index=True'

First, run through all the files to define the common headers :
csv_path = './csv_files'
csv_separator = ','
full_headers = []
for fn in os.listdir(csv_path):
with open(fn, 'r') as f:
headers = f.readline().split(csv_separator)
full_headers += full_headers + list(set(full_headers) - set(headers))
Then write your header line into your output file, and run again through all the files to fill it.
You can use : csv.DictReader(open('myfile.csv')) to be able to match the headers to their designated column simply.

Related

How to create multiple year columns in a new dataframe, from the original single column datetime dataframe?

Source df has sdate datetime64 and svalue float64 columns as:
sdate svalue
1980-01-01 5
1980-01-02 7
1980-01-05 2
1981-01-01 6
1981-01-02 3
1982-01-01 4
1982-01-02 2
1982-01-06 9
1983-01-06 8
How to create multiple year columns in a new dataset as:
dayofyear 1980 1981 1982 1983
1 5 6 4 nan
2 7 3 2 nan
3 nan nan nan nan
4 nan nan nan nan
5 2 nan nan nan
6 nan nan 9 8
I tried something like
df_new = df.pivot(index=df.sdate.dt.dayofyear, columns=df.sdate.dt.year, values='svalue')

Use DataFrame.assign for new columns and then pivoting:
df_new = df.assign(d = df.sdate.dt.dayofyear, y = df.sdate.dt.year).pivot('d','y','svalue')
print (df_new)
y 1980 1981 1982 1983
d
1 5.0 6.0 4.0 NaN
2 7.0 3.0 2.0 NaN
5 2.0 NaN NaN NaN
6 NaN NaN 9.0 8.0

Getting corresponding values in a groupby

I have a dataset similar to this
Serial A B
1 12
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100
2 32 242
2 3
3 2
3 23 100
3
3 23
I group the dataframe based on Serial and find the maximum value of each A column by df['A_MAX'] = df.groupby('Serial')['A'].transform('max').values and retain the first value by df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
Serial A B A_MAX B_corresponding
1 12 31 203
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100 32 100
2 32 242
2 3
3 2 23 100
3 23 100
3
3 23
Now for the B_corresponding column, I would like to get the corresponding B values of the A_MAX. I thought of locating the A_MAX values in A but there are similar max A values per group. Additional condition, for example in Serial 2 I would also prefer to get the smallest B values between the 32

Idea is use DataFrame.sort_values for maximal values per groups, then remove missing values by DataFrame.dropna and get first rows by Serial by DataFrame.drop_duplicates. Create Series by DataFrame.set_index and last use Series.map:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated())
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated())
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31.0 203.0
1 1 31.0 NaN NaN NaN
2 1 NaN NaN NaN NaN
3 1 12.0 NaN NaN NaN
4 1 31.0 203.0 NaN NaN
5 1 10.0 NaN NaN NaN
6 1 2.0 NaN NaN NaN
7 2 32.0 100.0 32.0 100.0
8 2 32.0 242.0 NaN NaN
9 2 3.0 NaN NaN NaN
10 3 2.0 NaN 23.0 100.0
11 3 23.0 100.0 NaN NaN
12 3 NaN NaN NaN NaN
13 3 23.0 NaN NaN NaN
Converting missing values to empty strings is possible, but get mixed values - numeric and strings, so next processing should be problematic:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated(), '')
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31 203
1 1 31.0 NaN
2 1 NaN NaN
3 1 12.0 NaN
4 1 31.0 203.0
5 1 10.0 NaN
6 1 2.0 NaN
7 2 32.0 100.0 32 100
8 2 32.0 242.0
9 2 3.0 NaN
10 3 2.0 NaN 23 100
11 3 23.0 100.0
12 3 NaN NaN
13 3 23.0 NaN

You could also use dictionaries to achieve the same if you are not so inclined to only use pandas.
a_to_b_mapping = df.groupby('A')['B'].min().to_dict()
series_to_a_mapping = df.groupby('Series')['A'].max().to_dict()
agg_df = {}
for series, a in series_to_a_mapping.items():
agg_df.append((series, a, a_to_b_mapping.get(a, None)))
agg_df = pd.DataFrame(agg_df, columns=['Series', 'A_max', 'B_corresponding'])
agg_df.head()
Series A_max B_corresponding
0 1 31.0 203.0
1 2 32.0 100.0
2 3 23.0 100.0
If you want, you could join this to original dataframe and mask duplicates.
dft = df.join(final_df.set_index('Serial'), on='Serial', how='left')
dft['A_max'] = dft['A_max'].mask(dft['A_max'].duplicated(), '')
dft['B_corresponding'] = dft['B_corresponding'].mask(dft['B_corresponding'].duplicated(), '')
dft

Find cumulative sums of each grouping in a row and then set the grouping equal to the maximum sum

If I have a pandas data frame of ones like this:
NaN 1 1 1 1 NaN 1 1 1 NaN 1
Nan NaN 1 1 1 1 NaN NaN 1 NaN 1
NaN NaN 1 1 1 1 1 1 1 1 1
How do I do a cumulative sum in each row such but then set each grouping with the maximum value of the cumulative sum such that I get a pandas data frame like this:
NaN 4 4 4 4 NaN 3 3 3 NaN 1
Nan NaN 4 4 4 4 NaN NaN 1 NaN 1
NaN NaN 9 9 9 9 9 9 9 9 9

First we do stack with isnull, the create the sub-group with cumsum and count the continue 1 with transform , last step we just need unstack convert the data back
s=df.isnull().stack()
s=s.groupby(level=0).cumsum()[~s]
s=s.groupby([s.index.get_level_values(0),s]).transform('count').unstack().reindex_like(df)
1 2 3 4 5 6 7 8 9 10 11
0 NaN 4.0 4.0 4.0 4.0 NaN 3.0 3.0 3.0 NaN 1.0
1 NaN NaN 4.0 4.0 4.0 4.0 NaN NaN 1.0 NaN 1.0
2 NaN NaN 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0

Many more steps than #YOBEN_S but we can make use of melt and groupby
we use cumcount to create a condtional helper column to group with.
from io import StringIO
import pandas as pd
d = """ NaN 1 1 1 1 NaN 1 1 1 NaN 1
NaN NaN 1 1 1 1 NaN NaN 1 NaN 1
NaN NaN 1 1 1 1 1 1 1 1 1"""
df = pd.read_csv(StringIO(d), header=None, sep=r"\s+")
s = df.reset_index().melt(id_vars="index")
s.loc[s["value"].isnull(), "counter"] = s.groupby(
[s["index"], s["value"].isnull()]
).cumcount()
s["counter"] = s.groupby(["index"])["counter"].ffill()
s["val"] = s.groupby(["index", "counter"])["value"].cumsum()
s["val"] = s.groupby(["counter", "index"])["val"].transform("max")
s.loc[s["value"].isnull(), "val"] = np.nan
df2 = (
s.groupby(["index", "variable"])["val"]
.first()
.unstack()
.rename_axis(None, axis=1)
.rename_axis(None)
)
print(df2)
0 1 2 3 4 5 6 7 8 9 10
0 NaN 4.0 4.0 4.0 4.0 NaN 3.0 3.0 3.0 NaN 1.0
1 NaN NaN 4.0 4.0 4.0 4.0 NaN NaN 1.0 NaN 1.0
2 NaN NaN 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0

concat result of groupby pandas

I am raising this question for learning a new method for myself.
I have a dataframe like below,
ID Value
0 1 10
1 1 12
2 1 14
3 1 16
4 1 18
5 2 32
6 2 12
7 2 -8
8 2 -28
9 2 -48
10 2 -68
11 3 12
12 3 1
13 3 43
I want to convert this into:
ID Value ID Value ID Value
0 1.0 10.0 2 32 3.0 12.0
1 1.0 12.0 2 12 3.0 1.0
2 1.0 14.0 2 -8 3.0 43.0
3 1.0 16.0 2 -28 NaN NaN
4 1.0 18.0 2 -48 NaN NaN
5 NaN NaN 2 -68 NaN NaN
one way to solve this,
print
pd.concat([df[df['ID']==1].reset_index(drop=True),df[df['ID']==2].reset_index(drop=True),df[df['ID']==3].reset_index(drop=True)],axis=1)
But I'm thinking can I do the same concat operation for each groupby method result instead of filtering by value?
Any better/new approaches are more appreciated.
Thanks in advance.

Yup, very possible and quite simple with pd.concat, in fact.
df = pd.concat({k : g.reset_index(drop=True) for k, g in df.groupby('ID')}, axis=1)
df.columns = df.columns.droplevel(0)
Or, a minor variation in Dark's (now deleted) answer (which does not give you the opportunity to specify column suffixes automatically) -
pd.concat([g.reset_index(drop=True) for _, g in df.groupby('ID')], axis=1)
df
ID Value ID Value ID Value
0 1.0 10.0 2 32 3.0 12.0
1 1.0 12.0 2 12 3.0 1.0
2 1.0 14.0 2 -8 3.0 43.0
3 1.0 16.0 2 -28 NaN NaN
4 1.0 18.0 2 -48 NaN NaN
5 NaN NaN 2 -68 NaN NaN
Those column names are terrible, though. Rather than dropping the first level, you should consider concatenating them to form a pre/suf-fix for the second level. That should be a good exercise for you with df.columns.map.

Indexing columns based on cell value in pandas

I have a dataframe of race results. I'd like to create a series that takes the last stage position and subtracts that by the average of all the stages before that. Here is a small slice for the df (could have more stages, countries and rows)
race_location stage1_position stage2_position stage3_position number_of_stages
AUS 2.0 2.0 NaN 2
AUS 1.0 5.0 NaN 2
AUS 3.0 4.0 NaN 2
AUS 4.0 8.0 NaN 2
AUS 10.0 6.0 NaN 2
AUS 9.0 7.0 NaN 2
FRA 23.0 1.0 10.0 3
FRA 6.0 12.0 24.0 3
FRA 14.0 11.0 14.0 3
FRA 18.0 10.0 1.0 3
FRA 15.0 14.0 4.0 3
USA 24.0 NaN NaN 1
USA 7.0 NaN NaN 1
USA 22.0 NaN NaN 1
USA 11.0 NaN NaN 1
USA 8.0 NaN NaN 1
USA 16.0 NaN NaN 1
USA 13.0 NaN NaN 1
USA 19.0 NaN NaN 1
USA 5.0 NaN NaN 1
USA 25.0 NaN NaN 1
The output would be
last_stage_minus_average
0
4
1
4
-4
-2
-2
15
1.5
-13
-10.5
0
0
0
0
0
0
0
0
0
0
0
This wont work, but I was thinking something like this:
new_series = []
for country in country_list:
num_stages = df.loc[df['race_location'] == country, 'number_of_stages']
differnce = df.ix[df['race_location'] == country, num_stages] -
df.iloc[:, 0:num_stages-1].mean(axis=1)
new_series.append(difference)
I'm not sure how to go about doing this. Any help or direction would be amazing!

#use pandas apply to take the mean for the first n-1 stages and subtract from last stage.
df.apply(lambda x: x.iloc[x.number_of_stages]-np.mean(x.iloc[1:x.number_of_stages]),axis=1).fillna(0)
Out[264]:
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64

I'd use filter to get just he stage columns, then stack and groupby
stages = df.filter(regex='^stage\d+.*')
stages.stack().groupby(level=0).apply(
lambda x: x.iloc[-1] - x.iloc[:-1].mean()
).fillna(0)
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64
how it works
stack will automatically drop the NaN values when converting to a series.
Now, position -1 is the last value within each group if we grouped by the first level of the new multiindex
So, we use a lambda and calculate the mean with every thing up to the last value x.iloc[:-1].mean()
And subtract that from the last value x.iloc[-1]

subtracts that by the average of all the stages before that
It's not a big deal but I'm just curious! Unlike your desired output but along to your description, if one of the racers finished only one race, shouldn't their result be inf or nan instead of 0? (to specify them from the one who has already done 2~3 race but last race result is exactly same with average of races? like racer #1 vs racer #11~20)
df_sp = df.filter(regex='^stage\d+.*')
df['last'] = df_sp.T.fillna(method='ffill').T.iloc[:, -1]
df['mean'] = (df_sp.sum(axis=1) - df['last']) / (df['number_of_stages'] - 1)
print(df['last'] - df['mean'])
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

In python, reading multiple CSV's, with different headers, into one dataframe - python

Can't pandas take care of this automagically? http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-using-append If your indices overlap, don't forget to add 'ignore_index=True'

Related

How to create multiple year columns in a new dataframe, from the original single column datetime dataframe?

Getting corresponding values in a groupby

Find cumulative sums of each grouping in a row and then set the grouping equal to the maximum sum

concat result of groupby pandas

Indexing columns based on cell value in pandas

Categories

Resources