Sum data from multiple csv files using pandas

Sum data from multiple csv files using pandas - python

I have -many- csv files with the same number of columns (different number of rows) in the following pattern:
Files 1:
A1,B1,C1
A2,B2,C2
A3,B3,C3
A4,B4,C4
File 2:
*A1*,*B1*,*C1*
*A2*,*B2*,*C2*
*A3*,*B3*,*C3*
File ...
Output:
A1+*A1*+...,B1+*B1*+...,C1+*C1*+...
A2+*A2*+...,B2+*B2*+...,C2+*C2*+...
A3+*A3*+...,B3+*B3*+...,C3+*C3*+...
A4+... ,B4+... ,C4+...
For example:
Files 1:
1,0,0
1,0,1
1,0,0
0,1,0
Files 2:
1,1,0
1,1,1
0,1,0
Output:
2,1,0
2,1,2
1,1,0
0,1,0
I am trying to use python.pandas and was thinking of something like this to create the reading variables:
dic={}
for i in range(14253,14352):
try:
dic['df_{0}'.format(i)]=pandas.read_csv('output_'+str(i)+'.csv')
except:
pass
and then to sum the columns:
for residue in residues:
for number in range(14254,14255):
df=dic['df_14253'][residue]
df+=dic['df_'+str(number)][residue]
residues is a list of strings which are the column names.
I have the problem that my files have different numbers of rows and are only summed up until the last row of df1. How could I add them up until the last row of the longest file - so that no data is lost? I think groupby.sum by panda could be an option but I don't understand how to use it.
To add an example - now I get this:
Files 1:
1,0,0
1,0,1
1,0,0
0,1,0
Files 2:
1,1,0
1,1,1
0,1,0
File 3:
1,0,0
0,0,1
1,0,0
1,0,0
1,0,0
1,0,1
File ...:
Output:
3,1,0
2,1,3
2,1,0
1,1,0
1,0,0
1,0,1

You can use Panel in pandas , a 3Dobject, collection of dataframes :
dfs={ i : pd.DataFrame.from_csv('file'+str(i)+'.csv',sep=',',\
header=None,index_col=None) for i in range(n)} # n files.
panel=pd.Panel(dfs)
dfs_sum=panel.sum(axis=0)
dfs is a dictionnary of dataframes. Panel completes automatically lacking values with Nan and does the good sum. For example :
n [500]: panel[1]
Out[500]:
0 1 2
0 1 0 0
1 1 0 1
2 1 0 0
3 0 1 0
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
In [501]: panel[2]
Out[501]:
0 1 2
0 1 0 0
1 1 0 1
2 1 0 0
3 0 1 0
4 1 0 0
5 1 0 1
6 1 0 0
7 0 1 0
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
In [502]: panel[3]
Out[502]:
0 1 2
0 1 0 0
1 1 0 1
2 1 0 0
3 0 1 0
4 1 0 0
5 1 0 1
6 1 0 0
7 0 1 0
8 1 0 0
9 1 0 1
10 1 0 0
11 0 1 0
In [503]: panel.sum(0)
Out[503]:
0 1 2
0 3 0 0
1 3 0 3
2 3 0 0
3 0 3 0
4 2 0 0
5 2 0 2
6 2 0 0
7 0 2 0
8 1 0 0
9 1 0 1
10 1 0 0
11 0 1 0

Looking for this exact same thing, I find out that Panel is now Deprecated so I post here the news :
class pandas.Panel(data=None, items=None, major_axis=None, minor_axis=None, copy=False, dtype=None)
Deprecated since version 0.20.0: The recommended way to represent 3-D data are with a >MultiIndex on a DataFrame via the to_frame() method or with the xarray package. >Pandas provides a to_xarray() method to automate this conversion.
to_frame(filter_observations=True)
Transform wide format into long (stacked) format as DataFrame whose columns are >the Panel’s items and whose index is a MultiIndex formed of the Panel’s major and >minor
I would recommend using
pandas.DataFrame.sum
DataFrame.sum(axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)
Parameters:
axis : {index (0), columns (1)}
Axis for the function to be applied on.
One can use it the same way as in B.M. answer

Related

Drop rows from a slice of Multi-Index DataFrame based on boolean

EDIT: Upon request I provide an example that is closer to the real data I am working with.
So I have a table data that looks something like
value0 value1 value2
run step
0 0 0.12573 -0.132105 0.640423
1 0.1049 -0.535669 0.361595
2 1.304 0.947081 -0.703735
3 -1.265421 -0.623274 0.041326
4 -2.325031 -0.218792 -1.245911
5 -0.732267 -0.544259 -0.3163
1 0 0.411631 1.042513 -0.128535
1 1.366463 -0.665195 0.35151
2 0.90347 0.094012 -0.743499
3 -0.921725 -0.457726 0.220195
4 -1.009618 -0.209176 -0.159225
5 0.540846 0.214659 0.355373
(think: collection of time series) and a second table valid_range
start stop
run
0 1 3
1 2 5
For each run I want to drop all rows that do not satisfy start≤step≤stop.
I tried the following (table generating code at the end)
for idx in valid_range.index:
slc = data.loc[idx]
start, stop = valid_range.loc[idx]
cond = (start <= slc.index) & (slc.index <= stop)
data.loc[idx] = data.loc[idx][cond]
However, this results in:
value0 value1 value2
run step
0 0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
1 0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
I also tried data.loc[idx].drop(slc[cond].index, inplace=True) but it didn't have any effect...
Generating code for table
import numpy as np
from pandas import DataFrame, MultiIndex, Index
rng = np.random.default_rng(0)
valid_range = DataFrame({"start": [1, 2], "stop":[3, 5]}, index=Index(range(2), name="run"))
midx = MultiIndex(levels=[[],[]], codes=[[],[]], names=["run", "step"])
data = DataFrame(columns=[f"value{k}" for k in range(3)], index=midx)
for run in range(2):
for step in range(6):
data.loc[(run, step), :] = rng.normal(size=(3))
)

First, merge data and valid range based on 'run', using the merge method
>>> data
value0 value1 value2
run step
0 0 0.12573 -0.132105 0.640423
1 0.1049 -0.535669 0.361595
2 1.304 0.947081 -0.703735
3 -1.26542 -0.623274 0.041326
4 -2.32503 -0.218792 -1.24591
5 -0.732267 -0.544259 -0.3163
1 0 0.411631 1.04251 -0.128535
1 1.36646 -0.665195 0.35151
2 0.90347 0.0940123 -0.743499
3 -0.921725 -0.457726 0.220195
4 -1.00962 -0.209176 -0.159225
5 0.540846 0.214659 0.355373
>>> valid_range
start stop
run
0 1 3
1 2 5
>>> merged = data.reset_index().merge(valid_range, how='left', on='run')
>>> merged
run step value0 value1 value2 start stop
0 0 0 0.12573 -0.132105 0.640423 1 3
1 0 1 0.1049 -0.535669 0.361595 1 3
2 0 2 1.304 0.947081 -0.703735 1 3
3 0 3 -1.26542 -0.623274 0.041326 1 3
4 0 4 -2.32503 -0.218792 -1.24591 1 3
5 0 5 -0.732267 -0.544259 -0.3163 1 3
6 1 0 0.411631 1.04251 -0.128535 2 5
7 1 1 1.36646 -0.665195 0.35151 2 5
8 1 2 0.90347 0.0940123 -0.743499 2 5
9 1 3 -0.921725 -0.457726 0.220195 2 5
10 1 4 -1.00962 -0.209176 -0.159225 2 5
11 1 5 0.540846 0.214659 0.355373 2 5
Then select the rows which satisfy the condition using eval. Use the boolean array to mask data
>>> cond = merged.eval('start < step < stop').to_numpy()
>>> data[cond]
value0 value1 value2
run step
0 2 1.304 0.947081 -0.703735
1 3 -0.921725 -0.457726 0.220195
4 -1.00962 -0.209176 -0.159225
Or if you want, here is a similar approach using query
res = (
data.reset_index()
.merge(valid_range, on='run', how='left')
.query('start < step < stop')
.drop(columns=['start','stop'])
.set_index(['run', 'step'])
)

I would go on groupby like this:
(df.groupby(level=0)
.apply(lambda x: x[x['small']>1])
.reset_index(level=0, drop=True) # remove duplicate index
)
which gives:
big small
animal animal attribute
cow cow speed 30.0 20.0
weight 250.0 150.0
falcon falcon speed 320.0 250.0
lama lama speed 45.0 30.0
weight 200.0 100.0

Applying a lambda function to columns on pandas avoiding redundancy

I have this dataset, which contains some NaN values:
df = pd.DataFrame({'Id':[1,2,3,4,5,6], 'Name':['Eve','Diana',np.NaN,'Mia','Mae',np.NaN], "Count":[10,3,np.NaN,8,5,2]})
df
Id Name Count
0 1 Eve 10.0
1 2 Diana 3.0
2 3 NaN NaN
3 4 Mia 8.0
4 5 Mae 5.0
5 6 NaN 2.0
I want to test if the column has a NaN value (0) or not (1) and creating two new columns. I have tried this:
df_clean = df
df_clean[['Name_flag','Count_flag']] = df_clean[['Name','Count']].apply(lambda x: 0 if x == np.NaN else 1, axis = 1)
But it mentions that The truth value of a Series is ambiguous. I want to make it avoiding redundancy, but I see there is a mistake in my logic. Please, could you help me with this question?
The expected table is:
Id Name Count Name_flag Count_flag
0 1 Eve 10.0 1 1
1 2 Diana 3.0 1 1
2 3 NaN NaN 0 0
3 4 Mia 8.0 1 1
4 5 Mae 5.0 1 1
5 6 NaN 2.0 0 1

Multiply boolean mask by 1:
df[['Name_flag','Count_flag']] = df[['Name', 'Count']].isna() * 1
>>> df
Id Name Count Name_flag Count_flag
0 1 Eve 10.0 0 0
1 2 Diana 3.0 0 0
2 3 NaN NaN 1 1
3 4 Mia 8.0 0 0
4 5 Mae 5.0 0 0
5 6 NaN 2.0 1 0
For your problem of The truth value of a Series is ambiguous
For apply, you cannot return a scalar 0 or 1 because you have a series as input . You have to use applymap instead to apply a function elementwise. But comparing to NaN is not an easy thing:
Try:
df[['Name','Count']].applymap(lambda x: str(x) == 'nan') * 1

We can use isna and convert the boolean to int:
df[["Name_flag", "Count_flag"]] = df[["Name", "Count"]].isna().astype(int)
Id Name Count Name_flag Count_flag
0 1 Eve 10.00 0 0
1 2 Diana 3.00 0 0
2 3 NaN NaN 1 1
3 4 Mia 8.00 0 0
4 5 Mae 5.00 0 0
5 6 NaN 2.00 1 0

Sum of column values based on a condition in pandas

daychange SS
0.017065 0
-0.009259 100
0.031542 0
-0.004530 0
0.000709 0
0.004970 100
-0.021900 0
0.003611 0
I have two columns and I want to calculate the sum of next 5 'daychange' if SS = 100.
I am using the following right now but it does not work quite the way I want it to:
df['total'] = df.loc[df['SS'] == 100,['daychange']].sum(axis=1)

Since pandas 1.1 you can create a forward rolling window and select the rows you want to include in your dataframe. With different arguments my notebook kernel got terminated: use with caution.
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=5)
df['total'] = df.daychange.rolling(indexer, min_periods=1).sum()[df.SS == 100]
df
Out:
daychange SS total
0 0.017065 0 NaN
1 -0.009259 100 0.023432
2 0.031542 0 NaN
3 -0.004530 0 NaN
4 0.000709 0 NaN
5 0.004970 100 -0.013319
6 -0.021900 0 NaN
7 0.003611 0 NaN
Exclude the starting row with SS == 100 from the sum
That would be the next row after rows with SS == 100. As all rows are computed you can use
df['total'] = df.daychange.rolling(indexer, min_periods=1).sum().shift(-1)[df.SS == 100]
df
Out:
daychange SS total
0 0.017065 0 NaN
1 -0.009259 100 0.010791
2 0.031542 0 NaN
3 -0.004530 0 NaN
4 0.000709 0 NaN
5 0.004970 100 -0.018289
6 -0.021900 0 NaN
7 0.003611 0 NaN
Slow hacky solution using indices of selected rows
This feels like a hack, but works and avoids computing unnecessary rolling values
df['next5sum'] = df[df.SS == 100].index.to_series().apply(lambda x: df.daychange.iloc[x: x + 5].sum())
df
Out:
daychange SS next5sum
0 0.017065 0 NaN
1 -0.009259 100 0.023432
2 0.031542 0 NaN
3 -0.004530 0 NaN
4 0.000709 0 NaN
5 0.004970 100 -0.013319
6 -0.021900 0 NaN
7 0.003611 0 NaN
For the sum of the next five rows excluding the rows with SS == 100 you can adjust the slices or shift the series
df['next5sum'] = df[df.SS == 100].index.to_series().apply(lambda x: df.daychange.iloc[x + 1: x + 6].sum())
# df['next5sum'] = df[df.SS == 100].index.to_series().apply(lambda x: df.daychange.shift(-1).iloc[x: x + 5].sum())
df
Out:
daychange SS next5sum
0 0.017065 0 NaN
1 -0.009259 100 0.010791
2 0.031542 0 NaN
3 -0.004530 0 NaN
4 0.000709 0 NaN
5 0.004970 100 -0.018289
6 -0.021900 0 NaN
7 0.003611 0 NaN
7 0.003611 0 NaN

How can I delete int in Pandas dataframe column?

I have a dataframe like this, how can I delete all the int in a column?
For example, the value of column[0]['material'], transformed from lm792 to lm.
material item
index
0 lm792 1
1 sotl085-pu01. 1
2 lm792 1
3 sotl085-pu01. 1
4 ym11-3527 1
... ... ...
135526 0 0
135527 0 0
135528 0 0
135529 0 0
135530 0 0

you could use a simple regex -
\d is a digit (a character in the range 0-9), and + means 1 or more times. So, \d+ is 1 or more digits.
df['material'] = df['material'].str.replace('\d+','')
print(df)
material item
0 lm 1.0
1 sotl-pu. 1.0
2 lm 1.0
3 sotl-pu. 1.0
4 ym- 1.0
5 NaN
6 NaN
7 NaN
8 NaN
9 0.0

how do I sort a list of values that correspond to 'n' into a large table ordered by 'n' [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 3 years ago.
I have a list of uncertainties that correspond to a particular values of n that i'll call table 1. I would like to add those uncertainties into a comprehensive large table of data, table 2, that is ordered numerically and in ascending order by n. How could I put attach my uncertainty to the correct corresponding value of n?
My first issue is, my table of uncertainties is a table, not a dataframe. I have the separate arrays but not sure how to combine into a dataframe.
table1 = Table([xrow,yrow])
xrow denotes the array of the below 'n' in table1 and yrow denotes the corresponding error.
excerpt of table1:
n error
1 0.0
2 0.00496
3 0.0096
4 0.00913
6 0.00555
8 0.00718
10 0.00707
excerpt of table2:
n Energy g J error
0 1 0.000000 1 0 NaN
1 2 1827.486200 1 0 NaN
2 3 3626.681500 1 0 NaN
3 4 5396.686500 1 0 NaN
4 5 6250.149500 1 0 NaN
so the end result should look like this:
n Energy g J error
0 1 0.000000 1 0 0
1 2 1827.486200 1 0 0.00496
2 3 3626.681500 1 0 0.0096
3 4 5396.686500 1 0 0.00913
4 5 6250.149500 1 0 NaN
i.e. the ones where there is no data remains to be blank (e.g. n=5 in the above case)
I should note there is a lot of data (roughly 30k) in table 2 and 2.5k in table1.

you can use .merge like this:
import pandas as pd
from io import StringIO
table1 = pd.read_csv(StringIO("""
n error
1 0.0
2 0.00496
3 0.0096
4 0.00913
6 0.00555
8 0.00718
10 0.00707"""), sep=r"\s+")
table2 = pd.read_csv(StringIO("""
n Energy g J error
0 1 0.000000 1 0 NaN
1 2 1827.486200 1 0 NaN
2 3 3626.681500 1 0 NaN
3 4 5396.686500 1 0 NaN
4 5 6250.149500 1 0 NaN"""), sep=r"\s+")
table2["error"] = table1.merge(table2, on="n", how="right")["error_x"]
print(table2)
Output:
n Energy g J error
0 1 0.0000 1 0 0.00000
1 2 1827.4862 1 0 0.00496
2 3 3626.6815 1 0 0.00960
3 4 5396.6865 1 0 0.00913
4 5 6250.1495 1 0 NaN
EDIT: using .map should perform better (see comments):
table2["error"] = table2["n"].map(table1.set_index('n')['error'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sum data from multiple csv files using pandas - python

Related

Drop rows from a slice of Multi-Index DataFrame based on boolean

Applying a lambda function to columns on pandas avoiding redundancy

Sum of column values based on a condition in pandas

How can I delete int in Pandas dataframe column?

how do I sort a list of values that correspond to 'n' into a large table ordered by 'n' [duplicate]

Categories

Resources