Be the following python pandas DataFrame:
ID
Holidays
visit_1
visit_2
visit_3
other
0
True
1
2
0
red
0
False
3
2
0
red
0
True
4
4
1
blue
1
False
2
0
0
red
1
True
1
2
1
green
2
False
1
0
0
red
Currently I calculate a new DataFrame with the accumulated visit values as follows.
# Calculate the columns of the total visit count
visit_df = df.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
I would like to create a new one taking into account only the rows whose Holiday value is True. How could I do this?
Simple subset the rows first:
df[df['Holidays']].groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
output:
visit_1 visit_2 visit_3
ID
0 5 6 1
1 1 2 1
Alternative if you want to also get the groups without any match:
df2 = df.set_index('ID')
(df2.where(df2['Holidays'])
.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
)
output:
visit_1 visit_2 visit_3
ID
0 5.0 6.0 1.0
1 1.0 2.0 1.0
2 0.0 0.0 0.0
variant
df2 = df.set_index('ID')
(df2.where(df2['Holidays'])
.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
.convert_dtypes()
.add_suffix('_Holidays')
)
output:
visit_1_Holidays visit_2_Holidays visit_3_Holidays
ID
0 5 6 1
1 1 2 1
2 0 0 0
I have this dataset, which contains some NaN values:
df = pd.DataFrame({'Id':[1,2,3,4,5,6], 'Name':['Eve','Diana',np.NaN,'Mia','Mae',np.NaN], "Count":[10,3,np.NaN,8,5,2]})
df
Id Name Count
0 1 Eve 10.0
1 2 Diana 3.0
2 3 NaN NaN
3 4 Mia 8.0
4 5 Mae 5.0
5 6 NaN 2.0
I want to test if the column has a NaN value (0) or not (1) and creating two new columns. I have tried this:
df_clean = df
df_clean[['Name_flag','Count_flag']] = df_clean[['Name','Count']].apply(lambda x: 0 if x == np.NaN else 1, axis = 1)
But it mentions that The truth value of a Series is ambiguous. I want to make it avoiding redundancy, but I see there is a mistake in my logic. Please, could you help me with this question?
The expected table is:
Id Name Count Name_flag Count_flag
0 1 Eve 10.0 1 1
1 2 Diana 3.0 1 1
2 3 NaN NaN 0 0
3 4 Mia 8.0 1 1
4 5 Mae 5.0 1 1
5 6 NaN 2.0 0 1
Multiply boolean mask by 1:
df[['Name_flag','Count_flag']] = df[['Name', 'Count']].isna() * 1
>>> df
Id Name Count Name_flag Count_flag
0 1 Eve 10.0 0 0
1 2 Diana 3.0 0 0
2 3 NaN NaN 1 1
3 4 Mia 8.0 0 0
4 5 Mae 5.0 0 0
5 6 NaN 2.0 1 0
For your problem of The truth value of a Series is ambiguous
For apply, you cannot return a scalar 0 or 1 because you have a series as input . You have to use applymap instead to apply a function elementwise. But comparing to NaN is not an easy thing:
Try:
df[['Name','Count']].applymap(lambda x: str(x) == 'nan') * 1
We can use isna and convert the boolean to int:
df[["Name_flag", "Count_flag"]] = df[["Name", "Count"]].isna().astype(int)
Id Name Count Name_flag Count_flag
0 1 Eve 10.00 0 0
1 2 Diana 3.00 0 0
2 3 NaN NaN 1 1
3 4 Mia 8.00 0 0
4 5 Mae 5.00 0 0
5 6 NaN 2.00 1 0
I have a pandas.DataFrame in it I have a column. The columns contains, integers, strings, time...
I want to create columns (containing [0,1]) that tells if the value in that column is a string or not, a time or not... in an efficient way.
A
0 Hello
1 Name
2 123
3 456
4 22/03/2019
And the output should be
A A_string A_number A_date
0 Hello 1 0 0
1 Name 1 0 0
2 123 0 1 0
3 456 0 1 0
4 22/03/2019 0 0 1
Using pandas str methods to check for the string type could help:
df = pd.read_clipboard()
df['A_string'] = df.A.str.isalpha().astype(int)
df['A_number'] = df.A.str.isdigit().astype(int)
#naive assumption
df['A_Date'] = (~df.A.str.isalnum()).astype(int)
df.filter(['A','A_string','A_number','A_Date'])
A A_string A_number A_Date
0 Hello 1 0 0
1 Name 1 0 0
2 123 0 1 0
3 456 0 1 0
4 22/03/2019 0 0 1
We can use the native pandas .to_numeric, to_datetime to test for dates & numbers. Then we can use .loc for assignment and fillna to match your target df.
df.loc[~pd.to_datetime(df['A'],errors='coerce').isna(),'A_Date'] = 1
df.loc[~pd.to_numeric(df['A'],errors='coerce').isna(),'A_Number'] = 1
df.loc[(pd.to_numeric(df['A'],errors='coerce').isna())
& pd.to_datetime(df['A'],errors='coerce').isna()
,'A_String'] = 1
df = df.fillna(0)
print(df)
A A_Date A_Number A_String
0 Hello 0.0 0.0 1.0
1 Name 0.0 0.0 1.0
2 123 0.0 1.0 0.0
3 456 0.0 1.0 0.0
4 22/03/2019 1.0 0.0 0.0
This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 3 years ago.
I have a list of uncertainties that correspond to a particular values of n that i'll call table 1. I would like to add those uncertainties into a comprehensive large table of data, table 2, that is ordered numerically and in ascending order by n. How could I put attach my uncertainty to the correct corresponding value of n?
My first issue is, my table of uncertainties is a table, not a dataframe. I have the separate arrays but not sure how to combine into a dataframe.
table1 = Table([xrow,yrow])
xrow denotes the array of the below 'n' in table1 and yrow denotes the corresponding error.
excerpt of table1:
n error
1 0.0
2 0.00496
3 0.0096
4 0.00913
6 0.00555
8 0.00718
10 0.00707
excerpt of table2:
n Energy g J error
0 1 0.000000 1 0 NaN
1 2 1827.486200 1 0 NaN
2 3 3626.681500 1 0 NaN
3 4 5396.686500 1 0 NaN
4 5 6250.149500 1 0 NaN
so the end result should look like this:
n Energy g J error
0 1 0.000000 1 0 0
1 2 1827.486200 1 0 0.00496
2 3 3626.681500 1 0 0.0096
3 4 5396.686500 1 0 0.00913
4 5 6250.149500 1 0 NaN
i.e. the ones where there is no data remains to be blank (e.g. n=5 in the above case)
I should note there is a lot of data (roughly 30k) in table 2 and 2.5k in table1.
you can use .merge like this:
import pandas as pd
from io import StringIO
table1 = pd.read_csv(StringIO("""
n error
1 0.0
2 0.00496
3 0.0096
4 0.00913
6 0.00555
8 0.00718
10 0.00707"""), sep=r"\s+")
table2 = pd.read_csv(StringIO("""
n Energy g J error
0 1 0.000000 1 0 NaN
1 2 1827.486200 1 0 NaN
2 3 3626.681500 1 0 NaN
3 4 5396.686500 1 0 NaN
4 5 6250.149500 1 0 NaN"""), sep=r"\s+")
table2["error"] = table1.merge(table2, on="n", how="right")["error_x"]
print(table2)
Output:
n Energy g J error
0 1 0.0000 1 0 0.00000
1 2 1827.4862 1 0 0.00496
2 3 3626.6815 1 0 0.00960
3 4 5396.6865 1 0 0.00913
4 5 6250.1495 1 0 NaN
EDIT: using .map should perform better (see comments):
table2["error"] = table2["n"].map(table1.set_index('n')['error'])
I have -many- csv files with the same number of columns (different number of rows) in the following pattern:
Files 1:
A1,B1,C1
A2,B2,C2
A3,B3,C3
A4,B4,C4
File 2:
*A1*,*B1*,*C1*
*A2*,*B2*,*C2*
*A3*,*B3*,*C3*
File ...
Output:
A1+*A1*+...,B1+*B1*+...,C1+*C1*+...
A2+*A2*+...,B2+*B2*+...,C2+*C2*+...
A3+*A3*+...,B3+*B3*+...,C3+*C3*+...
A4+... ,B4+... ,C4+...
For example:
Files 1:
1,0,0
1,0,1
1,0,0
0,1,0
Files 2:
1,1,0
1,1,1
0,1,0
Output:
2,1,0
2,1,2
1,1,0
0,1,0
I am trying to use python.pandas and was thinking of something like this to create the reading variables:
dic={}
for i in range(14253,14352):
try:
dic['df_{0}'.format(i)]=pandas.read_csv('output_'+str(i)+'.csv')
except:
pass
and then to sum the columns:
for residue in residues:
for number in range(14254,14255):
df=dic['df_14253'][residue]
df+=dic['df_'+str(number)][residue]
residues is a list of strings which are the column names.
I have the problem that my files have different numbers of rows and are only summed up until the last row of df1. How could I add them up until the last row of the longest file - so that no data is lost? I think groupby.sum by panda could be an option but I don't understand how to use it.
To add an example - now I get this:
Files 1:
1,0,0
1,0,1
1,0,0
0,1,0
Files 2:
1,1,0
1,1,1
0,1,0
File 3:
1,0,0
0,0,1
1,0,0
1,0,0
1,0,0
1,0,1
File ...:
Output:
3,1,0
2,1,3
2,1,0
1,1,0
1,0,0
1,0,1
You can use Panel in pandas , a 3Dobject, collection of dataframes :
dfs={ i : pd.DataFrame.from_csv('file'+str(i)+'.csv',sep=',',\
header=None,index_col=None) for i in range(n)} # n files.
panel=pd.Panel(dfs)
dfs_sum=panel.sum(axis=0)
dfs is a dictionnary of dataframes. Panel completes automatically lacking values with Nan and does the good sum. For example :
n [500]: panel[1]
Out[500]:
0 1 2
0 1 0 0
1 1 0 1
2 1 0 0
3 0 1 0
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
In [501]: panel[2]
Out[501]:
0 1 2
0 1 0 0
1 1 0 1
2 1 0 0
3 0 1 0
4 1 0 0
5 1 0 1
6 1 0 0
7 0 1 0
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
In [502]: panel[3]
Out[502]:
0 1 2
0 1 0 0
1 1 0 1
2 1 0 0
3 0 1 0
4 1 0 0
5 1 0 1
6 1 0 0
7 0 1 0
8 1 0 0
9 1 0 1
10 1 0 0
11 0 1 0
In [503]: panel.sum(0)
Out[503]:
0 1 2
0 3 0 0
1 3 0 3
2 3 0 0
3 0 3 0
4 2 0 0
5 2 0 2
6 2 0 0
7 0 2 0
8 1 0 0
9 1 0 1
10 1 0 0
11 0 1 0
Looking for this exact same thing, I find out that Panel is now Deprecated so I post here the news :
class pandas.Panel(data=None, items=None, major_axis=None, minor_axis=None, copy=False, dtype=None)
Deprecated since version 0.20.0: The recommended way to represent 3-D data are with a >MultiIndex on a DataFrame via the to_frame() method or with the xarray package. >Pandas provides a to_xarray() method to automate this conversion.
to_frame(filter_observations=True)
Transform wide format into long (stacked) format as DataFrame whose columns are >the Panel’s items and whose index is a MultiIndex formed of the Panel’s major and >minor
I would recommend using
pandas.DataFrame.sum
DataFrame.sum(axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)
Parameters:
axis : {index (0), columns (1)}
Axis for the function to be applied on.
One can use it the same way as in B.M. answer