Non-breaking space column indexing in pandas dataframe - python

I have a pandas groupby object that contains a column with as its name (i.e., non-breaking space). Although the following snippet is able to print it:
In[25]:
...: for key, item in grouped_df:
...: print(key)
Output:
... other names
I'm not able to index it with grouped_df[key]:
In[29]:
...: for key, item in grouped_df:
...: print(key, grouped_df[key].count())
which results in:
KeyError: 'Column not found: '
[Update]
Partial solution was to use .agg(['count']). However, that is a solution to that specific example I'm giving, but not the main problem.
Here is code which reproduces the problem:
import numpy as np
import pandas as pd
N = 100
df = pd.DataFrame({'col':np.random.choice([1,2,3,4,' '], size=N),
'col2':np.random.randint(10, size=N) })
grouped_df = df.groupby('col')
for key, item in grouped_df:
print(key)
print(grouped_df[' '])

grouped_df is a DataFrameGroupBy object, not a DataFrame.
To extract a DataFrame from grouped_df, use the get_group method:
In [231]: grouped_df.get_group(' ')
Out[231]:
col col2
3 9
9 2
14 5
29 0
30 4
33 6
38 7
41 0
53 7
57 6
73 8
75 7
83 0
92 1
98 8

Related

Assigning a value to a column derived from another column in Pandas

I have a dataframe with three columns containing text. One column (column1) consists of 3 unique entries; "H", "D", "A".
I want to create a new column with the entries from the other two columns (column2 & column3) based on the entry from the column containing "H", "D" or "A".
I tried to write a function:
def func(x):
if x== "H":
return column2
elif x == "A":
return column3
else:
return "D"
I then tried to use the .apply() function:
df["new_col"] = df["column1"].apply(func)
But this doesn't work as it doesn't recognise column2 & column 3. How do I access the entries of the columns column2 & column 3 inside the function?
You can send the whole row to the function and access it's columns:
def func(x):
if x["column1"]== "H":
return x["column2"]
elif x["column1"] == "A":
return x["column3"]
else:
return "D"
df["new_col"] = df.apply(lambda x: func(x), axis=1)
No need to use .apply you can use np.select to choose elements based upon the conditions:
Consider the example dataframe:
df = pd.DataFrame({
'column1': ['H', 'D', 'A', 'H', 'A'],
'column2': [1, 2, 3, 4, 5],
'column3': [10, 20, 30, 40, 50]
})
Use:
import numpy as np
conditions = [
df['column1'].eq('H'),
df['column1'].eq('A')
]
choices = [
df['column2'],
df['column3']]
df['new_col'] = np.select(
conditions, choices, default='D')
Result:
# print(df)
column1 column2 column3 new_col
0 H 1 10 1
1 D 2 20 D
2 A 3 30 30
3 H 4 40 4
4 A 5 50 50
Here I am retrieving the rows with the conditions required and altering corresponding rows in column4. We can achieve this using iloc in a pandas dataframe.
import pandas as pd
d = {"column1":["H","D","A","D", "H", "H", "A"],"column2":[1,2,3,4,5,6,7],"column3":[12,23,34,45,56,67,87]}
df = pd.DataFrame(d)
df["column4"] = None
df.iloc[list(df[df["column1"] == "H"].index), 3] = df[df["column1"] == "H"]["column2"]
df.iloc[list(df[df["column1"] == "A"].index), 3] = df[df["column1"] == "A"]["column3"]
df.iloc[list(df[df["column4"].isnull()].index), 3] = "D"
The output of the above processing is given below,
print(df)
column1 column2 column3 column4
0 H 1 12 1
1 D 2 23 D
2 A 3 34 34
3 D 4 45 D
4 H 5 56 5
5 H 6 67 6
6 A 7 87 87
You can use np.select() function
import numpy as np
df['column4'] = np.select([df.column1=='H',df.column1=='A'],
[df.column2,df.column3], default = 'D')
It's a kind of a case when statement in which 1st argument is a values to compare,2nd argument is the output corresponding to that comparison. Default is a keyword argument fot the 'else' statement.
Based on my understanding of your query, I'll illustrate taking your example.
Conisder this data frame:
d = {
"col1": ["H","D","A","H","D","A"],
"col2": [172,180,190,156,176,182],
"col3":[80,75,53,80,100,92]
}
df = pd.DataFrame(d)
df
col1 col2 col3
0 H 172 80
1 D 180 75
2 A 190 53
3 H 156 80
4 D 176 100
5 A 182 92
apply takes in a Series object and accesses the columns using appropriate indices with respect to the dataframe you passed. On calling apply it is necessary to pass axis=1 , since you you need column values for each row. Finally append the returned series to the original dataframe.
def func(df):
if df[0] == 'H':
return df[1]
elif df[0] == 'A':
return df[2]
else:
return "D"
df['col4'] = df.apply(func, axis=1)
df
col1 col2 col3 col4
0 H 172 80 172
1 D 180 75 D
2 A 190 53 53
3 H 156 80 156
4 D 176 100 D
5 A 182 92 92

Looping over data frame to cap and sum another data frame

I am trying to use entries from df1 to limit amounts in df2, then add them up based on their type and summarize in df3. I'm not sure how to get it, the for loop using iterrows would be my best guess but it's not complete.
Code:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'Caps':['25','50','100']})
df2 = pd.DataFrame({'Amounts':['45','25','65','35','85','105','80'], \
'Type': ['a' ,'b' ,'b' ,'c' ,'a' , 'b' ,'d' ]})
df3 = pd.DataFrame({'Type': ['a' ,'b' ,'c' ,'d']})
df1['Caps'] = df1['Caps'].astype(float)
df2['Amounts'] = df2['Amounts'].astype(float)
for index1, row1 in df1.iterrows():
for index2, row2 in df3.iterrows():
df3[str(row1['Caps']+'limit')] = df2['Amounts'].where(
df2['Type'] == row2['Type']).where(
df2['Amounts']<= row1['Caps'], row1['Caps']).sum()
# My ideal output would be this:
df3 = pd.DataFrame({'Type':['a','b','c','d'],
'Total':['130','195','35','80'],
'25limit':['50','75','25','25'],
'50limit':['95','125','35','50'],
'100limit':['130','190','35','80'],
})
Output:
>>> df3
Type Total 25limit 50limit 100limit
0 a 130 50 95 130
1 b 195 75 125 190
2 c 35 25 35 35
3 d 80 25 50 80
Use numpy for compare all values Amounts with Caps by broadcasting to 2d array a, then create DataFrame by constructor with sum per columns, transpose by DataFrame.T and DataFrame.add_prefix.
For aggregated column use DataFrame.insert for first column with GroupBy.sum:
df1['Caps'] = df1['Caps'].astype(int)
df2['Amounts'] = df2['Amounts'].astype(int)
am = df2['Amounts'].to_numpy()
ca = df1['Caps'].to_numpy()
#pandas below 0.24
#am = df2['Amounts'].values
#ca = df1['Caps'].values
a = np.where(am <= ca[:, None], am[None, :], ca[:, None])
df1 = (pd.DataFrame(a,columns=df2['Type'],index=df1['Caps'])
.sum(axis=1, level=0).T.add_suffix('limit'))
df1.insert(0, 'Total', df2.groupby('Type')['Amounts'].sum())
df1 = df1.reset_index().rename_axis(None, axis=1)
print (df1)
Type Total 25limit 50limit 100limit
0 a 130 50 95 130
1 b 195 75 125 190
2 c 35 25 35 35
3 d 80 25 50 80
Here is my solution without numpy, however it is two times slower than #jezrael's solution, 10.5ms vs. 5.07ms.
limcols= df1.Caps.to_list()
df2=df2.reindex(columns=["Amounts","Type"]+limcols)
df2[limcols]= df2[limcols].transform( \
lambda sc: np.where(df2.Amounts.le(sc.name),df2.Amounts,sc.name))
# Summations:
g=df2.groupby("Type")
df3= g[limcols].sum()
df3.insert(0,"Total", g.Amounts.sum())
# Renaming columns:
c_dic={ lim:f"{lim:.0f}limit" for lim in limcols}
df3= df3.rename(columns=c_dic).reset_index()
# Cleanup:
#df2=df2.drop(columns=limcols)

Pandas: for each row in a DataFrame, count the number of rows matching a condition

I have a DataFrame for which I want to calculate, for each row, how many other rows match a given condition (e.g. number of rows that have value in column C less than the value for this row). Iterating through each row is too slow (I have ~1B rows), especially when the columns dtype is a datetime, but this is the way it could be run on a DataFrame df with a column labeled C:
df['newcol'] = 0
for row in df.itertuples():
df.loc[row.Index, 'newcol'] = len(df[df.C < row.C])
Is there a way to vectorize this?
Thanks!
Preparation:
import numpy as np
import pandas as pd
count = 5000
np.random.seed(100)
data = np.random.randint(100, size=count)
df = pd.DataFrame({'Col': list('ABCDE') * (count/5),
'Val': data})
Suggestion:
u, c = np.unique(data, return_counts=True)
values = np.cumsum(c)
dictionary = dict(zip(u[1:], values[:-1]))
dictionary[u[0]] = 0
df['newcol'] = [dictionary[x] for x in data]
It does exactly the same as your example.
If it does not help. Write more detailed question.
Recommendations:
Pandas vectorization and jit-compiling are available with numba at page .
If you work with 1d arrays - use numpy. In many situations it works faster. Just compare that:
Pandas
%timeit df['newcol2'] = df.apply(lambda x: sum(df['Val'] < x.Val), axis=1)
1 loop, best of 3: 51.1 s per loop
204.34800005
Numpy
%timeit df['newcol3'] = [np.sum(data<x) for x in data]
10 loops, best of 3: 61.3 ms per loop
2.5490000248
Use numpy.sum instead of sum!
Consider pandas.DataFrame.apply with a lambda expression to count the rows to your condition. Admittedly, apply is a loop and to run across ~1 billion rows may take time to process.
import numpy as np
import pandas as pd
np.random.seed(161)
df = pd.DataFrame({'Col': list('ABCDE') * 3,
'Val': np.random.randint(100, size=15)})
df['newcol'] = df.apply(lambda x: sum(df['Val'] < x.Val), axis=1)
# Col Val Count
# 0 A 78 13
# 1 B 11 2
# 2 C 51 8
# 3 D 31 5
# 4 E 29 4
# 5 A 99 14
# 6 B 65 10
# 7 C 16 3
# 8 D 43 7
# 9 E 10 1
# 10 A 67 11
# 11 B 36 6
# 12 C 1 0
# 13 D 73 12
# 14 E 64 9

Pandas groupby result into multiple columns

I have a dataframe in which I'm looking to group and then partition the values within a group into multiple columns.
For example: say I have the following dataframe:
>>> import pandas as pd
>>> import numpy as np
>>> df=pd.DataFrame()
>>> df['Group']=['A','C','B','A','C','C']
>>> df['ID']=[1,2,3,4,5,6]
>>> df['Value']=np.random.randint(1,100,6)
>>> df
Group ID Value
0 A 1 66
1 C 2 2
2 B 3 98
3 A 4 90
4 C 5 85
5 C 6 38
>>>
I want to groupby the "Group" field, get the sum of the "Value" field, and get new fields, each of which holds the ID values of the group.
Currently I am able to do this as follows, but I am looking for a cleaner methodology:
First, I create a dataframe with a list of the IDs in each group.
>>> g=df.groupby('Group')
>>> result=g.agg({'Value':np.sum, 'ID':lambda x:x.tolist()})
>>> result
ID Value
Group
A [1, 4] 98
B [3] 76
C [2, 5, 6] 204
>>>
And then I use pd.Series to split those up into columns, rename them, and then join it back.
>>> id_df=result.ID.apply(lambda x:pd.Series(x))
>>> id_cols=['ID'+str(x) for x in range(1,len(id_df.columns)+1)]
>>> id_df.columns=id_cols
>>>
>>> result.join(id_df)[id_cols+['Value']]
ID1 ID2 ID3 Value
Group
A 1 4 NaN 98
B 3 NaN NaN 76
C 2 5 6 204
>>>
Is there a way to do this without first having to create the list of values?
You could use
id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()
to create id_df without the intermediate result DataFrame.
import pandas as pd
import numpy as np
np.random.seed(2016)
df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
'ID': [1, 2, 3, 4, 5, 6],
'Value': np.random.randint(1, 100, 6)})
grouped = df.groupby('Group')
values = grouped['Value'].agg('sum')
id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()
id_df = id_df.rename(columns={i: 'ID{}'.format(i + 1) for i in range(id_df.shape[1])})
result = pd.concat([id_df, values], axis=1)
print(result)
yields
ID1 ID2 ID3 Value
Group
A 1 4 NaN 77
B 3 NaN NaN 84
C 2 5 6 86
Another way of doing this is to first added a "helper" column on to your data, then pivot your dataframe using the "helper" column, in the case below "ID_Count":
Using #unutbu setup:
import pandas as pd
import numpy as np
np.random.seed(2016)
df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
'ID': [1, 2, 3, 4, 5, 6],
'Value': np.random.randint(1, 100, 6)})
#Create group
grp = df.groupby('Group')
#Create helper column
df['ID_Count'] = grp['ID'].cumcount() + 1
#Pivot dataframe using helper column and add 'Value' column to pivoted output.
df_out = df.pivot('Group','ID_Count','ID').add_prefix('ID').assign(Value = grp['Value'].sum())
Output:
ID_Count ID1 ID2 ID3 Value
Group
A 1.0 4.0 NaN 77
B 3.0 NaN NaN 84
C 2.0 5.0 6.0 86
Using get_dummies and MultiLabelBinarizer (scikit-learn):
import pandas as pd
import numpy as np
from sklearn import preprocessing
df = pd.DataFrame()
df['Group']=['A','C','B','A','C','C']
df['ID']=[1,2,3,4,5,6]
df['Value']=np.random.randint(1,100,6)
mlb = preprocessing.MultiLabelBinarizer(classes=classes).fit([])
df2 = pd.get_dummies(df, '', '', columns=['ID']).groupby(by='Group').sum()
df3 = pd.DataFrame(mlb.inverse_transform(df2[df['ID'].unique()].values), index=df2.index)
df3.columns = ['ID' + str(x + 1) for x in range(df3.shape[0])]
pd.concat([df3, df2['Value']], axis=1)
ID1 ID2 ID3 Value
Group
A 1 4 NaN 63
B 3 NaN NaN 59
C 2 5 6 230

Read all lines of csv file using .read_csv

I am trying to read simple csv file using pandas but I can't figure out how to not "lose" the first row.
For example:
my_file.csv
Looks like this:
45
34
77
But when I try to to read it:
In [18]: import pandas as pd
In [19]: df = pd.read_csv('my_file.csv', header=False)
In [20]: df
Out[20]:
45
0 34
1 77
[2 rows x 1 columns]
This is not what I am after, I want to have 3 rows. I want my DataFrame to look exactly like this:
In [26]: my_list = [45,34,77]
In [27]: df = pd.DataFrame(my_list)
In [28]: df
Out[28]:
0
0 45
1 34
2 77
[3 rows x 1 columns]
How can I use .read_csv to get the result I am looking for?
Yeah, this is a bit of a UI problem. We should handle False; right now it thinks you want the header on row 0 (== False.) Use None instead:
>>> df = pd.read_csv("my_file.csv", header=False)
>>> df
45
0 34
1 77
>>> df = pd.read_csv("my_file.csv", header=None)
>>> df
0
0 45
1 34
2 77

Categories