I am trying to create a dataframe in pandas and directly use one of the generated columns to assign a new column to the same df.
As a simplified example, I tried to multiply a column of a df using assign:
import pandas as pd
df = pd.DataFrame([['A', 1], ['B', 2], ['C', 3]] , columns = ['col1', 'col2'])\
.assign(col3 = 2 * col2)
but then I get an error NameError: name 'col2' is not defined.
Using R/dplyr, I would be able to do this in a pipe using
df <- data.frame(col1 = LETTERS[1:3], col2 = 1:3) %>% mutate(col3 = 2 * col2)
Also, in a general sense, pipe notation in R/dplyr allows the usage of the "." to refer to the data forwarded by the pipe.
Is there a way to refer to the columns just created (or to the data that goes into the assign statement), thus doing the same thing in Pandas?
Use lambda function, more information in Assigning New Columns in Method Chains:
df = (pd.DataFrame([['A', 1], ['B', 2], ['C', 3]] , columns = ['col1', 'col2'])
.assign(col3 = lambda x: 2 * x.col2))
print (df)
col1 col2 col3
0 A 1 2
1 B 2 4
2 C 3 6
I wrote a package datar to port dplyr and family to python. Now you can do it with (almost) the same syntax as you do it in R:
>>> from datar.all import f, tibble, LETTERS, mutate
>>> tibble(col1=LETTERS[:3], col2=f[1:3]) >> mutate(col3=2*f.col2)
col1 col2 col3
<object> <int64> <int64>
0 A 1 2
1 B 2 4
2 C 3 6
Related
I need to convert to JSON format DataFrames similar to this:
col1 col2 col3
col1 col2 col3
1 a 10 1 a 10
2 b 11 2 b 11
3 c 12 3 c 12
However, when I run df.to_json(orient='table') I'm getting the exception ValueError: Overlapping names between the index and columns. I understand why this happens, but I really would like to know if there is an easy way to circumvent the error. All I need is to convert the DataFrame to JSON maintaining the same indexes, and when restoring it, get the original DataFrame.
Here I leave a code snippet so you can reproduce the scenario.
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['a', 'b', 'c'], 'col3': [10, 11, 12]})
df = df.set_index(keys=['col1', 'col2', 'col3'], drop=False)
df.to_json(orient='table')
According to this thread, we could use map or replace to remap values of a Dataframe using a defined dictionary. I have tried this and it did correctly remap the values, but the output result only produces the column I performed the operation on (of type series) instead of the full Dataframe.
How can I perform the mapping but keep the other columns (with 'last') in the new data3 ?
data3 = data['last'].map(my_dict)
I think what you are trying to do is this:
data['last'] = data['last'].map(my_dict)
Updating based on comment with relation to the link:
In [1]: di = {1: "A", 2: "B"}
In [5]: from numpy import NaN
In [6]: df = DataFrame({'col1':['w', 1, 2], 'col2': ['a', 2, NaN]})
In [7]: df
Out[7]:
col1 col2
0 w a
1 1 2
2 2 NaN
In [8]: df['col1'].map(di)
Out[8]:
0 NaN
1 A
2 B
Name: col1, dtype: object
In [9]: df
Out[9]:
col1 col2
0 w a
1 1 2
2 2 NaN
In [10]: df['col1'] = df['col1'].map(di)
In [11]: df
Out[11]:
col1 col2
0 NaN a
1 A 2
2 B NaN
If you want this to happen in data3 instead of data then you could assign the Series result of the map to a column in data3.
I would like to count the number of cells within each row that contain a particular character string, cells which have the particular string more than once should be counted once only.
I can count the number of cells across a row which equal a given value, but when I expand this logic to use str.contains, I have issues, as shown below
d = {'col1': ["a#", "b","c#"], 'col2': ["a", "b","c#"]}
df = pd.DataFrame(d)
#can correctly count across rows using equality
thisworks =( df =="a#" ).sum(axis=1)
#can count across a column using str.contains
thisworks1=df['col1'].str.contains('#').sum()
#but cannot use str.contains with a dataframe so what is the alternative
thisdoesnt =( df.str.contains('#') ).sum(axis=1)
Output should be a series showing the number of cells in each row that contain the given character string.
str.contains is a series method. To apply it to whole dataframe you need either agg or apply such as:
df.agg(lambda x: x.str.contains('#')).sum(1)
Out[2358]:
0 1
1 0
2 2
dtype: int64
If you don't like agg nor apply, you may use np.char.find to work directly on underlying numpy array of df
(np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1)
Out[2360]: array([1, 0, 2])
Passing it to series or a columns of df
pd.Series((np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1), index=df.index)
Out[2361]:
0 1
1 0
2 2
dtype: int32
A solution using df.apply:
df = pd.DataFrame({'col1': ["a#", "b","c#"],
'col2': ["a", "b","c#"]})
df
col1 col2
0 a# a
1 b b
2 c# c#
df['sum'] = df.apply(lambda x: x.str.contains('#'), axis=1).sum(axis=1)
col1 col2 sum
0 a# a 1
1 b b 0
2 c# c# 2
Something like this should work:
df = pd.DataFrame({'col1': ['#', '0'], 'col2': ['#', '#']})
df['totals'] = df['col1'].str.contains('#', regex=False).astype(int) +\
df['col2'].str.contains('#', regex=False).astype(int)
df
# col1 col2 totals
# 0 # # 2
# 1 0 # 1
It should generalize to as many columns as you want.
I understand that apply method is called even for the empty dataframe. When there is error inside the apply method it doesn't get propagated. I was looking at this stackoverflow link which suggests to use the reduce option so that the apply function is not called.
Pandas: why does DataFrame.apply(f, axis=1) call f when the DataFrame is empty?
Consider this example, in Col1, everything is less than 10. So the dataframe is empty. when I use the reduce option, the datatype of col2 is changed. It converts the numbers to decimals.
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
mask = df["col1"] > 10
df.loc[mask, "col2"] = df[mask].apply(lambda x: x+2, axis=1, result_type='reduce')
print(df)
Expected output
col1 col2
0 1 3
1 2 4
Actual output:
col1 col2
0 1 3.0
1 2 4.0
I am not sure why it converts the integers to decimals. Does anyone know how to avoid this?
You can use pd.to_numeric() for downcasting to integer.
I will update this if I find something better to do this.
>>> import pandas as pd
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> mask = df["col1"] > 10
>>> df.loc[mask, "col2"] = df[mask].apply(lambda x: x+2, axis=1, result_type='reduce')
>>>
>>> df
col1 col2
0 1 3.0
1 2 4.0
>>>
>>> pd.to_numeric(df.col2, downcast='integer')
0 3
1 4
Name: col2, dtype: int8
>>>
>>> df.col2 = pd.to_numeric(df.col2, downcast='integer')
>>> df
col1 col2
0 1 3
1 2 4
>>>
I want to add a column of 1s in the beginning of a pandas dataframe which is created from an external data file 'ex1data1.txt'. I wrote the following code. The problem is the print(data) command, in the end, is returning None. What is wrong with this code? I want data to be a pandas dataframe. The raw_data and X0_ are fine, I have printed them.
import numpy as np
import pandas as pd
raw_data = pd.read_csv('ex1data1.txt', header= None, names= ['x1','y'])
X0_ = np.ones(len(raw_data))
idx = 0
data = raw_data.insert(loc=idx, column='x0', value=X0_)
print(data)
Another solution might look like this:
import numpy as np
import pandas as pd
raw_data = pd.read_csv('ex1data1.txt', header= None, names= ['x1','y'])
raw_data.insert(loc=0, column='x0', value=1.0)
print(raw_data)
pd.DataFrame.insert
You can use pd.DataFrame.insert, but note this solution is in place and does not need reassignment. You may also need to explicitly set dtype to int:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]],
columns=['col1', 'col2', 'col3'])
arr = np.ones(len(df.index), dtype=int)
idx = 0
df.insert(loc=idx, column='col0', value=arr)
print(df)
col0 col1 col2 col3
0 1 1 2 3
1 1 4 5 6
Direct definition + reordering
One clean solution is to simply add a row and move the last column to the beginning of your dataframe. Here's a complete example:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]],
columns=['col1', 'col2', 'col3'])
df['col0'] = 1 # adds column to end of dataframe
cols = [df.columns[-1]] + df.columns[:-1].tolist() # move last column to front
df = df[cols] # apply new column ordering
print(df)
col0 col1 col2 col3
0 1 1 2 3
1 1 4 5 6