Hi I am working with the Python datatable package and need to replace all the 'NA' after joining two DT's.
Sample data:
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))
X[DT, on="x"]
The code below replaces all 1 with 0
DT.replace(1, 0)
How should I adapt it to replace 'NA'? Or is there maybe an option to change the padding while joining from 'NA' to '0'?
Thank you.
Here is the code using python's data structures :
from datatable import dt, f, by, join
DT = dt.Frame(x = ["b"]*3 + ["a"]*3 + ["c"]*3,
y = [1, 3, 6] * 3,
v = range(1, 10))
X = dt.Frame({"x":('c','b'),
"v":(8,7),
"foo":(4,2)})
X.key="x" # key the ``x`` column
merger = DT[:, :, join(X)]
merger
x y v v.0 foo
0 b 1 1 7 2
1 b 3 2 7 2
2 b 6 3 7 2
3 a 1 4 NA NA
4 a 3 5 NA NA
5 a 6 6 NA NA
6 c 1 7 8 4
7 c 3 8 8 4
8 c 6 9 8 4
The NA is also None; it makes it easy to replace with 0 :
merger.replace(None, 0)
x y v v.0 foo
0 b 1 1 7 2
1 b 3 2 7 2
2 b 6 3 7 2
3 a 1 4 0 0
4 a 3 5 0 0
5 a 6 6 0 0
6 c 1 7 8 4
7 c 3 8 8 4
8 c 6 9 8 4
Related
If you have 2 dataframes, represented as:
A F Y
0 1 2 3
1 4 5 6
And
B C T
0 7 8 9
1 10 11 12
When combining it becomes:
A B C F T Y
0 1 7 8 2 9 3
1 4 10 11 5 12 6
I would like it to become:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
How do I combine 1 data frame with another but keep the original column order?
In [1294]: new_df = df.join(df1)
In [1295]: new_df
Out[1295]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
OR you can also use pd.merge(not a very clean solution though)
In [1297]: df['tmp' ] =1
In [1298]: df1['tmp'] = 1
In [1309]: pd.merge(df, df1, on=['tmp'], left_index=True, right_index=True).drop('tmp', 1)
Out[1309]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
I am trying to create a list of dataframes where each dataframe is 3 rows of a larger dataframe.
dframes = [df[0:3], df[3:6],...,df[2000:2003]]
I am still fairly new to programming, why does:
x = 3
dframes = []
for i in range(0, len(df)):
dframes = dframes.append(df[i:x])
i = x
x = x + 3
dframes = dframes.append(df[i:x])
AttributeError: 'NoneType' object has no attribute 'append'
Use np.split
Setup
Consider the dataframe df
df = pd.DataFrame(dict(A=range(15), B=list('abcdefghijklmno')))
Solution
dframes = np.split(df, range(3, len(df), 3))
Output
for d in dframes:
print(d, '\n')
A B
0 0 a
1 1 b
2 2 c
A B
3 3 d
4 4 e
5 5 f
A B
6 6 g
7 7 h
8 8 i
A B
9 9 j
10 10 k
11 11 l
A B
12 12 m
13 13 n
14 14 o
Python raise this error because function append return None and next time in your loot variable dframes will be None
You can use this:
[list(dframes[i:i+3]) for i in range(0, len(dframes), 3)]
You can use list comprehension with groupby by numpy array created by length of index floor divided by 3:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
5 4 3 7 1 1
6 7 7 0 2 9
7 9 3 2 5 8
8 1 0 7 6 2
9 0 8 2 5 1
dfs = [x for i, x in df.groupby(np.arange(len(df.index)) // 3)]
print (dfs)
[ A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8, A B C D E
3 4 0 9 6 2
4 4 1 5 3 4
5 4 3 7 1 1, A B C D E
6 7 7 0 2 9
7 9 3 2 5 8
8 1 0 7 6 2, A B C D E
9 0 8 2 5 1]
If default monotonic index (0,1,2...) solution can be simplify:
dfs = [x for i, x in df.groupby(df.index // 3)]
l have the following sample to transform. After concatenating several csv files l keep the index of each row 0 up to last row of the file in each file as depicted below.
Column_1 column2
0 m 4
1 n 3
2 4 6
3 t 8
0 h 8
1 4 7
2 kl 8
3 m 4
4 bv 5
5 n 8
Now l want to add another column in the beginning indexing the file.
Column_1 column2
0 0 m 4
1 1 n 3
2 2 4 6
3 3 t 8
4 0 h 8
5 1 4 7
6 2 kl 8
7 3 m 4
8 4 bv 5
9 5 n 8
Simpliest is MultiIndex.from_arrays by numpy.arange or range:
print (np.arange(len(df.index)))
[0 1 2 3 4 5 6 7 8 9]
n = ['a','b']
df.index = pd.MultiIndex.from_arrays([np.arange(len(df.index)), df.index], names= n)
print (df)
Column_1 column2
a b
0 0 m 4
1 1 n 3
2 2 4 6
3 3 t 8
4 0 h 8
5 1 4 7
6 2 kl 8
7 3 m 4
8 4 bv 5
9 5 n 8
n = ['a','b']
df.index = pd.MultiIndex.from_arrays([range(len(df.index)), df.index], names= n)
print (df)
Column_1 column2
a b
0 0 m 4
1 1 n 3
2 2 4 6
3 3 t 8
4 0 h 8
5 1 4 7
6 2 kl 8
7 3 m 4
8 4 bv 5
9 5 n 8
If index names are not necessary, simply assign:
df.index = [np.arange(len(df.index)), df.index]
print (df)
Column_1 column2
0 0 m 4
1 1 n 3
2 2 4 6
3 3 t 8
4 0 h 8
5 1 4 7
6 2 kl 8
7 3 m 4
8 4 bv 5
9 5 n 8
consider the dataframe df
df = pd.DataFrame(dict(
A=list('aaaaabbbbccc'),
B=range(12)
))
print(df)
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
I want to sort the dataframe such if I grouped by column 'A' I'd pull the first position from each group, then cycle back and get the second position from each group if any are remaining. So on and so forth.
I'd expect results tot look like this
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
You can use cumcount for count values in groups first, then sort_values and reindex by Series cum:
cum = df.groupby('A')['B'].cumcount().sort_values()
print (cum)
0 0
5 0
9 0
1 1
6 1
10 1
2 2
7 2
11 2
3 3
8 3
4 4
dtype: int64
print (df.reindex(cum.index))
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Here's a NumPy approach -
def approach1(g, v):
# Inputs : 1D arrays of groupby and value columns
id_arr2 = np.ones(v.size,dtype=int)
sf = np.flatnonzero(g[1:] != g[:-1])+1
id_arr2[sf[0]] = -sf[0]+1
id_arr2[sf[1:]] = sf[:-1] - sf[1:]+1
return id_arr2.cumsum().argsort(kind='mergesort')
Sample run -
In [246]: df
Out[246]:
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
In [247]: df.iloc[approach1(df.A.values, df.B.values)]
Out[247]:
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Or using df.reindex from #jezrael's post :
df.reindex(approach1(df.A.values, df.B.values))
I have a dataframe, that I want to calculate statitics on (value_count, mode, mean, etc.) and then put the result in a new column. My current solution is O(n**2) or so, and I'm sure there is likely a faster, obvious method that I'm overlooking.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(10, size=(100, 10)),
columns = list('abcdefghij'))
df['result'] = 0
groups = df.groupby([df.i, df.j])
for g in groups:
icol_eq = df.i == g[0][0]
jcol_eq = df.j == g[0][1]
i_and_j = icol_eq & jcol_eq
df['result'][i_and_j] = len(g[1])
The above works, but is extremely slow for large dataframes.
I tried
df['result'] = df.groupby([df.i, df.j]).apply(len)
but it doesn't seem to work.
Nor does
def f(g):
g['result'] = len(g)
return g
df.groupby([df.i, df.j]).apply(f)
Nor can I merge the resulting series of a df.groupby.apply(lambda x: len(x))
You want to use transform:
In [98]:
df['result'] = df.groupby([df.i, df.j]).transform(len)
df
Out[98]:
a b c d e f g h i j result
0 6 1 3 0 1 1 4 2 8 6 6
1 1 3 9 7 5 5 3 5 4 4 1
2 1 5 0 1 8 1 4 7 3 9 1
3 6 8 6 4 6 0 8 0 6 5 6
4 7 9 7 2 8 9 9 6 0 6 7
5 3 5 5 7 2 7 7 3 2 8 3
6 5 0 4 7 5 7 5 7 9 1 5
7 3 2 5 4 3 6 8 4 2 0 3
8 2 3 0 4 8 5 7 9 7 2 2
9 1 1 3 2 3 5 6 6 5 6 1
10 3 0 2 7 1 8 1 3 5 4 3
....
transform returns a Series with an index aligned to your original df so you can then add it as a column