Creating a list of sliced dataframes

Creating a list of sliced dataframes - python

I am trying to create a list of dataframes where each dataframe is 3 rows of a larger dataframe.
dframes = [df[0:3], df[3:6],...,df[2000:2003]]
I am still fairly new to programming, why does:
x = 3
dframes = []
for i in range(0, len(df)):
dframes = dframes.append(df[i:x])
i = x
x = x + 3
dframes = dframes.append(df[i:x])
AttributeError: 'NoneType' object has no attribute 'append'

Use np.split
Setup
Consider the dataframe df
df = pd.DataFrame(dict(A=range(15), B=list('abcdefghijklmno')))
Solution
dframes = np.split(df, range(3, len(df), 3))
Output
for d in dframes:
print(d, '\n')
A B
0 0 a
1 1 b
2 2 c
A B
3 3 d
4 4 e
5 5 f
A B
6 6 g
7 7 h
8 8 i
A B
9 9 j
10 10 k
11 11 l
A B
12 12 m
13 13 n
14 14 o

Python raise this error because function append return None and next time in your loot variable dframes will be None
You can use this:
[list(dframes[i:i+3]) for i in range(0, len(dframes), 3)]

You can use list comprehension with groupby by numpy array created by length of index floor divided by 3:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
5 4 3 7 1 1
6 7 7 0 2 9
7 9 3 2 5 8
8 1 0 7 6 2
9 0 8 2 5 1
dfs = [x for i, x in df.groupby(np.arange(len(df.index)) // 3)]
print (dfs)
[ A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8, A B C D E
3 4 0 9 6 2
4 4 1 5 3 4
5 4 3 7 1 1, A B C D E
6 7 7 0 2 9
7 9 3 2 5 8
8 1 0 7 6 2, A B C D E
9 0 8 2 5 1]
If default monotonic index (0,1,2...) solution can be simplify:
dfs = [x for i, x in df.groupby(df.index // 3)]

Related

Filter and to stay only rows with the same index

I have two data frame: X_oos_top_10 and y_oos_top_10. I need to filter them by X_oos_top_10["comm"] == 1.
I do it for one:
X_oos_top_10_comm1 = X_oos_top_10[X_oos_top_10["comm"] == 1]
But for another I have the problem: IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
y_oos_top_10_comm1 = y_oos_top_10[X_oos_top_10["comm"] == 1]
I haven't ideas how I can do it.

Assuming, X and y have the same length, you can use indexing.
Setup a minimal reproducible example:
X_oos_top_10 = pd.DataFrame({'comm': np.random.randint(1, 10, 10)})
y_oos_top_10 = pd.DataFrame(np.random.randint(1, 10, (10, 4)), columns=list('ABCD'))
print(X_oos_top_10)
# Output:
comm
0 5
1 6
2 2
3 6
4 1
5 6
6 1
7 4
8 5
9 8
print(y_oos_top_10)
# Output:
A B C D
0 2 9 1 6
1 9 8 5 4
2 1 6 7 6
3 6 3 6 5
4 2 6 8 3
5 2 6 6 5
6 4 4 3 5
7 6 3 7 5
8 2 8 8 7
9 4 9 1 4
1st method
idx = X_oos_top_10[X_oos_top_10["comm"] == 1].index
out = y_oos_top_10.loc[idx]
print(out)
# Output:
A B C D
4 2 6 8 3
6 4 4 3 5
2nd method
Xy_oos_top_10 = X_oos_top_10.join(y_oos_top_10)
out = Xy_oos_top_10[Xy_oos_top_10['comm'] == 1]
print(out)
# Output:
comm A B C D
4 1 2 6 8 3
6 1 4 4 3 5

Replace all 'NA' with 0 in complete DT (Python Datatable)

Hi I am working with the Python datatable package and need to replace all the 'NA' after joining two DT's.
Sample data:
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))
X[DT, on="x"]
The code below replaces all 1 with 0
DT.replace(1, 0)
How should I adapt it to replace 'NA'? Or is there maybe an option to change the padding while joining from 'NA' to '0'?
Thank you.

Here is the code using python's data structures :
from datatable import dt, f, by, join
DT = dt.Frame(x = ["b"]*3 + ["a"]*3 + ["c"]*3,
y = [1, 3, 6] * 3,
v = range(1, 10))
X = dt.Frame({"x":('c','b'),
"v":(8,7),
"foo":(4,2)})
X.key="x" # key the ``x`` column
merger = DT[:, :, join(X)]
merger
x y v v.0 foo
0 b 1 1 7 2
1 b 3 2 7 2
2 b 6 3 7 2
3 a 1 4 NA NA
4 a 3 5 NA NA
5 a 6 6 NA NA
6 c 1 7 8 4
7 c 3 8 8 4
8 c 6 9 8 4
The NA is also None; it makes it easy to replace with 0 :
merger.replace(None, 0)
x y v v.0 foo
0 b 1 1 7 2
1 b 3 2 7 2
2 b 6 3 7 2
3 a 1 4 0 0
4 a 3 5 0 0
5 a 6 6 0 0
6 c 1 7 8 4
7 c 3 8 8 4
8 c 6 9 8 4

Can You Preserve Column Order When Pandas Dataframe.Combine Or DataFrame.Combine_First?

If you have 2 dataframes, represented as:
A F Y
0 1 2 3
1 4 5 6
And
B C T
0 7 8 9
1 10 11 12
When combining it becomes:
A B C F T Y
0 1 7 8 2 9 3
1 4 10 11 5 12 6
I would like it to become:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
How do I combine 1 data frame with another but keep the original column order?

In [1294]: new_df = df.join(df1)
In [1295]: new_df
Out[1295]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
OR you can also use pd.merge(not a very clean solution though)
In [1297]: df['tmp' ] =1
In [1298]: df1['tmp'] = 1
In [1309]: pd.merge(df, df1, on=['tmp'], left_index=True, right_index=True).drop('tmp', 1)
Out[1309]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12

sort dataframe by position in group then by that group

consider the dataframe df
df = pd.DataFrame(dict(
A=list('aaaaabbbbccc'),
B=range(12)
))
print(df)
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
I want to sort the dataframe such if I grouped by column 'A' I'd pull the first position from each group, then cycle back and get the second position from each group if any are remaining. So on and so forth.
I'd expect results tot look like this
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4

You can use cumcount for count values in groups first, then sort_values and reindex by Series cum:
cum = df.groupby('A')['B'].cumcount().sort_values()
print (cum)
0 0
5 0
9 0
1 1
6 1
10 1
2 2
7 2
11 2
3 3
8 3
4 4
dtype: int64
print (df.reindex(cum.index))
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4

Here's a NumPy approach -
def approach1(g, v):
# Inputs : 1D arrays of groupby and value columns
id_arr2 = np.ones(v.size,dtype=int)
sf = np.flatnonzero(g[1:] != g[:-1])+1
id_arr2[sf[0]] = -sf[0]+1
id_arr2[sf[1:]] = sf[:-1] - sf[1:]+1
return id_arr2.cumsum().argsort(kind='mergesort')
Sample run -
In [246]: df
Out[246]:
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
In [247]: df.iloc[approach1(df.A.values, df.B.values)]
Out[247]:
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Or using df.reindex from #jezrael's post :
df.reindex(approach1(df.A.values, df.B.values))

collapse a pandas MultiIndex

Suppose I have a DataFrame with MultiIndex columns. How can I collapse the levels to a concatenation of the values so that I only have one level?
Setup
np.random.seed([3, 14])
col = pd.MultiIndex.from_product([list('ABC'), list('DE'), list('FG')])
df = pd.DataFrame(np.random.rand(4, 12) * 10, columns=col).astype(int)
print df
A B C
D E D E D E
F G F G F G F G F G F G
0 2 1 1 7 5 9 9 2 7 4 0 3
1 3 7 1 1 5 3 1 4 3 5 6 0
2 2 6 9 9 9 5 7 0 1 2 7 5
3 2 2 8 0 3 9 4 7 0 8 2 5
I want the result to look like this:
ADF ADG AEF AEG BDF BDG BEF BEG CDF CDG CEF CEG
0 2 1 1 7 5 9 9 2 7 4 0 3
1 3 7 1 1 5 3 1 4 3 5 6 0
2 2 6 9 9 9 5 7 0 1 2 7 5
3 2 2 8 0 3 9 4 7 0 8 2 5

Solution
I did this
def collapse_columns(df):
df = df.copy()
if isinstance(df.columns, pd.MultiIndex):
df.columns = df.columns.to_series().apply(lambda x: "".join(x))
return df
I had to check if its a MultiIndex because if it wasn't, I'd split a string and recombine it with what ever separator I chose in the join.

you may try this:
In [200]: cols = pd.Series(df.columns.tolist()).apply(pd.Series).sum(axis=1)
In [201]: cols
Out[201]:
0 ADF
1 ADG
2 AEF
3 AEG
4 BDF
5 BDG
6 BEF
7 BEG
8 CDF
9 CDG
10 CEF
11 CEG
dtype: object

df.columns = df.columns.to_series().apply(''.join)
This will give no separation, but you can sub in '_' for '' or any other separator you might want.

Solution 1)
df.columns = df.columns.to_series().str.join('_')
print(df.columns.shape) #(1,_X_) # a 2 D Array.
OR BETTER Solution 2
pivoteCols = df.columns.to_series().str.join('_')
pivoteCols = pivoteCols.values.reshape(len(pivoteCols))
df.columns = pivoteCols
print(df.columns.shape) # One Dimensional

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a list of sliced dataframes - python

Python raise this error because function append return None and next time in your loot variable dframes will be None You can use this: [list(dframes[i:i+3]) for i in range(0, len(dframes), 3)]

Related

Filter and to stay only rows with the same index

Replace all 'NA' with 0 in complete DT (Python Datatable)

Can You Preserve Column Order When Pandas Dataframe.Combine Or DataFrame.Combine_First?

sort dataframe by position in group then by that group

collapse a pandas MultiIndex

Categories

Resources