rowbind elements of list into pandas data frame by grouping - python

I'm wondering what is the pythonic way of achieving the following:
Given a list of list:
l = [[1, 2],[3, 4],[5, 6],[7, 8]]
I would like to create a list of pandas data frames where the first pandas data frame is a row bind of the first two elements in l and the second a row bind of the last two elements:
>>> df1 = pd.DataFrame(np.asarray(l[:2]))
>>> df1
0 1
0 1 2
1 3 4
and
>>> df2 = pd.DataFrame(np.asarray(l[2:]))
>>> df2
0 1
0 5 6
1 7 8
In my problem I have a very long list and I know the grouping, i.e. the first k elements of the list l should be rowbinded to form the first df. How can this be achieved in a python friendly way?

You could store them in dict like
In [586]: s = pd.Series(l)
In [587]: k = 2
In [588]: df = {k:pd.DataFrame(g.values.tolist()) for k, g in s.groupby(s.index//k)}
In [589]: df[0]
Out[589]:
0 1
0 1 2
1 3 4
In [590]: df[1]
Out[590]:
0 1
0 5 6
1 7 8

Related

What is the most efficient way to swap the values of two columns of a 2D list in python when the number of rows is in the tens of thousands?

for example if I have an original list:
A B
1 3
2 4
to be turned into
A B
3 1
4 2
two cents worth:
3 ways to do it
you could add a 3rd column C, copy A to C, then delete A. This would take more memory.
you could create a swap function for the values in a row, then wrap it into a loop.
you could just swap the labels of the columns. This is probably the most efficient way.
You could use rename:
df2 = df.rename(columns={'A': 'B', 'B': 'A'})
output:
B A
0 1 3
1 2 4
If order matters:
df2 = df.rename(columns={'A': 'B', 'B': 'A'})[df.columns]
output:
A B
0 3 1
1 4 2
Use DataFrame.rename with dictionary for swapping columnsnames, last check orcer by selecting columns:
df = df.rename(columns=dict(zip(df.columns, df.columns[::-1])))[df.columns]
print (df)
A B
0 3 1
1 4 2
You can also just simple use masking to change the values.
import pandas as pd
df = pd.DataFrame({"A":[1,2],"B":[3,4]})
df[["A","B"]] = df[["B","A"]].values
df
A B
0 3 1
1 4 2
for more than 2 columns:
df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9], 'D':[10,11,12]})
print(df)
'''
A B C D
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
'''
df = df.set_axis(df.columns[::-1],axis=1)[df.columns]
print(df)
'''
A B C D
0 10 7 4 1
1 11 8 5 2
2 12 9 6 3
I assume that your list is like this:
my_list = [[1, 3], [2, 4]]
So you can use this code:
print([[each_element[1], each_element[0]] for each_element in my_list])
The output is:
[[3, 1], [4, 2]]

Iterating over dataframe and get columns as new dataframes

I'm trying to create a set of dataframes from one big dataframe. Theses dataframes consists of the columns of the original dataframe in this manner:
1st dataframe is the 1st column of the original one,
2nd dataframe is the 1st and 2nd columns of the original one,
and so on.
I use this code to iterate over the dataframe:
for i, data in enumerate(x):
data = x.iloc[:,:i]
print(data)
This works but I also get an empty dataframe in the beginning and an index vector I don't need.
any suggestions on how to remove those 2?
thanks
Instead of enumerating the dataframe, since you are not using the outcome after enumerating but using only the index value, you can just iterate in the range 1 through the number of columns added one, then take the slice df.iloc[:, :i] for each value of i, you can use list-comprehension to achieve this.
>>> [df.iloc[:, :i] for i in range(1,df.shape[1]+1)]
[ A
0 1
1 2
2 3,
A B
0 1 2
1 2 4
2 3 6]
The equivalent traditional loop would look something like this:
for i in range(1,df.shape[1]+1):
print(df.iloc[:, :i])
A
0 1
1 2
2 3
A B
0 1 2
1 2 4
2 3 6
you can also do something like this:
data = {
'col_1': np.random.randint(0, 10, 5),
'col_2': np.random.randint(10, 20, 5),
'col_3': np.random.randint(0, 10, 5),
'col_4': np.random.randint(10, 20, 5),
}
df = pd.DataFrame(data)
all_df = {col: df.iloc[:, :i] for i, col in enumerate(df, start=1)}
# For example we can print the last one
print(all_df['col_4'])
col_1 col_2 col_3 col_4
0 1 13 5 10
1 8 16 1 18
2 6 11 5 18
3 3 11 1 10
4 7 14 8 12

How to get first n records of groups based on column value

I am wondering how I can use groupby and head to get the first n values of a group of records, where n is encoded in a column in the original dataframe.
import pandas as pd
df = pd.DataFrame({"A": [1] * 4 + [2] * 3, "B": list(range(1, 8))})
gp = df.groupby("A").head(2)
print(gp)
This will return the first 2 records of each group. How would I go ahead if I wanted the first 1 of group 1, and the first 2 of group 2, as encoded in column A?
Desired outcome:
A B
0 1 1
4 2 5
5 2 6
We can create a sequential counter using groupby + cumcount to uniquely identify the rows within each group of column A, then create a boolean mask to identify the rows where the counter value is less than or equal to value encoded in column A, now we can filter the required rows using this boolean mask
df[df.groupby('A').cumcount().add(1).le(df['A'])]
A B
0 1 1
4 2 5
5 2 6
Here is solution with DataFrame.head in custom function by A passed by x.name - here is filtered data by A values:
gp = df.groupby("A", group_keys=False).apply(lambda x: x.head(x.name))
print(gp)
A B
0 1 1
4 2 5
5 2 6
If need filter by order in A values solution is:
df = pd.DataFrame({"A": [8] * 4 + [6] * 3, "B": list(range(1, 8))})
d = {v: k for k, v in enumerate(df.A.unique(), 1)}
gp = df.groupby("A", group_keys=False, sort=False).apply(lambda x: x.head(d[x.name]))
print(gp)
A B
0 8 1
4 6 5
5 6 6
df_ = pd.concat([gp[1].head(i+1) for i, gp in enumerate(df.groupby("A"))])
# print(df_)
A B
0 1 1
4 2 5
5 2 6

Get rows based on my given list without revising the order or unique the list

I have a df looks like below, I would like to get rows from 'D' column based on my list without changing or unique the order of list. .
A B C D
0 a b 1 1
1 a b 1 2
2 a b 1 3
3 a b 1 4
4 c d 2 5
5 c d 3 6 #df
My list
l = [4, 2, 6, 4] # my list
df.loc[df['D'].isin(l)].to_csv('output.csv', index = False)
When I use isin() the result would change the order and unique my result, df.loc[df['D'] == value only print the last line.
A B C D
3 a b 1 4
1 a b 1 2
5 c d 3 6
3 a b 1 4 # desired output
Any good way to do this? Thanks,
A solution without loop but merge:
In [26]: pd.DataFrame({'D':l}).merge(df, how='left')
Out[26]:
D A B C
0 4 a b 1
1 2 a b 1
2 6 c d 3
3 4 a b 1
You're going to have to iterate over your list, get copies of them filtered and then concat them all together
l = [4, 2, 6, 4] # you shouldn't use list = as list is a builtin
cache = {}
masked_dfs = []
for v in l:
try:
filtered_df = cache[v]
except KeyError:
filtered_df = df[df['D'] == v]
cache[v] = filtered_df
masked_dfs.append(filtered_df)
new_df = pd.concat(masked_dfs)
UPDATE: modified my answer to cache answers so that you don't have to do multiple searches for repeats
just collect the indices of the values you are looking for, put in a list and then use that list to slice the data
import pandas as pd
df = pd.DataFrame({
'C' : [6, 5, 4, 3, 2, 1],
'D' : [1,2,3,4,5,6]
})
l = [4, 2, 6, 4]
i_locs = [ind for elem in l for ind in df[df['D'] == elem].index]
df.loc[i_locs]
results in
C D
3 3 4
1 5 2
5 1 6
3 3 4

merge items into a list based on index

Suppose I have a dataframe as follows:
df = pd.DataFrame(range(4), index=range(4))
df = df.append(df)
the resultant df is:
0 0
1 1
2 2
3 3
0 0
1 1
2 2
3 3
I want to combine the values of the same index into a list. The desired result is:
0 [0,0]
1 [1,1]
2 [2,2]
3 [3,3]
For a more realistic scenario, my index will be dates, and I want to aggregate multiple obs into a list based on the date. In this way, I can perform some functions on the obs for each date.
For a more realistic scenario, my index will be dates, and I want to
aggregate multiple obs into a list based on the date. In this way, I
can perform some functions on the obs for each date.
If that's your goal, then I don't think you want to actually materialize a list. What you want to do is use groupby and then act on the groups. For example:
>>> df.groupby(level=0)
<pandas.core.groupby.DataFrameGroupBy object at 0xa861f6c>
>>> df.groupby(level=0)[0]
<pandas.core.groupby.SeriesGroupBy object at 0xa86630c>
>>> df.groupby(level=0)[0].sum()
0 0
1 2
2 4
3 6
Name: 0, dtype: int64
You could extract a list too:
>>> df.groupby(level=0)[0].apply(list)
0 [0, 0]
1 [1, 1]
2 [2, 2]
3 [3, 3]
Name: 0, dtype: object
but it's usually better to act on the groups themselves. Series and DataFrames aren't really meant for storing lists of objects.
In [374]:
import pandas as pd
df = pd.DataFrame({'a':range(4)})
df = df.append(df)
df
Out[374]:
a
0 0
1 1
2 2
3 3
0 0
1 1
2 2
3 3
[8 rows x 1 columns]
In [379]:
import numpy as np
# loop over the index values and flatten them using numpy.ravel and cast to a list
for index in df.index.values:
# use loc to select the values at that index
print(index, list((np.ravel(df.loc[index].values))))
# handle condition where we have reached the max value of the index, otherwise we output the values twice
if index == max(df.index.values):
break
0 [0, 0]
1 [1, 1]
2 [2, 2]
3 [3, 3]

Categories