Slice one Pandas DataFrame based on another - python

I have created following pandas DataFrame based on lists of ids.
In [8]: df = pd.DataFrame({'groups' : [1,2,3,4],
'id' : ["[1,3]","[2]","[5]","[4,6,7]"]})
Out[9]:
groups id
0 1 [1,3]
1 2 [2]
2 3 [5]
3 4 [4,6,7]
There is another DataFrame like following.
In [12]: df2 = pd.DataFrame({'id' : [1,2,3,4,5,6,7],
'path' : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3","p1,p2","p1","p2,p3,p4"]})
I need to get path values for each group.
E.g
groups path
1 p1,p2,p3,p4
p1,p5,p5,p7
2 p1,p2,p1
3 p1,p2
4 p1,p2,p3,p3
p1
p2,p3,p4

I'm not sure this is quite the best way to do it, but it worked for me. Incidentally this only works if you create the id variable in df 1 without the "" marks, i.e. as lists, not strings...
import itertools
df = pd.DataFrame({'groups' : [1,2,3,4],
'id' : [[1,3],[2],[5],[4,6,7]]})
df2 = pd.DataFrame({'id' : [1,2,3,4,5,6,7],
'path' : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3","p1,p2","p1","p2,p3,p4"]})
paths = [[] for group in df.groups.unique()]
for x in df.index:
paths[x].extend(itertools.chain(*[list(df2[df2.id == int(y)]['path']) for y in df.id[x]]))
df['paths'] = pd.Series(paths)
df
There is probably a much neater way of doing this, but its an odd data structure in a way. Gives the following output
groups id paths
0 1 [1, 3] [p1,p2,p3,p4, p1,p5,p5,p7]
1 2 [2] [p1,p2,p1]
2 3 [5] [p1,p2]
3 4 [4, 6, 7] [p1,p2,p3,p3, p1, p2,p3,p4]

You shouldn't construct your DataFrame to have embedded list objects. Instead, repeat the groups according to the length of the ids and then use pandas.merge, like so:
In [143]: groups = list(range(1, 5))
In [144]: ids = [[1, 3], [2], [5], [4, 6, 7]]
In [145]: df = DataFrame({'groups': np.repeat(groups, list(map(len, ids))), 'id': reduce(lambda
x, y: x + y, ids)})
In [146]: df2 = pd.DataFrame({'id' : [1,2,3,4,5,6,7],
'path' : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3","p1,p2","p1","p
2,p3,p4"]})
In [147]: df
Out[147]:
groups id
0 1 1
1 1 3
2 2 2
3 3 5
4 4 4
5 4 6
6 4 7
[7 rows x 2 columns]
In [148]: df2
Out[148]:
id path
0 1 p1,p2,p3,p4
1 2 p1,p2,p1
2 3 p1,p5,p5,p7
3 4 p1,p2,p3,p3
4 5 p1,p2
5 6 p1
6 7 p2,p3,p4
[7 rows x 2 columns]
In [149]: pd.merge(df, df2, on='id', how='outer')
Out[149]:
groups id path
0 1 1 p1,p2,p3,p4
1 1 3 p1,p5,p5,p7
2 2 2 p1,p2,p1
3 3 5 p1,p2
4 4 4 p1,p2,p3,p3
5 4 6 p1
6 4 7 p2,p3,p4
[7 rows x 3 columns]

Related

Replace specific values in a data frame with column mean

I have a dataframe and I want to replace the value 7 with the round number of mean of its columns with out other 7 in that columns. Here is a simple example:
import pandas as pd
df = pd.DataFrame()
df['a'] = [1, 2, 3]
df['b'] =[3, 0, -1]
df['c'] = [4, 7, 6]
df['d'] = [7, 7, 6]
a b c d
0 1 3 4 7
1 2 0 7 7
2 3 -1 6 6
And here is the output I want:
a b c d
0 1 3 4 2
1 2 0 3 2
2 3 -1 6 6
For example, in row 1, the mean of column c is equal to 3.33 and then its round is 3, and in column column d is equal to 2 (since we do not consider the other 7 in that column).
Can you please help me with that?
here is one way to do it
# replace 7 with np.nan
df.replace(7,np.nan, inplace=True)
# fill NaN values with the mean of the column
(df.fillna(df.apply(lambda x: x.replace(np.nan, 0)
.mean(skipna=False) ))
.round(0)
.astype(int))
a b c d
0 1 3 4 2
1 2 0 3 2
2 3 -1 6 6
temp = df.replace(to_replace=7, value=0, inplace=False).copy()
df.replace(to_replace=7, value=temp.mean().astype(int), inplace=True)

What is the most efficient way to swap the values of two columns of a 2D list in python when the number of rows is in the tens of thousands?

for example if I have an original list:
A B
1 3
2 4
to be turned into
A B
3 1
4 2
two cents worth:
3 ways to do it
you could add a 3rd column C, copy A to C, then delete A. This would take more memory.
you could create a swap function for the values in a row, then wrap it into a loop.
you could just swap the labels of the columns. This is probably the most efficient way.
You could use rename:
df2 = df.rename(columns={'A': 'B', 'B': 'A'})
output:
B A
0 1 3
1 2 4
If order matters:
df2 = df.rename(columns={'A': 'B', 'B': 'A'})[df.columns]
output:
A B
0 3 1
1 4 2
Use DataFrame.rename with dictionary for swapping columnsnames, last check orcer by selecting columns:
df = df.rename(columns=dict(zip(df.columns, df.columns[::-1])))[df.columns]
print (df)
A B
0 3 1
1 4 2
You can also just simple use masking to change the values.
import pandas as pd
df = pd.DataFrame({"A":[1,2],"B":[3,4]})
df[["A","B"]] = df[["B","A"]].values
df
A B
0 3 1
1 4 2
for more than 2 columns:
df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9], 'D':[10,11,12]})
print(df)
'''
A B C D
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
'''
df = df.set_axis(df.columns[::-1],axis=1)[df.columns]
print(df)
'''
A B C D
0 10 7 4 1
1 11 8 5 2
2 12 9 6 3
I assume that your list is like this:
my_list = [[1, 3], [2, 4]]
So you can use this code:
print([[each_element[1], each_element[0]] for each_element in my_list])
The output is:
[[3, 1], [4, 2]]

How to split a dataframe into some dataframes according to list of row numbers?

Given a dataframe, I want to obtain a list of distinct dataframes which together concatenate into the original.
The separation is by indices of rows like so
import pandas as pd
import numpy as np
data = {"a": np.arange(10)}
df = pd.DataFrame(data)
print(df)
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
separate_by = [1, 5, 6, ]
should give a list of
df1 =
a
0 0
df2 =
a
1 1
2 2
3 3
4 4
df3 =
a
5 5
df4 =
a
6 6
7 7
8 8
9 9
How can this be done in pandas?
Try:
groups = (pd.Series(1, index=separate_by)
.reindex(df.index,fill_value=0)
.cumsum()
)
out = {k:v for k,v in df.groupby(groups)}
then for example, out[2]:
a
5 5
Similar logic:
groups = np.zeros(len(df))
groups[separate_by] = 1
groups = np.cumsum(groups)
out = {k:v for k,v in df.groupby(groups)}
separate_by = [1, 5, 6, ]
separate_by.append(len(df))
separate_by.append(0, 0)
dfs = [df.loc[separate_by[i]: separate_by[i+1]] for i in range(len(separate_by)-1)]
Let us try
d = dict(tuple(df.groupby(df.index.isin(separate_by).cumsum())))
d[0]
Out[364]:
a
0 0
d[2]
Out[365]:
a
5 5

Pandas index column by boolean

I want to keep columns that have 'n' or more values.
For example:
> df = pd.DataFrame({'a': [1,2,3], 'b': [1,None,4]})
a b
0 1 1
1 2 NaN
2 3 4
3 rows × 2 columns
> df[df.count()==3]
IndexingError: Unalignable boolean Series key provided
> df[:,df.count()==3]
TypeError: unhashable type: 'slice'
> df[[k for (k,v) in (df.count()==3).items() if v]]
a
0 1
1 2
2 3
Is that the best way to do this? It seems ridiculous.
You can use conditional list comprehension to generate the columns that exceed your threshold (e.g. 3). Then just select those columns from the data frame:
# Create sample DataFrame
df = pd.DataFrame({'a': [1, 2, 3, 4, 5],
'b': [1, None, 4, None, 2],
'c': [5, 4, 3, 2, None]})
>>> df_new = df[[col for col in df if df[col].count() > 3]]
Out[82]:
a c
0 1 5
1 2 4
2 3 3
3 4 2
4 5 NaN
Use count to produce a boolean index and use this as a mask for the columns:
In [10]:
df[df.columns[df.count() > 2]]
Out[10]:
a
0 1
1 2
2 3
if you want to keep columns that have 'n' or more values. for my example i am considering n value as 4
df = pd.DataFrame({'a': [1,2,3,4,6], 'b': [1,None,4,5,7],'c': [1,2,3,5,8]})
print df
a b c
0 1 1 1
1 2 NaN 2
2 3 4 3
3 4 5 5
4 6 7 8
print df[[i for i in xrange(0,len(df.columns)) if len(df.iloc[:,i]) - df.isnull().sum()[i] >4]]
a c
0 1 1
1 2 2
2 3 3
3 4 5
4 6 8

How to sort pandas dataframe from list category?

so I have this data set below that I want to sort base on mylist from column 'name' as well as acsending by 'A' and descending by 'B'
import pandas as pd
import numpy as np
df1 = pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6]) , ('name', ['x','x','x'])])
df2 = pd.DataFrame.from_items([('B', [5, 6, 7]), ('A', [8, 9, 10]) , ('name', ['y','y','y'])])
df3 = pd.DataFrame.from_items([('C', [5, 6, 7]), ('D', [8, 9, 10]), ('A',[1,2,3]), ('B',[4,5,7] ), ('name', ['z','z','z'])])
df_list = [df1,df2,df3[['A','B','name']]]
df = pd.concat(df_list, ignore_index=True)
so my list is
mylist = ['z','x','y']
I want the dataset to start with sort by my list , then sort asc column A then desc column B
is there a way to do this in python ?
======== Edit ==========
I want my final result to be something like
OK, a way to sort by a custom order is to create a dict that defines how 'name' column should be order, call map to add a new column that defines this new order, then call sort and pass in the new column and the others, plus the param ascending where you selectively decide whether each column is sorted ascending or not, and then finally drop that column:
In [20]:
name_sort = {'z':0,'x':1,'y':2}
df['name_sort'] = df.name.map(name_sort)
df
Out[20]:
A B name name_sort
0 1 4 x 1
1 2 5 x 1
2 3 6 x 1
3 8 5 y 2
4 9 6 y 2
5 10 7 y 2
6 1 4 z 0
7 2 5 z 0
8 3 7 z 0
In [23]:
df = df.sort(['name_sort','A','B'], ascending=[1,1,0])
df
Out[23]:
A B name name_sort
6 1 4 z 0
7 2 5 z 0
8 3 7 z 0
0 1 4 x 1
1 2 5 x 1
2 3 6 x 1
3 8 5 y 2
4 9 6 y 2
5 10 7 y 2
In [25]:
df = df.drop('name_sort', axis=1)
df
Out[25]:
A B name
6 1 4 z
7 2 5 z
8 3 7 z
0 1 4 x
1 2 5 x
2 3 6 x
3 8 5 y
4 9 6 y
5 10 7 y
Hi We can do the above issue by using the following:
t = pd.CategoricalDtype(categories=['z','x','y'], ordered=True)
df['sort'] = pd.Series(df.name, dtype=t)
df.sort_values(by=['sort','A','B'], inplace=True)

Categories