I have a dataframe:
>>> df = pd.DataFrame(np.random.random((3,3)))
>>> df
0 1 2
0 0.732993 0.611314 0.485260
1 0.935140 0.153149 0.065653
2 0.392037 0.797568 0.662104
What is the easiest way for me convert each entry to a 2-tuple, with first element from the current dataframe, and 2nd element from the last columns ('2')?
i.e. I want the final results to be:
0 1 2
0 (0.732993, 0.485260) (0.611314, 0.485260) (0.485260, 0.485260)
1 (0.935140, 0.065653) (0.153149, 0.065653) (0.065653, 0.065653)
2 (0.392037, 0.662104) (0.797568, 0.662104) (0.662104, 0.662104)
As of pd version 0.20, you can use df.transform:
In [111]: df
Out[111]:
0 1 2
0 1 3 4
1 2 4 5
2 3 5 6
In [112]: df.transform(lambda x: list(zip(x, df[2])))
Out[112]:
0 1 2
0 (1, 4) (3, 4) (4, 4)
1 (2, 5) (4, 5) (5, 5)
2 (3, 6) (5, 6) (6, 6)
Or, another solution using df.apply:
In [113]: df.apply(lambda x: list(zip(x, df[2])))
Out[113]:
0 1 2
0 (1, 4) (3, 4) (4, 4)
1 (2, 5) (4, 5) (5, 5)
2 (3, 6) (5, 6) (6, 6)
You can also use dict comprehension:
In [126]: pd.DataFrame({i : df[[i, 2]].apply(tuple, axis=1) for i in df.columns})
Out[126]:
0 1 2
0 (1, 4) (3, 4) (4, 4)
1 (2, 5) (4, 5) (5, 5)
2 (3, 6) (5, 6) (6, 6)
I agree with Corley's comment that you are better off leaving the data in the current format, and changing your algorithm to process data explicitly from the second column.
However, to answer your question, you can define a function that does what's desired and call it using apply.
I don't like this answer, it is ugly and "apply" is syntatic sugar for a "For Loop", you are definitely better off not using this:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((3,3)))
df
0 1 2
0 0.847380 0.897275 0.462872
1 0.161202 0.852504 0.951304
2 0.093574 0.503927 0.986476
def make_tuple(row):
n= len(row)
row = [(x,row[n - 1]) for x in row]
return row
df.apply(make_tuple, axis =1)
0 (0.847379908309, 0.462871875315) (0.897274903359, 0.462871875315)
1 (0.161202442072, 0.951303842798) (0.852504052133, 0.951303842798)
2 (0.0935742441563, 0.986475692614) (0.503927404884, 0.986475692614)
2
0 (0.462871875315, 0.462871875315)
1 (0.951303842798, 0.951303842798)
2 (0.986475692614, 0.986475692614)
Related
if i have code like this
import pandas as pd
import random
x = 5
table = []
row = []
for i in range(x):
for j in range(x):
if i == j :
row.append(0)
else :
row.append(random.randint(0,1))
table.append(row)
row = []
df = pd.DataFrame(table)
df
and the output will be there
how to make graph from this table ?
i want the output graph like this [(0,1), (0,2), (1,0), (1,2), (1,4), (2,3), (2,4), (3,0), (3,1), (3,2), (3,4), (4,0)]
IIUC, replace 0 by NA, stack (which drops the NA by default), and convert the index to list:
df.replace(0, pd.NA).stack().index.to_list()
output:
[(0, 3), (0, 4), (1, 0), (1, 2), (1, 3), (2, 0), (2, 1), (4, 0), (4, 3)]
matching input:
0 1 2 3 4
0 0 0 0 1 1
1 1 0 1 1 0
2 1 1 0 0 0
3 0 0 0 0 0
4 1 0 0 1 0
I have a list and dataframe (example below).
0 1
0 ((test1, AA), (1, 1)) 1
1 ((test2, BB), (1, 1)) 2
2 ((test1, CC), (1, 1)) 3
3 ((test1, DD), (2, 1)) 8
4 ((test3, EE), (3, 1)) 9
I need to filter out only data with first elements test1 AND 1 . Could you please help?
Expected output:
0 1
0 ((test1, AA), (1, 1)) 1
2 ((test1, CC), (1, 1)) 3
You can use boolean indexing:
v = df[0].apply(lambda i: i[0][0] == 'test1' and i[1][0] == 1)
df = df[v]
print(df)
Output
0 1
0 ((test1, AA), (1, 1)) 1
2 ((test1, CC), (1, 1)) 3
I have a problem to search the correct solution of the frequency of a combination.
This my code:
import pandas as pd
import itertools
list = [1,20,1,50]
combinations = []
for i in itertools.combinations(list ,2):
combinations .append(i)
data = pd.DataFrame({'products':combinations})
data['frequency'] = data.groupby('products')['products'].transform('count')
print data
The out is:
products frequency
0 (1, 20) 1
1 (1, 1) 1
2 (1, 50) 2
3 (20, 1) 1
4 (20, 50) 1
5 (1, 50) 2
The problem is (1, 20) and (20, 1), the frequency puts 1 but are the same combination and has to be 2, Is there any method with the correct solution?
You can use group by a modification on the column by using applyand lambda
import pandas as pd
import itertools
list = [1,20,1,50]
combinations = []
for i in itertools.combinations(list ,2):
combinations .append(i)
data = pd.DataFrame({'products':combinations})
data['frequency'] = data.groupby(data['products'].apply(
lambda i :tuple(sorted(i))))['products'].transform('count')
print (data)
The output will be
products frequency
0 (1, 20) 2
1 (1, 1) 1
2 (1, 50) 2
3 (20, 1) 2
4 (20, 50) 1
5 (1, 50) 2
My dataframe looks like this:
(1, 2) (1, 3) (1, 4) (1, 5) (1, 6) (1, 7) (1, 8) (1, 9) (1, 10) (1, 11) ... 2 3 4 5 6 7 8 9 10 11
0 0 1 0 1 1 1 1 0 1 0 ... 0.612544 0.727393 0.366578 0.631451 0.722980 0.772853 0.964982 0.549801 0.406692 0.798083
1 0 0 0 0 0 0 0 0 0 0 ... 0.583228 0.698729 0.343934 0.602037 0.694230 0.745422 0.954682 0.521298 0.382381 0.771640
2 1 0 0 1 0 1 1 0 0 0 ... 0.481291 0.593353 0.271028 0.498949 0.588807 0.641602 0.901779 0.424495 0.303309 0.669657
3 1 1 0 1 0 1 1 0 0 1 ... 0.583228 0.698729 0.343934 0.602037 0.694230 0.745422 0.954682 0.521298 0.382381 0.771640
4 0 0 0 1 1 1 1 1 1 1 ... 0.612544 0.727393 0.366578 0.631451 0.722980 0.772853 0.964982 0.549801 0.406692 0.798083
where i have column headers with a tuple like (1, 2) and column headers that are a single element, like 1. I want to perform a calculation on the tuple columns based on the columns that have the elements of that tuple. For example, with the tuple (1, 2), I want to retrieve the columns 1 and 2, multiply them together, then subtract the result from the column (1, 2).
The solution that I thought of was to create (55) new columns that perform the first calculation from the columns that contain only a single element (e.g. 1 or 2), and then do some sort of identity match using the .where() and all() statements. However, this seems rather computationally inefficient since I'd be making a whole other set of data, rather than performing the calculation directly on the tuple column. How would I go about this?
Not sure if this is faster, but here's a solution without needing where()/all()
import pandas as pd
# create some sample data
arr = [[1, 2, 3, 4, 5, 6, 7],
[7, 6, 5, 4, 3, 2, 1]]
df = pd.DataFrame(arr, columns=[('a', 'b'), ('c','d'), ('a', 'd'), 'a', 'b', 'c', 'd'])
# get all tuple headers
tuple_columns = [col for col in df.columns if isinstance(col, tuple)]
# put the results into a list of series and concat into a DataFrame
results = pd.concat([df[col] - df[col[0]] * df[col[1]] for col in tuple_columns], axis=1)
# rename the columns
results.columns = tuple_columns
I have a pandas series indexed by tuples, like this:
from pandas import Series
s = Series({(0, 0): 1, (0, 1): 2, (0, 3): 3, (1, 0): 1, (1, 2): 4, (3, 0): 5})
I want to slice such a series by using indexes that are also tuples (using lexicographic ordering), but not necessarily in the index. Slicing seems to work when I pass an index that is on the series:
s[:(1,0)]
(0, 0) 1
(0, 1) 2
(0, 3) 3
(1, 0) 1
dtype: int64
but if I try slicing by an index which is not on the series there is an error:
s[:(1,1)]
...
ValueError: Index(...) must be called with a collection of some kind, 0 was passed
Ideally I'd like to get the series elements indexed by (0, 0), (0, 1), (0, 3), (1, 0), similar to what happens when slicing using dates in TimeSeries. Is there a simple way to achieve this?
This works if you have a MultiIndex rather than an index of tuples:
In [11]: s.index = pd.MultiIndex.from_tuples(s.index)
In [12]: s
Out[12]:
0 0 1
1 2
3 3
1 0 1
2 4
3 0 5
dtype: int64
In [13]: s[:(1,1)]
Out[13]:
0 0 1
1 2
3 3
1 0 1
dtype: int64
In a previous edit I had suggested this could be a bug, and had created an awful hack...