List of list of tuples to pandas dataframe - python

I have this array (it's a result from similarity calcul) it's a list of tuples like this:
example = [[(a,b), (c,d)], [(a1,b1), (c1,d2)] …]
In example there is 121044 list of 30 tuples each.
I want to have a pandas Dataframe like of just the second value of the tuples (i.e : b, d, b1, d2) without spending to much time compute it
Do you have any ideas ?

Use nested list comprehension:
df = pd.DataFrame([[y[1] for y in x] for x in example])
print (df)
0 1
0 b d
1 b1 d2
df = pd.DataFrame([[y[1] for y in x] for x in example], columns=['col1','col2'])
print (df)
col1 col2
0 b d
1 b1 d2

For numeric data, you can use numpy indexing directly. This should be more efficient than a list comprehension, as pandas uses numpy internally to store data in contiguous memory blocks.
import pandas as pd, numpy as np
example = [[(1,2), (3,4)], [(5,6), (7,8)]]
df = pd.DataFrame(np.array(example)[..., 1],
columns=['col1', 'col2'])
print(df)
col1 col2
0 2 4
1 6 8

Related

Add column to pandas dataframe from a reversed dictionary

I have a dataframe (pandas) and a dictionary with keys and values as list. The values in lists are unique across all the keys. I want to add a new column to my dataframe based on values of the dictionary having keys in it. E.g. suppose I have a dataframe like this
import pandas as pd
df = {'a':1, 'b':2, 'c':2, 'd':4, 'e':7}
df = pd.DataFrame.from_dict(df, orient='index', columns = ['col2'])
df = df.reset_index().rename(columns={'index':'col1'})
df
col1 col2
0 a 1
1 b 2
2 c 2
3 d 4
4 e 7
Now I also have dictionary like this
my_dict = {'x':['a', 'c'], 'y':['b'], 'z':['d', 'e']}
I want the output like this
col1 col2 col3
0 a 1 x
1 b 2 y
2 c 2 x
3 d 4 z
4 e 7 z
Presently I am doing this by reversing the dictionary first, i.e. like this
my_dict_rev = {value:key for key in my_dict for value in my_dict[key]}
df['col3']= df['col1'].map(my_dict_rev)
df
But I am sure that there must be some direct method.
I know this is an old question but here are two other ways to do the same job. First convert my_dict to a Series object, then explode it. Then reverse the mapping and use map:
tmp = pd.Series(my_dict).explode()
df['col3'] = df['col1'].map(pd.Series(tmp.index, tmp))
Another option (starts similar to above) but instead of map, merge:
df = df.merge(pd.Series(my_dict, name='col1').explode().rename_axis('col3').reset_index())
Output:
col1 col2 col3
0 a 1 x
1 b 2 y
2 c 2 x
3 d 4 z
4 e 7 z

How to convert list of pair tuples in dataframe into columns

I have a dataframe column where each cells data looks like:
[('a', '2000'),('b', '4000'),('d', '5000')].
Some would have 4 pairs with c. How can I convert all of them into new columns filling
df['a'] , df['b'] , df['c'] , df['d'] ?
This isn't vectorisable. One efficient solution would be to feed a list of dictionaries to the pd.DataFrame constructor via list + map with dict:
s = pd.Series([[('a', '2000'),('b', '4000'),('d', '5000')],
[('a', '1000'),('b', '3000'),('c', '6000'),('d', '7000')]])
# example dataframe
df = pd.DataFrame(s, columns=['data'])
# convert list of tuples to dict for each element in series
res = pd.DataFrame(list(map(dict, df['data'])))
print(res)
a b c d
0 2000 4000 NaN 5000
1 1000 3000 6000 7000
import pandas as pd
d=[('a', '2000'),('b', '4000'),('d', '5000')]
df = pd.DataFrame(data=d).T
df.columns = df.iloc[0]
df = df.iloc[1:]
Another way of doing it :
l = [('a', '2000'),('b', '4000'),('d', '5000')]
df = pd.DataFrame([dict(l)])
If your data is like nested list of tuples then try like this :
df = pd.DataFrame(list(map(dict, l)))
Output will be like :
a b d
0 2000 4000 5000

How to initialize a two dimensional string DataFrame array in python

I want to initialize a 31756x2 data frame of strings.
I want it to look like this:
index column1 column2
0 A B
1 A B
.
.
31756 A B
I wrote:
content_split = [["A", "B"] for x in range(31756)]
This is the result:
I did get a two dimensional list, but I want the columns to be separated like in a data frame, and I can't seem to get it to work (like column1: A.. , column2: B...)
Would love some help.
Use DataFrame constructor only:
df = pd.DataFrame([["A", "B"] for x in range(31756)], columns=['col1','col2'])
print (df.head())
col1 col2
0 A B
1 A B
2 A B
3 A B
4 A B
Or:
N = 31756
df = pd.DataFrame({'col1':['A'] * N, 'col2':['B'] * N})
print (df.head())
col1 col2
0 A B
1 A B
2 A B
3 A B
4 A B
import pandas as pd
df = pd.DataFrame(index=range(31756))
df.loc[:,'column1'] = 'A'
df.loc[:,'column2'] = 'B'
Using numpy.tile:
import numpy as np
df = pd.DataFrame(np.tile(list('AB'), (31756, 1)), columns=['col1','col2'])
Or just passing a dictionary:
df = pd.DataFrame({'A':['A']*31756, 'B':['B']*31756})
If using this latter method you may want to explicitly sort the columns since the dictionary doesn't have order:
df = pd.DataFrame({'A':['A']*31756, 'B':['B']*31756}).sort_index(axis=1)
For fun
pd.DataFrame(index=range(31756)).assign(dict(col1='A', col2='B'))

Python: Pandas DataFrame for tuples

Is this a correct way of creating DataFrame for tuples? (assume that the tuples are created inside code fragment)
import pandas as pd
import numpy as np
import random
row = ['a','b','c']
col = ['A','B','C','D']
# use numpy for creating a ZEROS matrix
st = np.zeros((len(row),len(col)))
df2 = pd.DataFrame(st, index=row, columns=col)
# CONVERT each cell to an OBJECT for inserting tuples
for c in col:
df2[c] = df2[c].astype(object)
print df2
for i in row:
for j in col:
df2.set_value(i, j, (i+j, np.round(random.uniform(0, 1), 4)))
print df2
As you can see I first created a zeros(3,4) in numpy and then made each cell an OBJECT type in Pandas so I can insert tuples. Is this correct way to do or there is a better solution to ADD/RETRIVE tuples to matrices?
Results are fine:
A B C D
a 0 0 0 0
b 0 0 0 0
c 0 0 0 0
A B C D
a (aA, 0.7134) (aB, 0.006) (aC, 0.1948) (aD, 0.2158)
b (bA, 0.2937) (bB, 0.8083) (bC, 0.3597) (bD, 0.324)
c (cA, 0.9534) (cB, 0.9666) (cC, 0.7489) (cD, 0.8599)
First, to answer your literal question: You can construct DataFrames from a list of lists. The values in the list of lists can themselves be tuples:
import numpy as np
import pandas as pd
np.random.seed(2016)
row = ['a','b','c']
col = ['A','B','C','D']
data = [[(i+j, round(np.random.uniform(0, 1), 4)) for j in col] for i in row]
df = pd.DataFrame(data, index=row, columns=col)
print(df)
yields
A B C D
a (aA, 0.8967) (aB, 0.7302) (aC, 0.7833) (aD, 0.7417)
b (bA, 0.4621) (bB, 0.6426) (bC, 0.2249) (bD, 0.7085)
c (cA, 0.7471) (cB, 0.6251) (cC, 0.58) (cD, 0.2426)
Having said that, beware that storing tuples in DataFrames dooms you to Python-speed loops. To take advantage of fast Pandas/NumPy routines, you need to use native NumPy dtypes such as np.float64 (whereas, in contrast, tuples require "object" dtype).
So perhaps a better solution for your purpose is to use two separate DataFrames, one for the strings and one for the numbers:
import numpy as np
import pandas as pd
np.random.seed(2016)
row=['a','b','c']
col=['A','B','C','D']
prevstate = pd.DataFrame([[i+j for j in col] for i in row], index=row, columns=col)
prob = pd.DataFrame(np.random.uniform(0, 1, size=(len(row), len(col))).round(4),
index=row, columns=col)
print(prevstate)
# A B C D
# a aA aB aC aD
# b bA bB bC bD
# c cA cB cC cD
print(prob)
# A B C D
# a 0.8967 0.7302 0.7833 0.7417
# b 0.4621 0.6426 0.2249 0.7085
# c 0.7471 0.6251 0.5800 0.2426
To loop through the columns, find the row with maximum probability and retrieve the corresponding prevstate, you could use .idxmax and .loc:
for col in prob.columns:
idx = (prob[col].idxmax())
print('{}: {}'.format(prevstate.loc[idx, col], prob.loc[idx, col]))
yields
aA: 0.8967
aB: 0.7302
aC: 0.7833
aD: 0.7417

Appending a list or series to a pandas DataFrame as a row?

So I have initialized an empty pandas DataFrame and I would like to iteratively append lists (or Series) as rows in this DataFrame. What is the best way of doing this?
df = pd.DataFrame(columns=list("ABC"))
df.loc[len(df)] = [1,2,3]
Sometimes it's easier to do all the appending outside of pandas, then, just create the DataFrame in one shot.
>>> import pandas as pd
>>> simple_list=[['a','b']]
>>> simple_list.append(['e','f'])
>>> df=pd.DataFrame(simple_list,columns=['col1','col2'])
col1 col2
0 a b
1 e f
Here's a simple and dumb solution:
>>> import pandas as pd
>>> df = pd.DataFrame()
>>> df = df.append({'foo':1, 'bar':2}, ignore_index=True)
Could you do something like this?
>>> import pandas as pd
>>> df = pd.DataFrame(columns=['col1', 'col2'])
>>> df = df.append(pd.Series(['a', 'b'], index=['col1','col2']), ignore_index=True)
>>> df = df.append(pd.Series(['d', 'e'], index=['col1','col2']), ignore_index=True)
>>> df
col1 col2
0 a b
1 d e
Does anyone have a more elegant solution?
Following onto Mike Chirico's answer... if you want to append a list after the dataframe is already populated...
>>> list = [['f','g']]
>>> df = df.append(pd.DataFrame(list, columns=['col1','col2']),ignore_index=True)
>>> df
col1 col2
0 a b
1 d e
2 f g
There are several ways to append a list to a Pandas Dataframe in Python. Let's consider the following dataframe and list:
import pandas as pd
# Dataframe
df = pd.DataFrame([[1, 2], [3, 4]], columns = ["col1", "col2"])
# List to append
list = [5, 6]
Option 1: append the list at the end of the dataframe with pandas.DataFrame.loc.
df.loc[len(df)] = list
Option 2: convert the list to dataframe and append with pandas.DataFrame.append().
df = df.append(pd.DataFrame([list], columns=df.columns), ignore_index=True)
Option 3: convert the list to series and append with pandas.DataFrame.append().
df = df.append(pd.Series(list, index = df.columns), ignore_index=True)
Each of the above options should output something like:
>>> print (df)
col1 col2
0 1 2
1 3 4
2 5 6
Reference : How to append a list as a row to a Pandas DataFrame in Python?
Converting the list to a data frame within the append function works, also when applied in a loop
import pandas as pd
mylist = [1,2,3]
df = pd.DataFrame()
df = df.append(pd.DataFrame(data[mylist]))
Here's a function that, given an already created dataframe, will append a list as a new row. This should probably have error catchers thrown in, but if you know exactly what you're adding then it shouldn't be an issue.
import pandas as pd
import numpy as np
def addRow(df,ls):
"""
Given a dataframe and a list, append the list as a new row to the dataframe.
:param df: <DataFrame> The original dataframe
:param ls: <list> The new row to be added
:return: <DataFrame> The dataframe with the newly appended row
"""
numEl = len(ls)
newRow = pd.DataFrame(np.array(ls).reshape(1,numEl), columns = list(df.columns))
df = df.append(newRow, ignore_index=True)
return df
If you want to add a Series and use the Series' index as columns of the DataFrame, you only need to append the Series between brackets:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame()
In [3]: row=pd.Series([1,2,3],["A","B","C"])
In [4]: row
Out[4]:
A 1
B 2
C 3
dtype: int64
In [5]: df.append([row],ignore_index=True)
Out[5]:
A B C
0 1 2 3
[1 rows x 3 columns]
Whitout the ignore_index=True you don't get proper index.
simply use loc:
>>> df
A B C
one 1 2 3
>>> df.loc["two"] = [4,5,6]
>>> df
A B C
one 1 2 3
two 4 5 6
As mentioned here - https://kite.com/python/answers/how-to-append-a-list-as-a-row-to-a-pandas-dataframe-in-python, you'll need to first convert the list to a series then append the series to dataframe.
df = pd.DataFrame([[1, 2], [3, 4]], columns = ["a", "b"])
to_append = [5, 6]
a_series = pd.Series(to_append, index = df.columns)
df = df.append(a_series, ignore_index=True)
Consider an array A of N x 2 dimensions. To add one more row, use the following.
A.loc[A.shape[0]] = [3,4]
The simplest way:
my_list = [1,2,3,4,5]
df['new_column'] = pd.Series(my_list).values
Edit:
Don't forget that the length of the new list should be the same of the corresponding Dataframe.

Categories