DataFrame from dictionary with nested lists

DataFrame from dictionary with nested lists - python

I have a python dictionary with nested lists, that I would like to turn into a pandas DataFrame
a = {'A': [1,2,3], 'B':['a','b','c'],'C':[[1,2],[3,4],[5,6]]}
I would like the final DataFrame to look like this:
> A B C
> 1 a 1
> 1 a 2
> 2 b 3
> 2 b 4
> 3 c 5
> 3 c 6
When I use the DataFrame command it looks like this:
pd.DataFrame(a)
> A B C
>0 1 a [1, 2]
>1 2 b [3, 4]
>2 3 c [5, 6]
Is there anyway I make the data long by the elements of C?

This is what I came up with:
In [53]: df
Out[53]:
A B C
0 1 a [1, 2]
1 2 b [3, 4]
2 3 c [5, 6]
In [58]: s = df.C.apply(Series).unstack().reset_index(level=0, drop = True)
In [59]: s.name = 'C2'
In [61]: df.drop('C', axis = 1).join(s)
Out[61]:
A B C2
0 1 a 1
0 1 a 2
1 2 b 3
1 2 b 4
2 3 c 5
2 3 c 6
apply(Series) gives me a DataFrame with two columns. To join them into one while keeping the original index, I use unstack. reset_index removes the first level of the index, which basically holds the index of the value in the original list which was in C. Then I join it back into the df.

Yes, one way is to deal with your dictionnary first ( I assume your dictionnary values contain either just list of values either list of nested lists - but not lists of both values and lists).
Step by step:
def f(x, y): return x + y
res={k: reduce(f, v) if any(isinstance(i, list) for i in v) else v for k,v in a.items()}
will give you:
{'A': [1, 2, 3], 'C': [1, 2, 3, 4, 5, 6], 'B': ['a', 'b', 'c']}
Now you need to extend lists in your dictionnary:
m = max([len(v) for v in res.values()])
res1 = {k: reduce(f, [(m/len(v))*[i] for i in v]) for k,v in res.items()}
And finally:
pd.DataFrame(res1)

Related

Pandas DataFrame: Add a list to each cell by iterating over df with new column does not work

DataFrame with some columns a and b.
I now want to add a new column c that should contain lists (of different lengths).
df1 = pd.DataFrame({'a':[1,2,3], 'b':[5,6,7]})
new_col_init = [list() for i in range(len(df1)]
df1['c'] = pd.Series(new_col_init,dtype='object')
print(df1)
gives:
a b c
0 1 5 []
1 2 6 []
2 3 7 []
Why Am I unable to do the following:
for i in range(len(df1)):
df1.loc[i,'c'] = [2]*i
This results in ValueError: cannot set using a multi-index selection indexer with a different length than the value.
However this works:
df1['c'] = pd.Series([[2], [2,2], [2,2,2]])
print(df1)
Result:
a b c
0 1 5 [2]
1 2 6 [2, 2]
2 3 7 [2, 2, 2]
Is there a way to assign the lists by iterating with a for-loop? (I have a lot of other stuff that gets already assigned within that loop and I now need to add the new lists)

You can use .at:
for i, idx in enumerate(df1.index, 1):
df1.at[idx, "c"] = [2] * i
print(df1)
Prints:
a b c
0 1 5 [2]
1 2 6 [2, 2]
2 3 7 [2, 2, 2]

Here is a solution you can try out using Index.map,
df1['c'] = df1.index.map(lambda x: (x + 1) * [2])
a b c
0 1 5 [2]
1 2 6 [2, 2]
2 3 7 [2, 2, 2]

df1.loc[:, 'c'] = [[2]*(i+1) for i in range(len(df1))]

Pandas: how to Identify lines with specific value in column x, and use other values in same line as variables?

I would like to identify a row containing 1 in column x, and use other values in the same row as variables. Then I would like to identify a row in a different dataframe which contains the variables, and delete the row.
df1:
z y x
--------
3 3 0
5 4 1
df2:
a b c d e
--------------
5 4 p p p <-- Delete this row
3 3 p p p

This is one way of doing it, if you are looking for a scalable version, with more than just 2 rows in df1 and df2:
import pandas as pd
df1 = pd.DataFrame({"z":[3,5], "y": [3,4], "x": [0,1]})
df2 = pd.DataFrame({"a":[5,4], "b":[4,3], "c": ["p","p"], "d": ["p","p"], "e": ["p","p"]})
df1_set = df1[df1.x == 1]
idx = []
for i in range(len(df2)):
for j in range(len(df1_set)):
if (df2.a.iloc[i], df2.b.iloc[i]) == (df1_set.z.iloc[j], df1_set.y.iloc[j]):
idx.append(i)
df2_set = df2.drop(idx)
The dataframes look like this:
df1_set
Out[50]:
z y x
1 5 4 1
df2_set
Out[51]:
a b c d e
1 4 3 p p p
Here df1_set is the selective df from df1, which has value of x = 1, and then df2_set if the final output that you are seeking.
Explanation:
Find out the rows with x = 1 in df1
Run loops on df1 and df2 to
remove each element in df2 where the set z and y from df1_set
matches a and b from df2

For both DataFrame you can create the column that will be used to match both DataFrame together, you need a column that contains a tuple with the different values
Then remove from the main dataframe, the rows where (a,b) is contained in the valus that com from the second dataframe
df_idx = DataFrame({'z': [3, 5], 'y': [3, 4], 'x': [0, 1]})
df = DataFrame({'a': [5, 3], 'b': [4, 3], 'c': ['A', 'A'], 'd': ['A', 'A'], 'e': ['A', 'A']})
df_idx['match'] = df_idx[['z', 'y']].apply(tuple, axis=1)
df['match'] = df[['a', 'b']].apply(tuple, axis=1)
selectors = df_idx[df_idx.x == 1]
result = df[~df.match.isin(selectors['match'].tolist())]
>>> print(df_idx)
z y x match
0 3 3 0 (3, 3)
1 5 4 1 (5, 4)
>>> print(df)
a b c d e match
0 5 4 A A A (5, 4)
1 3 3 A A A (3, 3)
>>> print(selectors['match'].tolist())
[(5, 4)]
>>> print(result)
a b c d e match
1 3 3 A A A (3, 3)

Set Multi-Index DataFrame column by Series with Index

I'm struggling with a MultiIndex dataframe (a) which requires the column x to be set by b which isn't a MultiIndex and has only 1 index level (first level of a). I have an index to change those values (ix), which is why I am using .loc[] for indexing. The problem is that the way missing index levels are populated in a is not what I require (see example).
>>> a = pd.DataFrame({'a': [1, 2, 3], 'b': ['b', 'b', 'b'], 'x': [4, 5, 6]}).set_index(['a', 'b'])
>>> a
x
a b
1 b 4
2 b 5
3 b 6
>>> b = pd.DataFrame({'a': [1, 4], 'x': [9, 10]}).set_index('a')
>>> b
x
a
1 9
4 10
>>> ix = a.index[[0, 1]]
>>> ix
MultiIndex(levels=[[1, 2, 3], [u'b']],
codes=[[0, 1], [0, 0]],
names=[u'a', u'b'])
>>> a.loc[ix]
x
a b
1 b 4
2 b 5
>>> a.loc[ix, 'x'] = b['x']
>>> # wrong result (at least not what I want)
>>> a
x
a b
1 b NaN
2 b NaN
3 b 6.0
>>> # expected result
>>> a
x
a b
1 b 9 # index: a=1 is part of DataFrame b
2 b 5 # other indices don't exist in b and...
3 b 6 # ... x-values remain unchanged
# if there were more [1, ...] indices...
# ...x would also bet set to 9

I think you want to merge a and B. you should consider using concat,merge or join funcs.

I can't think of any one-liner, so here's a multi-step approach:
tmp_df = a.loc[ix, ['x']].reset_index(level=1, drop=True)
tmp_df['x'] = b['x']
tmp_df.index = ix
a.loc[ix, 'x'] = tmp_df['x']
Output:
x
a b
1 b 9.0
2 b 5.0
3 b 6.0
Edit: I assume that the b's in index are symbolic. Otherwise, the code will fail from a.loc[ix, 'x']: for
a = pd.DataFrame({'a': [1, 1, 2, 3],
'b': ['b', 'b', 'b', 'b'],
'x': [4, 5, 3, 6]}).set_index(['a', 'b'])
a.loc[ix,'x'] gives:
a b
1 b 4
b 5
b 4
b 5
Name: x, dtype: int64

You try use 1- index frame with 2- index frame, just use values:
EDIT:
import pandas as pd
a = pd.DataFrame({'a': [1, 2, 3], 'b': ['b', 'b', 'b'], 'x': [4, 5, 6]}).set_index(['a', 'b'])
b = pd.DataFrame({'a': [1, 4], 'x': [9, 10]}).set_index('a')
a_ix = a.index.get_level_values('a')[[0, 1]]
b_ix = b.index
mask = (b_ix == a_ix)
a.loc[mask, 'x'] = b.loc[mask,'x'].values
a:
x
a b
1 b 9
2 b 5
3 b 6

I first reset the multi-index of a and then I set it to the (single column) a
a = a.reset_index()
a = a.set_index('a')
print(a)
b x
a
1 b 4
2 b 5
3 b 6
print(b)
x
a
1 9
4 10
Then, make the assignment you require using loc and also re-set the multi-index
now, since we are using loc, your ix = a.index[[0, 1]] becomes similar to [1,0] (1 refers to index of a and 0 refers to index of b)
a.loc[1, 'x'] = b.iloc[0,0]
a.reset_index(inplace=True)
a = a.set_index(['a','b'])
print(a)
x
a b
1 b 9
2 b 5
3 b 6
EDIT:
Alternatively, reset the multi-index of a and don't set it to a single column index. Then your [0,1] (referring to index values with loc, not positions iloc) can be used (0 refers to index of a and 1 refers to index of b)
a = a.reset_index()
print(a)
a b x
0 1 b 4
1 2 b 5
2 3 b 6
a.loc[0, 'x'] = b.loc[1,'x']
a = a.set_index(['a','b'])
print(a)
x
a b
1 b 9
2 b 5
3 b 6

Pandas long format DataFrame from multiple lists of different length

Consider I have multiple lists
A = [1, 2, 3]
B = [1, 4]
and I want to generate a Pandas DataFrame in long format as follows:
type | value
------------
A | 1
A | 2
A | 3
B | 1
B | 4
What is the easiest way to achieve this? The way over the wide format and melt is not possible(?) because the lists may have different lengths.

Create dictionary for types and create list of tuples by list comprehension:
A = [1, 2, 3]
B = [1, 4]
d = {'A':A,'B':B}
print ([(k, y) for k, v in d.items() for y in v])
[('A', 1), ('A', 2), ('A', 3), ('B', 1), ('B', 4)]
df = pd.DataFrame([(k, y) for k, v in d.items() for y in v], columns=['type','value'])
print (df)
type value
0 A 1
1 A 2
2 A 3
3 B 1
4 B 4
Another solution, if input is list of lists and types should be integers:
L = [A,B]
df = pd.DataFrame([(k, y) for k, v in enumerate(L) for y in v], columns=['type','value'])
print (df)
type value
0 0 1
1 0 2
2 0 3
3 1 1
4 1 4

Here's a NumPy-based solution using a dictionary input:
d = {'A': [1, 2, 3],
'B': [1, 4]}
keys, values = zip(*d.items())
res = pd.DataFrame({'type': np.repeat(keys, list(map(len, values))),
'value': np.concatenate(values)})
print(res)
type value
0 A 1
1 A 2
2 A 3
3 B 1
4 B 4

Check this, this borrows the idea from dplyr, tidyr, R programming languages' 3rd libs, the following code is just for demo, so I created two df: df1, df2, you can dynamically create dfs and concat them:
import pandas as pd
def gather(df, key, value, cols):
id_vars = [col for col in df.columns if col not in cols]
id_values = cols
var_name = key
value_name = value
return pd.melt(df, id_vars, id_values, var_name, value_name)
df1 = pd.DataFrame({'A': [1, 2, 3]})
df2 = pd.DataFrame({'B': [1, 4]})
df_messy = pd.concat([df1, df2], axis=1)
print(df_messy)
df_tidy = gather(df_messy, 'type', 'value', df_messy.columns).dropna()
print(df_tidy)
And you got output for df_messy
A B
0 1 1.0
1 2 4.0
2 3 NaN
output for df_tidy
type value
0 A 1.0
1 A 2.0
2 A 3.0
3 B 1.0
4 B 4.0
PS: Remeber to convert the type of values from float to int type, I just wrote it down for a demo, and didn't pay too much attention about the details.

Get rows based on my given list without revising the order or unique the list

I have a df looks like below, I would like to get rows from 'D' column based on my list without changing or unique the order of list. .
A B C D
0 a b 1 1
1 a b 1 2
2 a b 1 3
3 a b 1 4
4 c d 2 5
5 c d 3 6 #df
My list
l = [4, 2, 6, 4] # my list
df.loc[df['D'].isin(l)].to_csv('output.csv', index = False)
When I use isin() the result would change the order and unique my result, df.loc[df['D'] == value only print the last line.
A B C D
3 a b 1 4
1 a b 1 2
5 c d 3 6
3 a b 1 4 # desired output
Any good way to do this? Thanks,

A solution without loop but merge:
In [26]: pd.DataFrame({'D':l}).merge(df, how='left')
Out[26]:
D A B C
0 4 a b 1
1 2 a b 1
2 6 c d 3
3 4 a b 1

You're going to have to iterate over your list, get copies of them filtered and then concat them all together
l = [4, 2, 6, 4] # you shouldn't use list = as list is a builtin
cache = {}
masked_dfs = []
for v in l:
try:
filtered_df = cache[v]
except KeyError:
filtered_df = df[df['D'] == v]
cache[v] = filtered_df
masked_dfs.append(filtered_df)
new_df = pd.concat(masked_dfs)
UPDATE: modified my answer to cache answers so that you don't have to do multiple searches for repeats

just collect the indices of the values you are looking for, put in a list and then use that list to slice the data
import pandas as pd
df = pd.DataFrame({
'C' : [6, 5, 4, 3, 2, 1],
'D' : [1,2,3,4,5,6]
})
l = [4, 2, 6, 4]
i_locs = [ind for elem in l for ind in df[df['D'] == elem].index]
df.loc[i_locs]
results in
C D
3 3 4
1 5 2
5 1 6
3 3 4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

DataFrame from dictionary with nested lists - python

Related

Pandas DataFrame: Add a list to each cell by iterating over df with new column does not work

Pandas: how to Identify lines with specific value in column x, and use other values in same line as variables?

Set Multi-Index DataFrame column by Series with Index

Pandas long format DataFrame from multiple lists of different length

Get rows based on my given list without revising the order or unique the list

Categories

Resources