Convert Python dict of Arrays into a dataframe - python

I have a dictionary of arrays like the following:
d = {'a': [1,2], 'b': [3,4], 'c': [5,6]}
I want to create a pandas dataframe like this:
0 1 2
0 a 1 2
1 b 3 4
2 c 5 6
I wrote the following code:
pd.DataFrame(list(d.items()))
which returns:
0 1
0 a [1,2]
1 b [3,4]
2 c [5,6]
Do you know how can I achieve my goal?!
Thank you in advance.

Pandas allows you to do this in a straightforward fashion:
pd.DataFrame.from_dict(d,orient = 'index')
>> 0 1
a 1 2
b 3 4
c 5 6
pd.DataFrame.from_dict(d,orient = 'index').reset_index() gives you what you are looking for.

Use the splat operator in a comprehension to produce your dataframe:
pd.DataFrame([k, *v] for k, v in d.items())
0 1 2
0 a 1 2
1 b 3 4
2 c 5 6
If you don't mind having index as one of your column names, simply transpose and reset_index:
pd.DataFrame(d).T.reset_index()
index 0 1
0 a 1 2
1 b 3 4
2 c 5 6
Finally, although it's rather ugly, the most performant option I could find on very large dictionaries is the following:
pd.DataFrame(list(d.values()), index=list(d.keys())).reset_index()

Related

What is the most efficient way to swap the values of two columns of a 2D list in python when the number of rows is in the tens of thousands?

for example if I have an original list:
A B
1 3
2 4
to be turned into
A B
3 1
4 2
two cents worth:
3 ways to do it
you could add a 3rd column C, copy A to C, then delete A. This would take more memory.
you could create a swap function for the values in a row, then wrap it into a loop.
you could just swap the labels of the columns. This is probably the most efficient way.
You could use rename:
df2 = df.rename(columns={'A': 'B', 'B': 'A'})
output:
B A
0 1 3
1 2 4
If order matters:
df2 = df.rename(columns={'A': 'B', 'B': 'A'})[df.columns]
output:
A B
0 3 1
1 4 2
Use DataFrame.rename with dictionary for swapping columnsnames, last check orcer by selecting columns:
df = df.rename(columns=dict(zip(df.columns, df.columns[::-1])))[df.columns]
print (df)
A B
0 3 1
1 4 2
You can also just simple use masking to change the values.
import pandas as pd
df = pd.DataFrame({"A":[1,2],"B":[3,4]})
df[["A","B"]] = df[["B","A"]].values
df
A B
0 3 1
1 4 2
for more than 2 columns:
df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9], 'D':[10,11,12]})
print(df)
'''
A B C D
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
'''
df = df.set_axis(df.columns[::-1],axis=1)[df.columns]
print(df)
'''
A B C D
0 10 7 4 1
1 11 8 5 2
2 12 9 6 3
I assume that your list is like this:
my_list = [[1, 3], [2, 4]]
So you can use this code:
print([[each_element[1], each_element[0]] for each_element in my_list])
The output is:
[[3, 1], [4, 2]]

How to create data fame from random lists length using python?

I want to create pandas data frame with multiple lists with different length. Below is my python code.
import pandas as pd
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
lenA = len(A)
lenB = len(B)
lenC = len(C)
df = pd.DataFrame(columns=['A', 'B','C'])
for i,v1 in enumerate(A):
for j,v2 in enumerate(B):
for k, v3 in enumerate(C):
if(i<random.randint(0, lenA)):
if(j<random.randint(0, lenB)):
if (k < random.randint(0, lenC)):
df = df.append({'A': v1, 'B': v2,'C':v3}, ignore_index=True)
print(df)
My lists are as below:
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6,7]
In each run I got different output and which is correct. But not covers all list items in each run. In one run I got below output as:
A B C
0 1 1 3
1 1 2 1
2 1 2 2
3 2 2 5
In the above output 'A' list's all items (1,2) are there. But 'B' list has only (1,2) items, the item 3 is missing. Also list 'C' has (1,2,3,5) items only. (4,6,7) items are missing in 'C' list. My expectation is: in each list each item should be in the data frame at least once and 'C' list items should be in data frame only once. My expected sample output is as below:
A B C
0 1 1 3
1 1 2 1
2 1 2 2
3 2 2 5
4 2 3 4
5 1 1 7
6 2 3 6
Guide me to get my expected output. Thanks in advance.
You can add random values of each list to total length and then use DataFrame.sample:
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
L = [A,B,C]
m = max(len(x) for x in L)
print (m)
6
a = [np.hstack((np.random.choice(x, m - len(x)), x)) for x in L]
df = pd.DataFrame(a, index=['A', 'B', 'C']).T.sample(frac=1)
print (df)
A B C
2 2 2 3
0 2 1 1
3 1 1 4
4 1 2 5
5 2 3 6
1 2 2 2
You can use transpose to achieve the same.
EDIT: Used random to randomize the output as requested.
import pandas as pd
from random import shuffle, choice
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
shuffle(A)
shuffle(B)
shuffle(C)
data = [A,B,C]
df = pd.DataFrame(data)
df = df.transpose()
df.columns = ['A', 'B', 'C']
df.loc[:,'A'].fillna(choice(A), inplace=True)
df.loc[:,'B'].fillna(choice(B), inplace=True)
This should give the below output
A B C
0 1.0 1.0 1.0
1 2.0 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 NaN NaN 5.0
5 NaN NaN 6.0

Map column using two dictionaries

I have a df:
ColA ColB
1 1
2 3
2 2
1 2
1 3
2 1
I would like to use two different dictionaries to change the values in ColB. I would like to use d1 if the value in ColA is 1 and d2 if the value in ColB is 2.
d1 = {1:'a',2:'b',3:'c'}
d2 = {1:'d',2:'e',3:'f'}
Resulting in:
ColA ColB
1 a
2 f
2 e
1 b
1 c
2 d
How would be the best way of achieving this?
One way is using np.where to map the values in ColB using one dictionary or the other depending on the values of ColA:
import numpy as np
df['ColB'] = np.where(df.ColA.eq(1), df.ColB.map(d1), df.ColB.map(d2))
Which gives:
ColA ColB
0 1 a
1 2 f
2 2 e
3 1 b
4 1 c
5 2 d
For a more general solution, you could use np.select, which works for multiple conditions. Let's add another value in ColA and a dictionary, to see how this could be done with three different mappings:
print(df)
ColA ColB
0 1 1
1 2 3
2 2 2
3 1 2
4 3 3
5 3 1
values_to_map = [1,2,3]
d1 = {1:'a',2:'b',3:'c'}
d2 = {1:'d',2:'e',3:'f'}
d3 = {1:'g',2:'h',3:'i'}
#create a list of boolean Series as conditions
conds = [df.ColA.eq(i) for i in values_to_map]
# List of Series to choose from depending on conds
choices = [df.ColB.map(d) for d in [d1,d2,d3]]
# use np.select to select form the choice list based on conds
df['ColB'] = np.select(conds, choices)
Resulting in:
ColA ColB
0 1 a
1 2 f
2 2 e
3 1 b
4 3 i
5 3 g
You can use a new dictionary in which the keys are tuples and map it against the zipped columns.
d = {**{(1, k): v for k, v in d1.items()}, **{(2, k): v for k, v in d2.items()}}
df.assign(ColB=[*map(d.get, zip(df.ColA, df.ColB))])
ColA ColB
0 1 a
1 2 f
2 2 e
3 1 b
4 1 c
5 2 d
Or we can get cute with a lambda to map.
NOTE: I aligned the dictionaries to switch between based on their relative position in the list [0, d1, d2]. In this case it doesn't matter what is in the first position. I put 0 arbitrarily.
df.assign(ColB=[*map(lambda x, y: [0, d1, d2][x][y], df.ColA, df.ColB)])
ColA ColB
0 1 a
1 2 f
2 2 e
3 1 b
4 1 c
5 2 d
For robustness I'd stay away from cute and map a lambda that had some default value capability
df.assign(ColB=[*map(lambda x, y: {1: d1, 2: d2}.get(x, {}).get(y), df.ColA, df.ColB)])
ColA ColB
0 1 a
1 2 f
2 2 e
3 1 b
4 1 c
5 2 d
If it needs to be done for many groups use a dict of dicts to map each group separately. Ideally you can find some functional way to create d:
d = {1: d1, 2: d2}
df['ColB'] = pd.concat([gp.ColB.map(d[idx]) for idx, gp in df.groupby('ColA')])
Output:
ColA ColB
0 1 a
1 2 f
2 2 e
3 1 b
4 1 c
5 2 d
I am using concat with reindex
idx=pd.MultiIndex.from_arrays([df.ColA, df.ColB])
df.ColB=pd.concat([pd.Series(x) for x in [d1,d2]],keys=[1,2]).reindex(idx).values
df
Out[683]:
ColA ColB
0 1 a
1 2 f
2 2 e
3 1 b
4 1 c
5 2 d
You can create a function that does this for one element and then use an apply lambda to your dataframe.
def your_func(row):
if row["ColA"] == 1:
return d1[row["ColB"]]
elif row["ColB"] == 2:
return d2[row["ColB"]]
else:
return None
df["ColB"] = df.apply(lambda row: your_func(row), axis=1)
You can use two replace as such:
df.loc[df['ColA'] == 1,'ColB'] = df['ColB'].replace(d1, regex=True)
df.loc[df['ColA'] == 2,'ColB'] = df['ColB'].replace(d2, regex=True)
I hope it helps,
BR

Adding a column in dataframes based on similar columns in them

I am trying to get an output where I wish to add column d in d1 and d2 where a b c are same (like groupby).
For example
d1 = pd.DataFrame([[1,2,3,4]],columns=['a','b','c','d'])
d2 = pd.DataFrame([[1,2,3,4],[2,3,4,5]],columns=['a','b','c','d'])
then I'd like to get an output as
a b c d
0 1 2 3 8
1 2 3 4 5
Merging the two data frames and adding the resultant column d where a b c are same.
d1.add(d2) or radd gives me an aggregate of all columns
The solution should be a DataFrame which can be added again to another similarly.
Any help is appreciated.
You can use set_index first:
print (d2.set_index(['a','b','c'])
.add(d1.set_index(['a','b','c']), fill_value=0)
.astype(int)
.reset_index())
a b c d
0 1 2 3 8
1 2 3 4 5
df = pd.concat([d1, d2])
df.drop_duplicates()
a b c d
0 1 2 3 4
1 2 3 4 5

Remove last two characters from column names of all the columns in Dataframe - Pandas

I am joining the two dataframes (a,b) with identical columns / column names using the user ID key and while joining, I had to give suffix characters, in order for it to get created. The following is the command I used,
a.join(b,how='inner', on='userId',lsuffix="_1")
If I dont use this suffix, I am getting error. But I dont want the column names to change because, that is causing a problem while running other analysis. So I want to remove this "_1" character from all the column names of the resulting dataframe. Can anybody suggest me an efficient way to remove last two characters of names of all the columns in the Pandas dataframe?
Thanks
This snippet should get the job done :
df.columns = pd.Index(map(lambda x : str(x)[:-2], df.columns))
Edit : This is a better way to do it
df.rename(columns = lambda x : str(x)[:-2])
In both cases, all we're doing is iterating through the columns and apply some function. In this case, the function converts something into a string and takes everything up until the last two characters.
I'm sure there are a few other ways you could do this.
You could use str.rstrip like so
In [214]: import functools as ft
In [215]: f = ft.partial(np.random.choice, *[5, 3])
In [225]: df = pd.DataFrame({'a': f(), 'b': f(), 'c': f(), 'a_1': f(), 'b_1': f(), 'c_1': f()})
In [226]: df
Out[226]:
a b c a_1 b_1 c_1
0 4 2 0 2 3 2
1 0 0 3 2 1 1
2 4 0 4 4 4 3
In [227]: df.columns = df.columns.str.rstrip('_1')
In [228]: df
Out[228]:
a b c a b c
0 4 2 0 2 3 2
1 0 0 3 2 1 1
2 4 0 4 4 4 3
However if you need something more flexible (albeit probably a bit slower), you can use str.extract which, with the power of regexes, will allow you to select which part of the column name you would like to keep
In [216]: df = pd.DataFrame({f'{c}_{i}': f() for i in range(3) for c in 'abc'})
In [217]: df
Out[217]:
a_0 b_0 c_0 a_1 b_1 c_1 a_2 b_2 c_2
0 0 1 0 2 2 4 0 0 3
1 0 0 3 1 4 2 4 3 2
2 2 0 1 0 0 2 2 2 1
In [223]: df.columns = df.columns.str.extract(r'(.*)_\d+')[0]
In [224]: df
Out[224]:
0 a b c a b c a b c
0 1 1 0 0 0 2 1 1 2
1 1 0 1 0 1 2 0 4 1
2 1 3 1 3 4 2 0 1 1
Idea to use df.columns.str came from this answer

Categories