Dict to df if value is a list of lists - python

I have following dictionary:
my_dict = dict([(779825550, [[346583, 2, 305.98, 9]]), (779825605, [[276184, 2, 169.5, 15], [331465, 2, 214.5, 15], [276184, 2, 169.5, 15], [331465, 2, 214.5, 15], [637210, 2, 368.5, 15], [249559, 2, 133.46, 15], [591652, 2, 132.0, 15], [216367, 2, 142.5, 14]]), (779825644, [[568025, 13, 494.5, 15]]), (779825657, [[75366, 18, 43.26, 9]])])
I need to convert this dict into pandas df. In each row I need my_dict key (that is 779825550, 779825605 etc) followed by values in the list of list. So the first row will be: 779825550, 346583, 2, 305.98, 9. If there is more lists in the list (like for 779825605) I need to have more rows with the same key in the first column (that is 779825605, 276184, 2, 169.5, 15 and 779825605, 276184, 2, 169.5, 15 etc). How can I do this please?
I tried:
df = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in my_dict.items() ]))
but it gives me a wrong result. Thanks

You can flatten nested lists and add k with unpack nested lists by *, last pass to DataFrame constructor:
df = pd.DataFrame((k, *x) for k,v in my_dict.items() for x in v)
print (df)
0 1 2 3 4
0 779825550 346583 2 305.98 9
1 779825605 276184 2 169.50 15
2 779825605 331465 2 214.50 15
3 779825605 276184 2 169.50 15
4 779825605 331465 2 214.50 15
5 779825605 637210 2 368.50 15
6 779825605 249559 2 133.46 15
7 779825605 591652 2 132.00 15
8 779825605 216367 2 142.50 14
9 779825644 568025 13 494.50 15
10 779825657 75366 18 43.26 9
Your solution should be changed by DataFrame constructor with concat:
df = pd.concat(dict((k,pd.DataFrame(v)) for k,v in my_dict.items()))
print (df)
0 1 2 3
779825550 0 346583 2 305.98 9
779825605 0 276184 2 169.50 15
1 331465 2 214.50 15
2 276184 2 169.50 15
3 331465 2 214.50 15
4 637210 2 368.50 15
5 249559 2 133.46 15
6 591652 2 132.00 15
7 216367 2 142.50 14
779825644 0 568025 13 494.50 15
779825657 0 75366 18 43.26 9

Related

moving last two dataframe rows

I'm trying to move the last two rows up:
import pandas as pd
df = pd.DataFrame({
"A" : [1,2,3,4],
"C": [5, 6, 7, 8],
"D": [9, 10, 11, 12],
"E": [13, 14, 15, 16],
})
print(df)
Output:
A C D E
0 1 5 9 13
1 2 6 10 14
2 3 7 11 15
3 4 8 12 16
Desired output:
A C D E
0 3 7 11 15
1 4 8 12 16
2 1 5 9 13
3 2 6 10 14
I was able to move the last row using
df = df.reindex(np.roll(df.index, shift=1))
But can't get the second to last row to move as well. Any advice what's the most efficient way to do this without creating a copy of the dataframe?
Using your code, you can just change the roll's shift value.
import pandas as pd
import numpy as np
df = pd.DataFrame({
"A" : [1,2,3,4],
"C": [5, 6, 7, 8],
"D": [9, 10, 11, 12],
"E": [13, 14, 15, 16],
})
df = df.reindex(np.roll(df.index, shift=2), copy=False)
df.reset_index(inplace=True, drop=True)
print(df)
A C D E
0 3 7 11 15
1 4 8 12 16
2 1 5 9 13
3 2 6 10 14
The shift value will change how many rows are affected by the roll, and afterwards we just reset the index of the dataframe so that it goes back to 0,1,2,3.
Based on the comment of wanting to swap indexes 0 and 1 around, we can use an answer in #CatalinaChou's link to do that. I am choosing to do it after using the roll so as to only have to contend with indexes 0 and 1 after it's been shifted.
# continuing from where the last code fence ends
swap_indexes = {1: 0, 0: 1}
df.rename(swap_indexes, inplace=True)
df.sort_index(inplace=True)
print(df)
A C D E
0 4 8 12 16
1 3 7 11 15
2 1 5 9 13
3 2 6 10 14
A notable difference is the use of inplace=True and thus not being able to chain the methods, but this would be to fulfil not copying the dataframe at all (or as much as possible, I'm not sure if df.reindex will make an internal copy with copy=False).

Function in pandas to stack rows into columns by number of rows?

Suppose I have heterogeneous dataframe:
a b c d
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
4 13 14 15 16
And i want to stack the rows like so:
a b c d
1 1,5,8,13 2,6,10,14 3,7,11,15 4,8,12,16
Etc...
All the references for grouby etc seem to require some feature of grouping, I just want to put x rows into columns, regardless of their content. Each row has a timestamp, I am looking to group values by sample count, so i want 1 row with all the values of x sample rows as columns.
I should end up with a dataframe that has x*original number of columns and original number of rows/x
I'm sure there must be some simple method I'm missing here without a series of loop etc
If need join all values to strings use:
df1 = df.astype(str).agg(','.join).to_frame().T
print (df1)
a b c d
0 1,5,9,13 2,6,10,14 3,7,11,15 4,8,12,16
Or if need create lists use:
df2 = pd.DataFrame([[list(df[x]) for x in df]], columns=df.columns)
print (df2)
a b c d
0 [1, 5, 9, 13] [2, 6, 10, 14] [3, 7, 11, 15] [4, 8, 12, 16]
If need scalars with MultiIndex (generated fro index nad columns labels) use:
df3 = df.unstack().to_frame().T
print (df3)
a b c d
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
0 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16

Find nearest value from multiple columns and add to a new column in Python

I have the following dataframe:
import pandas as pd
import numpy as np
data = {
"index": [1, 2, 3, 4, 5],
"A": [11, 17, 5, 9, 10],
"B": [8, 6, 16, 17, 9],
"C": [10, 17, 12, 13, 15],
"target": [12, 13, 8, 6, 12]
}
df = pd.DataFrame.from_dict(data)
print(df)
I would like to find nearest values for column target in column A, B and C, and put those values into column result. As far as I know, I need to use abs() and argmin() function.
Here is the output I expected:
index A B C target result
0 1 11 8 10 12 11
1 2 17 6 17 13 17
2 3 5 16 12 8 5
3 4 9 17 13 6 9
4 5 10 9 15 12 10
Here is the solution and links what i have found from stackoverflow which may help:
(df.assign(closest=df.apply(lambda x: x.abs().argmin(), axis='columns'))
.apply(lambda x: x[x['target']], axis='columns'))
Identifying closest value in a column for each filter using Pandas
https://codereview.stackexchange.com/questions/204549/lookup-closest-value-in-pandas-dataframe
Subtract "target" from the other columns, use idxmin to get the column of the minimum difference, followed by a lookup:
idx = df.drop(['index', 'target'], 1).sub(df.target, axis=0).abs().idxmin(1)
df['result'] = df.lookup(df.index, idx)
df
index A B C target result
0 1 11 8 10 12 11
1 2 17 6 17 13 17
2 3 5 16 12 8 5
3 4 9 17 13 6 9
4 5 10 9 15 12 10
General solution handling string columns and NaNs (along with your requirement of replacing NaN values in target with value in "v1"):
df2 = df.select_dtypes(include=[np.number])
idx = df2.drop(['index', 'target'], 1).sub(df2.target, axis=0).abs().idxmin(1)
df['result'] = df2.lookup(df2.index, idx.fillna('v1'))
You can also index into the underlying NumPy array by getting integer indices using df.columns.get_indexer.
# idx = df[['A', 'B', 'C']].sub(df.target, axis=0).abs().idxmin(1)
idx = df.drop(['index', 'target'], 1).sub(df.target, axis=0).abs().idxmin(1)
# df['result'] = df.values[np.arange(len(df)), df.columns.get_indexer(idx)]
df['result'] = df.values[df.index, df.columns.get_indexer(idx)]
df
index A B C target result
0 1 11 8 10 12 11
1 2 17 6 17 13 17
2 3 5 16 12 8 5
3 4 9 17 13 6 9
4 5 10 9 15 12 10
You can use NumPy positional integer indexing with argmin:
col_lst = list('ABC')
col_indices = df[col_lst].sub(df['target'], axis=0).abs().values.argmin(1)
df['result'] = df[col_lst].values[np.arange(len(df.index)), col_indices]
Or you can lookup column labels with idxmin:
col_labels = df[list('ABC')].sub(df['target'], axis=0).abs().idxmin(1)
df['result'] = df.lookup(df.index, col_labels)
print(df)
index A B C target result
0 1 11 8 10 12 11
1 2 17 6 17 13 17
2 3 5 16 12 8 5
3 4 9 17 13 6 9
4 5 10 9 15 12 10
The principle is the same, though for larger dataframes you may find NumPy more efficient:
# Python 3.7, NumPy 1.14.3, Pandas 0.23.0
def np_lookup(df):
col_indices = df[list('ABC')].sub(df['target'], axis=0).abs().values.argmin(1)
df['result'] = df[list('ABC')].values[np.arange(len(df.index)), col_indices]
return df
def pd_lookup(df):
col_labels = df[list('ABC')].sub(df['target'], axis=0).abs().idxmin(1)
df['result'] = df.lookup(df.index, col_labels)
return df
df = pd.concat([df]*10**4, ignore_index=True)
assert df.pipe(pd_lookup).equals(df.pipe(np_lookup))
%timeit df.pipe(np_lookup) # 7.09 ms
%timeit df.pipe(pd_lookup) # 67.8 ms

How to replace elements inside the list in series

I have a DataFrame like below,
df1
col1
0 10
1 [5, 8, 11]
2 15
3 12
4 13
5 33
6 [12, 19]
Code to generate this df1:
df1 = pd.DataFrame({"col1":[10,[5,8,11],15,12,13,33,[12,19]]})
df2
col1 col2
0 12 1
1 10 2
2 5 3
3 11 10
4 7 5
5 13 4
6 8 7
Code to generate this df2:
df2 = pd.DataFrame({"col1":[12,10,5,11,7,13,8],"col2":[1,2,3,10,5,4,7]})
I want to replace elements in df1 with df2 values.
If the series values contains non list elements,
I could simply replace with map
df1['res'] = df1['col1'].map(df2.set_index('col1')["col2"].to_dict())
But now this series contains mixed of list and scalar.
How to replace elements in list and scalar values in series in effective way.
Expected Output
col1 res
0 10 2
1 [5, 8, 11] [3,7,10]
2 15 15
3 12 1
4 13 4
5 33 33
Your series is of dtype object, as it contains int and list objects. This is inefficient for Pandas and means a vectorised solution won't be possible.
You can create a mapping dictionary and use pd.Series.apply. To account for list objects, you can catch TypeError. You meet this specific error for lists since they are not hashable, and therefore cannot be used as dictionary keys.
d = df2.set_index('col1')['col2'].to_dict()
def mapvals(x):
try:
return d.get(x, x)
except TypeError:
return [d.get(i, i) for i in x]
df1['res'] = df1['col1'].apply(mapvals)
print(df1)
col1 res
0 10 2
1 [5, 8, 11] [3, 7, 10]
2 15 15
3 12 1
4 13 4
5 33 33
6 [12, 19] [1, 19]

How to select ranges of values in pandas?

Newbie question.
My dataframe looks like this:
class A B
0 1 3.767809 11.016
1 1 2.808231 4.500
2 1 4.822522 1.008
3 2 5.016933 -3.636
4 2 6.036203 -5.220
5 2 7.234567 -6.696
6 2 5.855065 -7.272
7 4 4.116770 -8.208
8 4 2.628000 -10.296
9 4 1.539184 -10.728
10 3 0.875918 -10.116
11 3 0.569210 -9.072
12 3 0.676379 -7.632
13 3 0.933921 -5.436
14 3 0.113842 -3.276
15 3 0.367129 -2.196
16 1 0.968661 -1.980
17 1 0.160997 -2.736
18 1 0.469383 -2.232
19 1 0.410463 -2.340
20 1 0.660872 -2.484
I would like to get groups where class is the same, like:
class 1: rows 0..2
class 2: rows 3..6
class 4: rows 7..9
class 3: rows 10..15
class 1: rows 16..20
The reason is that order matters. In requirements I have that class 4 can be only between 1 and 2 and if after prediction we have class 4 after 2 it should be considered as 2.
Build a New Para to identify the group
df['group']=df['class'].diff().ne(0).cumsum()
df.groupby('group')['group'].apply(lambda x : x.index)
Out[106]:
group
1 Int64Index([0, 1, 2], dtype='int64')
2 Int64Index([3, 4, 5, 6], dtype='int64')
3 Int64Index([7, 8, 9], dtype='int64')
4 Int64Index([10, 11, 12, 13, 14, 15], dtype='in...
5 Int64Index([16, 17, 18, 19, 20], dtype='int64')
Name: group, dtype: object

Categories