Here is my pandas dataframe, and I would like to flatten. How can I do that ?
The input I have
key column
1 {'health_1': 45, 'health_2': 60, 'health_3': 34, 'health_4': 60, 'name': 'Tom'}
2 {'health_1': 28, 'health_2': 10, 'health_3': 42, 'health_4': 07, 'name': 'John'}
3 {'health_1': 86, 'health_2': 65, 'health_3': 14, 'health_4': 52, 'name': 'Adam'}
The expected output
All the health and name will become a column name of their own with their corresponding values. In no particular order.
health_1 health_2 health_3 health_4 name key
45 60 34 60 Tom 1
28 10 42 07 John 2
86 65 14 52 Adam 3
You can do it with one line solution,
df_expected = pd.concat([df, df['column'].apply(pd.Series)], axis = 1).drop('column', axis = 1)
Full version:
import pandas as pd
df = pd.DataFrame({"column":[
{'health_1': 45, 'health_2': 60, 'health_3': 34, 'health_4': 60, 'name': 'Tom'} ,
{'health_1': 28, 'health_2': 10, 'health_3': 42, 'health_4': 7, 'name': 'John'} ,
{'health_1': 86, 'health_2': 65, 'health_3': 14, 'health_4': 52, 'name': 'Adam'}
]})
df_expected = pd.concat([df, df['column'].apply(pd.Series)], axis = 1).drop('column', axis = 1)
print(df_expected)
DEMO: https://repl.it/repls/ButteryFrightenedFtpclient
This should work:
df['column'].apply(pd.Series)
Gives:
health_1 health_2 health_3 health_4 name
0 45 60 34 60 Tom
1 28 10 42 7 John
2 86 65 14 52 Adam
Try:
pd.concat([pd.DataFrame(i, index=[0]) for i in df.column], ignore_index=True)
Output:
health_1 health_2 health_3 health_4 name
0 45 60 34 60 Tom
1 28 10 42 7 John
2 86 65 14 52 Adam
The solutions using apply are going overboard. You can create your desired DataFrame using a list of dictionaries like you have in your column Series. You can easily get this list of dictionaries by using the tolist method:
res = pd.concat([df.key, pd.DataFrame(df.column.tolist())], axis=1)
print(res)
key health_1 health_2 health_3 health_4 name
0 1 45 60 34 60 Tom
1 2 28 10 42 7 John
2 3 86 65 14 52 Adam
Not sure I understand - This is the default format for a DataFrame?
import pandas as pd
df = pd.DataFrame([
{'health_1': 45, 'health_2': 60, 'health_3': 34, 'health_4': 60, 'name': 'Tom'} ,
{'health_1': 28, 'health_2': 10, 'health_3': 42, 'health_4': 7, 'name': 'John'} ,
{'health_1': 86, 'health_2': 65, 'health_3': 14, 'health_4': 52, 'name': 'Adam'}
])
Related
Suppose I have the following DataFrame:
df = pd.DataFrame({'id': [2, 4, 10, 12, 13, 14, 19, 20, 21, 22, 24, 25, 27, 29, 30, 31, 42, 50, 54],
'value': [37410.0, 18400.0, 200000.0, 392000.0, 108000.0, 423000.0, 80000.0, 307950.0,
50807.0, 201740.0, 182700.0, 131300.0, 282005.0, 428800.0, 56000.0, 412400.0, 1091595.0, 1237200.0,
927500.0]})
And I do the following:
df.sort_values(by='id').set_index('id').cumsum()
value
id
2 37410.0
4 55810.0
10 255810.0
12 647810.0
13 755810.0
14 1178810.0
19 1258810.0
20 1566760.0
21 1617567.0
22 1819307.0
24 2002007.0
25 2133307.0
27 2415312.0
29 2844112.0
30 2900112.0
31 3312512.0
42 4404107.0
50 5641307.0
54 6568807.0
I want to know the first element of id that is bigger than 25% of the cumulative sum. In this example, 25% of the cumsum would be 1,642,201.75. The first element to exceed that would be 22. I know it can be done with a for, but I think it would be pretty inefficient.
You could do:
percentile_25 = df['value'].sum() * 0.25
res = df[df['value'].cumsum() > percentile_25].head(1)
print(res)
Output
id value
9 22 201740.0
Or use searchsorted to do the search in O(log N):
percentile_25 = df['value'].sum() * 0.25
i = df['value'].cumsum().searchsorted(percentile_25)
res = df.iloc[i]
print(res)
Output
id 22.0
value 201740.0
Name: 9, dtype: float64
Let we have a dataframe like
names = ['bob', 'alice', 'bob', 'bob', 'ali', 'alice', 'bob', 'ali', 'moji', 'ali', 'moji', 'ali', 'bob', 'bob', 'bob']
times = [21 , 34, 37, 40, 55, 65, 67, 84, 88, 90 , 91, 97, 104,105, 108]
df = pd.DataFrame({'name' : names , 'time_of_action' : times})
We would like to groupby it by name and in each group consider times from 0. That is in each group we want to subtract min-time_of_action in that group from all times of that group. how could we do this systematically with pandas?
If I am correct then you want this:
df['new time'] = df['time_of_action']-df.groupby('name')['time_of_action'].transform('min')
df:
name time_of_action new time
0 bob 21 0
1 alice 34 0
2 bob 37 16
3 bob 40 19
4 ali 55 0
5 alice 65 31
6 bob 67 46
7 ali 84 29
8 moji 88 0
9 ali 90 35
10 moji 91 3
11 ali 97 42
12 bob 104 83
13 bob 105 84
14 bob 108 87
Try this
df['new_time'] = df.groupby('name')['time_of_action'].apply(lambda x: x - x.min())
df
Output:
name time_of_action new_time
0 bob 21 0
1 alice 34 0
2 bob 37 16
3 bob 40 19
4 ali 55 0
5 alice 65 31
Others have already answered but here's mine
import pandas as pd
names = ['bob', 'alice', 'bob', 'bob', 'ali', 'alice', 'bob', 'ali', 'moji', 'ali', 'moji', 'ali', 'bob', 'bob', 'bob']
times = [21 , 34, 37, 40, 55, 65, 67, 84, 88, 90 , 91, 97, 104,105, 108]
df = pd.DataFrame({'name' : names , 'time_of_action' : times})
def subtract_min(df):
df['new_time'] = df['time_of_action'] - df['time_of_action'].min()
return df
df.groupby('name').apply(subtract_min).sort_values('name')
As others have said I am kind of guessing as well
I got two DataFrames in Python, but the Column there are to be used as Indexes (CodeNumber) are not in the same order. There would be needed to order them equally; Follows the code:
#generating DataFrames:
d3 = {'CodeNumber': [1234, 1235, 111, 101], 'Date': [20150808, 20141201, 20180119, 20120720], 'Weight': [26, 32, 41, 24]}
d4 = {'CodeNumber': [1235, 1234, 101, 111], 'Date': [20160808, 20151201, 20180219, 20130720], 'Weight': [28, 25, 47, 3]}
data_SKU3 = pd.DataFrame(data=d3)
data_SKU4 = pd.DataFrame(data=d4)
Then i set as an index the CodeNumber:
dados_SKU3.set_index('CodeNumber', inplace = True)
dados_SKU4.set_index('CodeNumber', inplace = True)
if we print the resulting DataFrames, note that data_SKU3 has the following order of Code Number: 1234 1235 111 101 , while data_SKU4: 1235 1234 101 111
Is there a way to order the Code Numbers so both DataFrames would be in the same order?
You can also sort values by CodeNumber on each dataframe by calling .sort_values(by = 'CodeNumber') before setting them as index:
d3 = {'CodeNumber': [1234, 1235, 111, 101], 'Date': [20150808, 20141201, 20180119, 20120720], 'Weight': [26, 32, 41, 24]}
d4 = {'CodeNumber': [1235, 1234, 101, 111], 'Date': [20160808, 20151201, 20180219, 20130720], 'Weight': [28, 25, 47, 3]}
data_SKU3 = pd.DataFrame(data=d3).sort_values(by = 'CodeNumber')
data_SKU4 = pd.DataFrame(data=d4).sort_values(by = 'CodeNumber')
data_SKU3.set_index('CodeNumber', inplace = True)
data_SKU4.set_index('CodeNumber', inplace = True)
Use sort_index if same number of values in both indices:
data_SKU3 = data_SKU3.set_index('CodeNumber').sort_index()
data_SKU4 = data_SKU4.set_index('CodeNumber').sort_index()
print (data_SKU3)
Date Weight
CodeNumber
101 20120720 24
111 20180119 41
1234 20150808 26
1235 20141201 32
print (data_SKU4)
Date Weight
CodeNumber
101 20180219 47
111 20130720 3
1234 20151201 25
1235 20160808 28
Another approach is use reindex by another index values, but is necessary unique values and only diiference is different ordering:
data_SKU3 = data_SKU3.set_index('CodeNumber')
data_SKU4 = data_SKU4.set_index('CodeNumber').reindex(index=data_SKU3.index)
print (data_SKU3)
Date Weight
CodeNumber
1234 20150808 26
1235 20141201 32
111 20180119 41
101 20120720 24
print (data_SKU4)
Date Weight
CodeNumber
1234 20151201 25
1235 20160808 28
111 20130720 3
101 20180219 47
I have this pivot table:
[in]:unit_d
[out]:
units
store_nbr item_nbr
1 9 27396
28 4893
40 254
47 2409
51 925
89 157
93 1103
99 492
2 5 55104
11 655
44 117125
85 106
93 653
I want to have a dictionary with 'store_nbr' as the key and 'item_nbr' as the values.
So, {'1': [9, 28, 40,...,99], '2': [5, 11 ,44, 85, 93], ...}
I'd use groupby here, after resetting the index to make it into columns:
>>> d = unit_d.reset_index()
>>> {k: v.tolist() for k, v in d.groupby("store_nbr")["item_nbr"]}
{1: [9, 28, 40, 47, 51, 89, 93, 99], 2: [5, 11, 44, 85, 93]}
I have a large csv file with columns that encode the name and index of the array below. eg:
time, dataset1[0], dataset1[1], dataset1[2], dataset2[0], dataset2[1], dataset2[2]\n
0, 43, 35, 29, 21, 59, 39\n
1, 21, 59, 39, 43, 35, 29\n
You get the idea (obviously there is far more data in the arrays).
Any ideas how can I easily parse/strip this into an efficient dataframes?
[EDIT]
Ideally I'm after a structure like this:
time dataset1 dataset2
0 0 [43,35,29] [21,59,39]
1 1 [21,59,39] [43,35,29]
where the index's have been stripped from the labels and turned into nparray indices.
from pandas import read_csv
df = read_csv('data.csv')
print df
Gives as output:
>>>
time dataset1[0] dataset1[1] dataset1[2] dataset2[0] dataset2[1] \
0 0 43 35 29 21 59
1 1 21 59 39 43 35
dataset2[2]
0 39
1 29