I got two DataFrames in Python, but the Column there are to be used as Indexes (CodeNumber) are not in the same order. There would be needed to order them equally; Follows the code:
#generating DataFrames:
d3 = {'CodeNumber': [1234, 1235, 111, 101], 'Date': [20150808, 20141201, 20180119, 20120720], 'Weight': [26, 32, 41, 24]}
d4 = {'CodeNumber': [1235, 1234, 101, 111], 'Date': [20160808, 20151201, 20180219, 20130720], 'Weight': [28, 25, 47, 3]}
data_SKU3 = pd.DataFrame(data=d3)
data_SKU4 = pd.DataFrame(data=d4)
Then i set as an index the CodeNumber:
dados_SKU3.set_index('CodeNumber', inplace = True)
dados_SKU4.set_index('CodeNumber', inplace = True)
if we print the resulting DataFrames, note that data_SKU3 has the following order of Code Number: 1234 1235 111 101 , while data_SKU4: 1235 1234 101 111
Is there a way to order the Code Numbers so both DataFrames would be in the same order?
You can also sort values by CodeNumber on each dataframe by calling .sort_values(by = 'CodeNumber') before setting them as index:
d3 = {'CodeNumber': [1234, 1235, 111, 101], 'Date': [20150808, 20141201, 20180119, 20120720], 'Weight': [26, 32, 41, 24]}
d4 = {'CodeNumber': [1235, 1234, 101, 111], 'Date': [20160808, 20151201, 20180219, 20130720], 'Weight': [28, 25, 47, 3]}
data_SKU3 = pd.DataFrame(data=d3).sort_values(by = 'CodeNumber')
data_SKU4 = pd.DataFrame(data=d4).sort_values(by = 'CodeNumber')
data_SKU3.set_index('CodeNumber', inplace = True)
data_SKU4.set_index('CodeNumber', inplace = True)
Use sort_index if same number of values in both indices:
data_SKU3 = data_SKU3.set_index('CodeNumber').sort_index()
data_SKU4 = data_SKU4.set_index('CodeNumber').sort_index()
print (data_SKU3)
Date Weight
CodeNumber
101 20120720 24
111 20180119 41
1234 20150808 26
1235 20141201 32
print (data_SKU4)
Date Weight
CodeNumber
101 20180219 47
111 20130720 3
1234 20151201 25
1235 20160808 28
Another approach is use reindex by another index values, but is necessary unique values and only diiference is different ordering:
data_SKU3 = data_SKU3.set_index('CodeNumber')
data_SKU4 = data_SKU4.set_index('CodeNumber').reindex(index=data_SKU3.index)
print (data_SKU3)
Date Weight
CodeNumber
1234 20150808 26
1235 20141201 32
111 20180119 41
101 20120720 24
print (data_SKU4)
Date Weight
CodeNumber
1234 20151201 25
1235 20160808 28
111 20130720 3
101 20180219 47
Related
I have a dataframe that is similar to:
I would like to calculate the median age for each city but given that it is a frequency table I'm finding it somewhat tricky. Is there any function in pandas or other that would help me achieve this?
Maybe this works for you:
import numpy as np
import pandas as pd
# create dataframe
df = pd.DataFrame(
[
['Alabama', 34, 67, 89, 89, 67, 545, 4546, 3, 23],
['Georgia', 345, 65, 67, 32, 23, 567, 87, 647, 68]
],
columns=['City', 0, 1, 2, 3, 4, 5, 6, 7, 8]
).set_index('City')
print(df)
# calculate median for freq table
m = list() # median list
for index, row in df.iterrows():
v = list() # value list
z = zip(row.index, row.values)
for item in z:
for f in range(item[1]):
v.append(item[0])
m.append(np.median(v))
df_m = pd.DataFrame({'City': df.index, 'Median': m})
print(df_m)
Input:
0 1 2 3 4 5 6 7 8
City
Alabama 34 67 89 89 67 545 4546 3 23
Georgia 345 65 67 32 23 567 87 647 68
Output:
City Median
0 Alabama 6.0
1 Georgia 5.0
For each row, find the number of instances there are. Then take that number, divide by 2, and determine what age that would be by checking if the number of people have the age smaller than what we are looking for.
For example, for the row 'alabama', you would add 34 + 67 + ... + 23 = 5463. That, divided by 2, would be 2731.5 ==> 2731. Then, checking each age group, determine where the 2731th person would be.
At age 1, since 2731 > 34, check the next.
At age 2, since 2731 > 34 + 67, check the next.
At age 3, since 2731 > 34 + 67 + 89, check the next.
At age 4, since 2731 > 34 + 67 + 89 + 89, check the next.
At age 5, since 2731 > 34 + 67 + 89 + 89 + 67, check the next.
At age 6, since 2731 > 34 + 67 + 89 + 89 + 67 + 545, check the next.
At age 7, since 2731 < 34 + 67 + 89 + 89 + 67 + 545 + 4546, the median has to be in this age group.
Do this repeatedly for each city/state, and you should get the median for each one.
Let we have a dataframe like
names = ['bob', 'alice', 'bob', 'bob', 'ali', 'alice', 'bob', 'ali', 'moji', 'ali', 'moji', 'ali', 'bob', 'bob', 'bob']
times = [21 , 34, 37, 40, 55, 65, 67, 84, 88, 90 , 91, 97, 104,105, 108]
df = pd.DataFrame({'name' : names , 'time_of_action' : times})
We would like to groupby it by name and in each group consider times from 0. That is in each group we want to subtract min-time_of_action in that group from all times of that group. how could we do this systematically with pandas?
If I am correct then you want this:
df['new time'] = df['time_of_action']-df.groupby('name')['time_of_action'].transform('min')
df:
name time_of_action new time
0 bob 21 0
1 alice 34 0
2 bob 37 16
3 bob 40 19
4 ali 55 0
5 alice 65 31
6 bob 67 46
7 ali 84 29
8 moji 88 0
9 ali 90 35
10 moji 91 3
11 ali 97 42
12 bob 104 83
13 bob 105 84
14 bob 108 87
Try this
df['new_time'] = df.groupby('name')['time_of_action'].apply(lambda x: x - x.min())
df
Output:
name time_of_action new_time
0 bob 21 0
1 alice 34 0
2 bob 37 16
3 bob 40 19
4 ali 55 0
5 alice 65 31
Others have already answered but here's mine
import pandas as pd
names = ['bob', 'alice', 'bob', 'bob', 'ali', 'alice', 'bob', 'ali', 'moji', 'ali', 'moji', 'ali', 'bob', 'bob', 'bob']
times = [21 , 34, 37, 40, 55, 65, 67, 84, 88, 90 , 91, 97, 104,105, 108]
df = pd.DataFrame({'name' : names , 'time_of_action' : times})
def subtract_min(df):
df['new_time'] = df['time_of_action'] - df['time_of_action'].min()
return df
df.groupby('name').apply(subtract_min).sort_values('name')
As others have said I am kind of guessing as well
Here is my pandas dataframe, and I would like to flatten. How can I do that ?
The input I have
key column
1 {'health_1': 45, 'health_2': 60, 'health_3': 34, 'health_4': 60, 'name': 'Tom'}
2 {'health_1': 28, 'health_2': 10, 'health_3': 42, 'health_4': 07, 'name': 'John'}
3 {'health_1': 86, 'health_2': 65, 'health_3': 14, 'health_4': 52, 'name': 'Adam'}
The expected output
All the health and name will become a column name of their own with their corresponding values. In no particular order.
health_1 health_2 health_3 health_4 name key
45 60 34 60 Tom 1
28 10 42 07 John 2
86 65 14 52 Adam 3
You can do it with one line solution,
df_expected = pd.concat([df, df['column'].apply(pd.Series)], axis = 1).drop('column', axis = 1)
Full version:
import pandas as pd
df = pd.DataFrame({"column":[
{'health_1': 45, 'health_2': 60, 'health_3': 34, 'health_4': 60, 'name': 'Tom'} ,
{'health_1': 28, 'health_2': 10, 'health_3': 42, 'health_4': 7, 'name': 'John'} ,
{'health_1': 86, 'health_2': 65, 'health_3': 14, 'health_4': 52, 'name': 'Adam'}
]})
df_expected = pd.concat([df, df['column'].apply(pd.Series)], axis = 1).drop('column', axis = 1)
print(df_expected)
DEMO: https://repl.it/repls/ButteryFrightenedFtpclient
This should work:
df['column'].apply(pd.Series)
Gives:
health_1 health_2 health_3 health_4 name
0 45 60 34 60 Tom
1 28 10 42 7 John
2 86 65 14 52 Adam
Try:
pd.concat([pd.DataFrame(i, index=[0]) for i in df.column], ignore_index=True)
Output:
health_1 health_2 health_3 health_4 name
0 45 60 34 60 Tom
1 28 10 42 7 John
2 86 65 14 52 Adam
The solutions using apply are going overboard. You can create your desired DataFrame using a list of dictionaries like you have in your column Series. You can easily get this list of dictionaries by using the tolist method:
res = pd.concat([df.key, pd.DataFrame(df.column.tolist())], axis=1)
print(res)
key health_1 health_2 health_3 health_4 name
0 1 45 60 34 60 Tom
1 2 28 10 42 7 John
2 3 86 65 14 52 Adam
Not sure I understand - This is the default format for a DataFrame?
import pandas as pd
df = pd.DataFrame([
{'health_1': 45, 'health_2': 60, 'health_3': 34, 'health_4': 60, 'name': 'Tom'} ,
{'health_1': 28, 'health_2': 10, 'health_3': 42, 'health_4': 7, 'name': 'John'} ,
{'health_1': 86, 'health_2': 65, 'health_3': 14, 'health_4': 52, 'name': 'Adam'}
])
I would like to choose from a column every 3 values.
For example:
Input
12
73
56
33
16
output
12
73
56
------
73
56
33
-----
56
33
16
I have tried to add a key column and group by it, but my data frame is too large to perform the grouping. Here is my attempt:
df.groupby('key').agg(lambda x: x.tolist())
If use list type, you can do like this :
lst = [12,73,56,33,16]
slide_size=3
result = []
for i in range(0,len(lst)-slide_size+1):
result.append(lst[i:i+3])
result
# output : [[12, 73, 56], [73, 56, 33], [56, 33, 16]]
After this, you can transform the list to DataFrame
I have this pivot table:
[in]:unit_d
[out]:
units
store_nbr item_nbr
1 9 27396
28 4893
40 254
47 2409
51 925
89 157
93 1103
99 492
2 5 55104
11 655
44 117125
85 106
93 653
I want to have a dictionary with 'store_nbr' as the key and 'item_nbr' as the values.
So, {'1': [9, 28, 40,...,99], '2': [5, 11 ,44, 85, 93], ...}
I'd use groupby here, after resetting the index to make it into columns:
>>> d = unit_d.reset_index()
>>> {k: v.tolist() for k, v in d.groupby("store_nbr")["item_nbr"]}
{1: [9, 28, 40, 47, 51, 89, 93, 99], 2: [5, 11, 44, 85, 93]}