I have 2 data frames grouped by 4 separate keys. I would like to assign the mean of a column of one group, to all the row values in a column in another group. As I understand it, this is how it should be done:
g_test.get_group((1, 5, 13, 8)).monthly_sales = \
g_train.get_group((1, 5, 13, 8)).monthly_sales.mean()
Except this does nothing. The values in monthly_sales of the group identified in g_test are unchanged. Can someone please explain what I am doing wrong and suggest alternatives?
These are the first few rows of g_train.get_group((1, 5, 13, 8))
year month day store item units monthly_sales
1 5 5 13 8 4 466
1 5 6 13 8 12 475
1 5 0 13 8 22 469
1 5 5 13 8 26 469
1 5 6 13 8 39 480
and these are the first few rows of g_test.get_group((1, 5, 13, 8))
year month day store item monthly_sales
1 5 1 13 8 0
1 5 2 13 8 0
1 5 3 13 8 0
1 5 4 13 8 0
1 5 5 13 8 0
Only the first few rows are shown, but the mean of g_train((1, 5, 13, 8)).monthly_sales is 450, which I want to be copied over to the monthly_sales column in g_test.
Edit:
I now understand that, the code snippet below will work:
`df1.loc[(df1.year == 1)
& (df1.month == 5)
& (df1.store == 13)
& (df1.item == 8), 'monthly_sales'] = \
gb2.get_group((1, 5, 13, 8)).monthly_sales.mean()`
This operation is great for copying the mean once, however the whole reason I split the data frame into groups was to avoid these logic checks and do this multiple times for different store and item numbers. Is there something else I can do?
You need to assign the result back to the DataFrame, not the groupby object. This should work:
df1.loc[(df1.year == 1)
& (df1.month == 5)
& (df1.store == 13)
& (df1.item == 8), 'monthly_sales'] = \
gb2.get_group((1, 5, 13, 8)).monthly_sales.mean()
>>> gb1.get_group((1, 5, 13, 8))
year month day store item units monthly_sales
0 1 5 5 13 8 4 471.8
1 1 5 6 13 8 12 471.8
2 1 5 0 13 8 22 471.8
3 1 5 5 13 8 26 471.8
4 1 5 6 13 8 39 471.8
Actually I just discovered a better way. g_test is part of dataframe 'test', so when I tried the line below it worked perfectly
test.loc[g_test.get_group((1, 5, 13, 8)).index, 'monthly_sales'] = \
g_train.get_group((1, 5, 13, 8)).monthly_sales.mean()
Related
I have the following dataframe:
import pandas as pd
data = {'id': [542588, 542594, 542594, 542605, 542605, 542605, 542630, 542630],
'label': [3, 3, 1, 1, 2, 0, 0, 2]}
df = pd.DataFrame(data)
df
id label
0 542588 3
1 542594 3
2 542594 1
3 542605 1
4 542605 2
5 542605 0
6 542630 0
7 542630 2
The id columns contains large integers (6-digits). I want a way to simplify it, starting from 10, so that 542588 becomes 10, 542594 becomes 11, etc...
Required output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
You can use factorize:
df['id'] = df['id'].factorize()[0] + 10
Output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
Note: factorize will enumerate the keys in the order that they occur in your data, while groupby().ngroup() solution will enumerate the key in the increasing order. You can mimic the increasing order with factorize by sorting the data first. Or you can replicate the data order with groupby() by passing sort=False to it.
You can try
df['id'] = df.groupby('id').ngroup().add(10)
print(df)
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
This is a naive way of looping through the IDs, and every time you encounter an ID you haven't seen before, associate it in a dictionary with a new ID (starting at 10, incrementing by 1 each time).
You can then swap out the values of the ID column using the map method.
new_ids = dict()
new_id = 10
for old_id in df['id']:
if old_id not in new_ids:
new_ids[old_id] = new_id
new_id += 1
df['id'] = df['id'].map(new_ids)
I need to add a column with changes ow worker coordinates through different stages. We have a DataFrame:
import pandas as pd
from geopy.distance import geodesic as GD
d = {'user_id': [26, 26, 26, 26, 26, 26, 9, 9, 9, 9],
'worker_latitude': [55.114410, 55.114459, 55.114379,
55.114462, 55.114372, 55.114389, 65.774064, 65.731034, 65.731034, 65.774057],
'worker_longitude': [38.927155, 38.927114, 38.927101, 38.927156,
38.927258, 38.927120, 37.532380, 37.611746, 37.611746, 37.532346],
'change':[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
df = pd.DataFrame(data=d)
which looks like:
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 0
1 26 55.114459 38.927114 0
2 26 55.114379 38.927101 0
3 26 55.114462 38.927156 0
4 26 55.114372 38.927258 0
5 26 55.114389 38.927120 0
6 9 65.774064 37.532380 0
7 9 65.731034 37.611746 0
8 9 65.731034 37.611746 0
9 9 65.774057 37.532346 0
Then I need to count difference between person previous and current stage. So I use a function:
for group in df.groupby(by='user_id'):
group[1].reset_index(inplace=True,drop=True)
for i in range(1,len(group[1])):
first_xy=(group[1]['worker_latitude'][i-1],group[1]['worker_longitude'][i-1])
second_xy=(group[1]['worker_latitude'][i],group[1]['worker_longitude'][i])
print((round((GD(first_xy, second_xy).km),6)))
group[1]['change'][i]=round((GD(first_xy, second_xy).km),6)
And then I get:
6.021576
0.0
6.021896
0.00605
0.008945
0.009884
0.011948
0.009007
display(df)
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 0
1 26 55.114459 38.927114 0
2 26 55.114379 38.927101 0
3 26 55.114462 38.927156 0
4 26 55.114372 38.927258 0
5 26 55.114389 38.927120 0
6 9 65.774064 37.532380 0
7 9 65.731034 37.611746 0
8 9 65.731034 37.611746 0
9 9 65.774057 37.532346 0
Which means that values are counted correctly, but for some reason they don't fit into 'change' column. What can be done?
It doesn't works because you're accessing a copy of your DataFrame and trying to assign value to it.
However, it seems instead of iterating over the DataFrame inside groupby, it seems more intuitive to use groupby + shift to get the first_xys first; then apply a custom function that applies GD between first_xy and second_xy to each row:
def func(x):
if x.notna().all():
first_xy = (x['prev_lat'], x['prev_long'])
second_xy = (x['worker_latitude'], x['worker_longitude'])
return round((GD(first_xy, second_xy).km), 6)
else:
return float('nan')
g = df.groupby('user_id')
df['prev_lat'] = g['worker_latitude'].shift()
df['prev_long'] = g['worker_longitude'].shift()
df['change'] = df.apply(func, axis=1)
df = df.drop(columns=['prev_lat','prev_long'])
Output:
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 NaN
1 26 55.114459 38.927114 0.006050
2 26 55.114379 38.927101 0.008945
3 26 55.114462 38.927156 0.009884
4 26 55.114372 38.927258 0.011948
5 26 55.114389 38.927120 0.009007
6 9 65.774064 37.532380 NaN
7 9 65.731034 37.611746 6.021576
8 9 65.731034 37.611746 0.000000
9 9 65.774057 37.532346 6.021896
I think that the problem could be the following line:
group[1]['change'][i]=round((GD(first_xy, second_xy).km),6)
You are updating the group variable, and you should update df variable instead. My suggestion for you fixing this property is:
df.loc[i, "change"] = round((GD(first_xy, second_xy).km),6)
Considering that i is the row number that you want update, and "change" is the column name.
It seems simple but I can’t seem to find an efficient way to solve this in Python 3: Is there is a loop I can use in my dataframe that takes every column after the current column (starting with the 1st column), and subtracts it from the current column, so that I can add that resulting column to a new dataframe?
This is what my data looks like:
This is what I have so far, but when running run_analysis my "result" equation is bringing up an error, and I do not know how to store the results in a new dataframe. I'm a beginner at all of this so any help would be much appreciated.
storage = [] #container that will store the results of the subtracted columns
def subtract (a,b): #function to call to do the column-wise subtractions
return a-b
def run_analysis (frame, store):
for first_col_index in range(len(frame)): #finding the first column to use
temp=[] #temporary place to store the column-wise values from the analysis
for sec_col_index in range(len(frame)): #finding the second column to subtract from the first column
if (sec_col_index <= first_col_index): #if the column is below the current column or is equal to
#the current column, then skip to next column
continue
else:
result = [r for r in map(subtract, frame[sec_col_index], frame[first_col_index])]
#if column above our current column, the subtract values in the column and keep the result in temp
temp.append(result)
store.append(temp) #save the complete analysis in the store
Something like this?
#dummy ddataframe
df = pd.DataFrame({'a':list(range(10)), 'b':list(range(10,20)), 'c':list(range(10))})
print(df)
output:
a b c
0 0 10 0
1 1 11 1
2 2 12 2
3 3 13 3
4 4 14 4
5 5 15 5
6 6 16 6
7 7 17 7
8 8 18 8
9 9 19 9
Now iterate over pairs of columns and subtract them while assigning another column to the dataframe
for c1, c2 in zip(df.columns[:-1], df.columns[1:]):
df[f'{c2}-{c1}'] = df[c2]-df[c1]
print(df)
output:
a b c b-a c-b
0 0 10 0 10 -10
1 1 11 1 10 -10
2 2 12 2 10 -10
3 3 13 3 10 -10
4 4 14 4 10 -10
5 5 15 5 10 -10
6 6 16 6 10 -10
7 7 17 7 10 -10
8 8 18 8 10 -10
9 9 19 9 10 -10
I have a dataframe that looks like the following (actually, this is the abstracted result of a calculation):
import pandas as pd
data = {"A":[i for i in range(10)]}
index = [1, 3, 4, 5, 9, 10, 12, 13, 15, 20]
df = pd.DataFrame(index=index, data=data)
print(df)
yields:
A
1 0
3 1
4 2
5 3
9 4
10 5
12 6
13 7
15 8
20 9
Now I want to filter the index values to only show the first value in a group of consecutive values e. g. the following result:
A
1 0
3 1
9 4
12 6
15 8
20 9
Any hints on how to achieve this efficiently?
Use Series.diff which is not implemented for Index, so convert to Series and compre for not equal 1:
df = df[df.index.to_series().diff().ne(1)]
print (df)
A
1 0
3 1
9 4
12 6
15 8
20 9
Try this one:
import numpy as np
df.iloc[np.unique(np.array(index)-np.arange(len(index)), return_index=True)[1]]
Try this:
df.groupby('A').index.first().reset_index()
I have a Pandas dataframe with a column full of values I want to replace with another, non conditionally.
For the purpose of this question, let's assume I don't know how long this column is and I don't want to iterate over its values.
Using .replace() is not appropriate since I don't know which values are in that column: I want to replace all values, non conditionally.
Using df.loc[<row selection>, <column selection>] is not appropriate since there is no row selection logic: I want all the rows and simply writing True (as in data.loc[True, 'ColumnName'] = new_value) returns KeyError(True,). I tried data.loc[1, 'ColumnName'] = new_value and it works but it really looks like a shitty solution.
If I know len() of data['ColumnName'] I could create an array of that size, filled with as many time of my new_value and simply replace the column with that array. 10 lines of code to do something simpler than something that requires 1 line of code (doing so conditionally): this is also not ok.
How can I tell Pandas in 1 line: all the values in ColumnName are now new_value? I refuse to believe there's no way to tell Pandas not to bother me with conditions.
As I explained in the comment, you don't need to create an array.
Let's say you have df:
InvoiceNO Month Year Size
0 1 1 2 7
1 2 1 2 8
2 3 2 2 11
3 4 3 2 9
4 5 7 2 8.5
..and you want to change all values in InvoiceNO to 1234:
df['InvoiceNO'] = 1234
Output:
InvoiceNO Month Year Size
0 1234 1 2 7
1 1234 1 2 8
2 1234 2 2 11
3 1234 3 2 9
4 1234 7 2 8.5
import pandas as pd
df = pd.DataFrame(
{'num1' : [3, 5, 9, 9, 14, 1],
'num2' : [3, 5, 9, 9, 14, 1]},
index=[0, 1, 2, 3, 4, 5])
print(df)
print('\n')
df['num1'] = 100
print(df)
df['num1'] = 'Hi'
print('\n')
print(df)
The output is
num1 num2
0 3 3
1 5 5
2 9 9
3 9 9
4 14 14
5 1 1
num1 num2
0 100 3
1 100 5
2 100 9
3 100 9
4 100 14
5 100 1
num1 num2
0 Hi 3
1 Hi 5
2 Hi 9
3 Hi 9
4 Hi 14
5 Hi 1