I need to add a column with changes ow worker coordinates through different stages. We have a DataFrame:
import pandas as pd
from geopy.distance import geodesic as GD
d = {'user_id': [26, 26, 26, 26, 26, 26, 9, 9, 9, 9],
'worker_latitude': [55.114410, 55.114459, 55.114379,
55.114462, 55.114372, 55.114389, 65.774064, 65.731034, 65.731034, 65.774057],
'worker_longitude': [38.927155, 38.927114, 38.927101, 38.927156,
38.927258, 38.927120, 37.532380, 37.611746, 37.611746, 37.532346],
'change':[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
df = pd.DataFrame(data=d)
which looks like:
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 0
1 26 55.114459 38.927114 0
2 26 55.114379 38.927101 0
3 26 55.114462 38.927156 0
4 26 55.114372 38.927258 0
5 26 55.114389 38.927120 0
6 9 65.774064 37.532380 0
7 9 65.731034 37.611746 0
8 9 65.731034 37.611746 0
9 9 65.774057 37.532346 0
Then I need to count difference between person previous and current stage. So I use a function:
for group in df.groupby(by='user_id'):
group[1].reset_index(inplace=True,drop=True)
for i in range(1,len(group[1])):
first_xy=(group[1]['worker_latitude'][i-1],group[1]['worker_longitude'][i-1])
second_xy=(group[1]['worker_latitude'][i],group[1]['worker_longitude'][i])
print((round((GD(first_xy, second_xy).km),6)))
group[1]['change'][i]=round((GD(first_xy, second_xy).km),6)
And then I get:
6.021576
0.0
6.021896
0.00605
0.008945
0.009884
0.011948
0.009007
display(df)
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 0
1 26 55.114459 38.927114 0
2 26 55.114379 38.927101 0
3 26 55.114462 38.927156 0
4 26 55.114372 38.927258 0
5 26 55.114389 38.927120 0
6 9 65.774064 37.532380 0
7 9 65.731034 37.611746 0
8 9 65.731034 37.611746 0
9 9 65.774057 37.532346 0
Which means that values are counted correctly, but for some reason they don't fit into 'change' column. What can be done?
It doesn't works because you're accessing a copy of your DataFrame and trying to assign value to it.
However, it seems instead of iterating over the DataFrame inside groupby, it seems more intuitive to use groupby + shift to get the first_xys first; then apply a custom function that applies GD between first_xy and second_xy to each row:
def func(x):
if x.notna().all():
first_xy = (x['prev_lat'], x['prev_long'])
second_xy = (x['worker_latitude'], x['worker_longitude'])
return round((GD(first_xy, second_xy).km), 6)
else:
return float('nan')
g = df.groupby('user_id')
df['prev_lat'] = g['worker_latitude'].shift()
df['prev_long'] = g['worker_longitude'].shift()
df['change'] = df.apply(func, axis=1)
df = df.drop(columns=['prev_lat','prev_long'])
Output:
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 NaN
1 26 55.114459 38.927114 0.006050
2 26 55.114379 38.927101 0.008945
3 26 55.114462 38.927156 0.009884
4 26 55.114372 38.927258 0.011948
5 26 55.114389 38.927120 0.009007
6 9 65.774064 37.532380 NaN
7 9 65.731034 37.611746 6.021576
8 9 65.731034 37.611746 0.000000
9 9 65.774057 37.532346 6.021896
I think that the problem could be the following line:
group[1]['change'][i]=round((GD(first_xy, second_xy).km),6)
You are updating the group variable, and you should update df variable instead. My suggestion for you fixing this property is:
df.loc[i, "change"] = round((GD(first_xy, second_xy).km),6)
Considering that i is the row number that you want update, and "change" is the column name.
Related
I have a DataFrame which has a column containing these values with % occurrence
I want to convert the value with highest occurrence as 1 and the rest as 0.
How can I do the same using Pandas?
Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'availability': np.random.randint(0, 100, 10), 'some_col': np.random.randn(10)})
print(df)
"""
availability some_col
0 9 -0.332662
1 35 0.193257
2 1 2.042402
3 50 -0.298372
4 52 -0.669655
5 3 -1.031884
6 44 -0.763867
7 28 1.093086
8 67 0.723319
9 87 -1.439568
"""
df['availability'] = np.where(df['availability'] == df['availability'].max(), 1, 0)
print(df)
"""
availability some_col
0 0 -0.332662
1 0 0.193257
2 0 2.042402
3 0 -0.298372
4 0 -0.669655
5 0 -1.031884
6 0 -0.763867
7 0 1.093086
8 0 0.723319
9 1 -1.439568
"""
Edit
If you are trying to mask the rows with the values that occur most often instead, try this:
df = pd.DataFrame(
{
'availability': [10, 10, 20, 30, 40, 40, 50, 50, 50, 50],
'some_col': np.random.randn(10)
}
)
print(df)
"""
availability some_col
0 10 0.954199
1 10 0.779256
2 20 -0.438860
3 30 -2.547989
4 40 0.587108
5 40 0.398858
6 50 0.776177 # <--- Most Frequent is 50
7 50 -0.391724 # <--- Most Frequent is 50
8 50 -0.886805 # <--- Most Frequent is 50
9 50 1.989000 # <--- Most Frequent is 50
"""
df['availability'] = np.where(df['availability'].isin(df['availability'].mode()), 1, 0)
print(df)
"""
availability some_col
0 0 0.954199
1 0 0.779256
2 0 -0.438860
3 0 -2.547989
4 0 0.587108
5 0 0.398858
6 1 0.776177
7 1 -0.391724
8 1 -0.886805
9 1 1.989000
"""
Try:
df.availability.apply(lambda x: 1 if x == df.availability.value_counts().idxmax() else 0)
You can use Series.mode() to get the most often value and use isin to check if value in column in list
df['col'] = df['availability'].isin(df['availability'].mode()).astype(int)
You can compare to the mode with isin, then convert the boolean to integer (True -> 1, False -> 0):
df['col2'] = df['col'].isin(df['col'].mode()).astype(int)
example (here, 2 and 4 are tied as most frequent value), as new column "col2" for clarity:
col col2
0 0 0
1 2 1
2 2 1
3 2 1
4 4 1
5 4 1
6 4 1
7 1 0
I have the following dataframe:
import pandas as pd
data = {'id': [542588, 542594, 542594, 542605, 542605, 542605, 542630, 542630],
'label': [3, 3, 1, 1, 2, 0, 0, 2]}
df = pd.DataFrame(data)
df
id label
0 542588 3
1 542594 3
2 542594 1
3 542605 1
4 542605 2
5 542605 0
6 542630 0
7 542630 2
The id columns contains large integers (6-digits). I want a way to simplify it, starting from 10, so that 542588 becomes 10, 542594 becomes 11, etc...
Required output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
You can use factorize:
df['id'] = df['id'].factorize()[0] + 10
Output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
Note: factorize will enumerate the keys in the order that they occur in your data, while groupby().ngroup() solution will enumerate the key in the increasing order. You can mimic the increasing order with factorize by sorting the data first. Or you can replicate the data order with groupby() by passing sort=False to it.
You can try
df['id'] = df.groupby('id').ngroup().add(10)
print(df)
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
This is a naive way of looping through the IDs, and every time you encounter an ID you haven't seen before, associate it in a dictionary with a new ID (starting at 10, incrementing by 1 each time).
You can then swap out the values of the ID column using the map method.
new_ids = dict()
new_id = 10
for old_id in df['id']:
if old_id not in new_ids:
new_ids[old_id] = new_id
new_id += 1
df['id'] = df['id'].map(new_ids)
I have a dataframe that looks like the following (actually, this is the abstracted result of a calculation):
import pandas as pd
data = {"A":[i for i in range(10)]}
index = [1, 3, 4, 5, 9, 10, 12, 13, 15, 20]
df = pd.DataFrame(index=index, data=data)
print(df)
yields:
A
1 0
3 1
4 2
5 3
9 4
10 5
12 6
13 7
15 8
20 9
Now I want to filter the index values to only show the first value in a group of consecutive values e. g. the following result:
A
1 0
3 1
9 4
12 6
15 8
20 9
Any hints on how to achieve this efficiently?
Use Series.diff which is not implemented for Index, so convert to Series and compre for not equal 1:
df = df[df.index.to_series().diff().ne(1)]
print (df)
A
1 0
3 1
9 4
12 6
15 8
20 9
Try this one:
import numpy as np
df.iloc[np.unique(np.array(index)-np.arange(len(index)), return_index=True)[1]]
Try this:
df.groupby('A').index.first().reset_index()
I am trying to apply a function to each row of a data frame. The tricky part is that the function returns a new data frame for each processed row. Assume the columns of this data frame can easily be derived from the processed row.
At the end the result should be all these data frames (1 for each processed row) concatenated. I intentionally do not provide sample code, because the simplest of solution proposal will do, as long as the 'tricky' part if fulfilled.
I have spend hours trying digging through docs and stackoverflow to find a solution. As usual the pandas docs are so devoid of practical examples aside the simplest of operations that I just couldn't figure it out. I also made sure to not miss any duplicate questions. Thanks a lot.
It is unclear what you are trying to achieve, but I doubt you need to create separate dataframes.
The example below shows how you can take a dataframe, subset it to your columns of interest, apply a function foo to one of the columns and then apply a second function bar that returns multiple values.
df = pd.DataFrame({
'first_name': ['john', 'nancy', 'jolly'],
'last_name': ['smith', 'drew', 'rogers'],
'A': [1, 4, 7],
'B': [2, 5, 8],
'C': [3, 6, 9]
})
>>> df
first_name last_name A B C
0 john smith 1 2 3
1 nancy drew 4 5 6
2 jolly rogers 7 8 9
def foo(first_name):
return 2 if first_name.startswith('j') else 1
def bar(first_name):
return (2, 0) if first_name.startswith('j') else (1, 3)
columns_of_interest = ['first_name', 'A']
df_new = pd.concat([
df[columns_of_interest].assign(x=df.first_name.apply(foo)),
df.first_name.apply(bar).apply(pd.Series)], axis=1)
>>> df_new
first_name A x 0 1
0 john 1 2 2 0
1 nancy 4 1 1 3
2 jolly 7 2 2 0
Assuming the function you are applying to each row is called f
pd.concat({i: f(row) for i, row in df.iterrows()})
Working example
df = pd.DataFrame(np.arange(25).reshape(5, 5), columns=list('ABCDE'))
def f(row):
return pd.concat([row] * 2, keys=['x', 'y']).unstack().drop('C', 1).assign(S=99)
pd.concat({i: f(row) for i, row in df.iterrows()})
A B D E S
0 x 0 1 3 4 99
y 0 1 3 4 99
1 x 5 6 8 9 99
y 5 6 8 9 99
2 x 10 11 13 14 99
y 10 11 13 14 99
3 x 15 16 18 19 99
y 15 16 18 19 99
4 x 20 21 23 24 99
y 20 21 23 24 99
Or
df.groupby(level=0).apply(lambda x: f(x.squeeze()))
A B D E S
0 x 0 1 3 4 99
y 0 1 3 4 99
1 x 5 6 8 9 99
y 5 6 8 9 99
2 x 10 11 13 14 99
y 10 11 13 14 99
3 x 15 16 18 19 99
y 15 16 18 19 99
4 x 20 21 23 24 99
y 20 21 23 24 99
I would do it this way - although I note the .apply is possibly what you are looking for.
import pandas as pd
import numpy as np
np.random.seed(7)
orig=pd.DataFrame(np.random.rand(6,3))
orig.columns=(['F1','F2','F3'])
res=[]
for i,r in orig.iterrows():
tot=0
for col in r:
tot=tot+col
rv={'res':tot}
a=pd.DataFrame.from_dict(rv,orient='index',dtype=np.float64)
res.append(a)
res[0].head()
Should return something like this
{'res':10}
I found it difficult to do this with an array, but whatever output method is fine with me.
I want to take a column from my DataFrame which has single digits numbers and double digits numbers.
The items are currently integers, but they can be converted to str or bool, whatever necessary to do the task.
I want to add a 1 to the end of all the single digits for example, if the first digit is 2, then I want it to return 21.
Lastly, once these operations are complete, I need to split the digits in half and create two columns.
For example
col['a'] = [3, 22, 23, 2, 1]
so my output should look like:
col['a'] = [31, 22, 23, 21, 11]
then, I will most likely do something like
col['b'] = col['a'][0:]
[3,2,2,2,1]
and
col['c'] = col['a'][:1]
[1,2,3,1,1].
Assuming your data is numeric. You can use np.mod(data, 10) to get the very last digit.
import pandas as pd
import numpy as np
# data
# ===========================
df = pd.DataFrame({'a': [31, 22, 23, 21, 11]})
df.dtypes
a int64
dtype: object
# processing
# =====================================
df['c'] = np.mod(df.a, 10)
df
a c
0 31 1
1 22 2
2 23 3
3 21 1
4 11 1
Edit:
To add 1 to the end of each single digit number:
df = pd.DataFrame({'a': [31,22,23,21,11,1,2,3,4,5]})
df
a
0 31
1 22
2 23
3 21
4 11
5 1
6 2
7 3
8 4
9 5
single_digit_selector = df.a - np.mod(df.a, 10) == 0
df[single_digit_selector] = df[single_digit_selector] * 10 + 1
df
a
0 31
1 22
2 23
3 21
4 11
5 11
6 21
7 31
8 41
9 51
>>> df
a
0 3
1 22
2 23
3 2
4 1
df['aa'] = df.apply(lambda row: row['a']*10+1 if 0<=row['a']<=9 else row['a'], axis=1)
>>> df
a aa
0 3 31
1 22 22
2 23 23
3 2 21
4 1 11
df['b'] = df.apply(lambda row: divmod(row['aa'], 10)[0], axis=1)
df['c'] = df.apply(lambda row: divmod(row['aa'], 10)[1], axis=1)
>>> df
a aa b c
0 3 31 3 1
1 22 22 2 2
2 23 23 2 3
3 2 21 2 1
4 1 11 1 1
I would do it this way:
single_digit = col.a < 10
col['b'] = col.a.where(single_digit, col.a.values / 10)
col['c'] = np.where(single_digit, 1, np.mod(col.a, 10))
So if a < 10, b is simply a and the result of integer division by 10 otherwise. Note that numpy arrays support integer division whereas pandas Series don't (as far as I know) which is why I have col.a.values / 10. For column c we have 1 if a < 10 and the a mod 10 otherwise.