Encode pandas column as categorical values - python

I have a dataframe as follow:
d = {'item': [1, 2,3,4,5,6], 'time': [1297468800, 1297468809, 12974688010, 1297468890, 1297468820,1297468805]}
df = pd.DataFrame(data=d)
the output of df is as follow:
item time
0 1 1297468800
1 2 1297468809
2 3 1297468801
3 4 1297468890
4 5 1297468820
5 6 1297468805
the time here is based on the unixsystem time. My goal is to replace the time column in the dataframe.
such as the
mintime = 1297468800
maxtime = 1297468890
And I want to split the time into 10 (can be changed by using parameter like 20 intervals) interval, and recode the time column in df. Such as
item time
0 1 1
1 2 1
2 3 1
3 4 9
4 5 3
5 6 1
what is the most efficient way to do this since I have billion of records? Thanks

You can use pd.cut with np.linspace to specify the bins. This encodes your column categorically, from which you can then extract the codes in order:
bins = np.linspace(df.time.min() - 1, df.time.max(), 10)
df['time'] = pd.cut(df.time, bins=bins, right=True).cat.codes + 1
df
item time
0 1 1
1 2 1
2 3 1
3 4 9
4 5 3
5 6 1
Alternatively, depending on how you treat the interval edges, you could also do
bins = np.linspace(df.time.min(), df.time.max() + 1, 10)
pd.cut(df.time, bins=bins, right=False).cat.codes + 1
0 1
1 1
2 1
3 9
4 2
5 1
dtype: int8

Related

Allocate lowest value over n rows to n rows in DataFrame

I need to take the lowest value over n rows and add it to these n rows in a new colomn of the dataframe. For example:
n=3
Column 1 Column 2
5 3
3 3
4 3
7 2
8 2
2 2
5 4
4 4
9 4
8 2
2 2
3 2
5 2
Please take note that if the number of rows is not dividable by n, the last values are incorporated in the last group. So in this example n=4 for the end of the dataframe.
Thanking you in advance!
I do not know any straight forward way to do this, but here is a working example (not elegant, but working...).
If you do not worry about the number of rows being dividable by n, you could use .groupby():
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
df['new_col']=df.groupby(df.index // n).transform('min')
which yields:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 4
7 6 4
8 4 4
9 1 1
10 2 1
However, we can see that the last 2 rows are grouped together, instead of them being grouped with the 3 previous values in this case.
A way around would be to look at the .count() of elements in each group generated by grouby, and check the last one:
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
# Temporary dataframe
A = df.groupby(df.index // n).transform('min')
# The min value of each group in a second dataframe
min_df = df.groupby(df.index // n).min()
# The size of the last group
last_batch = df.groupby(df.index // n).count()[-1:]
# if the last size is not equal to n
if last_batch.values[0][0] !=n:
last_group = last_batch+n
A[-last_group.values[0][0]:]=min_df[-2:].min()
# Assign the temporary modified dataframe to df
df['new_col'] = A
which yields the expected result:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 1
7 6 1
8 4 1
9 1 1
10 2 1

How to explode aggregated pandas column

I have a df that looks like this:
df
time score
83623 4
83624 3
83625 3
83629 2
83633 1
I want to explode df.time so that the single digit increments by 1, and then the df.score value is duplicated for each added row. See example below:
time score
83623 4
83624 3
83625 3
83626 3
83627 3
83628 3
83629 2
83630 2
83631 2
83632 2
83633 1
From your sample, I assume df.time is integer. You may try this way
df_final = df.set_index('time').reindex(range(df.time.min(), df.time.max()+1),
method='pad').reset_index()
Out[89]:
time score
0 83623 4
1 83624 3
2 83625 3
3 83626 3
4 83627 3
5 83628 3
6 83629 2
7 83630 2
8 83631 2
9 83632 2
10 83633 1

python - absolute to relative time

my pandas DataFrame has the following current structure:
{
'Temperature': [1,2,3,4,5,6,7,8,9],
'machining': [1,1,1,2,2,2,3,3,3],
'timestamp': [1560770645,1560770646,1560770647,1560770648,1560770649,1560770650,1560770651,1560770652,1560770653]
}
I'd like to add a column with the relative time of each machining process, so that it refreshes every time the column 'Machining' changes its value.
Thus, the desired structure is:
{
'Temperature': [1,2,3,4,5,6,7,8,9],
'machining': [1,1,1,2,2,2,3,3,3],
'timestamp': [1560770645,1560770646,1560770647,1560770648,1560770649,1560770650,1560770651,1560770652,1560770653]
'timestamp_machining': [1,2,3,1,2,3,1,2,3]
}
I'm struggling a bit to do this in a clean way: any help would be appreciated also without pandas if needed.
Subtract first values per groups created by GroupBy.transform:
#if values are not sorted
df = df.sort_values(['machining','timestamp'])
print (df.groupby('machining')['timestamp'].transform('first'))
0 1560770645
1 1560770645
2 1560770645
3 1560770648
4 1560770648
5 1560770648
6 1560770651
7 1560770651
8 1560770651
Name: timestamp, dtype: int64
df['new'] = df['timestamp'].sub(df.groupby('machining')['timestamp'].transform('first')) + 1
print (df)
Temperature machining timestamp timestamp_machining new
0 1 1 1560770645 1 1
1 2 1 1560770646 2 2
2 3 1 1560770647 3 3
3 4 2 1560770648 1 1
4 5 2 1560770649 2 2
5 6 2 1560770650 3 3
6 7 3 1560770651 1 1
7 8 3 1560770652 2 2
8 9 3 1560770653 3 3
If need counter only then GroupBy.cumcount is your friend:
df['new'] = df.groupby('machining').cumcount() + 1

Pandas groupby treat nonconsecutive as different variables?

I want to treat non consecutive ids as different variables during groupby, so that I can take return the first value of stamp, and the sum of increment as a new dataframe. Here is sample input and output.
import pandas as pd
import numpy as np
df = pd.DataFrame([np.array(['a','a','a','b','c','b','b','a','a','a']),
np.arange(1, 11), np.ones(10)]).T
df.columns = ['id', 'stamp', 'increment']
df_result = pd.DataFrame([ np.array(['a','b','c','b','a']),
np.array([1,4,5,6,8]), np.array([3,1,1,2,3])]).T
df_result.columns = ['id', 'stamp', 'increment_sum']
In [2]: df
Out[2]:
id stamp increment
0 a 1 1
1 a 2 1
2 a 3 1
3 b 4 1
4 c 5 1
5 b 6 1
6 b 7 1
7 a 8 1
8 a 9 1
9 a 10 1
In [3]: df_result
Out[3]:
id stamp increment_sum
0 a 1 3
1 b 4 1
2 c 5 1
3 b 6 2
4 a 8 3
I can accomplish this via
def get_result(d):
sum = d.increment.sum()
stamp = d.stamp.min()
name = d.id.max()
return name, stamp, sum
#idea from http://stackoverflow.com/questions/25147091/combine-consecutive-rows-with-the-same-column-values
df['key'] = (df['id'] != df['id'].shift(1)).astype(int).cumsum()
result = zip(*df.groupby([df.key]).apply(get_result))
df = pd.DataFrame(np.array(result).T)
df.columns = ['id', 'stamp', 'increment_sum']
But I'm sure there must be a more elegant solution
Not that good in terms of optimum code, but solves the problem
> df_group = df.groupby('id')
we cant use id alone for groupby, so adding another new column to groupby within id based whether it is continuous or not
> df['group_diff'] = df_group['stamp'].diff().apply(lambda v: float('nan') if v == 1 else v).ffill().fillna(0)
> df
id stamp increment group_diff
0 a 1 1 0
1 a 2 1 0
2 a 3 1 0
3 b 4 1 0
4 c 5 1 0
5 b 6 1 2
6 b 7 1 2
7 a 8 1 5
8 a 9 1 5
9 a 10 1 5
Now we can the new column group_diff for secondary grouping.. Added sort function in the end as suggested in the comments to get the exact function
> df.groupby(['id','group_diff']).agg({'increment':sum, 'stamp': 'first'}).reset_index()[['id', 'stamp','increment']].sort('stamp')
id stamp increment
0 a 1 3
2 b 4 1
4 c 5 1
3 b 6 2
1 a 8 3

Python pandas, multindex, slicing

I have got a pd.DataFrame
Time Value
a 1 1 1
2 2 5
3 5 7
b 1 1 5
2 2 9
3 10 11
I want to multiply the column Value with the column Time - Time(t-1) and write the result to a column Product, starting with row b, but separately for each top level index.
For example Product('1','b') should be (Time('1','b') - Time('1','a')) * Value('1','b'). To do this, i would need a "shifted" version of column Time "starting" at row b so that i could do df["Product"] = (df["Time"].shifted - df["Time"]) * df["Value"]. The result should look like this:
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
This should do it:
>>> time_shifted = df['Time'].groupby(level=0).apply(lambda x: x.shift())
>>> df['Product'] = ((df.Time - time_shifted)*df.Value).fillna(0)
>>> df
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
Hey this should do what you need it to. Comment if I missed anything.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Time':[1,2,5,1,2,10],'Value':[1,5,7,5,9,11]},
index = [['a','a','a','b','b','b'],[1,2,3,1,2,3]])
def product(x):
x['Product'] = (x['Time']-x.shift()['Time'])*x['Value']
return x
df = df.groupby(level =0).apply(product)
df['Product'] = df['Product'].replace(np.nan, 0)
print df

Categories