Creating Data Frame with repeating values that repeat - python

I'm trying to create a dataframe in Pandas that has two variables ("date" and "time_of_day" where "date" is 120 observations long with 30 days (each day has four observations: 1,1,1,1; 2,2,2,2; etc.) and then the second variable "time_of_day) repeats 30 times with values of 1,2,3,4.
The closest I found to this question was here: How to create a series of numbers using Pandas in Python, which got me the below code, but I'm receiving an error that it must be a 1-dimensional array.
df = pd.DataFrame({'date': np.tile([pd.Series(range(1,31))],4), 'time_of_day': pd.Series(np.tile([1, 2, 3, 4],30 ))})
So the final dataframe would look something like
date
time_of_day
1
1
1
2
1
3
1
4
2
1
2
2
2
3
2
4
Thanks much!

you need once np.repeat and once np.tile
df = pd.DataFrame({'date': np.repeat(range(1,31),4),
'time_of_day': np.tile([1, 2, 3, 4],30)})
print(df.head(10))
date time_of_day
0 1 1
1 1 2
2 1 3
3 1 4
4 2 1
5 2 2
6 2 3
7 2 4
8 3 1
9 3 2
or you could use pd.MultiIndex.from_product, same result.
df = (
pd.MultiIndex.from_product([range(1,31), range(1,5)],
names=['date','time_of_day'])
.to_frame(index=False)
)
or product from itertools
from itertools import product
df = pd.DataFrame(product(range(1,31), range(1,5)), columns=['date','time_of_day'])

New feature in merge cross
out = pd.DataFrame(range(1,31)).merge(pd.DataFrame([1, 2, 3, 4]),how='cross')

Related

Allocate lowest value over n rows to n rows in DataFrame

I need to take the lowest value over n rows and add it to these n rows in a new colomn of the dataframe. For example:
n=3
Column 1 Column 2
5 3
3 3
4 3
7 2
8 2
2 2
5 4
4 4
9 4
8 2
2 2
3 2
5 2
Please take note that if the number of rows is not dividable by n, the last values are incorporated in the last group. So in this example n=4 for the end of the dataframe.
Thanking you in advance!
I do not know any straight forward way to do this, but here is a working example (not elegant, but working...).
If you do not worry about the number of rows being dividable by n, you could use .groupby():
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
df['new_col']=df.groupby(df.index // n).transform('min')
which yields:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 4
7 6 4
8 4 4
9 1 1
10 2 1
However, we can see that the last 2 rows are grouped together, instead of them being grouped with the 3 previous values in this case.
A way around would be to look at the .count() of elements in each group generated by grouby, and check the last one:
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
# Temporary dataframe
A = df.groupby(df.index // n).transform('min')
# The min value of each group in a second dataframe
min_df = df.groupby(df.index // n).min()
# The size of the last group
last_batch = df.groupby(df.index // n).count()[-1:]
# if the last size is not equal to n
if last_batch.values[0][0] !=n:
last_group = last_batch+n
A[-last_group.values[0][0]:]=min_df[-2:].min()
# Assign the temporary modified dataframe to df
df['new_col'] = A
which yields the expected result:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 1
7 6 1
8 4 1
9 1 1
10 2 1

Sliding minimum value in a pandas column

I am working with a pandas dataframe where I have the following two columns: "personID" and "points". I would like to create a third variable ("localMin") which will store the minimum value of the column "points" at each point in the dataframe as compared with all previous values in the "points" column for each personID (see image below).
Does anyone have an idea how to achieve this most efficiently? I have approached this problem using shift() with different period sizes, but of course, shift is sensitive to variations in the sequence and doesn't always produce the output I would expect.
Thank you in advance!
Use groupby.cummin:
df['localMin'] = df.groupby('personID')['points'].cummin()
Example:
df = pd.DataFrame({'personID': list('AAAAAABBBBBB'),
'points': [3,4,2,6,1,2,4,3,1,2,6,1]
})
df['localMin'] = df.groupby('personID')['points'].cummin()
output:
personID points localMin
0 A 3 3
1 A 4 3
2 A 2 2
3 A 6 2
4 A 1 1
5 A 2 1
6 B 4 4
7 B 3 3
8 B 1 1
9 B 2 1
10 B 6 1
11 B 1 1

Python Pandas: MultiIndex groupby second level of columns

I'm trying to group rows by multiple columns.
What I want to achieve can be illustrated by this small example:
import pandas as pd
col_index = pd.MultiIndex.from_arrays([['A','A','B','B'],['a','b','c','d']])
df = pd.DataFrame([ [1,2,3,3],
[4,2,2,2],
[6,4,2,2],
[1,2,4,4],
[3,8,4,4],
[1,2,3,3]], columns = col_index)
DataFrame created by this looks like this:
A B
a b c d
0 1 2 3 3
1 4 2 2 2
2 6 4 2 2
3 1 2 4 4
4 3 8 4 4
5 1 2 3 3
I would like to group by 'c' and 'd', actually whole 'B'
This gives me "KeyError: 'c' "
#something like this
df.groupby(['c','d'], axis = 1, level = 1)
#or like this
df.groupby('B', axis = 1, level = 0)
I tried searching for answer but I can't seem to find any.
Can somebody tell me what I'm doing wrong?
This is one way of doing it by resetting the columns first:
df.set_axis(df.columns.droplevel(0), axis=1,inplace=False).groupby(['c','d']).sum()
Out[531]:
a b
c d
2 2 10 6
3 3 2 4
4 4 4 10
You can also specify the 2-level multi-indices explicitly.
df.groupby([("B","c"), ("B", "d")])

calculating differences within groups

I have a DataFrame whose rows provide a value of one feature at one time. Times are identified by the time column (there's about 1000000 distinct times). Features are identified by the feature column (there's a few dozen features). There's at most one row for any combination of feature and time. At each time, only some of the features are available; the only exception is feature 0 which is available at all times. I'd like to add to that DataFrame a column that shows the value of the feature 0 at that time. Is there a reasonably fast way to do it?
For example, let's say I have
df = pd.DataFrame({
'time': [1,1,2,2,2,3,3],
'feature': [1,0,0,2,4,3,0],
'value':[1,2,3,4,5,6,7],
})
I want to add a column that contains [2,2,3,3,3,7,7].
I tried to use groupby and boolean indexing but no luck.
I'd like to add to that DataFrame a column that shows the value of the feature 0 at that time. Is there a reasonably fast way to do it?
I think that a groupby (which is quite an expensive operation) is an overkill for this. Try a merge with the values only of the 0 feature:
>>> pd.merge(
df,
df[df.feature == 0].drop('feature', axis=1).rename(columns={'value': 'value_0'}))
feature time value value_0
0 1 1 1 2
1 0 1 2 2
2 0 2 3 3
3 2 2 4 3
4 4 2 5 3
5 3 3 6 7
6 0 3 7 7
Edit
Per #jezrael's request, here is a timing test:
import pandas as pd
m = 10000
df = pd.DataFrame({
'time': range(m / 2) + range(m / 2),
'feature': range(m / 2) + [0] * (m / 2),
'value': range(m),
})
On this input, #jezrael's solution takes 396 ms, whereas mine takes 4.03 ms.
If you'd like to drop the zero rows and add them as a separate column (slightly different than your original request), you could do the following:
# Create initial dataframe.
df = pd.DataFrame({
'time': [1,1,2,2,2,3,3],
'feature': [1,0,0,2,4,3,0],
'value':[1,2,3,4,5,6,7],
})
# Set the index to 'time'
df = df.set_index('time')
# Join the zero feature value to the non-zero feature rows.
>>> df.loc[df.feature > 0, :].join(df.loc[df.feature == 0, 'value'], rsuffix='_feature_0')
feature value value_feature_0
time
1 1 1 2
2 2 4 3
2 4 5 3
3 3 6 7
You can set_index from column value and then groupby with transform idxmin.
This solution works, if the value 0 in column feature is min.
df = df.set_index('value')
df['diff'] = df.groupby('time')['feature'].transform('idxmin')
print df.reset_index()
value feature time diff
0 1 1 1 2
1 2 0 1 2
2 3 0 2 3
3 4 2 2 3
4 5 4 2 3
5 6 3 3 7
6 7 0 3 7

Python - Get group names from aggregated results in pandas

I have a dataframe like this:
minute values
0 1 3
1 2 4
2 1 1
3 4 6
4 3 7
5 2 2
When I apply
df.groupby('minute').sum().sort('values', ascending=False)
This gives:
values
minute
3 7
2 6
4 6
1 4
I want to get first two values in minute column in an array like [3,2]. How can I access values in minute column
If what you want is the values from the minute column in the grouped dataframe (which would be the index column as well) , you can use DataFrame.index , to access that column. Example -
grouped = df.groupby('minute').sum().sort('values', ascending=False)
grouped.index[:2]
If you really want it as a list, you can use .tolist() to convert it to a list. Example -
grouped.index[:2].tolist()
Demo -
In [3]: df
Out[3]:
minute values
0 1 3
1 2 4
2 1 1
3 4 6
4 3 7
5 2 2
In [4]: grouped = df.groupby('minute').sum().sort('values', ascending=False)
In [5]: grouped.index[:2]
Out[5]: Int64Index([3, 2], dtype='int64', name='minute')
In [6]: grouped.index[:2].tolist()
Out[6]: [3, 2]

Categories