I am trying to merge two datasets by using pandas. One is location (longitude and latitude) and the other is time frame (0 to 24hrs, 15 mins step = 96 datapoints)
Here is the sample code:
s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6])
df = pd.DataFrame([list(s1), list(s2)], columns = ["A", "B", "C"])
timeframe_array=[]
for i in range(0, 3600, timeframe):
timeframe_array.append(i)
And I want to get the data like this:
A B C time
0 1 2 3 0
1 1 2 3 15
2 1 2 3 30
3 1 2 3 45
...
How can I get the data like this?
While not particularly elegant, this should work:
from __future__ import division # only needed if you're using Python 2
import pandas as pd
from math import ceil
# Constants
timeframe = 15
total_t = 3600
Create df1:
s1 = [1, 2, 3]
s2 = [4, 5, 6]
df1 = pd.DataFrame([s1, s2], columns=['A', 'B', 'C'])
Next, we want to build df2 such that the sequence 0-3600 (step=15) is replicated for each row in df1. We can extract the number of rows with df1.shape[0] (which is 2 in this case).
df2 = pd.DataFrame({'time': range(0, total_t * df1.shape[0], timeframe)})
Next, you need to replicate the rows in df1 to match df2.
factor = ceil(df2.shape[0] / df1.shape[0])
df1_f = pd.concat([df1] * factor).sort_index().reset_index(drop=True)
Lastly, join the two data frames together and trim off any excess rows.
df3 = df1_f.join(df2, how='left')[:df2.shape[0]]
Pandas may have a built-in way to do this, but to my knowledge both join and merge can only make up a difference in rows by filling with a constant (NaN by default).
Result:
>>> print(df3.head(4))
A B C time
0 1 2 3 0
1 1 2 3 15
2 1 2 3 30
3 1 2 3 45
>>> print(df3.tail(4))
A B C time
476 4 5 6 7140
477 4 5 6 7155
478 4 5 6 7170
479 4 5 6 7185
>>> df3.shape # (480, 4)
Related
I am wondering how I can use groupby and head to get the first n values of a group of records, where n is encoded in a column in the original dataframe.
import pandas as pd
df = pd.DataFrame({"A": [1] * 4 + [2] * 3, "B": list(range(1, 8))})
gp = df.groupby("A").head(2)
print(gp)
This will return the first 2 records of each group. How would I go ahead if I wanted the first 1 of group 1, and the first 2 of group 2, as encoded in column A?
Desired outcome:
A B
0 1 1
4 2 5
5 2 6
We can create a sequential counter using groupby + cumcount to uniquely identify the rows within each group of column A, then create a boolean mask to identify the rows where the counter value is less than or equal to value encoded in column A, now we can filter the required rows using this boolean mask
df[df.groupby('A').cumcount().add(1).le(df['A'])]
A B
0 1 1
4 2 5
5 2 6
Here is solution with DataFrame.head in custom function by A passed by x.name - here is filtered data by A values:
gp = df.groupby("A", group_keys=False).apply(lambda x: x.head(x.name))
print(gp)
A B
0 1 1
4 2 5
5 2 6
If need filter by order in A values solution is:
df = pd.DataFrame({"A": [8] * 4 + [6] * 3, "B": list(range(1, 8))})
d = {v: k for k, v in enumerate(df.A.unique(), 1)}
gp = df.groupby("A", group_keys=False, sort=False).apply(lambda x: x.head(d[x.name]))
print(gp)
A B
0 8 1
4 6 5
5 6 6
df_ = pd.concat([gp[1].head(i+1) for i, gp in enumerate(df.groupby("A"))])
# print(df_)
A B
0 1 1
4 2 5
5 2 6
Given a dataframe, I want to obtain a list of distinct dataframes which together concatenate into the original.
The separation is by indices of rows like so
import pandas as pd
import numpy as np
data = {"a": np.arange(10)}
df = pd.DataFrame(data)
print(df)
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
separate_by = [1, 5, 6, ]
should give a list of
df1 =
a
0 0
df2 =
a
1 1
2 2
3 3
4 4
df3 =
a
5 5
df4 =
a
6 6
7 7
8 8
9 9
How can this be done in pandas?
Try:
groups = (pd.Series(1, index=separate_by)
.reindex(df.index,fill_value=0)
.cumsum()
)
out = {k:v for k,v in df.groupby(groups)}
then for example, out[2]:
a
5 5
Similar logic:
groups = np.zeros(len(df))
groups[separate_by] = 1
groups = np.cumsum(groups)
out = {k:v for k,v in df.groupby(groups)}
separate_by = [1, 5, 6, ]
separate_by.append(len(df))
separate_by.append(0, 0)
dfs = [df.loc[separate_by[i]: separate_by[i+1]] for i in range(len(separate_by)-1)]
Let us try
d = dict(tuple(df.groupby(df.index.isin(separate_by).cumsum())))
d[0]
Out[364]:
a
0 0
d[2]
Out[365]:
a
5 5
I have two data frames with different variable names
df1 = pd.DataFrame({'A':[2,2,3],'B':[5,5,6]})
>>> df1
A B
0 2 5
1 2 5
2 3 6
df2 = pd.DataFrame({'C':[3,3,3],'D':[5,5,6]})
>>> df2
C D
0 3 5
1 3 5
2 3 6
I want to create a third data frame where the n-th column is the product of the n-th columns in the first two data frames. In the above example, df3 would have two columns X and Y, where df.X = df.A * df.C and df.Y = df.B * df.D
df3 = pd.DataFrame({'X':[6,6,9],'Y':[25,25,36]})
>>> df3
X Y
0 6 25
1 6 25
2 9 36
Is there a simple pandas function that allows me to do this?
You can use mul, to multiply df1 by the values of df2:
df3 = df1.mul(df2.values)
df3.columns = ['X','Y']
>>> df3
X Y
0 6 25
1 6 25
2 9 36
You can also use numpy as:
df3 = np.multiply(df1, df2)
Note: Most numpy operations will take Pandas Series or DataFrame.
I have a "sample.txt" like this.
idx A B C D cat
J 1 2 3 1 x
K 4 5 6 2 x
L 7 8 9 3 y
M 1 2 3 4 y
N 4 5 6 5 z
O 7 8 9 6 z
With this dataset, I want to get sum in row and column.
In row, it is not a big deal.
I made result like this.
### MY CODE ###
import pandas as pd
df = pd.read_csv('sample.txt',sep="\t",index_col='idx')
df.info()
df2 = df.groupby('cat').sum()
print( df2 )
The result is like this.
A B C D
cat
x 5 7 9 3
y 8 10 12 7
z 11 13 15 11
But I don't know how to write a code to get result like this.
(simply add values in column A and B as well as column C and D)
AB CD
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Could anybody help how to write a code?
By the way, I don't want to do like this.
(it looks too dull, but if it is the only way, I'll deem it)
df2 = df['A'] + df['B']
df3 = df['C'] + df['D']
df = pd.DataFrame([df2,df3],index=['AB','CD']).transpose()
print( df )
When you pass a dictionary or callable to groupby it gets applied to an axis. I specified axis one which is columns.
d = dict(A='AB', B='AB', C='CD', D='CD')
df.groupby(d, axis=1).sum()
Use concat with sum:
df = df.set_index('idx')
df = pd.concat([df[['A', 'B']].sum(1), df[['C', 'D']].sum(1)], axis=1, keys=['AB','CD'])
print( df)
AB CD
idx
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Does this do what you need? By using axis=1 with DataFrame.apply, you can use the data that you want in a row to construct a new column. Then you can drop the columns that you don't want anymore.
In [1]: import pandas as pd
In [5]: df = pd.DataFrame(columns=['A', 'B', 'C', 'D'], data=[[1, 2, 3, 4], [1, 2, 3, 4]])
In [6]: df
Out[6]:
A B C D
0 1 2 3 4
1 1 2 3 4
In [7]: df['CD'] = df.apply(lambda x: x['C'] + x['D'], axis=1)
In [8]: df
Out[8]:
A B C D CD
0 1 2 3 4 7
1 1 2 3 4 7
In [13]: df.drop(['C', 'D'], axis=1)
Out[13]:
A B CD
0 1 2 7
1 1 2 7
How do I add a order number column to an existing DataFrame?
This is my DataFrame:
import pandas as pd
import math
frame = pd.DataFrame([[1, 4, 2], [8, 9, 2], [10, 2, 1]], columns=['a', 'b', 'c'])
def add_stats(row):
row['sum'] = sum([row['a'], row['b'], row['c']])
row['sum_sq'] = sum(math.pow(v, 2) for v in [row['a'], row['b'], row['c']])
row['max'] = max(row['a'], row['b'], row['c'])
return row
frame = frame.apply(add_stats, axis=1)
print(frame.head())
The resulting data is:
a b c sum sum_sq max
0 1 4 2 7 21 4
1 8 9 2 19 149 9
2 10 2 1 13 105 10
First, I would like to add 3 extra columns with order numbers, sorting on sum, sum_sq and max, respectively. Next, these 3 columns should be combined into one column - the mean of the order numbers - but I do know how to do that part (with apply and axis=1).
I think you're looking for rank where you mention sorting. Given your example, add:
frame['sum_order'] = frame['sum'].rank()
frame['sum_sq_order'] = frame['sum_sq'].rank()
frame['max_order'] = frame['max'].rank()
frame['mean_order'] = frame[['sum_order', 'sum_sq_order', 'max_order']].mean(axis=1)
To get:
a b c sum sum_sq max sum_order sum_sq_order max_order mean_order
0 1 4 2 7 21 4 1 1 1 1.000000
1 8 9 2 19 149 9 3 3 2 2.666667
2 10 2 1 13 105 10 2 2 3 2.333333
The rank method has some options as well, to specify the behavior in case of identical or NA-values for example.