Could you suggest the method to create DataFrame from Series like I have described below:
Input Series
s = pd.Series([1,2,3,4,5,6])
Wanted DataFrame:
x y z
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
Of course I could do it by using loop but I hope there is way to do it more elegantly.
I'm not certain that's what you're looking for, but here's a pretty trivial way to do that:
df = pd.DataFrame({"x": s[:-2].values, "y": s[1:-1].values, "z": s[2:].values} )
Output:
x y z
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
Related
My Initial DataFrame Looks like given below(just for example, though it contains different values with more number of inputs)
close
1
2
3
4
5
6
Now I need to select hyper parameter and based on that I need to create new DataFrames every time.
Suppose my N starts with N=1
then the new dataFrame should look like
x1 Y
1 2
2 3
3 4
4 5
and it will continue till the length of the original dataFrame. Once it done it will move to the next hyperparameter with N=2 and then my new dataFrame should look like
X1 X2 Y
1 2 3
2 3 4
3 4 5
4 5 6
and soon, it will continue till certain hyper parameter N.
I hope I've understood you correctly. This example will create new dataframe from df['close'] based on the N:
N = 1
df_new = pd.DataFrame(
{
**{f"X{i+1}": df["close"].shift(-i)[: len(df) - N] for i in range(N)},
"Y": df["close"].shift(-N)[: len(df) - N],
}
).astype(int)
print(df_new)
Prints:
X1 Y
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
for N = 2:
X1 X2 Y
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
On 0.24.1, given:
toy = pd.DataFrame({'g': [1,1,1,2,2,3,3], 'val': [0,2,3,4,5,6,0]})
toy
g val
0 1 0
1 1 2
2 1 3
3 2 4
4 2 5
5 3 6
6 3 0
I want to drop entire groups where the v column of the group contains a zero. In other words, after the incantation, which should preferably involve inplace=True to preserve my data transformation pipeline, I want toy to be:
g val
3 2 4
4 2 5
Using filter
toy.groupby('g').filter(lambda x : all(x['val']!=0))
Out[58]:
g val
3 2 4
4 2 5
Or you can also do:
toy[~toy.g.isin(toy.loc[toy.val.eq(0),'g'])]
g val
3 2 4
4 2 5
This works:
toy.loc[toy.groupby('g').transform(np.prod).query('val !=0').index, :]
I have a dataframe,
x y z new_col
1 2 3 1
1 2 3 4
1 2 3 7
1 2 3 10
1 2 3 13
Want to create a new column and set a value 1 to first row.
And all other value for new column will be 1+3(3 from z), then 4+3 and so on.
You can perform a shifted cumulative sum:
df['new'] = 1 + df['z'].shift().fillna(0).astype(int).cumsum()
print(df)
x y z new
0 1 2 3 1
1 1 2 3 4
2 1 2 3 7
3 1 2 3 10
4 1 2 3 13
You can use the function: pd.cumsum
If your DataFrame is called df:
df['new_column'] = df.cumsum() - df.z[0] + 1
The -2 is there so that your sum start to 1 as you requested
You can do it like that:
df.assign(new_col = lambda x: 1 + x['z'].shift().cumsum()).fillna(1).astype(int)
x y z new_col
0 1 2 3 1
1 1 2 3 4
2 1 2 3 7
3 1 2 3 10
4 1 2 3 13
if you want more specific control over type cast and na filling you can also use the more verbose:
df.assign(new_col = lambda x: 1 + x['z'].shift().cumsum()
).fillna({'new_col':1}).astype({'new_col': int})
Actually, you can use the same logic as from jpp's answer but wrap it in an assign call:
df.assign(new_col = lambda x: 1+ x['z'].shift().fillna(0).astype(int).cumsum())
There are various ways to do it, but you have two very simple ones here:
df['new_col'] = (3*df.x).cumsum() - 2
df['new_col'] = 3*df.index + 1
The former assumes that your 'x' column only contains value 1 (if not, you can easily create a column like this df['temp'] = 1).
And the latter assumes that your index has no holes (which can be due to drops for instance). This two methods are easy to implement and very fast (way faster than shift cumsums for instance).
Moreover if the step you need depends on the values contained in the column z it can easily be adapted:
df['new_col'] = (df.z*df.x).cumsum() - 2
x y z new_col
0 1 2 3 1
1 1 2 3 4
2 1 2 3 7
3 1 2 3 10
4 1 2 3 13
I want to sort a subset of a dataframe (say, between indexes i and j) according to some value. I tried
df2=df.iloc[i:j].sort_values(by=...)
df.iloc[i:j]=df2
No problem with the first line but nothing happens when I run the second one (not even an error). How should I do ? (I tried also the update function but it didn't do either).
I believe need assign to filtered DataFrame with converting to numpy array by values for avoid align indices:
df = pd.DataFrame({'A': [1,2,3,4,3,2,1,4,1,2]})
print (df)
A
0 1
1 2
2 3
3 4
4 3
5 2
6 1
7 4
8 1
9 2
i = 2
j = 7
df.iloc[i:j] = df.iloc[i:j].sort_values(by='A').values
print (df)
A
0 1
1 2
2 1
3 2
4 3
5 3
6 4
7 4
8 1
9 2
I have a Dataframe like
Sou Des
1 3
1 4
2 3
2 4
3 1
3 2
4 1
4 2
I need to assign random value for each pair between 0 and 1 but have to assign the same random value for both similar pairs like "1-3", "3-1" and other pairs. I'm expecting a result dataframe like
Sou Des Val
1 3 0.1
1 4 0.6
2 3 0.9
2 4 0.5
3 1 0.1
3 2 0.9
4 1 0.6
4 2 0.5
How to assign same random value similar pairs like "A-B" and "B-A" in python pandas .
Let's create first a sorted by axis=1 helper DF:
In [304]: x = pd.DataFrame(np.sort(df, axis=1), df.index, df.columns)
In [305]: x
Out[305]:
Sou Des
0 1 3
1 1 4
2 2 3
3 2 4
4 1 3
5 2 3
6 1 4
7 2 4
now we can group by its columns:
In [306]: df['Val'] = (x.assign(c=1)
.groupby(x.columns.tolist())
.transform(lambda x: np.random.rand(1)))
In [307]: df
Out[307]:
Sou Des Val
0 1 3 0.989035
1 1 4 0.918397
2 2 3 0.463653
3 2 4 0.313669
4 3 1 0.989035
5 3 2 0.463653
6 4 1 0.918397
7 4 2 0.313669
This is new way
s=pd.crosstab(df.Sou,df.Des)
b = np.random.random_integers(-2000,2000,size=(len(s),len(s)))
sy = (b + b.T)/2
s.mul(sy).replace(0,np.nan).stack().reset_index()
Out[292]:
Sou Des 0
0 1 3 -60.0
1 1 4 -867.0
2 2 3 269.0
3 2 4 1152.0
4 3 1 -60.0
5 3 2 269.0
6 4 1 -867.0
7 4 2 1152.0
The trick here is to do a bit of work away from the dataframe. You can break this down into three steps:
assemble a list of all tuples (a,b)
assign a random value to each pair so that (a,b) and (b,a) have the same value
fill in the new column
Assuming your dataframe is called df, we can make a list of all the pairs ordered so that a <= b. I think this will be easier than trying to keep track of both (a,b) and (b,a).
pairs = set([(a,b) if a <= b else (b,a)
for a, b in df.itertuples(index=False,name=None))
It's simple enough to assign a random number to each of these pairs and store it in a dictionary, so I'll leave that to you. Call it pair_dict.
Now, we just have to lookup the values. We'll ultimately want to write
df['Val'] = df.apply(<some function>, axis=1)
where our function looks up the appropriate value in pair_dict.
Rather than try to cram it into a lambda (though we could), let's write it separately.
def func(row):
if row['Sou'] <= row['Des']:
key = (row['Sou'], row['Des'])
else:
key = (row['Des'], row['Sou'])
return pair_dict[key]
if you are ok having the "random" value coming from the hash() method you can achieve with frozenset()
df = pd.DataFrame([[1,1,2,2,3,3,4,4],[3,4,3,4,1,2,1,2]]).T
df.columns = ['Sou','Des']
df['Val']= df.apply(lambda x: hash(frozenset([x["Sou"],x["Des"]])),axis=1)
print df
which gives:
Sou Des Val
0 1 3 1580307032
1 1 4 -1736016661
2 2 3 741508915
3 2 4 -1930135584
4 3 1 1580307032
5 3 2 741508915
6 4 1 -1736016661
7 4 2 -1930135584
reference:
Why aren't Python sets hashable?