I am new in Python and try to replace rows.
I have a dataframe such as:
X
Y
1
a
2
d
3
c
4
a
5
b
6
e
7
a
8
b
I have two question:
1- How can I replace 2nd row with 5th, such as:
X
Y
1
a
5
b
3
c
4
a
2
d
6
e
7
a
8
b
2- How can I put 6th row above 3rd row, such as:
X
Y
1
a
2
d
6
e
3
c
4
a
5
b
7
a
8
b
First use DataFrame.iloc, python counts from 0, so for select second row use 1 and for fifth use 4:
df.iloc[[1, 4]] = df.iloc[[4, 1]]
print (df)
X Y
0 1 a
1 5 b
2 3 c
3 4 a
4 2 d
5 6 e
6 7 a
7 8 b
And then rename indices for above value, here 1 and sorting with only stable sorting mergesort:
df = df.rename({5:1}).sort_index(kind='mergesort', ignore_index=True)
print (df)
X Y
0 1 a
1 2 d
2 6 e
3 3 c
4 4 a
5 5 b
6 7 a
7 8 b
i have a dataframe created in pandas which is looks like this:
A B C
X Y Z X Y Z X Y Z
Y K 2 5 12 11 9 8 4 5 12
K 4 4 13 15 5 4 6 7 2
K 6 7 14 0 2 3 0 6 8
C M 4 5 12 5 2 2 1 14 0
M 6 7 2 3 1 6 7 12 5
M 0 6 8 7 3 9 6 8 4
D N 7 1 13 15 9 8 1 13 5
N 9 0 14 0 5 4 0 14 6
N 3 2 12 5 2 3 1 2 2
I want to make it looks like this:
A B C
X Y Z X Y Z X Y Z
Y K 2 5 12 11 9 8 4 5 12
K 4 4 13 15 5 4 6 7 2
K 6 7 14 0 2 3 0 6 8
A B C
X Y Z X Y Z X Y Z
C M 4 5 12 5 2 2 1 14 0
M 6 7 2 3 1 6 7 12 5
M 0 6 8 7 3 9 6 8 4
A B C
X Y Z X Y Z X Y Z
D N 7 1 13 15 9 8 1 13 5
N 9 0 14 0 5 4 0 14 6
N 3 2 12 5 2 3 1 2 2
Is there is any way i can do that? I have tried several ways with concat/merge/join, but i didn't find a way how i can keep column names for "Y,C,D"
No it is not possible using the standard DataFrame string output/display functions. If there were a way to do that it might be a "display option" but those are all listed here and I don't see a relevant one for you: https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html#available-options
I have this formula, I wanted to turn this into pandas calculation,
the formula is very easy:
NEW = A(where v=1) + A(where v=3) + A(where v=5)
I have a data frame like this:
Type subType value A NEW
X a 1 3 =3+9+9=21
X a 3 9
X a 5 9
X b 1 4 =4+5+0=9
X b 3 5
X b 5 0
Y a 1 1 =1+2+3=6
Y a 3 2
Y a 5 3
Y b 1 4 =4+5+2=11
Y b 3 5
Y b 5 2
Two questions:
I know I can just write down the calculation with the specified cell, but I want the code looks nicer, is there other ways to get the value?
Because there will be only two results for X & Y, how can I add them into my original dataframe for further calculation? (my thought is not to add them in the dataframe and just use the value whenever it's necessary for future calculation)
Quite new to coding, any answer will be appreciated!
Try this:
>>> import pandas as pd
>>> df = pd.DataFrame({'Type':['X','X','X','Y','Y','Y'], 'value':[1,3,5,1,3,5], 'A':[3,9,4,0,2,2]})
>>> df
Type value A
0 X 1 3
1 X 3 9
2 X 5 4
3 Y 1 0
4 Y 3 2
5 Y 5 2
>>> df.groupby('Type')['A'].sum()
Type
X 16
Y 4
>>> ur_dict = df.groupby('Type')['A'].sum().to_dict()
>>> df['NEW'] = df['Type'].map(ur_dict)
>>> df
Type value A NEW
0 X 1 3 16
1 X 3 9 16
2 X 5 4 16
3 Y 1 0 4
4 Y 3 2 4
5 Y 5 2 4
Hope this helps.
Edit to answer additional inquiry:
You are mapping tuple keys to a series, that will give you an error. You should shift the columns you need to map your dictionary into as index before doing the mapping.
See below:
>>> import pandas as pd
>>> df = pd.DataFrame({'Type':['X','X','X','X','X','X','Y','Y','Y','Y','Y','Y'], 'subType':['a','a','a','b','b','b','a','a','a','b','b','b'],'value':[1,3,5,1,3,5,1,3,5,1,3,5],'A':[3,9,9,4,5,0,1,2,3,4,5,2]})
>>> df
Type subType value A
0 X a 1 3
1 X a 3 9
2 X a 5 9
3 X b 1 4
4 X b 3 5
5 X b 5 0
6 Y a 1 1
7 Y a 3 2
8 Y a 5 3
9 Y b 1 4
10 Y b 3 5
11 Y b 5 2
>>> df.groupby(['Type', 'subType'])['A'].sum()
Type subType
X a 21
b 9
Y a 6
b 11
Name: A, dtype: int64
>>> ur_dict = df.groupby(['Type', 'subType'])['A'].sum().to_dict()
>>> ur_dict
{('X', 'a'): 21, ('X', 'b'): 9, ('Y', 'a'): 6, ('Y', 'b'): 11}
>>> df['NEW'] = df.set_index(['Type', 'subType']).index.map(ur_dict)
>>> df
Type subType value A NEW
0 X a 1 3 21
1 X a 3 9 21
2 X a 5 9 21
3 X b 1 4 9
4 X b 3 5 9
5 X b 5 0 9
6 Y a 1 1 6
7 Y a 3 2 6
8 Y a 5 3 6
9 Y b 1 4 11
10 Y b 3 5 11
11 Y b 5 2 11
I'm trying to merge two data frames based on a column present in both, keeping only the intersection of the two sets.
The desired result is:
foo bar foobar
x y z x j i x y z j i
a 1 2 a 9 0 a 1 2 9 0
b 3 4 b 9 0 b 3 4 9 0
c 5 6 c 9 0 c 5 6 9 0
d 7 8 e 9 0
f 9 0
My code that does not produce the desired result is:
pd.merge(foo, bar, how='inner', on='x')
Instead, the code seems to return:
foo bar foobar
x y z x j i x y z j i
a 1 2 a 9 0 a 1 2 9 0
b 3 4 b 9 0 b 3 4 9 0
c 5 6 c 9 0 c 5 6 9 0
d 7 8 e 9 0 e * * 9 0
f 9 0 f * * 9 0
(where * represents an NaN)
Where am I going wrong? I've already reached the third Google page trying to fix this an nothing works. Whatever I do I get an outer join, with all rows in both sets.
Usually it means that you have duplicates in the column(s) used for joining, resulting in cartesian product.
Demo:
In [35]: foo
Out[35]:
x y z
0 a 1 2
1 b 3 4
2 c 5 6
3 d 7 8
In [36]: bar
Out[36]:
x j i
0 a 9 0
1 b 9 0
2 a 9 0
3 a 9 0
4 b 9 0
In [37]: pd.merge(foo, bar)
Out[37]:
x y z j i
0 a 1 2 9 0
1 a 1 2 9 0
2 a 1 2 9 0
3 b 3 4 9 0
4 b 3 4 9 0
(I'm a pandas n00b) I have some oddly formatted CSV data that resembles this:
i A B C
x y z x y z x y z
-------------------------------------
1 1 2 3 4 5 6 7 8 9
2 1 2 3 3 2 1 2 1 3
3 9 8 7 6 5 4 3 2 1
where A, B, C are categorical and the properties x, y, z are present for each. What I think I want to do (part of a larger split-apply-combine step) is to read data with Pandas such that I have dimensionally homogenous observations like this:
i id GRP x y z
-----------------------
1 1 A 1 2 3
2 1 B 4 5 6
3 1 C 7 8 9
4 2 A 1 2 3
5 2 B 3 2 1
6 2 C 2 1 3
7 3 A 9 8 7
8 3 B 6 5 4
9 3 C 3 2 1
So how best to accomplish this?
#1: I thought about reading the file using basic read_csv() options, then iterating/ slicing/transposing/whatever to create another dataframe that has the structure i want. But in my case the number of categories (A,B,C) and properties (x,y,z) is large and is not known ahead of time. I'm also worried about memory issues if scaling to large datasets.
#2: I like the idea of setting the iterator param in read_csv() and then yielding multiple observations per line. (any reason y not set chunksize=1?) I wouldn't be creating multiple dataframes this way at least.
What's the smarter way to do this?
First I constructed the sample dataframe like yours:
column = pd.MultiIndex(levels=[['A', 'B', 'C'], ['x', 'y', 'z']],
labels=[[i for i in range(3) for _ in range(3)], [0, 1, 2]*3])
df = pd.DataFrame(np.random.randint(1,10, size=(3, 9)),
columns=column, index=[1, 2, 3])
print df
# A B C
# x y z x y z x y z
# 1 5 7 4 7 7 8 9 1 9
# 2 8 5 1 8 5 9 4 4 2
# 3 4 9 6 2 1 4 6 1 6
To get your desired output, reshape the dataframe using df.stack() and then reset the index:
df = df.stack(0).reset_index()
df.index += 1 # to make index begin from 1
print df
# level_0 level_1 x y z
# 1 1 A 5 7 4
# 2 1 B 7 7 8
# 3 1 C 9 1 9
# 4 2 A 8 5 1
# 5 2 B 8 5 9
# 6 2 C 4 4 2
# 7 3 A 4 9 6
# 8 3 B 2 1 4
# 9 3 C 6 1 6
Then you can just rename the columns as you want. Hope it helps.