Manipulate a new column after 'df.size()' function? - python

After I have executed a df.size() function as seen below (df = Dataframe) in the pandas module, I've obtained a new column beside the one labeled No.. However, I'm not sure how to manipulate this new column. This is because I don't know the label/key for this column.
For example, I want to express the values generated (in the new column) as a fraction of the sum of all these values in a new column. How can I do so?
JuncNo = pd.read_csv(filename)
JuncNo_group = JuncNo.groupby('No.')
JuncSize = JuncNo_group.size()
JuncSize.head(n=6)
No.
1 122
2 2136
3 561
4 91
5 10
6 3
dtype: int64

You have to set name of new Series and reset index:
JuncSize = JuncSize.groupby('No').size()
JuncSize.name = 'size'
JuncSize = JuncSize.reset_index()
print JuncSize
But if you need add new column with same no of rows as original dataframe, you can use:
JuncSize['size'] = JuncSize.groupby('No').transform(np.size)
Example:
print JuncSize
No Code
0 D B2
1 B B2
2 B B3
3 B B3
4 G B3
5 B B3
JuncSize['size'] = JuncSize.groupby('No').transform(np.size)
print JuncSize
No Code size
0 D B2 1
1 B B2 4
2 B B3 4
3 B B3 4
4 G B3 1
5 B B3 4
JuncSize = JuncSize.groupby('No').size()
print JuncSize
No
B 4
D 1
G 1
JuncSize.name = 'size'
print JuncSize
No
B 4
D 1
G 1
Name: size, dtype: int64
JuncSize = JuncSize.reset_index()
print JuncSize
No size
0 B 4
1 D 1
2 G 1

Related

Extract TLDs , SLDs from a dataframe column into new columns

I am trying to extract the top level domain (TLD), second level (SLD) etc from a column in a dataframe and added to new columns. Currently I have a solution where I convert this to a list and then use tolist, but the since this does sequential append, it does not work correctly. For example if the url has 3 levels then the mapping gets messed up
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":["xyz[.]com","abc123[.]pro","xyzabc[.]gouv[.]fr"]})
df['C'] = df.C.apply(lambda x: x.split('[.]'))
df.head()
A B C
0 1 2 [xyz, com]
1 2 3 [abc123, pro]
2 3 4 [xyzabc, gouv, fr]
d = [pd.DataFrame(df[col].tolist()).add_prefix(col) for col in df.columns]
df = pd.concat(d, axis=1)
df.head()
A0 B0 C0 C1 C2
0 1 2 xyz com None
1 2 3 abc123 pro None
2 3 4 xyzabc gouv fr
I want C2 to always contain the TLD (com,pro,fr) and C1 to always contain SLD
I am sure there is a better way to do this correctly and would appreciate any pointers.
You can shift the Cx columns:
df.loc[:, "C0":] = df.loc[:, "C0":].apply(
lambda x: x.shift(periods=x.isna().sum()), axis=1
)
print(df)
Prints:
A0 B0 C0 C1 C2
0 1 2 NaN xyz com
1 2 3 NaN abc123 pro
2 3 4 xyzabc gouv fr
You can also use a regex with negative lookup and split builtin pandas with expand
df[['C0', 'C2']] = df.C.str.split('\[\.\](?!.*\[\.\])', expand=True)
df[['C0', 'C1']] = df.C0.str.split('\[\.\]', expand=True)
that gives
A B C C0 C2 C1
0 1 2 xyz[.]com xyz com None
1 2 3 abc123[.]pro abc123 pro None
2 3 4 xyzabc[.]gouv[.]fr xyzabc fr gouv

Append data frame issue

data frame issue
ID
C1
C2
M1
1
A
B
X
2
A
Y
3
C
W
4
G
H
Z
result wanted
ID
C
1
A
1
B
2
B
3
C
4
C
4
G
The main problem is the first dataset today has C1 and C2
tomorrow we could have C1 , C2 , C3 ...Cn
the filename will be provided and my task is read it and get the result regardless of how many C related columns the file may have. column: M1 is not needed.
-----what I tried:
df = pd.read_csv (r"C:\Users\JIRAdata_TEST.csv")
df = df.filter(regex='ID|C')
print(df2)
will return all ID and C related columns, and remove the M1 column as part of data clean up--dont know if that helps.
then...am stuck!
Use df.melt with df.dropna:
In [1295]: x = df.filter(regex='ID|C').melt('ID', value_name='C').sort_values('ID').dropna().drop('variable', 1)
In [1296]: x
Out[1296]:
ID C
0 1 A
4 1 B
5 2 A
2 3 C
3 4 G
7 4 H

how to merge and update 2 Dataframes

a1 = pd.DataFrame({'A': [1,2,3], 'B': [2,3,4]})
b2 = pd.DataFrame({'A': [1,4], 'B': [3,6]})
and I wanna get
c = pd.DataFrame({'A': [1,2,3,4], 'B': [3,3,4,6]})
a1 and b2 merge on the key='A'
but when 'A' equal but B different, get b2 value
how can I get this work? have no idea.
First concatenate both dataframes under each other to get one big dataframe:
c = pd.concat([a1, b2], 0)
A B
0 1 2
1 2 3
2 3 4
0 1 3
1 4 6
Then group on column A to only get the unique values of A, by using last you make sure than when there is a duplicate the value of b2 is used. This gives:
c = c.groupby('A').last()
B
A
1 3
2 3
3 4
4 6
Then set reset index to get a nice numerical index.
c = c.reset_index()
which returns:
A B
0 1 3
1 2 3
2 3 4
3 4 6
To do it all in one go just enter the following lines of code:
c = pd.concat([a1, b2], 0)
c = c.groupby('A').last().reset_index()

Extract all the following rows in pandas

I have the following pandas DataFrame:
df
A B
1 b0
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
The first row which starts with a is
df[df.B.str.startswith("a")]
A B
2 a0
I would like to extract the first row in column B that starts with a and every row after. My desired result is below
A B
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
How can this be done?
One option is to create a mask and use it for selection:
mask = df.B.str.startswith("a")
mask[~mask] = np.nan
df[mask.fillna(method='ffill').fillna(0).astype(int) == 1]
Another option is to build an index range:
first = df[df.B.str.startswith("a")].index[0]
df.ix[first:]
The latter approach assumes that an "a" is always present.
using idxmax to find first True
df.loc[df.B.str[0].eq('a').idxmax():]
A B
1 2 a0
2 3 c0
3 5 c1
4 6 a1
5 7 b1
6 8 b2
If I understand your question correctly, here is how you do it :
df = pd.DataFrame(data={'A':[1,2,3,5,6,7,8],
'B' : ['b0','a0','c0','c1','a1','b1','b2']})
# index of the item beginning with a
index = df[df.B.str.startswith("a")].values.tolist()[0][0]
desired_df = pd.concat([df.A[index-1:],df.B[index-1:]], axis = 1)
print desired_df
and you get:

problems with MultiIndex

I'm having problems with MultiIndex and stack(). The following example is based on a solution from Calvin Cheung on StackOvervlow.
=== multi.csv ===
h1,main,h3,sub,h5
a,A,1,A1,1
b,B,2,B1,2
c,B,3,A1,3
d,A,4,B2,4
e,A,5,B3,5
f,B,6,A2,6
=== multi.py ===
#!/usr/bin/env python
import pandas as pd
df1 = pd.read_csv('multi.csv')
df2 = df1.pivot('main', 'sub').stack()
print(df2)
=== output ===
h1 h3 h5
main sub
A A1 a 1 1
B2 d 4 4
B3 e 5 5
B A1 c 3 3
A2 f 6 6
B1 b 2 2
This works as long as the entries in the sub column are unique with respect to the corresponding entry in the main column. But if we change the sub column entry in row e to B2, then B2 is no longer unique in the group of A rows and we get an error message: "pandas.core.reshape.ReshapeError: Index contains duplicate entries, cannot reshape".
I was expected the shape of the sub index to behave like the shape of the primary index where duplicates are indicated with blank entries under the first row entry.
=== expected output ===
h1 h3 h5
main sub
A A1 a 1 1
B2 d 4 4
e 5 5
B A1 c 3 3
A2 f 6 6
B1 b 2 2
So my question is, how can I structure a MultiIndex in a way that allows duplicates in sub-levels?
Rather than do a pivot*, just set_index directly (this works for both examples):
In [11]: df
Out[11]:
h1 main h3 sub h5
0 a A 1 A1 1
1 b B 2 B1 2
2 c B 3 A1 3
3 d A 4 B2 4
4 e A 5 B2 5
5 f B 6 A2 6
In [12]: df.set_index(['main', 'sub'])
Out[12]:
h1 h3 h5
main sub
A A1 a 1 1
B B1 b 2 2
A1 c 3 3
A B2 d 4 4
B2 e 5 5
B A2 f 6 6
*You're not really doing a pivot here anyway, it just happens to work in the above case.

Categories