Pandas read_csv usecols same index - python

Consider the following code:
import pandas as pd
from StringIO import StringIO
x='''
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11,12
13,14,15,16
17,18,19,20
'''
df = pd.read_csv(StringIO(x), skipinitialspace=True, usecols=[2,3,2])
print df
Output:
c d
0 3 4
1 7 8
2 11 12
3 15 16
4 19 20
is there any way i can get
c d c
0 3 4 3
1 7 8 7
2 11 12 11
3 15 16 15
4 19 20 19

You can use iloc[] indexer:
In [67]: pd.read_csv(StringIO(x), skipinitialspace=True).iloc[:, [2,3,2]]
Out[67]:
c d c
0 3 4 3
1 7 8 7
2 11 12 11
3 15 16 15
4 19 20 19
But as #Boud has already mentioned in comments it would be much more efficient to make use of usecols parameter (as we don't need to parse columns that we don't need and we won't waste memory for them), if you know either names of columns in the CSV file:
In [6]: pd.read_csv(StringIO(x), skipinitialspace=True, usecols=[2,3,2]).loc[:, ['c','d','c']]
Out[6]:
c d c
0 3 4 3
1 7 8 7
2 11 12 11
3 15 16 15
4 19 20 19
or if you know their new indexes (in the new DataFrame):
In [7]: pd.read_csv(StringIO(x), skipinitialspace=True, usecols=[2,3,2]).iloc[:, [0,1,0]]
Out[7]:
c d c
0 3 4 3
1 7 8 7
2 11 12 11
3 15 16 15
4 19 20 19
PS you may also want to read about Pandas boolean indexing

Related

reshape Pandas dataframe by appending column to column

i do have a Pandas df like (df1):
0 1 2 3 4 5
0 a b c d e f
1 1 4 7 10 13 16
2 2 5 8 11 14 17
3 3 6 9 12 15 18
and i want to generate an Dataframe like (df2):
0 1 2
0 a b c
1 1 4 7
2 2 5 7
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
additional information about the given df:
shape of given df ist unknown. b = df1.shape() -> b = [n,m]
it is a given fact the width of df1 is divisble by 3
i did try stack, melt and wide_to_long. By using stack the order of the rows is lost, the rows should behave as shown in exmeplary df2 . I would really appreciate any help.
Kind regards Hans
Use np.vstack and np.hsplit:
>>> pd.DataFrame(np.vstack(np.hsplit(df, df.shape[1] / 3)))
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
Another example:
>>> df
0 1 2 3 4 5 6 7 8
0 a b c d e f g h i
1 1 4 7 10 13 16 19 22 25
2 2 5 8 11 14 17 20 23 26
3 3 6 9 12 15 18 21 24 27
>>> pd.DataFrame(np.vstack(np.hsplit(df, df.shape[1] / 3)))
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
8 g h i
9 19 22 25
10 20 23 26
11 21 24 27
You can use DataFrame.append:
a = df[df.columns[: len(df.columns) // 3 + 1]]
b = df[df.columns[len(df.columns) // 3 + 1 :]]
b.columns = a.columns
df_out = a.append(b).reset_index(drop=True)
print(df_out)
Prints:
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
EDIT: To handle unknown widths:
dfs = []
for i in range(0, len(df.columns), 3):
dfs.append(df[df.columns[i : i + 3]])
dfs[-1].columns = df.columns[:3]
df_out = pd.concat(dfs)
print(df_out)
Prints:
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
0 d e f
1 10 13 16
2 11 14 17
3 12 15 18
0 g h i
1 19 22 25
2 20 23 26
3 21 24 27

Insert new dataframe to existing dataframe into specific row position in Pandas

I have a df1 and df2 as follows:
df1:
a b c
0 1 2 4
1 6 12 24
2 7 14 28
3 4 8 16
4 3 6 12
df2:
a b c
0 7 8 9
1 10 11 12
How can I insert df2 to df1 but after the second row? My desired output will like this.
a b c
0 1 2 4
1 6 12 24
2 7 8 9
3 10 11 12
4 7 14 28
5 4 8 16
6 3 6 12
Thank you.
Use concat with splitted first DataFrame by DataFrame.iloc:
df = pd.concat([df1.iloc[:2], df2, df1.iloc[2:]], ignore_index=False)
print (df)
a b c
0 1 2 4
1 6 12 24
0 7 8 9
1 10 11 12
2 7 14 28
3 4 8 16
4 3 6 12
Here is another way using np.r_:
df2.index=range(len(df1),len(df1)+len(df2)) #change index where df1 ends
final=pd.concat((df1,df2)) #concat
final.iloc[np.r_[0,1,df2.index,2:len(df1)]] #select ordering with iloc
#final.iloc[np.r_[0:2,df2.index,2:len(df1)]]
a b c
0 1 2 4
1 6 12 24
5 7 8 9
6 10 11 12
2 7 14 28
3 4 8 16
4 3 6 12

merge two pandas data frame and skip common columns of right

I am using pandas DataFrame as a lightweight dataset to maintain some status and need to dynamically/continuously merge new DataFrames into existing table. Say I have two datasets as below:
df1:
a b
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
df2:
b c
0 10 11
1 12 13
2 14 15
3 16 17
4 18 19
I want to merge df2 to df1 (on index), and for columns in common (in this case, it is 'b'), simply discard the common column of df2.
a b c
0 0 1 11
1 2 3 13
2 4 5 15
3 6 7 17
4 8 9 19
My code was checking common part between df1 and df2 by using SET, so that I manually drop common part in df2. I wonder is there any much efficient way to do this?
First identify the columns in df2 not in df1
cols = df2.columns.difference(df1.columns)
Then pd.DataFrame.join
df1.join(df2[cols])
a b c
0 0 1 11
1 2 3 13
2 4 5 15
3 6 7 17
4 8 9 19
Or pd.concat will also work
pd.concat([df1, df2[cols]], axis=1)
a b c
0 0 1 11
1 2 3 13
2 4 5 15
3 6 7 17
4 8 9 19
Pandas merge function will also work wonders. You can do it as:
pd.merge(left=df1, right=df2, how='inner')
a b c
0 0 1 11
1 2 3 13
2 4 5 15
3 6 7 17
4 8 9 19
by eliminating the 'on' attribute of merge function it will consider the columns which are in-common in both of the dataframes.

How could I use boolean index to retrieve columns of a pandas DataFrame

I have a dataframe like the following:
df1:DataFrame
0 1 2 3 4
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
I can retrieve rows using the following way.
df1[(df1>10).any(1)]
In [58]: df1[(df1>10).any(1)]
Out[58]:
0 1 2 3 4
c 10 11 12 13 14
d 15 16 17 18 19
But when I want to retrieve columns, in this way: df[(df1>10).any(0)].
It throws exception.
1699
1700 result = result.astype(bool)._values
IndexingError: Unalignable boolean Series key provided
How can I realise it?
Use loc and specify the relevant column constraints.
>>> df1.loc[:, df1.gt(10).any(axis=0)]
0 1 2 3 4
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19

python pandas, certain columns to rows [duplicate]

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 4 years ago.
I have a pandas dataframe, with 4 rows and 4 columns - here is asimple version:
import pandas as pd
import numpy as np
rows = np.arange(1, 4, 1)
values = np.arange(1, 17).reshape(4,4)
df = pd.DataFrame(values, index=rows, columns=['A', 'B', 'C', 'D'])
what I am trying to do is to convert this to a 2 * 8 dataframe, with B, C and D alligng for each array - so it would look like this:
1 2
1 3
1 4
5 6
5 7
5 8
9 10
9 11
9 12
13 14
13 15
13 16
reading on pandas documentation I tried this:
df1 = pd.pivot_table(df, rows = ['B', 'C', 'D'], cols = 'A')
but gives me an error that I cannot identify the source (ends with
DataError: No numeric types to aggregate
)
following that I want to split the dataframe based on A values, but I think the .groupby command is probably going to take care of it
What you are looking for is the melt function
pd.melt(df,id_vars=['A'])
A variable value
0 1 B 2
1 5 B 6
2 9 B 10
3 13 B 14
4 1 C 3
5 5 C 7
6 9 C 11
7 13 C 15
8 1 D 4
9 5 D 8
10 9 D 12
11 13 D 16
    
A final sorting according to A is then necessary
pd.melt(df,id_vars=['A']).sort('A')
A variable value
0 1 B 2
4 1 C 3
8 1 D 4
1 5 B 6
5 5 C 7
9 5 D 8
2 9 B 10
6 9 C 11
10 9 D 12
3 13 B 14
7 13 C 15
11 13 D 16
Note: pd.DataFrame.sort has been deprecated in favour of pd.DataFrame.sort_values.

Categories