Python : Pandas : Data Frame : Column Names
I have large number of columns and column names are also very large. I would like to see few columns and rows but view becoming restricted to size of column names. How can I temporarily see dataframe in Python without column names (just display data )
Convert DataFrame to numpy array:
print (df.values)
But maybe here is possible select values of columns by positions first by iloc:
print (df.iloc[:, 5:8].values)
Sample:
df = pd.DataFrame(np.random.randint(10, size=(3,10)))
print (df)
0 1 2 3 4 5 6 7 8 9
0 8 4 9 1 3 7 6 3 0 3
1 3 2 6 8 9 3 7 5 7 4
2 0 0 7 5 7 3 9 3 9 3
print (df.iloc[:, 5:8])
5 6 7
0 7 6 3
1 3 7 5
2 3 9 3
print (df.iloc[:, 5:8].values)
[[7 6 3]
[3 7 5]
[3 9 3]]
Related
Consider a DataFrame with only one column named values.
data_dict = {values:[5,4,3,8,6,1,2,9,2,10]}
df = pd.DataFrame(data_dict)
display(df)
The output will look something like:
values
0 5
1 4
2 3
3 8
4 6
5 1
6 2
7 9
8 2
9 10
I want to generate a new column that will have the trailing high value of the previous column.
Expected Output:
values trailing_high
0 5 5
1 4 5
2 3 5
3 8 8
4 6 8
5 1 8
6 2 8
7 9 9
8 2 9
9 10 10
Right now I am using for loop to iterate on df.iterrows() and calculating the values at each row. Because of this, the code is very slow.
Can anyone share the vectorization approach to increase the speed?
Use .cummax:
df["trailing_high"] = df["values"].cummax()
print(df)
Output
values trailing_high
0 5 5
1 4 5
2 3 5
3 8 8
4 6 8
5 1 8
6 2 8
7 9 9
8 2 9
9 10 10
So i have a column in a CSV file that I would like to gather data on. It is full of integers, but I would like to bar-graph the top 5 "modes"/"most occurred" numbers within that column. Is there any way to do this?
Assuming you have a big list of integers in the form of a pandas series s.
s.value_counts().plot.bar() should do it.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html
you can use .value_counts().head().plot(kind='bar')
for example:
df = pd.DataFrame({'a':[1,1,2,3,5,8,1,5,6,9,8,7,5,6,7],'b':[1,1,2,3,3,3,4,5,6,7,7,7,7,8,2]})
df
a b
0 1 1
1 1 1
2 2 2
3 3 3
4 5 3
5 8 3
6 1 4
7 5 5
8 6 6
9 9 7
10 8 7
11 7 7
12 5 7
13 6 8
14 7 2
df.b.value_counts().head() # count values of column 'b' and show only top 5 values
7 4
3 3
2 2
1 2
8 1
Name: b, dtype: int64
df.b.value_counts().head().plot(kind='bar') #create bar plot for top values
I have a pandas dataframe containing rows with numbered columns:
1 2 3 4 5
a 0 0 0 0 1
b 1 1 2 1 9
c 2 2 2 2 2
d 5 5 5 5 5
e 8 9 9 9 9
How can I filter out the rows where a subset of columns are all above or below a certain value?
So, for example: I want to remove all rows where columns 1 to 3 all values are not > 3. In the above, that would leave me with only rows d and e.
The columns I am filtering and the value I am checking against are both arguments.
I've tried a few things, this is the closest I've gotten:
df[df[range(1,3)]>3]
Any ideas?
I used loc and all
in this function:
def filt(df, cols, thresh):
return df.loc[(df[cols] > thresh).all(axis=1)]
filt(df, [1, 2, 3], 3)
1 2 3 4 5
d 5 5 5 5 5
e 8 9 9 9 9
You can achieve this without using apply:
In [73]:
df[(df.ix[:,0:3] > 3).all(axis=1)]
Out[73]:
1 2 3 4 5
d 5 5 5 5 5
e 8 9 9 9 9
So this slices the df to just the first 3 columns using ix and then we compare against the scalar 3 and then call all(axis=1) to create a boolean series to mask the index
I have a dataframe that looks like this:
test_data = pd.DataFrame(np.array([np.arange(10)]*3).T, columns =['issuer_id','winner_id','gov'])
issuer_id winner_id gov
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
and a list of two-tuples consisting of a dataframe and a label encoding 'gov' (perhaps a label:dataframe dict would be better). In test_out below the two labels are 2 and 7.
test_out = [(pd.DataFrame(np.array([np.arange(10)]*2).T, columns =['id','partition']),2),(pd.DataFrame(np.array([np.arange(10)]*2).T, columns =['id','partition']),7)]
[( id partition
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9, 2), ( id partition
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9, 7)]
I want to add two columns to the test_data dataframe: issuer_partition and winner_partition
test_data['issuer_partition']=''
test_data['winner_partition']=''
and I would like to fill in these values from the test_out list where the entry in the gov column determines the labeled dataframe in test_out to draw from. Then I look up the winner_id and issuer_id in the id-partition dataframe and write them into test_data.
Put another way: I have a list of labeled dataframes that I would like to loop through to conditionally fill in data in a primary dataframe.
Is there a clever way to use merge in this scenario?
*edit - added another sentence and fixed test_out code
I have two dataframes.
df1
Out[162]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
11 11 11 11
df2
Out[194]:
A B
0 a 3
1 b 4
2 c 5
I wish to create a 3rd column in df2 that maps df2['A'] to df1 and find the smallest number in df1 that's greater than the number in df2['B']. For example, for df2['C'].ix[0], it should go to df1['a'] and search for the smallest number that's greater than df2['B'].ix[0], which should be 4.
I had something like df2['C'] = df2['A'].map( df1[df1 > df2['B']].min() ). But this doesn't work as it won't go to df2['B'] search for corresponding rows. Thanks.
Use apply for row-wise methods:
In [54]:
# create our data
import pandas as pd
df1 = pd.DataFrame({'a':list(range(12)), 'b':list(range(12)), 'c':list(range(12))})
df1
Out[54]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
11 11 11 11
[12 rows x 3 columns]
In [68]:
# create our 2nd dataframe, note I have deliberately used alternate values for column 'B'
df2 = pd.DataFrame({'A':list('abc'), 'B':[3,5,7]})
df2
Out[68]:
A B
0 a 3
1 b 5
2 c 7
[3 rows x 2 columns]
In [69]:
# apply row-wise function, must use axis=1 for row-wise
df2['C'] = df2.apply(lambda row: df1[row['A']].ix[df1[row.A] > row.B].min(), axis=1)
df2
Out[69]:
A B C
0 a 3 4
1 b 5 6
2 c 7 8
[3 rows x 3 columns]
There is some example usage in the pandas docs