How to plot/graph top modes through panda python - python

So i have a column in a CSV file that I would like to gather data on. It is full of integers, but I would like to bar-graph the top 5 "modes"/"most occurred" numbers within that column. Is there any way to do this?

Assuming you have a big list of integers in the form of a pandas series s.
s.value_counts().plot.bar() should do it.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html

you can use .value_counts().head().plot(kind='bar')
for example:
df = pd.DataFrame({'a':[1,1,2,3,5,8,1,5,6,9,8,7,5,6,7],'b':[1,1,2,3,3,3,4,5,6,7,7,7,7,8,2]})
df
a b
0 1 1
1 1 1
2 2 2
3 3 3
4 5 3
5 8 3
6 1 4
7 5 5
8 6 6
9 9 7
10 8 7
11 7 7
12 5 7
13 6 8
14 7 2
df.b.value_counts().head() # count values of column 'b' and show only top 5 values
7 4
3 3
2 2
1 2
8 1
Name: b, dtype: int64
df.b.value_counts().head().plot(kind='bar') #create bar plot for top values

Related

How to create a column to store trailing high value in Pandas DataFrame?

Consider a DataFrame with only one column named values.
data_dict = {values:[5,4,3,8,6,1,2,9,2,10]}
df = pd.DataFrame(data_dict)
display(df)
The output will look something like:
values
0 5
1 4
2 3
3 8
4 6
5 1
6 2
7 9
8 2
9 10
I want to generate a new column that will have the trailing high value of the previous column.
Expected Output:
values trailing_high
0 5 5
1 4 5
2 3 5
3 8 8
4 6 8
5 1 8
6 2 8
7 9 9
8 2 9
9 10 10
Right now I am using for loop to iterate on df.iterrows() and calculating the values at each row. Because of this, the code is very slow.
Can anyone share the vectorization approach to increase the speed?
Use .cummax:
df["trailing_high"] = df["values"].cummax()
print(df)
Output
values trailing_high
0 5 5
1 4 5
2 3 5
3 8 8
4 6 8
5 1 8
6 2 8
7 9 9
8 2 9
9 10 10

create pandas.DataFrame from Postgreql and convert multiple rows and one column to one row and multiple column

i have a postgresql db that look like this:
price
1
2
4
9
7
8
3
7
5
3
7
and I want it to look like this:
1 2 4 9 7 8 3 7 5 3 7
I'm reding it using pandas.read_sql()
now I want to convert that the DataFrame will be instead of 11 rows and one column to be 1 row and
11 column, from what I'm understanding I need to use the pandas.melt() function but I didn't understand how?
You can do
df.T
Out[7]:
0 1 2 3 4 5 6 7 8 9 10
price 1 2 4 9 7 8 3 7 5 3 7

Updating values in a particular column of a database using python

I have a col called id in a dataframe called _newdata which looks like this. Note that this is a part of the values in the column and not the entire thing.
1
1
1
2
2
2
2
2
4
4
4
4
4
5
5
5
5
7
7
7
7
7
8
8
8
8
10
10
10
What I want to do is the make rename the 'id' with values so that it is in running numbers. Which means I want it to look like this
1
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
4
5
5
5
5
5
6
6
6
6
7
7
7
I tried using this but it didn't seem to do anything to the file. Could someone tell me where I went wrong or suggest a method to do what I want it to do?
count = 1 #values start at 1
for i, row in _newdata.iterrows():
if row['id']==count or row['id']==count+1:
pass
else:
count+=1
row['id']=count
You can use dense rank():
df['id'] = df['id'].rank(method='dense').astype(int)

Bucket Multiple Columns in Pandas Based on Top N Values

I would like to iterate through multiple dataframe columns looking for the top n values in each column. If the value in the column is in the top n values then keep that value, otherwise bucket in "other". Also, I would like to create new columns from this.
However, I'm not sure how to use .apply in this case as it seems like I need to reference both columns and rows.
np.random.seed(0)
example_df = pd.DataFrame(np.random.randint(low=0, high=10, size=(15, 5)),columns=['a', 'b', 'c', 'd', 'e'])
cols_to_group = ['a','b','c']
top = 2
So for the example below, here's my pseudo code that I'm not sure how to execute:
Pseudo Code:
#loop through each column
for column in example_df[cols_to_group]:
#loop through each value in column and check if it's in top values for the column.
for single_value in column:
if single_value.isin(column.value_counts()[:top].values):
#return value if it is in top values
return single_value
else:
return "other"
#create new column in your df that has bucketed values
example_df[column.name + str("bucketed")+ str(top)]=column
Expected output:
Crude example where top = 2.
a b c d e a_bucketed b_bucketed
0 4 6 4 3 1 4 6
1 8 8 1 5 7 8 8
2 8 6 0 0 2 8 6
3 4 1 0 7 4 4 Other
4 7 8 7 7 7 Other 8
Here is one way. But no treatment for ties has been prescribed.
df['a_bucketed'] = np.where(df['a'].isin(df['a'].value_counts().index[:2]), df['a'], 'Other')
df['b_bucketed'] = np.where(df['b'].isin(df['b'].value_counts().index[:2]), df['b'], 'Other')
# a b c d e a_bucketed b_bucketed
# 0 5 0 3 3 7 Other Other
# 1 9 3 5 2 4 9 3
# 2 7 6 8 8 1 Other Other
# 3 6 7 7 8 1 Other Other
# 4 5 9 8 9 4 Other 9
# 5 3 0 3 5 0 3 Other
# 6 2 3 8 1 3 Other 3
# 7 3 3 7 0 1 3 3
# 8 9 9 0 4 7 9 9
# 9 3 2 7 2 0 3 Other
# 10 0 4 5 5 6 Other Other
# 11 8 4 1 4 9 Other Other
# 12 8 1 1 7 9 Other Other
# 13 9 3 6 7 2 9 3
# 14 0 3 5 9 4 Other 3

adding data from one df conditionally in pandas

I have a dataframe that looks like this:
test_data = pd.DataFrame(np.array([np.arange(10)]*3).T, columns =['issuer_id','winner_id','gov'])
issuer_id winner_id gov
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
and a list of two-tuples consisting of a dataframe and a label encoding 'gov' (perhaps a label:dataframe dict would be better). In test_out below the two labels are 2 and 7.
test_out = [(pd.DataFrame(np.array([np.arange(10)]*2).T, columns =['id','partition']),2),(pd.DataFrame(np.array([np.arange(10)]*2).T, columns =['id','partition']),7)]
[( id partition
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9, 2), ( id partition
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9, 7)]
I want to add two columns to the test_data dataframe: issuer_partition and winner_partition
test_data['issuer_partition']=''
test_data['winner_partition']=''
and I would like to fill in these values from the test_out list where the entry in the gov column determines the labeled dataframe in test_out to draw from. Then I look up the winner_id and issuer_id in the id-partition dataframe and write them into test_data.
Put another way: I have a list of labeled dataframes that I would like to loop through to conditionally fill in data in a primary dataframe.
Is there a clever way to use merge in this scenario?
*edit - added another sentence and fixed test_out code

Categories