I have a dataframe that looks like this:
test_data = pd.DataFrame(np.array([np.arange(10)]*3).T, columns =['issuer_id','winner_id','gov'])
issuer_id winner_id gov
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
and a list of two-tuples consisting of a dataframe and a label encoding 'gov' (perhaps a label:dataframe dict would be better). In test_out below the two labels are 2 and 7.
test_out = [(pd.DataFrame(np.array([np.arange(10)]*2).T, columns =['id','partition']),2),(pd.DataFrame(np.array([np.arange(10)]*2).T, columns =['id','partition']),7)]
[( id partition
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9, 2), ( id partition
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9, 7)]
I want to add two columns to the test_data dataframe: issuer_partition and winner_partition
test_data['issuer_partition']=''
test_data['winner_partition']=''
and I would like to fill in these values from the test_out list where the entry in the gov column determines the labeled dataframe in test_out to draw from. Then I look up the winner_id and issuer_id in the id-partition dataframe and write them into test_data.
Put another way: I have a list of labeled dataframes that I would like to loop through to conditionally fill in data in a primary dataframe.
Is there a clever way to use merge in this scenario?
*edit - added another sentence and fixed test_out code
Related
Consider a DataFrame with only one column named values.
data_dict = {values:[5,4,3,8,6,1,2,9,2,10]}
df = pd.DataFrame(data_dict)
display(df)
The output will look something like:
values
0 5
1 4
2 3
3 8
4 6
5 1
6 2
7 9
8 2
9 10
I want to generate a new column that will have the trailing high value of the previous column.
Expected Output:
values trailing_high
0 5 5
1 4 5
2 3 5
3 8 8
4 6 8
5 1 8
6 2 8
7 9 9
8 2 9
9 10 10
Right now I am using for loop to iterate on df.iterrows() and calculating the values at each row. Because of this, the code is very slow.
Can anyone share the vectorization approach to increase the speed?
Use .cummax:
df["trailing_high"] = df["values"].cummax()
print(df)
Output
values trailing_high
0 5 5
1 4 5
2 3 5
3 8 8
4 6 8
5 1 8
6 2 8
7 9 9
8 2 9
9 10 10
i have a postgresql db that look like this:
price
1
2
4
9
7
8
3
7
5
3
7
and I want it to look like this:
1 2 4 9 7 8 3 7 5 3 7
I'm reding it using pandas.read_sql()
now I want to convert that the DataFrame will be instead of 11 rows and one column to be 1 row and
11 column, from what I'm understanding I need to use the pandas.melt() function but I didn't understand how?
You can do
df.T
Out[7]:
0 1 2 3 4 5 6 7 8 9 10
price 1 2 4 9 7 8 3 7 5 3 7
So i have a column in a CSV file that I would like to gather data on. It is full of integers, but I would like to bar-graph the top 5 "modes"/"most occurred" numbers within that column. Is there any way to do this?
Assuming you have a big list of integers in the form of a pandas series s.
s.value_counts().plot.bar() should do it.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html
you can use .value_counts().head().plot(kind='bar')
for example:
df = pd.DataFrame({'a':[1,1,2,3,5,8,1,5,6,9,8,7,5,6,7],'b':[1,1,2,3,3,3,4,5,6,7,7,7,7,8,2]})
df
a b
0 1 1
1 1 1
2 2 2
3 3 3
4 5 3
5 8 3
6 1 4
7 5 5
8 6 6
9 9 7
10 8 7
11 7 7
12 5 7
13 6 8
14 7 2
df.b.value_counts().head() # count values of column 'b' and show only top 5 values
7 4
3 3
2 2
1 2
8 1
Name: b, dtype: int64
df.b.value_counts().head().plot(kind='bar') #create bar plot for top values
I have a dataset in a form of:
A B C D label
6 2 6 8 0
2 5 3 6 0
4 3 4 9 1
5 7 5 5 1
6 4 5 8 0
in which each row is a label with a unique value, and that unique value is repeating after some lines, so there are 7 labels to 7000 lines if I do
df.loc[df['label'] == 0]
it will grab all the values of 0 labeled rows, but I want to extract the values according to the first label set of 0, if there are first 10 rows labeled as 0, then it just brings them not others label 0 in the data frame
We may need a new parameter here
df=df.assign(new=df.label.diff().ne(0).cumsum())
df[df.new==df.groupby('label').new.transform('min')]
Out[206]:
A B C D label new
0 6 2 6 8 0 1
1 2 5 3 6 0 1
2 4 3 4 9 1 2
3 5 7 5 5 1 2
Save to list
s=df[df.new==df.groupby('label').new.transform('min')];
l=[df1 for _, df1 in s.groupby('label')]
I would like to iterate through multiple dataframe columns looking for the top n values in each column. If the value in the column is in the top n values then keep that value, otherwise bucket in "other". Also, I would like to create new columns from this.
However, I'm not sure how to use .apply in this case as it seems like I need to reference both columns and rows.
np.random.seed(0)
example_df = pd.DataFrame(np.random.randint(low=0, high=10, size=(15, 5)),columns=['a', 'b', 'c', 'd', 'e'])
cols_to_group = ['a','b','c']
top = 2
So for the example below, here's my pseudo code that I'm not sure how to execute:
Pseudo Code:
#loop through each column
for column in example_df[cols_to_group]:
#loop through each value in column and check if it's in top values for the column.
for single_value in column:
if single_value.isin(column.value_counts()[:top].values):
#return value if it is in top values
return single_value
else:
return "other"
#create new column in your df that has bucketed values
example_df[column.name + str("bucketed")+ str(top)]=column
Expected output:
Crude example where top = 2.
a b c d e a_bucketed b_bucketed
0 4 6 4 3 1 4 6
1 8 8 1 5 7 8 8
2 8 6 0 0 2 8 6
3 4 1 0 7 4 4 Other
4 7 8 7 7 7 Other 8
Here is one way. But no treatment for ties has been prescribed.
df['a_bucketed'] = np.where(df['a'].isin(df['a'].value_counts().index[:2]), df['a'], 'Other')
df['b_bucketed'] = np.where(df['b'].isin(df['b'].value_counts().index[:2]), df['b'], 'Other')
# a b c d e a_bucketed b_bucketed
# 0 5 0 3 3 7 Other Other
# 1 9 3 5 2 4 9 3
# 2 7 6 8 8 1 Other Other
# 3 6 7 7 8 1 Other Other
# 4 5 9 8 9 4 Other 9
# 5 3 0 3 5 0 3 Other
# 6 2 3 8 1 3 Other 3
# 7 3 3 7 0 1 3 3
# 8 9 9 0 4 7 9 9
# 9 3 2 7 2 0 3 Other
# 10 0 4 5 5 6 Other Other
# 11 8 4 1 4 9 Other Other
# 12 8 1 1 7 9 Other Other
# 13 9 3 6 7 2 9 3
# 14 0 3 5 9 4 Other 3