I have a portion of my dataframe here:
days = [1, 2, 3, 4, 5]
time = [2, 4, 2, 4, 2, 4, 2, 4, 2]
df1 = pd.DataFrame(days)
df2 = pd.Series(time)
df2 = df2.transpose()
df3 = df1*df2
df4 = df1.dot(df2.to_frame().T)
df4 =
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
1 4 8 4 8 4 8 4 8 4
2 6 12 6 12 6 12 6 12 6
3 8 16 8 16 8 16 8 16 8
4 10 20 10 20 10 20 10 20 10
I have an if loop that creates a single row dataframe which looks like:
df_new =
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
I need to be able to loop through and add this row to the end of the larger dataframe a handful of times so the end result looks like this:
df_final =
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
1 4 8 4 8 4 8 4 8 4
2 6 12 6 12 6 12 6 12 6
3 8 16 8 16 8 16 8 16 8
4 10 20 10 20 10 20 10 20 10
5 2 4 2 4 2 4 2 4 2
6 5 6 7 8 9 8 7 6 5
I have tried to either append or concact the new dataframe to the existing one, but I receive errors both ways. Either indexing errors or a few looping issues. I think I need either a better understanding of why the row cannot be added to the end of the dataframe or an idea of a work around. The loop has 25 iterations where I only added two, but the idea is the same, I will get a new row in the form of a single row dataframe and I need to add that data from the single row dataframe without the column headers to the final dataframe. I am willing to update my question as soon as I get a better idea of how this can work, it does not seem like a difficult task, but I am sure I am asking the wrong thing.
Related
Consider a DataFrame with only one column named values.
data_dict = {values:[5,4,3,8,6,1,2,9,2,10]}
df = pd.DataFrame(data_dict)
display(df)
The output will look something like:
values
0 5
1 4
2 3
3 8
4 6
5 1
6 2
7 9
8 2
9 10
I want to generate a new column that will have the trailing high value of the previous column.
Expected Output:
values trailing_high
0 5 5
1 4 5
2 3 5
3 8 8
4 6 8
5 1 8
6 2 8
7 9 9
8 2 9
9 10 10
Right now I am using for loop to iterate on df.iterrows() and calculating the values at each row. Because of this, the code is very slow.
Can anyone share the vectorization approach to increase the speed?
Use .cummax:
df["trailing_high"] = df["values"].cummax()
print(df)
Output
values trailing_high
0 5 5
1 4 5
2 3 5
3 8 8
4 6 8
5 1 8
6 2 8
7 9 9
8 2 9
9 10 10
So i have a column in a CSV file that I would like to gather data on. It is full of integers, but I would like to bar-graph the top 5 "modes"/"most occurred" numbers within that column. Is there any way to do this?
Assuming you have a big list of integers in the form of a pandas series s.
s.value_counts().plot.bar() should do it.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html
you can use .value_counts().head().plot(kind='bar')
for example:
df = pd.DataFrame({'a':[1,1,2,3,5,8,1,5,6,9,8,7,5,6,7],'b':[1,1,2,3,3,3,4,5,6,7,7,7,7,8,2]})
df
a b
0 1 1
1 1 1
2 2 2
3 3 3
4 5 3
5 8 3
6 1 4
7 5 5
8 6 6
9 9 7
10 8 7
11 7 7
12 5 7
13 6 8
14 7 2
df.b.value_counts().head() # count values of column 'b' and show only top 5 values
7 4
3 3
2 2
1 2
8 1
Name: b, dtype: int64
df.b.value_counts().head().plot(kind='bar') #create bar plot for top values
I would like to append rows to a dataframe using a loop, but I can't figure out how not to overwrite the previously appended rows.
Example of starting dataframe
print df
quantity cost
0 1 30
1 1 5
2 2 10
3 4 8
4 5 2
My goal is
quantity cost
0 1 30
1 1 5
2 2 10
3 4 8
4 5 2
5 2 10
6 4 8
7 4 8
8 4 8
9 5 2
10 5 2
11 5 2
12 5 2
My current code is incorrect (only appending rows with quantity==5), but I can't figure out how to fix it.
for x in xrange(2,6):
data = df['quantity'] == x
data = df[data]
df_new = df.append([data]*(x-1),ignore_index=True)
Any advice would be awesome, thank you!
This question already has answers here:
Random row selection in Pandas dataframe
(6 answers)
Closed 6 years ago.
I am trying to build an algorithm for finding number of clusters. I need to assign random points from a data set as initial means.
I first tried the following code :
mu=random.sample(df,10)
it gave index out of range error.
I converted the same into a numpy array and then did
mu=random.sample(np.array(df).tolist(),10)
instead of giving 10 values as mean it is giving me 10 arrays of values.
How can I get a 10 values to initialise as mean for 10 clusters from the dataframe?
Use numpy.random.choice
df.iloc[np.random.choice(np.arange(len(df)), 10, False)]
Or numpy.random.permutation
df.loc[np.random.permutation(df.index)[:10]]
a b c
11 2 9 9
1 7 7 0
16 5 1 8
15 0 8 2
17 1 5 4
19 5 0 9
10 7 7 0
8 4 4 3
6 6 2 4
14 7 6 2
I think you need DataFrame.sample:
mu = df.sample(10)
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(20,3)), columns=list('abc'))
print (df)
a b c
0 8 8 3
1 7 7 0
2 4 2 5
3 2 2 2
4 1 0 8
5 4 0 9
6 6 2 4
7 1 5 3
8 4 4 3
9 7 1 1
10 7 7 0
11 2 9 9
12 3 2 5
13 8 1 0
14 7 6 2
15 0 8 2
16 5 1 8
17 1 5 4
18 2 8 3
19 5 0 9
mu = df.sample(10)
print (mu)
a b c
11 2 9 9
1 7 7 0
8 4 4 3
5 4 0 9
2 4 2 5
19 5 0 9
13 8 1 0
14 7 6 2
0 8 8 3
9 7 1 1
I have a dataframe that looks like this:
test_data = pd.DataFrame(np.array([np.arange(10)]*3).T, columns =['issuer_id','winner_id','gov'])
issuer_id winner_id gov
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
and a list of two-tuples consisting of a dataframe and a label encoding 'gov' (perhaps a label:dataframe dict would be better). In test_out below the two labels are 2 and 7.
test_out = [(pd.DataFrame(np.array([np.arange(10)]*2).T, columns =['id','partition']),2),(pd.DataFrame(np.array([np.arange(10)]*2).T, columns =['id','partition']),7)]
[( id partition
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9, 2), ( id partition
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9, 7)]
I want to add two columns to the test_data dataframe: issuer_partition and winner_partition
test_data['issuer_partition']=''
test_data['winner_partition']=''
and I would like to fill in these values from the test_out list where the entry in the gov column determines the labeled dataframe in test_out to draw from. Then I look up the winner_id and issuer_id in the id-partition dataframe and write them into test_data.
Put another way: I have a list of labeled dataframes that I would like to loop through to conditionally fill in data in a primary dataframe.
Is there a clever way to use merge in this scenario?
*edit - added another sentence and fixed test_out code