I would like to append rows to a dataframe using a loop, but I can't figure out how not to overwrite the previously appended rows.
Example of starting dataframe
print df
quantity cost
0 1 30
1 1 5
2 2 10
3 4 8
4 5 2
My goal is
quantity cost
0 1 30
1 1 5
2 2 10
3 4 8
4 5 2
5 2 10
6 4 8
7 4 8
8 4 8
9 5 2
10 5 2
11 5 2
12 5 2
My current code is incorrect (only appending rows with quantity==5), but I can't figure out how to fix it.
for x in xrange(2,6):
data = df['quantity'] == x
data = df[data]
df_new = df.append([data]*(x-1),ignore_index=True)
Any advice would be awesome, thank you!
Related
Consider a DataFrame with only one column named values.
data_dict = {values:[5,4,3,8,6,1,2,9,2,10]}
df = pd.DataFrame(data_dict)
display(df)
The output will look something like:
values
0 5
1 4
2 3
3 8
4 6
5 1
6 2
7 9
8 2
9 10
I want to generate a new column that will have the trailing high value of the previous column.
Expected Output:
values trailing_high
0 5 5
1 4 5
2 3 5
3 8 8
4 6 8
5 1 8
6 2 8
7 9 9
8 2 9
9 10 10
Right now I am using for loop to iterate on df.iterrows() and calculating the values at each row. Because of this, the code is very slow.
Can anyone share the vectorization approach to increase the speed?
Use .cummax:
df["trailing_high"] = df["values"].cummax()
print(df)
Output
values trailing_high
0 5 5
1 4 5
2 3 5
3 8 8
4 6 8
5 1 8
6 2 8
7 9 9
8 2 9
9 10 10
I have a portion of my dataframe here:
days = [1, 2, 3, 4, 5]
time = [2, 4, 2, 4, 2, 4, 2, 4, 2]
df1 = pd.DataFrame(days)
df2 = pd.Series(time)
df2 = df2.transpose()
df3 = df1*df2
df4 = df1.dot(df2.to_frame().T)
df4 =
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
1 4 8 4 8 4 8 4 8 4
2 6 12 6 12 6 12 6 12 6
3 8 16 8 16 8 16 8 16 8
4 10 20 10 20 10 20 10 20 10
I have an if loop that creates a single row dataframe which looks like:
df_new =
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
I need to be able to loop through and add this row to the end of the larger dataframe a handful of times so the end result looks like this:
df_final =
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
1 4 8 4 8 4 8 4 8 4
2 6 12 6 12 6 12 6 12 6
3 8 16 8 16 8 16 8 16 8
4 10 20 10 20 10 20 10 20 10
5 2 4 2 4 2 4 2 4 2
6 5 6 7 8 9 8 7 6 5
I have tried to either append or concact the new dataframe to the existing one, but I receive errors both ways. Either indexing errors or a few looping issues. I think I need either a better understanding of why the row cannot be added to the end of the dataframe or an idea of a work around. The loop has 25 iterations where I only added two, but the idea is the same, I will get a new row in the form of a single row dataframe and I need to add that data from the single row dataframe without the column headers to the final dataframe. I am willing to update my question as soon as I get a better idea of how this can work, it does not seem like a difficult task, but I am sure I am asking the wrong thing.
So i have a column in a CSV file that I would like to gather data on. It is full of integers, but I would like to bar-graph the top 5 "modes"/"most occurred" numbers within that column. Is there any way to do this?
Assuming you have a big list of integers in the form of a pandas series s.
s.value_counts().plot.bar() should do it.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html
you can use .value_counts().head().plot(kind='bar')
for example:
df = pd.DataFrame({'a':[1,1,2,3,5,8,1,5,6,9,8,7,5,6,7],'b':[1,1,2,3,3,3,4,5,6,7,7,7,7,8,2]})
df
a b
0 1 1
1 1 1
2 2 2
3 3 3
4 5 3
5 8 3
6 1 4
7 5 5
8 6 6
9 9 7
10 8 7
11 7 7
12 5 7
13 6 8
14 7 2
df.b.value_counts().head() # count values of column 'b' and show only top 5 values
7 4
3 3
2 2
1 2
8 1
Name: b, dtype: int64
df.b.value_counts().head().plot(kind='bar') #create bar plot for top values
I have a col called id in a dataframe called _newdata which looks like this. Note that this is a part of the values in the column and not the entire thing.
1
1
1
2
2
2
2
2
4
4
4
4
4
5
5
5
5
7
7
7
7
7
8
8
8
8
10
10
10
What I want to do is the make rename the 'id' with values so that it is in running numbers. Which means I want it to look like this
1
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
4
5
5
5
5
5
6
6
6
6
7
7
7
I tried using this but it didn't seem to do anything to the file. Could someone tell me where I went wrong or suggest a method to do what I want it to do?
count = 1 #values start at 1
for i, row in _newdata.iterrows():
if row['id']==count or row['id']==count+1:
pass
else:
count+=1
row['id']=count
You can use dense rank():
df['id'] = df['id'].rank(method='dense').astype(int)
I have a dataframe that looks like this:
test_data = pd.DataFrame(np.array([np.arange(10)]*3).T, columns =['issuer_id','winner_id','gov'])
issuer_id winner_id gov
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
and a list of two-tuples consisting of a dataframe and a label encoding 'gov' (perhaps a label:dataframe dict would be better). In test_out below the two labels are 2 and 7.
test_out = [(pd.DataFrame(np.array([np.arange(10)]*2).T, columns =['id','partition']),2),(pd.DataFrame(np.array([np.arange(10)]*2).T, columns =['id','partition']),7)]
[( id partition
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9, 2), ( id partition
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9, 7)]
I want to add two columns to the test_data dataframe: issuer_partition and winner_partition
test_data['issuer_partition']=''
test_data['winner_partition']=''
and I would like to fill in these values from the test_out list where the entry in the gov column determines the labeled dataframe in test_out to draw from. Then I look up the winner_id and issuer_id in the id-partition dataframe and write them into test_data.
Put another way: I have a list of labeled dataframes that I would like to loop through to conditionally fill in data in a primary dataframe.
Is there a clever way to use merge in this scenario?
*edit - added another sentence and fixed test_out code