Changing the dimension (axis) of the dataframe - python

I know the community hate people uploading a image, but it is hard to explain without showing the dataframe I have.
is there any way that I can group the data by the columns 'Open','High','Low','Close','Adj Close','Volume' ,'Symbol' like this:
Have been browsing through pandas documentation for days and tried plenty methods but still dont work. Thank you and sorry for uploading image.
Update:
The code for the df is as following:
import yfinance as yf
stock_df = yf.download(["AAPL","GOOG"], start="2020-05-19", end="2020-05-20", interval='1m',group_by='ticker')
stock_df
need to pip install yfinance first tho. Hope this could help you guys to test it thanks. group_by = can be deleted so now stock are group by the column. However they are still separated, you can see there is 12 columns in there where 6 of them are repeated, any way to add a Symbol Column like what my expected output? Thanks

You can try this:
df.rename_axis(('Symbol', None), axis=1).stack(level=0).reset_index(level=1).sort_values('Symbol')
Though I am unsure how your AACG rows have data when your original frame are NaNs.

Related

differenc between using panda.drop_duplicate or value_count on whole frame or one column

I am a new python user just for finish the homework. But I am willing to dig deeper when I meet questions.
Ok the problem is from professor's sample code for data cleaning. He use drop.duplicates() and value_counts to check unique value of a frame, here are his codes:
spyq['sym_root'].value_counts() #method1
spyq['date'].drop_duplicates() #method2
Here is the output:
SPY 7762857 #method1
0 20100506 #method2
I use spyq.shape() to help you understand the spyq dataframe :
spyq.shape #out put is (7762857, 9)
the spqy is dataframe contains trading history for spy500 in one day when is 05/06/2010.
Ok after I see this, I wonder why he specify a column'date" or :'sym_root"; why he dont just use the whole spyq.drop_dupilicates() or spyq.value_counts(), so I have a try:
spyq.value_counts()
spyq.drop_duplicates()
Both output is (6993487, 9)
The row has decreased!
but from professor's sample code, there is no duplicated row existed because the row number from method 1 's output is exactly the same as the row number from spyq.shape!
I am so confused why output of whole dataframe:spyq.drop_duplicates() is not same as spyq['column'].drop_duplicated() when there is no repeat value!
I try to use
spyq.loc[spyq.drop_duplicates()]
to see what have dropped but it is error.
Can any one kindly help me? I know my question is kind of stupid but I just want to figure it out and I want to learn python from most fundmental part not just learn some code to finish homework.
Thanks!

why and how to solve the data lost when multi encode in python pandas

Download the Data Here
Hi, I have a data something like below, and would like to multi label the data.
something to like this: target
But the problem here is data lost when multilabel it, something like below:
issue
using the coding of:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)
df_enc = df.drop('movieId', 1).join(df.movieId.str.join('|').str.get_dummies())
Someone can help me, feel free to download the dataset, thank you.
So that column when read in with pandas will be stored as a string. So first we'd need to convert that to an actual list.
From there use .explode() to expand out that list into a series (where the index will match the index it came from, and the column values will be the values in that list).
Then crosstab that from a series into each row and column being the value.
Then join that back up with the dataframe on the index values.
Keep in mind, when you do one-hot-encoding with high cardinality, you're table will blow up into a huge wide table. I just did this on the first 20 rows, and ended up with 233 columns. with the 225,000 + rows, it'll take a while (maybe a minute or so) to process and you end up with close to 1300 columns. This may be too complex for machine learning to do anything useful with it (although maybe would work with deep learning). You could still try it though and see what you get. What I would suggest to test out is find a way to simplify it a bit to make it less complex. Perhaps find a way to combine movie ids in a set number of genres or something like that? But then test to see if simplifying it improves your model/performance.
import pandas as pd
from ast import literal_eval
df = pd.read_csv('ratings_action.csv')
df.movieId = df.movieId.apply(literal_eval)
s = df['movieId'].explode()
df = df[['userId']].join(pd.crosstab(s.index, s))

How to crunch down Pandas rows into one, aplying different conditions to columns

I have the following
I need to figure out how to get something like the following out of it.
By adding grouping by both NAME and DATE, getting the total FT but saving the first ORI and last END.
I have been trying using variarions of groupby with aggregations to no avail.
df.groupby(["NAME"]).agg(sum)
Any help would be apreciated, thank you.
You can do groupby.agg
(df.groupby(['NAME','DATE'])
.agg({'ORI':'first', 'END':'last',
'TO':'first', 'LAND':'last', 'FT':'sum'})
)

(Python) manually copy/paste data from pandas table without copying the index

I've been looking around but could not find an similar post, so I thought I'd give it a go.
I wrote an pandas program that sucessfully displays the resulting dataframe in pandas table format in a tkinter textbox. the aim is that the user can select the data ancopy/paste it into an (existing)excel sheet. when doing this, the index is always copied as well. I was wondering if one could programmatically select the complete table except the index?
I know that one can save to excel or other with index=false, but I could not find a kind of df.select....index=false. I hope my explanation is more or less clear ;-)
Thanks a lot
screenshot
you could use dataframe's 'to_string' function, here you could pass 'index = False' as one of the parameters. For Ex: say we have this df:
import pandas as pd
df = pd.DataFrame({'a': ['yes', 'no', 'yes' ], 'b': [10, 5, 20]})
print(df.to_string(index = False))
this would give you:
a b
yes 10
no 5
yes 20
Hope this helps!
I finally found it.
Instead of using something like self.mytable.copy('columns') to select everything and then switch to Excel and paste it, I use this line of code which does exactly what I need :
df.to_clipboard(sep="\t", index=False)
The sep="\t" makes it split up amongst columns in Excel.
Hopefully someone can use this at some stage.

While true, pandas dataframe won't show?

Using Jupyter Notebook, if I put in the following code:
import pandas as pd
df = pd.read_csv('path/to/csv')
while True:
df
The dataframe won't show. Can anyone tell me why this is the case? I'm guessing it's because the constant looping is preventing the dataframe from loading fully. Is that what's happening here?
I need code that would let me get a user's input. If they type in a name, for example, I'll extract the person with that name's info from the dataframe and display it, then the program needs to ask them to give another name. This will continue until they type in "quit". I figured a while loop would be the best for that, but it looks like there's just something about while loops and pandas that won't mix. Does anyone have any suggestions on what I can do instead?

Categories