Cryptocurrency correlation in python, working with dictionaries - python

I'm working with a crypto-currency data sample, each cell contains a dictionary. The dictionary containing the open price, close price, highest price, lowest price, volume and market cap. The columns are the corresponding dates and the index is the name of each cryptocurrency.
I don't know how to prepare the data in order for me to find the correlation between different currencies and between highest price and volume for example. How can this be done in python (pandas)...also how would I define a date range in such a situation?
Here's a link to the data sample, my coding and a printout of the data (Access is OPEN TO PUBLIC): https://drive.google.com/open?id=1mjgq0lEf46OmF4zK8sboXylleNs0zx7I

To begin with, I would suggest rearranging your data so that each currency's OHLCV values are their own columns (e.g. "btc_open | btc_high" etc.). This makes generating correlation matrices far easier. I'd also suggest beginning with only one metric (e.g. close price) and perhaps period movement (e.g. close-open) in your analysis. To answer your question:
Pandas can return a correlation matrix of all columns with:
df.corr()
If you want to use only specific columns, select those from the DataFrame:
df[["col1", "col2"]].corr()
You can return a single correlation value between two columns with the form:
df["col1"].corr(df["col2"])
If you'd like to specify a specific date range, I'd refer you to this question. I believe this will require your date column or index to be of the type datetime. If you don't know how to work with or convert to this type, I would suggest consulting the pandas documentation (perhaps begin with pandas.to_datetime).
In future, I would suggest including a data snippet in your post. I don't believe Google Drive is an appropriate form to share data, and it definitely is not appropriate to set the data to "request access".
EDIT: I checked your data and created a smaller subset to test this method on. If there are imperfections in the data you may find problems, but I had none when I tested it on a sample of your first 100 days and 10 coins (after transposing, df.iloc[:100, :10].
Firstly, transpose the DataFrame so columns are organised by coin and rows are dates.
df = df.T
Following this, we concatenate to a new DataFrame (result). Alternatively, concatenate to the original and drop columns after. Unfortunately I can't think of a non-iterative method. This method goes column by column, creates a DataFrame for each coins, adds the coin name prefix to the column names, then concatenates each DataFrame to the end.
result = pd.DataFrame()
coins = df.columns.tolist()
for coin in coins:
coin_data = df[coin]
split_coin = coin_data.apply(pd.Series).add_prefix(coin+"_")
result = pd.concat([result, split_coin], axis=1)

Related

How to expand groupby df Pandas python

I have filtered a pandas data frame by grouping and taking sum, now I want all the details and no longer need the sum
for example what I have looks like the image below
what i want is for each of the individual transactions to be shown, as currently the amount column is the sum of all transactions done by an individual on a specific date i want to see all the individual amounts, is this possible?
I dont know how to filter the larger df by the groupby one, have also tried using isin() with multiple &s but it does not work as for example "David" could be in my groupby df on sept 15, but in the larger df he has made transactions on other days aswell and those are slipping through when using isin()
Hello there and welcome,
first of all, as I've learned my self, always try:
to give some data (in text, or code form) as your input
share your expected output, to avoid more questions
have fun :-)
I'm new as well, and I did my best to cover as much possibilities as I could, at least people can use my code to get your df.
#From the picture
data={'Date': ['2014-06-30','2014-07-02','2014-07-02','2014-07-03','2014-07-09','2014-07-14','2014-07-17','2014-07-25','2014-07-29','2014-07-29','2014-08-06','2014-08-11','2014-08-22'],
'LastName':['Cow','Kind','Lion','Steel','Torn','White','Goth','Hin','Hin','Torn','Goth','Hin','Hin'],
'FirstName':['C','J','K','J','M','D','M','G','G','M','M','G','G'],
'Vendor':['Jail','Vet','TGI','Dept','Show','Still','Turf','Glass','Sup','Ref','Turf','Lock','Brenn'],
'Amount': [5015.70,6293.27,7043.00,7600,9887.08,5131.74,5037.55,5273.55,9455.48,5003.71,6675,7670.5,8698.18]
}
df=pd.DataFrame(data)
incoming=df.groupby(['Date','LastName','FirstName','Vendor','Amount']).count()
#what I believe you did to get Date grouped
incoming
Now here my answer:
Firstly I merged First and Lastname
df['CompleteName']=df[['FirstName','LastName']].agg('.'.join,axis=1) # getting Names for df
Then I did some statistics for the amount, for different groups:
#creating a new column with as much Statistics from group (Complete Name, Date, Vendor, etc.)
df['AmountSumName']=df['Amount'].groupby(df['CompleteName']).transform('sum')
df['AmountSumDate']=df['Amount'].groupby(df['Date']).transform('sum')
df['AmountSumVendor']=df['Amount'].groupby(df['Vendor']).transform('sum')
df
Now just groupby as you wish
Hope I could answer you question.

Restructuring Pandas Dataframe for large number of columns

I have a pandas dataframe which is a large number of answers given by users in response to a survey and I need to re-structure it. There are up to 105 questions asked each year, but I only need maybe 20 of them.
The current structure is as below.
What I want to do is re-structure it so that the row values become column names and the answer given by the user is then the value in that column. In a picture (from Excel), what I want is the below (I know I'll need to re-name my columns, but that's fine once I can create the structure in the first place):
Is it possible to re-structure my dataframe this way? The outcome of this is to use some predictive analytics to predict a target variable, so I need to re-strcture before I can use Random Forest, kNN, and so on.
You might want try pivoting your table:
df.pivot(index=['SurveyID', 'UserID'], columns=['QuestionID'], values=['AnswerText'])
df.columns = [x[0] if x[1] == "" else "Answer_{}".format(x[1]) for x in df.columns.to_flat_index()]

Create Loop to dynamically select rows from dataframe, then append selected rows to another dataframe: df.query()

I am currently working with dataframes in pandas. In sum, I have a dataframe called "Claims" filled with customer claims data, and I want to parse all the rows in the dataframe based on the unique values found in the field 'Part ID.' I would then like to take each set of rows and append it one at a time to an empty dataframe called "emptydf." This dataframe has the same column headings as the "Claims" dataframe. Since the values in the 'Part ID' column change from week to week, I would like to find some way to do this dynamically, rather than comb through the dataframe each week manually. I was thinking of somehow incorporating the df.where() expression and a For Loop, but am at a loss as to how to put it all together. Any insight into how to go about this, or even some better methods, would be great! The code I have thus far is divided into two steps as follows:
emptydf = Claims[0:0]
#Create empty dataframe
2.Parse_Claims = Claims.query('Part_ID == 1009')
emptydf = emptydf.append(Parse_Claims)
#Parse the dataframe by each unique Part ID number and append to empty dataframe. As you can see, I can only hard code one Part ID number at a time so far. This would take hours to complete manually, so I would love to figure out a way to iterate through the Part ID column and append the data dynamically.
Needless to say, I am super new to Python, so I definitely appreciate your patience in advance!
empty_df = list(Claims.groupby(Claims['Part_ID']))
this will create a list of tuples one for each part id. each tuple has 2 elements 1st is part id and 2nd is subset for that part id

Aggregating a pandas dataframe using groupby, then using apply.... but how to then add the output back into original dataframe?

I have some data with 4 features of interest: account_id, location_id, date_from and date_to. Each entry corresponds to a period where a customer account was associated with a particular location.
There are some pairs of account_id and location_id which have multiple entries, with different dates. This means that the customer is associated with the location for a longer period, covered by multiple consecutive entries.
So I want to create an extra column with the total length of time that a customer was associated with a given location. I am able to use groupby and apply to calculate this for each pair (see code below).. this works fine but I don't understand how to then add this back into the original dataframe as a new column.
lengths = non_zero_df.groupby(['account_id','location_id'], group_keys=False).apply(lambda x: x.date_to.max() - x.date_from.min())
Thanks
I think Mephy is right that this should probably go to StackOverflow.
You're going to have a shape incompatibility because there will be fewer entries in the grouped result than in the original table. You'll need to do the equivalent of an SQL left outer join with the original table and the results, and you'll have the total length show up multiple times in the new column -- every time you have an equal (account_id, location_id) pair, you'll have the same value in the new column. (There's nothing necessarily wrong with this, but it could cause an issue if people are trying to sum up the new column, for example)
Check out pandas.DataFrame.join (you can also use merge). You'll want to join the old table with the results, on (account_id, location_id), as a left (or outer) join.

Comparing Pandas Dataframe Rows & Dropping rows with overlapping dates

I have a dataframe filled with trades taken from a trading strategy. The logic in the trading strategy needs to be updated to ensure that trade isn't taken if the strategy is already in a trade - but that's a different problem. The trade data for many previous trades is read into a dataframe from a csv file.
Here's my problem for the data I have:
I need to do a row-by-row comparison of the dataframe to determine if Entrydate of rowX is less than ExitDate rowX-1.
A sample of my data:
Row 1:
EntryDate ExitDate
2012-07-25 2012-07-27
Row 2:
EntryDate ExitDate
2012-07-26 2012-07-29
Row 2 needs to be deleted because it is a trade that should not have occurred.
I'm having trouble identifying which rows are duplicates and then dropping them. I tried the approach in answer 3 of this question with some luck but it isn't ideal because I have to manually iterate through the dataframe and read each row's data. My current approach is below and is ugly as can be. I check the dates, and then add them to a new dataframe. Additionally, this approach gives me multiple duplicates in the final dataframe.
for i in range(0,len(df)+1):
if i+1 == len(df): break #to keep from going past last row
ExitDate = df['ExitDate'].irow(i)
EntryNextTrade = df['EntryDate'].irow(i+1)
if EntryNextTrade>ExitDate:
line={'EntryDate':EntryDate,'ExitDate':ExitDate}
df_trades=df_trades.append(line,ignore_index=True)
Any thoughts or ideas on how to more efficiently accomplish this?
You can click here to see a sampling of my data if you want to try to reproduce my actual dataframe.
You should use some kind of boolean mask to do this kind of operation.
One way is to create a dummy column for the next trade:
df['EntryNextTrade'] = df['EntryDate'].shift()
Use this to create the mask:
msk = df['EntryNextTrade'] > df'[ExitDate']
And use loc to look at the subDataFrame where msk is True, and only the specified columns:
df.loc[msk, ['EntryDate', 'ExitDate']]

Categories