Pandas dataframe left-merge with different dataframe sizes - python

I have a toy stock predictor, and from time to time save results using dataframes. After the first result set I would like to append my first dataframe. Here is what I do:
Create first dataframe using predicted results
Sort descending to predicted performance
Save to csv, without the index
With new data, read out result csv and try left merge, goal is to append new predicted performance to the correct stock ticker
df=pd.merge(df, df_new[['ticker', 'avgrd_app']], on='ticker', how='left')
Those two dataframes have different amounts of columns. In the end it only appends the dataframes to another:
avgrd,avgrd_app,prediction1,prediction2,ticker
-0.533520756811,,110.64654541,110.37853241,KIO
-0.533520756811,,110.64654541,110.37853241,MMM
-0.604610694122,,110.64654541,110.37853241,SRI
[...]
,-0.212600450514,,,G5DN
,0.96378750992,,,G5N
,2.92757501984,,,DAL3
,2.27297945023,,,WHF4
So - how can I merge correctly?

From the sample result, it works as expected, the new data don't have numbers for all the tickers so some of the predictions are missing. So what exactly do you want to achieve? If you only need stocks with all the predictions, use inner join.

Related

Combining/Grouping the dataset using one column while keeping the other data

I have merged many dataframes using merge command and append and facing problem of data redundancy in final dataset. I have tried using groupby() on unique attribute but the end result still contains redundant data.
Tried:
removedRedun = data.groupby("Name", group_keys=False).apply(lambda x: x)
Actual Dataset and Expected Result.

How to merge two data frames in Pandas without losing values

I have two data frames that I imported as spreadsheets into Pandas and cleaned up. They have a similar key value called 'PurchaseOrders' that I am using to match product numbers to a shipment number. When I attempt to merge them, I only end up with a df of 34 rows, but I have over 400 pairs of matching product to shipment numbers.
This is the closest I've gotten, but I have also tried using join()
ShipSheet = pd.merge(new_df, orders, how ='inner')
ShipSheet.shape
Here is my order df
orders df
and here is my new_df that I want to add to my orders df using the 'PurchaseOrders' key
new_df
In the end, I want them to look like this
end goal df
I am not sure if I'm not using the merge function improperly, but my end product should have around 300+ rows. I will note that the new_df data frame's 'PurchaseOrders' values had to be delimited from a single column and split into rows, so I guess this could have something to do with it.
Use the merge method on the dataframe and specify the key
merged_inner = pd.merge(left=df_left, right=df_right, left_on='PurchaseOrders', right_on='PurchaseOrders')
learn more here
Use the concat method on pandas and specify the axis.
final_df = pd.concat([new_df, order], axis = 1)
when you specify the axis please careful if you specify axis = 0 then it placed second data frame under the first one and if you specify axis = 1 then it placed the second data frame right to the first data frame.

How to assign the same array of columns to multiple dataframes in Pandas?

I have 9 data sets. Between any 2 given data sets, they will share about 60-80% of the same columns. I want to concatenate these data sets into one data set. Due to some memory limitations, I can't load these datasets into data frames and use the concatenate function in pandas (but I can load each individual data set into a data frame). Instead, I am looking at an alternative solution.
I have created an ordered list of all columns which exist in these data sets. And I want to apply this column list to each of the individual 9 data sets. This way they will all have the same columns and are in the same order. Once that is done I will do a concatenate function on the flat files in the terminal, which will essentially append each data sets together, hopefully solving my issue and creating one single dataset of these 9.
The problem I am having is applying the ordered list to 9 data sets. I keep getting a KeyError "[[list of columns]] not in index" whenever I try to change the columns in the single data sets.
This is what I have been trying:
df = df[clist]
I have also tried
df = df.reindex(columns=clist)
but this doesn't create the extra columns in the data frame, it just orders them in the order that clist is in.
I expect the result to create 9 datasets which lineup on the same axis for an appends or concat operation outside pandas.
I just solved it.
the reindiex function does work. I was applying the reindex function outside of the list of dataframes I had created.
I loaded these 9 datasets with their first 9 rows into a list.
for filename in all_files:
df = pd.read(filename,nrows=10)
li.append(df)
And from that list I used the reindex as such
for i in range(0,9):
li[i]=li[i].reindex(columns=clist)

Cryptocurrency correlation in python, working with dictionaries

I'm working with a crypto-currency data sample, each cell contains a dictionary. The dictionary containing the open price, close price, highest price, lowest price, volume and market cap. The columns are the corresponding dates and the index is the name of each cryptocurrency.
I don't know how to prepare the data in order for me to find the correlation between different currencies and between highest price and volume for example. How can this be done in python (pandas)...also how would I define a date range in such a situation?
Here's a link to the data sample, my coding and a printout of the data (Access is OPEN TO PUBLIC): https://drive.google.com/open?id=1mjgq0lEf46OmF4zK8sboXylleNs0zx7I
To begin with, I would suggest rearranging your data so that each currency's OHLCV values are their own columns (e.g. "btc_open | btc_high" etc.). This makes generating correlation matrices far easier. I'd also suggest beginning with only one metric (e.g. close price) and perhaps period movement (e.g. close-open) in your analysis. To answer your question:
Pandas can return a correlation matrix of all columns with:
df.corr()
If you want to use only specific columns, select those from the DataFrame:
df[["col1", "col2"]].corr()
You can return a single correlation value between two columns with the form:
df["col1"].corr(df["col2"])
If you'd like to specify a specific date range, I'd refer you to this question. I believe this will require your date column or index to be of the type datetime. If you don't know how to work with or convert to this type, I would suggest consulting the pandas documentation (perhaps begin with pandas.to_datetime).
In future, I would suggest including a data snippet in your post. I don't believe Google Drive is an appropriate form to share data, and it definitely is not appropriate to set the data to "request access".
EDIT: I checked your data and created a smaller subset to test this method on. If there are imperfections in the data you may find problems, but I had none when I tested it on a sample of your first 100 days and 10 coins (after transposing, df.iloc[:100, :10].
Firstly, transpose the DataFrame so columns are organised by coin and rows are dates.
df = df.T
Following this, we concatenate to a new DataFrame (result). Alternatively, concatenate to the original and drop columns after. Unfortunately I can't think of a non-iterative method. This method goes column by column, creates a DataFrame for each coins, adds the coin name prefix to the column names, then concatenates each DataFrame to the end.
result = pd.DataFrame()
coins = df.columns.tolist()
for coin in coins:
coin_data = df[coin]
split_coin = coin_data.apply(pd.Series).add_prefix(coin+"_")
result = pd.concat([result, split_coin], axis=1)

Turning DataFrameGroupBy.resample hierarchical index into columns

I have a dataset that contains individual observations that I need to aggregate at coarse time intervals, as a function of several indicator variables at each time interval. I assumed the solution here was to do a groupby operation, followed by a resample:
adult_resampled = adult_data.set_index('culture', drop=False).groupby(['over64','regioneast','pneumo7',
'pneumo13','pneumo23','pneumononPCV','PENR','LEVR',
'ERYTHR','PENS','LEVS','ERYTHS'])['culture'].resample('AS', how='count')
The result is an awkward-looking series with a massive hierarchical index, so perhaps this is not the right approach, but I need to then turn the hierarchical index into columns. The only way I can do that now is to hack the hierarchical index (by pulling out the index labels, which are essentially the contents of the columns I need).
Any tips on what I ought to have done instead would be much appreciated!
I've tried the new Grouper syntax, but it does not allow me to subsequently change the hierarchical indices to data columns. Applying unstack to this table:
results in this:
In order for this dataset to be useful, say in a regression model, I really need the index labels as indicators in columns.

Categories