I have a csv file with 500 companies stock values for 5 years (2013-2017). The columns I have are: date, open, high, low, close, volume and name. I would like to be able to compare these companies, to see which 20 of them are the best. I was thinking about just using the mean, but since the stocks value of the first data collected (jan 2013) are different (some starts of at 30 usd, and others at 130 usd), it's hard to really compare which ones that has been the best during these 5 years. I would therefore want to have the values of the first date of every company as the zero-point. Basically I want to subtract the close value from the first date to the rest of the datas collected.
My problem is that, firstly, I have a hard time getting to the first dates close value. Somehow I want to write somthing like "data.loc(data['close']).iloc(0)". But since it's a dataframe I can't find a value of a row, nor iterate through the dataframe.
Secondly, I'm not sure how I can differentiate between the companies. I want to do the procedure with the zero-point for every of these 500 companies, so somehow I need to know when to start over.
The code I have now is
def main():
data = pd.read_csv('./all_stocks_5yr.csv', usecols = ['date', 'close', 'Name'])
comp_name = sorted(set(data.Name))
number_of = comp_name.__len__()
comp_mean = []
for i in comp_name:
frames = data.loc[data['Name'] == i]
comp_mean.append([i, frames['close'].mean()])
print(comp_mean)
But this will only give me the mean, without using the zero-point
Another idea I had was to just compare the closing price from first value (January 1, 2013) with the price from the last value (December 31, 2017) to see how much the stock has increased/decreased, what I'm not sure about here is how I will reach the close values from these dates, for every single of the 500 companies.
Do you have any recommendations for any of the methods?
Thank you in advance
Related
I have a dataset of sales data over the last year.
For each unique item in the product category I'd like to calculate the number of days between the first solDate and the last solDate.
The data contains two columns relevant to this 'solDate' which is a datetime object NS product which is a string. Each row in the data set an individual sale of the item.
My first instinct was to use a for loop to iterate through, check each column and then uncheck it when the next entry is found to figure out the last time sold but I know their must be an easier way + the dataset is 100,000 entries so I need something semi-efficent to complete in a reasonable time.
Im using pandas package for the analysis.
I'm writing my own code to analyse/visualise COVID-19 data from the European CDC.
https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
I've got a simple code to extract the data and make plots with cumulative deaths against time, and am trying to add functionality.
My aim is something like the attached graph, with all countries time shifted to match at (in this case the 5th death) I want to make a general bit of code to shift countries to match at the 'n'th death.
https://ourworldindata.org/grapher/covid-confirmed-deaths-since-5th-death
The current way I'm trying to do this is to have a maze of "if group is 'country' shift by ..." terms.
Where ... is a lookup to find the date for the particular 'country' when there were 'n' deaths, and to interpolate fractional dates where appropriate.
i.e. currently deaths are assigned as 00:00 day/month, but the data can be shifted by 2/3 a day as below.
datetime cumulative deaths
00:00 15/02 80
00:00 16/02 110
my '...' should give 16:00 15/02
I'm working on this right now but it doesn't feel very efficient and I'm sure there must be a much simpler way that I'm not seeing.
Essentially despite copious googling I can't seem to find a simple way of automatically shifting a bunch of timeseries to match at a particular y value, which feels like it should have some built-in functionality, i.e. a Lookup with interpolation.
####Live url (I've downloaded my own csv and been calling that for code development)
url = 'https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
dataraw = pd.read_csv(url)
#extract relevanty colums
data = dataraw.loc[:,["dateRep","countriesAndTerritories","deaths"]]
####convert date format
data['dateRep'] = pd.to_datetime(data['dateRep'],dayfirst=True)
####sort by date
data = data.sort_values(["dateRep"],ascending=True)
data['cumdeaths'] = data.groupby(['countriesAndTerritories']).cumsum()
##### limit to countries with cumulative deaths > 500
data = data.groupby('countriesAndTerritories').filter(lambda x:x['cumdeaths'].max() >500)
###### remove China from data for now as it doesn't match so well with dates
data = data.groupby('countriesAndTerritories').filter(lambda x:(x['countriesAndTerritories'] != "China").any())
##### only recent dates
data = data[data['dateRep'] > '2020-03-01']
print(data)
You can use groupby('country') and the pd.transform function to add a column which will set every row with the date in which its country hit the nth death.
Then you can do a vectorized subtraction of the date column and the new column to get the number of days.
I currently try to put the following time series into one plot, with adjusted scalings to examine whether they correlate or not.
raw = pd.read_csv('C:/Users/Jens/Documents/Studium/Master/Big Data Analytics/Lecture 5/tr_eikon_eod_data.csv', index_col=0, parse_dates=True)
data = raw[['.SPX', '.VIX']].dropna()
data.tail()
data.plot(subplots=True, figsize=(10,6));
Has anyone an idea how to do that?
Also, would it possible to do the same with data from two different data sets? I like to compare a house price index with a stock index. I have the daily closing prices for the stock index and just a value for each quarter for the house prices (10y). I.e. closing price against house price index.
I don't really know where to start.
Thank you!:)
I am quite new in python and I need your help guys!
my data structure
this data is intraday data 5 min from 2001/01/02 till 31/12/2019. As you can see from the data 0 indicated the date, and 2 indicate the prices of the stock.
Each day, such as 2001/01/02 has 79 observation.
First of all, I need to create a daily return as a new column. Normaly I was dealing with daily data and for the daily log return was as follow
def lr(x):
return np.log(x[1:]) - np.log(x[:-1])
How can I create new column for the daily return from the 5 min data.
If you load your data into a pandas.DataFrame, you can use df.groupby() and then apply your lr-function with minimal changes:
df = pd.read_excel('path/to/your/file.xlsx', header=None,
names=['Index', 'Date', 'Some_var', 'Stock_price'])
The key thing to do, though, will be to decide how you want to generate your daily values from your 5 minute data. I'm no stock expert, but I'd guess you want to use the last value for each day to represent the stock value. If that's the case, you can use
daily_values = df.groupby('Date')['Stock_price'].agg('last')
and then apply your lr function to get the returns
lr(daily_values)
I'm new to python and I would appreciate if you give me an answer as soon as possible.
I'm processing a file containing reviews for products that can belong to more than 1 category. What I need is to group the review ratings by the categories, and date at the same time. Since I don't know the exact number of categories, or dates in advance, I need to add rows and columns as I'm processing the reviews data (50 GB file).
I've seen how I can add columns, however my trouble is adding a row without knowing how many columns are currently in the dataframe.
Here is my code:
list1=['Movies & TV', 'Books'] #categories so far
dfMain=pandas.DataFrame(index=list1,columns=['2002-09']) #only one column at the beginnig
print(dfMain)
This is what dfMain looks like:
If I want to add a column, I simply do this:
dfMain.insert(0, date, 0) #where date is in format like '2002-09'
But if I want to add a new category(row) and fill all the dates(columns) with zeros? How do I do that? I've tried with method append, but it asks for all the columns as parameters. Method Insert doesn't seem to work either..
Here's a possible solution:
dfMain.append(pd.Series(index=dfMain.columns, name='NewRow').fillna(0))
2002-09
Movies & TV NaN
Books NaN
NewRow 0.0