For a university project I’m trying to see the relation oil production/consumption and crude oil price have on certain oil stocks, and I’m a bit confused about how to sort this data.
I basically have 4 datasets-
-Oil production
-Oil consumption
-Crude oil price
-Historical price of certain oil company stock
If I am trying to find a way these 4 tables relate, what is the recommended way of organizing the data? Should I manually combine all this data to a single Excel sheet (seems like the most straight-forward way) or is there a more efficient way to go about this.
I am brand new to PyTorch and data, so I apologise if this is a very basic question. Also, the data can basically get infinitely larger, by adding data from additional countries, other stock indexes, etc. So is there a way I can organize the data so it’s easy to add additional related data?
Finally, I have the month-to-month values for certain data (eg: oil production), and day-to-day values for other data (eg: oil price). What is the best way I can adjust the data to make up for this discrepancy?
Thanks in advance!
You can use pandas.DataFrame to create 4 dataframes for each dataset, then proceed with combining them in one dataframe by using merge
Related
I am working on a rental price prediction project where I web scraped data from Facebook Marketplace. When extracting the areas of the properties, I am encountering many NaN values.
I am web scraping from a small city and it is unlikely that I will be able to find more data. How can I effectively handle the NaN values in my data? Are there any machine learning algorithms or external sources of information that can be used to impute missing values in this situation?
Any suggestions or advice would be greatly appreciated. Thank you in advance!
I have considered using the mean or median based on property type, number of bedrooms, and bathrooms, but I am not sure if this is the best approach.
There are many methods that you can use when it comes to missing values in your data. As you mentioned general apprach is to fill with mean-median. I recommend grouping them first then filling with mean or median.
df['a'].fillna(df.groupby('b')['a'].transform('mean'))
I recon you can use zipcode or something simmilar to group them. Another thing you can do is before filling empty places, create another column that indicates if the values are missing. this may help your model to treat those values differently and dont overfit on those values.
for furter info link
I have a dataset having 25 columns and 1000+ rows. This dataset contains dummy information of interns. We want to make squads of these interns. Suppose we want to make each squad of 10 members.
Based on the similarities of the intern we want to make squads and assign squad number to them. The factors will the columns we have in dataset which are Timezone, Language they speak, in which team they want to work etc.
These are the columns:
["Name","Squad_Num","Prefered_Lang","Interested_Grp","Age","City","Country","Region","Timezone",
"Occupation","Degree","Prev_Took_Courses","Intern_Experience","Product_Management","Digital_Marketing",
"Market_Research","Digital_Illustration","Product_Design","Prodcut_Developement","Growth_Marketing",
"Leading_Groups","Internship_News","Cohort_Product_Marketing","Cohort_Product_Design",
"Cohort_Product_Development","Cohort_Product_Growth","Hours_Per_Week"]
enter image description here
Here are a bunch of clustering algos for you to play around with.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb
Since this is unsupervised learning, you kind of have to fiddle around with different algos, and see which one performs to your liking, but there is no accuracy, precision, R^2, etc., to let you know how well the machine is performing.
I have a dataframe with a large number of rows (several hundred thousand) and several columns that show industry classification for a company, while the eighth column is the output and shows the company type, e.g. Corporate or Bank or Asset Manager or Government etc.
Unfortunately industry classification is not consistent 100% of the time and is not finite, i.e. there are too many permutations of the industry classification columns to be mapped once manually. If I mapped say 1k rows with correct Output columns, how can I employ machine learning with python to predict the Output column based on my trained sample data? Please see the image attached which will make it clearer.
Part of the dataset
You are trying to predict to company type based in a couple of columns? That is not possible, there are a lot of companies working on that. The best you can do is to collect a lot of data from different sources match them, and then you can try with sklearn probably a decision tree classifier to start.
I have this daily stats churned out from a system which outputs total sales and units sold per region group. For my analysis, I want to breakdown the entries into regions instead of region group. I'm trying to look for a way to split each row into per region with the respective measures.
I have historical percentages on the market share per region which I'll use to come up with the estimated sales and units sold.
I can do this manually in excel but given how i'll be doing this on a weekly basis, I'm looking for a way to automate it via python.
My data: https://imgur.com/a/pBr3y4D
Goal: https://imgur.com/a/Uc56PVR
Well, first of all, when you're doing DS researches try to find the most appropriate way in your personal case. There's nothing bad in using all Excel functionality to solve your issue, scripting, etc.
However, if you really-really want to use pandas, then what I would do in your case - just .append() and then split on regions and grouping by sales or made up a function with for..loop.
I have a very complex data-set that I need to easily aggregate and work with values at multiple levels.
For example, assume I have data on population and crime rate for each city in the US. Each city should roll up to a state, so the state population is the SUM of each city within it, and the crime rate is the AVERAGE of the crime rates of each city below it. Then I need each state to roll up to the US overall, maintaining the same calculation logic.
What is the best data structure to accomplish complex aggregations of hierarchically organized data in python?
Ideally I would be able to select a node, and then using some method feed the node an argument on what data to aggregate, and the logic to aggregate it with.
two words
use pandas
link to tutorial
http://pandas.pydata.org/pandas-docs/stable/cookbook.html