I am probably using poor search terms when trying to find the answer to this problem but I hope that I can explain by posting an image.
I have a weekly df (left table) and I am trying to get the total average across all cities within one week and the average of certain observations based on 2 lists (right table)
excel representation of the dataframe
Can anyone please help figure out how to do this?
Related
Overflow Data/CSV/Pandas peeps hivemind....I need your help!
I've only recently started using Python/Pandas and I found a really good project to possibly work on, that would save me a lot of time.
I do weekly reports and report on the differences in data week by week.
I dont know Pandas 100% but I dont think this would be that hard to do with code and I feel like this project would be a great way for me to learn.
Here is an example of the report I have:
Report Example
Now, I have a list of items from the items list (and gets concatenated in the item info column) that I'm to be reporting on:
I'm essentially trying to have code that can compute:
-IF the name (from my list) is found in the item info column AND the Week number(s) is a particular number AND the year(s) is 2022 THEN aggregate the total number of the POS/sales altogether as data A
&
-IF there is viable data there as well for Week 16 (compute the similar above info for that week as data B), then subtract the difference between these weeks (A and B) and output that data to me as information point C (aka the difference)
-THEN if that difference is positive, divide C by B (aka, give me the percentage of that move)
Tl:dr-I want to aggregate the total sales of an item for the week and subtract it from the corresponding amount for the previous week for the same item and verify the difference, as well as the percentage of movement in amounts.
I only know so much in Pandas right now, would anyone be able to point me in a direction that could help? I so feel like this shouldn't be that hard to do/I'd love to make it a weekend project and saves myself a good bit of time at work and learn how to automate some work tasks too. :)
I am working on a portfolio implementation thesis but am struggling a bit on the implementation.
Long story short I want to create portfolio weights that are rebalanced every Nth day (based on my dataset of daily observations). The weights are updated based on the daily observations between so weights are updated every Nth day and on N more observations. All updates are backwards looking.
So far I am able to:
Generate portfolio weights based on all data up to the current time
Push these weights into the correct dataframe row corresponding to the correct day
I am NOT able to create a code that fills the rows in between with the weights from the previous Nth update. This is important as I will test several rebalancing frequencies.
I would greatly appreciate ANY help on this, either similar (answered) questions or a plain code example.
Hope someone is able to help! Thanks a lot
I have a Pandas dataframe containing tweets from the period July 24 2019 to 19 October 2019. I have applied the VADER sentiment analysis method to each tweet and added the sentiment scores in new columns.
Now, my hope was to visualize this in some kind of line chart in order to analyse how the averaged sentiment scores per day have changed over this three-months period. I therefore need the dates to be on the x-axis, and the averaged negative, positive and compound scores (three different lines) on the y-axis.
I have an idea that I need to somehow group or resample the data in order to show the aggregated sentiment value per day, but since my Python skills are still limited, I have not succeeded in finding a solution that works yet.
If anyone has an idea as to how I can proceed, that would be much appreciated! I have attached a picture of my dateframe as well as an example of the type of plot I had in mind :)
Cheers,
Nicolai
You should have a look at the groupby() method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
Simply create a day column which contains a timestamp/datetime_object/dict/tuple/str ... which represents the day of the tweet and not it's exact time . Then use the groupby() method on this column.
If you don't know how to create this column, an easy way of doing it is using https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
Keep in mind that groupby method doesn't return a DataFrame but a groupby_generic.DataFrameGroupBy so you'll have to choose a way of aggreating the data in your groups (you should probably do groupby().mean() in your case, see grouby method documentation for more information)
I have two dataframes which both have an ID column, and for each ID a date columns with timestamps and a Value column. Now, I would like to find a correlation between the values from each dataset in this way: dataset 1 has all the values of people that got a specific disease, and in dataset 2 there are values for people that DIDN'T get the disease. Now, using the corr function:
corr = df1['val'].corr(df2['val'])
my result is 0.1472 and is very very low (too much), meaning they don't have nothing in correlation.
Am I wrong in something? How do I calculate the correlation? Is there a way to find a value (maybe a line) where after that value the people will get the disease? I would like to try this with a Machine Learning technique (SVMs), but first it would be good to have something like the part I explained before. How can I do that?
Thanks
Maybe your low correlation is due to the index or order of your observations
Have you tried to do a left join by ID?
I am a newbie to Python and learning my way and your help is appreciated.
I am trying to do a simple calculation where I calculate rolling average of last 4 datapoint (fin_ratio) i.e. (x1+x2+x3+x4)/4
Now my dataset is as follows:
I have put a small sample of file (xl format) here for your reference:
https://www.mediafire.com/file/4gjsprvfc31n79g/test_data_python.xlsx/file
1) it has unique firm_id, so the rolling mean should not use data from two different firm_id ( i.e. I can't just calculate the mean and let it go down the rows as it will then use fin_ratio from two different firms where one ends and other firm data starts.
2) How should I use the firm_year and firm_qtr in the for loop in this case.
Thanks for your time and I appreciate any pointers you may have.
Regards
John M.