I have two dataframes which both have an ID column, and for each ID a date columns with timestamps and a Value column. Now, I would like to find a correlation between the values from each dataset in this way: dataset 1 has all the values of people that got a specific disease, and in dataset 2 there are values for people that DIDN'T get the disease. Now, using the corr function:
corr = df1['val'].corr(df2['val'])
my result is 0.1472 and is very very low (too much), meaning they don't have nothing in correlation.
Am I wrong in something? How do I calculate the correlation? Is there a way to find a value (maybe a line) where after that value the people will get the disease? I would like to try this with a Machine Learning technique (SVMs), but first it would be good to have something like the part I explained before. How can I do that?
Thanks
Maybe your low correlation is due to the index or order of your observations
Have you tried to do a left join by ID?
Related
I have been trying hard to divide a column that I have into bins. I looked at pandas.qcut and pandas.cut and have understood this fairly. ( qcut will divide my population into equal number of bins and cut will divide my range into equal bins).
I understand that I need the functionality of cut in PySpark.
I experimented with approxQuantile, which takes a real long time in processing. I also experimented with QuantileDiscretizer ( which is similar to qcut, which gave me the same number of items in my output).
I am trying out Bucketizer, however looks like there's no straightaway method to apply this on a groupby/window output.
Is there any other way that I should look at in order to compute this?
Applying Bucketizer to Spark dataframe after partitioning based on a column value
I am probably using poor search terms when trying to find the answer to this problem but I hope that I can explain by posting an image.
I have a weekly df (left table) and I am trying to get the total average across all cities within one week and the average of certain observations based on 2 lists (right table)
excel representation of the dataframe
Can anyone please help figure out how to do this?
I have a Pandas dataframe containing tweets from the period July 24 2019 to 19 October 2019. I have applied the VADER sentiment analysis method to each tweet and added the sentiment scores in new columns.
Now, my hope was to visualize this in some kind of line chart in order to analyse how the averaged sentiment scores per day have changed over this three-months period. I therefore need the dates to be on the x-axis, and the averaged negative, positive and compound scores (three different lines) on the y-axis.
I have an idea that I need to somehow group or resample the data in order to show the aggregated sentiment value per day, but since my Python skills are still limited, I have not succeeded in finding a solution that works yet.
If anyone has an idea as to how I can proceed, that would be much appreciated! I have attached a picture of my dateframe as well as an example of the type of plot I had in mind :)
Cheers,
Nicolai
You should have a look at the groupby() method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
Simply create a day column which contains a timestamp/datetime_object/dict/tuple/str ... which represents the day of the tweet and not it's exact time . Then use the groupby() method on this column.
If you don't know how to create this column, an easy way of doing it is using https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
Keep in mind that groupby method doesn't return a DataFrame but a groupby_generic.DataFrameGroupBy so you'll have to choose a way of aggreating the data in your groups (you should probably do groupby().mean() in your case, see grouby method documentation for more information)
I have 2 dataframes.
Each dataframe contains 64 columns with each column containing 256 values.
I need to compare these 2 dataframes for statistical significance.
I know only basics of statistics.
What I Have done is calculate p-value for all columns for each dataframe.
Then I compare p-value of each column of 1 st dataframe to the p value of each column to the 2nd dataframe.
EX: p value of 1 st column of 1st dataframe to p value of 1st column of 2nd dataframe.
Then I tell which columns are significantly different among 2 dataframes.
Is there any better way to do this.
I use python.
To be honest, the way you do it is not the way its meant to be. Let my highlight some points that you should always keep in mind when conducting such analyses:
1.) Hypothesis first
I strongly suggest to avoid testing everything against everything. This kind of exploratory data analysis will likely produce some significant results but it is also likely that you end up in a multiple comparisons problem.
In simple terms: You have so many tests that the chance of seeing something significant which in fact is not is greatly increased (see also Type I and Type II errors).
2.) The p-value isn't all the magic
Saying that you calculated the p-value for all columns doesn't tell which test you used. The p-value is just a "tool" from mathematical statistics that is used by a lot of tests (e.g. correlation, t-test, ANOVA, regression etc.). Having a significant p-value indicates that the difference/relationship you observed is statistically relevant (i.e. a systematic and not a random effect).
3.) Consider sample and effect size
Depending on which test you are using, the p-value is sensitive to the sample size you have. The greater your sample size, the more likely it is to find a significant effect. For instance, if you compare two groups with 1 million observations each, the smallest differences (which might also be random artifacts) can be significant. It is therefore important to also take a look at the effect size that tells you how large the observed really is (e.g. r for correlations, Cohen's d for t-tests, partial eta for ANOVAs etc.).
SUMMARY
So, if you want to get some real help here, I suggest to post some code and specify more concretely what (1) your research question is, (2) which tests you used, and (3) how your code and your output looks like.
I am a beginner in Data Science and Python, and learning Statistics at the same time as leasure. I am not a Computer Scientist , so sorry if this question may sounds basic.
I have a Dataframe, which I process using Pandas using Python 3. From a preliminary exploration, I suspect that one of the columns is correlated to the behavior of two others instead of just one.
I can calculate the correlation coefficient between two single columns. However, is there a way to calculate some sort of "weighted" correlation between one column and two others?
You can compute and plot Correlation Matrix. Please refer http://www.marketcalls.in/python/quick-start-guide-compute-correlation-matrix-using-nsepy-pandas-python.html and python documentation.