two datasets, to find relation based on date range - python

I have two datasets, one is with time of Volcanoes eruption, the second with Earthquake.
Both have "Date" column.
I would like to run somehow loop to find out based on the dates if Earthquake is linked to Volcanoes eruption.
The idea is that to check if the date of both events is close enough, lets say within 4 days range than create new column in Earthquake dataset and state yes or no (volcano related or no)....
I have no idea even how to start if that is even possible.
Here are the datasets:

Related

How to iterate through dataframe with conditions?

I'm having trouble working with a dataframe and I would appreciate some help.
I have a pandas dataframe that has information on the points where two trajectories meet (it comes from pygmt.x2sys_cross). Two of the columns of my dataframe refer to which trajectories I'm working with, in a style such as
trajectory 1 trajectory 2
[abcd1, 123] [efgh2, 456]
where some trajectories cross more than one time. I want to find the rows for each unique pair of trajectories that cross, so I can operate with them. In particular, I'd like to find:
How many times each unique pair of trajectories cross
The longest and shortest time interval between crosses (the time at which each trajectory crosses that point is also a column, time1 and time2)
The highest and lowest difference in value measured at the intersections.
I was able to do this by working with nested dictionaries, using the names of the trajectories as the keys, and running a nested for loop. However, I could not store the data from the dictionary efficiently, and I can do it with the dataframe, therefore I'd like to get the same result that way.
Thanks a lot.

How to visualize aggregate VADER sentiment score values over time in Python?

I have a Pandas dataframe containing tweets from the period July 24 2019 to 19 October 2019. I have applied the VADER sentiment analysis method to each tweet and added the sentiment scores in new columns.
Now, my hope was to visualize this in some kind of line chart in order to analyse how the averaged sentiment scores per day have changed over this three-months period. I therefore need the dates to be on the x-axis, and the averaged negative, positive and compound scores (three different lines) on the y-axis.
I have an idea that I need to somehow group or resample the data in order to show the aggregated sentiment value per day, but since my Python skills are still limited, I have not succeeded in finding a solution that works yet.
If anyone has an idea as to how I can proceed, that would be much appreciated! I have attached a picture of my dateframe as well as an example of the type of plot I had in mind :)
Cheers,
Nicolai
You should have a look at the groupby() method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
Simply create a day column which contains a timestamp/datetime_object/dict/tuple/str ... which represents the day of the tweet and not it's exact time . Then use the groupby() method on this column.
If you don't know how to create this column, an easy way of doing it is using https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
Keep in mind that groupby method doesn't return a DataFrame but a groupby_generic.DataFrameGroupBy so you'll have to choose a way of aggreating the data in your groups (you should probably do groupby().mean() in your case, see grouby method documentation for more information)

How to find out how many items in a dataframe is within a certain range?

I'm currently doing some analysis on the stats of my podcast, and I have merged my Spotify listening numbers with the ones from my RSS-feed in pandas. So far so good, I now have a dataframe with a column of "Total" which tells me how many listeners I had on each episode and what the average number of listeners is.
Now, what I want to do is to see how many of my episodes fit in to three categories (at least), Good, Normal and Bad. So I need to divide my Totals into three ranges and then see how many of my episodes land within each of those frames. I have some limited experience of messing around with Python and pandas, but it's been a while since I last sat down with it and I dont really know how to approach this problem.
Any help is highly appreciated!

How to treat missing data involving multiple datasets

I'm developing a model used to predict the probability of a client changing telephone companies based on their daily usage. My dataset has information from two weeks (14 days).
My datasets include in each row:
User ID, day (number from 1 to 14), a list of 15 more values.
The problem comes from the fact that some clients don't use their telephones everyday so for each client we have a random amount of rows (from 1 to 14) depending on the days they have used their telephones. Therefore we have some missing client-day data combinations.
Removing the missing values is not an option since the data set is small and it would affect the predictive methods.
What kind of treatement could I make for this missing day values for each client?
I've tried to make a new dataset in which we have only one entry per client, there is a new value that quantifies the amount of days of telephone usage and the rest of values are a mean of all the values found on each day of the original dataset. This decreases the size of the dataset and we would have the same problem than just removing the missing values.
I've thought about adding values for the missing days for each client (using interpolation methods) but that would twist the results since that would make the dataset as if every client used their phones everyday and that would affect the predictive model.

Converting a single row into two

I have this daily stats churned out from a system which outputs total sales and units sold per region group. For my analysis, I want to breakdown the entries into regions instead of region group. I'm trying to look for a way to split each row into per region with the respective measures.
I have historical percentages on the market share per region which I'll use to come up with the estimated sales and units sold.
I can do this manually in excel but given how i'll be doing this on a weekly basis, I'm looking for a way to automate it via python.
My data: https://imgur.com/a/pBr3y4D
Goal: https://imgur.com/a/Uc56PVR
Well, first of all, when you're doing DS researches try to find the most appropriate way in your personal case. There's nothing bad in using all Excel functionality to solve your issue, scripting, etc.
However, if you really-really want to use pandas, then what I would do in your case - just .append() and then split on regions and grouping by sales or made up a function with for..loop.

Categories