How to iterate through dataframe with conditions? - python

I'm having trouble working with a dataframe and I would appreciate some help.
I have a pandas dataframe that has information on the points where two trajectories meet (it comes from pygmt.x2sys_cross). Two of the columns of my dataframe refer to which trajectories I'm working with, in a style such as
trajectory 1 trajectory 2
[abcd1, 123] [efgh2, 456]
where some trajectories cross more than one time. I want to find the rows for each unique pair of trajectories that cross, so I can operate with them. In particular, I'd like to find:
How many times each unique pair of trajectories cross
The longest and shortest time interval between crosses (the time at which each trajectory crosses that point is also a column, time1 and time2)
The highest and lowest difference in value measured at the intersections.
I was able to do this by working with nested dictionaries, using the names of the trajectories as the keys, and running a nested for loop. However, I could not store the data from the dictionary efficiently, and I can do it with the dataframe, therefore I'd like to get the same result that way.
Thanks a lot.

Related

two datasets, to find relation based on date range

I have two datasets, one is with time of Volcanoes eruption, the second with Earthquake.
Both have "Date" column.
I would like to run somehow loop to find out based on the dates if Earthquake is linked to Volcanoes eruption.
The idea is that to check if the date of both events is close enough, lets say within 4 days range than create new column in Earthquake dataset and state yes or no (volcano related or no)....
I have no idea even how to start if that is even possible.
Here are the datasets:

How do I find the 100 most different points within a pool of 10,000 points?

I have a set of 10,000 points, each made up of 70 boolean dimensions. From this set of 10,000, I would like to select 100 points which are representative of the whole set of 10,000. In other words, I would like to pick the 100 points which are most different from one another.
Is there some established way of doing this? The first thing that comes to my mind is a greedy algorithm, which begins by selecting one point at random, then the next point is selected as the most distant one from the first point, and then the second point is selected as having the longest average distance from the first two, etc. This solution doesn't need to be perfect, just roughly correct. Preferably, this solution of 100 points can also be found within ~10 minutes but finishing within 24 hours is also fine.
I don't care about distance, in particular, that's just something that comes to mind as a way to capture "differentness."
If it matters, every point has 10 values of TRUE and 60 values of FALSE.
Some already-built Python package to do this would be ideal, but I am also happy to just write the code myself something if somebody could point me to a Wikipedia article.
Thanks
Your use of "representative" is not standard terminology, but I read your question as you wish to find 100 items that cover a wide gamut of different examples from your dataset. So if 5000 of your 10000 items were near identical, you would prefer to see only one or two items from that large sub-group. Under the usual definition, a representative sample of 100 would have ~50 items from that group.
One approach that might match your stated goal is to identify diverse subsets or groups within your data, and then pick an example from each group.
You can establish group identities for a fixed number of groups - with different membership size allowed for each group - within a dataset using a clustering algorithm. A good option for you might be k-means clustering with k=100. This will find 100 groups within your data and assign all 10,000 items to one of those 100 groups, based on a simple distance metric. You can then either take the central point from each group or a random sample from each group to find your set of 100.
The k-means algorithm is based around minimising a cost function which is the average distance of each group member from the centre of its group. Both the group centres and the membership are allowed to change, updated in an alternating fashion, until the cost cannot be reduced any further.
Typically you start by assigning each item randomly to a group. Then calculate the centre of each group. Then re-assign items to groups based on closest centre. Then recalculate the centres etc. Eventually this should converge. Multiple runs might be required to find an good optimum set of centres (it can get stuck in a local optimum).
There are several implementations of this algorithm in Python. You could start with the scikit learn library implementation.
According to an IBM support page (from comment by sascha), k-means may not work well with binary data. Other clustering algorithms may work better. You could also try to convert your records to a space where Euclidean distance is more useful and continue to use k-means clustering. An algorithm that may do that for you is principle component analysis (PCA) which is also implemented in scikit learn.
The graph partitioning tool METIS claims to be able to partition graphs with millions of vertices in 256 parts within seconds.
You could treat your 10.000 points as vertices of an undirected graph. A fully connected graph with 50 million edges would probably be too big. Therefore, you could restrict the edges to "similarity links" between points which have a Hamming distance below a certrain threshold.
In general, Hamming distances for 70-bit words have values between 0 and 70. In your case, the upper limit is 20 as there are 10 true coordinates and 60 false coordinates per point. The maximum distance occurs, if all true coordinates are differently located for both points.
Creation of the graph is a costly operation of O(n^2). But it might be possible to get it done within your envisaged time frame.

what is the best algorithm to cluster this data

can some one help me find a good clustering algorithm that will cluster this into 3 clusters without defining the number of clusters.
i have tried many algorithms in its basic form.. nothing seems to work properly.
clustering = AgglomerativeClustering().fit(temp)
same way i tried the dbscan and kmeans too.. just used the guidelines from sklean. i couldn't get the expected results.
my original data set is a 1D list of numbers.. but the order of the numbers matters, so generated a 2D list as bellow.
temp = []
for i in range(len(avgs)):
temp.append([avgs[i], i+1])
clustering = AgglomerativeClustering().fit(temp)
in plotting piloting i used a similter range as the y axis
ax2.scatter(range(len(plots[i])), plots[i], c=np.random.rand(3,))
the order of the data matters, so this need to clustered into 3. and there might be some other data sets that the data is very good so that the result of that need to be just one cluster.
Link to the list if someone want to try
so i tried using the step detection and got the following image according to ur answer. but how can i find the values of the peaks.. if i get the max value i can get one of them.. but how to get the rest of it.. the second max is not an answer because the one right next to the max is the second max
Your data is not 2d coordinates. So don't choose an algorithm designed for that!
Instead your data appears to be sequential or time series.
What you want to use is a change point detection algorithm, capable of detecting a change in the mean value of a series.
A simple approach would be to compute the sum of the next 10 points minus the sum of the previous 10 points, then look for extreme values of this curve.

Find correlation between two columns in two different dataframes

I have two dataframes which both have an ID column, and for each ID a date columns with timestamps and a Value column. Now, I would like to find a correlation between the values from each dataset in this way: dataset 1 has all the values of people that got a specific disease, and in dataset 2 there are values for people that DIDN'T get the disease. Now, using the corr function:
corr = df1['val'].corr(df2['val'])
my result is 0.1472 and is very very low (too much), meaning they don't have nothing in correlation.
Am I wrong in something? How do I calculate the correlation? Is there a way to find a value (maybe a line) where after that value the people will get the disease? I would like to try this with a Machine Learning technique (SVMs), but first it would be good to have something like the part I explained before. How can I do that?
Thanks
Maybe your low correlation is due to the index or order of your observations
Have you tried to do a left join by ID?

Organizing records to classes

I'm planning to develop a genetic algorithm for a series of acceleration records in a search to find optimum match with a target.
At this point my data is array-like with a unique ID column, X,Y,Z component info in the second, time in the third etc...
That being said each record has several "attributes". Do you think it would be beneficial to create a (records) class considering the fact I will want to to do a semi-complicated process with it as a next step?
Thanks
I would say yes. Basically I want to:
Take the unique set of data
Filter it so that just a subset is considered (filter parameters can be time of recording for example)
Use a genetic algorithm the filtered data to match on average a target.
Step 3 is irrelevant to the post, I just wanted to give the big picture in order to make my question more clear.

Categories