Create optimal_quantization.py #11

alebrigant · 2018-08-16T15:56:25Z

No description provided.

johmathe · 2018-08-16T18:22:38Z

quantization/optimal_quantization.py

@@ -0,0 +1,72 @@
+"""
+Optimal quantization


Can you describe a little bit more what this deos please?

johmathe · 2018-08-16T18:22:47Z

quantization/optimal_quantization.py

+S2metric = HypersphereMetric(dimension=2)
+
+TOLERANCE = 1e-5
+IMPLEMENTED = ['S2']


use a tuple, not a list

johmathe · 2018-08-16T18:23:08Z

quantization/optimal_quantization.py

+
+def sample_from(points, size=1):
+    """
+    Sample from the empirical distribution associated to points


please be consistent with . at the end of the sentences in the comments.

johmathe · 2018-08-16T18:23:42Z

quantization/optimal_quantization.py

+    n_points = points.shape[0]
+    dimension = points.shape[-1]
+
+    ind = np.random.randint(low=0, high=n_points, size=size)


use full words - ind ?

johmathe · 2018-08-16T18:24:45Z

quantization/optimal_quantization.py

+    # random initialization of the centers
+    centers = sample_from(points, n_centers)
+
+    gap = 1


gap = 1.0 to show that it is a floating point value

johmathe · 2018-08-16T18:27:06Z

quantization/optimal_quantization.py

+
+    while gap > tolerance:
+        step += 1
+        k = np.floor(step / n_repetition) + 1


what's the rationale for starting at step = 1 instead of step = 0?

Starting at step = 0 will replace the center to be updated by the new sample, instead of just moving it in the direction of it (as in any other step > 0). It is equivalent to a slightly different initialization of the centers (only one is different). Since there is no reason for that initialization to be better than the previous one, I start at step = 1.

ok you should find a better name for k - hard to understand what this variable mean. Usually k will be used for indexing an array. variables like i,j k should have a trivial behavior like starting at 0 and incrementing. Any other more complex behavior should have a clear name.

Yes Nina had the same comment, so I changed it to step_size.

johmathe · 2018-08-16T18:27:50Z

quantization/optimal_quantization.py

+
+        centers[ind, :] = new_center
+
+    return centers, step


you need unit tests for this file

Corrections of the script including new descriptions, implementation of the circle (will be useful in unit test) and addition of new outputs.

Add karcher flow algorithm for the purpose of the unit test.

alebrigant · 2018-08-22T13:36:31Z

Thanks for the reviews ! Here are the last changes:

Modifications of optimal_quantization.py: I have added a karcher_mean function that is needed for the test of optimal_quantization. Maybe it would be more coherent to define a class 'DataPoints' of which all the functions defined in optimal_quantization.py could be methods ?
Creation of test_optimal_quantization.py: the unit test of optimal quantization consists in verifying that if one looks for only one center, one should find the Karcher mean. However I couldn't guarantee a better precision than 0.5% of the 'diameter' of the set of data points.

ninamiolane

Awesome! Some comments, mostly syntax.

ninamiolane · 2018-08-28T01:09:44Z

quantization/optimal_quantization.py

@@ -0,0 +1,171 @@
+"""
+Optimal quantization of the empirical distribution of a dataset -


Nit: why not a point at the end of the line?

I replaced it with a dot.

ninamiolane · 2018-08-28T01:13:33Z

quantization/optimal_quantization.py

+
+def diameter_of_data(points, space=None):
+    """
+    Compute the two points that are farthest away from each other in points.


NIT: Compute the distance between the two points

Yes thanks !

ninamiolane · 2018-08-28T01:13:46Z

quantization/optimal_quantization.py

+    return index_closest_neighbor
+
+
+def diameter_of_data(points, space=None):


NIT: maybe diameter would sound better and add to riemannian_metric.py?

ninamiolane · 2018-08-28T01:16:27Z

quantization/optimal_quantization.py

+    n_points = points.shape[0]
+
+    for i in range(n_points-1):
+        dist_to_neighbors = metric.dist(points[i, :], points[i+1:, :])


Doesn't this give only one real number: the distance between point i and point i+1? If so, I don't understand the next line taking the max of a single real number.

No because the second argument contains all the points from i+1 to the end (the ":" is hard to see).

ninamiolane · 2018-08-28T01:21:43Z

quantization/optimal_quantization.py

+    """
+    Compute the Karcher mean of points using a Karcher flow algorithm.
+    Return :
+        - the karcher mean


NIT: Uppercase letter for Karcher everywhere.

ninamiolane · 2018-08-28T14:01:54Z

quantization/test/test_optimal_quantization.py

+        sample = oq.sample_from(self.points)
+        result = False
+        for i in range(self.n_points):
+            if (self.points[i, :] == sample).all():


Could you use np.allclose(self.points[i, :], sample)?

ninamiolane · 2018-08-28T14:02:43Z

quantization/test/test_optimal_quantization.py

+        result = False
+        for i in range(self.n_points):
+            if (self.points[i, :] == sample).all():
+                result = True


you can add a break when sample is found to be one point, in order to stop the for loop.

Thanks for the tip !

ninamiolane · 2018-08-28T14:03:15Z

quantization/test/test_optimal_quantization.py

+        closest_neighbor = self.points[index, :]
+        result = False
+        for i in range(self.n_points):
+            if (self.points[i, :] == closest_neighbor).all():


Same as above.

ninamiolane · 2018-08-28T14:03:27Z

quantization/test/test_optimal_quantization.py

+
+import unittest
+
+import numpy as np


Use import geomstats.backend as gs.

ninamiolane · 2018-08-28T14:04:53Z

quantization/test/test_optimal_quantization.py

+        for i in range(self.n_points):
+            tangent_vectors[i, :] = self.metric.log(
+                    self.points[i, :], karcher_mean)
+        sum_tan_vec = np.sum(tangent_vectors, axis=0)


NIT: sum_tangent_vecs?

ninamiolane · 2018-08-28T14:28:53Z

quantization/optimal_quantization.py

+    return diameter
+
+
+def karcher_flow(points, space=None, tolerance=TOLERANCE):


This looks like: https://github.com/geomstats/geomstats/blob/master/geomstats/riemannian_metric.py#L219

Could we fuse the codes maybe?

ninamiolane · 2018-08-28T14:37:21Z

quantization/optimal_quantization.py

+IMPLEMENTED = ('S1', 'S2')
+
+
+def sample_from(points, size=1):


There might be a function doing this already, something like: https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html
?

It seems like numpy.random.choice takes only 1D arrays as entries, and so it cannot be directly used to sample from a set of points in 2 or more dimensions. I can use it to choose an index, but I don't think it would be very different from using numpy.random.randint ?

Ah, true. Too bad, it seems that they wanted to add the option but haven't done it so far: numpy/numpy#7810

ninamiolane · 2018-08-28T14:37:43Z

quantization/optimal_quantization.py

+    return sample
+
+
+def closest_neighbor(point, neighbors, space=None):


Add to riemannian_metric.py

Remove closest_neighbor, diameter and karcher_flow

alebrigant · 2018-08-29T16:14:06Z

I had missed the mean method of RiemannianMetric, thanks for pointing it out ! I have removed karcher_flow which was indeed doing the same thing, and I have moved diameter and closest_neighbor to riemannian_metric.py.

ninamiolane

Thank you! Almost there! I added a decent amount of new comments, bcs I've understood more of the code with this second review :)

ninamiolane · 2018-09-03T21:03:30Z

quantization/optimal_quantization.py

+IMPLEMENTED = ('S1', 'S2')
+
+
+def sample_from(points, size=1):


Ah, true. Too bad, it seems that they wanted to add the option but haven't done it so far: numpy/numpy#7810

ninamiolane · 2018-09-03T21:03:53Z

quantization/optimal_quantization.py

+IMPLEMENTED = ('S1', 'S2')
+
+
+def sample_from(points, size=1):


How about n_samples=1 instead of size=1 to be consistent with the other sampling functions?

ninamiolane · 2018-09-03T21:14:11Z

quantization/optimal_quantization.py

+    dimension = points.shape[-1]
+
+    index = gs.random.randint(low=0, high=n_points, size=size)
+    sample = points[gs.ix_(index, gs.arange(dimension))]


I think I don't understand the size parameter. Why not:

index = gs.random.randint(low=0, high=n_points, size=(n_samples,)) sample = points[index, :]

?

It seems that size=n_samples (previously size=size) gives the same result as size=(n_samples,).

ninamiolane · 2018-09-03T21:19:43Z

quantization/optimal_quantization.py

+        if gs.isclose(gap, 0, atol=tolerance):
+                break
+
+    if iteration is n_max_iterations:


This should be n_max_iterations-1 or the while loop condition above should be <=.

Use == to compare integers.
https://stackoverflow.com/questions/2239737/is-it-better-to-use-is-or-for-number-comparison-in-python

Yes thanks !

ninamiolane · 2018-09-03T21:22:53Z

quantization/optimal_quantization.py

+    else:
+        metric = HypersphereMetric(dimension=2)
+
+    # random initialization of the centers


No need for this comment: the code is clear enough and this will save us from meaningless leftover comments if we later change the code but forget to adapt the comments.

ninamiolane · 2018-09-03T21:47:32Z

quantization/plot_quantization_s1.py

+    centers, weights, clustering, n_iterations = oq.optimal_quantization(
+                points, n_centers, space='S1', n_repetitions=20, tolerance=1e-6
+                )
+    theta = gs.linspace(0, 2*gs.pi, 100)


Could you add this as a class Circle in visualization.py, similar to the class Sphere? Thank you!

https://github.com/geomstats/geomstats/blob/master/geomstats/visualization.py#L64

ninamiolane · 2018-09-03T21:47:39Z

quantization/plot_quantization_s1.py

+from geomstats.hypersphere import Hypersphere
+
+CIRCLE = Hypersphere(dimension=1)
+


Add all the constants at the beginning:

N_POINTS = 1000 N_CENTERS = 5 N_REPETITIONS = 20 TOLERANCE=1e-6

ninamiolane · 2018-09-03T21:49:04Z

quantization/optimal_quantization.py

+
+def optimal_quantization(points, n_centers=10, space=None, n_repetitions=20,
+                         tolerance=TOLERANCE, n_max_iterations=50000):
+    """


Could you add a short explanation about how you use n_repetitions? thanks.

quantization/plot_quantization_s2.py

ninamiolane · 2018-09-03T22:03:14Z

quantization/plot_quantization_s2.py

+    plt.figure(1)
+    ax = plt.subplot(111, projection="3d", aspect="equal")
+    color = np.random.rand(n_centers, 3)
+    ax.plot_wireframe(sphere.sphere_x,


Use methods of the sphere: sphere.draw, sphere.add_points, sphere.draw_points, etc
https://github.com/geomstats/geomstats/blob/master/geomstats/visualization.py#L94

I have tried to use them in the best way possible, however it would be more satisfactory if the sphere was not drawn as many times as the number of clusters (I need each cluster to be drawn in a different color and so I repeat sphere.draw n_clusters times). Maybe this could be fixed by creating a new list of points each time add_points is called, that could be plotted in a different color (or some other way) ?

Also, I would recommend to change the alpha=0.5 in draw to a lower value such as alpha=0.2 so that the points are more visible, or to make it into a adjustable parameter.

Never mind, the sphere is not drawn when using sphere.draw_points instead of sphere.draw. Sorry about that !

…t_optimal_quantization.py

ninamiolane

Nice! Just a few additional comments.

ninamiolane · 2018-09-19T16:47:03Z

quantization/plot_quantization_s1.py

+                n_repetitions=N_REPETITIONS, tolerance=TOLERANCE
+                )
+    visualization.plot(centers, space='S1', color='red')
+


Does it make sense to also draw the points, each in the color corresponding to its center, to mimic plot_quantization_s2 example?

Yes ! I added that.

ninamiolane · 2018-09-19T16:47:59Z

quantization/optimal_quantization.py

+    return sample
+
+
+def optimal_quantization(points, metric, n_centers=10, n_repetitions=20,


Put

N_CENTERS = 10 N_REPETITIONS = 20 N_MAX_ITERATIONS = 5000

at the beginning of the file.

ninamiolane · 2018-09-19T16:50:00Z

quantization/plot_quantization_s2.py

+    color = gs.random.rand(N_CENTERS, 3)
+    for i in range(N_CENTERS):
+        cluster_i = gs.vstack([point for point in clusters[i]])
+        sphere = visualization.Sphere()


Do we need a new Sphere for each cluster?

I changed visualization.py in that direction. See geomstats/geomstats#141.

ninamiolane · 2018-09-19T16:52:39Z

quantization/plot_quantization_s2.py

+        cluster_i = gs.vstack([point for point in clusters[i]])
+        sphere = visualization.Sphere()
+        sphere.add_points(cluster_i)
+        if i == 0:


Do we need this edge case? If so, could we tackle it directly in visualization.py?

Same as above, see PR #141.

ninamiolane · 2018-09-19T16:56:18Z

quantization/optimal_quantization.py

+    Return :
+        - n_centers centers
+        - n_centers weights between 0 and 1
+        - a dictionary containing the clusters


Add something like: "where each key is the cluster index, and its value is the lists of points belonging to the cluster."?

ninamiolane · 2018-09-24T21:33:13Z

Thanks! Let's first agree on the visualisation.py version and I'll have another look at this one after.

johmathe · 2018-08-17T15:45:52Z

quantization/optimal_quantization.py

+
+    while gap > tolerance:
+        step += 1
+        k = np.floor(step / n_repetition) + 1


ok you should find a better name for k - hard to understand what this variable mean. Usually k will be used for indexing an array. variables like i,j k should have a trivial behavior like starting at 0 and incrementing. Any other more complex behavior should have a clear name.

alebrigant added 2 commits August 16, 2018 17:55

Create optimal_quantization.py

9914037

Create plot_quantization_s2.py

ea5be07

johmathe suggested changes Aug 16, 2018

View reviewed changes

alebrigant added 7 commits August 17, 2018 17:29

Update of optimal_quantization.py

6fc7401

Corrections of the script including new descriptions, implementation of the circle (will be useful in unit test) and addition of new outputs.

Update optimal_quantization.py

31299b5

Add karcher flow algorithm for the purpose of the unit test.

Create test_optimal_quantization.py

cd31019

Create plot_quantization_s1_uniform.py

a652987

Delete plot_quantization_s1_uniform.py

358510b

Update test_optimal_quantization.py

9dab6f4

Update plot_quantization_s2.py

827fb70

ninamiolane requested changes Aug 28, 2018

View reviewed changes

ninamiolane reviewed Aug 28, 2018

View reviewed changes

alebrigant added 2 commits August 29, 2018 17:48

Update optimal_quantization.py

54f7cb6

Remove closest_neighbor, diameter and karcher_flow

Update optimal_quantization.py

9638aa1

alebrigant added 2 commits August 29, 2018 18:16

Update test_optimal_quantization.py

ddedc0a

Create plot_quantization_s1.py

9719e13

ninamiolane requested changes Sep 3, 2018

View reviewed changes

alebrigant added 4 commits September 18, 2018 18:52

changes to optimal_quantization.py

aeb2325

changes to plot_quantization_s2.py

7493386

other changes to plot_quantization_s2.py

f565dca

update of plot_quantization_s1.py and plot_quantization_s2.py and tes…

ea40ac2

…t_optimal_quantization.py

ninamiolane approved these changes Sep 19, 2018

View reviewed changes

simplify the plots using new version of visualization.py

4cef3dd

johmathe approved these changes Sep 24, 2018

View reviewed changes

Update the plots according to the changes in visualization.py

d9b82b7

ninamiolane merged commit 4283bce into geomstats:master Sep 25, 2018

		@@ -0,0 +1,171 @@
		"""
		Optimal quantization of the empirical distribution of a dataset -

		return index_closest_neighbor


		def diameter_of_data(points, space=None):

		return diameter


		def karcher_flow(points, space=None, tolerance=TOLERANCE):

		return sample


		def closest_neighbor(point, neighbors, space=None):

		from geomstats.hypersphere import Hypersphere

		CIRCLE = Hypersphere(dimension=1)

		return sample


		def optimal_quantization(points, metric, n_centers=10, n_repetitions=20,

Create optimal_quantization.py #11

Create optimal_quantization.py #11

Uh oh!

Conversation

alebrigant commented Aug 16, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alebrigant commented Aug 22, 2018

Uh oh!

ninamiolane left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ninamiolane Aug 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alebrigant commented Aug 29, 2018

ninamiolane Aug 28, 2018 •

edited

Loading