Running kmeans model in Python recipe is very slow

Alex_zw
Level 1
Running kmeans model in Python recipe is very slow

I was using python to build the cluster model in Dataiku

# Best k to build model
print('The best K sugest: ',K_best)
model = KMeans(n_clusters=K_best, init='k-means++', n_init=10,max_iter=300, tol=1e-04, random_state=101)

#
model = model.fit(X_scaled)

#
labels = model.labels_

# plt
#plt.scatter(X_scaled[:,0], X_scaled[:,1], c=model.labels_.astype(float))
fig = plt.figure(figsize=(20,5))
ax = fig.add_subplot(121)
plt.scatter(x = X_scaled[:,1], y = X_scaled[:,0],c=model.labels_.astype(float))
ax.set_xlabel(feature_vector[1])
ax.set_ylabel(feature_vector[0])
ax = fig.add_subplot(122)
plt.scatter(x = X_scaled[:,2], y = X_scaled[:,0], c=model.labels_.astype(float))
ax.set_xlabel(feature_vector[2])
ax.set_ylabel(feature_vector[0])

plt.show()

but it takes more then 5 hours to run this part. May I know if there is a more efficient way to run that faster?

Many thanks

0 Kudos
1 Reply
cperdigou
Dataiker Alumni

I'm sorry I can't help with your exact problem as it would require knowing a bit more about your input data, its size in particular.

Could you try using Dataiku's visual machine learning for this task?

From the flow, select your input dataset > Lab > Quick Model > Clustering > Quick Models > K-means.

In DESIGN > Algorithms you can select other algorithms, and by clicking k-means you can try multiple cluster numbers. When you're done, click train, and tell us how this goes.

0 Kudos