Running kmeans model in Python recipe is very slow

Alex_zw Registered Posts: 1 ✭✭✭✭

I was using python to build the cluster model in Dataiku

# Best k to build modelprint('The best K sugest: ',K_best)model = KMeans(n_clusters=K_best, init='k-means++', n_init=10,max_iter=300, tol=1e-04, random_state=101)#model = = model.labels_# plt#plt.scatter(X_scaled[:,0], X_scaled[:,1], c=model.labels_.astype(float))fig = plt.figure(figsize=(20,5))ax = fig.add_subplot(121)plt.scatter(x = X_scaled[:,1], y = X_scaled[:,0],c=model.labels_.astype(float))ax.set_xlabel(feature_vector[1])ax.set_ylabel(feature_vector[0])ax = fig.add_subplot(122)plt.scatter(x = X_scaled[:,2], y = X_scaled[:,0], c=model.labels_.astype(float))ax.set_xlabel(feature_vector[2])ax.set_ylabel(feature_vector[0])

but it takes more then 5 hours to run this part. May I know if there is a more efficient way to run that faster?

Many thanks


  • cperdigou
    cperdigou Alpha Tester, Dataiker Alumni Posts: 115 ✭✭✭✭✭✭✭

    I'm sorry I can't help with your exact problem as it would require knowing a bit more about your input data, its size in particular.

    Could you try using Dataiku's visual machine learning for this task?

    From the flow, select your input dataset > Lab > Quick Model > Clustering > Quick Models > K-means.

    In DESIGN > Algorithms you can select other algorithms, and by clicking k-means you can try multiple cluster numbers. When you're done, click train, and tell us how this goes.

Setup Info
      Help me…