Clustering Customer Comments¶
The comment data can be found here. Usually when I use data from Kaggle I write the report in Kaggle, however this project uses the sentence_transformers library which the Kaggle environment does not provide.
Introduction¶
Customer feedback is everywhere—whether it’s in reviews, surveys, or social media posts, businesses are constantly receiving comments from their users. But digging through all that text to find useful insights is tough, especially because people express themselves in so many different ways. Traditional methods of analyzing text often miss the deeper meaning or context behind what’s being said.
In this project, we’re tackling that challenge using the all-MiniLM-L6-v2 transformer model. This model is part of the sentence-transformer family, designed to create compact, high-quality embeddings for text. Basically, it turns customer comments into numerical vectors that represent their meaning and context. These embeddings make it much easier to group similar comments together, even when they’re phrased differently.
We will start by cleaning up the text data, removing things like typos, extra spaces, or irrelevant symbols. Then we will use the transformer model to generate embeddings for each comment. Once we have these embeddings, we will run a k-means clustering algorithm to group the comments into meaningful clusters. To make it easy to see how the comments are grouped, we will reduce the data to two dimensions using principal component analysis (PCA) Finally, we will analyze the clusters to figure out what themes or insights they reveal.
The ultimate goal is to make sense of unstructured feedback and help businesses understand their customers better. Whether it’s spotting recurring issues, identifying areas of improvement, or just recognizing what’s working well, this project aims to show how modern NLP tools can make a big difference in understanding customer sentiment.
# Utility Libraries
import re
import numpy as np
import pandas as pd
from collections import Counter
# Plotting Libraries
import matplotlib.pyplot as plt
# nltk Libraries
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# sklearn Libraries
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
# Transformer Libraries
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
c:\Users\tjmaz\anaconda3\envs\PyTorchGPU\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Load in the Comments¶
We begin by loading the dataset, which includes customer comments along with an accompanying column indicating whether the customer had a positive or negative experience. For the purposes of this project, our focus is solely on the textual feedback, so we will extract and utilize only the Review
column.
comments = list(pd.read_csv("Restaurant_Reviews.tsv", sep = "\t")["Review"])
comments = list(set(comments))
comments[:10]
['I hate to disagree with my fellow Yelpers, but my husband and I were so disappointed with this place.', "Just don't know why they were so slow.", "Best fish I've ever had in my life!", "I've lived here since 1979 and this was the first (and last) time I've stepped foot into this place.", 'Sauce was tasteless.', 'first time there and might just be the last.', 'The food, amazing.', 'We waited for forty five minutes in vain.', 'Now this dish was quite flavourful.', 'Very Very Disappointed ordered the $35 Big Bay Plater.']
Preprocess the Comments¶
To ensure the data is suitable for clustering, we preprocess the comments to clean and standardize the text. This involves several key steps. First, we remove URLs and non-alphabetic characters to eliminate unnecessary noise. Next, the comments are converted to lowercase to maintain consistency. We then tokenize the text into individual words and filter out common stop words that do not contribute meaningful information, such as "and" or "the." Finally, each word is lemmatized to reduce it to its base form, ensuring that variations of the same word are treated as a single entity. For example, the words "running," "ran," and "runs" would all be reduced to their base form, "run." This preprocessing step enhances the quality of the input data, enabling the clustering algorithm to focus on meaningful patterns and relationships within the comments.
def preprocess_comments(comments):
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()
def clean_comment(comment):
comment = re.sub(r"http\S+|www\S+|https\S+", '', comment, flags=re.MULTILINE)
comment = re.sub(r"[^a-zA-Z\s]", '', comment)
words = word_tokenize(comment.lower())
words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
return ' '.join(words)
return [clean_comment(comment) for comment in comments]
processed_comments = preprocess_comments(comments)
processed_comments[:10]
['hate disagree fellow yelpers husband disappointed place', 'dont know slow', 'best fish ive ever life', 'ive lived since first last time ive stepped foot place', 'sauce tasteless', 'first time might last', 'food amazing', 'waited forty five minute vain', 'dish quite flavourful', 'disappointed ordered big bay plater']
Load in the Transformer Model¶
To transform customer comments into a format suitable for clustering, we use the SentenceTransformer model, specifically the pre-trained 'all-MiniLM-L6-v2' variant. This model converts each comment into a high-dimensional vector representation (embedding), capturing its semantic meaning. These embeddings serve as the foundation for clustering, enabling us to group comments with similar underlying meanings, regardless of wording differences.
For visualization purposes, we apply Principal Component Analysis (PCA) to reduce the embeddings to two dimensions, making it easier to interpret the data. It’s important to note that the PCA-reduced embeddings are not used for the actual clustering process; they are solely for visualizing the relationships between comments in a more interpretable 2D space.
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(processed_comments)
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)
Sentiment Extraction¶
The get_sentiments function is designed to analyze the sentiment of a list of comments by calculating their compound sentiment score. It utilizes the SentimentIntensityAnalyzer from the nltk library, which applies a pre-trained model to assess the sentiment of each comment. The function iterates through the provided comments, converts them to strings (if necessary), and applies the polarity_scores method to each one, which returns a dictionary containing sentiment scores. The compound score, which represents the overall sentiment (positive, negative, or neutral), is extracted and added to a list. This list of sentiment scores is then returned, allowing for a quantitative measure of the emotional tone of the comments. The function is useful for gaining insights into the general sentiment of customer feedback or other textual data.
Once our data is clustered, we can use the sentiment scores to determine the average sentiment of each cluster.
def get_sentiments(comments):
sentiments = []
sia = SentimentIntensityAnalyzer()
for text in comments:
sentiments.append(sia.polarity_scores(str(text))["compound"])
return sentiments
sentiments = get_sentiments(comments)
Optimal Number of Clusters - Qualitative Analysis¶
Now that we have the embeddings for the customer comments, the next step is clustering. To facilitate experimenting with different cluster counts and improve reusability, we define some functions to streamline the process.
The perform_and_plot_kmeans
function applies K-Means clustering to a dataset, dividing the data into k clusters. It visualizes the results using a 2D PCA projection, where the points are colored based on their cluster labels. The plot includes a legend identifying the clusters and highlights the spatial separation of the clusters in the reduced-dimensional space. This function returns the fitted K-Means model and the cluster labels for further analysis.
The get_top_n_closest_comments_to_centers
function retrieves the most representative comments for each cluster by finding the points closest to the cluster centers in embedding space. Using the euclidean_distances function, it calculates the distance of each point in a cluster from its center, sorts the points by proximity, and extracts the top n_comments. The closest comments are printed for each cluster, and a list of sorted comments is returned for later use.
The summarize_comments
function generates concise summaries of the comments for each cluster using a pre-trained T5 text summarization model. It combines all comments within a cluster into a single text block, tokenizes the text, and feeds it to the model to produce a summary of specified length. The summarized comments for all clusters are returned, providing a high-level overview of the dominant themes or characteristics in each cluster. It should be noted that this method to summarize comments is rather weak at the moment and is just being used as a placeholder until we find a better way to auto-summarize comments. It will be very noticable how weak the comment summaries are when they are used as the cluster names in the bar charts.
The plot_and_print_comments
function integrates the previous functions into a cohesive process. It performs clustering, visualizes the clusters, retrieves the closest comments for each cluster, and generates summaries of the comments. By combining these steps, it provides both visual and textual insights into the structure and content of the clusters. This function outputs the K-Means model, cluster labels, sorted comments, and summarized comments.
Finally, the plot_cluster_bar_chart
function visualizes the distribution of clusters and sentiments as a bar chart, showing the percentage of data points and average sentiments in each cluster. It maps the cluster labels to their corresponding summarized comments, calculates the frequency of each cluster, and normalizes these counts into percentages. It also maps the cluster labels to their corresponding sentiments, and calculates the average sentiment of each cluster. Both results are then displayed as a bar chart for each cluster.
def perform_and_plot_kmeans(original_data, pca_data, k):
kmeans = KMeans(n_clusters=k, random_state=42)
cluster_labels = kmeans.fit_predict(original_data)
plt.figure(figsize=(8, 6))
scatter = plt.scatter(pca_data[:, 0], pca_data[:, 1], c=cluster_labels, cmap='viridis')
plt.title(f'2D PCA Projection of Word Embeddings with K-Means Clustering (k={k})')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
handles, labels = scatter.legend_elements()
plt.legend(handles, labels, title="Clusters", loc='upper right')
plt.grid(True)
plt.show()
return kmeans, cluster_labels
def get_top_n_closest_comments_to_centers(embeddings, cluster_labels, comments, kmeans, n_comments):
cluster_centers = kmeans.cluster_centers_
all_sorted_comments = [None] * len(cluster_centers)
for cluster in range(len(cluster_centers)):
cluster_indices = np.where(cluster_labels == cluster)[0]
cluster_embeddings = embeddings[cluster_indices]
distances = euclidean_distances(cluster_embeddings, cluster_centers[cluster].reshape(1, -1)).flatten()
sorted_indices = np.argsort(distances)
sorted_comments = [comments[cluster_indices[idx]] for idx in sorted_indices]
all_sorted_comments[cluster] = sorted_comments
print(f"Top {n_comments} closest comments to Cluster {cluster + 1} center:")
for idx in range(min(n_comments, len(sorted_comments))):
print(f"- {sorted_comments[idx]}")
print("")
return all_sorted_comments
def summarize_comments(comments, cluster_labels, min_length = 1, max_length = 10):
clusters = list(set(cluster_labels))
summarized_comments = []
for cluster in clusters:
model_name = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)
selected_comments = comments[cluster]
selected_comments = selected_comments[:len(selected_comments) // 50]
input_text = " ".join(selected_comments)
encoded_input = tokenizer(input_text, return_tensors="pt", truncation=True, padding=True, max_length=512)
input_text = tokenizer.decode(encoded_input["input_ids"][0], skip_special_tokens=True)
summary = summarizer(input_text, max_length=max_length, min_length=min_length, do_sample=False)
summarized_comments.append(summary[0]['summary_text'])
return summarized_comments
def plot_and_print_comments(embeddings, embeddings_2d, comments, k_clusters, n_comments = 10):
kmeans, cluster_labels = perform_and_plot_kmeans(embeddings, embeddings_2d, k_clusters)
sorted_comments = get_top_n_closest_comments_to_centers(embeddings, cluster_labels, comments, kmeans, n_comments)
summarized_comments = summarize_comments(sorted_comments, cluster_labels)
return kmeans, cluster_labels, sorted_comments, summarized_comments
def plot_cluster_bar_chart(summarized_comments, cluster_labels, sentiments):
cluster_names = [summarized_comments[cluster] for cluster in range(len(summarized_comments))]
cluster_counts = Counter(cluster_labels)
clusters = list(range(len(cluster_counts)))
counts = [cluster_counts[cluster] for cluster in clusters]
total_count = sum(counts)
percentages = [(count / total_count) * 100 for count in counts]
cluster_sentiments = {cluster: [] for cluster in clusters}
for label, sentiment in zip(cluster_labels, sentiments):
cluster_sentiments[label].append(sentiment)
avg_sentiments = [sum(cluster_sentiments[cluster]) / len(cluster_sentiments[cluster]) for cluster in clusters]
avg_sentiments = [sentiment * 100 for sentiment in avg_sentiments]
x = np.arange(len(clusters))
width = 0.4
plt.figure(figsize=(12, 6))
plt.bar(x - width/2, percentages, width, label='Percentage of Comments', color='skyblue')
plt.bar(x + width/2, avg_sentiments, width, label='Average Sentiment', color='orange')
plt.xlabel('Cluster')
plt.ylabel('Values')
plt.title('Cluster Analysis: Percentage and Average Sentiment')
plt.xticks(x, cluster_names, rotation=30)
plt.legend()
for i, (percentage, avg_sentiment) in enumerate(zip(percentages, avg_sentiments)):
plt.text(x[i] - width/2, percentage + 0.5, f'{percentage:.2f}%', ha='center', va='bottom', fontsize=10)
plt.text(x[i] + width/2, avg_sentiment + 0.05, f'{avg_sentiment:.2f}%', ha='center', va='bottom', fontsize=10)
plt.tight_layout()
plt.show()
Analysis on k = 2 Clusters¶
kmeans_2, cluster_labels_2, sorted_comments_2, summarized_comments_2 = plot_and_print_comments(embeddings, embeddings_2d, comments, k_clusters=2)
plot_cluster_bar_chart(summarized_comments_2, cluster_labels_2, sentiments)
Top 10 closest comments to Cluster 1 center: - As for the service, I thought it was good. - Terrible service! - Service was fantastic. - And service was super friendly. - this place is good. - This place is great!!!!!!!!!!!!!! - The service was poor and thats being nice. - This place is amazing! - Cant say enough good things about this place. - Awful service. Top 10 closest comments to Cluster 2 center: - The food, amazing. - This is an Outstanding little restaurant with some of the Best Food I have ever tasted. - Great service and food. - Food was great and so was the serivce! - Now this dish was quite flavourful. - Service was exceptional and food was a good as all the reviews. - Fantastic food! - Great food. - Everything was good and tasty! - The food was excellent and service was very good.
Device set to use cuda:0 Device set to use cuda:0
Cluster 1: Service Quality Focused¶
Comments in this cluster revolve around the quality of service, with a mix of both positive and negative sentiments. Many individuals express high satisfaction with the service, describing it as "fantastic," "outstanding," and appreciating the warm, personal atmosphere. These positive experiences are coupled with a sense of being treated as special guests. However, some customers express disappointment, even after returning, suggesting that their expectations were not met despite the good service. There are also remarks indicating that while the service was good, other aspects of the experience, such as the overall ambiance or food, did not live up to expectations.
Cluster 2: Food Quality Focused¶
In this cluster, the comments are overwhelmingly positive and focused primarily on the quality of the food. Words like "amazing," "delicious," and "outstanding" are frequently used to describe the meals, with several customers mentioning that the food surpassed their expectations. A few comments also mention that the service was good, but the food takes center stage in these reviews. One comment highlights how the restaurant accommodated a vegetarian guest, further emphasizing the thoughtfulness in catering to diverse dietary preferences. The general consensus in this cluster is that the food is the standout feature of the dining experience.
Analysis on k = 3 Clusters¶
kmeans_3, cluster_labels_3, sorted_comments_3, summarized_comments_3 = plot_and_print_comments(embeddings, embeddings_2d, comments, k_clusters=3)
plot_cluster_bar_chart(summarized_comments_3, cluster_labels_3, sentiments)
Top 10 closest comments to Cluster 1 center: - As for the service, I thought it was good. - Terrible service! - I can't wait to go back. - Won't go back. - We won't be going back. - Service sucks. - Very poor service. - I wouldn't return. - Service stinks here! - We won't be going back anytime soon! Top 10 closest comments to Cluster 2 center: - The food, amazing. - Now this dish was quite flavourful. - Food was great and so was the serivce! - Great service and food. - This is an Outstanding little restaurant with some of the Best Food I have ever tasted. - Fantastic food! - Great food. - Service was exceptional and food was a good as all the reviews. - Everything was good and tasty! - Food was delicious! Top 10 closest comments to Cluster 3 center: - This place is amazing! - This place is pretty good, nice little vibe in the restaurant. - Great place fo take out or eat in. - This place is great!!!!!!!!!!!!!! - This wonderful experience made this place a must-stop whenever we are in town again. - this place is good. - I would not recommend this place. - This is a GREAT place to eat! - Wow... Loved this place. - Pretty awesome place.
Device set to use cuda:0 Device set to use cuda:0 Device set to use cuda:0
Cluster 1: Service Quality Focused¶
The comments in this cluster highlight a mixed view of service, with some customers expressing deep dissatisfaction. Several remarks focus on poor service, such as "terrible service" and "slow," while others describe their experiences as "meh" or "sub-par." A few individuals were disappointed enough by the service to decide not to return, citing a lack of appreciation or effort to make them feel valued. However, a few comments note that while the service may not have been exceptional, the company or other aspects of the experience were enjoyable.
Cluster 2: Overall Positive Experience¶
This cluster features mostly positive comments regarding both the food and service. Customers frequently mention having a pleasant experience, with particular emphasis on the great food and friendly, attentive service. Comments such as "outstanding little restaurant" and "great food and great service" showcase how the restaurant's ambiance, food, and service create an enjoyable overall experience. Some reviewers also mention a relaxing atmosphere and recommend the place for both dining in and takeout.
Cluster 3: Food Focused¶
Cluster 3 primarily focuses on the food, with a clear split between highly positive and critical comments. Many reviewers describe the food as "delicious," "amazing," or "great," highlighting a consistently satisfying culinary experience. However, there are a few negative remarks, including one that describes the food as the "worst version" of certain dishes they have ever had. Despite these occasional criticisms, the majority of feedback is centered on the quality and flavor of the food.
Analysis on k = 4 Clusters¶
kmeans_4, cluster_labels_4, sorted_comments_4, summarized_comments_4 = plot_and_print_comments(embeddings, embeddings_2d, comments, k_clusters=4)
plot_cluster_bar_chart(summarized_comments_4, cluster_labels_4, sentiments)
Top 10 closest comments to Cluster 1 center: - Service is quick and friendly. - And service was super friendly. - Service was fine and the waitress was friendly. - As for the service, I thought it was good. - Waitress was a little slow in service. - The service was poor and thats being nice. - But the service was beyond bad. - Terrible service! - Very poor service. - Service was good and the company was better! Top 10 closest comments to Cluster 2 center: - Now this dish was quite flavourful. - Everything was good and tasty! - Everything was fresh and delicious! - From what my dinner companions told me...everything was very fresh with nice texture and taste. - Extremely Tasty! - It lacked flavor, seemed undercooked, and dry. - Not much flavor to them, and very poorly constructed. - To my disbelief, each dish qualified as the worst version of these foods I have ever tasted. - It was extremely "crumby" and pretty tasteless. - It's too bad the food is so damn generic. Top 10 closest comments to Cluster 3 center: - I can't wait to go back. - Won't ever go here again. - this place is good. - We loved the place. - This place is amazing! - This place is great!!!!!!!!!!!!!! - Wow... Loved this place. - This place has it! - Cant say enough good things about this place. - I will never go back to this place and will never ever recommended this place to anyone! Top 10 closest comments to Cluster 4 center: - The food was excellent and service was very good. - Great service and food. - Service was exceptional and food was a good as all the reviews. - Great food and great service in a clean and friendly setting. - This is an Outstanding little restaurant with some of the Best Food I have ever tasted. - Good food , good service . - Food was great and so was the serivce! - Great food and awesome service! - The food, amazing. - Phenomenal food, service and ambiance.
Device set to use cuda:0 Device set to use cuda:0 Device set to use cuda:0 Device set to use cuda:0
Cluster 1: Food Quality and Taste: Positive and Negative Feedback¶
This cluster is focused on the quality and taste of the food. Comments here show a stark contrast in opinions about the food, ranging from strong positive feedback (e.g., "The food was very good," "Food was delicious!") to negative reviews (e.g., "The food wasn't good," "It lacked flavor, seemed undercooked, and dry").
Cluster 2: Service Experience: Praise and Criticism¶
The comments in this cluster primarily focus on the service experience at the restaurant. Many of the comments praise the service, describing it as fantastic, friendly, and attentive (e.g., "Service was fantastic," "The service was great, even the manager came and helped"). However, there is also a dissenting opinion, where the service is criticized as "terrible" (e.g., "Terrible service!").
Cluster 3: Overall Dining Experience: High Satisfaction with Food and Service¶
This cluster contains comments expressing overall satisfaction with both the food and the service, creating a positive impression of the restaurant. Many reviewers praise the combination of high-quality food and great service (e.g., "Great food and great service," "Everything on the menu is terrific"). There is a noticeable appreciation for the overall dining experience.
Cluster 4: Likelihood of Returning: Mixed Feelings About Future Visits¶
This cluster contains comments that reflect reviewers' intentions or thoughts on whether they would return to the restaurant. There is a clear split between those expressing strong intentions not to return (e.g., "I probably won't be coming back here," "Won't ever go here again") and others expressing a desire to return in the future (e.g., "I'd love to go back," "Definitely will come back here again").
Analysis on k = 5 Clusters¶
kmeans_5, cluster_labels_5, sorted_comments_5, summarized_comments_5 = plot_and_print_comments(embeddings, embeddings_2d, comments, k_clusters=5)
plot_cluster_bar_chart(summarized_comments_5, cluster_labels_5, sentiments)
Top 10 closest comments to Cluster 1 center: - We got sitting fairly fast, but, ended up waiting 40 minutes just to place our order, another 30 minutes before the food arrived. - At least 40min passed in between us ordering and the food arriving, and it wasn't that busy. - This is was due to the fact that it took 20 minutes to be acknowledged, then another 35 minutes to get our food...and they kept forgetting things. - -Drinks took close to 30 minutes to come out at one point. - We sat another ten minutes and finally gave up and left. - We literally sat there for 20 minutes with no one asking to take our order. - The real disappointment was our waiter. - I kept looking at the time and it had soon become 35 minutes, yet still no food. - Similarly, the delivery man did not say a word of apology when our food was 45 minutes late. - I also decided not to send it back because our waitress looked like she was on the verge of having a heart attack. Top 10 closest comments to Cluster 2 center: - Now this dish was quite flavourful. - Everything was good and tasty! - Everything was fresh and delicious! - Extremely Tasty! - From what my dinner companions told me...everything was very fresh with nice texture and taste. - Not much flavor to them, and very poorly constructed. - It lacked flavor, seemed undercooked, and dry. - To my disbelief, each dish qualified as the worst version of these foods I have ever tasted. - It was extremely "crumby" and pretty tasteless. - It's too bad the food is so damn generic. Top 10 closest comments to Cluster 3 center: - This place is amazing! - this place is good. - This place is great!!!!!!!!!!!!!! - I will never go back to this place and will never ever recommended this place to anyone! - Wow... Loved this place. - We loved the place. - Cant say enough good things about this place. - I can't wait to go back. - Won't ever go here again. - This place has it! Top 10 closest comments to Cluster 4 center: - The food was excellent and service was very good. - Great service and food. - Service was exceptional and food was a good as all the reviews. - This is an Outstanding little restaurant with some of the Best Food I have ever tasted. - Great food and great service in a clean and friendly setting. - Good food , good service . - Food was great and so was the serivce! - The food, amazing. - Great food and awesome service! - Phenomenal food, service and ambiance. Top 10 closest comments to Cluster 5 center: - Service is quick and friendly. - And service was super friendly. - As for the service, I thought it was good. - The service was poor and thats being nice. - But the service was beyond bad. - Service was fantastic. - Service was fine and the waitress was friendly. - Terrible service! - Service was good and the company was better! - Very poor service.
Device set to use cuda:0 Device set to use cuda:0 Device set to use cuda:0 Device set to use cuda:0 Device set to use cuda:0 Your max_length is set to 10, but your input_length is only 9. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=4)
Cluster 1: Mixed Service Experience¶
This cluster reflects a wide range of opinions about the service, with some customers praising it as "fantastic" and "good," while others report poor or even "awful" service. The comments reveal inconsistencies, with some describing the service as "meh" or "terrible," with particular complaints about waitstaff making customers feel uncomfortable. There is a noticeable contrast between friendly service and unsatisfactory interactions, indicating that service quality varied significantly for different customers.
Cluster 2: Positive Experience with Food, Service, and Atmosphere¶
Cluster 2 is characterized by positive feedback across the board, with many customers complimenting both the food and service. Customers appreciate the "great food and great service" in a "clean and friendly setting." The restaurant's atmosphere is described as a good place to relax and enjoy food, particularly mentioning unique offerings like burgers and beers. Several comments mention the overall "great place" to eat and recommend it highly, with a focus on friendly and fast service.
Cluster 3: Exceptional Food Quality and Consistency¶
Cluster 3 centers around the high quality of the food, with comments praising its taste, generosity of portions, and consistency. Customers consistently describe the food as "delicious," "amazing," and "terrific." There is also a focus on accommodating special dietary needs, like vegetarian options. Many customers express satisfaction with both the quality and value of the food, mentioning that the food is high quality, house-made, and offered at a great price.
Cluster 4: Mixed Food Experience¶
Cluster 4 presents a mix of positive and negative remarks about the food. Some customers rave about the "delicious" and "amazing" food, while others report poor experiences, including "terrible" or "stale" food. Negative comments focus on the food lacking flavor or being undercooked, while positive comments highlight fresh ingredients and good taste. There is also mention of the restaurant's cleanliness impacting the overall experience.
Cluster 5: Mixed Intentions About Returning¶
This cluster captures a mix of customers who have differing feelings about returning. While some express strong reluctance to return, using phrases like "I won't be back" or "definitely will not come back," others express enthusiasm about coming back, with statements such as "I can't wait to go back" or "I'd love to come back." This reflects an overall mixed sentiment, where experiences are polarized but some customers still have a positive desire to return.
Qualitative Analysis Conclusion¶
Throughout our exploration of different clustering approaches, we observed how adjusting the number of clusters significantly impacted the distribution of content and the focus of the comments within each cluster. With fewer clusters, the content tended to be more generalized, with overlapping themes between food quality, service, and overall experience, making it challenging to pinpoint specific aspects that customers were reacting to.
As we increased the number of clusters, we saw a clearer separation of themes, with distinct clusters forming around specific aspects of the dining experience, such as food quality, service quality, and overall satisfaction. This allowed us to capture more granular insights into customers' feedback. For example, a larger number of clusters might separate food-related comments from service-related ones, offering clearer insights into what specifically drives satisfaction or dissatisfaction.
However, as the number of clusters increased further, some of the clusters became overly specific. This over-segmentation sometimes limited the interpretability of the results, as certain clusters contained only a few comments with very similar phrasing or sentiment. These highly specific clusters, while potentially capturing nuanced opinions, often lacked sufficient diversity to be fully actionable.
In terms of sentiment, we observed a consistent pattern across different clusterings. Generally, mixed clusters exhibited middling sentiment scores, reflecting the presence of both positive and negative comments within the same group. Poor clusters, where dissatisfaction was the dominant theme, displayed lower sentiment scores. On the other hand, positive clusters, containing mostly favorable feedback, had high sentiment scores. These findings suggest that sentiment scores correlate with the overall tone of the comments in each cluster, with positive feedback skewing toward higher scores, while negative feedback drives lower sentiment scores.
This exploration suggests that while increasing the number of clusters can provide more detailed insights, there is a balance to be struck. Too many clusters can lead to overly fragmented categories that might not offer significant value in terms of actionable insights. The optimal number of clusters is crucial, as it needs to capture meaningful patterns without compromising the interpretability of the results.
Optimal Number of Clusters - Quantitative Analysis¶
In our initial exploration, we analyzed clusters of comments empirically, examining each cluster's characteristics and relationships to intuitively determine a suitable range for the optimal number of clusters. This qualitative approach allowed us to identify patterns and gain insights into the natural groupings present in the data. However, as we move forward, we are shifting toward more mathematical methods for determining the optimal number of clusters. These methods, such as the elbow method, silhouette scores, gap statistics, and cluster evaluation indices, provide a more systematic and objective way to assess the quality of clustering results. By combining both empirical analysis and mathematical rigor, we can refine our clustering approach and ensure that the chosen number of clusters is not only interpretable but also supported by quantitative evidence.
Below we write a function to model each plot. We are using the following methods to determine the best number of clusters:
1. Elbow Method¶
- How it Works: The elbow method involves plotting the inertia (sum of squared distances between points and their centroids) for different values of
k
(number of clusters). Inertia typically decreases ask
increases, but after a certain point, the rate of decrease slows. The "elbow" point is where the curve flattens, indicating the optimal number of clusters. - Metric: Inertia (lower is better).
- What We're Looking For: The optimal
k
is identified where the decrease in inertia slows down significantly, forming an "elbow."
2. Silhouette Score¶
- How it Works: The silhouette score measures how similar each point is to its own cluster compared to other clusters. It ranges from -1 to 1, where a higher score indicates better-defined clusters. The optimal
k
is typically the one that maximizes the average silhouette score. - Metric: Silhouette score (higher is better).
- What We're Looking For: The optimal
k
corresponds to the maximum silhouette score, indicating that points are well-clustered and separated from other clusters.
3. Gap Statistic¶
- How it Works: The gap statistic compares the sum of squared errors (inertia) of the observed data with a random reference distribution. A larger gap indicates a better clustering structure. The optimal
k
is the one that maximizes the gap statistic. - Metric: Gap statistic (higher is better).
- What We're Looking For: The optimal
k
corresponds to the largest gap between the observed data's inertia and the reference inertia.
4. Davies-Bouldin Index¶
- How it Works: The Davies-Bouldin Index measures the average similarity between each cluster and the cluster most similar to it. It is the ratio of within-cluster scatter to between-cluster separation. A lower Davies-Bouldin score indicates better clustering.
- Metric: Davies-Bouldin score (lower is better).
- What We're Looking For: The optimal
k
minimizes the Davies-Bouldin index, meaning clusters are well-separated and compact.
5. Calinski-Harabasz Index (Variance Ratio Criterion)¶
- How it Works: The Calinski-Harabasz Index evaluates the ratio of between-cluster variance to within-cluster variance. A higher score indicates better clustering. It’s useful when clusters are well-separated.
- Metric: Calinski-Harabasz index (higher is better).
- What We're Looking For: The optimal
k
maximizes the Calinski-Harabasz index, which indicates well-separated clusters with low intra-cluster variance.
These methods can be used in combination to validate the choice of k
and ensure robust clustering.
def plot_elbow_curve(data, max_k=10):
inertia = []
for k in range(1, max_k+1):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(data)
inertia.append(kmeans.inertia_)
plt.figure(figsize=(8, 6))
plt.plot(range(1, max_k+1), inertia, marker='o', color='b', linestyle='--')
plt.title('Elbow Curve for K-Means Clustering')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.xticks(range(1, max_k+1))
plt.grid(True)
plt.show()
def plot_silhouette_scores(data, max_k=10):
silhouette_scores = []
for k in range(2, max_k+1):
kmeans = KMeans(n_clusters=k, random_state=42)
cluster_labels = kmeans.fit_predict(data)
score = silhouette_score(data, cluster_labels)
silhouette_scores.append(score)
plt.figure(figsize=(8, 6))
plt.plot(range(2, max_k+1), silhouette_scores, marker='o', color='b', linestyle='--')
plt.title('Silhouette Scores for K-Means Clustering')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.xticks(range(2, max_k+1))
plt.grid(True)
plt.show()
def gap_statistic(X, max_k=10):
gaps = np.zeros(max_k - 1)
results = []
for k in range(1, max_k):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
inertia = kmeans.inertia_
results.append(inertia)
reference_inertia = 0
gaps[k - 1] = reference_inertia - inertia
plt.plot(range(1, max_k), gaps)
plt.title('Gap Statistic for Optimal Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Gap Statistic')
plt.show()
def plot_davies_bouldin(data, max_k=10):
davies_bouldin_scores = []
for k in range(2, max_k + 1):
kmeans = KMeans(n_clusters=k, random_state=42)
cluster_labels = kmeans.fit_predict(data)
db_score = davies_bouldin_score(data, cluster_labels)
davies_bouldin_scores.append(db_score)
plt.plot(range(2, max_k + 1), davies_bouldin_scores, marker='o')
plt.title('Davies-Bouldin Index for Optimal Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Davies-Bouldin Index')
plt.grid(True)
plt.show()
def plot_calinski_harabasz(data, max_k=10):
calinski_harabasz_scores = []
for k in range(2, max_k + 1):
kmeans = KMeans(n_clusters=k, random_state=42)
cluster_labels = kmeans.fit_predict(data)
ch_score = calinski_harabasz_score(data, cluster_labels)
calinski_harabasz_scores.append(ch_score)
plt.plot(range(2, max_k + 1), calinski_harabasz_scores, marker='o')
plt.title('Calinski-Harabasz Index for Optimal Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Calinski-Harabasz Index')
plt.grid(True)
plt.show()
plot_elbow_curve(embeddings)
plot_silhouette_scores(embeddings)
gap_statistic(embeddings)
plot_davies_bouldin(embeddings)
plot_calinski_harabasz(embeddings)
As we analyze the graphs generated by the various clustering methods, it's evident that some of the plots are recommending too few or too many clusters, and none of them are providing a clear, concrete answer for the optimal number of clusters. This lack of clarity may stem from several factors. Firstly, the nature of the data itself might be more complex or noisy than anticipated, making it difficult to find well-defined clusters. In some cases, the inherent relationships between data points may not fit neatly into the traditional assumptions of clustering algorithms, such as spherical or equally-sized clusters. Additionally, these methods rely on certain metrics or statistical thresholds that may not always align with the underlying structure of the data, especially when dealing with text or subjective comments where patterns might be less explicit. The complexity of textual data makes clustering particularly challenging, as text can be highly variable, with ambiguity, synonyms, polysemy (words with multiple meanings), and subjective opinions contributing to the difficulty in identifying distinct groups. Furthermore, these methods may struggle to process the informal, diverse language found in comments, adding noise to the data and further complicating the clustering process. Lastly, the range of clusters chosen in some methods may be too broad, causing the metrics to flatten out or fluctuate unpredictably. This suggests that more refinement or even alternative approaches might be needed to effectively capture the true structure of the data. Ultimately, these graphs serve as a starting point, but further investigation and tuning will be necessary to find a reliable and meaningful number of clusters.