-
Notifications
You must be signed in to change notification settings - Fork 0
Task1.0
In this study, I employ a topic modelling technique, namely Latent Dirichlet Allocation (LDA) and Non-negative matrix factorization (NMF) from the sklearn packages, to analyse and extract topics from the reviews text. Topic modelling is a technique that enables the identification of the principal themes and subjects that people discuss in their reviews. Once the topics have been extracted, they are visualised in order to facilitate a deeper comprehension of the common themes and opinions expressed by the reviewers.
The observation with LDA: The dataset was analyzed with a focus on ten distinct topics. Each of the ten topics is distinct and exhibits a moderate degree of separation from the others. pyLDAvis is a highly beneficial tool for comprehending and elucidating the nuances of topics derived from topic modelling for clients. Topic 1 is primarily concerned with food and the locations in which it is consumed, rather than a specific cuisine. The reviewers tend to focus on the location of the restaurant, which is a key aspect of Topic 2. However, there is no clear consensus on how to define this topic. Topics 3, 6 and 9 are clearly related to Mexican cuisine, with words such as salsa, Mexican and taco. Topic 4 is more indicative of Indian cuisine. However, the primary focus of Topic 1 is on food-related locations.
The observation with NMF: The topics are distributed more widely and are of a similar size. Topic 1 is focused on Mexican food. Topic 2 appears to focus more on the location and its appearance. Topics 4 and 6 are more focused on the shortcomings of the restaurants. Topics 7 and 8 are more focused on the order and waiting time procedures. Topic 9 focuses on the positive aspects of the locations, and Topic 10 on wine and cheese, which seem to be related.



import pandas as pd
import numpy
import matplotlib
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, svm, metrics
from sklearn.decomposition import NMF, LatentDirichletAllocation, MiniBatchNMF
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk
import re
import string
from nltk.tokenize import RegexpTokenizer
from sklearn.decomposition import LatentDirichletAllocation
from gensim import corpora
from gensim.models import LdaModel
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)
import pyLDAvis
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()
-------------------
import pandas as pd
import ast
# Number of rows to import
rows_import = 10000
# Reading JSON files
reviews_text = pd.read_json("../data/01_raw/yelp_academic_dataset_review.json", lines=True, nrows=rows_import)
users = pd.read_json("../data/01_raw/yelp_academic_dataset_user.json", lines=True, nrows=rows_import)
business = pd.read_json("../data/01_raw/yelp_academic_dataset_business.json", lines=True, nrows=rows_import)
business = business[business['categories'].apply(lambda x: 'Restaurants' in x or 'Food' in x)]
# reviews_text.columns
reviews_business = reviews_text.merge(business, on="business_id")
# reviews_business.columns
reviews_business = reviews_business[["text", "votes", "type_y", "name", "votes", "stars_x"]]
reviews_business.shape
---------------------
# # stemmer = SnowballStemmer('english')
stemmer = PorterStemmer()
# # stemmer = LancasterStemmer()
# # stemmer = RegexpStemmer('ing$|s$|e$|able$', min=1)
# lemmatizer = WordNetLemmatizer()
# # t = str.maketrans(dict.fromkeys(string.punctuation)) # special char removal
def text_reinigen(text):
text = word_tokenize(text.lower())
stop_words = set(stopwords.words('english'))
text = [stemmer.stem(w) for w in text if not w in stop_words]
# text = [lemmatizer.lemmatize(w) for w in text if not w in stop_words]
text = [word for word in text if word.isalpha() and not word in stop_words]
return ' '.join(text)
reviews_business = reviews_business.astype('str').map(str.lower)
reviews_business['text'] = reviews_business['text'].str.strip() # Strip whitespace
reviews_business['text'] = reviews_business['text'].str.replace(r'[^a-zA-Z0-9 ]+', '', regex=True) # Remove non-alphanumeric characters
# # ----- Clean the Text -----
reviews_business['text'] = reviews_business['text'].apply(text_reinigen)
print("(Rows, Columns) ", reviews_business.shape)
reviews_business.head(1)
------------------------------
tokenizer = RegexpTokenizer(r'\w+')
vect = CountVectorizer( lowercase = True,
stop_words = 'english',
ngram_range = (1,1),
tokenizer = tokenizer.tokenize )
reviews_business_tf = vect.fit_transform(reviews_business["text"])
------------------------------
n_topics = 10
words_per_topic = 30
# model = LatentDirichletAllocation(n_topics, learning_method='online', random_state = 42, max_iter = 1)
# model.fit_transform(reviews_business_tf)
model = NMF(n_components = n_topics)
model.fit(reviews_business_tf)
pyLDAvis.lda_model.prepare(model, reviews_business_tf, vect, mds='tsne')
------------------------------