This question comes up often: How large should the bag of words be for LDA (or LSA)? The short answer is that it depends on what you're doing. If you want to compress data from one dimension into fewer dimensions, then 128 is usually sufficient. If you want to use only the top few words in order to perform classification or regression, then 32 or 64 might be enough. But if those aren't good answers for you, read on!
If a bag of word size of 128 is sufficient for an LDA, then it will be sufficient for all the following ML methods:
The number of dimensions in which to compress data.
The number of dimensions in which to compress data is the number of features, and the number of features is the number of variables used in a statistical model.
The number of variables is the number of features you have. For example, if your data has ten features, then your bag-of-words size would be 10.
The number of classes is the number of different categories or labels that you want to create (e.g., "good" vs "bad"). If there are two classes and each class has five words associated with it (as in our example), then this would result in 10 * 5 = 50 total words used for modeling purposes (i.e., training).
You may be wondering if a bag of word size of 128 is sufficient for an LDA, then it will be sufficient for all the following ML methods. The number of dimensions in which to compress data is called "dimensionality". In general, if we have n observations and d features (d < n), then we can represent them as points in a space with d coordinates. This is called high-dimensional data because there are many more variables than observations; such a representation can be difficult to interpret intuitively and can lead to overfitting problems when trying to train machine learning models on large datasets
LDA stands for Latent Dirichlet Allocation, and it's a method of dimensionality reduction. It's used in supervised learning algorithms to find patterns in large datasets. For example, you could use LDA to detect anomalies in data or predict future values.
LDA is a probabilistic model. It treats words as random variables, which means that each word has an associated probability distribution over all possible values. In other words, we can think of each word in our vocabulary as having its own "bag" of possible values for its latent components.
The simplest way to understand LDA is by thinking about it as an extension of logistic regression (LR). LR models are used extensively in machine learning because they allow us to estimate the probability that observations will belong to one class or another based on their features (inputs). In contrast with linear regression models where all feature weights are treated equally and thus implicitly assumed equal, LDA allows you to explicitly specify how much each feature contributes towards predicting your target variable--and this makes them particularly useful when working with text data where not all words have equal importance!
In this section, we will compare the two algorithms in terms of flexibility and use cases.
K-means is a clustering algorithm that can be used to find groups of data points that are similar to each other. It's flexible in that it can work with any number of clusters (or groups), but it isn't very good at finding those clusters automatically -- you have to specify them manually beforehand. LDA is more general-purpose than K-means because it can be used for both classification tasks (such as predicting whether someone will buy something) as well as clustering tasks (finding similar users).
The answer is yes, but only if you have a lot of data points per feature. If your dataset has 10000 observations and 50 features, then it would be possible to use a bag of 128 words for LDA. The reason for this is that the size of the corpus (number of words) does not matter as much as its density: how many observations there are per feature. In other words, if you have enough data points per feature (or "dimension"), then even small bags will suffice because they will contain enough information to make good predictions about future items in your dataset.
In summary:
You can try different sizes and see what works best for you. In general, 128 is a good choice because it's not too large to make your computer run slow but still enough to capture most of the important information in your data set. If you're working on a very small dataset (like just one or two hundred words), then 256-word models might be better.
As long as you have enough data points per feature, a bag of this size will suffice. If you're working with 1000 words and 128 dimensions (a common choice), then your model will have 128 * 1000 = 1280000 parameters. That's quite a few!
If you're using an LDA with the default setting (i.e., no tuning), then the number of iterations required for convergence is determined by two factors: 1) how many observations there are in total; 2) how many words are used as inputs into each dimensionality reduction step at each layer (the latter being controlled by 'max_iter'). The more observations there are overall and/or fewer words per layer means more iterations needed before convergence occurs -- which can be problematic if resources become limited due to memory constraints or other factors such as computer speed/power."
We hope that we've convinced you that 128 is enough for most things, and if it isn't then it probably will be soon. But most importantly, this means that your data can be compressed in fewer dimensions than ever before! This means less work for computers and humans alike - which means faster processing times and more accurate predictions. And who doesn't want those things?