CONSTRUCTION & REAL ESTATE
levi-stute-PuuP2OEYqWk-unsplash-2
Discover how crafting a robust AI data strategy identifies high-value opportunities. Learn how Ryan Companies used AI to enhance efficiency and innovation.
Read the Case Study ⇢ 

 

    LEGAL SERVICES
    levi-stute-PuuP2OEYqWk-unsplash-2
    Discover how a global law firm uses intelligent automation to enhance client services. Learn how AI improves efficiency, document processing, and client satisfaction.
    Read the Case Study ⇢ 

     

      HEALTHCARE
      levi-stute-PuuP2OEYqWk-unsplash-2
      A startup in digital health trained a risk model to open up a robust, precise, and scalable processing pipeline so providers could move faster, and patients could move with confidence after spinal surgery. 
      Read the Case Study ⇢ 

       

        ⇲ Dive Into
        LEGAL SERVICES
        carli-jeen-15YDf39RIVc-unsplash-1
        Learn how Synaptiq helped a law firm cut down on administrative hours during a document migration project.
        Read the Case Study ⇢ 

         

          GOVERNMENT/LEGAL SERVICES
          joel-durkee-1Hx3VqgApkI-unsplash
          Learn how Synaptiq helped a government law firm build an AI product to streamline client experiences.
          Read the Case Study ⇢ 

           

            strvnge-films-P_SSMIgqjY0-unsplash-2-1-1

            Mushrooms, Goats, and Machine Learning: What do they all have in common? You may never know unless you get started exploring the fundamentals of Machine Learning with Dr. Tim Oates, Synaptiq's Chief Data Scientist. You can read and visualize his new book in Python, tinker with inputs, and practice machine learning techniques for free. 

            Start Chapter 1 Now ⇢ 

             

              How Should My Company Prioritize AIQ™ Capabilities?

               

                 

                 

                 

                Start With Your AIQ Score

                  9 min read

                  Too Much Data, Too Little Time: A Business Case for Dimensionality Reduction

                  Featured Image

                  Introduction to Dimensionality Reduction

                  High-Dimensional Data

                  Imagine a spreadsheet with one hundred columns and only ten rows. This is a high-dimensional dataset, in which the number of features (columns) matches or exceeds the number of observations (rows). In the context of data science and machine learning, managing high-dimensional data presents challenges. Models trained on high-dimensional data are difficult to interpret and tend to be computationally expensive and time-consuming to train.

                  The Curse of Dimensionality

                  The "curse of dimensionality," a term coined by the American mathematician Richard Bellman in 1961, refers to various problems that arise when organizing, analyzing, or otherwise dealing with data in high-dimensional spaces. Common issues associated with the curse of dimensionality include the following:

                  • Loss of Interpretability: As you add more features to a dataset, the relationships between features become increasingly difficult to understand. Human perception is three-dimensional, so we find it challenging to visualize and interpret interactions within high-dimensional spaces.
                  • High Computational Complexity: Processing high-dimensional data is time-consuming and resource-intensive, as the volume of the data expands exponentially with each added feature.
                  • Model Overfitting: An overfitted model has learned to classify or predict based on noise (irrelevant and random fluctuations in its training data) rather than signal (relevant underlying patterns in its training data). The sheer number of features in high-dimensional data provides a rich ground for a model to learn noise instead of signal.

                  Let us help you with your data strategy →

                  Dimensionality Reduction

                  One solution to the curse of dimensionality is dimensionality reduction — the transformation of data from a high-dimensional space into a lower-dimensional space. Dimensionality reduction techniques aim to reduce the number of features in a dataset without muddying the essential characteristics of the data, which are those that convey useful information. Simply put, these techniques aim to remove the noise from a dataset but preserve the signal.

                  Principal Component Analysis for Dummies

                  Principal component analysis (PCA) is a dimensionality reduction technique that transforms data from a high-dimensional space into a low-dimensional space by condensing the original features into a smaller number of principal components. These principal components are linear combinations of the original features specifically tailored to capture the greatest amount of variance (assumed to correspond to information) in the data. 

                  The first principal component (a.k.a. PC1) captures the maximum amount of variance in the data. The second principal component (a.k.a. PC2) then captures the maximum amount of the variance left unexplained by PC1. Each subsequent principal component captures progressively less variance than its predecessor, decomposing the data into a new basis where the components are ranked according to their variance contribution.

                  The Upfront Computational Cost of PCA

                  Principal component analysis involves operations that scale with the square of the number of original features that need to be condensed. Consequently, PCA can be computationally intensive and time-consuming, especially when it’s applied to datasets with more than a few thousand original features. For example, applying PCA to our Sephora customer review dataset with 12 thousand original features takes 35.41 seconds.

                  PCA Image-1

                  The Downstream Benefits of PCA

                  The downstream benefits of PCA often justify the upfront computational expense. After we use PCA to condense our Sephora customer review dataset into one hundred principal components, fitting a logistic regression model to predict whether a review is negative or positive takes 0.08 seconds — versus 0.57 seconds required to fit the same model to the original features. Model training is often an iterative process, so this seemingly small reduction in training time can translate into significant cumulative efficiency gains across hundreds or thousands of iterations. 

                  Note: The accuracy of both models is ~81 percent. This suboptimal performance could stem from the limitations of our TF-IDF matrix, which overlooks semantic meaning, i.e. the contextual significance of word order. In an upcoming blog post, we will explore a solution to this oversight: word embeddings. Word embeddings represent words as dense vectors in a continuous vector space, capturing not only their frequency but also their semantic meaning.

                  Our Sephora dataset is an extremely small fish in the grand scheme of machine learning and data science. Datasets used in many modern applications, from image recognition to natural language processing, can be much larger — often by several orders of magnitude. For example, consider autonomous driving systems like those developed by Tesla and Waymo. Dimensionality reduction techniques like PCA enable these systems to rapidly and continuously process high-dimensional datasets with many millions of original features.

                  TLDR: When to Use PCA

                  In practice, this question is best left to experts. Dimensionality reduction techniques are not one-size-fits-all, requiring domain expertise to determine their suitability for a given scenario. But as a general rule of thumb, PCA is best employed when the goal is to simplify high-dimensional data while preserving as much variance as possible.

                  However, Several factors can render PCA unsuitable for a task or dataset. For instance, PCA is typically ineffective when the assumption that variance equates to information does not hold true, and it will often fail to capture non-linear relationships between original features. Moreover, if the time spent applying PCA outweighs the time saved in subsequent model training, it may not be worthwhile. Therefore, it's essential to evaluate task requirements and dataset characteristics before deciding on the suitability of PCA as a dimensionality reduction technique. Consulting with domain experts can help ensure the best approach is selected for the situation at hand.

                  Let us help you with your data strategy →

                   

                  humankind of ai

                   

                  Photo by Vanessa Schulze on Unsplash


                   

                  About Synaptiq

                  Synaptiq is an AI and data science consultancy based in Portland, Oregon. We collaborate with our clients to develop human-centered products and solutions. We uphold a strong commitment to ethics and innovation. 

                  Contact us if you have a problem to solve, a process to refine, or a question to ask.

                  You can learn more about our story through our past projects, our blog, or our podcast.

                  Additional Reading:

                  How to Know When You Need an AI Expert vs DIY

                  Knowledge is power. Knowledge is important in AI because it takes knowledge to effectively deploy AI solutions, with...

                  Your Three-Point Checklist for Choosing an AI Partner

                  Choosing an AI partner is a high-stakes decision. In 2025, AI will be the single-largest technology spending budget...

                  We Helped a Startup Fight Wildfires with AI & Atmospheric Balloons

                  Climate Change Fuels Record Wildfires

                  The 2021 wildfire season scorched 3.6 million acres in the United States. [1]...