Skip to content

08. Representation

Antonio Erdeljac edited this page Feb 26, 2019 · 1 revision

Representation


Topic: Representation

Course: GMLC

Date: 19 February 2019 

Professor: Not specified


Resources


Key Points


  • Opposed to traditional programming, in ML the focus is on representation

  • Representation

    • A way developers hone a model by adding and improving its features
  • Feature engineering

    • Transforming raw data into a feature vector
  • In many cases, features must be represented in real-numbered vectors since they are multiple by weight, which is what we do feature engineering

  • Mapping numeric values

    • If a feature from raw data is an integer/floating point it doesn’t have to be transformed into a feature value since it can be multiple by weight. That doesn’t mean all numeric values are truly numeric values. For example, we wouldn’t want the postcode to be multiplied by the weight. Postcode is a categoric value.
  • Mapping categorical values

    • Categorical features must have a discrete set of possible values

    • We use feature engineering  to convert strings to feature vectors

    • [“name1”, “name2”, ...] -> [0, 1, 0, 0, 0...] - example encoding name2 to feature vector

    • One-hot-encoding:

      • Single value is 1 (for example, a house is exactly on street name2 )
    • Multi-hot-encoding:

      • Multiple values are 1 (for example, a house on the corner of 2 streets as a feature needs to be encoded to feature vector through multi-hot-encoding)
    • Sparse representation

      • In examples above we needed to have a discrete set of possible values ([name1, name2,…]), what happens if we have over 1,000,000 possible streets? That’s where we use sparse representation  - where only nonzero values are stored, but their respective weight is calculated the same way as in normal representation
  • Qualities of good features:

    • Avoid rarely used discrete feature values

      • unique_house_id: 123456 is a bad feature since it won’t reoccur, meaning the model can’t learn anything by training on it
    • Clear and obvious meanings

      • house_age_years: 43527 is a case of unclear data storing. Only the programmer knows wether this is stored in seconds, minutes, etc.
    • “Magic” values

      • quality_rating: -1 could be assigned if there is no quality_rating assigned. This is an incorrect measure to train on. Instead, create a synthetic feature has_quality_rating: -1 that will act as a boolean for the model to train on
    • Upstream instability

      • The definition of a feature must not be changed over time
  • Cleaning Data

    • Scaling feature values

      • Converting float-point values from their natural range (100 - 900) into a standard range (0 - 1 or -1 to 1)

      • Helps speeding up convergion

      • Avoids NaN trap

      • Speeds up learning

      • Linear mapping ([min value, max value] => [-1, +1])

      • Scaled value = (value - mean) / standard deviation

        • Mean = 100

        • Standard deviation = 20

        • Value = 130

    • Handling extreme outliers

      • Logarithmic scaling

        • roomsPerPerson = log((totalRooms / population) + 1)
      • Clipping values

    • Binning

      • Dividing floating-point values that do not make sense to be stored as floating-point values into bins

      • Latitude = 34 => LatitudeBin = 32 < latitude < 33

      • After dividing latitudes into bins,  model can learn different and correct weights for each latitude

      • [0,0,0,0,0,1,0,0,0,0] => feature vector representation of latitudeBin we want

    • Scrubbing

      • Clearing & Fixing unreliable real-world data

      • Omitted values

      • Duplicates

      • Bad labels

      • Bad feature values

      • Use common sense in general

    • Know your data

      • Verify that data meets your expectations

      • Check other sources

      • Good Machine learning relies on good data

Check your understanding


  • Describe what is feature engineering and why is it important

  • Know ways of mapping categorial & numeric values

  • Know the qualities of good data

  • Describe & explain ways of cleaning up your data

Summary of Notes


  • Feature engineering is a way of transforming categorical or multiple numerical values into feature vectors

  • Checking your data is the key to a good model

  • Real world data often isn’t perfect and will need manual fixing & salvaging

Clone this wiki locally