Employment Data Imports

We have several employment files from vendors where the data was either pulled from LinkedIn or from a survey, which makes for non-standard employer names, like 8 different ways to say Fifth Third Bank.  Going through 100,000+ lines in the files and trying to match what we have in the system will be extremely time consuming.  We would love to hear what you have done and what has worked for you with importing this kind of data?  Any help with this appreciated.

 

Comments

  • Hi Shawn, a couple of things I've learned from having to tackle this before:
    • It helps to create a lookup table (temporary or perhaps permanent) with normalized organization names.  By normalized I mean:
      • Remove punctuation and standardize spacing
      • Remove leading articles ("the", "a", "an")
      • Standardize or remove common abbreviations ("& = and", "llc. = , LLC", "corp. = corporation", ...)
    • Do this for both the org name and all aliases
    • Run the same normalization against the incoming data 
    I'm usually doing this in C#.  Refine the above steps by repeatedly running against a test set and observing the matching results as new standardization/normalization steps are added--refine the steps based on patterns observed in matches/misses in the sample set.  That won't get you 100%, but ideally that gets you more than half.  

     
  • Hi Matt, thank you for your
    response. Removing the common abbreviations and articles have
    yielded a good deal more matches. Now if I could do something about
    the people who can’t spell where they work.  I appreciate you
    taking the time.

     

    Thank you,

    Shawn

    __________

     

    Shawn
    Schaeffer

    Director
    of Fundraising Applications


    The
    University of Cincinnati Foundation

    PO
    Box 19970

    Cincinnati,
    OH 45219-0970


    513-558-7814 |  m  513-384-6852  |  f 
    513-556-4300


    shawn.schaeffer@uc.edu


     

    Next, Now: The
    Campaign for Cincinnati

    nextnow.uc.edu


     

     

Categories