Improving Company Data
Our employees need high-quality company data to successfully serve our clients. AlphaSights didn’t have that, so the software engineering team dedicated time in our roadmap to solve that problem and deliver value to our employees and clients.
AlphaSights works with investor, corporate, and consulting clients looking to understand a particular topic or industry, and connects them with the best experts. Our clients come to us requesting to speak to experts when making a business decision (ie. investment in a company, entering a new market, helping a client restructure). They use us because they need quick access to relevant experts: speed and quality of the connections we deliver are therefore key.
A client request to speak to experts is structured as a project internally. Our Client Services team works on projects for our clients and makes up our core user group. To find relevant experts for a project, employees generate a value chain for the industry or topic of interest: who are the key competitors, who are their customers, which companies supply to them, and which companies distribute their products. An expert’s relevant experience is signaled by the companies they’ve worked at. Quickly mapping out relationships between companies means we can deliver quality experts to our clients fast.
We identified a few core problems before we embarked on the project.
We had poor quality company information. We had a database with millions of company names and primary serial keys, and that was it. The companies in our database were entered through manual input in our internal systems by our employees and experts. There was a huge variation in how users chose to represent companies. For example Samsung Electronics Co Ltd had been input as ‘Samsung Electronics’, ‘Samsung Electronics Korea’, ‘Samsung Electronics (삼성디지털플라자)’ etc.
Our company information was not enriched. We didn’t have legal company names, aliases, industry information, geographies, or supplier/competitor relationships. Our employees had to research those connections between companies and industries from scratch every time, losing valuable time in the process.
Even if we did enrich our company information, we couldn’t do it well. Different name variations for one company existed as unique rows in our database, and Entity Linking was not possible.
Our lack of enriched company data was preventing us from improving and developing our search and recommendation tooling. This was problematic because search and recommendation is crucial to improving our speed and quality of delivery.
Cleaning our existing company data
We had to keep our existing company data because we had a decade’s worth of experts associated with those companies. Before we could start enriching our company data, we had to clean up our current company database by removing duplicates without impacting existing records.
Accessing enriched company data from third party providers
We knew that we’d enrich our company information over time through multiple third party company data providers (particularly to ensure coverage across all geographies). Whatever service we built to host this enriched company information would have to consume regularly updated data from various sources.
Supporting existing and new internal services
Our existing company data was used primarily by our core internal platform Delivery, which allows employees to manage project requests, find experts, and schedule phone calls. Once we had enriched company data linked to unique company identifiers, other internal services and products would be developed to consume this enriched company data to improve the speed and quality of employees’ work. We therefore had to surface enriched company information in a manner easily consumable by various internal services.
What we built
We built the Companies Gateway (CG), a service that provides enriched company data to other internal AlphaSights services. The service parses incoming company names from our internal application, searches through enriched company data for possible matches, and resolves multiple matches. Exact or multiple matches are then returned to our main internal application.
The CG live application does not need (and actually does not even have access to) our third party provider data for normal operation. We can horizontally scale by adding new sources of third party data with minimal friction. Our end-users access the enriched company information through our internal services which consume the CG data over HTTP JSON API. The CG sends notifications via RabbitMQ to our consuming services when its entities are updated or deleted.
Given the scope of the problem we were solving, we encountered a number of challenges that didn’t have an immediate solution. This gave software engineers the freedom to balance execution with more exploratory work.
Matching & Entity Resolution
We faced an initial challenge of matching companies from our low-quality database with companies from our third party provider’s database. This was tricky given the limitations of our existing company data.
The challenge was further compounded because the company names entered in our original database rarely reflected a company’s actual legal name. For example, in our database, ‘BBC’ could have referred to BBC Entrepreneurial Training & Consulting LLC just as much as the British Broadcasting Corporation.
We developed a matching algorithm which goes through a list of matchers. Each matcher attempted to find matching entities. If any of the steps returns a single match (exact match), we exit the algorithm. If a matcher returns multiple matches that cannot be resolved into a single match, we return these to our internal service, but don’t attempt to match.
Over the course of the project, we continuously experimented with new matchers. Matchers that had an impact on our overall match rate were added incrementally to our matching algorithm. We’ve outlined two of these matchers below to illustrate our range of approaches.
If a user included a company’s ticker as part of the company name (e.g ‘Apple Inc - AAPL’), we extracted the ticker and matched the company using the ticker information we have in our enriched company database. This yields very strong results, however relies entirely on our users manually providing a ticker.
Our name matcher is the most complex matcher. It initially uses the given string to find matches, however, if no matches are returned, it’ll deploy various strategies to modify, clean up, and normalize the string to find a match. For example, our ParenthesizedNameParser takes ‘Apple (Cupertino)’ , removes ‘(Cupertino)’ and searches for ‘Apple’. Similarly, LegalIdentifierParser removes all words that match recognized legal company identifiers (e.g Inc, Ltd, GmbH) in the string, and searches using the remainder. Our GeographyParser is pretty nifty: it recognizes country names (in both English and the country’s language), initially removes the country name from the string and then searches using the remaining string. If multiple matches are returned, it then matches with the corporate entity that’s located in that given geography. For example IBM Italy will initially search as ‘IBM’, which will yield multiple possible matches given the number of IBM subsidiaries. We’ll then use ‘Italy’ to search through IBM subsidiaries based in Italy: if this returns a single result, ‘IBM Italy’ will get matched.
We initially targeted a match rate of 50% for the project. The distribution of names in our database had a long tail and a lot of ‘companies’ didn’t even refer to real companies. Through these various experiments we were able to refine our matching algorithm over the course of our project, improve our match rate incrementally, and ultimately achieve a final match rate of 67%.
We knew from the start that we wouldn’t achieve 100% match rate through our matching algorithm: many companies were represented ambiguously and could potentially match against multiple results, so we chose to leave those unmatched. We wanted to encourage our employees, to manually resolve those unmatched companies as part of their normal workflow in the internal system they use everyday.
As soon as our first version of the typeahead (developed for our users to manually match) was in production, our first batch of testers told us that it was a pain to select from a list of companies and that they were avoiding manual matching.
We built a small tool to experiment with various heuristics to surface more ‘relevant’ companies. We assigned more weight to company features like entity type (public company, subsidiary, private company), position in a given corporate hierarchy (parent company, ultimate parent company), number of subsidiaries, number of securities associated to a given company, etc. Upon testing, we established that weighting by entity type and then by number of subsidiaries was more likely to yield what our users considered to be the ‘real’ Nike that they were looking for. The screenshot below illustrates how the applied heuristics surfaces ‘Microsoft Corp’ faster for the user.
Our work on the Companies Gateway had a direct impact on the rest of the business very quickly. Two entirely new internal services already consume information from the Gateway and help our users deliver relevant experts to our clients at increased speed. Our work on company data now underpins new compliance and recommendation tooling we’re building to continue increasing the speed and quality of the service AlphaSights delivers to its clients day in and day out.
Jennifer Juillard-Maniece joined Alphasights in January of 2014 and serves as a product manager on our Software Engineering team.