@datagica/parse-companies
v0.0.7
Published
Find companies in a text
Downloads
6
Readme
Parse companies
Find companies in a text. Well, most of the time.
Free data sources used
- french open corporation database (SIRENE)
- http://www.opendata500.com/us/list/
- https://github.com/GovLab/OpenData500
- http://api.corpwatch.org
- https://datahub.io/dataset/corpwatch
Guidelines
It is better to have little solid data than a lot of garbage, so I try to clean up and verify things by hand (with google) as much as possible, or by using pattern matching to normalize anomalies.
When cleaning up the dataset, there are a couple of things to keep in mind:
Common problems
Examples of anomalies and issues:
- The legal category of companies may or may not be present
- The legal category might have variations (eg. S.A.S., S.A.S, SAS..)
- Or it can be at the beginning or the end (eg. "FOOBAR SARL", "SARL FOOBAR")
The solution to this is to normalize the names whenever possible, and generate aliases covering most of the cases
About aliases
You may wonder why we pre-generate aliases, since it could be done at runtime for entries in the dataset and/or for each word found in documents as we go.
It is important to understand that normalizing stuff at runtime is costly for the CPU while disk, memory and network are getting faster and cheaper.
Until a certain size limit, it is much faster to just load a big blob in memory, rather than doing additional per-row work (especially since we have 1M+ rows).
Admittedly that doesn't mean we cannot perform normalization at runtime:
@datagica/parse-entities
actually does perform basic normalization,
and allow one to define custom rules.
I didn't feel the need for this step in parse-companies
until now,
but if you want you can have a look at parse-institution
to see how it is done.
Perhaps this technique could be used to solve cases such as "SAS FRENCH COMPANY" VS "FRENCH COMPANY SAS". Still, to be used with moderation as we need to keep a good balance between loading time, cpu and memory.
cover names
Many business have a corporate name different from the business name, eg. the name of the company owning a restaurant might be different.
In that case, the restaurant name should be in the aliases
small business
For small business better to prepend the category and/or append the street eg:
"BAR TABAC" => "Bar Tabac, 28 Smith Street, Brooklyn" "LE CARREFOUR" => "RESTAURANT LE CARREFOUR", "LE CARREFOUR, 42 RUE DU BLABLA" "MARCEL" => "GARAGE MARCEL"
Hedge funds subsidiaries / family trusts
- they are mostly piggy banks and proxies for money, not "real" business with offices, services, customers..
- usually owned by a single guy or a hedge fund and located in a tax-free heaven
- they have weird names such as "LITTLE SUNSHINE HEDGE FUND CAIMAN B XXVII" or "JOHN DOE JR FAMILY HOLDING"
For the moment they are of little interest for us when analyzing news reports and curriculums, so we delete them when we find one (filter by name, company size eg. 1 person company => delete).
However, if the hedge fund is actually a big thing, we can also delete the subsidiary and only keep the parent company (eg. "LITTLE SUNSHINE HEDGE FUND CAIMAN B XXVII" => "LITTLE SUNSHINE")
This is not an issue, because if want to use them in the future, we can still go back to CorpWatch (us) or SIRENE (fr) and download the full listings again.