@datatables/type-detector
v1.1.6
Published
Raw data type detection.
Downloads
66
Maintainers
Readme
Type-Detection
This Library provieds raw data type detection for Numbers, HTML, Dates and Strings. It was created with ❤ by the CloudTables team.
Install
Install from npm by using
npm install @datatables/type-detector
Usage
ES3
var TypeDetect = require('@datatables/type-detector');
var detector = new TypeDetect["default"]();
var type = detector.typeDetect(dataArray);
ES6
import TypeDetect from '@datatables/type-detector';
let detector = new TypeDetect();
let type = detector.typeDetect(dataArray);
Where dataArray
is a single dimensional array of data who's type is to be detected.
The constructor takes 2 optional parameters as follows.
| Parameter | Default | Description |
|:---------:|:-------:|:-----------:|
| decimalCharacter
| '.'
| This is used when determining the number of decimal places that a number has. |
| thousandsSeparator
| ','
| This is removed from strings when attempting to determine a number type. |
API
There is a single API method i18n()
.
detector.i18n(langOpts)
The langOpts
parameter is an object that is used to add internationalisation options used to detect dates. The default value of the internal object is as follows.
{
"abbrDays": {
"deDE": /^(mo\.|di\.|mi\.|do\.|fr\.|sa\.|so\.)$/gi,
"en": /^(mon|tue|wed|thu|fri|sat|sun)$/gi,
"esES": /^(lun\.|mar\.|mié\.|jue\.|vie\.|sáb\.|dom\.)$/gi,
"frFR": /^(lun\.|mar\.|mer\.|jeu\.|ven\.|sam\.|dim\.)$/gi
},
"abbrMonths": {
"deDE": /^(jan\.|feb\.|märz\.|apr\.|mai\.|juni\.|juli\.|aug\.|sep\.|okt\.|nov\.|dez\.)$/gi,
"en": /^(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/gi,
"esES": /^(ene\.|feb\.|mar\.|abr\.|may\.|jun\.|jul\.|ago\.|sep\.|oct\.|nov\.|dic\.)$/gi,
"frFR": /^(janv\.|févr\.|mars\.|avr\.|mai\.|juin\.|juil\.|août\.|sept\.|oct\.|nov\.|dec\.)$/gi
},
"days": {
"deDE": /(montag|dienstag|mittwoch|donnerstag|freitag|samstag|sonntag)/gi,
"en": /(monday|tuesday|wednesday|thursday|friday|saturday|sunday)/gi,
"esES": /(lunes|martes|miércoles|jueves|viernes|sábado|domingo)/gi,
"frFR": /(lundi|mardi|mercredi|jeudi|vendredi|samedi|dimanche)/gi
},
"months": {
"deDE": /^(januar|februar|märz|april|mai|juni|juli|august|september|oktober|november|dezember)$/gi,
"en": /^(january|february|march|april|may|june|july|august|september|october|november|december)$/gi,
"esES": /^(enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)$/gi,
"frFR": /^(janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre)$/gi
},
"postFixes": {
"deDE": /^[0-9]+(st|nd|rd|th)$/gi,
"en": /^[0-9]+(st|nd|rd|th)$/gi,
"esES": /^[0-9]+(st|nd|rd|th)$/gi,
"frFR": /^[0-9]+(st|nd|rd|th)$/gi
}
}
To add internationalisation options for a language(s) an object of this form must be passed in. The values within the RegExps need to be parseable by moment otherwise dates will not be identified correctly. The key of the object value pairs should match the locale for the language that is being added. For example to add english as a language you would make the following call.
detector.i18n({
"abbrDays": {
"en": /^(mon|tue|wed|thu|fri|sat|sun)$/gi,
},
"abbrMonths": {
"en": /^(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/gi,
},
"days": {
"en": /(monday|tuesday|wednesday|thursday|friday|saturday|sunday)/gi,
},
"months": {
"en": /^(january|february|march|april|may|june|july|august|september|october|november|december)$/gi,
},
"postFixes": {
"en": /^[0-9]+(st|nd|rd|th)$/gi,
}
});
The method returns the detector instance so that chaining can occur.
Return Format
The following object is returned from the detector.typeDetect(dataArray)
call mentioned above. If a given property is not relevant to the type that has been detected then it is set to null.
{
type: null | string;
format: null | string;
locale: null | string;
prefix: null | string;
postfix: null | string;
dp: null | number;
}
| Property | Description |
|:--------:|:-----------:|
| type
| The type that has been determined for the data. |
| format
| If a date has been detected as the type then this property is set to the moment format of that date. |
| locale
| If a date has been detected then this property is set to store the locale of that date. |
| prefix
| If numeric data has been detected then any common prefix across the entire data set is stored on this property. This excludes any numbers that are at the end of the prefix. |
| postfix
| If numeric data has been detected then any common postfix across the entire data set is stored on this property. This excludes any numbers that are at the start of the postfix. |
| dp
| If numeric data has been detected then the largest number of decimal places present within the data set is stored on this property. |
Detectable Types
The following types can be detected by the Type-Detection library. They are listed here in the order that identification is attempted in.
Excel
CloudTables allows the importing of data from excel sheets. When this occurs the data that is read in has both a value and a format within an object.
[{ value: 1.1, excel: '#.#'}, { value: 2.5, excel: '#.#'}, { value: 3.6, excel: '#.#'}]
The data within this set is numeric, and therefore the type returned will be number. The dp
, prefix
and postfix
properties are also determined using a combination of the value
and excel
properties. This page is useful for understanding the different formats that excel can create.
Excel imports can also include dates, in which case a format is also set in the excel
property. In this case the date detection methods implemented within this library are used as the excel date formats are different to the moment formats that Cloudtables uses.
Strings and can also be determined from excel data.
Numbers
The Type-Detection library is capable of identifying a variety of formats of numbers.
- JS numbers (
100
) - Numeric Strings (
'100'
) - Numeric Strings with a prefix, such ascurrency (
'$100'
) - Numeric Strings with a postfix, such as a unit (
'100 cm'
) - Numeric Strings with both a prefix and a postfix (
'$100 per month'
) - All of the above with decmial places,
'.'
by default (100.5, '100.5', '$100.50', '100.5 cm', '$100.50 per month'
) - All of the above (except JS Numbers) with a thousands separator,
','
by default ('100,000.5', '$100,000.50', '100,000.5 cm', '$100,000.50 per month'
)
There is no limit to the size of prefixes or postfixes.
Numbers
The Type-Detection library is capable of identifying sequences of numbers. To do this all of the numbers in the data must be integer values. The must also not include and prefixes or postfixes, but they can include thousands separators.
- JS numbers (
1
,2
,3
) - Numeric Strings (
'1'
,'2'
,'3'
) - Numeric Strings (
'1,000'
,'2,000'
,'3,000'
)
HTML
To determine HTML the string inputs are checked against the following Regular Expression...
/<(“[^”]*”|'[^’]*’|[^'”>])*>/g
A data set does not need to be solely comprised of html code to get the html type. If there are also strings present then an html type can be returned.
Dates
Moment is used to aid in the detection of dates and their formats. Any date containing the following tokens can be detected, but each token should only be used once in the field.
Tokens
| Token | Description | Example | Extra Notes | |:-----:|:-----------:|:-------:|:-----------:| | YYYY | Full Year | 2020, 2021 | Cannot be used with YY | | YY | Short Year | 20, 21 | Cannot be used with YYYY | | MMMM | Full Month | January, February | Cannot be used with MMM, MM, M | | MMM | Abbreviated Month | Jan, Feb | Cannot be used with MMMM, MM, M | | MM | Month as Number | 01, 02 | Cannot be used with MMMM, MMM, M | | M | Month as Short Number | 1, 2 | Cannot be used with MMMM, MMM, MM | | DD | Day as Number | 01, 02 | Cannot be used with D, Do | | D | Day as Short Number | 1, 2 | Cannot be used with DD, Do | | Do | Day with Postfix | 1st, 2nd | Cannot be used with DD, D | | dddd | Day of the Week | Monday, Tuesday | Cannot be used with ddd | | ddd | Abbreviated Day of the Week | Mon, Tue | Cannot be used with dddd | | HH | 24 Hour | 01, 02, ..., 23 | Cannot be used with H, hh, h. If used with A or a will become h | | H | 24 Hour Short | 1, 2, ..., 23 | Cannot be used with HH, hh, h. If used with A or a will become h | | hh | 12 Hour | 01, 02, ..., 12 | Cannot be used with HH, H, h | | h | 12 Hour Short | 1, 2, ..., 12 | Cannot be used with HH, H, hh | | mm | Minutes | 01, 02, ..., 59 | Cannot be used with m | | m | Minutes Short | 1, 2, ..., 59 | Cannot be used with mm | | ss | Seconds | 01, 02, ..., 59 | Cannot be used with s | | s | Seconds Short | 1, 2, ..., 59 | Cannot be used with ss | | A | AM/PM in Capitals | AM, PM | Cannot be used with HH, H | | a | am/pm in lower case | am, pm | Cannot be used with HH, H |
Some Examples
The date detection segment will always return one of the following.
- A moment format - in this case a date has been succesfully identified with a consistent format.
- A mixed type - in this case multiple different date formats have been identified.
- null - no dates could be identified and other types will be checked.
Consider the following array of data.
['Thursday 1st Oct, 2020 11:12:47 AM','Sunday 17th Oct, 1999 10:56:07 PM','Friday 3rd Jan, 2025 9:02:00 AM']
The moment format that is detected is dddd Do MMM, YYYY h:mm:ss A
, which is correct. Some changes can be made to this data to create ambiguity in the format that is used. If the AM/PM part of the dates is removed as follows...
['Thursday 1st Oct, 2020 11:12:47','Sunday 17th Oct, 1999 10:56:07','Friday 3rd Jan, 2025 9:02:00']
Now the moment format that is returned is dddd Do MMM, YYYY H:mm:ss
, Notice the change from 12 hours (h) to 24 hours (H). In this case either token is valid, H is selected as it allows more flexibility if more data is added in the future.
There are a number of cases such as the above where some ambiguity is present - this is why there isn't a pre-existing type detection library in common use. A common example that is used is...
['10/10/10', '11/11/11', '12/12/12']
There isn't a definitive way to derive a format for these tokens, instead a good guess is taken. The moment format that is produced in this case is M/D/Y
which again is valid, but it could be MM/DD/YY
, YY/MM/DD
, YY/D/M
or a whole variety of combinations of the Day, Month and Year. The best that can be done in situations like these is to take a guess with a format that is valid for the entire set.
When it comes to dates in the above format where certainty is not guaranteed the order of priority is Months > Days > Years.
When it comes to time the order of priority is Hours > Minutes > Seconds.
Locales
Differnet locales can also be detected where the tokens are strings rather than numbers (Do
, dddd
etc.). The languages that are currently integrated are
- English (default)
- German
- French
- Spanish
As with other elements of date detection, where a single locale cannot be determined, one will be selected that can apply to all of the data in the set.
The detection of locales is completed using a series of Regular Expressions. These are iterated over in an attempt to find matches.
Mixed
If multiple different types are detected across the entire dataset then a mixed type is returned. This type will also be returned if there are dates with multiple different formats.
Note. The only exception to the mixed rule is a combination of strings and html. In this case a type of html will be returned.
Strings
If no other type can be detected then the type defaults to string.
Performance
The library has been tested with a data set 10,000 elements in length for each type that can be detected. The response times achieved were around 10-20 ms for most types. The exception to this was complex date formats where the response time increased to around 400 ms.
Pull requests are welcome and encouraged to increase the performance of the library.