virastar

v0.21.0

Published

2 years ago

Cleaning-up Persian Texts!

Downloads

136

0High
0Medium
0Low

juvee

persian text

Virastar (ویراستار)

Virastar is a Persian text cleaner.

A javascript port of aziz/virastar with lots of help from ebraminio/persiantools

see live demo

Install

npm install virastar

Usage

var Virastar = require('virastar');
var virastar = new Virastar();

virastar.cleanup("فارسي را كمی درست تر می نويسيم"); // Outputs: "فارسی را کمی درست‌تر می‌نویسیم"

Browser

<script src="lib/virastar.js"></script>
<script>
  var virastar = new Virastar();
  alert(virastar.cleanup("فارسي را كمی درست تر می نويسيم"));
</script>

Virastar([text] [,options])

text

Type: string

string of persian source to be cleaned.

options

Type: object

Virastar("سلام 123" ,{"fix_english_numbers":false}); // Outputs: "سلام 123"

Options and Specifications

Virastar comes with a list of options to control its behavior.

`normalize_eol`

default: true

replaces windows end of lines with unix eol (\n)

`decode_htmlentities`

default: true

converts numeral and selected html character-sets into original characters

`fix_dashes`

default: true

replaces triple dash to mdash
replaces double dash to ndash

`fix_three_dots`

default: true

removes spaces between dots
replaces three dots with ellipsis character

`normalize_ellipsis`

default: true

replaces more than one ellipsis with one
replaces (space|tab|zwnj) after ellipsis with one space

`normalize_dates`

default: true

re-orders date parts with slash as delimiter

`fix_english_quotes_pairs`

default: true

replaces english quote pairs (“”) with their persian equivalent («»)

`fix_english_quotes`

default: true

replaces english quote marks with their persian equivalent

`fix_hamzeh`

default: true

replaces ه followed by (space|ZWNJ|lrm) follow by ی with هٔ
replaces ه followed by (space|ZWNJ|lrm|nothing) follow by ء with هٔ
replaces هٓ or single-character ۀ with the standard هٔ

`fix_hamzeh_arabic`

default: false

converts arabic hamzeh ة to هٔ

`cleanup_rlm`

default: true

converts Right-to-left marks followed by persian characters to zero-width non-joiners (ZWNJ)

`cleanup_zwnj`

default: true

converts all soft hyphens () into zwnj
removes more than one zwnj
cleans zwnj after characters that don't conncet to the next
cleans zwnj before and after numbers, english words, spaces and punctuations
removes unnecessary zwnj on start/end of each line

`fix_arabic_numbers`

default: true

replaces arabic numbers with their persian equivalent

`fix_english_numbers`

default: true

replaces english numbers with their persian equivalent

`fix_numeral_symbols`

default: true

replaces english percent signs (U+066A)
replaces dots between numbers into decimal separator (U+066B)
replaces commas between numbers into thousands separator (U+066C)

`fix_misc_non_persian_chars`

default: true

replaces arabic normal/swash kaf with its persian equivalent
replaces arabic/urdu/pushtu/uyghur yeh with its persian equivalent
replaces kurdish he with its persian equivalent

`fix_punctuations`

default: true

replaces ,, ; with its persian equivalent

`fix_question_mark`

default: true

replaces question marks with its persian equivalent

`fix_perfix_spacing`

default: true

puts zwnj between the word and the prefix:
- mi*, nemi*, bi*

`fix_suffix_spacing`

default: true

puts zwnj between the word and the suffix:
- *ha, *haye
- *am, *at, *ash, *ei, *eid, *eem, *and, *man, *tan, *shan
- *tar, *tari, *tarin
- *hayee, *hayam, *hayat, *hayash, *hayetan, *hayeman, *hayeshan

`fix_suffix_misc`

default: true

replaces ه followed by ئ or ی, and then by ی, with ه‌ای

`fix_spacing_for_braces_and_quotes`

default: true

removes inside spaces and more than one outside for (), [], {}, “” and «»

`fix_spacing_for_punctuations`

default: true

removes space before punctuations
removes more than one space after punctuations, except followed by new-lines
removes space after colon that separates time parts
removes space after dots in numbers
removes space before some common domain tlds
removes space between question and exclamation marks
removes space between same marks

`fix_diacritics`

default: true

cleans zwnj before diacritic characters
cleans more than one diacritic characters
cleans spaces before diacritic characters

`remove_diacritics`

default: false

removes all diacritic characters

`fix_persian_glyphs`

default: true

converts incorrect persian glyphs to standard characters

`fix_misc_spacing`

default: true

removes space before parentheses on misc cases
removes space before braces containing numbers

`cleanup_spacing`

default: true

replaces more than one space with just a single one
cleans whitespace/zwnj between new-lines

`cleanup_line_breaks`

default: true

cleans more than two contiguous line breaks

`cleanup_begin_and_end`

default: true

removes space/tab/zwnj/nbsp from the beginning of the new-lines
removes spaces, tabs, zwnj, direction marks and new lines from the beginning and end of text

markdown

`markdown_normalize_braces`

default: true

removes spaces between [] and () ([text] (link) into [text](link))
removes space between ! and opening brace (! [alt](src) into ![alt](src))
removes spaces inside double (), [], {} ([[ text ]] into [[text]])
removes spaces between double (), [], {} ([[text] ] into [[text]])

`markdown_normalize_lists`

default: true

removes extra lines between two items on a markdown list beginning with -, * or #

`skip_markdown_ordered_lists_numbers_conversion`

default: false

skips converting english numbers of ordered lists in markdown

aggressive editing

`cleanup_extra_marks`

default: true

replaces more than one exclamation mark with just one
replaces more than one english or persian question mark with just one
re-orders consecutive marks: ?! into !?

`kashidas_as_parenthetic`

default: true

replaces kashidas to ndash in parenthetic

`cleanup_kashidas`

default: true

converts kashida between numbers to ndash
removes all kashidas between non-whitespace characters

extras

`preserve_frontmatter`

default: true

preserves frontmatter data in the text

`preserve_HTML`

default: true

preserves all html tags in the text

`preserve_comments`

default: true

preserves all html comments in the text

`preserve_entities`

default: true

preserves all html entities in the text

`preserve_URIs`

default: true

preserves all uri strings in the text

`preserve_brackets`

default: false

preserves strings inside square brackets ([])

`preserve_braces`

default: false

preserves strings inside curly braces ({})

`preserve_nbsps`

default: true

preserves all no-break space entities in the text

License

This software is licensed under the MIT License. View the license.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Virastar (ویراستار)

Install

Usage

Browser

Virastar([text] [,options])

text

options

Options and Specifications

normalize_eol

decode_htmlentities

fix_dashes

fix_three_dots

normalize_ellipsis

normalize_dates

fix_english_quotes_pairs

fix_english_quotes

fix_hamzeh

fix_hamzeh_arabic

cleanup_rlm

cleanup_zwnj

fix_arabic_numbers

fix_english_numbers

fix_numeral_symbols

fix_misc_non_persian_chars

fix_punctuations

fix_question_mark

fix_perfix_spacing

fix_suffix_spacing

fix_suffix_misc

fix_spacing_for_braces_and_quotes

fix_spacing_for_punctuations

fix_diacritics

remove_diacritics

fix_persian_glyphs

fix_misc_spacing

cleanup_spacing

cleanup_line_breaks

cleanup_begin_and_end

markdown

markdown_normalize_braces

markdown_normalize_lists

skip_markdown_ordered_lists_numbers_conversion

aggressive editing

cleanup_extra_marks

kashidas_as_parenthetic

cleanup_kashidas

extras

preserve_frontmatter

preserve_HTML

preserve_comments

preserve_entities

preserve_URIs

preserve_brackets

preserve_braces

preserve_nbsps

License

`normalize_eol`

`decode_htmlentities`

`fix_dashes`

`fix_three_dots`

`normalize_ellipsis`

`normalize_dates`

`fix_english_quotes_pairs`

`fix_english_quotes`

`fix_hamzeh`

`fix_hamzeh_arabic`

`cleanup_rlm`

`cleanup_zwnj`

`fix_arabic_numbers`

`fix_english_numbers`

`fix_numeral_symbols`

`fix_misc_non_persian_chars`

`fix_punctuations`

`fix_question_mark`

`fix_perfix_spacing`

`fix_suffix_spacing`

`fix_suffix_misc`

`fix_spacing_for_braces_and_quotes`

`fix_spacing_for_punctuations`

`fix_diacritics`

`remove_diacritics`

`fix_persian_glyphs`

`fix_misc_spacing`

`cleanup_spacing`

`cleanup_line_breaks`

`cleanup_begin_and_end`

`markdown_normalize_braces`

`markdown_normalize_lists`

`skip_markdown_ordered_lists_numbers_conversion`

`cleanup_extra_marks`

`kashidas_as_parenthetic`

`cleanup_kashidas`

`preserve_frontmatter`

`preserve_HTML`

`preserve_comments`

`preserve_entities`

`preserve_URIs`

`preserve_brackets`

`preserve_braces`

`preserve_nbsps`