INTERACT FORUM

Please login or register.

Login with username, password and session length
Advanced search  
Pages: [1]   Go Down

Author Topic: Expression to remove diacritics?  (Read 1522 times)

chrisjj

  • Citizen of the Universe
  • *****
  • Posts: 750
Expression to remove diacritics?
« on: January 16, 2014, 07:43:33 pm »

Is there some way in an expression to remove diacritics? I find no explicit function . I need a version of Name without diacritics, to make Remove Duplicates treat as equal values that differ only in diacritics.

Thanks.
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Expression to remove diacritics?
« Reply #1 on: January 16, 2014, 08:43:41 pm »

Do you really mean "diacrytics" (which are just letters with additional glyph marks), or do you really mean all letters not typical ASCII a-z, A-Z, 0-9, punctuation, etc.

For example, should the following be accepted or rejected?

    ø  ß  œ Ɣ

To do these types of comparisons, you can normalize the Unicode to a certain standard form (where diacritics are split to include the combining characters and then these are stripped) and some additional characters such as ß are force-converted to S.  But more technically correct is to use a level 1 Unicode Collation Algorithm comparison.  These are well beyond MC's capabilities.

Somewhere I have some code that does this - if you want it, and I can find it, it can be turned into a pscriptor scriplet so that you could perform this conversion and save the results to an MC field or two, or have it save a comparison result.
Logged
The opinions I express represent my own folly.

chrisjj

  • Citizen of the Universe
  • *****
  • Posts: 750
Re: Expression to remove diacritics?
« Reply #2 on: January 17, 2014, 05:17:52 am »

Do you really mean "diacrytics" (which are just letters with additional glyph marks), or do you really mean all letters not typical ASCII a-z, A-Z, 0-9, punctuation, etc.

I do mean diacritics.

or do you really mean all letters not typical ASCII a-z, A-Z, 0-9, punctuation, etc.

No - I don't want any letters removed.

For example, should the following be accepted or rejected?

    ø  ß  œ Ɣ


Accepted unchanged, since they don't have diacritics.

To do these types of comparisons, you can normalize the Unicode to a certain standard form (where diacritics are split to include the combining characters and then these are stripped) and some additional characters such as ß are force-converted to S.  But more technically correct is to use a level 1 Unicode Collation Algorithm comparison.  These are well beyond MC's capabilities.

Thanks.

Somewhere I have some code that does this - if you want it, and I can find it, it can be turned into a pscriptor scriplet so that you could perform this conversion and save the results to an MC field or two, or have it save a comparison result.

I would very much like that - thanks. I'd missed the announcement of pscriptor. This feature looks awesome. I'll try it now. Even more awesome would be the ability to call it from expressions, but I don't currently see any expression-language call-out function what would allow this.
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Expression to remove diacritics?
« Reply #3 on: January 20, 2014, 12:58:43 pm »

Bump.  Did you get started with pscriptor, and if so, have you thought about how you'd like to use it for this problem here?
Logged
The opinions I express represent my own folly.

chrisjj

  • Citizen of the Universe
  • *****
  • Posts: 750
Re: Expression to remove diacritics?
« Reply #4 on: January 21, 2014, 08:41:31 am »

Bump.  Did you get started with pscriptor,

Not yet. I saw the install procedure is something I'll need to take a bit of time over.
Logged
Pages: [1]   Go Up