Topic: Expression to remove diacritics? (Read 1824 times)

chrisjj · « **on:** January 16, 2014, 07:43:33 pm »

Is there some way in an expression to remove diacritics? I find no explicit function . I need a version of Name without diacritics, to make Remove Duplicates treat as equal values that differ only in diacritics.

Thanks.

MrC · « **Reply #1 on:** January 16, 2014, 08:43:41 pm »

Do you really mean "diacrytics" (which are just letters with additional glyph marks), or do you really mean all letters not typical ASCII a-z, A-Z, 0-9, punctuation, etc.

For example, should the following be accepted or rejected?

ø ß œ Ɣ

To do these types of comparisons, you can normalize the Unicode to a certain standard form (where diacritics are split to include the combining characters and then these are stripped) and some additional characters such as ß are force-converted to S. But more technically correct is to use a level 1 Unicode Collation Algorithm comparison. These are well beyond MC's capabilities.

Somewhere I have some code that does this - if you want it, and I can find it, it can be turned into a pscriptor scriplet so that you could perform this conversion and save the results to an MC field or two, or have it save a comparison result.

chrisjj · « **Reply #2 on:** January 17, 2014, 05:17:52 am »

Quote from: MrC on January 16, 2014, 08:43:41 pm

Do you really mean "diacrytics" (which are just letters with additional glyph marks), or do you really mean all letters not typical ASCII a-z, A-Z, 0-9, punctuation, etc.

I do mean diacritics.

Quote from: MrC on January 16, 2014, 08:43:41 pm

or do you really mean all letters not typical ASCII a-z, A-Z, 0-9, punctuation, etc.

No - I don't want any letters removed.

Quote from: MrC on January 16, 2014, 08:43:41 pm

For example, should the following be accepted or rejected?

ø ß œ Ɣ

Accepted unchanged, since they don't have diacritics.

Quote from: MrC on January 16, 2014, 08:43:41 pm

To do these types of comparisons, you can normalize the Unicode to a certain standard form (where diacritics are split to include the combining characters and then these are stripped) and some additional characters such as ß are force-converted to S. But more technically correct is to use a level 1 Unicode Collation Algorithm comparison. These are well beyond MC's capabilities.

Thanks.

Quote from: MrC on January 16, 2014, 08:43:41 pm

Somewhere I have some code that does this - if you want it, and I can find it, it can be turned into a pscriptor scriplet so that you could perform this conversion and save the results to an MC field or two, or have it save a comparison result.

I would very much like that - thanks. I'd missed the announcement of pscriptor. This feature looks awesome. I'll try it now. Even more awesome would be the ability to call it from expressions, but I don't currently see any expression-language call-out function what would allow this.

MrC · « **Reply #3 on:** January 20, 2014, 12:58:43 pm »

Bump. Did you get started with pscriptor, and if so, have you thought about how you'd like to use it for this problem here?

chrisjj · « **Reply #4 on:** January 21, 2014, 08:41:31 am »

Quote from: MrC on January 20, 2014, 12:58:43 pm

Bump. Did you get started with pscriptor,

Not yet. I saw the install procedure is something I'll need to take a bit of time over.

INTERACT FORUM

Author Topic: Expression to remove diacritics? (Read 1824 times)

chrisjj

Expression to remove diacritics?

MrC

Re: Expression to remove diacritics?

chrisjj

Re: Expression to remove diacritics?

MrC

Re: Expression to remove diacritics?

chrisjj

Re: Expression to remove diacritics?