<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Writing - Ashwin Gopalsamy</title>
    <link>https://ashwingopalsamy.in/writing/tags/unicode/</link>
    <description>Staff Software Engineer scaling authorization infrastructure at Pismo, Visa.</description>
    <language>en-us</language>
    <lastBuildDate>Sat, 09 Aug 2025 00:00:00 &#43;0000</lastBuildDate>
    
    <atom:link href="https://ashwingopalsamy.in/writing/tags/unicode/feed.xml" rel="self" type="application/rss+xml" />
    
    
    <item>
      <title>Runes, Bytes, and Graphemes in Go</title>
      <link>https://ashwingopalsamy.in/writing/notes/runes-bytes-and-graphemes-in-go/</link>
      <pubDate>Sat, 09 Aug 2025 00:00:00 &#43;0000</pubDate>
      <guid isPermaLink="true">https://ashwingopalsamy.in/writing/notes/runes-bytes-and-graphemes-in-go/</guid>
      <description>&lt;p&gt;I once ran into this problem of differentiating runes, bytes and graphemes while handling names in Tamil and emoji in a Go web app: a string that &lt;em&gt;looked&lt;/em&gt; short wasn&amp;rsquo;t, and reversing it produced gibberish. The culprit wasn&amp;rsquo;t Go being flawed, it was me making assumptions about what &amp;ldquo;a character&amp;rdquo; means.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s map the territory precisely.&lt;/p&gt;
&lt;div class=&#34;diagram-container&#34; data-diagram-type=&#34;mermaid&#34;&gt;
  &lt;div class=&#34;mermaid&#34;&gt;
    
graph TD
    G[&#34;Grapheme Cluster&lt;br/&gt;(what users see)&#34;]
    R1[&#34;Rune 1&lt;br/&gt;(code point)&#34;]
    R2[&#34;Rune 2&lt;br/&gt;(combining mark)&#34;]
    B1[&#34;Bytes&lt;br/&gt;(1-4 per rune)&#34;]
    B2[&#34;Bytes&lt;br/&gt;(1-4 per rune)&#34;]
    G --&gt; R1
    G --&gt; R2
    R1 --&gt; B1
    R2 --&gt; B2
    style G fill:#10b981,color:#fff,stroke:none
    style R1 fill:#6366f1,color:#fff,stroke:none
    style R2 fill:#6366f1,color:#fff,stroke:none
    style B1 fill:#64748b,color:#fff,stroke:none
    style B2 fill:#64748b,color:#fff,stroke:none

  &lt;/div&gt;
  &lt;div class=&#34;diagram-actions&#34;&gt;
    &lt;button class=&#34;diagram-action&#34; data-action=&#34;expand&#34; aria-label=&#34;Expand diagram&#34;&gt;
      &lt;svg width=&#34;14&#34; height=&#34;14&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34;&gt;
        &lt;polyline points=&#34;15 3 21 3 21 9&#34;&gt;&lt;/polyline&gt;
        &lt;polyline points=&#34;9 21 3 21 3 15&#34;&gt;&lt;/polyline&gt;
        &lt;line x1=&#34;21&#34; y1=&#34;3&#34; x2=&#34;14&#34; y2=&#34;10&#34;&gt;&lt;/line&gt;
        &lt;line x1=&#34;3&#34; y1=&#34;21&#34; x2=&#34;10&#34; y2=&#34;14&#34;&gt;&lt;/line&gt;
      &lt;/svg&gt;
    &lt;/button&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&#34;1-bytes-the-raw-material-go-calls-a-string&#34;&gt;1. Bytes: the raw material Go calls a string&lt;/h2&gt;
&lt;p&gt;Go represents strings as immutable UTF-8 byte sequences. What we &lt;em&gt;see&lt;/em&gt; isn&amp;rsquo;t what Go handles under the hood.&lt;/p&gt;</description>
      <content:encoded>&lt;p&gt;I once ran into this problem of differentiating runes, bytes and graphemes while handling names in Tamil and emoji in a Go web app: a string that &lt;em&gt;looked&lt;/em&gt; short wasn&amp;rsquo;t, and reversing it produced gibberish. The culprit wasn&amp;rsquo;t Go being flawed, it was me making assumptions about what &amp;ldquo;a character&amp;rdquo; means.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s map the territory precisely.&lt;/p&gt;
&lt;div class=&#34;diagram-container&#34; data-diagram-type=&#34;mermaid&#34;&gt;
  &lt;div class=&#34;mermaid&#34;&gt;
    
graph TD
    G[&#34;Grapheme Cluster&lt;br/&gt;(what users see)&#34;]
    R1[&#34;Rune 1&lt;br/&gt;(code point)&#34;]
    R2[&#34;Rune 2&lt;br/&gt;(combining mark)&#34;]
    B1[&#34;Bytes&lt;br/&gt;(1-4 per rune)&#34;]
    B2[&#34;Bytes&lt;br/&gt;(1-4 per rune)&#34;]
    G --&gt; R1
    G --&gt; R2
    R1 --&gt; B1
    R2 --&gt; B2
    style G fill:#10b981,color:#fff,stroke:none
    style R1 fill:#6366f1,color:#fff,stroke:none
    style R2 fill:#6366f1,color:#fff,stroke:none
    style B1 fill:#64748b,color:#fff,stroke:none
    style B2 fill:#64748b,color:#fff,stroke:none

  &lt;/div&gt;
  &lt;div class=&#34;diagram-actions&#34;&gt;
    &lt;button class=&#34;diagram-action&#34; data-action=&#34;expand&#34; aria-label=&#34;Expand diagram&#34;&gt;
      &lt;svg width=&#34;14&#34; height=&#34;14&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34;&gt;
        &lt;polyline points=&#34;15 3 21 3 21 9&#34;&gt;&lt;/polyline&gt;
        &lt;polyline points=&#34;9 21 3 21 3 15&#34;&gt;&lt;/polyline&gt;
        &lt;line x1=&#34;21&#34; y1=&#34;3&#34; x2=&#34;14&#34; y2=&#34;10&#34;&gt;&lt;/line&gt;
        &lt;line x1=&#34;3&#34; y1=&#34;21&#34; x2=&#34;10&#34; y2=&#34;14&#34;&gt;&lt;/line&gt;
      &lt;/svg&gt;
    &lt;/button&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&#34;1-bytes-the-raw-material-go-calls-a-string&#34;&gt;1. Bytes: the raw material Go calls a string&lt;/h2&gt;
&lt;p&gt;Go represents strings as immutable UTF-8 byte sequences. What we &lt;em&gt;see&lt;/em&gt; isn&amp;rsquo;t what Go handles under the hood.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-go&#34; data-lang=&#34;go&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nx&#34;&gt;s&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;:=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;வணக்கம்&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nx&#34;&gt;fmt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;Println&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nx&#34;&gt;s&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// 21&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The length is 21 bytes, not visible symbols. Every Tamil character can span 3 bytes. Even simple-looking emojis stretch across multiple bytes.&lt;/p&gt;
&lt;h2 id=&#34;2-runes-unicode-code-points&#34;&gt;2. Runes: Unicode code points&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;string&lt;/code&gt; to &lt;code&gt;[]rune&lt;/code&gt; gives you code points, but still not what a human perceives.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-go&#34; data-lang=&#34;go&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nx&#34;&gt;rs&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;:=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;rune&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nx&#34;&gt;s&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nx&#34;&gt;fmt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;Println&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nx&#34;&gt;rs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// 7&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Here it&amp;rsquo;s 7 runes, but some Tamil graphemes (like &amp;ldquo;க்&amp;rdquo;) combine two runes: &lt;code&gt;க&lt;/code&gt; + &lt;code&gt;்&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;3-grapheme-clusters-the-units-users-actually-see&#34;&gt;3. Grapheme clusters: the units users actually see&lt;/h2&gt;
&lt;p&gt;Go&amp;rsquo;s standard library stops at runes. To work with visible characters, you need a grapheme-aware library like &lt;code&gt;github.com/rivo/uniseg&lt;/code&gt;.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-go&#34; data-lang=&#34;go&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;nx&#34;&gt;gr&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;:=&lt;/span&gt; &lt;span class=&#34;nx&#34;&gt;uniseg&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;NewGraphemes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nx&#34;&gt;s&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt; &lt;span class=&#34;nx&#34;&gt;gr&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;Next&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nx&#34;&gt;fmt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;Printf&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;%q\n&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nx&#34;&gt;gr&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;Str&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;())&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;That outputs what a human reads: &amp;ldquo;வ&amp;rdquo;, &amp;ldquo;ண&amp;rdquo;, &amp;ldquo;க்&amp;rdquo;, &amp;ldquo;க&amp;rdquo;, &amp;ldquo;ம்&amp;rdquo;, and even a heart emoji as a single unit.&lt;/p&gt;
&lt;div class=&#34;diagram-container&#34; data-diagram-type=&#34;mermaid&#34;&gt;
  &lt;div class=&#34;mermaid&#34;&gt;
    
graph LR
    T[&#34;வணக்கம் (Tamil)&#34;]
    T --&gt; |&#34;len()&#34;| BY[&#34;21 bytes&#34;]
    T --&gt; |&#34;[]rune&#34;| RU[&#34;7 runes&#34;]
    T --&gt; |&#34;uniseg&#34;| GR[&#34;5 graphemes&#34;]
    style T fill:#10b981,color:#fff,stroke:none
    style BY fill:#ef4444,color:#fff,stroke:none
    style RU fill:#f59e0b,color:#fff,stroke:none
    style GR fill:#10b981,color:#fff,stroke:none

  &lt;/div&gt;
  &lt;div class=&#34;diagram-actions&#34;&gt;
    &lt;button class=&#34;diagram-action&#34; data-action=&#34;expand&#34; aria-label=&#34;Expand diagram&#34;&gt;
      &lt;svg width=&#34;14&#34; height=&#34;14&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34;&gt;
        &lt;polyline points=&#34;15 3 21 3 21 9&#34;&gt;&lt;/polyline&gt;
        &lt;polyline points=&#34;9 21 3 21 3 15&#34;&gt;&lt;/polyline&gt;
        &lt;line x1=&#34;21&#34; y1=&#34;3&#34; x2=&#34;14&#34; y2=&#34;10&#34;&gt;&lt;/line&gt;
        &lt;line x1=&#34;3&#34; y1=&#34;21&#34; x2=&#34;10&#34; y2=&#34;14&#34;&gt;&lt;/line&gt;
      &lt;/svg&gt;
    &lt;/button&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&#34;why-this-matters&#34;&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;If your app deals with names, chats, or any multilingual text, indexing by bytes will break things. Counting runes helps but can still split what you intend as one unit. Grapheme-aware operations align with what users actually expect.&lt;/p&gt;
&lt;p&gt;Real bugs I&amp;rsquo;ve seen: Tamil names chopped mid-character, emoji reactions breaking because only one code point was taken.&lt;/p&gt;
&lt;h2 id=&#34;quick-reference&#34;&gt;Quick reference&lt;/h2&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Task&lt;/th&gt;
          &lt;th&gt;Approach&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Count code points&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;utf8.RuneCountInString(s)&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Count visible units&lt;/td&gt;
          &lt;td&gt;Grapheme iteration (&lt;code&gt;uniseg&lt;/code&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Reverse text&lt;/td&gt;
          &lt;td&gt;Parse into graphemes, reverse slice, join&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Slice safely&lt;/td&gt;
          &lt;td&gt;Only use &lt;code&gt;s[i:j]&lt;/code&gt; on grapheme boundaries&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Think about what you intend to manipulate: the raw bytes, the code points, or what a user actually reads on screen, and choose the right level.&lt;/p&gt;
</content:encoded>
      <author>ashwin@ashwingopalsamy.in (Ashwin Gopalsamy)</author>
    </item>
    
    <item>
      <title>Go&#39;s UTF-8 Identifier Limitation</title>
      <link>https://ashwingopalsamy.in/writing/gos-utf-8-identifier-limitation/</link>
      <pubDate>Tue, 12 Nov 2024 00:00:00 &#43;0000</pubDate>
      <guid isPermaLink="true">https://ashwingopalsamy.in/writing/gos-utf-8-identifier-limitation/</guid>
      <description>&lt;p&gt;I&amp;rsquo;ve been exploring Go&amp;rsquo;s UTF-8 support lately, and was curious about how well it handles non-Latin scripts in code. This post covers a detailed overview about the same.&lt;/p&gt;
&lt;h2 id=&#34;go-and-utf-8&#34;&gt;Go and UTF-8&lt;/h2&gt;
&lt;p&gt;We know that Go source files are UTF-8 encoded by default. This means you can, in theory, use Unicode characters in your variable names, function names and more.&lt;/p&gt;
&lt;p&gt;For example, in the official Go playground &lt;a href=&#34;https://go.dev/play/&#34; rel=&#34;noopener noreferrer&#34; target=&#34;_blank&#34;&gt;boilerplate code&lt;/a&gt;
, you might come across code like this:&lt;/p&gt;</description>
      <content:encoded>&lt;p&gt;I&amp;rsquo;ve been exploring Go&amp;rsquo;s UTF-8 support lately, and was curious about how well it handles non-Latin scripts in code. This post covers a detailed overview about the same.&lt;/p&gt;
&lt;h2 id=&#34;go-and-utf-8&#34;&gt;Go and UTF-8&lt;/h2&gt;
&lt;p&gt;We know that Go source files are UTF-8 encoded by default. This means you can, in theory, use Unicode characters in your variable names, function names and more.&lt;/p&gt;
&lt;p&gt;For example, in the official Go playground &lt;a href=&#34;https://go.dev/play/&#34; rel=&#34;noopener noreferrer&#34; target=&#34;_blank&#34;&gt;boilerplate code&lt;/a&gt;
, you might come across code like this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-go&#34; data-lang=&#34;go&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;package&lt;/span&gt; &lt;span class=&#34;nx&#34;&gt;main&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;fmt&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kd&#34;&gt;func&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;main&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nx&#34;&gt;消息&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;:=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;Hello, World!&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nx&#34;&gt;fmt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;Println&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nx&#34;&gt;消息&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Here, &lt;code&gt;消息&lt;/code&gt; is Chinese for &amp;ldquo;message.&amp;rdquo; Go handles this without any issues, thanks to its Unicode support. This capability is one reason why Go has gained popularity in countries like China and Japan &amp;ndash; developers can write code using identifiers meaningful in their own languages.&lt;/p&gt;
&lt;p&gt;You won&amp;rsquo;t believe it, but there&amp;rsquo;s some popularity in China, for writing code in their native language and I love it.&lt;/p&gt;
&lt;div class=&#34;diagram-container&#34; data-diagram-type=&#34;mermaid&#34;&gt;
  &lt;div class=&#34;mermaid&#34;&gt;
    
flowchart TD
    A[&#34;Character&#34;] --&gt; B{&#34;Letter\n(Lu/Ll/Lo)?&#34;}
    B -- Yes --&gt; C[&#34;Valid identifier start&#34;]
    B -- No --&gt; D{&#34;Mark/Digit\n(Mn/Mc/Nd)?&#34;}
    D -- Yes --&gt; E[&#34;Valid continuation only&#34;]
    D -- No --&gt; F[&#34;Invalid in identifiers&#34;]

    style C fill:#10b981,color:#fff,stroke:none
    style E fill:#f59e0b,color:#fff,stroke:none
    style F fill:#ef4444,color:#fff,stroke:none
    style A fill:#64748b,color:#fff,stroke:none
    style B fill:#6366f1,color:#fff,stroke:none
    style D fill:#6366f1,color:#fff,stroke:none

  &lt;/div&gt;
  &lt;div class=&#34;diagram-actions&#34;&gt;
    &lt;button class=&#34;diagram-action&#34; data-action=&#34;expand&#34; aria-label=&#34;Expand diagram&#34;&gt;
      &lt;svg width=&#34;14&#34; height=&#34;14&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34;&gt;
        &lt;polyline points=&#34;15 3 21 3 21 9&#34;&gt;&lt;/polyline&gt;
        &lt;polyline points=&#34;9 21 3 21 3 15&#34;&gt;&lt;/polyline&gt;
        &lt;line x1=&#34;21&#34; y1=&#34;3&#34; x2=&#34;14&#34; y2=&#34;10&#34;&gt;&lt;/line&gt;
        &lt;line x1=&#34;3&#34; y1=&#34;21&#34; x2=&#34;10&#34; y2=&#34;14&#34;&gt;&lt;/line&gt;
      &lt;/svg&gt;
    &lt;/button&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&#34;attempting-to-use-tamil-identifiers&#34;&gt;Attempting to Use Tamil Identifiers&lt;/h2&gt;
&lt;p&gt;Naturally, I wanted to try this out with Tamil, my mother tongue. &lt;a href=&#34;https://en.wikipedia.org/wiki/Tamil_language&#34; rel=&#34;noopener noreferrer&#34; target=&#34;_blank&#34;&gt;Tamil&lt;/a&gt;
, one of the world&amp;rsquo;s oldest languages, is spoken by over 85 million people globally and uses a non-Latin script quite distinct from widely-used scripts like Chinese. While coding in Tamil isn&amp;rsquo;t common even in regions where it&amp;rsquo;s spoken, its unique structure made it an intriguing choice for my experiment with Go&amp;rsquo;s Unicode support.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s a simple example I wrote:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-go&#34; data-lang=&#34;go&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;package&lt;/span&gt; &lt;span class=&#34;nx&#34;&gt;main&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;fmt&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kd&#34;&gt;func&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;main&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nx&#34;&gt;எண்ண&lt;/span&gt;&lt;span class=&#34;err&#34;&gt;ி&lt;/span&gt;&lt;span class=&#34;nx&#34;&gt;க்க&lt;/span&gt;&lt;span class=&#34;err&#34;&gt;ை&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;:=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;42&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// &amp;#34;எண்ணிக்கை&amp;#34; means &amp;#34;number&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nx&#34;&gt;fmt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;Println&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;Value:&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nx&#34;&gt;எண்ண&lt;/span&gt;&lt;span class=&#34;err&#34;&gt;ி&lt;/span&gt;&lt;span class=&#34;nx&#34;&gt;க்க&lt;/span&gt;&lt;span class=&#34;err&#34;&gt;ை&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;At first glance, this seems straightforward that can run without any errors.&lt;/p&gt;
&lt;p&gt;But, when I tried to compile the code, I ran into errors:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;./prog.go:6:11: invalid character U+0BCD &amp;#39;்&amp;#39; in identifier
./prog.go:6:17: invalid character U+0BBF &amp;#39;ி&amp;#39; in identifier
./prog.go:6:23: invalid character U+0BCD &amp;#39;்&amp;#39; in identifier
./prog.go:6:29: invalid character U+0BC8 &amp;#39;ை&amp;#39; in identifier
./prog.go:7:33: invalid character U+0BCD &amp;#39;்&amp;#39; in identifier
./prog.go:7:39: invalid character U+0BBF &amp;#39;ி&amp;#39; in identifier
./prog.go:7:45: invalid character U+0BCD &amp;#39;்&amp;#39; in identifier
./prog.go:7:51: invalid character U+0BC8 &amp;#39;ை&amp;#39; in identifier
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&#34;understanding-the-issue-with-tamil-combining-marks&#34;&gt;Understanding the Issue with Tamil Combining Marks&lt;/h3&gt;
&lt;p&gt;To understand what&amp;rsquo;s going on, it&amp;rsquo;s essential to know a bit about how Tamil script works.&lt;/p&gt;
&lt;p&gt;Tamil is an &lt;a href=&#34;https://en.wikipedia.org/wiki/Abugida&#34; rel=&#34;noopener noreferrer&#34; target=&#34;_blank&#34;&gt;abugida&lt;/a&gt;
 &amp;ndash; a writing system where each consonant-vowel sequence is written as a unit. In Unicode, this often involves combining a base consonant character with one or more combining marks that represent vowels or other modifiers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;For example:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The Tamil letter &lt;code&gt;க&lt;/code&gt; (U+0B95) represents the consonant sound &amp;ldquo;ka&amp;rdquo;&lt;/li&gt;
&lt;li&gt;To represent &amp;ldquo;ki&amp;rdquo; you&amp;rsquo;d combine &lt;code&gt;க&lt;/code&gt; with the vowel sign &lt;code&gt;ி&lt;/code&gt; (U+0BBF), resulting in &lt;code&gt;கி&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;The vowel sign &lt;code&gt;ி&lt;/code&gt; is a &lt;strong&gt;combining mark&lt;/strong&gt;, specifically classified as a &lt;a href=&#34;https://www.compart.com/en/unicode/category/Mn&#34; rel=&#34;noopener noreferrer&#34; target=&#34;_blank&#34;&gt;Non-Spacing Mark&lt;/a&gt;
 in Unicode&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;heres-where-the-problem-arises&#34;&gt;Here&amp;rsquo;s where the problem arises&lt;/h3&gt;


&lt;div class=&#34;callout callout--info&#34;&gt;
  &lt;div class=&#34;callout-icon&#34;&gt;ℹ&lt;/div&gt;
  &lt;div class=&#34;callout-body&#34;&gt;
    Go&amp;rsquo;s language specification allows Unicode letters in identifiers but excludes combining marks. Specifically, identifiers can include characters that are classified as &amp;ldquo;Letter&amp;rdquo; (categories &lt;code&gt;Lu&lt;/code&gt;, &lt;code&gt;Ll&lt;/code&gt;, &lt;code&gt;Lt&lt;/code&gt;, &lt;code&gt;Lm&lt;/code&gt;, &lt;code&gt;Lo&lt;/code&gt;, or &lt;code&gt;Nl&lt;/code&gt;) and digits, but not combining marks (categories &lt;code&gt;Mn&lt;/code&gt;, &lt;code&gt;Mc&lt;/code&gt;, &lt;code&gt;Me&lt;/code&gt;).
  &lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&#34;examples-of-combining-marks-in-tamil&#34;&gt;Examples of Combining Marks in Tamil&lt;/h3&gt;
&lt;p&gt;Let&amp;rsquo;s look at how Tamil characters are formed:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Standalone Consonant:&lt;/strong&gt; &lt;code&gt;க&lt;/code&gt; (U+0B95) &amp;ndash; Allowed in Go identifiers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Consonant + Vowel Sign:&lt;/strong&gt; &lt;code&gt;கா&lt;/code&gt; (U+0B95 U+0BBE) &amp;ndash; Not allowed because &lt;code&gt;ா&lt;/code&gt; (U+0BBE) is a combining mark (&lt;code&gt;Mc&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Consonant + Vowel Sign:&lt;/strong&gt; &lt;code&gt;கி&lt;/code&gt; (U+0B95 U+0BBF) &amp;ndash; Not allowed because &lt;code&gt;ி&lt;/code&gt; (U+0BBF) is a combining mark (&lt;code&gt;Mn&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Consonant + Vowel Sign:&lt;/strong&gt; &lt;code&gt;கூ&lt;/code&gt; (U+0B95 U+0BC2) &amp;ndash; Not allowed because &lt;code&gt;ூ&lt;/code&gt; (U+0BC2) is a combining mark (&lt;code&gt;Mc&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;In the identifier &lt;code&gt;எண்ணிக்கை&lt;/code&gt; (&amp;ldquo;number&amp;rdquo;), the characters include combining marks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;எ&lt;/code&gt; (U+0B8E) &amp;ndash; Letter, allowed&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ண்&lt;/code&gt; (U+0BA3 U+0BCD) &amp;ndash; Formed by &lt;code&gt;ண&lt;/code&gt; (U+0BA3) and the virama &lt;code&gt;்&lt;/code&gt; (U+0BCD), a combining mark (&lt;code&gt;Mn&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ண&lt;/code&gt; (U+0BA3) &amp;ndash; Letter, allowed&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ிக்கை&lt;/code&gt; &amp;ndash; Contains combining marks like &lt;code&gt;ி&lt;/code&gt; (U+0BBF) and &lt;code&gt;ை&lt;/code&gt; (U+0BC8)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Because these combining marks are not allowed in Go identifiers, the compiler throws errors when it encounters them.&lt;/p&gt;
&lt;div class=&#34;diagram-container&#34; data-diagram-type=&#34;mermaid&#34;&gt;
  &lt;div class=&#34;mermaid&#34;&gt;
    
graph LR
    A[&#34;Tamil syllable&#34;] --&gt; B[&#34;Base consonant\n(Lo: valid)&#34;]
    A --&gt; C[&#34;Vowel sign\n(Mc: continuation only)&#34;]
    B --&gt; D[&#34;Visually incomplete\nalone&#34;]
    C --&gt; D

    style A fill:#64748b,color:#fff,stroke:none
    style B fill:#10b981,color:#fff,stroke:none
    style C fill:#ef4444,color:#fff,stroke:none
    style D fill:#f59e0b,color:#fff,stroke:none

  &lt;/div&gt;
  &lt;div class=&#34;diagram-actions&#34;&gt;
    &lt;button class=&#34;diagram-action&#34; data-action=&#34;expand&#34; aria-label=&#34;Expand diagram&#34;&gt;
      &lt;svg width=&#34;14&#34; height=&#34;14&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34;&gt;
        &lt;polyline points=&#34;15 3 21 3 21 9&#34;&gt;&lt;/polyline&gt;
        &lt;polyline points=&#34;9 21 3 21 3 15&#34;&gt;&lt;/polyline&gt;
        &lt;line x1=&#34;21&#34; y1=&#34;3&#34; x2=&#34;14&#34; y2=&#34;10&#34;&gt;&lt;/line&gt;
        &lt;line x1=&#34;3&#34; y1=&#34;21&#34; x2=&#34;10&#34; y2=&#34;14&#34;&gt;&lt;/line&gt;
      &lt;/svg&gt;
    &lt;/button&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&#34;why-chinese-characters-work-but-tamil-doesnt&#34;&gt;Why Chinese Characters Work but Tamil Doesn&amp;rsquo;t&lt;/h2&gt;
&lt;p&gt;Chinese characters are generally classified under the &amp;ldquo;Letter, Other&amp;rdquo; (&lt;code&gt;Lo&lt;/code&gt;) category in Unicode. They are standalone symbols that don&amp;rsquo;t require combining marks to form complete characters. This is why identifiers like &lt;code&gt;消息&lt;/code&gt; work perfectly in Go.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Implications&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The inability to use combining marks in identifiers has significant implications for scripts like Tamil:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Without combining marks, it&amp;rsquo;s nearly impossible to write meaningful identifiers in Tamil&lt;/li&gt;
&lt;li&gt;Using native scripts can make learning to code more accessible, but these limitations hinder that possibility, particularly for languages that follow abugida-based writing systems&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;whats-wrong-here&#34;&gt;What&amp;rsquo;s wrong here?&lt;/h2&gt;
&lt;p&gt;Actually, nothing really!&lt;/p&gt;
&lt;p&gt;Go&amp;rsquo;s creators primarily aimed for consistent string handling and alignment with modern web standards through UTF-8 support. They didn&amp;rsquo;t &lt;strong&gt;necessarily intend for &amp;ldquo;native-language&amp;rdquo; coding&lt;/strong&gt; in identifiers, especially with scripts requiring combining marks.&lt;/p&gt;
&lt;p&gt;I wanted to experiment how far we could push Go&amp;rsquo;s non-Latin alphabet support. Although most developers use and prefer English for coding, I thought it would be insightful to explore this aspect of Go&amp;rsquo;s Unicode support.&lt;/p&gt;
</content:encoded>
      <author>ashwin@ashwingopalsamy.in (Ashwin Gopalsamy)</author>
    </item>
    
  </channel>
</rss>
