Create a class to fix a problem when read txt files encoded with ANSI which contains chinese strings

in #utopian-io7 years ago (edited)

 Github repository :https://github.com/github/VisualStudio

What Will I Learn?

  • Learn the basic knowledge of mbcs and ANSI.
  • Learn how to set local info temporarily.
  • Solutions to a problems of garbled when reading ANSI zh-cn files.

Requirements

  • The basic knowledge of C++.
  • Basic C++ knowledge
  • Basic knowledge of mbcs and ANSI.
  • Basic knowledge of reading data from file.

Difficulty

  • Intermediate

Tutorial Contents

issue raised

Some time ago, I was doing Github development based on Visual Studio 2015 . In this process, I have the need to read some TXT files or XML files from the database. But I found that the data is all garbled when I use the readstring function of CStdioFile to read a TXT file  contain chinese string, which was encoded by ANSI.

My Computer default language is set to English.


The TXT file shown in the following illustration contains some Chinese strings, and the encoded format of the file is ANSI.


Under these conditions, the data read from file will all displayed as garbled if without any special treatment, as shown in the following figure:

Investigate the reason

Reason: I have been investigating this problem for some time before I realized that it was the ReadString function of cstdiofile does not support Chinese files encoded in ANSI format.

1.ANSI and MBCS

MBCS(Multi-Byte Chactacter System)

It is a type of encoding, not a name for a particular encoding.

In order to expand the ASCII encoding to be used to display the native language, different countries and regions have different standards. So there are several different coding standards for GB2312, BIG5, JIS, and so on. These extended encodings, called ANSI encodings, which use 2 bytes to represent a single character, are called the "MBCS (muilti-bytes charecter set, multibyte character set) ". 

In the Simplified Chinese system, ANSI encoding represents GB2312 encoding, in the Japanese operating system, ANSI code for JIS code, so in Chinese windows to turn code into GB2312,GBK only need to save the text as ANSI code. Different ANSI codes are incompatible, and when information is exchanged internationally, text that belongs to both languages cannot be stored in the same section of ANSI-encoded text. A big disadvantage is that the same coded value represents different words in different coding systems. This can easily cause confusion.

2.ReadString of CStdioFile

If you do not convert your localized settings to the current encoding system when use the CStdioFile readstring function, the text you read will be treated as the text of the other encoding system, and the code will appear garbled.

3.Solution: Localization

If I want to continue using Cstudiofile to read files, you should temporarily set the LOACL info to the current encoding system when reading this type of file. The following two functions allow localized settings


char *setlocale(  
   int category,  
   const char *locale   
);  

wchar_t *_wsetlocale(  
   int category,  
   const wchar_t *locale   
);  

Create a CMBCSLocalEnv class

Based on the above questions, I wrote a new LocaleEnv.h file to solve the problem.

1.Enumerate common language types

First, you create an enumeration of language types to set localized information based on commonly used languages. I've only listed a few common types here, and you can add other types to your needs.

2.Record Raw Localized information

Because this class is intended only to temporarily set the locale to the specified, you need to remember the original language and locale information, and then restore the original settings when the class is destructor.


Cleverly using _tsetlocale to get current localized information . The defination of_tsetlocale seein c:\Program Files (x86)\Windows Kits\10\Include\10.0.10240.0\ucrt\tchar.h


This function has two parameters:category and locale

Parameters:

category : Category affected by locale.

locale : Locale specifier.

Return Value

If a valid locale and category are given, returns a pointer to the string associated with the specified locale and category. If the locale or category is not valid, returns a null pointer and the current locale settings of the program are not changed.


Therefore, we can use _tsetlocale (LC_ALL,  NULL ) function to get the current localized information. Use member variable m_OrgMBCSLoc to remember this information.

class CMBCSLocalEnv{
public:
CMBCSLocalEnv(ECurLangType lID)
{
 m_OrgMBCSLoc = NULL;
       #ifdef _MSC_VER
 m_OrgMBCSLoc = _tcsdup( _tsetlocale( LC_ALL,  NULL ) );
}
~CMBCSLocalEnv()
{
}
private:
LPTSTR m_OrgMBCSLoc;
}

3.Conditions for Localization settings

Localization settings require three conditions:

a. Language Code

b. Country Code

c. Encoding

Local names can be constructed using the following sections:

Language code _ Country code, such as (Chinese (Simplified)_China.936)


To get the current localized information, you can use the GetLocaleInfo function:

int GetLocaleInfo(
  _In_      LCID   Locale,
  _In_      LCTYPE LCType,
  _Out_opt_ LPTSTR lpLCData,
  _In_      int    cchData
);

First parameter: _In_      LCID   Locale,

Locale [in]

Locale identifier for which to retrieve information. 


Common Use Language ID

Language ID Constants

$(WindowsSdkDir)Include\WinNT.h

You can find some common language IDs in this link:https://blog.csdn.net/TMS_LI/article/details/48000751


Determines the value of the _In_      LCID   Locale based on the value of the lID . The code is as follows:

LCID mbcsID = 0;
if (lID == LA_ZH_CN_NO_LANG)
{
    mbcsID = LANG_CHINESE_SIMPLIFIED;
}
else if (lID == LA_ZH_TW_NO_LANG)
{
    mbcsID = LANG_CHINESE_TRADITIONAL;
}
else if (lID == LA_EN_US_NO_LANG)
{
   mbcsID = LANG_ENGLISH;
}
else if (lID == LA_UNKNOWN)
{
 return;
}
else
  mbcsID = (LCID)lID;


Second parameter: _In_      LCTYPE LCType,

LCType [in]

The locale information to retrieve. we need Language Code and

Country Code and Encoding.


We can get these three messages by GetLocaleInfo function, the code is as follows: 

TCHAR cpname[32];
int nbytes = 0;
GetLocaleInfo (mbcsID, LOCALE_SENGLANGUAGE, cpname, 32);

TCHAR cpchain1[128];
nbytes = GetLocaleInfo (mbcsID, LOCALE_SENGCOUNTRY   , cpchain1, 128);

TCHAR cpchain[7];
nbytes = GetLocaleInfo (mbcsID, LOCALE_IDEFAULTANSICODEPAGE, cpchain, 7);

4.set local info same as mbcs info temporarily

Use the _tsetlocale function to temporarily set the current localization settings to the desired setting.

TCHAR NewLocal[256];
_stprintf_s( NewLocal, 256,  _T("%s_%s.%s"),cpname,cpchain1,cpchain);
_tsetlocale( LC_ALL,  NewLocal );

5.The processing of destructor

The locale setting should be set again to the original locale when the class is destructed because we just want local information to be set locally and temporarily. The code is shown as following:

~CMBCSLocalEnv()
{
if (m_OrgMBCSLoc == NULL)
{
return;
}
#ifdef _MSC_VER
_tsetlocale( LC_ALL,  m_OrgMBCSLoc );
ASSERT( _tcscmp(m_OrgMBCSLoc, _tsetlocale(LC_ALL,NULL) ) == 0 );
free(m_OrgMBCSLoc);
#endif
}

6.The complete code for this class

After a few steps, we can now write this class in its entirety:

#ifndef IC_LOCALE_ENV_H
#define IC_LOCALE_ENV_H
#include <locale.h>
#include "mbctype.h"

enum ECurLangType
{
LA_ZH_CN_NO_LANG,
LA_ZH_TW_NO_LANG,
LA_EN_US_NO_LANG,
LA_UNKNOWN
};

//fix using function ReadString of CStdioFile issue.
//if reading an ANSI text, and the local info is en-us, but mbcs info is zh-cn, the reading result is wrong.
//so we should set local info same as mbcs info temporarily.

class CMBCSLocalEnv
{
public:
CMBCSLocalEnv(ECurLangType lID)
{
m_OrgMBCSLoc = NULL;
#ifdef _MSC_VER
m_OrgMBCSLoc = _tcsdup( _tsetlocale( LC_ALL,  NULL ) );
LCID mbcsID = 0;

if (lID == LA_ZH_CN_NO_LANG)
{
mbcsID = LANG_CHINESE_SIMPLIFIED;
}
else if (lID == LA_ZH_TW_NO_LANG)
{
mbcsID = LANG_CHINESE_TRADITIONAL;
}
else if (lID == LA_EN_US_NO_LANG)
{
mbcsID = LANG_ENGLISH;
}
else if (lID == LA_UNKNOWN)
{
return;
}

TCHAR cpname[32];

int nbytes = 0;
GetLocaleInfo (mbcsID, LOCALE_SENGLANGUAGE, cpname, 32);

TCHAR cpchain1[128];
nbytes = GetLocaleInfo (mbcsID, LOCALE_SENGCOUNTRY   , cpchain1, 128);

TCHAR cpchain[7];
nbytes = GetLocaleInfo (mbcsID, LOCALE_IDEFAULTANSICODEPAGE, cpchain, 7);
TCHAR NewLocal[256];

_stprintf_s( NewLocal, 256,  _T("%s_%s.%s"),cpname,cpchain1,cpchain);
_tsetlocale( LC_ALL,  NewLocal );
#endif
}

~CMBCSLocalEnv()
{
if (m_OrgMBCSLoc == NULL)
{
return;
}
#ifdef _MSC_VE
_tsetlocale( LC_ALL,  m_OrgMBCSLoc );
ASSERT( _tcscmp(m_OrgMBCSLoc, _tsetlocale(LC_ALL,NULL) ) == 0 );
free(m_OrgMBCSLoc);
#endif
}
private:
LPTSTR m_OrgMBCSLoc;
};
#endif

How to use this class 

The way to use this class is simple. Creates a CMBCSLocalEnv temporary variable when reading a TXT file encoded in ANSI format.

At this point, the data read from the file is correct:

Most importantly, this LocaleEnv.h file can be placed in other projects by simply include this header file. It can avoid the redundancy of code. I hope this tutorial will help you.

 Thank you for your attention.
@hushuilan 

Sort:  

Dear Moderator. The spacing between each line of code in this tutorial is a bit wide, resulting in a little bad format. This may be due to a bug in the Steemit's own code block. I have tried my best to adjust the format of the code. Thank you for your understanding.

The tutorial is with a very long title, next time try to put shorter.In the next tutorial, explain your code better. However, thank you for your contribution.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

Dear portugalcoin , thank you for your valuable advice.

Hey @hushuilan

Thanks for contributing via Utopian.
We're already looking forward to your next contribution!

Contributing on Utopian
Learn how to contribute on our website or by watching this tutorial on Youtube.

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Coin Marketplace

STEEM 0.27
TRX 0.25
JST 0.039
BTC 93337.49
ETH 3333.08
USDT 1.00
SBD 1.78