Dialects

The following zip files are my programs to convert Big5 text into Mandarin, Cantonese and Hakka syllables.

I am distributing the software as Freeware, since the source data was already available through the internet, and/or was originally sourced as freeware themselves.

Mandarin 100 kb
Cantonese 99 kb
Hakka 96 kb

Click on the links above to download, and note where you have saved the zip file. Click on the saved zip file, and it will unzip the files into a directory corresponding to mandarin, cantonese, or hakka.

You need to go into the directory and click on the file(s) named dialect-m.exe (for Mandarin), dialect-c.exe (for Cantonese) or dialect-h (for Hakka). Or, alternatively, you can click on Dos prompt, and go to the directory which contains these files, and type in the name of the appropriate exe file. Once the program runs, it will create a HTML file.

Mandarin > m-output.htm
Cantonese > c-output.htm
Hakka > h-output.htm

The following scan of the program for Mandarin running under DOS.

Each program has the same look as the dos-prompt output.

For your own text, just paste Big5 text into the input file called m-input.txt (for Mandarin), c-input.txt (for Cantonese) or h-input.txt (for Hakka), save, then run the programs dialect-m, dialect-c, or dialect-h for Mandarin, Cantonese or Hakka respectively, under DOS. It should work on most 32 bit machines.

I've configured the programs to have a maximum input of 5000 lines of text, of around 40 double byte character widths (i.e. 40 Big5 characters, or 80 single byte characters wide) per line of text.

These programs were created using the Fortran 77 programming language, and created using Guilherme Luiz Lepsch Guedes' Force 2.0 Fortran Compiler and Editor.

Due to my limited programming skills and the restrictive Fortran77 programming language, I've not been able to convert single byte characters properly. This will affect the outcome, so limiting the line width, was one solution for users to search these characters out, and they will need to pad them with an extra space. Otherwise, following two byte characters will not be recognised correctly.

FAQ

Some folks have had some problems with the annotators. What occurs in one occurs in all three, so the following is for Mandarin, Cantonese and Hakka annotators.

Q1. When I click the .exe file for the first time, a black window appears and then disappears. What is wrong?

A1. If you look in the directory where the program resides, the program will have created the output .htm file, and this means that the input file has been converted. Click on the .htm file and it should open in your browser for you to see the results.

Q2. I am getting nothing but question marks.

A2. It is quite likely that the encoding is not actually Big5. If you're are copying text from a browser, try selecting Western under the encoding to force the browser not to display the page in Chinese. Copy and paste the characters this way. However, if the page does not change though you have chosen the Western encoding, try saving the page first, and editing it by removing the following :

which will occur somewhere near the top of the page.

Q3. I am getting +++ all over the output webpage.

A3. Where there is no pronunciation data for a character, you may need to find it out yourself. Such a character will be appended as +++ within the romanised pronunciation section of the text.

Q4. I'm getting a lot of strange characters within the Chinese text.

A4. The Chinese characters in the input file must occupy the space of exactly two characters for the program to correctly read the input file and output correctly. This is because Big5 is a two byte character encoding. The text must not be indented by any space at the beginning of the text. If there are one byte characters such as full stops, commas, exclamation marked etc, this will also misalign the text. In this case, you should insert one space next to each one byte character so that the text to the right of it will again fall within units of two byte characters. So, you may need to run and rerun the program after your alteration and saving of the input file until you get all the Chinese text properly aligned. Make sure you have only around 40 double byte characters per line (the program allows a maximum of 50 double byte characters, but since you may need to insert spaces if there is a misalignment, it is suggested that 40 double byte characters should be the limit per line).

Q5. I can read the romanised text but cannot read the Chinese

A5. The program output is in the form of a HTML file. Within the browser, you can change the encoding to Chinese Traditional or Big5 to get the double byte characters to properly display as Chinese characters. Another reason for this is so that you can copy and paste into a text editor without having to fiddle about with the unknown intermediate encoding that the computer uses to retain the character information in.

Q6. The program runs and the display disappears, is there any way of keeping the runtime window displayed?

A6. Yes, you must open a Dos Prompt window. Go to

Start > Accessories > Dos Prompt

Change to the directory where the annotator is. If it was in a directory called "chinese" in your C Drive, then, for the Mandarin annotator, "dialect-m.exe"

c:>
c:>cd chinese
c:>chinese>
c:>chinese>dialect-m.exe

The program should run and the dos prompt will still be active. For Cantonese, use "dialect-c.exe", and for Hakka use "dialect-h.exe". It is possible to leave the extention .exe off to run the program.

Q7. May I have another sample of Big4 text to check I am doing things correctly?

A7. Yes, copy and paste the following in the input file. But first, make sure the encoding in your browser is set to Western, and there are not Chinese characters visible.


賀知章



回鄉偶書



少小離家老大回

鄉音無改鬢毛衰

兒童相見不相識

笑問客從何處來





李白



金陵酒肆留別



風吹柳花滿店香

吳姬壓酒喚客嘗

金陵子弟來相送

欲行不行各盡觴

請君試問東流水

別意與之誰短長







李白



贈孟浩然



吾愛孟夫子

風流天下聞

紅顏棄軒冕

白首臥松雲

醉月頻中聖

迷花不事君

高山安可仰

徒此挹清芬

E mail