There are several languages in the world. Now with global business, it became an important requirement to translate the language. Most of the web browsers comes with language translate option to make the content available for people across the world. Also there are so many requirements to translate documents from one language to another language.
Most of us are familiar with Google Translate. I used to do quick language translation of sentences and texts using Google Translate. Recently I got a problem statement to translate a large tab delimited file from German to English. Initially I thought it is a simple task that can be completed within few minutes. But later I realized the complexity. There were several challenges.
- The file that I got had lot of special characters and symbols.
- The file was large in size.
- The limit on the number of concurrent connections to the Translate service
Finally I had to do the following steps to translate the files.
If the length of the text in each field is more, we will have to translate each fields separately. Otherwise, we can split a complete row into a json. In my case, 99% of the rows fit into the limit of translate and for few records, I had to do the field level translation.
To summarize to split the entire files is by extracting the content of the rows and create small JSON files with details of the row (index).
Then create a batch job that reads this JSON files one by one and performs the translation.
This process can be expedited by running multiple batches and performing multi threading in the translation. For my requirement, I have splitted the main file into 3000 small files and executed around 10 batches in multiple machines. So basically 300 files were processed by each batch. Finally merge all these small files to generate a consolidated file. If the translation of any file fails, move that file to another directory, log the error details and translate the next file. At the end you can go and manually examine the reason for failure and handle it separately.
This was a tedious process and this is how translation works in case of real world data translation.
The following diagram will help you to understand the translation process.
With this approach we can translate any large file using the free version of Google Translate without exceeding the limits. It takes time, but the translation works fine.
We can use the same approach to translate file using AWS translate also.
This is a proven approach and I have translated several complex files using this approach with a good accuracy.