Development of AI specialized in Japanese and evaluation and selection of sentences to be used

OpenAI plans to establish a base in Japan on April 15, 2024 to promote its business in Japan. A “GPT-4” model specialized in Japanese was also announced, and the response speed is expected to be three times faster, making it difficult for generative AI developed by Japanese companies to catch up.

For example, when evaluating articles from news organizations as data for generated AI, it is possible that someone with a deep understanding of Japanese discourse will be required to be able to evaluate and select between safe articles'' andarticles that get to the heart of the matter.” Similarly, when using weekly magazine articles, I think it is necessary to evaluate and select between excellent articles that reveal the truth'' andarticles that resemble novels without truth.”

X (formerly Twitter) sometimes posts excellent articles that show the cutting edge of Japanese discourse, so appropriately extracting high-value articles from internet information such as SNS will improve generation AI specialized in Japanese. We believe that this will be an important means of increasing sophistication.

Claude 3 Sonnet

Response when asking “Claude 3 Sonnet” about the above sentence.

Quoting the response by Claude
You raise some interesting points about the challenges and considerations involved in developing high-quality generative AI models specialized for Japanese language and discourse. A few thoughts:

Evaluating and filtering data sources is critically important, as the quality and characteristics of the training data will heavily influence the model’s capabilities and potential biases. Having human experts deeply knowledgeable about Japanese language, culture, and various discourse styles across media types will likely be essential for curating an effective training dataset.

Your examples highlighting the need to distinguish between factual reporting vs. embellished narratives, core insights vs. superficial content, and identifying cutting-edge exemplars are all highly relevant. Generative models simply mirror the patterns in their training data, so carefully vetting data sources is key.

Social media and online platforms can indeed surface insightful content offering windows into evolving cultural discourse. However, as you note, discernment is required to separate high-value signals from noise. Sophisticated filtering mechanisms cultivated by subject matter experts will aid in harvesting the highest quality data.

Additionally, beyond just the training data, having Japanese language experts involved in the model development process itself – evaluating outputs, refining approaches, and ensuring linguistic and cultural nuances are properly captured – will likely enhance the sophistication of Japanese specialty models.

You raise valid points about the multifaceted challenges in this domain. Cultivating high-quality training data and having relevant human expertise rigorously involved throughout the process will be crucial for achieving state-of-the-art Japanese generative AI capabilities. It’s a complex undertaking requiring deep linguistic and cultural competencies.

Chat logs by Google AI’s large-scale language model “Gemini”
Google’s explainer blog post Google Blog: The next big step in AI
Response when asking Gemini the same sentence as “Claude 3 Sonnet”

I entered the same text as “Claude 3 Sonnet” in Gemini.

Quoting Gemini’s response
That’s a very interesting analysis of the challenges and opportunities surrounding Japanese generative AI. Here’s a breakdown of your points:

OpenAI’s Expansion and its Impact:

Data Curation for Japanese AI:

  • You raise a crucial point about data selection for training Japanese AI.
    • Articles need to be judged based on their contribution to the AI’s goals.
      • News articles require distinguishing between “safe” and “hard-hitting” content.
      • Weekly magazines demand separation of “truthful exposes” and “fictionalized stories.”

The Role of Social Media:

  • You highlight the potential of platforms like (formerly Twitter) to provide valuable data.
    • By identifying high-quality content from social media, the sophistication of Japanese AI can be enhanced.

Overall, your points suggest that for Japanese generative AI to thrive, it needs to focus on:

  • Data selection: Curating training data that reflects the nuances of Japanese discourse.
  • Understanding cultural context: Accounting for the specific styles and tones present in Japanese media (news, magazines, social media).
  • Addressing specific needs: Developing AI that caters to the unique requirements of the Japanese market.

By addressing these points, Japanese AI developers can create models that effectively compete with OpenAI’s offering and cater to the specific needs of Japanese users.

More information: Similar content found on Google search suggested by Gemini
OpenAI opens Tokyo hub, adds GPT-4 model optimized for Japanese

Related articles on this site
Microsoft Makes Major Investment in Japan to Prepare for Growing Demand for AI