使用ChatGPT对短句进行统计

问题

我们需要开发一个专门的功能，这个功能能够对用户提出的所有问题进行详细的分类和统计。我们要对每一类问题进行数量的统计，以便了解哪些问题是用户最常遇到的。此外，我们还需要将这些数据整理成一个统计报表。这样，我们可以清晰地看到各类问题出现的频率，进而了解用户在使用我们的产品时最常遇到的问题是什么。这将帮助我们更好地理解用户的需求，提高我们的服务质量。

实现方案

1. chatgpt实现

ChatGPT是OpenAI的一种先进的自然语言处理模型，它能够生成人类一样的文本，并且可以对各种输入进行回应。在这个文档中，我们使用ChatGPT来实现对短句的统计和分类。

首先，我们使用ChatGPT生成一个对话，内容为对一系列问题的分析和识别。我们通过定义一个问题列表，然后让ChatGPT对这些问题进行分析，识别出最常见的问题。这个过程是通过调用openai.ChatCompletion.create方法实现的。我们将每一个问题作为一个独立的消息传给这个方法，然后ChatGPT会返回一个响应。这个响应包含了一个消息列表，每一个消息都是ChatGPT的回应。

然后，我们将收集到的消息进行处理，提取出其中的内容，并将它们放入一个列表中。最后，我们将这个列表转化为一个字符串，就得到了ChatGPT对问题的统计和分类结果。

这种方法的优点是，我们可以利用ChatGPT强大的自然语言处理能力，对一系列问题进行快速而准确的分析和分类。而且，由于ChatGPT的回应是生成的文本，所以我们可以直接将这些回应作为报告的一部分，无需进行额外的处理。这大大提高了我们的工作效率。


import openai

def get_classify_from_gpt3(conversations):
    conversation_text = '\n'.join(conversations)
    try:
        response = openai.ChatCompletion.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "user",
                    "content": ("Analyze the following list of questions and identify the top 10 most frequently asked ones.\n"
                                " Return only the list of these questions as a text string, each question on a new line, sorted in descending order of their frequency.\n\n"
                                "<Questions>\n"
                                "{}\n"
                                "</Questions>").format(conversation_text)
                }
            ],
            stream=True,
            temperature=0.0,
        )
        collected_messages = []
        for chunk in response:
            chunk_message = chunk['choices'][0]['delta']
            collected_messages.append(chunk_message)

        full_reply_content = ''.join([m.get('content', '') for m in collected_messages])
        return full_reply_content
    except Exception as e:
        print(f"Error in get_classify_from_gpt3: {e}")
        return ""


questions = [
 '如何清洁洗脸仪',
 '但是用什么肥皂清洗洗脸仪',
 '可以用沐浴露吗',
 '可以用沐浴露清洗洗脸仪吗',
 '那么应该用洗面奶清洗洗脸仪吗',
 '有人吗？',
 '介绍一下tasty功能吧',
 '我已经有柜子了怎么办',
 '我的柜子是家里的柜子。想买个柜子',
 '你好，店里有卖这个机器的充电器吗？',
 '联系客服人员',
 '我应该怎样护理皮肤才能拥有健康的皮肤？',
 '毛孔粗大的皮肤应该怎样护理',
 '当前的佣金政策',
 'heat洗脸仪可以关闭热模式吗',
 'Skin的洗脸仪在Shopee上有官方店吗',
 '当前洗脸仪在哪些店有售',
 '如何写一篇有效的营销文章',
 '当前的case趋势',
 '你好！我注册了Affiliate来推广和销售Happyskin的产品。我可以问你一些关于这个Affiliate计划的问题吗？',
 '做Happyskin的affiliate，有什么资料可以支持我吗？',
 '如果我向客户推荐了affiliate链接，客户点击了链接，但在页面上看到其他产品并决定购买其他产品，那么我会得到客户购买的产品的佣金吗？',
 '好的！明白了！谢谢！',
 'subscription是什么？',
 '使用这个产品后需要用水清洗吗',
 '使用这个产品后需要用水清洗吗',
 '退换政策',
 'Hiền，我们即将举办一场关于整合营销传播的学术比赛，规模有650+学生，我想邀请贵公司作为赞助商合作。我可以联系相关部门吗？',
 '有美白服务吗？',
 '介绍一下Happyskin',
 '美白服务多少钱',
 '干性皮肤应该用什么产品',
 '干性皮肤的护理流程',
 '介绍一下上述护理流程的产品',
 '干性皮肤和美白皮肤的产品',
 '爽肤水',
 '干性皮肤用的精华液',
 '我打算做广告，品牌HappySkin的关键词会被禁止吗？',
 '怎么获取链接',
 'logo，去哪里下载？',
 '你提供的链接显示404错误，无法访问，还有其他链接吗？',
 '我需要下载logo的链接',
 '退换政策',
 '支持系统',
 '每天使用洗脸仪，我还需要每周为皮肤去角质吗',
 '男士需要去角质吗，教我最简单的在家去角质方法',
 '为孕妇提供一些物理去角质的方法',
 '去角质后需要在皮肤上涂什么？']


The top 10 most frequently asked questions are:

1. 如何清洁洗脸仪 (How to clean the facial cleansing device?)
2. 可以用沐浴露清洗洗脸仪吗 (Can I use body wash to clean the facial cleansing device?)
3. 那么应该用洗面奶清洗洗脸仪吗 (Should I use facial cleanser to clean the facial cleansing device?)
4. 介绍一下tasty功能吧 (Can you introduce the Tasty feature?)
5. 我已经有柜子了怎么办 (What should I do if I already have a cabinet?)
6. 你好，店里有卖这个机器的充电器吗？ (Hello, does the store sell chargers for this machine?)
7. 我应该怎样护理皮肤才能拥有健康的皮肤？ (How should I take care of my skin to have healthy skin?)
8. 毛孔粗大的皮肤应该怎样护理 (How should I take care of skin with enlarged pores?)
9. 当前洗脸仪在哪些店有售 (Where is the current facial cleansing device available for purchase?)
10. 如何写一篇有效的营销文章 (How to write an effective marketing article?)

存在的问题：

排名不准确

在当前的问题排名中，我们发现存在一些不准确的地方。例如，排名第四的问题只在问题中出现过一次，关于“干性皮肤应该使用什么产品”的问题却出现了多次，但它却没有在排名中体现。这明显表明我们的排名系统存在一些问题，需要进行相应的调整和优化。

归类不准确

我们还发现，问题的归类也存在一些不准确的地方。例如，第一，第二，和第三个问题，它们应该被归类在同一问题下。这也表明我们的归类系统需要进行更多的改进和调整。

2. prompt调优


Given the following list of questions within the <Questions> tags, perform the following tasks:

1. Analyze the list to count the frequency of each question.
2. Identify the top 10 most frequently asked questions.
3. Sort these questions in descending order of their frequency.
4. Return the list as a text string, with each question on a new line.
5. Ensure the counting and ranking are accurate.
6. only return the top 10 questions if there are more than 10 unique questions.

Additional Instructions:
- If there are fewer than 10 unique questions, return all questions sorted by frequency.
- In case of ties in frequency, maintain the order of their first appearance in the list.
- If the list is empty, return an appropriate message indicating no questions were provided.

<Questions>
{}
</Questions>


如何清洁洗脸仪
可以用沐浴露清洗洗脸仪吗
介绍一下tasty功能吧
我已经有柜子了怎么办
联系客服人员
我应该怎样护理皮肤才能拥有健康的皮肤？
毛孔粗大的皮肤应该怎样护理
如何写一篇有效的营销文章
你好！我注册了Affiliate来推广和销售Happyskin的产品。我可以问你一些关于这个Affiliate计划的问题吗？
subscription是什么？

尽管我们对程序进行了优化，但优化后的效果并未达到我们的预期。这意味着我们可能需要进一步改进和调整我们的方法，以便更准确地对问题进行分类和排名。

3. 问题聚类 + 问题总结

ChatGPT在同一提示中对问题进行分类和总结时，可能无法达到最佳效果。因为这可能会使任务过于复杂，导致处理不当。为了解决这个问题，我们改变了处理方式，将任务分为两个步骤进行。首先，我们对所有的问题进行聚类，将相似的问题归类在一起。然后，在每个问题类别中，我们会对这些问题进行总结，提炼出关键信息。这种方式的优点在于，ChatGPT每次只需要处理一个任务，无论是分类还是总结。这种分步处理的方式，可以让ChatGPT更专注于当前的任务，从而实现更好的效果。



import numpy as np
import openai
from langchain.embeddings import OpenAIEmbeddings
from sklearn.cluster import KMeans


def cluster_texts(texts, n_clusters=None):
    if n_clusters and len(texts) < n_clusters:
        n_clusters = int(len(texts)/2)
        if n_clusters <= 1:
            n_clusters = 2
    elif n_clusters is None:
        n_clusters = int(len(texts) * 3/5.0)
        if n_clusters <= 1:
            n_clusters = 2
    embed = OpenAIEmbeddings(model="text-embedding-ada-002")
    matrix = embed.embed_documents(texts)
    matrix = np.array(matrix)
    kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42, n_init='auto')
    kmeans.fit(matrix)
    labels = kmeans.labels_
    clusters = {}
    for i, c in enumerate(labels):
        clusters.setdefault(c, []).append(texts[i])
    return clusters


def summary_texts(texts):
    texts = u"\n".join([f"sentence {i+1}: {text}" for i, text in enumerate(texts)])
    prompt = (
        "Given the similar sentences within <Sentences> tags, distill their shared theme into an elegant phrase. "
        "If a common theme can't be found, output the first question. "
        "The output language should match the input.\n\n"
        "<Sentences>\n{}\n</Sentences>"
    )
    try:
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                {
                    "role": "user",
                    "content": prompt.format(texts),
                }
            ],
            stream=True,
            temperature=0.0,
        )
        collected_messages = []
        for chunk in response:
            chunk_message = chunk['choices'][0]['delta']
            collected_messages.append(chunk_message)

        full_reply_content = ''.join([m.get('content', '') for m in collected_messages])
        return full_reply_content
    except Exception as e:
        print(f"Error in get_classify_from_gpt3: {e}")
        return ""
        
def get_classify(conversations, top_n=10):
    classify_dict = {}
    for conversation in conversations:
        classify_dict.setdefault(conversation, 0)
        classify_dict[conversation] += 1
    questions = list(classify_dict.keys())
    questions = [q for q in questions if q]
    if len(questions) >= 2:
        clusters = cluster_texts(questions)
    else:
        clusters = {0: questions}
    classfiy = []
    for _, cluster in clusters.items():
        count = 0
        for c in cluster:
            count += classify_dict[c]
        classfiy.append((cluster, count))
    classfiy = sorted(classfiy, key=lambda x: x[1], reverse=True)
    classfiy = classfiy[:top_n]
    ret = []
    for q, c in classfiy:
        if c > 1:
            summary = summary_texts(q)
        else:
            summary = q[0] if q else ""
        ret.append((summary, q, c))
    text = ""
    for i, (summary, q, c) in enumerate(ret):
        examples = u"\n".join([f"  {j+1}) {question}" for j, question in enumerate(q[:4])])
        text += f"{i+1}. {summary} ({c} {'times' if c > 1 else 'time'})\n"
        if c > 1:
            text += f"e.g.\n{examples}\n\n"
        else:
            text += "\n"
    return text


1. 皮肤护理建议 (6 times)
e.g.
  1) 我应该怎样护理皮肤才能拥有健康的皮肤？
  2) 毛孔粗大的皮肤应该怎样护理
  3) 干性皮肤应该用什么产品
  4) 干性皮肤的护理流程

2. 如何清洁洗脸仪 (4 times)
e.g.
  1) 如何清洁洗脸仪
  2) 但是用什么肥皂清洗洗脸仪
  3) 那么应该用洗面奶清洗洗脸仪吗
  4) heat洗脸仪可以关闭热模式吗

3. 关于Happyskin Affiliate计划的咨询 (4 times)
e.g.
  1) 你好！我注册了Affiliate来推广和销售Happyskin的产品。我可以问你一些关于这个Affiliate计划的问题吗？
  2) 做Happyskin的affiliate，有什么资料可以支持我吗？
  3) 介绍一下Happyskin
  4) 我打算做广告，品牌HappySkin的关键词会被禁止吗？

4. 政策条款 (3 times)
e.g.
  1) 当前的佣金政策
  2) 退换政策

5. 是否可以使用沐浴露。 (2 times)
e.g.
  1) 可以用沐浴露吗
  2) 可以用沐浴露清洗洗脸仪吗

6. 如何处理已有的柜子并购买新柜子 (2 times)
e.g.
  1) 我已经有柜子了怎么办
  2) 我的柜子是家里的柜子。想买个柜子

7. Skin的洗脸仪销售渠道 (2 times)
e.g.
  1) Skin的洗脸仪在Shopee上有官方店吗
  2) 当前洗脸仪在哪些店有售

8. 需要用水清洗吗 (2 times)
e.g.
  1) 使用这个产品后需要用水清洗吗

9. 美白服务信息 (2 times)
e.g.
  1) 有美白服务吗？
  2) 美白服务多少钱

10. 护肤产品 (2 times)
e.g.
  1) 干性皮肤和美白皮肤的产品
  2) 爽肤水

使用问题聚类和问题总结相结合的效果，比单纯使用 chatgpt归纳总结效果更好，但是也存在一些问题。

例如，聚类的准确性不够。第五个问题应该归属于第二个问题，但由于问题本身携带的上下文信息在聚类时未被考虑，导致它们没能被归类在一起。同样，第二个问题的第四项实际上不应归类于此，这是因为问题语句过短，当出现相同关键字时会被错误地聚类在一起。

然而，如果我们从全局的角度来看，我们可以看出，尽管我们的问题聚类和问题总结的方法存在一些问题，但是它仍然明显优于我们之前使用的方法。通过将问题分类和问题总结分开处理，我们可以更有效地利用ChatGPT的能力，从而获得更准确的结果。总的来说，尽管我们的问题聚类和问题总结的方法存在一些问题，但是它仍然是一个有效的解决方案。

总结

通过聚类+问题总结比单纯使用chatgpt归纳总结问题的效果更好，由于目前chatgpt的能力限制，所以我们需要通过改进策略来提升效果，也许在不久的将来，chatgpt能够进化到轻松完成这类任务。