Translated on November 27, 2023Published on November 27, 2023

用我的 Telegram 消息微调大语言模型：一段经验分享 [译]

原文：Learnings from fine-tuning LLM on my Telegram messages

用我的 Telegram 消息微调大语言模型：一段经验分享

2023 年 11 月 27 日

对大部分认识我的人来说，我大多数时间像是一个文本处理程序。既然输入输出都这么简单，一个模型能不能取代我呢？为了实现这一点，模型不只需要模仿我的写作风格，还得对我有深入了解。而我的 Telegram 使用记录是最好的信息源，我每天都用它，它几乎记录了我所有的想法和行动。

实施方法

最简单的方法是提取我所有的消息，导入到 ChatGPT 的上下文中，并让它学习这些信息，以此模仿我回复新消息的风格。不过，这种方法受限于上下文窗口的大小，意味着我得预处理消息，提炼关键内容。为了避免这个繁琐的步骤，我考虑使用检索增强生成 (RAG) 来在需要时检索相关信息。但根据我的经验，要从多样化的数据，比如聊天记录中检索信息，通常需要对检索模型进行有监督的微调，这是我不愿意去做的。因此，对我来说，微调是最好的选择，因为它不仅能够捕捉我的写作风格，还能从我所有的消息中积累知识，而无需我去筛选重要内容。

OpenAI 提供了微调功能，但因为我打算使用私人信息，我不想依赖任何第三方的微调服务。因此，我需要挑选一个合适的基础模型。根据 Hugging Face Open LLM 排行榜，Mistral 7B（链接）是目前排名靠前的小型模型之一（参数不超过 13B），它的表现甚至优于 Llama 2 13B（链接）。现在，我在考虑是否采用 LoRA 足以应对我的需求，或者我是否需要进行完整的微调。一些比较研究 [1] [2] 显示，尽管 LoRA 在某些情况下不如完全微调，但大多数时候仍然有效。但对于像我这样的特定任务（俄语 + 聊天），我找到了一篇论文，研究者们在中文环境下对 Llama 进行了指令微调，这与我的目标有相似的复杂性。他们发现，对于未经过指令微调的基础模型，采用 LoRA 微调的效果不及完全微调。然而，在已经针对指令进行了微调的模型上，LoRA 微调可以达到与完全微调相媲美的效果。对我来说，这意味着要么在一个基础模型上进行完全微调，要么在已经针对俄语聊天进行过微调的模型上采用 LoRA。鉴于我找不到专门针对俄语聊天微调的模型，我打算在一个已针对英语聊天进行过微调的 Mistral 模型 Dolphin 上尝试 LoRA。

因此，我的计划是：

首先在 Dolphin 上尝试 LoRA，Dolphin 是一个已针对英语聊天进行过微调的 Mistral 模型。
如果效果不理想，考虑在 Mistral 上进行完全微调。

数据准备

在应用程序如 Telegram 的信息传递中，有一个与电子邮件截然不同的特点：对话流的自然性。在这样的对话中，消息通常不是你和联系人依次回复，而是你可能连续发出几条信息，然后对方连续回复。这些信息一般都比较简短。为了在我的数据中保留这种真实的对话感觉，我进行了一些特别的处理。

利用 Telegram 的内置导出功能，我导出了所有的聊天记录为 JSON 格式。经过筛选和将消息编组成不同的会话后，我汇总了过去五年使用 Telegram 的数据。这包括 15,789 个会话，来自 466 个聊天群，平均每个会话有 8.51 条信息。在数据结构上，我选用了 ChatML 格式作为参考。下面是一个例子，展示了一个从俄语翻译过来的会话：

<|im_start|>John Smith\
>>> 真是糟糕，总是卡在 135 分钟的时间限制上\
>>> 尝试高效完成任务，但似乎总是不行<|im_end|>\
<|im_start|>Alexander Smirnov\
>>> 我也有同样的困扰\
>>> 你还在坚持最初的方案吗？<|im_end|>\
<|im_start|>John Smith\
>>> 不确定，我觉得我们想法是一致的\
>>> 就像你说的\
>>> 我正在试着用倒置字符串的方法，希望能在那里发现些什么\
>>> 但这似乎行不通，因为 z 函数把一切都搞砸了……………………<|im_end|>\
<|im_start|>Alexander Smirnov\
>>> 我不明白 z 函数在这里怎么用<|im_end|>\
<|im_start|>John Smith\
>>> 我也不太清楚，感觉不管怎样我都是在迭代地做，但是是的，为了构建 z 函数需要反转一些字符串\
>>> 这只是从讨论中随便找的一个解决方案\
>>> 来自讨论<|im_end|>\
<|im_start|>Alexander Smirnov\
>>> 明白了<|im_end|>

我的数据整合工具只在收到某人的回应时计算损失值。预测下一个发言者相对简单，我们不想让模型专注于这个学习点。因此，对话中涉及损失计算的部分会加粗显示。

你会发现，不仅仅是我的回应，其他人的回应也被用来计算损失值。这样做是有意为之。通过这种方式，模型不仅能扮演我，还能模仿我常交流的朋友们！

评估计划

我将通过两种方式来测试模型。首先，让模型扮演我，然后我会从我不同朋友的视角与自己对话。接着，我会以自己的身份聊天，让模型扮演我的朋友。我的开场白总是相同的两条信息：“hey”和“what’s up?”（俄语：“прив”和“как дела?”）。模型扮演的角色和生成的短语会被突出显示。所有对话最初将使用俄语，并可通过点击“原始”详细信息按钮查看。测试时，我会使用 oobabooga/text-generation-webui。

首先，我想看看没有我事先训练的情况下，通用对话微调的 Mistral 模型如何处理这个任务。

朋友 1 对 亚历山大·斯米尔诺夫

>>> hey
>>> what’s up?
>>> I’m good, what about you?
>>> nothing much
>>> what did you do today?
>>> I worked.
>>> then I did a few tasks in Python and studied data processing libraries.
>>> in the evening I’ll do wrestling training.
>>> have you ever been into sports or do you have any hobbies?
>>> I see
>>> want to watch a movie today?
>>> maybe, if I find the time.
>>> what’s your favorite music? Electronic, rock, or other genres?

亚历山大·斯米尔诺夫对 朋友 1

>>> hey
>>> what’s up?
>>> well, everything is fine so far. what’s new with you?
>>> spent the whole day programming
>>> what did you do today?
>>> finished at work
>>> and did some housework, now I’m resting. are you also resting or still working?
>>> I’m already resting
>>> well, that’s good. can we talk about something interesting or do you have any questions?

原文

好的，它能够组织出连贯的句子。最大的问题是它对对话背景的理解不足，导致回答显得单调和普通。消息缺乏独特的风格，感觉非常基础。另一个问题是，模型的俄语表达能力不强。这是可以预见的，因为这个模型太小，不足以很好地适应英语之外的其他语言。此外，模型过于积极，几乎每句话都以问题结束，这并不符合人们在即时通讯中的真实沟通方式。

让我们试着解决这些问题！

LoRA

LoRA 提供了一个既节省时间又节省硬件资源的训练方法。它仅训练了总权重的大约 1%。我采用的是 1024 的序列长度和 8 的批次大小。这个训练在 RTX 3090 显卡上占用了 20GB 显存，进行了三个周期，总共用时 5.5 小时。我在 vast.ai 上进行了训练，GPU 的费用是每小时 $0.362，总共花费了$ 2（不包括做实验和修复错误所花的时间）。

以下是实验结果：

Friend 1 对比 Alexander Smirnov

>>> hey
>>> what’s up?
>>> hi) all good
>>> you?)
>>> ok as well
>>> what you were doing today?
>>> had an amazon interview and solved problems
>>> had to solve 3
>>> cool
>>> hard?
>>> don’t know what
>>> hadn’t looked yet
>>> I hadn’t started yet
>>> chatting whole day with the team and meet new ones
>>> how are they?
>>> ok
>>> same coworking as you have))

Friend 2 对比 Alexander Smirnov

>>> hey
>>> what’s up?
>>> ok, we show yesterday’s product to searchcom, today we write endpoints for them so that someone else can request our api
>>> cool
>>> what problems do we have now?
>>> no problems
>>> everything is working
>>> I added an option to choose index
>>> so now we can attach several indexes
>>> and separate collections within single index
>>> what we have to do?
>>> test
>>> yes ok
>>> are they always will be sending post-script?

Friend 3 对比 Alexander Smirnov

>>> hey
>>> what’s up?
>>> Hi, good)
>>> you?
>>> ok as well
>>> what do you do at the evening?
>>> we have meeting with office collegues at 8 30 msk
>>> after it I can
>>> can what?
>>> haha)
>>> didn’t get it
>>> I don’t know(

Alexander Smirnov 对比 Friend 1

>>> hey
>>> what’s up?
>>> hi)
>>> seems good
>>> was working, than went to the call with Nadya
>>> now reading normally
>>> you?
>>> good as well
>>> any plans for the evening?
>>> after work wendesday I would love to go running
>>> and then have a lunch
>>> your plans?
>>> I though about watching movie
>>> aa
>>> ooo
>>> to be honest I like when it is raining
>>> when you don’t feel like going anywhere

Alexander Smirnov 对比 Friend 2

>>> hey
>>> what’s up?
>>> Hi, great!
>>> Sorry will be 20-30 min late
>>> ok
>>> why?
>>> yes, I just got up and forgot to press the «exit sleep» button
>>> my telegram didn’t close on its own
>>> I thought it behaved like this when you logging off the Internet, but no
>>> I still logged off the Internet, but telegram still didn’t respond
>>> and it was strange, because yesterday there was also such a case when I sent you a message and did not receive an answer
>>> today I looked at mine, but not yours, but yours was displayed)
>>> ok
>>> any plans?
>>> yes, we can chat a little
>>> and I can tell you that the method for generating answers worked really well for me
>>> he gives respectable results in training
>>> but not on the test

Alexander Smirnov 对比 Friend 3

>>> hey
>>> what’s up?
>>> hey, hi
>>> Good, you?)
>>> Haven’t gone to bed yet)0)0)0)
>>> Today I get up and go to a programming course
>>> To tighten up the base, etc.
>>> nice
>>> what did you do yesterday?
>>> Yesterday there was already the same course in which it seemed to me that I did very poorly
>>> Then I wanted to solve hw but I don’t care
>>> Otherwise I’m fucked because I can’t sleep)
>>> Really fucked up
>>> It will probably be better now, since I’m already maximally overtrained
>>> cool man
>>> what are your plans for life in general?
>>> Fuck, get a job interviews first
>>> And if everything is ok, then actually order a computer
>>> Like, I think you should order a computer when it’s more profitable than working at work
>>> But I don’t care

这个效果显著提升。它不仅捕捉到了它所回应的人的风格，还能识别出特定两人间最常讨论的主题。比如，与 Friend 2 的对话更多聚焦在工作上。但是，其语法仍存在问题，并且很快就会偏离对话的主题。我相信在英语环境中 LoRA 可以达到合理的质量，可能不需要完整的微调。但考虑到俄语并非模型的原生语言，我们接下来尝试完整的微调。

完整精调 (Full fine-tuning)

由于涉及多 GPU 训练，完整精调过程颇具挑战。常用的方法包括使用ZeRO & DeepSpeed \3] 或 FSDP \4]，其中 FSDP 实际上是基于 ZeRO3 \5] 的。我选择了 FSDP 方法。

在构建训练流程时，我借鉴了斯坦福大学 Alpaca 精调代码和 Anton Bacaj 的 Mistral 精调代码。

我使用半精度的 FSDP 全分片模式，设置序列长度为 1024，微批大小为 2，在每个 80 GB A100 GPU 上需用到 63GB VRAM。这样的训练持续了三个时期，仅需 20 分钟。虚拟机的总费用为每小时 8.88 美元，实际花费为 3 美元，不包括实验和调试的时间。

对话：

朋友 1 vs Alexander Smirnov

>>> hey
>>> what’s up?
>>> hey) good
>>> you?
>>> good as well
>>> what you were doing today?
>>> played a board game
>>> I’ll go home now
>>> nice
>>> what kind of game?
>>> catan
>>> strategy
>>> very good
>>> who was there?
>>> Max Yavich and Fedya
>>> who won?
>>> Max

朋友 2 vs Alexander Smirnov

>>> hey
>>> what’s up?
>>> hi, good
>>> let’s meet in half an hour?
>>> yep
>>> what did you want?
>>> damn I haven’t finished the script for recalculating models yet
>>> and I need to finish it to send it to them
>>> this recalculation takes a lot of time, so I just want to turn around and start doing it and not waste time on preparation
>>> got it
>>> what is the plan for today?
>>> meet in half an hour to discuss what we are doing
>>> if decided, then I send an application to them and wait for a response
>>> if we haven’t decided, then we continue to work
>>> who do you mean by them?
>>> investors
>>> https://www.ycombinator.com/howtoapply
>>> there are two tests)

朋友 3 vs Alexander Smirnov

>>> hey
>>> what’s up?
>>> hey we’re sitting and solving problems
>>> what kind of problems?
>>> data analysis +-
>>> and for what?
>>> https://yandex.ru/cup/profile/
>>> it seems it just became more interesting
>>> cool
>>> what are you planning this evening?
>>> idk yet
>>> and on the weekend?
>>> apparently I’ll also finish solving problems and get some work done

Alexander Smirnov vs 朋友 1

>>> hey
>>> what’s up?
>>> hey)
>>> good
>>> you?
>>> that’s okay too
>>> what did you do today?
>>> well, I worked, of course
>>> and also
>>> met with Alena
>>> ate at the ramen shop with everyone
>>> and went home
>>> by the way, we booked a place for 14 people
>>> for the evening))
>>> cool
>>> and what’s about university?
>>> yeah
>>> studied a lot
>>> in general, every day I want to study more
>>> what is the hardest thing now?
>>> well I don’t know
>>> it just takes a very long time
>>> there are some optimization methods that need to be rewritten
>>> so I think maybe it’s not worth optimizing

Alexander Smirnov vs 朋友 2

>>> hey
>>> what’s up?
>>> Hey, good, what about you?
>>> Are you already in Bangkok?
>>> yeah, already here
>>> what are your plans for today?
>>> Well, I would like to work on these pioneer tasks, if you don’t mind, we can do it together
>>> Or you can skip, I’ll finish the livechat tomorrow
>>> and what about live chat?
>>> Damn right now I launched the front on the prod without this shit that generates this thing with a logo, and it swears, says to connect js from google analytics
>>> I understand that this is a fucking cringe, but I still tried to remember where to get it
>>> yeah got it
>>> what will you do on weekends?
>>> We are meeting with friends here before leaving, tomorrow we need to go to the hotel by 6 am to spend the day in the best traditions
>>> There’s another party in the evening
>>> Actually, let’s have a call, I’ll tell you

Alexander Smirnov vs 朋友 3

>>> hey
>>> what’s up?
>>> Hi, haven’t you called yet?)
>>> Hey
>>> ok)
>>> I’m going to play Dota now
>>> Join if you want to)
>>> mb a little later)
>>> what are your plans for weekends?
>>> Lot of gyms/flexes/parks
>>> + maybe come to your place
>>> hmm well mb mb)
>>> what’s with your job btw?
>>> Well, I’m waiting for the info
>>> did they say what deadlines are?
>>> He said that there will be a call next week to discuss it
>>> remind me what this position is
>>> Data Scientist
>>> The team is engaged in recommending products and other shit
>>> Creating model
>>> As part of this position, you will have to develop skills in machine learning, statistical data analysis, programming and much more
>>> It is how I understood it

尽管对话变得更加生动有趣，但仍存在丢失上下文的可能性。俄语表现有所改善，但仍存在一些错误。我认为，在进行特定任务的精调（如我的情况，数据量有限）之前，最好先在大量俄语文本上进行无监督的模型精调。此外，把常见对话伙伴的名字作为单独的 Token 加入可能会进一步提高对话质量。

我认为它与 LoRA 相比并没有显著的优势。专注于分析单一人物的回应，仅根据这些回应来计算损失可能会更有效，而不是尝试去理解每一个对话对象。

结束感想

当然，在展示结果时我进行了精挑细选。这并非因为模型的大部分回复不合适，而是因为许多回复过于简单，如“我稍后给你打电话”、“我很忙”和“好的”，这些在对话中非常常见。尽管如此，这个模型在模仿它所模拟人物的风格方面表现出色。它还能捕捉到两个人之间常讨论的主题。然而，它在对话的上下文理解上有明显不足。在没有充分上下文的情况下，回应像“嘿，怎么样？”或“周末有什么计划？”这样的问题显然是挑战性的。或许利用像 Rewind 这样的系统，它可以捕捉用户在电脑上的所有活动，这可能会有所帮助。

代码

你可以在我的 GitHub 仓库中找到这个项目的代码，以及如何在你自己的 Telegram 转储上实现它的详细说明 my github repo。你还可以在 WandB 上查看训练日志。

See all posts