用我的 Telegram 消息微调大语言模型:一段经验分享 [译]

用我的 Telegram 消息微调大语言模型:一段经验分享

2023 年 11 月 27 日

对大部分认识我的人来说,我大多数时间像是一个文本处理程序。既然输入输出都这么简单,一个模型能不能取代我呢?为了实现这一点,模型不只需要模仿我的写作风格,还得对我有深入了解。而我的 Telegram 使用记录是最好的信息源,我每天都用它,它几乎记录了我所有的想法和行动。

实施方法

最简单的方法是提取我所有的消息,导入到 ChatGPT 的上下文中,并让它学习这些信息,以此模仿我回复新消息的风格。不过,这种方法受限于上下文窗口的大小,意味着我得预处理消息,提炼关键内容。为了避免这个繁琐的步骤,我考虑使用检索增强生成 (RAG) 来在需要时检索相关信息。但根据我的经验,要从多样化的数据,比如聊天记录中检索信息,通常需要对检索模型进行有监督的微调,这是我不愿意去做的。因此,对我来说,微调是最好的选择,因为它不仅能够捕捉我的写作风格,还能从我所有的消息中积累知识,而无需我去筛选重要内容。

OpenAI 提供了微调功能,但因为我打算使用私人信息,我不想依赖任何第三方的微调服务。因此,我需要挑选一个合适的基础模型。根据 Hugging Face Open LLM 排行榜,Mistral 7B(链接)是目前排名靠前的小型模型之一(参数不超过 13B),它的表现甚至优于 Llama 2 13B(链接)。现在,我在考虑是否采用 LoRA 足以应对我的需求,或者我是否需要进行完整的微调。一些比较研究 [1] [2] 显示,尽管 LoRA 在某些情况下不如完全微调,但大多数时候仍然有效。但对于像我这样的特定任务(俄语 + 聊天),我找到了一篇 论文,研究者们在中文环境下对 Llama 进行了指令微调,这与我的目标有相似的复杂性。他们发现,对于未经过指令微调的基础模型,采用 LoRA 微调的效果不及完全微调。然而,在已经针对指令进行了微调的模型上,LoRA 微调可以达到与完全微调相媲美的效果。对我来说,这意味着要么在一个基础模型上进行完全微调,要么在已经针对俄语聊天进行过微调的模型上采用 LoRA。鉴于我找不到专门针对俄语聊天微调的模型,我打算在一个已针对英语聊天进行过微调的 Mistral 模型 Dolphin 上尝试 LoRA。

因此,我的计划是:

  1. 首先在 Dolphin 上尝试 LoRA,Dolphin 是一个已针对英语聊天进行过微调的 Mistral 模型。
  2. 如果效果不理想,考虑在 Mistral 上进行完全微调。

数据准备

在应用程序如 Telegram 的信息传递中,有一个与电子邮件截然不同的特点:对话流的自然性。在这样的对话中,消息通常不是你和联系人依次回复,而是你可能连续发出几条信息,然后对方连续回复。这些信息一般都比较简短。为了在我的数据中保留这种真实的对话感觉,我进行了一些特别的处理。

利用 Telegram 的内置导出功能,我导出了所有的聊天记录为 JSON 格式。经过筛选和将消息编组成不同的会话后,我汇总了过去五年使用 Telegram 的数据。这包括 15,789 个会话,来自 466 个聊天群,平均每个会话有 8.51 条信息。在数据结构上,我选用了 ChatML 格式作为参考。下面是一个例子,展示了一个从俄语翻译过来的会话:

<|im_start|>John Smith\
>>> 真是糟糕,总是卡在 135 分钟的时间限制上\
>>> 尝试高效完成任务,但似乎总是不行<|im_end|>\
<|im_start|>Alexander Smirnov\
>>> 我也有同样的困扰\
>>> 你还在坚持最初的方案吗?<|im_end|>\
<|im_start|>John Smith\
>>> 不确定,我觉得我们想法是一致的\
>>> 就像你说的\
>>> 我正在试着用倒置字符串的方法,希望能在那里发现些什么\
>>> 但这似乎行不通,因为 z 函数把一切都搞砸了……………………<|im_end|>\
<|im_start|>Alexander Smirnov\
>>> 我不明白 z 函数在这里怎么用<|im_end|>\
<|im_start|>John Smith\
>>> 我也不太清楚,感觉不管怎样我都是在迭代地做,但是是的,为了构建 z 函数需要反转一些字符串\
>>> 这只是从讨论中随便找的一个解决方案\
>>> 来自讨论<|im_end|>\
<|im_start|>Alexander Smirnov\
>>> 明白了<|im_end|>

我的数据整合工具只在收到某人的回应时计算损失值。预测下一个发言者相对简单,我们不想让模型专注于这个学习点。因此,对话中涉及损失计算的部分会加粗显示。

你会发现,不仅仅是我的回应,其他人的回应也被用来计算损失值。这样做是有意为之。通过这种方式,模型不仅能扮演我,还能模仿我常交流的朋友们!

评估计划

我将通过两种方式来测试模型。首先,让模型扮演我,然后我会从我不同朋友的视角与自己对话。接着,我会以自己的身份聊天,让模型扮演我的朋友。我的开场白总是相同的两条信息:“hey”和“what’s up?”(俄语:“прив”和“как дела?”)。模型扮演的角色和生成的短语会被突出显示。所有对话最初将使用俄语,并可通过点击“原始”详细信息按钮查看。测试时,我会使用 oobabooga/text-generation-webui

首先,我想看看没有我事先训练的情况下,通用对话微调的 Mistral 模型如何处理这个任务。


朋友 1 对 亚历山大·斯米尔诺夫

>>> hey
>>> what’s up?
>>> I’m good, what about you?
>>> nothing much
>>> what did you do today?
>>> I worked.
>>> then I did a few tasks in Python and studied data processing libraries.
>>> in the evening I’ll do wrestling training.
>>> have you ever been into sports or do you have any hobbies?
>>> I see
>>> want to watch a movie today?
>>> maybe, if I find the time.
>>> what’s your favorite music? Electronic, rock, or other genres?

亚历山大·斯米尔诺夫 对 朋友 1

>>> hey
>>> what’s up?
>>> well, everything is fine so far. what’s new with you?
>>> spent the whole day programming
>>> what did you do today?
>>> finished at work
>>> and did some housework, now I’m resting. are you also resting or still working?
>>> I’m already resting
>>> well, that’s good. can we talk about something interesting or do you have any questions?

原文


好的,它能够组织出连贯的句子。最大的问题是它对对话背景的理解不足,导致回答显得单调和普通。消息缺乏独特的风格,感觉非常基础。另一个问题是,模型的俄语表达能力不强。这是可以预见的,因为这个模型太小,不足以很好地适应英语之外的其他语言。此外,模型过于积极,几乎每句话都以问题结束,这并不符合人们在即时通讯中的真实沟通方式。

让我们试着解决这些问题!

LoRA

LoRA 提供了一个既节省时间又节省硬件资源的训练方法。它仅训练了总权重的大约 1%。我采用的是 1024 的序列长度和 8 的批次大小。这个训练在 RTX 3090 显卡上占用了 20GB 显存,进行了三个周期,总共用时 5.5 小时。我在 vast.ai 上进行了训练,GPU 的费用是每小时 0.362,总共花费了0.362,总共花费了 2(不包括做实验和修复错误所花的时间)。

以下是实验结果:


Friend 1 对比  Alexander Smirnov

>>> hey
>>> what’s up?
>>> hi) all good
>>> you?)
>>> ok as well
>>> what you were doing today?
>>> had an amazon interview and solved problems
>>> had to solve 3
>>> cool
>>> hard?
>>> don’t know what
>>> hadn’t looked yet
>>> I hadn’t started yet
>>> chatting whole day with the team and meet new ones
>>> how are they?
>>> ok
>>> same coworking as you have))

Friend 2 对比  Alexander Smirnov

>>> hey
>>> what’s up?
>>> ok, we show yesterday’s product to searchcom, today we write endpoints for them so that someone else can request our api
>>> cool
>>> what problems do we have now?
>>> no problems
>>> everything is working
>>> I added an option to choose index
>>> so now we can attach several indexes
>>> and separate collections within single index
>>> what we have to do?
>>> test
>>> yes ok
>>> are they always will be sending post-script?

Friend 3 对比  Alexander Smirnov

>>> hey
>>> what’s up?
>>> Hi, good)
>>> you?
>>> ok as well
>>> what do you do at the evening?
>>> we have meeting with office collegues at 8 30 msk
>>> after it I can
>>> can what?
>>> haha)
>>> didn’t get it
>>> I don’t know(

Alexander Smirnov 对比  Friend 1

>>> hey
>>> what’s up?
>>> hi)
>>> seems good
>>> was working, than went to the call with Nadya
>>> now reading normally
>>> you?
>>> good as well
>>> any plans for the evening?
>>> after work wendesday I would love to go running
>>> and then have a lunch
>>> your plans?
>>> I though about watching movie
>>> aa
>>> ooo
>>> to be honest I like when it is raining
>>> when you don’t feel like going anywhere

Alexander Smirnov 对比  Friend 2

>>> hey
>>> what’s up?
>>> Hi, great!
>>> Sorry will be 20-30 min late
>>> ok
>>> why?
>>> yes, I just got up and forgot to press the «exit sleep» button
>>> my telegram didn’t close on its own
>>> I thought it behaved like this when you logging off the Internet, but no
>>> I still logged off the Internet, but telegram still didn’t respond
>>> and it was strange, because yesterday there was also such a case when I sent you a message and did not receive an answer
>>> today I looked at mine, but not yours, but yours was displayed)
>>> ok
>>> any plans?
>>> yes, we can chat a little
>>> and I can tell you that the method for generating answers worked really well for me
>>> he gives respectable results in training
>>> but not on the test

Alexander Smirnov 对比  Friend 3

>>> hey
>>> what’s up?
>>> hey, hi
>>> Good, you?)
>>> Haven’t gone to bed yet)0)0)0)
>>> Today I get up and go to a programming course
>>> To tighten up the base, etc.
>>> nice
>>> what did you do yesterday?
>>> Yesterday there was already the same course in which it seemed to me that I did very poorly
>>> Then I wanted to solve hw but I don’t care
>>> Otherwise I’m fucked because I can’t sleep)
>>> Really fucked up
>>> It will probably be better now, since I’m already maximally overtrained
>>> cool man
>>> what are your plans for life in general?
>>> Fuck, get a job interviews first
>>> And if everything is ok, then actually order a computer
>>> Like, I think you should order a computer when it’s more profitable than working at work
>>> But I don’t care

这个效果显著提升。它不仅捕捉到了它所回应的人的风格,还能识别出特定两人间最常讨论的主题。比如,与 Friend 2 的对话更多聚焦在工作上。但是,其语法仍存在问题,并且很快就会偏离对话的主题。我相信在英语环境中 LoRA 可以达到合理的质量,可能不需要完整的微调。但考虑到俄语并非模型的原生语言,我们接下来尝试完整的微调。

完整精调 (Full fine-tuning)

由于涉及多 GPU 训练,完整精调过程颇具挑战。常用的方法包括使用ZeRO & DeepSpeed \3]FSDP \4],其中 FSDP 实际上是基于 ZeRO3 \5] 的。我选择了 FSDP 方法。

在构建训练流程时,我借鉴了 斯坦福大学 Alpaca 精调代码Anton Bacaj 的 Mistral 精调代码

我使用半精度的 FSDP 全分片模式,设置序列长度为 1024,微批大小为 2,在每个 80 GB A100 GPU 上需用到 63GB VRAM。这样的训练持续了三个时期,仅需 20 分钟。虚拟机的总费用为每小时 8.88 美元,实际花费为 3 美元,不包括实验和调试的时间。

对话:


朋友 1 vs Alexander Smirnov

>>> hey
>>> what’s up?
>>> hey) good
>>> you?
>>> good as well
>>> what you were doing today?
>>> played a board game
>>> I’ll go home now
>>> nice
>>> what kind of game?
>>> catan
>>> strategy
>>> very good
>>> who was there?
>>> Max Yavich and Fedya
>>> who won?
>>> Max

朋友 2 vs Alexander Smirnov

>>> hey
>>> what’s up?
>>> hi, good
>>> let’s meet in half an hour?
>>> yep
>>> what did you want?
>>> damn I haven’t finished the script for recalculating models yet
>>> and I need to finish it to send it to them
>>> this recalculation takes a lot of time, so I just want to turn around and start doing it and not waste time on preparation
>>> got it
>>> what is the plan for today?
>>> meet in half an hour to discuss what we are doing
>>> if decided, then I send an application to them and wait for a response
>>> if we haven’t decided, then we continue to work
>>> who do you mean by them?
>>> investors
>>> https://www.ycombinator.com/howtoapply
>>> there are two tests)

朋友 3 vs Alexander Smirnov

>>> hey
>>> what’s up?
>>> hey we’re sitting and solving problems
>>> what kind of problems?
>>> data analysis +-
>>> and for what?
>>> https://yandex.ru/cup/profile/
>>> it seems it just became more interesting
>>> cool
>>> what are you planning this evening?
>>> idk yet
>>> and on the weekend?
>>> apparently I’ll also finish solving problems and get some work done

Alexander Smirnov vs 朋友 1

>>> hey
>>> what’s up?
>>> hey)
>>> good
>>> you?
>>> that’s okay too
>>> what did you do today?
>>> well, I worked, of course
>>> and also
>>> met with Alena
>>> ate at the ramen shop with everyone
>>> and went home
>>> by the way, we booked a place for 14 people
>>> for the evening))
>>> cool
>>> and what’s about university?
>>> yeah
>>> studied a lot
>>> in general, every day I want to study more
>>> what is the hardest thing now?
>>> well I don’t know
>>> it just takes a very long time
>>> there are some optimization methods that need to be rewritten
>>> so I think maybe it’s not worth optimizing

Alexander Smirnov vs 朋友 2

>>> hey
>>> what’s up?
>>> Hey, good, what about you?
>>> Are you already in Bangkok?
>>> yeah, already here
>>> what are your plans for today?
>>> Well, I would like to work on these pioneer tasks, if you don’t mind, we can do it together
>>> Or you can skip, I’ll finish the livechat tomorrow
>>> and what about live chat?
>>> Damn right now I launched the front on the prod without this shit that generates this thing with a logo, and it swears, says to connect js from google analytics
>>> I understand that this is a fucking cringe, but I still tried to remember where to get it
>>> yeah got it
>>> what will you do on weekends?
>>> We are meeting with friends here before leaving, tomorrow we need to go to the hotel by 6 am to spend the day in the best traditions
>>> There’s another party in the evening
>>> Actually, let’s have a call, I’ll tell you

Alexander Smirnov vs 朋友 3

>>> hey
>>> what’s up?
>>> Hi, haven’t you called yet?)
>>> Hey
>>> ok)
>>> I’m going to play Dota now
>>> Join if you want to)
>>> mb a little later)
>>> what are your plans for weekends?
>>> Lot of gyms/flexes/parks
>>> + maybe come to your place
>>> hmm well mb mb)
>>> what’s with your job btw?
>>> Well, I’m waiting for the info
>>> did they say what deadlines are?
>>> He said that there will be a call next week to discuss it
>>> remind me what this position is
>>> Data Scientist
>>> The team is engaged in recommending products and other shit
>>> Creating model
>>> As part of this position, you will have to develop skills in machine learning, statistical data analysis, programming and much more
>>> It is how I understood it

尽管对话变得更加生动有趣,但仍存在丢失上下文的可能性。俄语表现有所改善,但仍存在一些错误。我认为,在进行特定任务的精调(如我的情况,数据量有限)之前,最好先在大量俄语文本上进行无监督的模型精调。此外,把常见对话伙伴的名字作为单独的 Token 加入可能会进一步提高对话质量。

我认为它与 LoRA 相比并没有显著的优势。专注于分析单一人物的回应,仅根据这些回应来计算损失可能会更有效,而不是尝试去理解每一个对话对象。

结束感想

当然,在展示结果时我进行了精挑细选。这并非因为模型的大部分回复不合适,而是因为许多回复过于简单,如“我稍后给你打电话”、“我很忙”和“好的”,这些在对话中非常常见。尽管如此,这个模型在模仿它所模拟人物的风格方面表现出色。它还能捕捉到两个人之间常讨论的主题。然而,它在对话的上下文理解上有明显不足。在没有充分上下文的情况下,回应像“嘿,怎么样?”或“周末有什么计划?”这样的问题显然是挑战性的。或许利用像 Rewind 这样的系统,它可以捕捉用户在电脑上的所有活动,这可能会有所帮助。

代码

你可以在我的 GitHub 仓库中找到这个项目的代码,以及如何在你自己的 Telegram 转储上实现它的详细说明 my github repo。你还可以在 WandB 上查看训练日志。