[论文翻译]BLOOM: 一个 176B 参数的开放访问多语言大语言模型


原文地址:https://arxiv.org/pdf/2211.05100


BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

BLOOM: 一个 176B 参数的开放访问多语言大语言模型

BigScience Workshop∗

大科学研讨会∗

Major Contributors

主要贡献者

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic´, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, Franc¸ois Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Amman a man chi, Thomas Wang, Benoˆıt Sagot, Niklas Mu en nigh off, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Thomas Wolf, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Lauren¸con, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic´, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, Franc¸ois Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Amman a man chi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Thomas Wolf, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel

Dataset

数据集

Aaron Gokaslan, Adi Simhi, Aitor Soroa, Albert Villanova del Moral, Alexandra Sasha Luccioni, Alham Fikri Aji, Amit Alfassy, Angelina McMillan-Major, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Akiki, Christopher Klamm, Colin Leong, Colin Raffel, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Pon- ferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Hugo Lauren¸con, Huu Nguyen, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, J¨org Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro Von Werra, Leon Weber, Long Phan, Loubna Ben allal, Lucile Saulnier, Ludovic Tanguy, Manan Dey, Manuel Romero Mu˜noz, Maraim Masoud, Margaret Mitchell, María Grandury, Mario Sˇaˇsko, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurul a qilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Pawan Sasanka Amman a man chi, Pedro Ortiz Suarez, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Roman Castagné, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Samson Tan, Sebastian Nagel, Shamik Bose, Shamsud- deen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Stella Biderman, Suhas Pai, Suzana Ili´c, Sydney Zink, Teven Le Scao, Thomas Wang, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Yacine Jernite, Zaid Alyafeai, Zeerak Talat

Aaron Gokaslan, Adi Simhi, Aitor Soroa, Albert Villanova del Moral, Alexandra Sasha Luccioni, Alham Fikri Aji, Amit Alfassy, Angelina McMillan-Major, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Akiki, Christopher Klamm, Colin Leong, Colin Raffel, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Pon- ferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Hugo Lauren¸con, Huu Nguyen, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jör格 Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro Von Werra, Leon Weber, Long Phan, Loubna Ben allal, Lucile Saulnier, Ludovic Tanguy, Manan Dey, Manuel Romero Mu˜noz, Maraim Masoud, Margaret Mitchell, María Grandury, Mario Sˇaˇsko, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurul a qilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Pawan Sasanka Amman a man chi, Pedro Ortiz Suarez, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Roman Castagné, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Samson Tan, Sebastian Nagel, Shamik Bose, Shamsud- deen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Stella Biderman, Suhas Pai, Suzana Ili´c, Sydney Zink, Teven Le Scao, Thomas Wang, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Yacine Jernite, Zaid Alyafeai, Zeerak Talat

Token iz ation

Token 化

Arun Raja, Benjamin Heinz erling, Benoıˆt Sagot, Chenglei Si, Colin Raffel, Davut Emre Ta¸sar, Elizabeth Salesky, Lucile Saulnier, Manan Dey, Matthias Gallé, Pedro Ortiz Suarez, Roman Castagné, Sabrina J. Mielke, Samson Tan, Teven Le Scao, Thomas Wang, Wilson Y. Lee, Zaid Alyafeai

阿伦·拉贾, 本杰明·海因茨·埃尔林, 本诺伊特·萨戈, 司承磊, 科林·拉斐尔, 达武特·埃姆雷·塔沙尔, 伊丽莎白·塞尔斯基, 卢西尔·索尔尼耶, 马南·德伊, 马蒂亚斯·加莱, 佩德罗·奥尔蒂斯·苏亚雷斯, 罗曼·卡斯塔涅, 萨布丽娜·J·米尔克, 桑森·谭, 特文·勒·斯考, 托马斯·王, 威尔逊·Y·李, 扎伊德·阿利亚费

Prompt Engineering

提示工程 (Prompt Engineering)

Abheesht Sharma, Albert Webson, Alexander M. Rush, Alham Fikri Aji, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Canwen Xu, Colin Raffel, Debajyoti Datta, Dragomir Radev, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jonathan Chang, Jos Rozen, Khalid Almubarak, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, Manan Dey, Matteo Manica, Mike Tian-Jian Jiang, Nihal Nayak, Niklas Mu en nigh off, Rachel Bawden, Ryan Teehan, Samuel Albanie, Shanya Sharma, Sheng Shen, Srulik Ben-David, Stella Biderman, Stephen H. Bach, Taewoon Kim, Tali Bers, Teven Le Scao, Thibault Fevry, Thomas Wang, Thomas Wolf, Trishala Neeraj, Urmish Thakker, Victor Sanh, Vikas Raunak,

Abheesht Sharma, Albert Webson, Alexander M. Rush, Alham Fikri Aji, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Arun Raja, 徐灿文, Colin Raffel, Debajyoti Datta, Dragomir Radev, Eliza Szczechla, Gunjan Chhablani, 王涵, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jonathan Chang, Jos Rozen, Khalid Almubarak, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, Manan Dey, Matteo Manica, Mike Tian-Jian Jiang, Nihal Nayak, Niklas Muennighoff, Rachel Bawden, Ryan Teehan, Samuel Albanie, Shanya Sharma, 沈盛, Srulik Ben-David, Stella Biderman, Stephen H. Bach, 金泰勋, Tali Bers, Teven Le Scao, Thibault Fevry, Thomas Wang, Thomas Wolf, Trishala Neeraj, Urmish Thakker, Victor Sanh, Vikas Raunak,

Xiangru Tang, Zaid Alyafeai, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh

唐翔儒,Zaid Alyafeai,雍正欣,孙志清,Shaked Brody,Yallow Uri,Hadar Tojarieh

Architecture and Objective

架构和目标

Adam Roberts, Colin Raffel, Daniel Hesslow, Hady Elsahar, Hyung Won Chung, Iz Beltagy, Jaesung Tae, Jason Phang, Julien Launay, Lintang Sutawika, Lucile Saulnier, M Saiful Bari, Niklas Muennighoff, Ofir Press, Sheng Shen, Stas Bekman, Stella Biderman, Teven Le Scao, Thomas Wang, Vassilina Nikoulina, Victor Sanh, Zheng-Xin Yong

Adam Roberts, Colin Raffel, Daniel Hesslow, Hady Elsahar, Hyung Won Chung, Iz Beltagy, Jaesung Tae, Jason Phang, Julien Launay, Lintang Sutawika, Lucile Saulnier, M Saiful Bari, Niklas Muennighoff, Ofir Press, Sheng Shen, Stas Bekman, Stella Biderman, Teven Le Scao, Thomas Wang, Vassilina Nikoulina, Victor Sanh, Zheng-Xin Yong

Engineering

工程技术

Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Pey rou net te, Nicolas Patry, Niklas Mu en nigh off, Nouamane Tazi, Olatunji Ruwase, Omar Sanseviero, Patrick von Platen, Pierre Cor- nette, Pierre Franc¸ois Lavallée, Rémi Lacroix, Samyam Raj bh and ari, Sanchit Gandhi, Shaden Smith, Stas Bekman, Stéphane Requena, Suraj Patil, Teven Le Scao, Thomas Wang, Tim Dettmers

Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Niklas Muennighoff, Nouamane Tazi, Olatunji Ruwase, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stas Bekman, Stéphane Requena, Suraj Patil, Teven Le Scao, Thomas Wang, Tim Dettmers

Evaluation and Interpret ability

评估与可解释性

Ahmed Baruwa, Albert Webson, Alexandra Sasha Luccioni, Alham Fikri Aji, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Sub ramon ian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Dragomir Radev, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Ellie Pavlick, Fran¸cois Yvon, Genta Indra Winata, Hailey Schoelkopf, Jaesung Tae, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Khalid Almubarak, Liam Hazan, Lintang Sutawika, Manan Dey, Maraim Masoud, Margaret Mitchell, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Niklas Mu en nigh off, Oleg Serikov, Omer Antverg, Oskar van der Wal, Pawan Sasanka Amman a man chi, Pierre Colombo, Rachel Bawden, Rui Zhang, Ruochen Zhang, Samson Tan, Sebastian Gehrmann, Shachar Mirkin, Shani Pais, Shanya Sharma, Shayne Longpre, Stella Biderman, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Lim is ie wi cz, Urmish Thakker, Vassilina Nikoulina, Verena Rieser, Vikas Raunak, Vitaly Protasov, Vladislav Mikhailov, Wilson Y. Lee, Yada Pr uk s a chat kun, Yonatan Belinkov, Zachary Bamberger, Zdeneˇk Kasner, Zeerak Talat, Zheng-Xin Yong

Ahmed Baruwa, Albert Webson, Alexandra Sasha Luccioni, Alham Fikri Aji, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Dragomir Radev, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Ellie Pavlick, François Yvon, Genta Indra Winata, Hailey Schoelkopf, Jaesung Tae, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Khalid Almubarak, Liam Hazan, Lintang Sutawika, Manan Dey, Maraim Masoud, Margaret Mitchell, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Niklas Muennighoff, Oleg Serikov, Omer Antverg, Oskar van der Wal, Pawan Sasanka Ammannamanchi, Pierre Colombo, Rachel Bawden, Rui Zhang, Ruochen Zhang, Samson Tan, Sebastian Gehrmann, Shachar Mirkin, Shani Pais, Shanya Sharma, Shayne Longpre, Stella Biderman, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Urmish Thakker, Vassilina Nikoulina, Verena Rieser, Vikas Raunak, Vitaly Protasov, Vladislav Mikhailov, Wilson Y. Lee, Yada Prukaschat, Yonatan Belinkov, Zachary Bamberger, Zdeněk Kasner, Zeerak Talat, Zheng-Xin Yong

Broader Impacts

更广泛的影响

Aaron Gokaslan, Alexandra Sasha Luccioni, Alham Fikri Aji, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Angelina McMillan-Major, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh Haji Hosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Mun˜oz Ferrandis, Chenghao Mou, Minh Chien Vu, Christopher Akiki, Daniel McDuff, Danish Contractor, David Ifeoluwa Adelani, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Gérard Dupont, Giada Pistilli, Habib Rezanejad, Hessie Jones, Huu Nguyen, Ian Yu, Indrani Bhatt acharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jaesung Tae, Jenny Chim, Jesse Dodge, Jesse Passmore, Josh Seltzer, Julien Launay, Julio Bonis Sanz, Khalid Almubarak, Livia Dutra, Long Phan, Mairon Samagaio, Manan Dey, Maraim Masoud, Margaret Mitchell, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Niklas Mu en nigh off, Nishant Subramani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Olivier Nguyen, Paulo Villegas, Pawan Sasanka Amman a man chi, Priscilla Amuok, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Shanya Sharma, Shayne Longpre, Silas Wang, Somaieh Nikpoor, Sourav Roy, Stas Bekman, Stella Biderman, Suhas Pai, Suzana Ilic´, Sylvain Viguier, Teven Le Scao, Thanh Le, Tobi Oyebade, Trieu Le, Tristan Thrush, Yacine Jernite, Yoyo Yang, Zach Nguyen, Zeerak Talat, ZhengXin Yong

Aaron Gokaslan, Alexandra Sasha Luccioni, Alham Fikri Aji, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Angelina McMillan-Major, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh Haji Hosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Mun˜oz Ferrandis, Chenghao Mou, Minh Chien Vu, Christopher Akiki, Daniel McDuff, Danish Contractor, David Ifeoluwa Adelani, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Gérard Dupont, Giada Pistilli, Habib Rezanejad, Hessie Jones, Huu Nguyen, Ian Yu, Indrani Bhatt acharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jaesung Tae, Jenny Chim, Jesse Dodge, Jesse Passmore, Josh Seltzer, Julien Launay, Julio Bonis Sanz, Khalid Almubarak, Livia Dutra, Long Phan, Mairon Samagaio, Manan Dey, Maraim Masoud, Margaret Mitchell, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Niklas Mu en nigh off, Nishant Subramani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Olivier Nguyen, Paulo Villegas, Pawan Sasanka Amman a man chi, Priscilla Amuok, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Shanya Sharma, Shayne Longpre, Silas Wang, Somaieh Nikpoor, Sourav Roy, Stas Bekman, Stella Biderman, Suhas Pai, Suzana Ilic´, Sylvain Viguier, Teven Le Scao, Thanh Le, Tobi Oyebade, Trieu Le, Tristan Thrush, Yacine Jernite, Yoyo Yang, Zach Nguyen, Zeerak Talat, ZhengXin Yong

以上内容为保留原有的人名列表,未进行翻译。

Applications

应用程序

Abhinav Ramesh Kashyap, Albert Villanova del Moral, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Carlos

Abhinav Ramesh Kashyap,Albert Villanova del Moral,Alfredo Palasciano,Alison Callahan,Anima Shukla,Antonio Miranda-Escalada,Ayush Singh,Benjamin Beilharz,Bo Wang,Caio Brito,Carlos

Mun˜oz Ferrandis, Chenxi Zhou, Chirag Jain, Christopher Akiki, Chuxin Xu, Clémentine Fourrier, Daniel León Peri˜nán, Daniel Molano, Daniel van Strien, Danish Contractor, David Lansky, Debajyoti Datta, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Francesco De Toni, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Imane Bello, Ishani Dash, Jason Alan Fries, Javier de la Rosa, Jenny Chim, Jihyun Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthik Rangasai Sivaraman, Leon Weber, Lokesh Bul chan dani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, Maria A Castillo, Marianna Nezhurina, Mario S¨anger, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel De Wolf, Mina Mihaljcic, Minh Chien Vu, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cah yaw i jaya, Samuele Garda, Shamik Bose, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter, Stella Biderman, Stephen H. Bach, Sushil Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Trishala Neeraj, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj, Yash Ven kat raman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye

穆诺兹·费兰迪斯,陈曦周,奇拉格·简,克里斯托弗·阿基基,楚欣 Xu,克莱蒙汀·福里耶,丹尼尔·莱昂·佩里楠,丹尼尔·莫拉诺,丹尼尔·范·斯特伦,丹尼什·康特拉克特,大卫·兰斯基,德巴吉约蒂·达塔,电 Yu,恩里克·曼哈瓦卡斯,法比奥·巴特,弗洛里安·福伊里曼,弗朗切斯科·德·托尼,加布里埃尔·阿尔泰,吉亚萨丁·贝拉克,古利·伯恩斯,海伦娜·U·弗拉贝克,伊玛内·贝洛,伊珊尼·达什,杰森·艾伦·弗里斯,哈维尔·德拉罗萨,珍妮·奇姆,智贤 康,约翰·乔尔吉,乔纳斯·戈尔德,何塞·大卫·波萨达,卡西克·兰加萨伊·西瓦拉曼,莱昂·韦伯,洛凯什·布尔昌达尼,刘璐,路易莎·辛扎托,玛德琳·哈恩·德·比霍维内茨,梅子琦,马可·帕米斯,玛丽亚·A·卡斯蒂略,玛丽安娜·内朱里纳,马里奥·桑格尔,马蒂亚斯·山姆威尔德,迈克尔·库兰,迈克尔·温伯格,米歇尔·德·沃尔夫,米娜·米哈尔希奇,明成吴,敏娜·刘,莫里茨·弗赖丹克,明善 康,娜塔莎·西兰,内森·达尔伯格,尼古拉斯·米奇奥·布罗德,尼古拉·米勒纳,帕斯卡尔·冯,帕特里克·赫勒,拉米亚·钱德拉塞卡,雷娜塔·艾森伯格,罗伯特·马丁,罗德里戈·卡纳利,罗莎琳·苏,瑞诗 苏,萨缪尔·查雅威贾雅,萨穆埃莱·加尔达,沙米克·博斯,施洛克·S·德什穆克,舒布汉舒·米什拉,赛德·基布拉维,西蒙·奥特,思妮·桑-阿隆西里,斯里什蒂·库马尔,斯特凡·施韦特,斯特拉·比德曼,斯蒂芬·H·巴赫,苏希尔·巴哈蒂,坦梅伊·劳德,提奥·吉冈特,田中健,特里沙拉·尼拉杰,沃伊切赫·库萨,亚尼斯·拉布拉克,亚什·夏莱什·巴贾,亚什·文卡特拉曼,一帆 Xu,颖昕 Xu,宇 Xu,哲 Tan,中立 谢,梓凡 叶

Organization

组织结构

Angela Fan, Christopher Akiki, Douwe Kiela, Giada Pistilli, Margot Mieskes, Mathilde Bras, Matthias Gallé, Suzana Ili´c, Yacine Jernite, Younes Belkada, Thomas Wolf

Angela Fan, Christopher Akiki, Douwe Kiela, Giada Pistilli, Margot Mieskes, Mathilde Bras, Matthias Gallé, Suzana Ili´c, Yacine Jernite, Younes Belkada, Thomas Wolf

Abstract

摘要

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.1

大语言模型 (LLMs) 已经证明能够根据少数示例或自然语言指令执行新任务。尽管这些能力导致了广泛应用,但大多数大语言模型是由资源丰富的组织开发的,并且经常不向公众开放。为了使这项强大的技术更加民主化,我们推出了 BLOOM,一个 176B 参数的开放访问语言模型,该模型由数百名研究人员合作设计和构建。BLOOM 是一个仅解码器的 Transformer 语言模型,训练数据来自 ROOTS 语料库,该语料库包含 46 种自然语言和 13 种编程语言(总计 59 种)的数百个来源。我们发现,BLOOM 在各种基准测试中表现出竞争力,在经过多任务提示微调后表现更佳。为了促进未来使用大语言模型的研究和应用,我们在负责任的人工智能许可证下公开发布我们的模型和代码。

Keywords: Language models, collaborative research

关键词:语言模型,协作研究

1. Introduction

1. 引言

Pretrained language models have become a cornerstone of modern natural language processing (NLP) pipelines because they often produce better performance from smaller quantities of labeled data. The development of ELMo (Peters et al., 2018), ULMFiT (Howard and Ruder, 2018), GPT (Radford et al., 2018), and BERT (Devlin et al., 2019) led to the widespread use of pretrained models as an initialization for finetuning on downstream tasks. The subsequent finding that pretrained language models can perform useful tasks without any additional training (Radford et al., 2019; Brown et al., 2020) further demonstrated their utility. In addition, the empirical observation that a language model’s performance tends to increase as the model is made larger—sometimes predictably (Hestness et al., 2017; Kaplan et al., 2020; Hoffmann et al., 2022) and sometimes suddenly (Wei et al., 2022)—has led to a trend of increasing scale (Zeng et al., 2021; Rae et al., 2021; Smith et al., 2022; Chowdhery et al., 2022). Apart from environmental concerns (Strubell et al., 2019; Lacoste et al., 2019; Schwartz et al., 2020), the costs of training large language models (LLMs) are only affordable for well-resourced organizations. Furthermore, until recently, most LLMs were not publicly released. As a result, the majority of the research community has been excluded from the development of LLMs. This exclusion has had concrete consequences; for example, most LLMs are primarily trained on English-language text (with notable exceptions in Chinese and Korean, e.g. Wang et al., 2021; Zeng et al., 2021; Kim et al., 2021).

预训练语言模型已成为现代自然语言处理 (NLP) 流程的基石,因为它们通常能从较少的标注数据中产生更好的性能。ELMo (Peters 等, 2018),ULMFiT (Howard 和 Ruder, 2018),GPT (Radford 等, 2018),和 BERT (Devlin 等, 2019) 的发展促使预训练模型作为微调下游任务的初始化被广泛使用。随后发现预训练语言模型可以在没有任何额外训练的情况下执行有用的任务 (Radford 等, 2019; Brown 等, 2020),这进一步证明了它们的实用性。此外,实证观察表明,随着模型规模的增大,语言模型的性能往往增加——有时是可以预测的 (Hestness 等, 2017; Kaplan 等, 2020; Hoffmann 等, 2022),有时则是突然的 (Wei 等, 2022)——这导致了规模增大的趋势 (Zeng 等, 2021; Rae 等, 2021; Smith 等, 2022; Chowdhery 等, 2022)。除了环境方面的担忧 (Strubell 等, 2019; Lacoste 等, 2019; Schwartz 等, 2020),训练大语言模型 (LLMs) 的成本只有资源充足的组织才能承担。此外,直到最近,大多数大语言模型都没有公开发布。因此,大部分研究社区被排除在大语言模型的发展之外。这种排除产生了具体的影响;例如,大多数大语言模型主要是在英语文本上进行训练的(中文和韩文有显著例外,如 Wang 等, 2021; Zeng 等, 2021; Kim 等, 2021)。

To address these issues, we present the BigScience Large Open-science Open-access Multilingual Language Model (BLOOM, BigScience Workshop, 2022). BLOOM is a 176 billion parameter language model trained on 46 natural languages and 13 programming languages that was developed and released by a collaboration of hundreds of researchers. The compute for training BLOOM was provided through a French public grant from GENCI and IDRIS, leveraging IDRIS’ Jean Zay supercomputer. To build BLOOM, we undertook a thorough design process for each of its components, including the training dataset (Section 3.1), model architecture and training objective (Section 3.2), and engineering strategy for distributed learning (Section 3.4). We also performed an analysis of the model’s capabilities (Section 4). Our overall aim is not only to publicly release a large-scale multilingual language model with performance comparable to recently developed systems, but also to document the coordinated process that went into its development (Section 2.2). The purpose of this paper is to provide a high-level overview of these design steps while referencing the individual reports we produced over the course of developing BLOOM.

为了解决这些问题,我们提出了大型科学大规模开放科学开源多语言语言模型 (BLOOM, BigScience Workshop, 2022)。BLOOM 是一个包含 1760 亿个参数的语言模型,训练数据涵盖了 46 种自然语言和 13 种编程语言,由数百名研究人员合作开发并发布。训练 BLOOM 所需的计算资源由法国公共资助机构 GENCI 和 IDRIS 提供,利用了 IDRIS 的 Jean Zay 超级计算机。为了构建 BLOOM,我们对其每个组件进行了详尽的设计过程,包括训练数据集 (第 3.1 节),模型架构和训练目标 (第 3.2 节),以及分布式学习的工程策略 (第 3.4 节)。我们还对模型的能力进行了分析 (第 4 节)。我们的总体目标不仅是公开发布一个性能与最近开发的系统相当的大规模多语言语言模型,还要记录其开发过程中协调一致的工作流程 (第 2.2 节)。本文的目的是提供这些设计步骤的高层次概述,并引用我们在开发 BLOOM 过程中产生的各个报告。

2. Background

2. 背景

Before describing the BLOOM model itself, in this section we provide necessary background on LLMs as well as an organizational overview of the BigScience effort.

在描述 BLOOM 模型之前,本节我们将提供关于大语言模型 (LLM) 的必要背景,以及对 BigScience 项目的组织概述。

2.1 Language Modeling

2.1 语言模型

Language modeling refers to the task of modeling the probability of a sequence of tokens in a text (Shannon, 1948), where a token is a unit of text (e.g. word, subword, character or byte, etc., as discussed by Mielke et al., 2021). In this work (and in most current applications of language modeling) we model the joint probability of tokens in a text as:

语言模型是指对文本中一系列 Token 出现的概率进行建模的任务 (Shannon, 1948),其中 Token 是文本的一个单位(例如词、子词、字符或字节等,如 Mielke 等人于 2021 年所讨论的)。在本工作中(以及在当前大多数语言模型的应用中),我们将文本中 Token 的联合概率建模为:

注意:原文中的公式未给出,因此这里没有翻译公式。如果原文中有具体的数学公式,请提供具体公式以便准确翻译。

$$
p(\boldsymbol{x})=p(x_{1},\dots,x_{T})=\prod_{t=1}^{T}p(x_{t}|\boldsymbol{x}_{<t})
$$

$$
p(\boldsymbol{x}) = p(x_{1}, \dots, x_{T}) = \prod_{t=1}^{T} p(x_{t} | \boldsymbol{x}_{<t})
$$

where $x$ is a sequence of tokens, $x_{t}$ is the $t^{\mathrm{th}}$ token, and $x{<}t$ is the sequence of tokens preceding $x_{t}$ . This approach is referred to as auto regressive language modeling and can be seen as iterative ly predicting the probability of the next token.

其中,$x$ 是一个 Token 序列,$x_{t}$ 是第 $t$ 个 Token,$x{<}t$ 是位于 $x_{t}$ 之前的 Token 序列。这种方法被称为自回归语言模型,可以看作是迭代地预测下一个 Token 的概率。

Early Language Models Language models have a long history of application in NLP. Early language models (such as those developed by Shannon, 1948) were primarily $n$ -gram models that estimate the probability of a length $n$ sequence of tokens in accordance with the number of times it appears in a training corpus. In practice, $n$ -gram models face two major issues: first, they grow exponentially in size as $n$ is increased; and second, they have no direct way of producing a probability for a sequence of tokens that does not appear in their training data. Advances on these problems enabled $n$ -gram models to see widespread use across most areas of NLP (Goodman, 2001).

早期语言模型

语言模型在自然语言处理 (NLP) 中有着悠久的应用历史。早期的语言模型(例如 Shannon 在 1948 年开发的模型)主要是 $n$-gram 模型,这些模型根据长度为 $n$ 的 Token 序列在训练语料库中出现的次数来估计其概率。实际上,$n$-gram 模型面临两个主要问题:首先,随着 $n$ 的增加,模型的大小呈指数级增长;其次,它们没有直接的方法为未出现在训练数据中的 Token 序列生成概率。针对这些问题的改进使得 $n$-gram 模型能够在大多数 NLP 领域得到广泛应用 (Goodman, 2001)。

Neural Language Models An alternative to $n$ -gram models, first proposed by Miikkulainen and Dyer (1991) and Schmid huber and Heil (1996) and later popularized by Bengio et al. (2000), is to use a neural network to estimate the probability of the next token given prior tokens. While early work used feed-forward networks with a fixed-length history window, Mikolov et al. (2010); Sutskever et al. (2011); Graves (2013) proposed to use recurrent neural networks instead and found that this significantly improved performance. More recently, language models based on the Transformer architecture (Vaswani et al., 2017) were shown to be more effective than recurrent neural networks (Radford et al., 2018; Al-Rfou et al., 2019; Kaplan et al., 2020). Consequently, the Transformer has become the de facto choice for language models.

神经语言模型

一种替代 $n$ -gram 模型的方法,最早由 Miikkulainen 和 Dyer (1991) 以及 Schmidhuber 和 Heil (1996) 提出,并由 Bengio 等人 (2000) 后来推广,是使用神经网络来估计给定先前 Token 的下一个 Token 的概率。虽然早期的工作使用具有固定长度历史窗口的前馈网络,Mikolov 等人 (2010);Sutskever 等人 (2011);Graves (2013) 提议使用循环神经网络,并发现这显著提高了性能。最近的研究表明,基于 Transformer 架构 (Vaswani 等人, 2017) 的语言模型比循环神经网络更有效 (Radford 等人, 2018; Al-Rfou 等人, 2019; Kaplan 等人, 2020)。因此,Transformer 已成为语言模型的事实选择。

Transfer Learning In tandem with advances in language modeling using neural networks, NLP pipelines have increasingly adopted the framework of transfer learning. In transfer learning, the parameters of a model are first pretrained on a data-rich task before being finetuned on a downstream task. A historically common approach to obtaining pretrained parameters were word vectors (Mikolov et al., 2013) trained so that the dot product of co-occurring word vectors is large. However, subsequent work by Peters et al. (2018); Howard and Ruder (2018); Radford et al. (2018); Devlin et al. (2019) showed that the framework of Collobert et al. (2011), where the entire model is pretrained before being finetuned, can attain stronger performance. In particular, Radford et al. (2018); Devlin et al. (2019) demonstrated strong results using pretrained Transformer language models, prompting work on progressively better models (Liu et al., 2019; Yang et al., 2019; Lewis et al., 2020; Raffel et al., 2020; Zhang et al., 2019, etc.).

随着使用神经网络进行语言建模的进步,自然语言处理 (NLP) 流水线越来越多地采用了迁移学习的框架。在迁移学习中,模型的参数首先在一个数据丰富的任务上进行预训练,然后再在下游任务上进行微调。历史上常见的获取预训练参数的方法是词向量 (Mikolov et al., 2013),这些词向量经过训练,使得共现词向量的点积较大。然而,Peters 等人 (2018);Howard 和 Ruder (2018);Radford 等人 (2018);Devlin 等人 (2019) 的后续工作表明,Collobert 等人 (2011) 提出的框架,即整个模型先预训练再微调,可以获得更强的性能。特别是,Radford 等人 (2018);Devlin 等人 (2019) 展示了使用预训练的 Transformer 语言模型取得的强劲结果,这促使人们开发出越来越好的模型 (Liu et al., 2019; Yang et al., 2019; Lewis et al., 2020; Raffel et al., 2020; Zhang et al., 2019, 等)。

Few- and Zero-Shot Learning While finetuning a pretrained model remains an effective way of attaining high performance with limited labeled data, a parallel line of work has demonstrated that pretrained language models can be induced to perform tasks without any subsequent training. After Vinyals and Le (2015) observed limited task-performing behavior in a neural dialog model, Radford et al. (2019) later demonstrated that Transformer-based language models trained on text scraped from the web could perform various tasks to varying degrees. Notably, Radford et al. (2019) found that performance improved with model scale, inspiring work to characterize (Kaplan et al., 2020; Hoffmann et al., 2022) and exploit (Shoeybi et al., 2019; Brown et al., 2020; Smith et al., 2022; Chowdhery et al., 2022; Rae et al., 2021; Wang et al., 2021; Zeng et al., 2021; Zhang et al., 2022) the benefits of scale. A major factor in the success of this approach is the way that task-specific examples are formatted when fed into the model. Brown et al. (2020) popularized the idea of designing “prompts” that provide natural-language descriptions of the task and also allow inputting a few demonstrations of input-output behavior.

少样本和零样本学习

尽管微调预训练模型仍然是在有限标注数据下获得高性能的有效方法,但并行的研究方向表明,预训练语言模型可以在没有任何后续训练的情况下执行任务。在 Vinyals 和 Le (2015) 观察到神经对话模型中有限的任务执行行为后,Radford 等人 (2019) 后来证明了基于 Transformer 的语言模型可以通过从网络抓取的文本进行训练,以不同程度执行各种任务。值得注意的是,Radford 等人 (2019) 发现性能随着模型规模的增大而提高,这激发了研究工作去描述 (Kaplan et al., 2020; Hoffmann et al., 2022) 和利用 (Shoeybi et al., 2019; Brown et al., 2020; Smith et al., 2022; Chowdhery et al., 2022; Rae et al., 2021; Wang et al., 2021; Zeng et al., 2021; Zhang et al., 2022) 规模带来的好处。这种方法成功的一个重要因素是特定任务示例在输入模型时的格式化方式。Brown 等人 (2020) 推广了设计“提示词 (prompts)”的想法,这些提示词提供了任务的自然语言描述,并允许输入一些输入-输出行为的演示。

Social Limitations of LLM Development While the continued increase in the size of large language models has resulted in improvements across a wide range of tasks, it has also exacerbated issues with their development and use (Bender et al., 2021). The computational expense of large models also prohibits the majority of the research community from participating in their development, evaluation and routine use. Moreover, the computational costs have also lead to concerns about the carbon footprint stemming from the training and use of large language models (Strubell et al., 2019; Lacoste et al., 2019; Schwartz et al., 2020; Bannour et al., 2021), and existing carbon footprint studies have likely under-estimated emissions (Bannour et al., 2021). Contributing to an increase in the global carbon footprint exacerbates climate change which most severely affects already-marginalized communities (Westra and Lawson, 2001). Furthermore, the concentration of resources within a handful of (typically industrial) institutions with primarily technical expertise hinders prospects for an inclusive, collaborative, and reliable governance of the technology. First, public narratives about the technology that are driven by industry actors can lead to inflated expectations about its suitability for use (Brennen, 2018; Brennen et al., 2022), leading to misaligned research and policy priorities (Raji et al., 2022) and potentially dire consequences in e.g. medical applications (Wong et al., 2021). Second, in a world mediated by technology, choices at all stages of its development end up shaping people’s lives in a way that can be most closely compared to regulations (Winner, 1977, 2017), albeit without the same explicit consultation of stakeholders in the process. When the development efforts are guided by prioritizing internal definitions of performance over their impact on society, the values of the developers come to be emphasized over those of the direct and indirect users (Birhane et al., 2022). Despite the substantial social dangers in allowing this technology to be developed unilaterally by corporations, EleutherAI (Phang et al., 2022) was the only non-corporate entity outside of China that was developing large language models before the BigScience Workshop was convened.

大语言模型发展的社会局限性

尽管大语言模型的规模持续增加,在广泛的任务中带来了改进,但也加剧了其开发和使用中的问题 (Bender et al., 2021)。大模型的计算成本也阻止了大部分研究社区参与其开发、评估和常规使用。此外,计算成本还引发了对大语言模型训练和使用过程中碳足迹的担忧 (Strubell et al., 2019; Lacoste et al., 2019; Schwartz et al., 2020; Bannour et al., 2021),现有的碳足迹研究可能低估了排放量 (Bannour et al., 2021)。增加全球碳足迹会加剧气候变化,这对已经边缘化的社区影响最为严重 (Westra 和 Lawson, 2001)。此外,资源集中在少数(通常是工业界的)机构中,这些机构主要拥有技术专长,阻碍了技术包容性、协作性和可靠治理的前景。首先,由行业参与者主导的技术公共叙事可能导致对其适用性的期望过高 (Brennen, 2018; Brennen et al., 2022),导致研究和政策重点错位 (Raji et al., 2022),并在例如医疗应用中产生潜在的严重后果 (Wong et al., 2021)。其次,在一个由技术中介的世界中,其开发各个阶段的选择最终以类似于法规的方式塑造人们的生活 (Winner, 1977, 2017),但在这个过程中没有同样的利益相关者明确咨询。当开发工作优先考虑内部性能定义而非其对社会的影响时,开发者的价值观被强调超过了直接和间接用户的价值观 (Birhane et al., 2022)。尽管让这种技术由公司单方面开发存在巨大的社会风险,但在 BigScience Workshop 召开之前,EleutherAI (Phang et al., 2022) 是中国以外唯一一家非企业实体在开发大语言模型。

2.2 BigScience

2.2 大科学 (BigScience)

Participants BLOOM’s development was coordinated by BigScience, an open research collaboration whose goal was the public release of an LLM. The project started after being awarded by GENCI a compute grant on its Jean Zay supercomputer at IDRIS/CNRS. It was initially built around a concerted effort from Hugging Face and the French NLP community (the “founding members”), and quickly opened up to grow into a broader international collaboration to support its aims of linguistic, geographical, and scientific diversity. In the end, over 1200 people registered as participants in BigScience and were given access to its communication channels. They had background not only in machine learning and computer science, but also linguistics, statistics, socio-cultural anthropology, philosophy, law, and other fields. Of those, hundreds of individuals have directly contributed to one of the project’s released artifacts. While the largest number of participants ultimately originated from the US, 38 countries were represented.

BLOOM 的开发由 BigScience 协调,BigScience 是一个开放的研究合作项目,其目标是公开发布一个大语言模型 (LLM)。该项目在获得 GENCI 授予的 Jean Zay 超级计算机(位于 IDRIS/CNRS)的计算资源后启动。最初,该项目由 Hugging Face 和法国 NLP 社区(“创始成员”)共同努力构建,并迅速扩展,成长为更广泛的国际合作,以支持其语言、地理和科学多样性目标。最终,超过 1200 人注册成为 BigScience 的参与者,并获得了其通信渠道的访问权限。这些参与者的背景不仅包括机器学习和计算机科学,还包括语言学、统计学、社会文化人类学、哲学、法律等领域。其中,数百人直接为项目发布的成果做出了贡献。尽管最终来自美国的参与者最多,但共有 38 个国家的代表参与了该项目。

Organization The set of related research questions tackled by the BigScience effort was reflected in the project’s organization into working groups. Each working group comprised several participants with various levels of involvement, including chairs whose role was to self-organize around a specific aspect of the overall project. Importantly, participants were encouraged to join more than one working group in order to share experiences and information, which resulted in the set of 30 working groups presented in Figure 1. Most of the working groups focused on tasks directly linked to the development of BLOOM. In addition, a few groups focused on the evaluation of LLMs and dataset development in specific domains, such as biomedical texts (Fries et al., 2022b) and historical texts (De Toni et al., 2022). A larger overview of the motivations behind this initiative, its history and some of the lessons learned can be found in Akiki et al. (2022).

组织结构

大科学 (BigScience) 努力解决的一系列相关研究问题反映在项目按工作小组划分的组织结构中。每个工作小组由若干参与者组成,参与程度各不相同,包括负责围绕项目某个特定方面自我组织的主席。重要的是,鼓励参与者加入多个工作小组以分享经验和信息,这导致了如图 1 所示的 30 个工作小组的形成。大多数工作小组专注于与 BLOOM 开发直接相关的任务。此外,少数小组专注于特定领域的 LLM 评估和数据集开发,例如生物医学文本 (Fries et al., 2022b) 和历史文本 (De Toni et al., 2022)。关于这一倡议背后的动机、历史以及一些经验教训的更广泛概述可以在 Akiki et al. (2022) 中找到。

图 1: 工作小组分布


Figure 1: Organization of BigScience working groups.


图 1: BigScience 工作组的组织结构。

Ethical Considerations within BigScience In order to acknowledge and start addressing social limitations of LLM development within BigScience, the workshop relied on a collaborative ly designed Ethical Charter $^2$ and original research on applicable regulations in jurisdictions outside of the US $^3$ to guide its choices throughout the project. In particular, the charter emphasizes values of in clu siv it y and diversity, openness and re prod uci bility, and responsibility in various aspects of the organization (Akiki et al., 2022). Each of these values are showcased in different ways in the dataset curation (Section 3.1), modeling (Section 3.2), engineering (Section 3.4), evaluation (Section 4), and other social impact (throughout) aspects of the project.

BigScience 中的伦理考量

为了承认并开始解决 BigScience 内大语言模型开发的社会局限性,研讨会依赖于共同设计的《伦理宪章》$^2$ 和关于美国以外司法管辖区适用法规的原始研究 $^3$ 来指导其在整个项目中的选择。特别是,《宪章》强调了包容性和多样性、开放性和可重复性以及责任感等价值观,这些价值观体现在组织的各个方面(Akiki 等,2022)。这些价值观以不同方式在数据集管理(第 3.1 节)、建模(第 3.2 节)、工程(第 3.4 节)、评估(第 4 节)和其他社会影响(贯穿全文)方面得到体现。

3. BLOOM

3. BLOOM

大语言模型 (LLM) 布隆 (BLOOM) 是由 BigScience 工作组开发的一个开源多语言大语言模型。该模型旨在促进不同语言之间的交流和理解,特别关注资源较少的语言。布隆 (BLOOM) 模型的训练数据涵盖了超过 40 种编程语言和 50 种自然语言,使其能够处理多种任务,包括但不限于文本生成、翻译和问答。此外,布隆 (BLOOM) 还支持零样本 (Zero-shot) 和少样本 (Few-shot) 学习,这使得它在面对新任务时无需额外训练即可表现出色。

In this section, we document the design of BLOOM, including its training dataset (Section 3.1), architecture (Section 3.2), tokenizer (Section 3.3), computing infrastructure (Section 3.4), and training hyper parameters (Section 3.5).

在本节中,我们记录了 BLOOM 的设计,包括其训练数据集 (第 3.1 节)、架构 (第 3.2 节)、分词器 (第 3.3 节)、计算基础设施 (第 3.4 节) 和训练超参数 (第 3.5 节)。

3.1 Training Dataset

3.1 训练数据集

BLOOM was trained on the ROOTS corpus (Laurenc¸on et al., 2022), a composite collection of 498 Hugging Face datasets (Lhoest et al., 2021) amounting to 1.61 terabytes of text that span 46 natural languages and 13 programming languages. A high-level overview of this dataset can be seen in Figure 3, while a detailed itemized list of every language along with its linguistic genus, family and macroarea is presented in Table 1. Beyond the corpus itself, the process resulted in the development and release of a number of organizational and technical tools, including those illustrated in Figure 2. The rest of this section will contextual ize these efforts by providing a brief summary of the steps taken to compile the corpus. For more detailed documentation of the overall dataset curation process and its outcomes, we refer the reader to Lauren¸con et al. (2022).

BLOOM 在 ROOTS 语料库 (Laurenc¸on 等, 2022) 上进行了训练,该语料库是由 498 个 Hugging Face 数据集 (Lhoest 等, 2021) 组成的综合集合,总文本量为 1.61 太字节,涵盖了 46 种自然语言和 13 种编程语言。可以在图 3 中看到该数据集的高层次概述,而表 1 则列出了每种语言及其语言学属、族和大区的详细清单。除了语料库本身,这一过程还开发和发布了若干组织和技术工具,包括那些在图 2 中展示的工具。本节其余部分将通过简要总结编译语料库所采取的步骤来对这些努力进行背景说明。对于整个数据集整理过程及其结果的更详细文档,我们请读者参阅 Laurenc¸on 等 (2022)。

Table 1: Linguistic makeup of the ROOTS corpus.

表 1: ROOTS 语料库的语言构成。

语言 ISO-639-3 catalog-ref 语系 语族 宏观区域 大小(字节)
Akan aka ak Kwa 尼日尔-刚果语系 非洲 70,1554
Arabic arb ar 半闪语族 阿非罗-亚细亚语系 欧亚大陆 74,854,900,600
Assamese asm as 印地语族 印欧语系 欧亚大陆 291,522,098
Bambara bam bm 西部曼德语族 曼德语系 非洲 391,747
Basque eus eu 巴斯克语族 巴斯克语系 欧亚大陆 2,360,470,848
Bengali ben bn 印地语族 印欧语系 欧亚大陆 18,606,823,104
Catalan cat ca 罗曼语族 印欧语系 欧亚大陆 17,792,493,289
Chichewa nya ny 班图语族 尼日尔-刚果语系 非洲 1,187,405
chiShona sna sn 班图语族 尼日尔-刚果语系 非洲 6,638,639
Chitumbuka tum tum 班图语族 尼日尔-刚果语系 非洲 170,360
English eng en 日耳曼语族 印欧语系 欧亚大陆 484,953,009,124
Fon fon fon Kwa 尼日尔-刚果语系 非洲 2,478,546
French fra fr 罗曼语族 印欧语系 欧亚大陆 208,242,620,434
Gujarati guj gu 印地语族 印欧语系 欧亚大陆 1,199,986,460
Hindi hin hi 印地语族 印欧语系 欧亚大陆 24,622,119,985
Igbo ibo ig 伊博语族 尼日尔-刚果语系 非洲 14078,521
Indonesian ind id 马来-巽他语族 南岛语系 纽几内亚 19,972,325,222
isiXhosa xho xh 班图语族 尼日尔-刚果语系 非洲 14,304,074
isiZulu zul Zu 班图语族 尼日尔-刚果语系 非洲 8,511,561
Kannada kan kn 南部达罗毗荼语族 达罗毗荼语系 欧亚大陆 2,098,453,560
Kikuyu kik ki 班图语族 尼日尔-刚果语系 非洲 359,615
Kinyarwanda kin rw 班图语族 尼日尔-刚果语系 非洲 40,428,299
Kirundi run rn 班图语族 尼日尔-刚果语系 非洲 3,272,550
Lingala lin ln 班图语族 尼日尔-刚果语系 非洲 1,650,804
Luganda lug lg 班图语族 尼日尔-刚果语系 非洲 4,568,367
Malayalam mal ml 南部达罗毗荼语族 达罗毗荼语系 欧亚大陆 3,662,571,498
Marathi mar mr 印地语族 印欧语系 欧亚大陆 1,775,483,122
Nepali nep ne 印地语族 印欧语系 欧亚大陆 2,551,307,393
Northern Sotho nso nso 班图语族 尼日尔-刚果语系 非洲 1,764,506
Odia or 印地语族 印欧语系 欧亚大陆 1,157,100,133
Portuguese por pt 罗曼语族 印欧语系 欧亚大陆 79,277,543,375
Punjabi pan pa 印地语族 印欧语系 欧亚大陆 1,572,109,752
Sesotho sot st 班图语族 尼日尔-刚果语系 非洲 751,034
Setswana tsn tn 班图语族 尼日尔-刚果语系 非洲 1,502,200
Simplified Chinese zhs 汉语族 汉藏语系 欧亚大陆 261,019,433,892
Spanish spa sa 罗曼语族 印欧语系 欧亚大陆 175,098,365,045
Swahili swh SW 班图语族 尼日尔-刚果语系 非洲 236,482,543
Tamil tam ta 南部达罗毗荼语族 达罗毗荼语系 欧亚大陆 7,989,206,220
Telugu tel te 南中部达罗毗荼语族 达罗毗荼语系 欧亚大陆 2993407,159
Traditional Chinese zht 汉语族 汉藏语系 欧亚大陆 762,489,150
Twi twi tw Kwa 尼日尔-刚果语系 非洲 1,265,041
Urdu urd ur 印地语族 印欧语系 欧亚大陆 2,781,329,959
Vietnamese vie vi 越南-姆昂语族 南亚语系 欧亚大陆 43,709,279,959
Wolof wol wo 沃洛夫语族 尼日尔-刚果语系 非洲 3,606,973
Xitsonga tso ts 班图语族 尼日尔-刚果语系 非洲 707,634
Yoruba yor yo Defoid 尼日尔-刚果语系 非洲 89,695,835
Programming Languages 174,700,245,772

Motivation The disconnect between developers and (in)voluntary users of the technology mentioned in Section 2 is particularly apparent in the curation of the datasets that have supported recent large-scale machine learning projects, where intentional “Data work” is generally under-valued (Sambasivan et al., 2021). In the context of LLMs, this tendency is exemplified by a range of heuristics-based filtering approaches that prioritize getting as much “high-quality” data for as little cost as possible over engaging with the needs—and rights—of data subjects, where quality is commonly defined as maximizing performance on downstream tasks while occasionally removing content deemed offensive by the developers.

动机

第 2 节提到的技术的开发者与(非)自愿用户之间的脱节在支持近期大规模机器学习项目的数据集整理中尤为明显,其中有意的“数据工作”通常被低估 (Sambasivan et al., 2021)。在大语言模型 (LLM) 的背景下,这种倾向体现在一系列基于启发式的过滤方法上,这些方法优先考虑以尽可能低的成本获取尽可能多的“高质量”数据,而不是关注数据主体的需求和权利,这里的质量通常被定义为最大化下游任务的性能,同时偶尔移除开发人员认为具有冒犯性的内容。

While these approaches do yield terabytes of data with comparatively little human effort, compounding biases of the source material (such as Common Crawl dumps) with those of the filtering method often leads to negative outcomes for marginalized populations. In one case, the use of a block list to remove “pornographic” text was shown to also suppress LGBTQ $^+$ and African American English (AAE) text from a corpus (Dodge et al., 2021). In another, using Reddit outgoing links as an indicator of quality for a seed corpus (Radford et al., 2019) leads to trained models that implicitly prioritize US-centric views in their outputs (Johnson et al., 2022). In yet another project, a filtering approach that relied on a machine learning image-text alignment model was shown to exacerbate its biases in the created multimodal dataset (Birhane et al., 2021). In addition, this abstract ive approach to data curation leads to corpora that are difficult to meaningfully document and govern after the fact, as the provenance and authorship of individual items is usually lost in the process (although works such as Gao et al. (2020) that prioritize compilations of previously documented individual sources over crawled data provide a step towards addressing these issues (Biderman et al., 2022)).

虽然这些方法确实可以用相对较少的人力生成数以太字节计的数据,但源材料(如 Common Crawl 数据转储)的累积偏差与过滤方法的偏差相结合,往往会对边缘化群体产生负面影响。在一个案例中,使用阻止列表移除“色情”文本被证明也会抑制语料库中的 LGBTQ+ 和非裔美式英语 (AAE) 文本 (Dodge 等, 2021)。在另一个案例中,使用 Reddit 外部链接作为种子语料库质量的指标 (Radford 等, 2019),导致训练出的模型在其输出中隐含地优先考虑以美国为中心的观点 (Johnson 等, 2022)。在另一个项目中,依赖于机器学习图像-文本对齐模型的过滤方法被证明会加剧其在创建的多模态数据集中存在的偏差 (Birhane 等, 2021)。此外,这种抽象的数据管理方法使得语料库难以在事后有意义地记录和治理,因为单个项目的来源和作者身份通常在这个过程中丢失(尽管像 Gao 等 (2020) 这样的工作通过优先编译先前已记录的单个来源而不是爬取的数据来解决这些问题 (Biderman 等, 2022))。

In the context of the BigScience workshop, and in accordance with its Ethical Charter,4 we aimed to prioritize human involvement, local expertise, and language expertise in our data curation and documentation process, as outlined in the following sections.

在 BigScience 工作坊的背景下,并根据其伦理宪章,我们旨在优先考虑人类参与、本地专业知识和语言专业知识在我们的数据整理和文档编制过程中,如下各节所述。

3.1.1 Data Governance

3.1.1 数据治理

Large text corpora comprise text about and created by people: the data subjects. Different people and institutions might legally “own” that data, making them data rights-holders. As machine learning developers gather and collate that data into ever-larger datasets to support training larger models, it becomes increasingly important to develop new ways of accounting for the interests of all parties involved – developers, data subjects, and rights-holders alike.

大型文本语料库包含由人编写和关于人的文本:数据主体。不同的人和机构可能在法律上“拥有”这些数据,使他们成为数据权利持有人。随着机器学习开发者收集和整理这些数据到越来越大的数据集中以支持训练更大的模型,为所有相关方(开发者、数据主体和权利持有人)的利益开发新的考量方式变得越来越重要。

The BigScience effort aimed to address these needs through a multidisciplinary lens combining technical, legal, and sociological expertise. The group focused on two main interrelated goals at two different time scales: the design of a structure for long-term international data governance that prioritizes the agency of the data rights-holders, and concrete recommendations for handling the data used directly in the BigScience project. Progress on the first goal is presented in the work of Jernite et al. (2022), which further motivates the needs and requirements of data governance, and outlines the structure needed for a network of data custodians, rights-holders, and other parties to appropriately govern shared data. The interactions between these actors are designed to account for the privacy, intellectual property, and user rights of the data and algorithm subjects in a way that aims to prioritize local knowledge and expression of guiding values. In particular, this approach relies on structured agreements between data providers and data hosts $^{5}$ that specify what the data may be used for.

BigScience 计划旨在通过结合技术、法律和社会学专业知识的多学科视角来满足这些需求。该小组专注于两个不同时间尺度上的两个主要且相互关联的目标:设计一个以数据权利持有人的自主权为优先的长期国际数据治理结构,以及为处理直接用于 BigScience 项目的数据提供具体建议。关于第一个目标的进展在 Jernite 等人 (2022) 的工作中有所介绍,这进一步强调了数据治理的需求和要求,并概述了数据托管方、权利持有人和其他相关方适当管理共享数据所需的结构。这些参与者之间的互动旨在考虑数据和算法主体的隐私、知识产权和用户权利,力求优先考虑本地知识和指导价值观的表达。特别是,这种方法依赖于数据提供者和数据托管方之间达成的结构性协议 (5) ,以明确规定数据的使用范围。

While we were not able to fully establish an international organization in the comparatively short time between the project start and model training, we worked on integrating lessons from this effort (and conversely adapting it to the practical concerns we were experiencing) in the following main ways: (i) we sought explicit permission to use the data from specific providers within the context of BigScience whenever possible (such as for the AI2 $^6$ -managed S2ORC corpus of Lo et al. (2020) or articles from the French newspaper Le Monde7); (ii) we kept individual sources separate until the final stages of preprocessing to maintain trace ability and handle each according to the needs of its specific context; and (iii) we adopted a composite release approach for the various data sources that make up the overall corpus to foster reproducibility and follow-up research while respecting these sourcedependent needs. Resources to visualize and access the ROOTS corpus can be found on the Hugging Face Hub organization “BigScience Data”.8 The organization hosts several demos (or “Spaces”) that can be used to gain insights into the full corpus, as well as direct access to the 223 (out of 498) components that we are able to distribute taking into account their licensing status, privacy risks, and agreements with their original custodians. Finally, since we understand that future investigation into the BLOOM models may require full access to the entire corpus, we are also inviting researchers with a relevant research project in mind to join ongoing efforts to analyze the data through a sign-up form.9

虽然我们未能在项目启动和模型训练之间相对较短的时间内完全建立一个国际组织,但我们致力于将此次努力的经验教训(以及反过来适应我们所经历的实际问题)以以下主要方式整合:(i) 我们尽可能寻求特定提供者在 BigScience 范围内使用数据的明确许可(例如,AI2 管理的 Lo 等人 (2020) 的 S2ORC 语料库或来自法国报纸 Le Monde 的文章);(ii) 我们在预处理的最后阶段之前保持各个来源的独立性,以确保可追溯性,并根据其具体需求进行处理;(iii) 我们采用了一种综合发布方法,针对构成整个语料库的各种数据源,以促进可重复性和后续研究,同时尊重这些来源依赖的需求。可以在 Hugging Face Hub 组织 “BigScience Data” 上找到用于可视化和访问 ROOTS 语料库的资源。该组织托管了多个演示(或“Spaces”),可以用来深入了解整个语料库,以及直接访问我们能够分发的 223 个组件(共 498 个),这考虑到了它们的许可状态、隐私风险以及与原始保管人的协议。最后,由于我们理解未来对 BLOOM 模型的研究可能需要完全访问整个语料库,因此我们也邀请有相关研究项目的研究人员通过注册表单加入正在进行的数据分析工作。

3.1.2 Data Sources

3.1.2 数据来源

Given a strategy for data governance, the next step was to determine the composition of the training corpus. This stage was driven by several goals, which sometimes had inherent tensions. Some of those tensions included building a language model that was accessible to as many people as possible around the world while only including languages for which we had enough expertise to curate a dataset of comparable scale (and to a lesser extent composition) to previous efforts while improving the standards of documentation and respect for data and algorithm subject rights.

在确定了数据治理策略后,下一步是确定训练语料库的组成。这一阶段由几个目标驱动,这些目标有时存在内在的紧张关系。其中一些紧张关系包括:构建一个尽可能让更多人能够使用的语言模型,同时仅包含我们有足够的专业知识来构建与其规模(以及在较小程度上是组成)可与之前努力相媲美的数据集的语言,并提高对数据和算法主题权利的文档记录标准和尊重。

Language Choices These considerations led us to an incremental process for choosing which languages were to be included in the corpus. We started with a list of eight of the world’s largest languages by number of speakers for which we did active outreach in the early stages of the project to invite fluent speakers to join the data efforts. Then, on the recommendation of language communities (Nekoto et al., 2020) we expanded Swahili in the original selection to the category of Niger-Congo languages, and Hindi and Urdu to

语言选择

这些考虑因素促使我们采取了一个逐步的过程来选择要包含在语料库中的语言。我们从按使用人数最多的八种世界主要语言开始,在项目初期积极联系这些语言的流利使用者,邀请他们加入数据工作。然后,根据语言社区的建议 (Nekoto et al., 2020),我们将原始选择中的斯瓦希里语扩展到尼日尔-刚果语系,并将印地语和乌尔都语扩展到

Indic languages (Kun chu kut tan et al., 2020). Finally, we proposed that any group of 3 or more participants fluent in an additional language could add it to the supported list if they would commit to selecting sources and guiding processing choices in the language in order to avoid common issues with corpora selected through automatic language identification without specific language expertise (Caswell et al., 2022).

印地语系语言 (Kun chu kut tan 等, 2020)。最后,我们建议任何有 3 名或更多参与者熟练掌握额外语言的小组可以将其添加到支持的语言列表中,前提是他们承诺选择该语言的来源并指导处理决策,以避免通过自动语言识别选择语料库而没有特定语言专业知识所导致的常见问题 (Caswell 等, 2022)。

Source Selection The biggest part of the corpus was curated by workshop participants and research collectives who collectively compiled the “BigScience Catalogue”: a large list of processed and non-processed sources covering a wide range of languages. This took the form of hackathons that were co-organized by communities such as Machine Learning Tokyo, Masakhane, and LatinX in AI (McMillan-Major et al., 2022). Complementary to those efforts, other working group participants compiled language-specific resources such as the Arabic-focused Masader repository (Alyafeai et al., 2021; Altaher et al., 2022). A total of 252 sources were identified through this bottom-up approach, with at least 21 sources per language category. Additionally, in order to increase the geographic coverage of some of our Spanish, Chinese, French, and English sources, participants identified locally relevant websites in their language to be added to the corpus via pseudo crawl, a method to obtain those websites from a Common Crawl snapshot.

源选择

语料库的最大部分由工作坊参与者和研究集体共同策划,他们共同编制了“BigScience Catalogue”:一个涵盖广泛语言的大型已处理和未处理来源列表。这以黑客松的形式呈现,由 Machine Learning Tokyo、Masakhane 和 LatinX in AI (McMillan-Major 等, 2022) 等社区共同组织。除此之外,其他工作组参与者还编制了特定语言的资源,例如专注于阿拉伯语的 Masader 仓库 (Alyafeai 等, 2021; Altaher 等, 2022)。通过这种自下而上的方法,共确定了 252 个来源,每种语言类别至少有 21 个来源。此外,为了增加我们一些西班牙语、中文、法语和英语来源的地理覆盖范围,参与者识别了其语言中的本地相关网站,并通过伪爬取的方法将这些网站添加到语料库中,该方法是从 Common Crawl 快照中获取这些网站。

GitHub Code The catalogue was further complemented with a dataset of programming languages collected from the GitHub data collection on Google’s BigQuery, $^{10}$ which was then de duplicated of exact matches. The choice of languages to include mirrored the design choices introduced by Li et al. (2022) to train the AlphaCode model.

GitHub 代码目录进一步补充了从 Google BigQuery 上的 GitHub 数据集中收集的编程语言数据集,$^{10}$ 并去除了完全匹配的重复项。所选语言的包含反映了 Li 等人 (2022) 在训练 AlphaCode 模型时引入的设计选择。

OSCAR Both in an effort not to diverge from the standard research practice of using the Web as a source of pre training data (Radford et al., 2018; Raffel et al., 2020), and also to satisfy the data volume needs of our compute budget given the size of BLOOM, we further sourced data from OSCAR version 21.09, corresponding to the February 2021 snapshot of the Common Crawl (Ortiz Suárez et al., 2019; Abadji et al., 2021), which ended up constituting 38% of the corpus.

为了不偏离使用网络作为预训练数据来源的标准研究实践 (Radford et al., 2018; Raffel et al., 2020),同时也为了满足我们计算预算下的数据量需求,考虑到 BLOOM 的规模,我们进一步从 OSCAR 21.09 版本获取了数据,该版本对应于 2021 年 2 月的 Common Crawl 快照 (Ortiz Suárez et al., 2019; Abadji et al., 2021),最终构成了语料库的 38%。

3.1.3 Data Preprocessing

3.1.3 数据预处理

After the sources had been identified, data processing involved several steps to handle multiple aspects of data curation. An over arching view of and processing pipeline to build ROOTS can be seen in Figure 2. All tools developed in the process are available on GitHub.11

在确定了数据来源后,数据处理涉及多个步骤以处理数据管理的各个方面。可以在图 2 中看到构建 ROOTS 的整体视图和处理管道。所有在该过程中开发的工具均托管在 GitHub 上 [11]。

Obtaining the Source Data The first step involved obtaining the data for all of the text data sources identified in Section 3.1.2, which consisted of a combination of downloading and extracting the text field from a variety of NLP datasets in various formats (including e.g. question answering, sum mari z ation, or dialogue datasets), scraping and processing large amounts of PDF files from archives (e.g. the French repository of scientific articles $\mathrm{12}$ ), and extracting and preprocessing text from 192 website entries from the catalogue and another geographically diverse set of 456 websites selected by data working group members. The latter required the development of new tools to extract text from the HTML in the Common Crawl WARC files, which we made available on the main data preparation repository. $^{13}$ We were able to find and extract usable text data from all URLs present in 539 of the websites.

获取源数据

第一步涉及获取第 3.1.2 节中确定的所有文本数据源的数据,这包括从各种格式的自然语言处理 (NLP) 数据集中下载和提取文本字段(包括例如问答、摘要或对话数据集),从档案中抓取和处理大量 PDF 文件(例如法国科学文章存储库 $\mathrm{12}$),以及从目录中的 192 个网站条目和由数据工作组成员选择的另一个地理分布广泛的 456 个网站中提取和预处理文本。后者需要开发新工具以从 Common Crawl WARC 文件中的 HTML 提取文本,这些工具我们已发布在主要的数据准备仓库上。$^{13}$ 我们能够从 539 个网站中的所有 URL 找到并提取可用的文本数据。


Figure 2: Creation Pipeline of the ROOTS Corpus. The purple-colored sourcing stage of the pipeline and the yellow-colored processing stage are described respectively in Section 3.1.2 and Section 3.1.3.

图 2: ROOTS 语料库的创建流程。管道中紫色的来源阶段和黄色的处理阶段分别在第 3.1.2 节和第 3.1.3 节中描述。

“Quality” filtering: Text Produced by Humans for Humans After obtaining the text, we found that most of the sources contained some amount of text that was not natural language, for example preprocessing errors, SEO pages, or spam (including pornographic spam). In order to filter non-natural language, we defined a set of quality indicators, where high-quality text is defined as “written by humans for humans”, without distinction of content (as we wanted content selection to exclusively be the domain of the more accountable human source selection) or a priori judgments of grammatical it y. The full list of indicators are described in (Lauren¸con et al., 2022). Importantly, the indicators were adapted to the needs of each of the sources in two main ways. First, their parameters such as the thresholds and supporting term lists were selected individually for each language by fluent speakers. Second, we manually went through each individual source to identify which indicators were most likely to identify non-natural language. Both processes were supported by tools to visualize their impact.14,15

质量过滤:由人类为人类生成的文本

在获取文本后,我们发现大多数来源包含了一定量的非自然语言文本,例如预处理错误、SEO页面或垃圾信息(包括色情垃圾信息)。为了过滤非自然语言,我们定义了一组质量指标,其中高质量文本被定义为“由人类为人类撰写”,不区分内容(因为我们希望内容选择完全由更具责任的人类来源选择来决定)或对语法性的先验判断。完整的指标列表在 (Lauren¸con et al., 2022) 中有描述。重要的是,这些指标以两种主要方式适应每个来源的需求。首先,其参数如阈值和支持术语列表由流利的语言使用者为每种语言单独选择。其次,我们手动检查每个单独的来源,以识别最有可能识别非自然语言的指标。这两个过程都得到了可视化工具的支持。14,15


Figure 3: Graphical overview of the ROOTS corpus. Left: A treemap plot of the language families of all 46 natural languages where surface is proportional to the number of bytes. Indo-European and Sino-Tibetan families overwhelm the plot with a combined total of 1321.89 GB. The thin orange surface represents 18GB of Indonesian data and the green rectangle 0.4GB constituting the Niger-Congo language family subset. Right: A waffle plot of the distribution of the 13 programming languages by size, where one square represents approximately 200MB. 3.1.4 Prompted Datasets

图 3: ROOTS 语料库的图形概览。左:所有 46 种自然语言的语言家族的树形图,其中面积与字节数成正比。印欧语系和汉藏语系在图中占据主导地位,总共有 1321.89 GB。细橙色区域代表 18 GB 的印尼数据,绿色矩形代表 0.4 GB,构成尼日尔-刚果语系子集。右:编程语言分布的华夫饼图,其中每个方块大约代表 200 MB。3.1.4 提示数据集

De duplication and Privacy Redaction Finally, we removed near-duplicate documents with two de duplication steps and redacted Personal Identifiable Information (such as social security numbers) that we could identify from the OSCAR version of the corpus—as it was deemed to be the source that presented the highest privacy risks, prompting us to apply regex-based redaction even in cases where the expressions had some false positives.

去重和隐私遮蔽

最后,我们通过两个去重步骤移除了几乎重复的文档,并从 OSCAR 版本的语料库中遮蔽了可识别的个人身份信息(如社会安全号码)——因为该来源被认为存在最高的隐私风险,促使我们即使在某些表达式存在一些误报的情况下也应用了基于正则表达式的遮蔽。

25 xP3 5 ROOTS 0.00.11 0.001 0.0001 en pt id zh hi code vi ln wotum ki st fon ak bm tw

25 xP3 5 根 0.00.11 0.001 0.0001 英文 葡萄牙文 印尼文 中文 印地文 代码 越南文 语言 wotum ki st fon ak bm tw

请注意,由于部分词汇如"wotum", "ki", "st", "fon", "ak", "bm", "tw"等无法识别为特定术语或语言代码,在没有明确上下文的情况下,保持了原文。

Figure 4: Language distribution of the prompted dataset xP3 closely follows ROOTS.

图 4: 提示数据集 xP3 的语言分布紧密跟随 ROOTS。

Multitask prompted finetuning (also referred to as instruction tuning) involves finetuning a pretrained language model on a training mixture composed of a large set of different tasks specified through natural language prompts. T0 (Sanh et al., 2022) (developed as part of BigScience) demonstrated that language models finetuned on a multitask mixture of prompted datasets have strong zero-shot task generalization abilities. Moreover, T0 was shown to outperform language models that are an order of magnitude larger but did not undergo such finetuning. Motivated by these results, we explored using existing natural language datasets to carry out multitask prompted finetuning.

多任务提示微调(也称为指令微调)涉及在由大量通过自然语言提示指定的不同任务组成的训练混合数据上对预训练语言模型进行微调。T0 (Sanh et al., 2022) (作为 BigScience 的一部分开发)表明,通过对多任务提示数据集混合进行微调的语言模型具有强大的零样本任务泛化能力。此外,T0 被证明优于那些规模大一个数量级但未经历此类微调的语言模型。受这些结果的启发,我们探索了使用现有的自然语言数据集来进行多任务提示微调。

T0 was trained on a subset of the Public Pool of Prompts (P3), a collection of prompts for various existing and open-source English natural language datasets. This collection of prompts was created through a series of hackathons involving BigScience collaborators and where hackathon participants wrote a total of of $2000+$ prompts for $170+$ datasets. Datasets in P3 cover a variety of natural language task including sentiment analysis, question answering, and natural language inference and exclude harmful content or non-natural language such as programming languages. Prompt Source (Bach et al., 2022),16 an opensource toolkit (also developed as part of BigScience) facilitated creating, sharing and using natural language prompts. Full details of the collection process are given in (Sanh et al., 2022; Bach et al., 2022).

T0 在 Public Pool of Prompts (P3) 的一个子集上进行了训练,P3 是为各种现有的开源英文自然语言数据集编写的提示集合。这个提示集合是通过一系列涉及 BigScience 合作者的黑客松活动创建的,在这些活动中,参与者总共为 170 多个数据集编写了 2000 多个提示。P3 中的数据集涵盖了多种自然语言任务,包括情感分析、问答和自然语言推理,并排除了有害内容或非自然语言(如编程语言)。Prompt Source (Bach et al., 2022),一个开源工具包(也是作为 BigScience 的一部分开发的),促进了自然语言提示的创建、共享和使用。收集过程的详细信息见 (Sanh et al., 2022; Bach et al., 2022)。

After pre training BLOOM, we applied the same massively multitask finetuning recipe to equip BLOOM with multilingual zero-shot task generalization abilities. We refer to the resulting models as BLOOMZ. To train BLOOMZ, we extended P3 to include new datasets in languages other than English and new tasks, such as translation. This resulted in xP3, a collection of prompts for 83 datasets covering 46 languages and 16 tasks. As highlighted in Figure 4, xP3 mirrors the language distribution of ROOTS. Tasks in xP3 are both crosslingual (e.g. translation) and monolingual (e.g. sum mari z ation, question answering). We used Prompt Source to collect these prompts, adding additional metadata to the prompts, such as input and target languages. To study the importance of multilingual prompts, we also machine-translated English prompts in xP3 to the respective dataset languages to produce a collection called xP3mt. Further details on the prompt collection for xP3 and xP3mt are given in Mu en nigh off et al. (2022b).

在预训练 BLOOM 之后,我们应用了相同的大量多任务微调方法,使 BLOOM 具备多语言零样本任务泛化能力。我们将这些模型称为 BLOOMZ。为了训练 BLOOMZ,我们扩展了 P3,以包含非英语的新数据集和新任务,例如翻译。这导致了 xP3 的产生,它是一个包含 46 种语言和 16 个任务的 83 个数据集的提示集合。如图 4 所示,xP3 模拟了 ROOTS 的语言分布。xP3 中的任务既包括跨语言的(例如翻译),也包括单语言的(例如摘要生成、问答)。我们使用 Prompt Source 收集这些提示,并添加了额外的元数据,例如输入和目标语言。为了研究多语言提示的重要性,我们还将 xP3 中的英文提示机器翻译成相应数据集的语言,以生成一个名为 xP3mt 的集合。关于 xP3 和 xP3mt 提示收集的更多细节,请参见 Muennighoff et al. (2022b)。

3.2 Model Architecture

3.2 模型架构

This section discusses our design methodology and the architecture of the BLOOM model. In-depth studies and experiments can be found in Le Scao et al. (2022) and Wang et al. (2022a). We first review our design methodology, then motivate our choice of training a causal decoder-only model. Finally, we justify the ways that our model architecture deviates from standard practice.

本节讨论我们的设计方法论和 BLOOM 模型的架构。深入的研究和实验可以在 Le Scao 等 (2022) 和 Wang 等 (2022a) 中找到。我们首先回顾我们的设计方法论,然后阐述我们选择训练因果解码器模型的原因。最后,我们解释我们的模型架构与标准实践不同的地方。

3.2.1 Design Methodology

3.2.1 设计方法论

The design space of possible architectures is immense, making exhaustive exploration impossible. One option would be to exactly replicate the architecture of an existing large language model. On the other hand, a great deal of work on improving existing architectures has seen relatively little adoption (Narang et al., 2021); adopting some of these recommended practices could yield a significantly better model. We take a middle ground and focus on model families that have been shown to scale well, and that have reasonable support in publicly available tools and codebases. We ablate components and hyper parameters of the models, seeking to make the best use of our final compute budget.

可能的架构设计空间非常巨大,使得穷尽探索变得不可能。一种选择是完全复制现有大语言模型的架构。另一方面,尽管有许多工作致力于改进现有架构,但这些工作得到的采用相对较少(Narang et al., 2021);采用其中一些推荐的做法可能会显著提升模型性能。我们采取中间立场,专注于那些已被证明具有良好扩展性的模型家族,并且在公开可用的工具和代码库中有合理支持。我们对模型的组件和超参数进行消融研究,力求在最终计算预算内获得最佳效果。

Experimental Design for Ablations One of the main draws of LLMs has been their ability to perform tasks in a “zero/few-shot” way: large enough models can perform novel tasks simply from in-context instructions and examples (Radford et al., 2019), without dedicated training on supervised samples. Accordingly, and because finetuning a 100B+ model is unwieldy, we focused our evaluation of architectural decisions on zero-shot generalization, and do not consider transfer learning. Specifically, we measured zero-shot performance on diverse aggregates of tasks: 29 tasks from the EleutherAI Language Model Evaluation Harness (EAI-Eval, Gao et al. (2021)), and 9 tasks from the evaluation set of T0 (T0-Eval, Sanh et al. (2022)). There is significant overlap between the two: only one task from T0-Eval (StoryCloze) is not in EAI-Eval, although all prompts between the two are different. See Le Scao et al. (2022) for a detailed list of tasks and baselines. We also note that our tasks aggregates share 17 of the 31 tasks of the evaluation of GPT-3 (Brown et al., 2020).

实验设计用于消融研究

大语言模型的主要吸引力之一在于它们能够以“零样本/少样本”的方式执行任务:足够大的模型仅通过上下文指令和示例就能完成新任务(Radford et al., 2019),而无需在监督样本上进行专门训练。因此,且由于微调一个100B+的模型非常不便,我们专注于评估架构决策的零样本泛化能力,不考虑迁移学习。具体来说,我们在多样化的任务集合上测量了零样本性能:包括来自EleutherAI语言模型评估工具包 (EAI-Eval, Gao et al. (2021)) 的29个任务,以及来自T0评估集 (T0-Eval, Sanh et al. (2022)) 的9个任务。这两个集合之间存在显著重叠:只有T0-Eval中的一个任务(StoryCloze)不在EAI-Eval中,尽管两者之间的所有提示都是不同的。有关任务和基线的详细列表,请参见 Le Scao et al. (2022)。我们还注意到,我们的任务集合与GPT-3评估(Brown et al., 2020)中的31个任务共享了17个任务。

We conducted our ablation experiments using smaller models. We used the 6.7B parameter scale for the pre training objective ablations (Wang et al., 2022a) and the 1.3B scale for the rest including position embeddings, activation s, and layer normalization (Le Scao et al., 2022). Recently, Dettmers et al. (2022) identified a phase transition for models larger than 6.7B, in which the emergence of “outliers features” is observed. This questions whether results obtained at the 1.3B scale should be assumed to extrapolate to our final model size.

我们使用较小的模型进行了消融实验。我们使用 67 亿参数规模进行预训练目标的消融实验 (Wang et al., 2022a),并使用 13 亿参数规模进行其余部分的实验,包括位置嵌入、激活函数和层归一化 (Le Scao et al., 2022)。最近,Dettmers 等人 (2022) 发现对于超过 67 亿参数的模型存在一个相变,在此过程中观察到了“异常特征”的出现。这使得我们质疑在 13 亿参数规模上获得的结果是否可以外推到我们的最终模型规模。

Out-of-scope Architectures We did not consider mixture-of-experts (MoE) (Shazeer et al., 2017), due to a lack of widely used GPU-based codebases suitable for training them at scale. Similarly, we also did not consider state-space models (Gu et al., 2020). At the time of the design of BLOOM, they consistently under performed in natural language tasks (Gu et al., 2021). Both of these approaches are promising, and have now demonstrated competitive results–at large scales for MoE (Fedus et al., 2022; Srivastava et al., 2022), and at smaller scale for state-space models with H3 (Fu et al., 2023).

超出范围的架构

我们未考虑专家混合 (Mixture-of-Experts, MoE) (Shazeer et al., 2017),由于缺乏广泛使用的适合大规模训练的基于GPU的代码库。同样,我们也没有考虑状态空间模型 (State-space models) (Gu et al., 2020)。在BLOOM设计时,它们在自然语言任务中表现一直不佳 (Gu et al., 2021)。这两种方法都具有潜力,并且现在已经展示了有竞争力的结果–对于MoE,在大规模上 (Fedus et al., 2022; Srivastava et al., 2022),而对于状态空间模型,在较小规模上使用H3 (Fu et al., 2023)。

3.2.2 Architecture and Pre training Objective

3.2.2 架构和预训练目标 (Architecture and Pre training Objective)

Although most modern language models are based on the Transformer architecture, there are significant deviations between architectural implementations. Notably, while the original Transformer is based on an encoder-decoder architecture, many popular models have opted for encoder-only (e.g. BERT, (Devlin et al., 2019)) or decoder-only (e.g. GPT, (Radford et al., 2018)) approaches. Currently, all state-of-the-art language models over 100 billion parameters are causal decoder-only models (Brown et al., 2020; Rae et al., 2021; Chowdhery et al., 2022). This is in opposition to the findings of Raffel et al. (2020), in which encoderdecoder models significantly outperform decoder-only models for transfer learning.

虽然大多数现代大语言模型都基于 Transformer 架构,但它们的架构实现存在显著差异。值得注意的是,原始的 Transformer 基于编码器-解码器架构,而许多流行的模型选择了仅编码器(例如 BERT [(Devlin et al., 2019)])或仅解码器(例如 GPT [(Radford et al., 2018)])的方法。目前,所有超过 1000 亿参数的最先进大语言模型都是因果仅解码器模型 [(Brown et al., 2020); (Rae et al., 2021); (Chowdhery et al., 2022)]。这与 Raffel 等人 (2020) 的研究结果相反,在他们的研究中,编码器-解码器模型在迁移学习方面显著优于仅解码器模型。

Prior to our work, the literature was lacking a systematic evaluation of the zero-shot generalization capabilities of different architectures and pre training objectives. We explored this question in Wang et al. (2022a) where we evaluated encoder-decoder and decoder-only architectures and their interactions with causal, prefix, and masked language modeling pre training objectives. Our results show that immediately after pre training, causal decoderonly models performed best – validating the choice of state-of-the-art LLMs. Furthermore, they can be more efficiently adapted after pre training to a non-causal architecture and objective–an approach which has been further explored and confirmed by Tay et al. (2022).

在我们的工作之前,文献中缺乏对不同架构和预训练目标的零样本泛化能力的系统性评估。我们在 Wang 等人 (2022a) 中探讨了这个问题,评估了编码器-解码器和仅解码器架构及其与因果、前缀和掩码语言建模预训练目标的交互。我们的结果显示,在预训练后,因果仅解码器模型表现最佳——验证了最先进的大语言模型 (LLM) 的选择。此外,这些模型可以在预训练后更高效地适应非因果架构和目标——这一方法已由 Tay 等人 (2022) 进一步探索并确认。

3.2.3 Modeling Details

3.2.3 模型细节

Beyond choosing an architecture and pre training objective, a number of changes to the original Transformer architecture have been proposed. For example, alternative positional embedding schemes (Su et al., 2021; Press et al., 2021) or novel activation functions (Shazeer, 2020). We thus performed a series of experiments to evaluate the benefit of each of these modifications for a causal decoder-only model in Le Scao et al. (2022). We adopted two architectural deviations in BLOOM:

除了选择架构和预训练目标外,已经提出了对原始 Transformer 架构的多项更改。例如,替代的位置嵌入方案 (Su et al., 2021; Press et al., 2021) 或新型激活函数 (Shazeer, 2020)。因此,我们进行了一系列实验,以评估这些修改对因果解码器模型的益处 [Le Scao et al. (2022)]。我们在 BLOOM 中采用了两种架构偏差:

ALiBi Positional Embeddings Instead of adding positional information to the embedding layer, ALiBi directly attenuates the attention scores based on how far away the keys and queries are (Press et al., 2021). Although ALiBi was initially motivated by its ability to extrapolate to longer sequences, we found it also led to smoother training and better downstream performance even at the original sequence length – outperforming both learned (Vaswani et al., 2017) and rotary (Su et al., 2021) embeddings.

ALiBi 位置嵌入

而不是将位置信息添加到嵌入层,ALiBi 直接根据键和查询之间的距离来衰减注意力分数 (Press et al., 2021)。虽然 ALiBi 最初是由于其能够外推到更长的序列而受到启发,但我们发现它即使在原始序列长度下也能导致更平滑的训练和更好的下游性能——超过了学习到的 (Vaswani et al., 2017) 和旋转 (Su et al., 2021) 嵌入。

Embedding LayerNorm In preliminary experiments training a 104B parameters model, we experimented with an additional layer normalization immediately after the embedding layer – as recommended by the bits and bytes 17 library (Dettmers et al., 2022) with its Stable Embedding layer. We found this significantly improved training stability. Even though we also found it penalizes zero-shot generalization in Le Scao et al. (2022), we train BLOOM with an additional layer normalization after the first embedding layer to avoid training instabilities. Note the preliminary 104B experiments were conducted in float16, while the final training was in bfloat16. Since then, float16 has been attributed as being responsible for many of the observed instabilities in training LLMs (Zhang et al., 2022; Zeng et al., 2022). It is possible that bfloat16 alleviates the need for the embedding LayerNorm.

在初步实验中训练一个 104B 参数的模型时,我们在嵌入层 (embedding layer) 立即之后添加了一个层归一化 (layer normalization) – 这是 bits and bytes 17 库 (Dettmers et al., 2022) 推荐的做法,该库使用了其稳定的嵌入层 (Stable Embedding layer)。我们发现这显著提高了训练的稳定性。尽管我们也发现它会损害零样本泛化能力 (Le Scao et al., 2022),但我们仍然在第一个嵌入层之后添加了一个层归一化来避免训练不稳定性,以训练 BLOOM。需要注意的是,初步的 104B 实验是在 float16 下进行的,而最终训练是在 bfloat16 下进行的。自那时以来,float16 被认为是导致训练大语言模型时许多观察到的不稳定性的原因 (Zhang et al., 2022; Zeng et al., 2022)。bfloat16 可能减轻了对嵌入层 LayerNorm 的需求。

We represent the full architecture of BLOOM in figure 5 for reference.

我们将在图 5 中展示 BLOOM 的完整架构,仅供参考。
图 5:

图 1: 大语言模型 (LLM) 的架构示例

在本节中,我们将介绍大语言模型 (LLM) 的基本概念和应用。大语言模型是近年来自然语言处理领域的重要进展之一,它能够通过大量的文本数据进行训练,并在多种任务上展现出卓越的性能。这些模型通常基于 Transformer 架构,并使用了大量的参数来捕捉复杂的语言模式。

表 1: 不同大语言模型的参数量对比

模型名称 参数量(亿)
GPT-3 175
PaLM 540
LaMDA 1370

大语言模型的一个重要特点是它们可以在没有特定任务训练的情况下执行零样本或少样本学习。这意味着它们可以根据上下文理解新任务并生成合理的输出。此外,大语言模型还可以用于构建各种 AI智能体,以实现更复杂的应用场景。

Figure 5: The BLOOM architecture. The $k_{h e a d}$ slope parameters for ALIBI are taken as $2^{\frac{-8i}{n}}$ with $n$ the number of heads and $i\in{1,2,...,n}$ .

图 5: BLOOM 架构。ALIBI 的 $k_{head}$ 斜率参数取为 $2^{\frac{-8i}{n}}$ ,其中 $n$ 是头的数量,$i∈{1,2,...,n}$ 。

3.3 Token iz ation

3.3 Token化

请注意,除了标题外,您提供的文本似乎是翻译规则和策略说明,而非待翻译的内容。如果您有具体的段落或章节需要翻译,请提供相应的内容。

The design decisions when training a tokenizer are often neglected in favour of “default” settings (Mielke et al., 2021). For instance, OPT (Zhang et al., 2022) and GPT-3 (Brown et al., 2020) both use GPT-2’s tokenizer, trained for English. This can be justified by the fact that evaluating the impact of a particular choice on the downstream performance of the model is constrained by the large computational costs of training. However, the diverse nature of BLOOM’s training data requires careful design choices to ensure that the tokenizer encodes sentences in a lossless manner.

在训练分词器时的设计决策往往被忽视,倾向于使用“默认”设置 (Mielke et al., 2021)。例如,OPT (Zhang et al., 2022) 和 GPT-3 (Brown et al., 2020) 都使用了为英语训练的 GPT-2 的分词器。这可以通过评估特定选择对模型下游性能的影响受到训练的巨大计算成本限制来解释。然而,BLOOM 训练数据的多样性要求仔细的设计选择,以确保分词器能够无损地编码句子。

Validation We use the fertility (Ács, 2019) of our tokenizer compared to existing monolingual tokenizers as a metric for sanity checks. Fertility is defined as the number of subwords created per word or per dataset by the tokenizer, which we measured using subsets of Universal Dependencies 2.9 (Nivre et al., 2017) and OSCAR (Ortiz Suárez et al., 2019) in the languages of interest. A very high fertility on a language compared to a monolingual tokenizer may indicate a degradation on the downstream multilingual performance of the model (Rust et al., 2021). Our goal was to not degrade the fertility on each language by more than 10 percentage points when comparing our multilingual tokenizer with monolingual tokenizers in corresponding languages. For all experiments, the Hugging Face Tokenizers library (Moi et al., 2019) was used to design and train the tested tokenizers.

验证

我们使用分词器的生育率 (Ács, 2019) 与现有的单语分词器相比,作为合理性检查的度量。生育率定义为每个单词或每个数据集由分词器创建的子词数量,我们通过使用 Universal Dependencies 2.9 (Nivre 等, 2017) 和 OSCAR (Ortiz Suárez 等, 2019) 的子集在感兴趣的语种中进行了测量。如果某个语言的生育率相对于单语分词器非常高,可能表明模型在下游多语言任务中的性能下降 (Rust 等, 2021)。我们的目标是在比较多语言分词器与相应语言的单语分词器时,每种语言的生育率下降不超过 10 个百分点。在所有实验中,使用了 Hugging Face Tokenizers 库 (Moi 等, 2019) 来设计和训练测试的分词器。

Tokenizer fr en es zh hi ar
单语 1.30 1.15 1.12 1.50 1.07 1.16
BLOOM 1.17 (-11%) 1.15 (0%) 1.16 (+3%) 1.58 (+5%) 1.18 (+6%) 1.34 (+13%)

Table 2: Fert ili ties obtained on Universal Dependencies treebanks on languages with existing monolingual tokenizers. The monolingual tokenizers we used were the ones from CamemBERT (Martin et al., 2020), GPT-2 (Radford et al., 2019), DeepESP/gpt2-spanish, bert-base-chinese, monsoon-nlp/hindi-bert and Arabic BERT (Safaya et al., 2020), all available on the Hugging Face Hub.

表 2: 在具有现有单语分词器的语言的 Universal Dependencies 依存树库上获得的 Fert ili ties 。我们使用的单语分词器来自 CamemBERT (Martin et al., 2020),GPT-2 (Radford et al., 2019),DeepESP/gpt2-spanish,bert-base-chinese,monsoon-nlp/hindi-bert 和 Arabic BERT (Safaya et al., 2020),所有这些都可以在 Hugging Face Hub 上获取。

Tokenizer Training Data We initially used a non-de duplicated subset of ROOTS. However, a qualitative study on the vocabulary of the tokenizer revealed issues in its training data. For instance, in earlier versions of the tokenizer, we found entire URLs stored as tokens caused by several documents containing a high number of duplicates. These issues motivated us to remove duplicated lines in the tokenizer training training data. We then applied the same sampling ratios per language as for the training data.

分词器训练数据

我们最初使用了 ROOTS 的一个未去重子集。然而,对分词器词汇表的定性研究揭示了其训练数据中存在一些问题。例如,在较早版本的分词器中,我们发现由于多个文档包含大量重复项,导致整个 URL 作为 Token 存储。这些问题促使我们去除了分词器训练数据中的重复行。然后,我们按照与训练数据相同的语言采样比例进行了处理。

Vocabulary Size A large vocabulary size reduces the risk of over-segmenting some sentences, especially for low-resource languages. We conducted validation experiments using 150k and 250k vocabulary sizes to make comparisons with existing multilingual modeling literature easier (Conneau et al., 2020; Xue et al., 2021). We ultimately settled for a vocabulary of 250k tokens to reach our initial fertility objective compared to monolingual tokenizers. Since the vocabulary size determines the embedding matrix size, it also had to be divisible by 128 for GPU efficiency reasons and by 4 to be able to use Tensor Parallelism. We used a final size of 250,680 vocabulary items with 200 tokens reserved for possible future applications such as removing private information using placeholder tokens.

词汇量大小

较大的词汇量可以降低过度分割某些句子的风险,特别是对于资源匮乏的语言。我们使用 150k 和 250k 的词汇量进行了验证实验,以便与现有的多语言建模文献进行比较 (Conneau et al., 2020; Xue et al., 2021)。最终,我们选择了 250k Token 的词汇量,以达到与单语分词器相比的初始生成率目标。由于词汇量大小决定了嵌入矩阵的大小,因此它还必须被 128 整除以提高 GPU 效率,并且被 4 整除以便使用 Tensor 并行性。我们最终使用的词汇量为 250,680 个词条,其中预留了 200 个 Token 用于未来可能的应用,例如使用占位符 Token 去除私人信息。

Byte-level BPE The tokenizer is a learned subword tokenizer trained using the Byte Pair Encoding (BPE) algorithm introduced by Gage (1994). In order not to lose information during token iz ation, the tokenizer creates merges starting from bytes as the smallest units instead of characters (Radford et al., 2019). This way, token iz ation never results in unknown tokens because all 256 bytes can be contained in the vocabulary of the tokenizer. In addition, Byte-level BPE maximizes vocabulary sharing between languages (Wang et al., 2020).

字节级 BPE

分词器是一个基于 Byte Pair Encoding (BPE) 算法训练的子词分词器,该算法由 Gage (1994) 提出。为了在分词过程中不丢失信息,分词器从字节作为最小单位开始创建合并,而不是字符 (Radford et al., 2019)。这样,分词永远不会产生未知的 Token,因为所有 256 个字节都可以包含在分词器的词汇表中。此外,字节级 BPE 最大化了不同语言之间的词汇共享 (Wang et al., 2020)。

Normalization Upstream of the BPE token iz ation algorithm, no normalization of the text was performed in order to have the most general model possible. In all cases, we observed that adding unicode normalization such as NFKC did not reduce the fertility by more than $0.8%$ on all the languages considered but came at the cost of making the model less general; for example, causing $2^{2}$ and 22 to be encoded in the same way.

在 BPE Token 化算法的上游进行归一化,在文本上没有执行任何归一化操作,以确保模型尽可能通用。在所有情况下,我们观察到添加如 NFKC 的 Unicode 归一化并不会使所有考虑的语言的生育率降低超过 $0.8%$ ,但这样做会使得模型变得不那么通用;例如,导致 $2^{2}$ 和 22 被编码成相同的方式。

Pre-tokenizer Our pre-token iz ation has two goals: producing a first division of the text (usually using whitespace s and punctuation) and restricting the maximum length of sequences of tokens produced by the BPE algorithm. The pre-token iz ation rule used was the following regex: “ $[\land(\backslash\mathsf{S},|,[,.,,?,.,.,.,.,.,],),]+",^{15}$ which splits words apart while preserving all the characters and in particular the sequences of spaces and line breaks that are crucial for programming languages. We do not use English-centric splits common in other tokenizers (e.g. splitting around ’nt or $^{\prime}\Sigma\bot$ ). We also didn’t use splits on numbers and digits, which caused issues in Arabic and code.

预分词器

我们的预分词有两个目标:对文本进行初步划分(通常使用空格和标点符号),并限制由 BPE 算法生成的 Token 序列的最大长度。使用的预分词规则是以下正则表达式:“$[\land(\backslash\mathsf{S},|,[,.,,?,.,.,.,],),]+$,^{15}$”,它在保留所有字符的同时将单词分开,特别是对于编程语言至关重要的空格和换行序列。我们不使用以英语为中心的分割方式(例如围绕 ’nt 或 $^{\prime}\Sigma\bot$ 分割)。我们也没有对数字和字母进行分割,这在阿拉伯语和代码中引起了问题。

3.4 Engineering

3.4 工程

3.4.1 Hardware

3.4.1 硬件

The model was trained on Jean Zay, $^{19}$ a French government-funded supercomputer owned by GENCI and operated at IDRIS, the national computing center for the French National Center for Scientific Research (CNRS). Training BLOOM took about 3.5 months to complete and consumed 1,082,990 compute hours. Training was conducted on 48 nodes, each having 8 NVIDIA A100 80GB GPUs (a total of 384 GPUs); due to possible hardware failures during training, we also maintained a reserve of 4 spare nodes. The nodes were equipped with 2x AMD EPYC 7543 32-Core CPUs and 512 GB of RAM, while the storage was handled by mix of full flash and hard disk drives using a Spectrum Scale (GPFS) parallel file system shared between all nodes and users of the supercomputer. 4 NVLink GPU-toGPU interconnects per node enabled intra-node communications while 4 Omni-Path 100 Gbps links per node, arranged in an enhanced hypercube 8D global topology, were used for inter-node communications.

该模型在 Jean Zay 上进行了训练,这是一台由法国政府资助的超级计算机,归 GENCI 所有,并由 IDRIS(法国国家科学研究中心 (CNRS) 的国家级计算中心)运营。训练 BLOOM 大约耗时 3.5 个月完成,消耗了 1,082,990 个计算小时。训练是在 48 个节点上进行的,每个节点配备 8 个 NVIDIA A100 80GB GPU(总计 384 个 GPU);由于训练期间可能出现硬件故障,我们还维护了 4 个备用节点。每个节点配备了 2 个 AMD EPYC 7543 32 核 CPU 和 512 GB 内存,存储则由全闪存和硬盘驱动器混合使用 Spectrum Scale (GPFS) 并行文件系统处理,该文件系统在所有节点和超级计算机用户之间共享。每个节点有 4 个 NVLink GPU 到 GPU 互连用于节点内通信,而每个节点有 4 个 Omni-Path 100 Gbps 链接,以增强的 8D 超立方体全局拓扑结构排列,用于节点间通信。

3.4.2 Framework

3.4.2 框架

BLOOM was trained using Megatron-DeepSpeed $^{20}$ (Smith et al., 2022), a framework for large-scale distributed training. It consists of two parts: Megatron-LM $^{21}$ (Shoeybi et al., 2019) provides the Transformer implementation, tensor parallelism, and data loading primitives, whereas Deep Speed 22 (Rasley et al., 2020) provides the ZeRO optimizer, model pipelining, and general distributed training components. This framework allows us to train efficiently with 3D parallelism (Narayanan et al., 2021, shown in Figure 6), a fusion of three complementary approaches to distributed training. These approaches are described below:

BLOOM 使用 Megatron-DeepSpeed (Smith et al., 2022) 进行训练,这是一个大规模分布式训练框架。它由两部分组成:Megatron-LM (Shoeybi et al., 2019) 提供 Transformer 实现、张量并行和数据加载原语,而 DeepSpeed (Rasley et al., 2020) 提供 ZeRO 优化器、模型管道和通用分布式训练组件。这个框架使我们能够通过三维并行 (Narayanan et al., 2021, 如图 6 所示) 高效地进行训练,这是三种互补的分布式训练方法的融合。这些方法在下文中有详细描述:

图 6:

这些方法如下所述:


Figure 6: DP $+$ PP $+$ TP combination leads to 3D parallelism.

图 6: DP + PP + TP 组合导致 3D 并行性。

Data parallelism (DP) replicates the model multiple times, with each replica placed on a different device and fed a slice of the data. The processing is done in parallel and all model replicas are synchronized at the end of each training step.

数据并行 (Data parallelism, DP) 多次复制模型,每个副本放置在不同的设备上,并输入数据的一个切片。处理是并行进行的,并且所有模型副本在每个训练步骤结束时进行同步。

Tensor parallelism (TP) partitions individual layers of the model across multiple devices. This way, instead of having the whole activation or gradient tensor reside on a single GPU, we place shards of this tensor on separate GPUs. This technique is sometimes called horizontal parallelism or intra-layer model parallelism.

张量并行 (Tensor parallelism, TP) 将模型的各个层分区到多个设备上。这样,而不是将整个激活或梯度张量放在单个 GPU 上,我们将这个张量的分片放在不同的 GPU 上。这种技术有时被称为水平并行或层内模型并行。

Pipeline parallelism (PP) splits up the model’s layers across multiple GPUs, so that only a fraction of the layers of the model are placed on each GPU. This is sometimes called vertical parallelism.

管道并行 (Pipeline parallelism, PP) 将模型的层分布在多个 GPU 上,使得每个 GPU 上只放置模型的一部分层。这有时被称为垂直并行。

Finally, the Zero Redundancy Optimizer (ZeRO; Raj bh and ari et al., 2020) allows different processes to only hold a fraction of data (parameters, gradients, and optimizer states)

最后,零冗余优化器 (ZeRO; Raj bh and ari et al., 2020) 允许不同的进程仅持有数据(参数、梯度和优化器状态)的一部分

required for a training step. We used ZeRO stage 1, meaning that only the optimizer states are sharded in this manner.

我们需要一个训练步骤。我们使用了 ZeRO 第 1 阶段,这意味着只有优化器状态以这种方式进行了分片。

The four components described above are combined together to allow scaling to hundreds of GPUs with extremely high GPU utilization. We were able to achieve 156 TFLOPs in our fastest configuration with A100 GPUs, attaining our objective of half of the theoretical peak performance of 312 TFLOPs (in float32 or bfloat16).

上述描述的四个组件结合在一起,使得能够扩展到数百个 GPU,并实现极高的 GPU 利用率。我们能够在最快的配置中使用 A100 GPU 达到 156 TFLOPs,实现了我们理论峰值性能 312 TFLOPs(在 float32 或 bfloat16)一半的目标。

3.4.3 Floating Point Format

3.4.3 浮点格式

In earlier experiments with 104B-parameter models on NVIDIA V100 GPUs, we observed numerical instabilities that caused irreversible training divergences. We hypothesize that these instabilities stem from our initial use of IEEE float16 — a 16-bit floating point format with a very limited dynamic range that can cause overflows. The NVIDIA A100 GPUs that we ultimately had access to support the bfloat16 format (Wang and Kanwar, 2019; Kalamkar et al., 2019), which has the same dynamic range as float32. On the other hand, bfloat16 still has much lower precision, which motivated our use of mixed-precision training (Mic ike vici us et al., 2018). This technique performs certain precision-sensitive operations such as gradient accumulation and softmax in float32 precision and the rest of operations in lower precision, allowing us to achieve a balance of high performance and training stability. Ultimately, we performed final training in bfloat16 mixed precision, which proved to solve the instability problem (in line with previous observation by Smith et al., 2022).

在早期使用 104B 参数模型在 NVIDIA V100 GPU 上的实验中,我们观察到数值不稳定问题,导致不可逆的训练发散。我们假设这些不稳定问题源于最初使用的 IEEE float16 —— 一种具有非常有限动态范围的 16 位浮点格式,可能会导致溢出。最终我们使用的 NVIDIA A100 GPU 支持 bfloat16 格式 (Wang 和 Kanwar, 2019; Kalamkar 等, 2019),其动态范围与 float32 相同。另一方面,bfloat16 的精度仍然较低,这促使我们采用混合精度训练 (Micikevicius 等, 2018)。这种技术在 float32 精度下执行某些对精度敏感的操作(如梯度累积和 softmax),而其余操作则在较低精度下进行,使我们能够在高性能和训练稳定性之间取得平衡。最终,我们在 bfloat16 混合精度下进行了最终训练,证明解决了不稳定问题(与 Smith 等, 2022 的先前观察一致)。

3.4.4 Fused CUDA Kernels

3.4.4 融合 CUDA 内核

In general, GPUs cannot retrieve data to perform computations on and perform these computations at the same time. Moreover, the compute performance of modern GPUs is much higher than the speed of memory transfer required for every operation (often called $a$ kernel in GPU programming). Kernel fusion (Wu et al., 2012) is an approach for optimizing GPU-based computations by performing several consecutive operations in only one kernel call. This approach offers a way to minimize data transfers: intermediary results stay in the GPU register instead of being copied into VRAM, saving overhead.

一般来说,GPU 无法在检索数据进行计算的同时执行这些计算。此外,现代 GPU 的计算性能远高于每次操作所需的内存传输速度(在 GPU 编程中通常称为 $a$ 内核)。内核融合 (Kernel fusion) [Wu et al., 2012] 是一种通过在一个内核调用中执行多个连续操作来优化基于 GPU 的计算的方法。这种方法提供了一种最小化数据传输的方式:中间结果保留在 GPU 寄存器中,而不是复制到 VRAM 中,从而节省了开销。

We used several custom fused CUDA kernels provided by Megatron-LM. First, we used an optimized kernel to perform LayerNorm, as well as kernels to fuse various combinations of the scaling, masking, and softmax operations. The addition of a bias term is also fused with the GeLU activation using the JIT functionality of PyTorch. As an example consequence of the use of fused kernels, adding the bias term in the GeLU operation adds no additional time, as the operation is memory-bound: the additional computation is negligible compared to data transfers between GPU VRAM and registers, so fusing both operations essentially halves their runtime.

我们使用了 Megatron-LM 提供的多个自定义融合 CUDA 内核。首先,我们使用了一个优化的内核来执行 LayerNorm,以及用于融合各种缩放、掩码和 softmax 操作组合的内核。偏置项的添加也与 GeLU 激活函数通过 PyTorch 的 JIT 功能进行了融合。作为使用融合内核的一个示例结果,在 GeLU 操作中添加偏置项不会增加额外的时间,因为该操作是内存绑定的:相对于 GPU VRAM 和寄存器之间的数据传输,额外的计算可以忽略不计,因此融合这两个操作基本上将其运行时间减少了一半。

3.4.5 Additional Challenges

3.4.5 额外挑战

Scaling to 384 GPUs required two final changes: disabling asynchronous CUDA kernel launches (for ease of debugging and to prevent deadlocks) and splitting parameter groups into smaller subgroups (to avoid excessive CPU memory allocations).

扩展到 384 个 GPU 需要进行两项最终更改:禁用异步 CUDA 内核启动(以便于调试和防止死锁)和将参数组拆分为更小的子组(以避免过多的 CPU 内存分配)。

During training, we faced issues with hardware failures: on average, 1–2 GPU failures occurred each week. As backup nodes were available and automatically used, and checkpoints were saved every three hours, this did not affect training throughput significantly. A PyTorch deadlock bug in the data loader and disk space issues led to 5–10h downtimes. Given the relative sparsity of engineering issues, and since there was only one loss spike, which the model swiftly recovered from, human intervention was less necessary than in comparable projects (Zhang et al., 2022). Full details of our experience with training BLOOM and a detailed report of all issues we faced are publicly available.23

在训练期间,我们遇到了硬件故障问题:平均每周发生 1–2 次 GPU 故障。由于备用节点可用并自动使用,并且每三小时保存一次检查点,这并未显著影响训练吞吐量。PyTorch 数据加载器中的死锁错误和磁盘空间问题导致了 5–10 小时的停机时间。鉴于工程问题相对较少,而且只有一个损失峰值,模型迅速从中恢复,因此与类似项目相比(Zhang et al., 2022),人工干预的需求较少。我们训练 BLOOM 的全部详细信息以及遇到的所有问题的详细报告均公开可用。23

3.5 Training

3.5 训练

Hyper parameter (↓) BLOOM-560M BLOOM-1.1B BLOOM-1.7B BLOOM-3B BLOOM-7.1B BLOOM Table 3: BLOOM & BLOOMZ Training Hyper parameters.

表 3: BLOOM & BLOOMZ 训练超参数 (Hyper parameters)

参数 BLOOM-560M BLOOM-1.1B BLOOM-1.7B BLOOM-3B BLOOM-7.1B BLOOM
参数 559M 1,065M 架构超参数 1,722M 2048 3,003M 2560 32 7,069M 30 4096 32 176,247M bfloat16 70 14336 112 250,680 2048 GELU Alibi True 2048 6e-5
精度 float16 24 30 float16 24 30 float16 24 30 float16 24 30 float16 24 30 float16 24 30
层数 隐藏层维度 24 24 1024 24 24 1024 24 24 1024 24 24 1024 24 24 1024 24 24 1024
注意力头数 1536 16 16 1536 16 16 1536 16 16 1536 16 16 1536 16 16 1536 16 16
词汇量大小 16 250,680 16 250,680 16 250,680 16 250,680 16 250,680 16 250,680
序列长度 2048 2048 2048 2048 2048 2048
激活函数 位置编码 GELU Alibi GELU Alibi GELU Alibi GELU Alibi GELU Alibi GELU Alibi
共享嵌入 True 预训练超参数 True 预训练超参数 True 预训练超参数 True 预训练超参数 True 预训练超参数 True 预训练超参数
全局批量大小 学习率 总 Token 数 256 512 512 512 3.0e-4 2.5e-4 2e-4 1.6e-4 1.2e-4 341B 366B 375M 410B cosine 256 512 512 512 3.0e-4 2.5e-4 2e-4 1.6e-4 1.2e-4 341B 366B 375M 410B cosine 256 512 512 512 3.0e-4 2.5e-4 2e-4 1.6e-4 1.2e-4 341B 366B 375M 410B cosine 256 512 512 512 3.0e-4 2.5e-4 2e-4 1.6e-4 1.2e-4 341B 366B 375M 410B cosine 256 512 512 512 3.0e-4 2.5e-4 2e-4 1.6e-4 1.2e-4 341B 366B 375M 410B cosine 256 512 512 512 3.0e-4 2.5e-4 2e-4 1.6e-4 1.2e-4 341B 366B 375M 410B cosine
预热 Token 数 衰减 Token 数 375M 410B cosine 375M 410B cosine 375M 410B cosine 375M 410B cosine 375M 410B cosine 375M 410B cosine
衰减方式 最小学习率 1e-5 6e-6 1e-5 6e-6 1e-5 6e-6 1e-5 6e-6 1e-5 6e-6 1e-5 6e-6
Adam (β1, β2) 权重衰减 (0.9, 0.95) 1e-1 1.0 (0.9, 0.95) 1e-1 1.0 (0.9, 0.95) 1e-1 1.0 (0.9, 0.95) 1e-1 1.0 (0.9, 0.95) 1e-1 1.0 (0.9, 0.95) 1e-1 1.0
梯度裁剪 多任务微调超参数 多任务微调超参数 多任务微调超参数 多任务微调超参数 多任务微调超参数 多任务微调超参数
全局批量大小 1024 1024 2048 2048 1024 1024 2048 2048 1024 1024 2048 2048 1024 1024 2048 2048 1024 1024 2048 2048 1024 1024 2048 2048
学习率 2048 2.0e-5 2.0e-5 2.0e-5 2.0e-5 2.0e-5 2048 2.0e-5 2.0e-5 2.0e-5 2.0e-5 2.0e-5 2048 2.0e-5 2.0e-5 2.0e-5 2.0e-5 2.0e-5 2048 2.0e-5 2.0e-5 2.0e-5 2.0e-5 2.0e-5 2048 2.0e-5 2.0e-5 2.0e-5 2.0e-5 2.0e-5 2048 2.0e-5 2.0e-5 2.0e-5 2.0e-5 2.0e-5
总 Token 数 (0.9, 0.95) 1e-1 1.0 2048 2.0e-5 13B 13B (0.9, 0.95) 1e-1 1.0 2048 2.0e-5 13B 13B (0.9, 0.95) 1e-1 1.0 2048 2.0e-5 13B 13B (0.9, 0.95) 1e-1 1.0 2048 2.0e-5 13B 13B (0.9, 0.95) 1e-1 1.0 2048 2.0e-5 13B 13B (0.9, 0.95) 1e-1 1.0 2048 2.0e-5 13B 13B
预热 Token 数 0 0 0 0 0 0 0 0 0 0 0 0
衰减方式 constant constant constant constant constant constant constant constant constant constant constant constant
权重衰减 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4

注:表格中的某些内容可能由于格式问题未能完全对齐,建议根据实际情况调整。

Pretrained Models We train six size variants of BLOOM with respective hyperparameters detailed in Table 3. Architecture and training hyper parameters come from our experimental results (Le Scao et al., 2022) and prior work on training large language models (Brown et al., 2020; Kaplan et al., 2020). Model depth and width for the non-176B models roughly follow previous literature (Brown et al., 2020; Zhang et al., 2022), deviating for 3B and 7.1B in order only to fit the models more easily on our training setup. Embedding parameter sizes are larger for BLOOM owing to the larger multilingual vocabulary, but scaling literature discounts embedding operations (Kaplan et al., 2020). During the development process at the 104B parameters scale, we experimented with different values of Adam $\beta$ parameters, weight decay and gradient clipping to target stability, but did not find it helpful. For all models, we use a cosine learning rate decay schedule (Loshchilov and Hutter, 2016) over 410B tokens, taken as an upper bound for the length of training if compute permitted, and warmup for 375M tokens. We use weight decay, gradient clipping, and no dropout. The ROOTS dataset contains around 341 billion tokens of text, so we aimed to train all models for the equivalent amount of tokens. However, in light of revised scaling laws published during training (Hoffmann et al., 2022), we decided to train the large models for an additional 25 billion tokens on repeated data. As warmup tokens $^+$ decay tokens were larger than the total number of tokens, the end of learning rate decay was never reached.

预训练模型

我们训练了六个不同规模的 BLOOM 模型,各自的超参数详见表 3。架构和训练超参数来自我们的实验结果 (Le Scao et al., 2022) 和先前关于训练大语言模型的工作 (Brown et al., 2020; Kaplan et al., 2020)。对于非 176B 的模型,模型深度和宽度大致遵循之前的文献 (Brown et al., 2020; Zhang et al., 2022),但为了更轻松地适应我们的训练设置,3B 和 7.1B 模型有所偏离。由于更大的多语言词汇量,BLOOM 的嵌入参数尺寸更大,但缩放文献中忽略了嵌入操作 (Kaplan et al., 2020)。在开发过程中,在 104B 参数规模上,我们尝试了不同的 Adam $\beta$ 参数、权重衰减和梯度裁剪值以提高稳定性,但未发现有帮助。对于所有模型,我们在 410B Token 上使用余弦学习率衰减计划 (Loshchilov and Hutter, 2016),这是计算资源允许时训练长度的上限,并在 375M Token 上进行预热。我们使用权重衰减、梯度裁剪,且不使用 dropout。ROOTS 数据集包含大约 3410 亿个文本 Token,因此我们旨在训练所有模型达到等量的 Token 数。然而,鉴于训练期间发布的修订后的缩放定律 (Hoffmann et al., 2022),我们决定对大型模型在重复数据上额外训练 250 亿个 Token。由于预热 Token $^+$ 衰减 Token 总数大于总 Token 数,因此从未达到学习率衰减的结束点。

Multitask Finetuning Finetuned BLOOMZ models (Mu en nigh off et al., 2022b) maintain the same architecture hyper parameters as BLOOM models. The finetuning hyperparameters are loosely based on T0 (Sanh et al., 2022) and FLAN (Wei et al., 2021). Learning rates are determined by doubling the minimum learning rate of the respective pretrained model and then rounding. Global batch sizes are multiplied by four for small variants to increase throughput. While the models are finetuned for 13 billion tokens, the best checkpoint is chosen according to a separate validation set. We found performance to plateau after $1\mathrm{-}6$ billion tokens of finetuning.

多任务微调的 BLOOMZ 模型 (Mu en nigh off et al., 2022b) 保持与 BLOOM 模型相同的架构超参数。微调超参数大致基于 T0 (Sanh et al., 2022) 和 FLAN (Wei et al., 2021)。学习率通过将相应预训练模型的最小学习率加倍然后取整来确定。对于小型变体,全局批量大小乘以四以提高吞吐量。尽管模型是针对 130 亿个 Token 进行微调的,但最佳检查点是根据单独的验证集选择的。我们发现,在微调了 1~60 亿个 Token 后性能趋于平稳。

Contrastive Finetuning We also perform contrastive finetuning of the 1.3 and 7.1 billion parameter BLOOM models using the SGPT Bi-Encoder recipe (Mu en nigh off, 2022) to train models that produce high-quality text embeddings. We created SGPT-BLOOM-7.1Bmsmarco $^{24}$ geared towards multilingual information retrieval and SGPT-BLOOM-1.7B-nli $^{25}$ for multilingual semantic textual similarity (STS). However, recent benchmarking has found these models to also generalize to various other embedding tasks, such as bitext mining, reranking or feature extraction for downstream classification (Mu en nigh off et al., 2022a).

对比微调

我们还使用 SGPT Bi-Encoder 方法 (Mu en nigh off, 2022) 对 13 亿和 71 亿参数的 BLOOM 模型进行对比微调,以训练生成高质量文本嵌入的模型。我们创建了 SGPT-BLOOM-7.1Bmsmarco ²⁴ 针对多语言信息检索,以及 SGPT-BLOOM-1.7B-nli ²⁵ 用于多语言语义文本相似度 (STS)。然而,最近的基准测试发现这些模型还可以泛化到其他各种嵌入任务,如双语挖掘、重排序或下游分类的特征提取 (Mu en nigh off et al., 2022a)。

3.5.1 Carbon Footprint

3.5.1 碳足迹

While most attempts to estimate the carbon footprint of language models have shed light on the emissions produced due to energy consumed during model training (e.g. Patterson et al., 2021; Strubell et al., 2019), other sources of emissions are also important to consider. In our efforts to estimate the carbon emissions of BLOOM, we were inspired by the Life Cycle Assessment (LCA) approach (Kl¨opffer, 1997) and aimed to consider aspects such as

虽然大多数尝试估算语言模型的碳足迹的研究都揭示了模型训练期间因能源消耗产生的排放(例如 Patterson 等,2021;Strubell 等,2019),但其他排放源也同样重要。在我们估算 BLOOM 的碳排放的努力中,我们受到生命周期评估 (LCA) 方法 (Kl¨opffer, 1997) 的启发,并旨在考虑诸如

图 1:

表 1:

生成式 AI (Generative AI)

Transformer

Token

大语言模型 (LLM/Large Language Model)

零样本 (Zero-shot)

少样本 (Few-shot)

AI智能体 (AI Agent)

通用人工智能 (AGI)

Python语言 (Python)

the emissions of equipment manufacturing, intermediate model training, and deployment. According to our estimates, the carbon emissions from BLOOM training add up to approximately 81 tons of CO $^2$ eq, of which $14%$ were generated by the equipment manufacturing process (11 tons), $30%$ by the energy consumed during training (25 tons) and $55%$ by idle consumption of the equipment and computing cluster used for training (45 tons).

设备制造、中间模型训练和部署的排放。根据我们的估算,BLOOM 训练的碳排放总量约为 81 吨 CO2 当量,其中 14% 来自设备制造过程 (11 吨),30% 来自训练期间消耗的能源 (25 吨),55% 来自用于训练的设备和计算集群的空闲消耗 (45 吨)。

模型名称 参数数量 能耗 CO2 当量排放
GPT-3 175B 1,287 MWh 502 吨
Gopher 280B 1,066 MWh 352 吨
OPT 175B 324 MWh 70 吨
BLOOM 176B 433 MWh 25 吨

Table 4: Comparison of carbon emissions between BLOOM and similar LLMs. Numbers in italics have been inferred based on data provided in the papers describing the models.

表 4: BLOOM 与类似的大语言模型的碳排放对比。斜体数字是根据描述这些模型的论文提供的数据推断得出的。

Comparing the carbon emissions of BLOOM training to other similar models (see Table 4), reveals that while the energy consumption of BLOOM is slightly higher than OPT (Zhang et al., 2022) (433 Mwh compared to OPT’s 324 MWh), its emissions are approximate ly 2/3 less (25 tons versus 70 tons). This is thanks to the low carbon intensity of the energy grid used for training BLOOM, which emits 57 gCO $^2$ eq/kWh, compared to 231 gCO $^2$ eq/kWh for the grid used for OPT training. Specifically, France’s national energy grid (which is used by Jean Zay) is largely powered by nuclear energy, which is low-carbon compared to grids powered by energy sources such as coal and natural gas. While the sustainability of nuclear energy is debated, it is one of the least carbon-intensive sources of energy that is currently available. Both BLOOM and OPT incurred significantly less carbon emissions than GPT-3 (as reported by (Patterson et al., 2021)), which can be attributed to several factors including more efficient hardware as well as less carbon-intensive energy sources.

将 BLOOM 训练的碳排放与其他类似模型进行比较(见表 4),结果显示虽然 BLOOM 的能耗略高于 OPT (Zhang et al., 2022) (433 MWh 对比 OPT 的 324 MWh),但其排放量大约减少了 2/3 (25 吨对比 70 吨)。这得益于用于训练 BLOOM 的能源网络的低碳强度,其碳排放为 57 gCO2eq/kWh,而用于 OPT 训练的能源网络为 231 gCO2eq/kWh。具体来说,法国的国家能源网络(由 Jean Zay 使用)主要由核能供电,相比煤炭和天然气等能源来源供电的电网,核能是低碳的。尽管核能的可持续性存在争议,但它目前是碳强度最低的能源之一。BLOOM 和 OPT 的碳排放显著低于 GPT-3(根据 (Patterson et al., 2021) 的报告),这可以归因于更高效的硬件以及碳强度较低的能源来源等因素。

We also pursued further exploration of the carbon footprint of (1) the computation carried out on Jean Zay within the scope of the Big Science workshop, and (2) running the BLOOM model API in real time. In terms of the footprint of the totality of the computation, we estimate that the final BLOOM training represents approximately 37% of the overall emissions, with other processes such as intermediate training runs and model evaluation adding up to the other $63%$ . This is slightly less than the estimate made by the authors of the OPT paper, who stated that the total carbon footprint of their model is roughly 2 times higher due to experimentation, baselines and ablation (Zhang et al., 2022). Our ongoing exploration of the carbon emissions of the BLOOM API have estimated that the real-time deployment of the model on a GCP instance with 16 GPUs running in the us-central1 region results in approximately 20 kg of CO $^2$ eq emitted per day of deployment (or 0.83 kg per hour). This figure is not representative of all deployment use-cases, and will vary depending on the hardware used as well as the specifics of model implementation (e.g. whether batching is used) and the number of requests the model receives. Further information regarding BLOOM’s carbon footprint can be found in Luccioni et al. (2022).

我们还进一步探索了以下两个方面的碳足迹:(1) 在 Jean Zay 内进行的 Big Science workshop 计算,以及 (2) 实时运行 BLOOM 模型 API。就计算总碳足迹而言,我们估计最终的 BLOOM 训练约占总排放量的 37%,其他过程如中间训练和模型评估则占剩余的 63%。这略低于 OPT 论文作者的估计,他们指出其模型的总碳足迹大约是我们的两倍,因为实验、基线和消融研究 (Zhang et al., 2022)。我们对 BLOOM API 的碳排放持续探索表明,在 us-central1 区域使用带有 16 个 GPU 的 GCP 实例实时部署该模型,每天大约会排放 20 公斤的 CO₂ 当量(或每小时 0.83 公斤)。这一数字并不代表所有部署案例,具体数值将根据所用硬件、模型实现的具体情况(例如是否使用批处理)以及模型接收到的请求数量而有所不同。有关 BLOOM 碳足迹的更多信息,请参见 Luccioni et al. (2022)。

3.6 Release

3.6 发布版

Openness has been central to the development of BLOOM and we wanted to ensure it is easily available for the community to use. As such, we worked on producing documentation as a Model Card (Mitchell et al., 2019) and a new license addressing specific goals of the project.

开放性一直是 BLOOM 开发的核心,我们希望确保它能方便社区使用。为此,我们致力于制作文档,作为模型卡片 (Model Card) (Mitchell et al., 2019) 和一个新的许可证,以解决项目的具体目标。

Model Card Following best practices for releasing machine learning models, the BLOOM model has been released along with a detailed Model Card $^{26}$ (Mitchell et al., 2019) describing its technical specifications, details on training, intended-use, out-of-scope uses as well as the model’s limitations. Participants across working groups worked together to produce the final Model Card and similar cards for each checkpoint. The work was collaborative, primarily composed “live” by thinking through and discussing each section, then further dividing into subsections based on the categorizations and distinctions participants naturally ended up creating throughout discussions.

遵循发布机器学习模型的最佳实践,BLOOM 模型已随附详细的模型说明书 (Model Card) 一同发布 $^{26}$ (Mitchell et al., 2019),该说明书描述了其技术规格、训练细节、预期用途、超出范围的用途以及模型的局限性。各工作组的参与者共同合作,完成了最终的模型说明书以及每个检查点的类似说明书。这项工作是协作完成的,主要通过实时思考和讨论每个部分,然后根据参与者在讨论中自然形成的分类和区分进一步划分为子部分。

Licensing Being mindful of the potentially harmful use-cases that BLOOM could enable, we chose to strike a balance between unrestricted open-access and responsible-use by including behavioral-use clauses (Contractor et al., 2022) to limit the application of the model towards potentially harmful use-cases. Such clauses are routinely being included in a growing class of “Responsible AI Licenses (RAIL)” $^{27}$ that the community has been adopting when releasing their models. $^{28}$ A distinguishing aspect of the RAIL license developed for BLOOM is that it separates licensing of the “source code” and “model”, as referenced by its trained parameters. It further includes detailed definitions of “use” and “derived works” of the model to ensure that anticipated downstream use by prompting, finetuning, distillation, use of logits and probability distributions are explicitly identified. The license contains 13 behavioral-use restrictions that have been identified based on the intended uses and limitations described in the BLOOM Model Card, as well as the BigScience ethical charter. The license offers the model at no charge and users are free to use the model as long as they comply with the terms (including usage restrictions). The source code for BLOOM has been made available under an Apache 2.0 open source license.

考虑到 BLOOM 可能启用的潜在有害用例,我们选择在无限制开放访问和负责任使用之间取得平衡,通过加入行为使用条款 (Contractor et al., 2022) 来限制模型应用于潜在有害用例。这类条款通常被包含在社区越来越多采用的“负责任 AI 许可证 (RAIL)”$^{27}$ 中,当发布模型时使用。$^{28}$ 为 BLOOM 开发的 RAIL 许可证的一个显著特点是它区分了“源代码”和“模型”的许可,后者指的是其训练参数。它进一步包含了对模型“使用”和“衍生作品”的详细定义,以确保通过提示、微调、蒸馏、使用 logits 和概率分布等预期下游应用得到明确识别。许可证包含 13 项行为使用限制,这些限制是根据 BLOOM 模型卡片中描述的预期用途和限制以及 BigScience 伦理宪章确定的。该许可证免费提供模型,用户可以自由使用模型,只要他们遵守条款(包括使用限制)。BLOOM 的源代码已在 Apache 2.0 开源许可证下提供。

4. Evaluation

4. 评估

Our evaluations focus on zero-shot and few-shot settings. Our goal is to present an accurate picture of how BLOOM compares to existing LLMs in settings that most realistically reflect the way the models are likely to be used in practice. Because of the scale of these models, prompt-based adaptation and few-shot “in-context learning” are currently more common than finetuning. Thus, we report results on a range of tasks - SuperGLUE 4.2, machine translation 4.3, sum mari z ation 4.4 - and languages in zero-shot and one-shot prompt-based settings, as well as after multitask finetuning (Section 4.7). We also perform code generation 4.5, use BLOOM-derived text embeddings for representation tasks 4.8 and interpret BLOOM’s generalization abilities from the perspective of multilingual probing (Section 4.9).

我们的评估集中在零样本和少样本设置。我们的目标是呈现一个准确的画面,展示 BLOOM 在最能反映模型实际使用方式的环境中与现有大语言模型相比的表现。由于这些模型的规模庞大,基于提示的适应和少样本“上下文学习”目前比微调更为常见。因此,我们在一系列任务上报告结果 - SuperGLUE 4.2、机器翻译 4.3、摘要 4.4 - 以及在零样本和单样本提示设置下的多种语言,并在多任务微调后(第 4.7 节)。我们还进行代码生成 4.5,使用由 BLOOM 导出的文本嵌入进行表示任务 4.8,并从多语言探测的角度解释 BLOOM 的泛化能力(第 4.9 节)。

4.1 Experimental Design

4.1 实验设计

4.1.1 Prompts

4.1.1 提示词 (Prompts)

Based on recent research on the impact of prompting on language model performance, we decided to build a language model evaluation suite that allowed us to vary both the basic task data as well as the prompting that is used to contextual ize the task. Our prompts were developed prior to BLOOM’s release, and did not undergo any a priori refinement using models. That is, the prompts we use in our evaluation are ones that humans believed were a reasonable way to solicit the desired task behavior from a language model. Our goal for designing prompts in this way is to simulate realistic zero-shot or one-shot results that a new user could expect from BLOOM. This is in contrast to presenting best-case performances that might result from multiple rounds of trial-and-error on prompt design. We choose to report the former because the latter is harder to reproduce systematically, is arguably a less representative picture of how the model works in the average setting, and is not representative of true zero-shot learning where no labeled data is available.

基于最近关于提示对语言模型性能影响的研究,我们决定构建一个语言模型评估套件,使我们能够同时改变基本任务数据以及用于情境化任务的提示。我们的提示是在 BLOOM 发布之前开发的,并未经过任何模型的事先优化。也就是说,我们在评估中使用的提示是人类认为可以合理地从语言模型中引出所需任务行为的方式。我们以这种方式设计提示的目标是模拟新用户可以从 BLOOM 中期望的真实零样本或单样本结果。这与展示通过多次试错优化提示设计后得到的最佳性能形成对比。我们选择报告前者,因为后者更难系统性地重现,且在平均情况下并不能代表模型的工作方式,也不符合真正的零样本学习,在这种情况下没有标注数据可用。

We generate multiple prompts per task using prompt source (Bach et al., 2022). We follow the procedure used by Sanh et al. (2022), in which prompt generation is crowdsourced, and thus we see substantial variety in length and style across prompts. To improve quality and clarity, multiple peer reviews were performed on each prompt for artifacts and consistency.

我们为每个任务使用提示源 (Bach et al., 2022) 生成多个提示。我们遵循 Sanh 等人 (2022) 使用的程序,其中提示生成是众包的,因此我们在提示的长度和风格上看到了显著的多样性。为了提高质量和清晰度,对每个提示进行了多次同行评审,以检查人工制品和一致性。

Table 5 shows examples of the resulting prompts used for the WMT’14 task. We also generate prompts for many tasks that are not included in this paper due to resource constraints. All of our prompts for all tasks (both those analyzed in the paper and those not yet analyzed) are publicly available.29

表 5: 显示了用于 WMT’14 任务的生成式提示示例。由于资源限制,我们还为许多未包含在本文中的任务生成了提示。所有任务的所有提示(包括本文中分析的任务和尚未分析的任务)均公开可用。29

注:原文中的 "Table 5" 已根据要求翻译为 "表 5",并保留了原有的格式。

Table 5: Four prompts for the WMT’14 dataset (Bojar et al., 2014) for MT evaluation. Above, “L1” and “L2” are replaced with language names (e.g. “Bengali” and “Russian”).

表 5: WMT’14 数据集 (Bojar et al., 2014) 的四个提示用于机器翻译 (MT) 评估。上面,“L1” 和 “L2” 被替换为语言名称(例如,“Bengali” 和 “Russian”)。

提示名称 提示 目标
a_good_translation-source+target Given the following source text: [source sentence], a good L2 translation is: target sentence
gpt3-target What is the L2 translation of the sentence: [source sentence]? [target sentence]
version-target If the original version says [source sentence]; then the L2 version should say: target sentence
xglm-source+target L1: [source sentence] = L2: [target sentence]

4.1.2 Infrastructure

4.1.2 基础设施

Our framework extends EleutherAI’s Language Model Evaluation Harness (Gao et al., 2021) by integrating it with the prompt source (Bach et al., 2022) library described in Section 3.1.4. We release our Prompted Language Model Evaluation Harness as an open source library for people to use. We use this framework in order to run the experiments and aggregate results.

我们的框架扩展了 EleutherAI 的语言模型评估工具 (Gao et al., 2021),通过将其与第 3.1.4 节中描述的提示源 (Bach et al., 2022) 库集成。我们发布了我们的提示式语言模型评估工具作为一个开源库供人们使用。我们使用这个框架来运行实验并汇总结果。

SuperGLUE We use a subset of the SuperGLUE (Wang et al., 2019) evaluation suite of classification tasks, specifically: Ax-b, Ax-g, BoolQ, CB, WiC, WSC, and RTE tasks. We excluded the remaining tasks because they require an order of magntiude more compute to run than all of these tasks we consider combined. These tasks are English-only, and are thus included to facilitate comparison with prior work, which has primarily focused on English-only models. We also note that performance on these tasks has not yet been widely reported using zero- and one-shot prompt-based setting. T0 (Sanh et al., 2022) is the first exception, but that model is instruction-tuned and thus not directly comparable to models like BLOOM and OPT. For each task, we select a random sample of five prompts from prompt source and evaluate all models on that set of prompts. As with other prompting tasks in Evaluation Harness (Gao et al., 2021), the prediction of a model for a given prompt is measured using the maximum log likelihood among a set of specified candidate label strings associated with the prompt.

SuperGLUE

我们使用 SuperGLUE (Wang et al., 2019) 评估套件中的一小部分分类任务,具体包括:Ax-b、Ax-g、BoolQ、CB、WiC、WSC 和 RTE 任务。我们排除了其余任务,因为它们所需的计算量比我们考虑的所有这些任务的总和还要高出一个数量级。这些任务仅限于英语,并因此被纳入以促进与先前工作的比较,这些工作主要集中在仅限英语的大语言模型上。我们还注意到,在零样本和单样本提示设置下,这些任务的表现尚未得到广泛报告。T0 (Sanh et al., 2022) 是第一个例外,但该模型是经过指令微调的,因此不能直接与 BLOOM 和 OPT 等模型进行比较。对于每个任务,我们从提示源中随机选择五个提示,并在该组提示上评估所有模型。与 Evaluation Harness (Gao et al., 2021) 中的其他提示任务一样,给定提示的模型预测是通过测量一组指定候选标签字符串中的最大对数似然来完成的。

Machine Translation (MT) We evaluate BLOOM on three datasets (using ISO-639-1 codes to refer to languages): WMT14 en $\leftrightarrow$ fr and en $\leftrightarrow$ hi (Bojar et al., 2014), Flores-101 (Goyal et al., 2022) and DiaBLa (Bawden et al., 2020). We evaluate using the sacrebleu (Post, 2018) implementation of BLEU (Papineni et al., 2002), using default token is ation for WMT and DiaBLa and spm-flores-101 for Flores.30 We use greedy decoding with generation proceeding until the EOS token, or additionally \n###\n for the 1-shot case. The maximum generation length was set per dataset to be in line with what is typically used in the literature; specifically, 64 tokens for WMT14 and 512 tokens for Flores-101 and DiaBla. Task-specific experimental design details are below.

机器翻译 (MT) 我们在三个数据集上评估 BLOOM (使用 ISO-639-1 代码指代语言):WMT14 en $\leftrightarrow$ fr 和 en $\leftrightarrow$ hi (Bojar 等, 2014),Flores-101 (Goyal 等, 2022) 和 DiaBLa (Bawden 等, 2020)。我们使用 sacrebleu (Post, 2018) 实现的 BLEU (Papineni 等, 2002) 进行评估,对于 WMT 和 DiaBLa 使用默认分词,对于 Flores 使用 spm-flores-101。我们使用贪婪解码,生成直到 EOS Token 或者另外 \n###\n 在 1-shot 情况下。最大生成长度根据数据集设置,以符合文献中通常使用的标准;具体来说,WMT14 为 64 Token,Flores-101 和 DiaBLa 为 512 Token。特定任务的实验设计细节如下。

Sum mari z ation We evaluate sum mari z ation on the WikiLingua (Ladhak et al., 2020) dataset. WikiLingua is a multilingual sum mari z ation dataset comprising WikiHow article and step-by-step summary pairs. Pairs are aligned across multiple languages, with translation of source and summary further reviewed by an international translation team. One-shot conditional natural language generation has typically not been reported by models with size comparable to BLOOM. PaLM (Chowdhery et al., 2022) is the first exception, and reports scores on WikiLingua; however, only the model’s ability to summarize in English was examined (- $>$ en). By contrast, we opted to test BLOOM’s inherent multilingual ability by assessing the abstract ive sum mari z ation in the source language (e.g. vi - $>$ vi). We focus on the nine languages (Arabic, English, Spanish, French, Hindi, Indonesian, Portuguese, Vietnamese and Chinese) which were amongst those targeted as part of the BigScience effort.

总结 我们在 WikiLingua (Ladhak 等, 2020) 数据集上评估总结性能。WikiLingua 是一个多语言总结数据集,包含 WikiHow 文章和逐步摘要对。这些对在多种语言之间进行了对齐,并由国际翻译团队进一步审查了源文本和摘要的翻译。单次条件自然语言生成通常没有在与 BLOOM 规模相当的模型中报告过。PaLM (Chowdhery 等, 2022) 是第一个例外,并在 WikiLingua 上报告了分数;然而,仅考察了该模型用英语进行总结的能力 (- $>$ en)。相比之下,我们选择通过评估源语言中的抽象式总结来测试 BLOOM 的固有多语言能力(例如 vi - $->$ vi)。我们专注于九种语言(阿拉伯语、英语、西班牙语、法语、印地语、印尼语、葡萄牙语、越南语和中文),这些语言是 BigScience 计划的目标语言之一。

Natural language generation is notoriously challenging to evaluate, with multilingual generation compounding this challenge due to a lack of metric support. Following the suggestions by Gehrmann et al. (2022b), we report ROUGE-2, ROUGE-L (Lin, 2004),31 and Levenshtein distance. One important modification to ROUGE is using the Sentence Piece tokenizer (Kudo and Richardson, 2018) built from the Flores-101 dataset (Goyal et al.,

自然语言生成的评估一直非常具有挑战性,而多语言生成则进一步加大了这一挑战,因为缺乏度量支持。根据 Gehrmann 等人 (2022b) 的建议,我们报告 ROUGE-2、ROUGE-L (Lin, 2004),31 和 Levenshtein 距离。对 ROUGE 的一个重要修改是使用基于 Flores-101 数据集 (Goyal 等,构建的 Sentence Piece 分词器 (Kudo 和 Richardson, 2018)。

2022). A naive approach would use a tokenizer based on English, but using a multilingual tokenizer improves the capacity to measure the fidelity of multilingual generations. To minimize inference time of the model we use the subsamples from the updated GEM benchmark (Gehrmann et al., 2022a) (3000 uniformly sampled test examples). The authors note that there is minimal difference when comparing model performance between the subsamples and the full test sets. For decoding and generation, we use the same procedure as described above for MT.

2022). 一种简单的方法是使用基于英语的分词器,但使用多语言分词器可以提高衡量多语言生成保真度的能力。为了最小化模型的推理时间,我们使用了更新的 GEM 基准 (Gehrmann et al., 2022a) 的子样本 (3000 个均匀采样的测试示例)。作者指出,在比较子样本和完整测试集之间的模型性能时,差异很小。对于解码和生成,我们使用与上述 MT 相同的程序。

4.1.4 Baseline Models

4.1.4 基线模型 (Baseline Models)

We use the following baseline models where appropriate (e.g. in settings where they support the language of the evaluation dataset):

我们在适当的情况下使用以下基线模型(例如,在它们支持评估数据集语言的设置中):

4.2 SuperGLUE

4.2 超级GLUE (SuperGLUE)

Figure 7 shows zero- and one-shot performance on SuperGLUE. In both settings, on entailment tasks (BoolQ and CB), performance is well above random chance for BLOOM, T0, OPT, and GPT-J. On other tasks, while the best prompts do better, the average performance across prompts hovers around chance, suggesting that the success of individual prompts is primarily statistical variation. There is some signal for BLOOM in the diagnostic (Ax-b and Ax-g) datasets. The exception is the T0 model, which shows strong performance. However, this model is finetuned in the multitask setting (similar to BLOOMZ, see Section 4.7) in order to improve performance in zero-shot prompting settings, and thus is not directly comparable to the other models shown here.

图 7: 显示了 SuperGLUE 上的零样本和单样本性能。在两种设置中,对于蕴含任务 (BoolQ 和 CB),BLOOM、T0、OPT 和 GPT-J 的表现远高于随机水平。对于其他任务,尽管最佳提示词的效果较好,但提示词的平均表现接近随机水平,这表明个别提示词的成功主要是统计波动。在诊断数据集 (Ax-b 和 Ax-g) 中,BLOOM 有一些信号。例外的是 T0 模型,它表现出强劲的性能。然而,该模型是在多任务设置中微调过的 (类似于 BLOOMZ,见第 4.7 节),以提高零样本提示设置中的性能,因此与这里显示的其他模型不直接可比。

As models go from zero-shot to one-shot, variability is reduced across all prompts and models and performance slightly and inconsistently increases. Notably, BLOOM sees more of an increase in performance than comparable models when going from zero-shot to oneshot, as it is generally behind OPT in the zero-shot setting but matches or improves on it in the one-shot setting, even though it has only partly been trained on English. This may be because a multilingual language model gains more certainty in the language of input and output with a longer context.

当模型从零样本到单样本,所有提示和模型的变异性减少,性能略有且不一致地提高。值得注意的是,BLOOM 在从零样本到单样本时性能提升比类似模型更明显,尽管在零样本设置下它通常落后于 OPT,但在单样本设置下它可以匹敌或超越 OPT,即使它只部分接受了英语训练。这可能是因为多语言的大语言模型在上下文更长的情况下对输入和输出语言的确信度更高。

图 1: 模型架构示例

在本节中,我们将介绍生成式 AI (Generative AI) 的最新进展。生成式 AI 是一种能够根据给定的数据生成新内容的技术。它已经在多个领域取得了显著的成果,包括但不限于图像生成、文本创作和音乐合成。

生成式 AI 的核心是大语言模型(LLM)。这些模型通过大量的文本数据进行训练,从而具备了强大的语言理解和生成能力。此外,零样本和少样本学习技术使得模型能够在没有大量标注数据的情况下完成特定任务。

近年来,Transformer 架构因其卓越的性能而成为生成式 AI 的主流选择。Transformer 模型通过自注意力机制有效地捕捉长距离依赖关系,从而提高了生成内容的质量和连贯性。

为了进一步提升生成式 AI 的性能,研究人员还探索了多种增强技术,如强化学习和对抗训练。这些方法有助于提高模型的稳定性和鲁棒性,使其在实际应用中表现更加出色。

未来,随着研究的不断深入和技术的进步,生成式 AI 将在更多领域发挥重要作用,并为人类带来更多的便利和创新。

Figure 7: Performance of various LLMs on subset of tasks from SuperGLUE benchmark in zero- and one-shot prompt-based setting.

图 7: 各种大语言模型 (LLM) 在 SuperGLUE 基准测试子集任务中的零样本和一样本提示设置下的性能。

We perform an additional analysis comparing BLOOM models across model sizes. As a baseline, we also measure the average one-shot accuracy of OPT models of similar sizes (350M parameters to 175B parameters).32 Figure 8 shows the accuracy of each prompt on each task across model scales. Both OPT and BLOOM model families improve very slightly with scale, with only models over 2 billion parameters showing signal, and there is no consistent difference between families across all tasks. In the 1-shot setting, BLOOM176B is ahead of OPT-175B on Ax-b, CB, WSC and WiC, and matches it on the other tasks, suggesting that multilingual it y does not limit performance of BLOOM on English-only tasks in the zero-shot setting.

我们进行了额外的分析,比较不同规模的 BLOOM 模型。作为基准,我们还测量了类似规模的 OPT 模型 (350M 参数到 175B 参数) 的平均单次 (one-shot) 准确率。图 8 显示了每个提示在每个任务上的准确率随模型规模的变化。OPT 和 BLOOM 模型家族的性能随着规模的增加略有提升,只有超过 20 亿参数的模型显示出信号,并且在所有任务中,两个模型家族之间没有一致的差异。在单次 (1-shot) 设置下,BLOOM176B 在 Ax-b、CB、WSC 和 WiC 上优于 OPT-175B,在其他任务上与之相当,这表明多语言性并不会限制 BLOOM 在零样本设置下的英文任务表现。

图 1: 模型架构示例

在本节中,我们将介绍生成式 AI (Generative AI) 的最新进展。生成式 AI 是指能够根据给定的数据生成新内容的 AI 技术。这些技术已经在多个领域取得了显著的成功,包括但不限于图像生成、文本生成和音乐创作。

Transformer 模型是当前最流行的生成式 AI 架构之一。它通过自注意力机制(self-attention mechanism)来处理输入序列中的每个位置,从而有效地捕捉长距离依赖关系。这种模型已经被广泛应用于大语言模型(LLM)中,并且在零样本和少样本学习任务中表现出色。

此外,我们还将探讨一些具体的生成式 AI 应用案例,例如使用 Python语言 编写的代码示例,以及如何利用 Amazon Web Services (AWS) 和 Microsoft Azure 等云平台进行大规模训练。最后,我们将讨论通用人工智能(AGI)的发展前景及其对社会的影响。

Figure 8: Comparison of the scaling of BLOOM versus OPT on each SuperGLUE one-shot task. Each point represents the average accuracy of a model within the BLOOM or OPT family of models on one of the five task prompts. The number of parameters on the x-axis is presented in log-scale.

图 8: BLOOM 与 OPT 在每个 SuperGLUE 单次提示任务上的扩展性对比。每个点代表 BLOOM 或 OPT 模型系列中的一个模型在五个任务提示之一上的平均准确率。x 轴上的参数数量以对数尺度表示。

4.3 Machine Translation

4.3 机器翻译

In addition to the results presented here, a more detailed analysis of BLOOM’s MT quality can be found in (Bawden and Yvon, 2023).

除了这里展示的结果外,还可以在 (Bawden and Yvon, 2023) 中找到对 BLOOM 的机器翻译质量更详细的分析。

4.3.1 WMT

4.3.1 WMT

由于没有提供具体的WMT内容,以上是根据给定指示整理的翻译规则和策略,请提供需要翻译的具体内容。

WMT results for BLOOM-176B in the zero-shot and 1-shot setting are given in Table 6. The best prompts tend to be the more verbose ones; the “version-target” prompt is consistently better and the “gpt3-target” and “xglm-source+target” prompts have very poor performance, especially for zero-shot. In the one-shot setting, BLOOM can, with the right prompt, perform competent translation, although it is behind dedicated (supervised) models such as M2M-100 (43.8 BLEU for English $\rightarrow$ French and 40.4 for French $\rightarrow$ English, compared to 34.2 and 35.4 BLEU for BLOOM). The two major problems observed, particularly in the zero-shot setting, are (i) over-generation and (ii) not producing the correct language (an obvious prerequisite for a good translation). Both of these aspects are greatly improved as the number of few-shot examples is increased.

表 6: WMT 结果显示了 BLOOM-176B 在零样本和 1-shot 设置下的表现。最佳的提示语往往是更详细的;“version-target”提示语始终表现更好,而“gpt3-target”和“xglm-source+target”提示语的表现非常差,尤其是在零样本情况下。在 1-shot 设置下,BLOOM 可以通过正确的提示语进行有效的翻译,尽管它落后于专门的(监督)模型,如 M2M-100 (对于英语到法语的翻译,M2M-100 的 BLEU 分数为 43.8,法语到英语为 40.4,而 BLOOM 分别为 34.2 和 35.4)。观察到的两个主要问题,特别是在零样本设置下,是 (i) 过度生成和 (ii) 没有生成正确的语言(这是良好翻译的明显前提)。随着少样本示例数量的增加,这两个方面都得到了显著改善。

Table 6: WMT’14 zero- and one-shot results (BLEU) for BLOOM-176B. The prompts used are described in Table 5.

表 6: WMT’14 零样本和单样本结果 (BLEU) 对于 BLOOM-176B。使用的提示符在表 5 中描述。

提示符 en → fr fr → en en → hi hi → en
样本数 0 1 0 0 1 0 1
a_good translation-source+target 15.38 36.39 14.15 36.56 1.90 14.49 10.19 24.60
gpt3-target 7.90 32.55 12.73 33.14 0.26 6.51 0.66 9.98
version-target 21.96 34.22 26.79 35.42 1.96 13.95 11.48 25.80
xglm-source+target 14.91 27.83 15.52 34.51 6.80 13.62 12.05 25.04

4.3.2 DiaBLa

4.3.2 DiaBLa

en-→fr 1-shot context Truncate BLEU COMET BLEU COMET
Rand. 5.7 0.342 12.1 0.614
37.6 0.634 41.4 0.757
Prev. X 6.1 0.328 12.3 0.617
38.5 0.614 41.6 0.751

Table 7: DiaBLa 1-shot results (BLEU) for the “xglm-source $+$ target” prompt when using the previous or a random sentence as the 1-shot example (with and without truncation of outputs). In bold the best results for each direction.

表 7: DiaBLa 1-shot 结果 (BLEU) 对于 “xglm-source $+$ target” 提示,使用前一句或随机句子作为 1-shot 示例(带或不带输出截断)。粗体为每个方向的最佳结果。

Table 7 shows results testing the use of linguistic context with DiaBLa, a parallel dataset of informal bilingual dialogues. In a 1-shot context and using the “xglm-source+target” prompt, we compare the effect of using a random test set example as the 1-shot example versus using the previous dialogue utterance. In light of the over generation issues seen and in order to evaluate the quality of the prediction independently of over generation, we report results for both original outputs and after applying a custom truncation function. $^{33}$ The automatic results are inconclusive, with little difference between scores (BLEU scores are higher for previous context but COMET scores are lower). Despite these results, there is evidence in the predictions themselves that the model is able to use the context of the 1-shot example to make translation choices. See (Bawden and Yvon, 2023) for examples and further analysis.

表 7: 显示了使用 DiaBLa 测试语言上下文的结果,DiaBLa 是一个非正式双语对话的并行数据集。在 1-shot 上下文中,使用 “xglm-source+target” 提示符,我们比较了使用随机测试集示例作为 1-shot 示例与使用前一个对话话语的效果。鉴于观察到的过度生成问题,并为了独立于过度生成评估预测质量,我们报告了原始输出和应用自定义截断函数后的结果。$^{33}$ 自动评估结果不确定,分数之间的差异很小(BLEU 分数对于前一个上下文更高,但 COMET 分数更低)。尽管这些结果,预测本身提供了证据表明模型能够利用 1-shot 示例的上下文进行翻译选择。参见 (Bawden and Yvon, 2023) 获取示例和进一步分析。

IS Trg→ en bn hi SW yo
en BLOOM 24.6 27.2 20.5 2.6
en M2M 23.0 28.1 26.9 2.2
bn BLOOM 29.9 16.3
bn M2M 22.9 21.8
hi BLOOM 35.1 23.8
hi M2M 27.9 21.8
SW BLOOM 37.4 1.3
SW M2M 30.4 1.3
yo BLOOM 4.1 0.9
yo M2M 4.2 1.9

(b) Romance languages

(b) 罗曼语系

↑o.1S Trg→ ca es fr gl it pt
ca BLOOM M2M 28.9 25.2 33.8 35.1 19.2 33.4 19.8 25.5 33.0 35.2
es BLOOM M2M 31.2 23.1 24.8 29.3 23.3 27.5 16.5 23.9 29.1 28.1
fr BLOOM M2M 37.2 28.7 27.5 25.6 24.9 32.8 24.0 28.6 38.9 37.8
gl BLOOM M2M 37.5 30.1 27.1 27.6 33.8 37.1 18.3 26.9 32.2 34.8
it BLOOM M2M 31.0 25.2 25.4 29.2 31.4 34.4 20.2 29.2 29.2 31.5
pt BLOOM M2M 39.6 30.7 28.1 26.9 40.3 40.2 27.1 33.8 20.1 28.1

(a) Low-resource languages

(c) High-resource language pairs.

(a) 低资源语言

↑0IS ←8L ar en es fr zh
ar BLOOM 40.3 23.3 33.1 17.7
M2M 25.5 16.7 25.7 13.1
en AlexaTM 41.8 23.2 35.5
BLOOM M2M 28.2 17.9 一 一 29.4 25.6 45.0 42.0 26.7 19.3
AlexaTM 32.0 31.0 50.7
BLOOM 18.8 32.7 24.8 20.9
es M2M 12.1 25.1 29.3 14.9
AlexaTM 20.8 34.6 33.4
fr BLOOM 23.4 45.6 27.5 23.2
M2M 15.4 37.2 25.6 17.6
AlexaTM 24.7 47.1 26.3
zh BLOOM 15.0 30.5 20.5 26.0
M2M 11.55 20.9 16.9 24.3
AlexaTM

(c) 高资源语言对。

(d) High $\rightarrow$ mid-resource language pairs.

Src↓ Trg → en fr hi id vi
en BLOOM 45.0 27.2 39.0 28.5
en M2M 42.0 28.1 37.3 35.1
fr BLOOM 45.6 18.5 31.4 32.8
fr M2M 37.2 22.9 29.1 30.3
hi BLOOM 35.1 27.6
hi M2M 27.9 25.9
id BLOOM 43.2 30.4
id M2M 33.7 30.8
vi BLOOM 38.7 26.8
vi M2M 29.5 25.8

(d) 高资源 $\rightarrow$ 中等资源语言对。

Table 8: 1-shot MT results (spBLEU) on the Flores-101 devtest set.

表 8: 1-shot MT 结果 (spBLEU) 在 Flores-101 devtest 数据集上。

4.3.3 Flores

4.3.3 Flores

In the 1-shot setting, we test several language directions in the Flores-101 (Goyal et al., 2022) devtest set using the “xglm-source+target” prompt (Lin et al., 2021). The 1-shot example is randomly taken from the dev set. We separate out results for low-resource language pairs (Table 8a), between related languages of the Romance language family (Table 8b), high-resource language pairs (Table 8c) and high-to-mid-resource language pairs (Table 8d).

在 1-shot 设置中,我们使用 “xglm-source+target” 提示 (Lin et al., 2021) 在 Flores-101 (Goyal et al., 2022) devtest 集中测试了几种语言方向。1-shot 示例是从 dev 集中随机选取的。我们将低资源语言对 (表 8a)、罗曼语系相关语言之间的结果 (表 8b)、高资源语言对 (表 8c) 和高到中资源语言对 (表 8d) 的结果分开列出。

Languages are classified as low-, mid- and high-resource depending on their representation in ROOTS. We compare to supervised results from the M2M-100 model (Fan et al., 2021) with 615M parameters, for which scores are computed by Goyal et al. (2022). Additionally, we compare to 32-shot AlexaTM results for high-resource language pairs (Soltan et al., 2022). Results are good across the board for both translation between high-resource languages and from high- to mid-resource languages, suggesting BLOOM’s good multilingual capacity, even across scripts (here between Latin (or extended Latin), Chinese, Arabic and Devanagari scripts). Compared to the supervised M2M-100 model, results are often comparable and sometimes better in this 1-shot setting, and results are comparable in many cases to those of AlexaTM (even though AlexTM results are for 32-shot).

语言根据在 ROOTS 中的资源量被分类为低资源、中资源和高资源。我们将结果与 M2M-100 模型 (Fan et al., 2021) 的监督结果进行比较,该模型有 615M 参数,其得分由 Goyal 等人 (2022) 计算。此外,我们还与高资源语言对的 32-shot AlexaTM 结果进行了比较 (Soltan et al., 2022)。对于高资源语言之间的翻译以及从高资源到中资源语言的翻译,结果都很好,这表明 BLOOM 具有良好的多语言能力,即使是在不同脚本之间(例如拉丁文(或扩展拉丁文)、中文、阿拉伯文和天城文)。与监督的 M2M-100 模型相比,在这个 1-shot 设置下,结果通常相当,有时甚至更好,并且在许多情况下与 AlexaTM 的结果相当(尽管 AlexaTM 的结果是基于 32-shot)。

The translation quality for many of the low-resource languages is good, comparable to or even slightly better than the supervised M2M model. However, results are very poor between Swahili and Yoruba, languages that are present but under-represented in BLOOM’s training data ( $<$ <50k tokens each). This contrasts with the results for translation between Romance (and therefore related) languages, where results are good across-theboard, including for translation from Galician (glg), a language not included in the training data, but which shares many similarities with the other Romance languages, in particular with Portuguese (por). This however does question BLOOM’s quality on those underrepresented low-resource languages included in training.

许多低资源语言的翻译质量很好,可与监督的 M2M 模型相媲美,甚至略胜一筹。然而,斯瓦希里语和约鲁巴语之间的翻译结果非常差,这两种语言在 BLOOM 的训练数据中存在但代表性不足(<50k tokens 每种)。这与罗曼语系(因此是相关)语言之间的翻译结果形成鲜明对比,后者的结果普遍良好,包括从加利西亚语 (glg) 的翻译,尽管该语言未包含在训练数据中,但它与其他罗曼语系语言有许多相似之处,尤其是与葡萄牙语 (por)。然而,这也引发了对 BLOOM 在训练中包含的那些代表性不足的低资源语言的质量的质疑。

4.4 Sum mari z ation

4.4 总结

Figure 9 shows one-shot results for BLOOM models alongside OPT-175B for comparison. Each point represents a per-prompt score. The key takeaways are that BLOOM attains higher performance on multilingual sum mari z ation than OPT and that performance increases as the parameter count of the model increases. We suspect this is due to BLOOM’s multilingual-focused training.

图 9: 显示了 BLOOM 模型与 OPT-175B 的一次 shot 结果对比。每个点代表一个提示的得分。主要结论是,BLOOM 在多语言总结方面比 OPT 表现更好,并且随着模型参数量的增加,性能也有所提升。我们怀疑这是由于 BLOOM 的多语言专注训练所致。

As discussed in Section 4.1, we report ROUGE-2 scores for the sake of comparability with prior work, and because there is a lack of alternatives for generation evaluation. However, we qualitatively observe that in many cases, the ROUGE-2 score understates the quality of the summaries generated by the systems.

如第 4.1 节所述,为了与先前的工作具有可比性,并且由于生成评估缺乏替代方案,我们报告了 ROUGE-2 分数。然而,我们定性地观察到,在许多情况下,ROUGE-2 分数低估了系统生成的摘要的质量。

4.5 Code Generation

4.5 代码生成

The BLOOM pre training corpus, ROOTS, consists of around 11% of code. In Table 9, we report benchmarking results of BLOOM on HumanEval (Chen et al., 2021). We find the performance of pretrained BLOOM models to be similar to that of the similar-sized GPT models trained on the Pile (Gao et al., 2020). The Pile contains English data and around 13% of code (GitHub $^+$ Stack Exchange), which is similar to the code data sources and proportions in ROOTS. The Codex models, which have solely been finetuned on code, are significantly stronger than other models. Multitask finetuned BLOOMZ models do not improve significantly over BLOOM models. We hypothesize this is due to the finetuning dataset, xP3, not containing significant amounts of pure code completion. Rather, xP3 contains code-related tasks, such as estimating the time complexity of a given Python code snippet. Additional analysis is provided in Mu en nigh off et al. (2022b).

BLOOM 的预训练语料库 ROOTS 包含大约 11% 的代码。在表 9 中,我们报告了 BLOOM 在 HumanEval (Chen et al., 2021) 上的基准测试结果。我们发现预训练的 BLOOM 模型的性能与在 Pile (Gao et al., 2020) 上训练的类似规模的 GPT 模型相似。Pile 包含英文数据和大约 13% 的代码(GitHub + Stack Exchange),这与 ROOTS 中的代码数据来源和比例相似。仅在代码上微调的 Codex 模型比其他模型显著更强。多任务微调的 BLOOMZ 模型并没有显著优于 BLOOM 模型。我们推测这是由于微调数据集 xP3 不包含大量的纯代码补全任务。相反,xP3 包含与代码相关的任务,例如估计给定 Python 代码片段的时间复杂度。更多分析见 Mu en nigh off et al. (2022b)。

图 1: 模型架构示例

在本研究中,我们探讨了生成式 AI (Generative AI) 的最新进展,并提出了一个新的 Transformer 架构。该架构能够在零样本和少样本场景下表现出色。此外,我们还引入了一个新的数据集,用于评估大语言模型的性能。

表 1: 数据集统计信息

数据集名称 样本数量 类别数量
Dataset A 10,000 5
Dataset B 20,000 10

我们的实验结果表明,所提出的方法在多个基准测试中均取得了显著的改进 [20]。

Figure 9: WikiLingua One-shot Results. Each plot represents a different language with per-prompt ROUGE-2 F-measure scores.

图 9: WikiLingua One-shot 结果 . 每个图表代表一种不同的语言,显示每个提示的 ROUGE-2 F-measure 分数。

4.6 HELM benchmark

4.6 HELM 基准测试

For completeness, we reproduce here evaluations from the HELM benchmark (Liang et al., 2022), which ran 5-shot evaluations of a variety of language models on English-only tasks. Despite the multilingual training, BLOOM is roughly on par in accuracy with previousgeneration English-only models, such as GPT3-davinci v1 and J1-Grande v1, but behind more recent monolingual models such as Instruct GP T davinci v2, Turing NLG v2, Anthropic-LM v4-s3, or OPT. Like other large language models of this size, it is not very well calibrated, but quite robust. Finally, on this benchmark, it is one of the best models for fairness, slightly more toxic than average in English, and average for bias.

为完整性,我们在此重现 HELM 基准测试 (Liang et al., 2022) 的评估结果,该测试对多种语言模型进行了少样本 (Few-shot) 评估,仅限于英语任务。尽管进行了多语言训练,BLOOM 在准确性上与之前的单语模型(如 GPT3-davinci v1 和 J1-Grande v1)大致相当,但落后于更新的单语模型,例如 Instruct GPT davinci v2、Turing NLG v2、Anthropic-LM v4-s3 或 OPT。与其他这个规模的大语言模型一样,它的校准效果不太好,但非常稳健。最后,在此基准测试中,它是公平性最好的模型之一,毒性略高于平均水平,偏见程度则处于平均水平。

4.7 Multitask Finetuning

4.7 多任务微调

Building on recent work on multitask finetuning (Sanh et al., 2022; Wei et al., 2021; Wang et al., 2022a) we explore using multilingual multitask finetuning to improve the zero-shot performance of the BLOOM model. We conducted multilingual multitask finetuning of BLOOM models using the xP3 corpus outlined in Section 3.1.4. We find that zero-shot performance significantly increases. In Figure 11, we compare the zero-shot performance of pretrained BLOOM and XGLM models with multitask finetuned BLOOMZ, T0 and mTk-Instruct (Wang et al., 2022b). BLOOM and XGLM performances are near the random baselines of 33% for NLI (XNLI) and 50% for co reference resolution (XWinograd) and

基于最近关于多任务微调的工作 (Sanh et al., 2022; Wei et al., 2021; Wang et al., 2022a),我们探索使用多语言多任务微调来提高 BLOOM 模型的零样本性能。我们使用第 3.1.4 节中概述的 xP3 语料库对 BLOOM 模型进行了多语言多任务微调。我们发现零样本性能显著提升。在图 11 中,我们将预训练的 BLOOM 和 XGLM 模型与多任务微调后的 BLOOMZ、T0 和 mTk-Instruct (Wang et al., 2022b) 的零样本性能进行了比较。BLOOM 和 XGLM 的性能接近随机基线,分别为 NLI (XNLI) 的 33% 和共指消解 (XWinograd) 的 50%。

PASS@k
k=1 k = 10 k = 100
GPT-NEO 1.3B 4.79% 7.47% 16.30%
GPT-NEO 2.7B 6.41% 11.27% 21.37%
GPT-J 6B 11.62% 15.74% 27.74%
GPT-NE0X 20B 15.4% 25.6% 41.2%
CODEX-300M 13.17% 20.37% 36.27%
CODEX-679M CODEX-2.5B 16.22% 25.7% 40.95%
CODEX-12B 21.36% 35.42% 59.5%
BLOOM-560M 28.81% 46.81% 72.31%
BLOOM-1.1B 0.82% 3.02% 5.91%
BLOOM-1.7B 2.48% 5.93% 9.62%
BLOOM-3B 4.03% 7.45% 12.75%
6.48% 11.35% 20.43%
BLOOM-7.1B 7.73% 17.38% 29.47%
BLOOM 15.52% 32.20% 55.45%
BLOOMZ-560M 2.18% 4.11% 9.00%
BLOOMZ-1.1B 2.63% 6.22% 11.68%
BLOOMZ-1.7B 4.38% 8.73% 16.09%
BLOOMZ-3B 6.29% 11.94% 19.06%
BLOOMZ-7.1B 8.06% 15.03% 27.49%
BLOOMZ 12.06% 26.53% 48.44%

Table 9: Performance on HumanEval (Chen et al., 2021). Non-BLOOM results come from prior work (Chen et al., 2021; Fried et al., 2022). The Codex model is a language model that was finetuned on code, while the GPT models (Black et al.; Wang and Komatsu zak i, 2021; Black et al., 2022) are trained on a mix of code and text like BLOOM.

表 9: HumanEval 上的性能 (Chen et al., 2021)。非 BLOOM 结果来自先前的工作 (Chen et al., 2021; Fried et al., 2022)。Codex 模型是经过代码微调的语言模型,而 GPT 模型 (Black et al.; Wang and Komatsu zak i, 2021; Black et al., 2022) 则是在代码和文本混合数据上进行训练,类似于 BLOOM。

sentence completion (XCOPA and X Story Clo ze). After going through multilingual multitask finetuning (BLOOMZ), zero-shot performance significantly improves on the depicted held-out tasks. Despite also being multitask finetuned, T0 performs badly on the multilingual datasets shown due to it being a monolingual English model. Additional results provided in Mu en nigh off et al. (2022b), however, show that models finetuned on xP3 also outperform T0 on English datasets when controlling for size and architecture. This is likely due to T0’s finetuning dataset (P3) containing less diverse datasets and prompts than xP3. Multitask finetuning performance has been shown to correlate with the amount of datasets and prompts (Chung et al., 2022).

句子补全 (XCOPA 和 X Story Cloze)。经过多语言多任务微调 (BLOOMZ) 后,零样本性能在所示的保留任务上显著提高。尽管也进行了多任务微调,但由于 T0 是单语英语模型,因此在所示的多语言数据集上的表现不佳。Mu en nigh off et al. (2022b) 提供的额外结果显示,在控制规模和架构的情况下,基于 xP3 微调的模型在英语数据集上的表现优于 T0。这可能是由于 T0 的微调数据集 (P3) 包含的数据集和提示比 xP3 更少样。多任务微调性能已被证明与数据集和提示的数量相关 (Chung et al., 2022)。

4.8 Embeddings

4.8 嵌入 (Embeddings)

In Section 3.5, we have outlined the contrastive finetuning procedure for creating SGPTBLOOM text embedding models. In Table 10, we report benchmarking results on two multilingual datasets from the Massive Text Embedding Benchmark (MTEB, Mu en nigh off et al., 2022a). We find that SGPT-BLOOM-7.1B-msmarco $^{36}$ provides state-of-the-art performance on several classification and semantic textual similarity splits. However, with 7.1 billion parameters it is an order of magnitude larger than models like the displayed multilingual MiniLM $^{37}$ and MPNet38. SGPT-BLOOM-1.7B-nli $^{39}$ performs significantly worse, likely due to less parameters and its finetuning being shorter (NLI is a much smaller dataset than MS-MARCO). Apart from the BLOOM models, ST5-XL $^{40}$ is the largest model with 1.2 billion parameters. However, as an English-only model its performance on non-English

在 3.5 节中,我们概述了对比微调过程,用于创建 SGPT-BLOOM 文本嵌入模型。在表 10 中,我们报告了来自大规模文本嵌入基准 (MTEB, Mu en nigh off et al., 2022a) 的两个多语言数据集的基准测试结果。我们发现 SGPT-BLOOM-7.1B-msmarco (36) 在多个分类和语义文本相似性分割上提供了最先进的性能。然而,它有 71 亿个参数,比像展示的多语言 MiniLM (37) 和 MPNet38 这样的模型大一个数量级。SGPT-BLOOM-1.7B-nli (39) 的表现明显较差,可能是因为参数较少且其微调时间较短(NLI 数据集比 MS-MARCO 小得多)。除了 BLOOM 模型外,ST5-XL (40) 是最大的模型,有 12 亿个参数。然而,作为一个仅限英语的模型,它在非英语上的表现


Figure 10: Results for a wide variety of language models on the 5-shot HELM benchmark. Taken from Liang et al. (2022)

图 10: 各种语言模型在 5-shot HELM 基准上的结果。摘自 Liang et al. (2022)

图 1: 模型架构示例 (Example Model Architecture)

Figure 11: BLOOMZ zero-shot task generalization. Five untuned prompts are evaluated for each dataset and plotted. T0 is monolingual (English) while other models are multilingual. T0 performance may be hurt by its inability to tokenize some non-English texts.

图 11: BLOOMZ 零样本任务泛化。每个数据集评估了五个未调优的提示并绘制了图表。T0 是单语言(英语)的,而其他模型是多语言的。T0 的性能可能因其无法对某些非英语文本进行分词而受到影响。

4.9 Multilingual Probing

4.9 多语言探针分析

Probing has emerged as a significant evaluation paradigm to analyze and interpret the inner workings of LLMs (Ettinger et al., 2016; Adi et al., 2017; Belinkov et al., 2017; Hupkes et al.,

探针方法已发展成为一种重要的评估范式,用于分析和解释大语言模型 (LLM) 的内部运作机制 (Ettinger et al., 2016; Adi et al., 2017; Belinkov et al., 2017; Hupkes et al.,

Table 10: Performance of BLOOM models finetuned for sentence embeddings on classification and STS datasets from MTEB (Mu en nigh off et al., 2022b).

ST5-XL LASER2 MiniLM-L1234 MPNet35 LaBSE SGPT-BLOOM-1.7B SGPT-BLOOM-7.1B
嵌入分类在 MASSIVE (FitzGerald et al., 2022) 上的表现(使用准确率评分)
阿拉伯语 (ar) 4.18 37.16 51.43 45.14 50.86 54.59 59.25
孟加拉语 (bn) 2.60 42.51 48.79 35.34 58.22 57.76 61.59
英语 (en) 72.09 47.91 69.32 66.84 61.46 66.69 69.67
西班牙语 (es) 57.97 45.44 64.43 59.66 58.32 61.77 66.35
法语 (fr) 60.99 46.13 64.82 60.25 60.47 64.58 66.95
印地语 (hi) 3.02 40.20 62.77 58.37 59.40 60.74 63.54
印尼语 (id) 41.53 45.81 65.43 59.85 61.12 60.07 64.06
卡纳达语 (kn) 2.79 4.32 50.63 40.98 56.24 48.56 53.54
马拉雅拉姆语 (ml) 2.98 41.33 54.34 42.41 57.91 55.10 58.27
葡萄牙语 (pt) 57.95 48.55 64.89 61.27 60.16 62.52 66.69
斯瓦希里语 (sw) 30.60 31.89 31.95 29.57 51.62 43.90 49.81
泰米尔语 (ta) 1.79 29.63 50.17 36.77 55.04 52.66 56.40
泰卢固语 (te) 2.26 36.03 52.82 40.72 58.32 49.32 54.71
乌尔都语 (ur) 2.70 26.11 56.37 52.80 56.70 51.00 56.75
越南语 (vi) 21.47 44.33 59.68 56.61 56.67 59.85 64.53
ST5-XL LASER2 MiniLM-L1234 MPNet35 LaBSE SGPT-BLOOM-1.7B SGPT-BLOOM-7.1B
STS22 (Madabushi et al., 2022) 上的语义文本相似性(使用余弦相似性的斯皮尔曼相关性评分)
阿拉伯语 (ar) 29.60 42.57 52.19 46.20 57.67 48.64 58.67
英语 (en) 64.32 39.76 63.06 61.72 60.97 61.45 66.13
西班牙语 (es) 58.16 54.92 59.91 56.56 63.18 61.81 65.41
法语 (fr) 77.49 58.61 74.30 70.55 77.95 73.18 80.38
中文 (zh) 33.55 49.41 61.75 58.75 63.02 58.53 66.78

表 10: BLOOM 模型在 MTEB (Mu en nigh off et al., 2022b) 的分类和 STS 数据集上微调后的表现。

2018; Tenney et al., 2018; Belinkov and Glass, 2019; Teehan et al., 2022), although it comes with certain shortcomings (Belinkov, 2022). Examination of the LLM embeddings can help shed light on the generalizing abilities of the model apart from its training objective loss or downstream task evaluation, which is especially beneficial for examining languages lacking annotated datasets or benchmarks.

2018;Tenney 等,2018;Belinkov 和 Glass,2019;Teehan 等,2022),尽管它存在某些不足 (Belinkov, 2022)。对大语言模型 (LLM) 的嵌入进行检查可以帮助揭示模型的泛化能力,而不仅仅是其训练目标损失或下游任务评估,这尤其有利于检查缺乏标注数据集或基准的语言。

4.9.1 Method

4.9.1 方法

For interpreting BLOOM’s multilingual generalizing abilities, we utilize the “Universal Probing” framework $^{42}$ for systematic probing analysis in 104 languages and 80 morph o syntactic features (Serikov et al., 2022). The framework provides SentEval-style (Conneau et al., 2018) probing setup and datasets for each language available in Universal Dependencies (UD; Nivre et al., 2016). We consider the following 17 languages from 7 language families present in BLOOM’s pre training corpus (Section 3.1) and UD treebanks: Arabic (AfroAsiatic), Bambara (Mande), Basque (language isolate), Bengali, Catalan, English, French, Hindi, Marathi, Portuguese, Spanish, Urdu (Indo-European), Chinese (Sino-Tibetan), Indonesian (Austronesian), Tamil (Dravidian), Wolof, Yoruba (Niger-Congo). Our setup covers 38 morph o syntactic features in total, which represent language-specific linguistic information. We provide a dataset sample in Table 11.

为了解释 BLOOM 的多语言泛化能力,我们利用了“通用探测 (Universal Probing)”框架 ⁴² 进行系统性的探测分析,涵盖 104 种语言和 80 个形态句法特征(Serikov 等,2022)。该框架为 Universal Dependencies (UD; Nivre 等,2016) 中每种可用的语言提供了 SentEval 风格(Conneau 等,2018)的探测设置和数据集。我们考虑了 BLOOM 的预训练语料库(第 3.1 节)和 UD 语料库中包含的以下 7 个语系中的 17 种语言:阿拉伯语(闪含语系),班巴拉语(曼德语系),巴斯克语(语言孤立体),孟加拉语,加泰罗尼亚语,英语,法语,印地语,马拉地语,葡萄牙语,西班牙语,乌尔都语(印欧语系),中文(汉藏语系),印度尼西亚语(南岛语系),泰米尔语(德拉威语系),沃洛夫语,约鲁巴语(尼日尔-刚果语系)。我们的设置总共涵盖了 38 个形态句法特征,这些特征代表了特定语言的语法信息。我们在表 11 中提供了一个数据集样本。

The probing procedure is conducted as follows. First, we compute -pooled represent at ions of the input sentence at each layer of the 1.7B-parameter BLOOM variant (“BLOOM 1B7”) and BLOOM (with 176B parameters). Second, we train a binary logistic regression classifier to predict a presence of a morph o syntactic feature in the sentence. Logistic regression is chosen due to its higher selectivity as opposed to non-linear probing class if i ers (Hewitt and Liang, 2019). We use the original UD training, validation, and test splits here. Third, the probing performance is evaluated by $F_{1}$ weighted score due to target class imbalance for most probing tasks. The results are averaged across three runs with different random seeds.

探针程序如下进行。首先,我们计算输入句子在 1.7B 参数的 BLOOM 变体(“BLOOM 1B7”)和 BLOOM(具有 176B 参数)每个层的 -pooled 表示。其次,我们训练一个二元逻辑回归分类器来预测句子中是否存在形态句法特征。选择逻辑回归是因为它比非线性探针分类器具有更高的选择性 (Hewitt and Liang, 2019)。这里使用原始的 UD 训练、验证和测试集。第三,由于大多数探针任务的目标类别不平衡,探针性能通过加权 $F_{1}$ 分数进行评估。结果取三次不同随机种子运行的平均值。

Table 11: Examples of the Number task in English and Spanish. The subject number indicator is highlighted in bold. The task is to predict if the sentence includes a singular subject number (upper sentence) and a plural subject number (bottom sentence).

表 11: 英语和西班牙语中数字任务的示例。主语数量指示器以粗体突出显示。任务是预测句子是否包含单数主语数量(上句)和复数主语数量(下句)。

语言 标签 句子
English Sing The scheme makes money through sponsorship and advertising
Plur Still, there are questions left unanswered
Spanish Sing Eligio no ir tras un tercer 1 periodo en el siguiente ciclo de elecciones
Plur Todavia quedan preguntas sin responder

Baselines We compare the probing performance with random guessing and logistic regression class if i ers trained on the following TF-IDF features (Salton and Yang, 1973): word unigrams, character N-grams, BPE $_{.43}$ token N-grams, and Sentence Piece $^{44}$ (SP; Kudo and Richardson, 2018) token N-grams. We use the N-gram range $\in$ [1; 4] and limit the TF-IDF vocabularies to top-250k features.

基线
我们比较了探测性能与随机猜测和在以下 TF-IDF 特征上训练的逻辑回归分类器:词单字 (word unigrams),字符 N-gram,BPE (Byte Pair Encoding) token N-gram,以及 Sentence Piece (SP; Kudo and Richardson, 2018) token N-gram。我们使用 N-gram 范围 ∈ [1; 4] 并将 TF-IDF 词汇限制为前 250k 特征。

Correlation We run statistical tests to analyze correlations between the probing performance and linguistic, dataset, and model configuration criteria:

我们运行统计测试来分析探测性能与语言学、数据集和模型配置标准之间的相关性:

• Language script: the results are divided into two groups by the language script – Latin and others (Devanagari, Tamil, and Arabic). Here, we use the non-parametric test Mann-Whitney U (Mann and Whitney, 1947).

• 语言脚本:结果根据语言脚本分为两组 – 拉丁语和其他语 (Devanagari, Tamil, 和 Arabic)。这里,我们使用非参数检验 Mann-Whitney U (Mann 和 Whitney, 1947)。

• Language family: the results are divided into 7 groups by the language family. We apply the ANOVA to analyze the variance between the groups.

• 语系:结果按语系分为 7 组。我们应用方差分析 (ANOVA) 来分析组间的差异。

• Probing and pre training dataset size: we run the Pearson correlation coefficient test (Pearson, 1895) to compute correlation between the probing performance and these data configuration criteria.

• 探针测试和预训练数据集大小:我们运行 Pearson 相关系数检验 (Pearson, 1895) 来计算探针测试性能与这些数据配置标准之间的相关性。

• Effect of the model size: the results are divided into two groups by the BLOOM version. Here, we use the Mann-Whitney U test to see if there is a correlation between the number of parameters and the probing results.

• 模型大小的影响:结果根据 BLOOM 版本分为两组。这里,我们使用 Mann-Whitney U 检验来查看参数数量与探测结果之间是否存在相关性。

BLOOM-1B7 BLOOM Random TF-IDF (Char) TF-IDF (Word) TF-IDF (BPE) TF-IDF (SP)
阿拉伯语 0.66 ±0.27 0.64 ±0.27 0.49 ±0.013 0.41 ±0.44 0.4 ±0.44 0.41 ±0.44 0.41 ±0.44
班巴拉语 0.64 ±0.16 0.59 ±0.16 0.45 ±0.1 0.52 ±0.46 0.45 ±0.47 0.48 ±0.49 0.49 ±0.49
巴斯克语 0.68 ±0.19 0.62 ±0.19 0.49 ±0.03 0.41 ±0.43 0.44 ±0.46 0.48 ±0.44 0.41 ±0.46
孟加拉语 0.42 ±0.15 0.45 ±0.12 0.35 ±0.23 0.63 ±0.48 0.37 ±0.44 0.41 ±0.32 0.76 ±0.28
加泰罗尼亚语 0.65 ±0.25 0.61 ±0.26 0.34 ±0.01 0.24 ±0.38 0.24 ±0.39 0.24 ±0.39 0.24 ±0.39
中文 0.66 ±0.25 0.50 ±0.21 0.55 ±0.03 0.03 ±0.05 0.11 ±0.28 0.04 ±0.06 0.03 ±0.05
英语 0.57 ±0.24 0.57 ±0.24 0.43 ±0.03 0.45 ±0.43 0.46 ±0.43 0.45 ±0.43 0.44 ±0.44
法语 0.61 ±0.23 0.57 ±0.22 0.44 ±0.02 0.32 ±0.43 0.32 ±0.43 0.32 ±0.43 0.33 ±0.44
印地语 0.63 ±0.23 0.6 ±0.25 0.48 ±0.03 0.53 ±0.46 0.55 ±0.47 0.53 ±0.46 0.53 ±0.46
印尼语 0.65 ±0.27 0.6 ±0.27 0.48 ±0.05 0.41 ±0.46 0.43 ±0.45 0.41 ±0.46 0.45 ±0.45
马拉地语 0.57 ±0.25 0.48 ±0.24 0.32 ±0.09 0.44 ±0.47 0.46 ±0.46 0.44 ±0.47 0.44 ±0.47
葡萄牙语 0.67 ±0.23 0.63 ±0.26 0.4 ±0.03 0.48 ±0.48 0.49 ±0.48 0.48 ±0.48 0.48 ±0.48
西班牙语 0.66 ±0.24 0.65 ±0.24 0.42 ±0.02 0.35 ±0.42 0.35 ±0.44 0.36 ±0.44 0.36 ±0.43
泰米尔语 0.57 ±0.25 0.51 ±0.27 0.43 ±0.05 0.51 ±0.44 0.53 ±0.44 0.5 ±0.44 0.5 ±0.44
乌尔都语 0.75 ±0.21 0.70 ±0.24 0.43 ±0.02 0.39 ±0.48 0.39 ±0.47 0.39 ±0.48 0.39 ±0.48
沃洛夫语 0.51 ±0.32 0.47 ±0.32 0.41 ±0.02 0.26 ±0.39 0.25 ±0.39 0.3 ±0.43 0.27 ±0.39
约鲁巴语 0.48 ±0.07 0.36 ±0.07 0.43 ±0.06 0.33 ±0.45 0.09 ±0.05 0.16 ±0.11 0.09 ±0.05

Table 12: Probing performance ( $F_{1}$ averaged by layers) of the BLOOM-based class if i ers and count-based baselines. The results are averaged over probing tasks, and three experiment runs within each language. Standard deviation is determined by the results along the language tasks.

表 12: 基于 BLOOM 的分类器和基于计数的基线在探针任务中的表现 (按层平均的 $F_{1}$ )。结果是针对每种语言内的探针任务和三次实验运行取平均值。标准差由语言任务的结果确定。


Figure 12: Probing class if i ers’ results by language and task category. White squares denote that the morph o syntactic category is not represented in the language.

图 12: 按语言和任务类别划分的探测分类器结果。白色方块表示该形态句法类别在该语言中未被表示。

4.9.2 Results

4.9.2 结果

Probing Table 12 presents the results of probing experiments averaged over the probing tasks and experiment runs within each language. The overall pattern is that BLOOM1B7 performs on par or better than BLOOM, and both LLMs outperform the count-based baselines. In particular, the LLMs achieve more robust performance on Arabic, Basque, and Indo-European languages (e.g., Catalan, French, Hindi, Portuguese, Spanish, and Urdu), while Bengali, Wolof, and Yoruba receive the lowest scores. We attribute this behavior to the transfer abilities: BLOOM infers linguistic properties better for the closely related languages that comprise a significant amount of data. For example, the performance on any Romance language is better than in English, and the results in Indic languages are close to those in high-resource languages.

表 12: 探测实验的结果,平均了每种语言内的探测任务和实验运行。总体模式是 BLOOM1B7 的表现与 BLOOM 相当或更好,且这两种大语言模型 (LLM) 均优于基于计数的基线模型。特别是,大语言模型在阿拉伯语、巴斯克语和印欧语系(例如加泰罗尼亚语、法语、印地语、葡萄牙语、西班牙语和乌尔都语)上表现出更稳健的性能,而孟加拉语、沃洛夫语和约鲁巴语得分最低。我们将这种行为归因于迁移能力:BLOOM 对包含大量数据的密切相关语言的推断能力更强。例如,任何罗曼语系的表现都优于英语,而在印度语系中的结果接近于高资源语言。

Table 13: Results of statistical tests and correlation analysis between probing performance and linguistic, dataset, and model configuration criteria.

表 13: 探测性能与语言、数据集和模型配置标准之间的统计检验和相关性分析结果。

标准 模型 检验 p-值
语言脚本 BLOOM BLOOM-1B7 Mann-Whitney U 0.72 0.13
语系 BLOOM BLOOM-1B7 ANOVA <0.01 <0.01
探测数据集大小 BLOOM BLOOM-1B7 Pearson 0.63 0.02
预训练数据集大小 BLOOM BLOOM-1B7 Pearson 0.46 <0.01
版本之间的差异 BLOOM & BLOOM-1B7 Mann-Whitney U <0.01

Figure 12 presents the language-wise probing performance results for morph o syntactic features represented at least in 5 languages. The probing performance of both LLMs is similar despite the difference in size. We find that the LLMs infer Mood and Person well with no regard for language. Number, NumType (numeral type), and Voice are moderately inferred in most languages. The models generally show worse qualities in the other categories, indicating that they do not encode such morphological information. The possible explanation of such difference in performance may be the diversity of possible values of these categories. For example, Mood and Person share similar values across the presented languages, while the set of Case values is highly dependent on the language.

图 12: 显示了至少在 5 种语言中表示的形态句法特征的语言分类探测性能结果。尽管大语言模型 (LLM) 的规模不同,但它们的探测性能相似。我们发现大语言模型在推断语气 (Mood) 和人称 (Person) 方面表现良好,不受语言影响。数 (Number)、数词类型 (NumType) 和语态 (Voice) 在大多数语言中被中等程度地推断出来。模型在其他类别中的表现通常较差,表明它们没有编码这些形态学信息。这种性能差异的可能解释可能是这些类别的可能值的多样性。例如,语气和人称在所展示的语言中具有相似的值,而格 (Case) 的值集则高度依赖于语言。

Correlation The correlation analysis results support conclusions on the probing performance and reveals contributing factors (see Table 13). Both models show similar results on the languages with different language scripts. Results of BLOOM-1B7 are highly correlated with language family, probing dataset size, and pre training dataset size. According to the results of Mann-Whithey U test, BLOOM-1B7 shows significantly better results ( $p<0.01)$ than BLOOM. However, BLOOM shows more stable performance on different languages in spite of the amount of data it has seen during pre training. This might indicate the better generalization abilities of the model with more parameters.

相关性分析结果支持了关于探测性能的结论,并揭示了影响因素(见表 13)。两个模型在不同语言脚本的语言上表现出相似的结果。BLOOM-1B7 的结果与语言家族、探测数据集大小和预训练数据集大小高度相关。根据 Mann-Whitney U 检验的结果,BLOOM-1B7 显著优于 BLOOM (p < 0.01)。然而,尽管 BLOOM 在预训练期间看到的数据量较少,它在不同语言上的表现更为稳定。这可能表明参数更多的模型具有更好的泛化能力。

Discussion It should be noted that the following questions remain for further research:

讨论 应该注意以下问题仍有待进一步研究:

  1. Generalizing abilities. BLOOM-1B7 is leading in the average performance of morph o syntactic feature classification for the languages in Table 12. The BLOOM results are lower, which can be interpreted as a worse grammatical generalization over the aforecited languages. However, the BLOOM-1B7’s probing correlation results with factors like pre training dataset size are more prominent, which makes it potentially less generalizing on the under-resourced languages than the bigger version.
  2. 泛化能力。BLOOM-1B7 在表 12 中语言的形态句法特征分类平均性能上处于领先地位。BLOOM 的结果较低,可以解释为在上述语言上的语法泛化较差。然而,BLOOM-1B7 在与预训练数据集大小等因素的相关性探测结果更为显著,这使得它在资源不足的语言上可能比更大版本的泛化能力更弱。
  3. Multilingual abilities. A separate research interest implies considering languages that are not explicitly included in the pre training corpus of the models. Expanding the set of languages for probing will allow for a typo logical interpretation and a deeper analysis of the most learnable and hard-to-learn linguistic features on a more considerable scope.
  4. 多语言能力。一个独立的研究兴趣意味着考虑未明确包含在模型预训练语料库中的语言。扩大探测语言的范围将允许进行类型学解释,并对更广泛范围内最容易学习和最难学习的语言特征进行更深入的分析。
  5. Under-resourced language evaluation. The under-resourced languages of the Indic and Niger-Congo families included in the pre training corpus in smaller shares represent a separate subject for future probing. We also plan to investigate the results of high-resourced and under-resourced languages to reveal possible linguistic insights in these two groups.
  6. 资源不足的语言评估。在预训练语料库中所占比例较小的印度语系和尼日尔-刚果语系的资源不足语言,为未来的探究提供了单独的主题。我们还计划研究资源充足和资源不足语言的结果,以揭示这两组语言中可能存在的语言学见解。
  7. Different layers and training dynamics. The analysis has focused on averaged representations of all layers and at the end of training. Analyzing different layers may reveal how morpho-syntactic representations are built during processing. Similarly, investigating how properties are acquired over the course of pre-training (Choshen et al., 2022; Zhang et al., 2021; Voloshina et al., 2022) is a viable direction for research.
  8. 不同层和训练动态。分析主要集中在所有层的平均表示以及训练结束时的情况。分析不同层可能会揭示形态句法表示在处理过程中是如何构建的。同样,研究属性在预训练过程中的获取方式 (Choshen et al., 2022; Zhang et al., 2021; Voloshina et al., 2022) 也是可行的研究方向。

4.10 Bias

4.10 偏差

As a preliminary study into the biases learned by BLOOM, we present evaluation on the multilingual CrowS-Pairs dataset, which combines a revised version of the CrowS-Pairs dataset developed by Nangia et al. (2020) together with the French version of CrowS-Pairs introduced by Névéol et al. (2022). One challenge of this evaluation was to adapt a dataset originally intended for masked language models to auto regressive language models such as BLOOM. CrowS-Pairs relies on minimal pairs to compare a stereotyped statement and a non-stereotyped statement (e.g. “Women can’t drive.” is a gender stereotype while “Men can’t drive” is not). The two statements differ only by the social category targeted by the stereotype and that social category is present in the stereotyped statement and not in the non-stereotyped statement. The evaluation aims at assessing systematic preference of models for stereotyped statements. The original “metric score” compared pseudo-loglikelihood of sentences in a pair to determine which sentence received a higher score from a masked language model. Prompts were designed to require the model to select one of the statements based on the “likely” and “realistic” nature of the situations described.

作为对 BLOOM 学习到的偏见的初步研究,我们展示了在多语言 CrowS-Pairs 数据集上的评估结果,该数据集结合了 Nangia 等人 (2020) 开发的修订版 CrowS-Pairs 数据集以及 Névéol 等人 (2022) 引入的法语版 CrowS-Pairs 数据集。这一评估的一个挑战是将原本为掩码语言模型设计的数据集改编为适用于自回归语言模型(如 BLOOM)。CrowS-Pairs 依赖于最小对来比较刻板印象陈述和非刻板印象陈述(例如,“女性不能开车”是一种性别刻板印象,而“男性不能开车”则不是)。这两个陈述仅在刻板印象所针对的社会类别上有所不同,并且该社会类别仅出现在刻板印象陈述中,而非刻板印象陈述中没有。评估旨在衡量模型对刻板印象陈述的系统性偏好。原始的“度量分数”通过比较句子对中的伪对数似然性来确定哪个句子从掩码语言模型获得了更高的分数。提示被设计为要求模型根据描述情况的“可能”和“现实”性质选择其中一个陈述。

Figure 13 shows that BLOOM’s overall prompt accuracy was close to .50, which suggests an overall absence of bias. We note that the scores in English and French are very close, suggesting similar overall behavior of the model on both languages. We also show results on mono-lingual auto regressive models — GPT-Neo (Black et al.) and GPT-FR (Simoulin and Crabbé, 2021) for English and French, respectively.

图 13: 显示 BLOOM 的整体提示准确性接近 0.50,这表明总体上不存在偏差。我们注意到英语和法语的分数非常接近,这表明该模型在这两种语言上的整体行为相似。我们还展示了单语言自回归模型的结果 — 英语的 GPT-Neo (Black et al.) 和法语的 GPT-FR (Simoulin 和 Crabbé, 2021)。

Table 14 presents the results per bias type in the CrowS-Pairs dataset. The results are quite homogeneous over the categories, which contrasts with previous studies on masked language models, which suggested models were prone to bias in specific categories, which differed between models tested. Nonetheless, accuracy significantly differs from 50 (T-test,

表 14: CrowS-Pairs 数据集按偏见类型的结果。结果在各类别中相当均匀,这与之前关于掩码语言模型的研究形成对比,之前的研究表明模型在特定类别中容易产生偏见,并且不同测试模型之间存在差异。尽管如此,准确率与 50 存在显著差异 (T 检验,


Figure 13: Overall accuracy of BLOOM on crowS-Pairs per prompt for English and French. Results on the two smallest BLOOM models and monolingual GPT models of comparable size are also shown.

图 13: BLOOM 在 crowS-Pairs 上的总体准确率,按提示语分别针对英语和法语。还显示了两个最小的 BLOOM 模型和单语 GPT 模型(大小相当)的结果。


BLOOM BLOOM-1.1B BLOOM-560M GPT-FR-1B GPT-FR-124M

BLOOM BLOOM-1.1B BLOOM-560M GPT-FR-1B GPT-FR-124M

大语言模型 BLOOM、BLOOM-1.1B、BLOOM-560M、GPT-FR-1B 和 GPT-FR-124M

偏见类型 支持数 英语 法语
种族色彩 460 50.05 50.48*
性别 321 51.17* 51.24*
社会经济地位 196 51.05* 52.22*
国籍 253 49.25* 48.49*
宗教 115 53.82* 53.01*
年龄 90 49.35 50.13
性取向 91 50.00 49.9
外貌 72 48.20 49.67
残疾 66 48.49* 49.16*
其他 13 50.18 42.1*
总计 1,677 49.78* 50.61*

Table 14: BLOOM accuracy results on crowS-Pairs bias categories averaged over eight runs for English and French. Significance for the one sample T-test $(p<.05)$ is indicated with *.

表 14: BLOOM 在 crowS-Pairs 偏见类别上的准确率结果,英语和法语的八次运行平均值。单样本 T 检验的显著性 (p<.05) 用 * 表示。

$\mathrm{p<.05}$ ) overall for both languages, as well as for a number of bias categories, as shown per asterisks in the table.

$\mathrm{p<.05}$ ) 对于两种语言以及多个偏见类别总体上均成立,如表中星号所示。

Limitations Blodgett et al. (2021) discuss validity issues with the original CrowS-Pairs corpus. The CrowS-Pairs version used here differs from the original by addressing some of the issues pointed out by Blodgett et al. (2021) and by constructing 200 additional sentence pairs based on stereotypes collected from French speakers. In a recent evaluation of bias in masked language models in English and French, results obtained on the revised dataset were not significantly different from those obtained on the original dataset Névéol et al. (2022).

局限性 Blodgett 等 (2021) 讨论了原始 CrowS-Pairs 语料库的有效性问题。这里使用的 CrowS-Pairs 版本与原始版本不同,通过解决 Blodgett 等 (2021) 指出的一些问题,并基于从法语使用者收集的刻板印象构建了 200 个额外的句子对。在最近对英语和法语掩码语言模型中的偏差评估中,在修订后的数据集上获得的结果与在原始数据集上获得的结果没有显著差异 Névéol 等 (2022)。

However, its original validation does not naturally apply here, and comparison to other CrowS-Pairs results is more difficult. For a stronger assessment of bias, results obtained with CrowS-Pairs should be compared with other measures of bias, and also assessed for all languages in the model. However, as noted by Talat et al. (2022), very little material (corpora, measures) is available for multilingual bias assessment.

然而,其原始验证方法并不自然适用于此,与其他 CrowS-Pairs 结果的比较也更加困难。为了更有力地评估偏差,使用 CrowS-Pairs 获得的结果应与其他偏差度量进行比较,并且还应评估模型中所有语言的情况。但是,正如 Talat 等人 (2022) 所指出的,用于多语言偏差评估的材料(语料库、度量)非常少。

Although our examinations suggest a limited presence of bias in the model, they cannot cover the breadth of possible usage scenarios. One such scenario where models may have a larger impact is on linguistic diversity and language variation encountered. As the training resources for BLOOM are carefully curated, they may also capture some language variations to a larger degree than other models. This also impacts the ability of trained models to equitably represent different variations. Such differences can aid in the propagation and legit i miz ation of some language variants over others. Our evaluation of biases in the model are further limited to the situations, languages and language variants that are covered by multilingual CrowS-Pairs. We therefore expect a distinction between our findings using CrowS-Pairs and wider model use (for a more detailed exploration on such differences, see Raji et al., 2021).

尽管我们的检查表明模型中的偏见存在有限,但无法涵盖所有可能的使用场景。其中一个模型可能产生更大影响的场景是遇到的语言多样性及语言变异。由于 BLOOM 的训练资源经过精心策划,它们可能会比其他模型更大地捕捉到某些语言变异。这也影响了训练模型公平表示不同变异的能力。这些差异可能会促进某些语言变体的传播和合法化超过其他变体。我们对模型中偏见的评估进一步局限于多语言 CrowS-Pairs 所涵盖的情况、语言和语言变体。因此,我们预计使用 CrowS-Pairs 得出的发现与更广泛的模型使用之间会有所不同(有关这些差异的详细探讨,请参阅 Raji et al., 2021)。

5. Conclusion

5. 结论

In this work, we present BLOOM, a 176B-parameter open-access multilingual language model. BLOOM was created by BigScience, a collaboration of hundreds of researchers, and was trained on the French government-funded Jean Zay supercomputer for 3.5 months. In this paper, we chronicled the development of BLOOM, from the creation of its training dataset ROOTS to the design of its architecture and tokenizer. We also discuss evaluation results of BLOOM and other large language models, finding it has competitive performance that improves after multitask finetuning.

在本工作中,我们介绍了 BLOOM,一个 176B 参数的开放访问多语言大语言模型。BLOOM 由 BigScience 创建,这是一个数百名研究人员的合作项目,并在法国政府资助的 Jean Zay 超级计算机上训练了 3.5 个月。在本文中,我们记录了 BLOOM 的开发过程,从其训练数据集 ROOTS 的创建到其架构和分词器的设计。我们还讨论了 BLOOM 和其他大语言模型的评估结果,发现它具有竞争力的性能,并在多任务微调后有所提升。

We hope that the release of a powerful multilingual language model unlocks new applications and research directions for large language models. Further, we hope that documenting our experience will help the machine learning research community organize new large-scale collaborative projects similar to BigScience. Besides enabling results that are impossible for any individual research group to achieve, this form of organization will also allow more people with different backgrounds to share their ideas and participate in the development of major advances in the field.

我们希望强大的多语言大语言模型的发布能够解锁新的应用场景和研究方向。进一步地,我们希望记录我们的经验能够帮助机器学习研究社区组织类似 BigScience 的新大规模协作项目。除了使任何单个研究团队无法实现的结果成为可能之外,这种组织形式还将让具有不同背景的更多人分享他们的想法并参与领域内重大进展的开发。

6. Contributions

6. 贡献

Authors are assigned to each authorship category according to which aspects of the project they contributed to. Many authors appear under multiple categories because they contributed to the project in more than one way. Author order in all categories is alphabetical by first name, except for “Major Contributors” where authors are shuffled randomly apart from Teven Le Scao, who is intentionally listed first and “Organization” where Thomas Wolf is intentionally listed last. A description of each category follows. For finer-grained contribution details, please see the papers mentioned under each category.

根据每位作者对项目不同方面的贡献,将他们分配到相应的作者类别中。许多作者出现在多个类别中,因为他们以多种方式为项目做出了贡献。所有类别中的作者顺序均按名字首字母排序,但“主要贡献者”类别除外,该类别中除 Teven Le Scao 故意排在首位外,其他作者顺序随机排列;“组织”类别中 Thomas Wolf 故意排在最后。以下是每个类别的描述。如需了解更详细的贡献信息,请参阅每个类别下提到的论文。

作者类别描述如下:

  • 主要贡献者 (Major Contributors):对项目有重大贡献的作者。
  • 组织 (Organization):负责项目组织工作的作者。
  • 其他类别:根据具体贡献划分的其他类别。

Major Contributors lists individuals without whom BLOOM would not have happened and/or who spent more than 20% of their time on the BigScience effort as a whole.

主要贡献者名单列出了没有他们 BLOOM 就不会发生和/或花费超过 20% 的时间在整体 BigScience 项目上的个人。

et al. (2022).

等 (2022)。

Acknowledgments

致谢

The BigScience Workshop was granted access to the HPC resources of the Institut du d veloppement et des ressources en informatique scientifique (IDRIS) du Centre national de la recherche scientifique (CNRS) under the allocation 2021-A 0101012475 made by the Grand équipement national de calcul intensif (GENCI). Model training ran on the Jean- Zay supercomputer of GENCI at IDRIS, and we thank the IDR