Building a Deep-Learning-Powered (Baby) Name Generator
By Dale Markowitz · February 6, 2020
What can Wikipedia biographies and Deep Neural Networks tell us about what’s in a name?
When I was young, I always hated being named Dale. This is mostly because my primary image of what Dales looked like was shaped by Dale Gribble from King of the Hill, and also Dale Earnhardt Jr., the NASCAR driver.
Dale Gribble image credit , Dale Earnhardt Jr image creditNeither of these Dales fit my aspirational self image. On the contrary, I wanted to be named Sailor Moon.
I did not like that my name was “androgynous” — 14 male Dales are born for every one female Dale. When I asked my parents about this, their rationale was:
A. Women with androgynous names are potentially more successful.
B. Their hipster friends just named their daughter Dale and it was just so cute!
To their credit, as an adult, I sure do feel I’ve benefited from pretending to be a man (or not outright denying it) on my resume, on Github, in my email signature, or even here on Medium.
But sexism aside, what if there really is something to nominative determinism — the idea that people tend to take on jobs or lifestyles that fit their names?¹ And if your name does have some impact on the life you lead, what a responsibility it must be to choose a name for a whole human person. I wouldn’t want to leave that responsibility to taste or chance or trends. Of course not — I’d turn to deep learning (duh!).
In this post, I’ll show you how I used machine learning to build a baby name generator (or predictor, more accurately) that takes a description of a (future) human and returns a name, i.e.:
My child will be born in New Jersey. She will grow up to be a software developer at Google who likes biking and coffee runs.
Given a bio, the model will return a set of names, sorted by probability:
Name: linda Score: 0.04895663261413574
Name: kathleen Score: 0.0423438735306263
Name: suzanne Score: 0.03537878766655922
Name: catherine Score: 0.030525485053658485
...
So in theory I should’ve been a Linda, but at this point, I’m really quite attached to Dale.
If you want to try this model out yourself, take a look here. Now you definitely shouldn’t put much weight into these predictions, because a. they’re biased and b. they’re about as scientific as a horoscope. But still — wouldn’t it be cool to have the first baby named by an AI?
小时候,我特别讨厌自己的名字——Dale,大家听到这个名字,脑海里首先浮现的是哪些形象?我自己最先想到的,总是《山丘之王》电影中的Dale Gribble以及纳斯卡车手Dale Earnhardt Jr.。
这些Dale都跟我所憧憬的自我形象完全不符。我更喜欢Sailor Moon“水冰月”这个名字。
另外,我最讨厌的是Dale所表现出的中性感——男生里叫Dale的特别多,但像我这样的女性“Dale”就相当罕见了。这名字怎么来的?每当我问起父母,他们的回答永远是:
A. 名字比较中性的女生更有可能获得成功。
B. 他们认识了一个特别时髦的朋友,人家的女儿也叫Dale,太可爱了!
好吧,在成年之后,我确实发现这个中性化的名字让自己的简历、GitHub乃至邮件签名里得到了不少好处。
但抛开性别偏见,更值得讨论的深层问题是,名字是否会在潜移默化中引导人们倾向于选择与之相符的工作或者生活方式?如果答案是肯定的,那么起名这事学问可就太大了,甚至会影响孩子的一生。我可不想把这么重要的决定交给个人品味、运气或者当时的潮流。于是乎,我希望机器学习能帮助我解决这个难题!
在本文中,我们将了解如何使用机器学习技术开发一款婴儿起名器(更准确地讲,其实是款预测器),能够根据对于新生儿的描述返回合适的名字,例如:
我的孩子出生在新泽西州,她未来会成为谷歌公司的软件开发员,喜欢骑自行车和喝咖啡。
以此为基础,模型会返回一组姓名,并按概率进行排序:
Name: linda Score: 0.04895663261413574
Name: kathleen Score: 0.0423438735306263
Name: suzanne Score: 0.03537878766655922
Name: catherine Score: 0.030525485053658485
...
所以从理论上讲,我本来应该叫Linda,但相比之下,现在我感觉Dale好像更好一点。
好了,别对结果太当真,这里头有明显的偏见成分,更像是那种星座玄学。虽然如此,第一个由AI起名的婴儿,听起来是不是还蛮酷的?
The Dataset
Although I wanted to create a name generator, what I really ended up building was a name predictor. I figured I would find a bunch of descriptions of people (biographies), block out their names, and build a model that would predict what those (blocked out) names should be.
Happily, I found just that kind of dataset here, in a Github repo called wikipedia-biography-dataset by David Grangier. The dataset contains the first paragraph of 728,321 biographies from Wikipedia, as well as various metadata.
Naturally, there’s a selection bias when it comes to who gets a biography on Wikipedia (according to The Lily, only 15% of bios on Wikipedia are of women, and I assume the same could be said for non-white people). Plus, the names of people with biographies on Wikipedia will tend to skew older, since many more famous people were born over the past 500 years than over the past 30 years.
To account for this, and because I wanted my name generator to yield names that are popular today, I downloaded the census’s most popular baby names and cut down my Wikipedia dataset to only include people with census-popular names. I also only considered names for which I had at least 50 biographies. This left me with 764 names, majority male.
The most popular name in my dataset was “John,” which corresponded to 10092 Wikipedia bios (shocker!), followed by William, David, James, George, and the rest of the biblical-male-name docket. The least popular names (that I still had 50 examples of) were Clark, Logan, Cedric, and a couple more, with 50 counts each. To account for this massive skew, I downsampled my dataset one more time, randomly selecting 100 biographies for each name.
虽然这里说的是创建起名器,但它在本质上就是个姓名预测器。我的基本思路就是找到大量名人的传记性表述,屏蔽掉他们的真实名字,然后创建模型来正确预测出这些被屏蔽的名字。
好消息是,我在David Grangier的Github repo中找到了wikipedia-biography-dataset数据集,其中包含维基百科中728321条名人记录的摘要段落与大量元数据。
相信大家都知道,直接从维基百科上提取的数据集有着严重的选择性偏见——因为其中只有15%的条目为女性,而且非白人条目的比例也更低。再有,在维基百科上有记录的人们往往年纪更大,毕竟都是些上岁数的名人嘛。
为了解决这个问题,特别是保证生成器能尽量提供符合当下审美的名字,我下载了人口普查报告中最受欢迎的新生儿名字,同时缩小了维基百科数据集的规模——即流行名字与维基百科内名人名字的交集。另外,我只保留了至少有50位名人取过的名字,最终得到了764个名字,其中大部分是男名。
在数据集里,最受欢迎的是“John”,在维基百科上有10092条记录。其次是William、David、James、George以及一大堆同样来自圣经的男性名字。人气最低的名字(我也保留了50个例子)则有Clark、Logan等等。为了解决这种严重偏见,我再次对数据集进行下采样,为每个名字随机选择了100条对应的名人记录。
Training a Model
Once I had my data sample, I decided to train a model that, given the text of the first paragraph of a Wikipedia biography, would predict the name of the person that bio was about.
If it’s been a while since you’ve read a Wikipedia biography, they usually start something like this:
Dale Alvin Gribble is a fictional character in the Fox animated series King of the Hill,[2] voiced by Johnny Hardwick (Stephen Root, who voices Bill, and actor Daniel Stern had both originally auditioned for the role). He is the creator of the revolutionary “Pocket Sand” defense mechanism, an exterminator, bounty hunter, owner of Daletech, chain smoker, gun fanatic, and paranoid believer of almost all conspiracy theories and urban legends.
Because I didn’t want my model to be able to “cheat,” I replaced all instances of the person’s first and last name with a blank line: “___”. So the bio above becomes:
__ Alvin __ is a fictional character in the Fox animated series…
This is the input data to my model, and its corresponding output label is “Dale.”
Once I prepared my dataset, I set out to build a deep learning language model. There were lots of different ways I could have done this (here’s one example in Tensorflow), but I opted to use AutoML Natural Language, a code-free way to build deep neural networks that analyze text.
I uploaded my dataset into AutoML, which automatically split it into 36,497 training examples, 4,570 validation examples, a 4,570 test examples:
Even though I attempted to remove first and last names, a couple of middle names names slipped in!To train a model, I navigated to the “Train” tab and clicked “Start Training.” Around four hours later, training was done.
So how well did the name generator model do?
If you’ve built models before, you know the go-to metrics for evaluating quality are usually precision and recall (if you’re not familiar with these terms or need a refresher, check out this nice interactive demo my colleague Zack Akil built to explain them!). In this case, my model had a precision of 65.7% and a recall of 2%.
But in the case of our name generator model, these metrics aren’t really that telling. Because the data is very noisy — there is no “right answer” to what a person should be named based on his or her life story. Names are largely arbitrary, which means no model can make really excellent predictions.
My goal was not to build a model that with 100% accuracy could predict a person’s name. I just wanted to build a model that understood something about names and how they work.
One way to dig deeper into what a model’s learned is to look at a table called a confusion matrix, which indicates what types of errors a model makes. It’s a useful way to debug or do a quick sanity check.
In the “Evaluate” tab, AutoML provides a confusion matrix. Here’s a tiny corner of it (cut off because I had sooo many names in the dataset):
In this table, the row headers are the True labels and the column headers are the Predicted labels. The rows indicate what a person’s name should have been, and the columns indicate what the model predicted the person’s name was.So for example, take a look at the row labeled “ahmad.” You’ll see a light blue box labeled “13%”. This means that, of all the bios of people named Ahmad in our dataset, 13% were labeled “ahmad” by the model. Meanwhile, looking one box over to the right, 25% of bios of peopled named “ahmad” were (incorrectly) labeled “ahmed.” Another 13% of people named Ahmad were incorrectly labeled “alec.”
Although these are technically incorrect labels, they tell me that the model has probably learned something about naming, because “ahmed” is very close to “ahmad.” Same thing for people named Alec. The model labeled Alecs as “alexander” 25% of the time, but by my read, “alec” and “alexander” are awfully close names.
在数据样本整理完成之后,我决定训练一套模型,根据维基百科中的摘要段落让模型结合人物描述生成推荐名字。
下面来看维基百科中常见的名人介绍形式:那么,我的这款起名器到底表现如何?
有过建模经验的朋友们,肯定知道评估质量的常规指标就是准确率与召回率。我这套模型的准确率为65.7%,召回率为2%。
但对于一款起名器来说,这些指标并不能真正说明问题。由于数据非常复杂,因此无法根据一个人的生活经历找到“正确答案”。另外,姓名还具有很大的随机性,所以任何模型都不可能做出真正准确的预测。
我的目标也并不是创建一套能够100%准确预测人名的模型,而只是想试着创建一种能够理解人名及其意义的模型。
对模型学习到的知识进行深入探究的一种可行方法,在于查看名为混淆矩阵的表。该表用于指示模型所犯错误的具体类型。我们可以借此进行调试,或者快速实现完整性检查。
在Evaluate选项卡中,AutoML提供现成的混淆矩阵。如下图所示,仅为其中的一角:
Dale Alvin Gribble,福克斯动画连续剧《山丘之王》中的虚构人物,由Johnny Hardwick配音、Stephen Root饰演,初次试镜演员为Daniel Stern。Dale Gribble是革命性的“口袋扬沙”自卫术的发明者,赏金猎人,Daletech所有者,喜欢吸烟,痴迷枪支,同时也是各种阴谋论与都市传奇的狂热信徒。
为了防止模型作弊,我把实例中的名字与姓氏替换成了空白行“__”,所以上述简介就变成了:
__ Alvin __ ,福克斯动画连续剧《山丘之王》中的虚构人物…
这就是模型的输入数据,而其对应的输出标签为“Dale”。
在数据集准备就绪之后,我开始着手构建深度学习语言库。我可以从多种方法中做出选择(例如使用TensorFlow),但这里我使用了AutoML Natural Language。这是一种无代码方法,可构建用于文本分析的深度神经网络。
我将数据集上传至AutoML,由其将数据集自动拆分为36497个训练示例,4570个验证示例,以及4570个测试示例:
虽然我只打算删除名字和姓氏,但也有不少中间名神秘失踪了。
为了训练模型,我访问Train选项卡,并点击Start Training。在四个小时之后,训练宣告完成。
那么,我的这款起名器到底表现如何?
有过建模经验的朋友们,肯定知道评估质量的常规指标就是准确率与召回率。我这套模型的准确率为65.7%,召回率为2%。
但对于一款起名器来说,这些指标并不能真正说明问题。由于数据非常复杂,因此无法根据一个人的生活经历找到“正确答案”。另外,姓名还具有很大的随机性,所以任何模型都不可能做出真正准确的预测。
我的目标也并不是创建一套能够100%准确预测人名的模型,而只是想试着创建一种能够理解人名及其意义的模型。
对模型学习到的知识进行深入探究的一种可行方法,在于查看名为混淆矩阵的表。该表用于指示模型所犯错误的具体类型。我们可以借此进行调试,或者快速实现完整性检查。
在Evaluate选项卡中,AutoML提供现成的混淆矩阵。如下图所示,仅为其中的一角:
在此表中,行标题为True labes,列标题则为Predicted labels。这些行表示某人的真实名字,而列则代表模型为该人预测出的名字。
例如,在标有“ahmad”的行中,我们看到一个淡蓝色的框,标有“13%”。这意味着我们的数据集中,所有名为Ahmad的条目中,有13%的比例被我们的模型正确标记为“ahmad”。这时向右侧框看去,在真名为“ahmad”的所有条目中,有25%被错误标记成了“ahmed”。另外有13%的条目被错误标记为“alec”。
尽管这些标签出现了技术意义上的错误,但由此推断,模型可能确实掌握了关于起名的一点小诀窍,因为“ahmad”与“ahmed”已经非常相近。Alec也类似,该模型在25%的情况下把Alecs标记为“alexander”,在现实生活中,“alec”与“alexander”确实没多大区别。
Running Sanity Checks
Next I decided to see if my model understood basic statistical rules about naming. For example, if I described someone as a “she,” would the model predict a female name, versus a male name for “he”?
For the sentence “She likes to eat,” the top predicted names were “Frances,” “Dorothy,” and “Nina,” followed by a handful of other female names. Seems like a good sign.
For the sentence “He likes to eat,” the top names were “Gilbert,” “Eugene,” and “Elmer.” So it seems the model understands some concept of gender.
Next, I thought I’d test whether it was able to understand how geography played into names. Here are some sentences I tested and the model’s predictions:
接下来,我打算看看这套模型能不能理解关于起名的基本统计规则。例如,如果我把某人形容为“她”,那么模型能不能成功给出一个女性名字?
例如,“她喜欢美食”,那么模型给出的推荐名称为“Frances”、“Dorothy”以及“Nina”,后面几个都是女性专用名,看起来好像不错。
对于“他喜欢美食”,模型给出的推荐名包括“Gilbert”、“Eugene”以及“Elmer”。看来,模型似乎确实有了性别区分的概念。
下面,我们再来看实际测试过的其他几项测试输入与对应结果。
“He was born in New Jersey” — Gilbert
“She was born in New Jersey” — Frances
“He was born in Mexico.” — Armando
“She was born in Mexico” — Irene
“He was born in France.” — Gilbert
“She was born in France.” — Edith
“He was born in Japan” — Gilbert
“She was born in Japan” — Frances
I was pretty unimpressed with the model’s ability to understand regionally popular names. The model seemed especially bad at understanding what names are popular in Asian countries, and tended in those cases just to return the same small set of names (i.e. Gilbert, Frances). This tells me I didn’t have enough global variety in my training dataset.
“他出生在新泽西州。” — Gilbert
“她出生在新泽西州。” — Frances
“他出生在墨西哥。” — Armando
“她出生在墨西哥。” — Irene
“他出生在法国。” — Gilbert
“她出生在法国。” — Edith
“他出生在日本。” — Gilbert
“她出生在日本。” — Frances
看来模型已经摸清了世界各地的起名倾向,这一点给我留下了深刻印象。但是,模型似乎很难弄清亚洲国家更喜欢哪些英文名,因此总是返回少量特别安全的名字(Gilbert与Frances)。这说明使用的训练数据集不具备充分的全局多样性。
Model Bias
Finally, I thought I’d test for one last thing. If you’ve read at all about Model Fairness, you might have heard that it’s easy to accidentally build a biased, racist, sexist, agest, etc. model, especially if your training dataset isn’t reflective of the population you’re building that model for. I mentioned before there’s a skew in who gets a biography on Wikipedia, so I already expected to have more men than women in my dataset.
I also expected that this model, reflecting the data it was trained on, would have learned gender bias — that computer programmers are male and nurses are female. Let’s see if I’m right:
“They will be a computer programmer.” — Joseph
“They will be a nurse.” — Frances
“They will be a doctor.” — Albert
“They will be an astronaut.” — Raymond
“They will be a novelist.” — Robert
“They will be a parent.” — Jose
“They will be a model.” — Betty
Well, it seems the model did learn traditional gender roles when it comes to profession, the only surprise (to me, at least) that “parent” was predicted to have a male name (“Jose”) rather than a female one.
So evidently this model has learned something about the way people are named, but not exactly what I’d hoped it would. Guess I’m back to square one when it comes to choosing a name for my future progeny…Dale Jr.?
最后,我们要对模型偏见进行测试。大家可能知道,我们往往会在不经意间创建出存在严重偏见、种族主义倾向、性别歧视或者野蛮暴力问题的模型,这主要是因为其中使用的训练数据集无法正确反映结果所指向的人群。因为维基百科的数据也有同样的问题,所以我知道数据集中的男名要比女名多。
我还意识到模型应该会反映出训练数据中的某些性别偏见问题——例如倾向于认为计算机程序是男性,而护士是女性。下面看看我的推理是否正确:
“TA将成为计算机程序员。” — Joseph
“TA将成为护士。” — Frances
“TA将成为医生。” — Albert
“TA将成为宇航员。” — Raymond
“TA将成为小说家。” — Robert
“TA将成为家长。” — Jose
“TA将成为模特。” — Betty
好吧,这套模型似乎确实掌握了职业岗位中的传统性别倾向,更让我惊讶的是,它为“家长”指定了一个男性名字(Jose)。
总结来看,我的模型似乎确实在起名这方面学到了不少,但又跟我的期望不太一样。所以我的下一代该叫什么名字好呢……要不,干脆叫Dale Jr.?
- Probably there isn’t, and this is about as scientific as horoscopes. But still, fun to think about!