AI取名器

AI应用  收藏
0 / 1431

Building a Deep-Learning-Powered (Baby) Name Generator

By Dale Markowitz · February 6, 2020

Building a Deep-Learning-Powered (Baby) Name GeneratorWhat can Wikipedia biographies and Deep Neural Networks tell us about what’s in a name?

When I was young, I always hated being named Dale. This is mostly because my primary image of what Dales looked like was shaped by Dale Gribble from King of the Hill, and also Dale Earnhardt Jr., the NASCAR driver.

Dale Gribble image credit , Dale Earnhardt Jr image creditDale Gribble image credit , Dale Earnhardt Jr image creditNeither of these Dales fit my aspirational self image. On the contrary, I wanted to be named Sailor Moon.

I did not like that my name was “androgynous” — 14 male Dales are born for every one female Dale. When I asked my parents about this, their rationale was:

A. Women with androgynous names are potentially more successful.

B. Their hipster friends just named their daughter Dale and it was just so cute!

To their credit, as an adult, I sure do feel I’ve benefited from pretending to be a man (or not outright denying it) on my resume, on Github, in my email signature, or even here on Medium.

But sexism aside, what if there really is something to nominative determinism — the idea that people tend to take on jobs or lifestyles that fit their names?¹ And if your name does have some impact on the life you lead, what a responsibility it must be to choose a name for a whole human person. I wouldn’t want to leave that responsibility to taste or chance or trends. Of course not — I’d turn to deep learning (duh!).

In this post, I’ll show you how I used machine learning to build a baby name generator (or predictor, more accurately) that takes a description of a (future) human and returns a name, i.e.:

My child will be born in New Jersey. She will grow up to be a software developer at Google who likes biking and coffee runs.

Given a bio, the model will return a set of names, sorted by probability:

Name: linda Score: 0.04895663261413574  
Name: kathleen Score: 0.0423438735306263  
Name: suzanne Score: 0.03537878766655922  
Name: catherine Score: 0.030525485053658485  
...

So in theory I should’ve been a Linda, but at this point, I’m really quite attached to Dale.

If you want to try this model out yourself, take a look here. Now you definitely shouldn’t put much weight into these predictions, because a. they’re biased and b. they’re about as scientific as a horoscope. But still — wouldn’t it be cool to have the first baby named by an AI?
小时候,我特别讨厌自己的名字——Dale,大家听到这个名字,脑海里首先浮现的是哪些形象?我自己最先想到的,总是《山丘之王》电影中的Dale Gribble以及纳斯卡车手Dale Earnhardt Jr.。

这些Dale都跟我所憧憬的自我形象完全不符。我更喜欢Sailor Moon“水冰月”这个名字。

另外,我最讨厌的是Dale所表现出的中性感——男生里叫Dale的特别多,但像我这样的女性“Dale”就相当罕见了。这名字怎么来的?每当我问起父母,他们的回答永远是:

A. 名字比较中性的女生更有可能获得成功。

B. 他们认识了一个特别时髦的朋友,人家的女儿也叫Dale,太可爱了!

好吧,在成年之后,我确实发现这个中性化的名字让自己的简历、GitHub乃至邮件签名里得到了不少好处。

但抛开性别偏见,更值得讨论的深层问题是,名字是否会在潜移默化中引导人们倾向于选择与之相符的工作或者生活方式?如果答案是肯定的,那么起名这事学问可就太大了,甚至会影响孩子的一生。我可不想把这么重要的决定交给个人品味、运气或者当时的潮流。于是乎,我希望机器学习能帮助我解决这个难题!

在本文中,我们将了解如何使用机器学习技术开发一款婴儿起名器(更准确地讲,其实是款预测器),能够根据对于新生儿的描述返回合适的名字,例如:

我的孩子出生在新泽西州,她未来会成为谷歌公司的软件开发员,喜欢骑自行车和喝咖啡。

以此为基础,模型会返回一组姓名,并按概率进行排序:

Name: linda Score: 0.04895663261413574

Name: kathleen Score: 0.0423438735306263

Name: suzanne Score: 0.03537878766655922

Name: catherine Score: 0.030525485053658485

...

所以从理论上讲,我本来应该叫Linda,但相比之下,现在我感觉Dale好像更好一点。

好了,别对结果太当真,这里头有明显的偏见成分,更像是那种星座玄学。虽然如此,第一个由AI起名的婴儿,听起来是不是还蛮酷的?

The Dataset

Although I wanted to create a name generator, what I really ended up building was a name predictor. I figured I would find a bunch of descriptions of people (biographies), block out their names, and build a model that would predict what those (blocked out) names should be.

Happily, I found just that kind of dataset here, in a Github repo called wikipedia-biography-dataset by David Grangier. The dataset contains the first paragraph of 728,321 biographies from Wikipedia, as well as various metadata.

Naturally, there’s a selection bias when it comes to who gets a biography on Wikipedia (according to The Lily, only 15% of bios on Wikipedia are of women, and I assume the same could be said for non-white people). Plus, the names of people with biographies on Wikipedia will tend to skew older, since many more famous people were born over the past 500 years than over the past 30 years.

To account for this, and because I wanted my name generator to yield names that are popular today, I downloaded the census’s most popular baby names and cut down my Wikipedia dataset to only include people with census-popular names. I also only considered names for which I had at least 50 biographies. This left me with 764 names, majority male.

The most popular name in my dataset was “John,” which corresponded to 10092 Wikipedia bios (shocker!), followed by William, David, James, George, and the rest of the biblical-male-name docket. The least popular names (that I still had 50 examples of) were Clark, Logan, Cedric, and a couple more, with 50 counts each. To account for this massive skew, I downsampled my dataset one more time, randomly selecting 100 biographies for each name.
虽然这里说的是创建起名器,但它在本质上就是个姓名预测器。我的基本思路就是找到大量名人的传记性表述,屏蔽掉他们的真实名字,然后创建模型来正确预测出这些被屏蔽的名字。

好消息是,我在David Grangier的Github repo中找到了wikipedia-biography-dataset数据集,其中包含维基百科中728321条名人记录的摘要段落与大量元数据。

相信大家都知道,直接从维基百科上提取的数据集有着严重的选择性偏见——因为其中只有15%的条目为女性,而且非白人条目的比例也更低。再有,在维基百科上有记录的人们往往年纪更大,毕竟都是些上岁数的名人嘛。

为了解决这个问题,特别是保证生成器能尽量提供符合当下审美的名字,我下载了人口普查报告中最受欢迎的新生儿名字,同时缩小了维基百科数据集的规模——即流行名字与维基百科内名人名字的交集。另外,我只保留了至少有50位名人取过的名字,最终得到了764个名字,其中大部分是男名。

在数据集里,最受欢迎的是“John”,在维基百科上有10092条记录。其次是William、David、James、George以及一大堆同样来自圣经的男性名字。人气最低的名字(我也保留了50个例子)则有Clark、Logan等等。为了解决这种严重偏见,我再次对数据集进行下采样,为每个名字随机选择了100条对应的名人记录。

Training a Model

Once I had my data sample, I decided to train a model that, given the text of the first paragraph of a Wikipedia biography, would predict the name of the person that bio was about.

If it’s been a while since you’ve read a Wikipedia biography, they usually start something like this:

Dale Alvin Gribble is a fictional character in the Fox animated series King of the Hill,[2] voiced by Johnny Hardwick (Stephen Root, who voices Bill, and actor Daniel Stern had both originally auditioned for the role). He is the creator of the revolutionary “Pocket Sand” defense mechanism, an exterminator, bounty hunter, owner of Daletech, chain smoker, gun fanatic, and paranoid believer of almost all conspiracy theories and urban legends.

Because I didn’t want my model to be able to “cheat,” I replaced all instances of the person’s first and last name with a blank line: “___”. So the bio above becomes:

__ Alvin __ is a fictional character in the Fox animated series…

This is the input data to my model, and its corresponding output label is “Dale.”

Once I prepared my dataset, I set out to build a deep learning language model. There were lots of different ways I could have done this (here’s one example in Tensorflow), but I opted to use AutoML Natural Language, a code-free way to build deep neural networks that analyze text.

I uploaded my dataset into AutoML, which automatically split it into 36,497 training examples, 4,570 validation examples, a 4,570 test examples:

Even though I attempted to remove first and last names, a couple of middle names names slipped in!Even though I attempted to remove first and last names, a couple of middle names names slipped in!To train a model, I navigated to the “Train” tab and clicked “Start Training.” Around four hours later, training was done.

So how well did the name generator model do?

If you’ve built models before, you know the go-to metrics for evaluating quality are usually precision and recall (if you’re not familiar with these terms or need a refresher, check out this nice interactive demo my colleague Zack Akil built to explain them!). In this case, my model had a precision of 65.7% and a recall of 2%.

But in the case of our name generator model, these metrics aren’t really that telling. Because the data is very noisy — there is no “right answer” to what a person should be named based on his or her life story. Names are largely arbitrary, which means no model can make really excellent predictions.

My goal was not to build a model that with 100% accuracy could predict a person’s name. I just wanted to build a model that understood something about names and how they work.

One way to dig deeper into what a model’s learned is to look at a table called a [confusion matrix](https://www.geeksforgeeks.org/confu