Table of contents
Abstract
The name has a rich connotation. In people's names usually have strong distinctions of gender. We can guess that strangers are male or female from their name. And the accuracy is high. This paper mainly makes the study of the gender recognition of names but it begins from the study of English gender recognition, step by step, looking for the rules, and use a variety of machine learning algorithms to search for a better model.
Introduction
As an important part of social groups, people's names have far-reaching significance in different fields. In recent years, many scholars have been doing research on the automatic recognition of human names, which involves different textsˈ such as named entity recognition in electronic medical records based on Conditional Random Field (CRF) and even some in ancient history. There are not only automatic recognition studies on people’s names, but also some focusing on multi-ethnic characters, such as Tibetan and Mongolian, and some aiming at English and Hindi. Among these studies, automatically identifying gender based on names is a hot area and previous studies have adopted names, Indian names, and Tibetan names and so on. Different approaches to Chinese and Tibetan structures are also adopted. For example, Tibetan studies mainly focus on syllables. Chinese names have very rich cultural connotations and profound meaning ˈ which embodies the historical accumulation of culture for thousands of years and contains the wisdom and spirit of the nations. A person's name usually has a certain meaning and names usually have a strong gender distinction. We can guess from the name of a stranger that they are male or female and the rate of accuracy is always high. This paper mainly focuses on the study of the gender recognition of names, and it will start from the study of English gender recognition, step by step, then draw a conclusion by comparing various methods. The following second parts will cover the related work, and the third part introduces the model of the experiment. In the fourth part this paper will talk about the experimental process and analyze the results and the last part will draw a conclusion, summarizing the pros and cons and making prospect for the following work.
Related Work
The English name data in this paper comes from the NLTK (Natural Language Toolkit) library, which contains 7944 entries. It is one of the most popular and widely used libraries in the field of Natural Language Processing (NLP).
Naïve Bayas Classifier
Naive Bayes classifier belongs to the generative model in which how to choose the generation model and the discriminant model, mainly depends on whether or not the joint distribution is required. If the conditional independence hypothesis (a strict condition) is injected, the convergence rate of the simple Bias classifier will be faster than the discriminant model, such as logical regression.so you only need less training data and are not sensitive to the missing data. Even if the assumption of NB conditional independence does not hold up, NB classifier still performs well in practice. Its main drawback is that it cannot learn the interaction between features.
Maximum Entropy Classifier
In order to accurately estimate the state of random variables, we usually maximize entropy, and consider that the Maximum Entropy model is the best model in all sets of probability models. In other words, on the premise of known knowledge, the most reasonable inference about the unknown distribution is to conform to the indeterminate or most random inference of known knowledge. The principle is to recognize the known things (knowledge), and to do no assumptions and no prejudices on the unknown. For example, if you throw a dice, if you ask what the probability of each face up is, you will say that it is equal probability for occurrence of each point which is 1/6. Because nothing is clear about this unknown dice, and it is the most reasonable way to assume that every probability is equal to each other.
From the perspective of information theory, the greatest uncertainty is retained, that is to say, the maximum entropy. Therefore, the maximum entropy principle can also be expressed as selecting the maximum entropy model in the set of models satisfying the constraints. The formula is as follows: where p = {p | p is the probability distribution of satisfaction conditions on X}. Feature x y, : y is the information that needs to be determined in this feature and x is the context information in this feature.
Experiments and Results Analysis
English names are very different names. Every word in name is independent, and the English name is a word, so we must study the word. Words are made up of letters, because the location of letters has a great influence on word pairs. The vowels and consonants have different effects on words, and the frequency of usage is different between males and females. You can start from the end of the letter, and then gradually expand from the last two letters to the last five letters, to find their connections, and to determine which effect is best by accuracy. The name is made up of many words. Many documents take the consideration of both word and position, but this article takes into account the frequency of words used in men’s or women’s names.
Because the number of people is huge, the probability of using the same words in different places is very large and cannot be used to distinguish the gender. Gender recognition of English names model, names are very complex. How to choose the right data is very important. There are many situations in surnames and names: two words, three words, four words, etc. Some are single surname plus one word, single surname plus two words, compound surname plus one word, compound surname plus two words, father's surname plus mother's surname plus a word, father's surname plus mother's surname plus two words, etc. when extracting features, if two take the last word, if three and four take two words. In addition, a neutral name list and a surname list should also be established. Here are some ideas about the use of neutral naming tables. Neutral names are also used, so they are only used frequently in men and women, so we need to establish a model with neutral names. When we encounter a neutral name, we do not make gender judgments and make it by another word. If the two words are all belong to neutral names, list them separately and use the model with neutral names to train and verify.
That is to say, statistics neutral words frequency and judge these names are more likely to belong to men or women. At last, tag them. First we need to extract features according to the above method, divide training set and test set. Then four classifier models were used to train the data. Finally we seek the accuracy and compare the results. The definition of accuracy is: The ratio of the number of samples correctly classified by the classifier to the total number of samples for a given set of test data. The formula of accuracy is following: Overall, there is still some performance that need to improve to achieve expected experiment purposes in the gender recognition of Chinese names model. Combination of multiple algorithms can achieve an ensemble.
Conclusion
In this study, we found that English names have certain rules. As a symbol, a person's name should have a high degree of recognition. According to the experience of daily life and people's habits, names are often related to gender. We use different algorithms with different effects. As the names composed with letters in English, we have the best effect when we extract their last three letters as features. At present, named entity recognition is still a difficult and hot point in information processing. The processing of unlisted words often has difficulty meeting the needs. Whether or not a person's name appears is the key to affecting the accuracy of the unlisted word recognition. In the gender identification of names, Support Vector classifiers and Naive Bayes classifiers work well. In the next step, we will select more precise data, build a more complete stop word list, and add deep learning methods as improvements. There are many kinds of software for giving names on the Internet. Automatic recognition of the gender of a person's name can help judge whether the name is suitable for men or women.
Cite this Essay
To export a reference to this article please select a referencing style below