SoFunction
Updated on 2024-11-13

Python Machine Learning NLP Natural Language Processing Basic Operations of Named Instance Extraction

summarize

Today we begin our journey into Natural Language Processing (NLP). Natural Language Processing allows us to process, understand, and utilize human language, bridging the gap between machine language and human language.

naming convention

Named Entity refers to an entity that has a specific meaning in the NLP task, including a person's name, a place's name, an organization's name, a proper name, etc. Example.

  • Luke Rawlence Representative
  • Aiimi and University of Lincoln on behalf of the organization
  • Milton Keynes Representative Place

HMM

A Hidden Markov Model (HMM) describes a Markov process with hidden unknown parameters. Figure.

airport-related

A random field consists of two elements: Site and Phase Space. When each location is given a random value in space according to some distribution, the whole is called a random field. For example, Site is like an acre of farmland, and Phase Space is like a variety of crops. We can plant different crops in different fields. This is like assigning a different value in space to each "location" of a random field. The random field is the crop in the field.

Markov random field (mathematics)

A Markov Random Field is a special kind of random field. The types of crops in any field are only related to the types of crops in its neighboring fields. Then this set is a Markov Random Field.

CRF

A Conditional Random Field (CRF) is a Markov random field for a random variable Y given a random variable X. A CRF is a model that solves for the conditional probability of one set of variables given another set of variables. A CRF is a model that solves for the conditional probability of one set of variables given another set of variables, and is often used in sequence labeling problems.

The formula is as follows.

Naming Examples in Action

data set

The dataset we will be using is a medically-named dataset, which reads as follows.

crf

import tensorflow as tf
import  as K
import  as L
from tensorflow_addons.text import crf_log_likelihood, crf_decode


class CRF():
    def __init__(self,
                 output_dim,
                 sparse_target=True,
                 **kwargs):
        """
        Args:
            output_dim (int): the number of labels to tag each temporal input.
            sparse_target (bool): whether the the ground-truth label represented in one-hot.
        Input shape:
            (batch_size, sentence length, output_dim)
        Output shape:
            (batch_size, sentence length, output_dim)
        """
        super(CRF, self).__init__(**kwargs)
        self.output_dim = int(output_dim)
        self.sparse_target = sparse_target
        self.input_spec = (min_ndim=3)
        self.supports_masking = False
        self.sequence_lengths = None
         = None

    def build(self, input_shape):
        assert len(input_shape) == 3
        f_shape = (input_shape)
        input_spec = (min_ndim=3, axes={-1: f_shape[-1]})

        if f_shape[-1] is None:
            raise ValueError('The last dimension of the inputs to `CRF` '
                             'should be defined. Found `None`.')
        if f_shape[-1] != self.output_dim:
            raise ValueError('The last dimension of the input shape must be equal to output'
                             ' shape. Use a linear layer if needed.')
        self.input_spec = input_spec
         = self.add_weight(name='transitions',
                                           shape=[self.output_dim, self.output_dim],
                                           initializer='glorot_uniform',
                                           trainable=True)
         = True

    def compute_mask(self, inputs, mask=None):
        # Just pass the received mask from previous layer, to the next layer or
        # manipulate it if this layer changes the shape of the input
        return mask

    def call(self, inputs, sequence_lengths=None, training=None, **kwargs):
        sequences = tf.convert_to_tensor(inputs, dtype=)
        if sequence_lengths is not None:
            assert len(sequence_lengths.shape) == 2
            assert tf.convert_to_tensor(sequence_lengths).dtype == 'int32'
            seq_len_shape = tf.convert_to_tensor(sequence_lengths).get_shape().as_list()
            assert seq_len_shape[1] == 1
            self.sequence_lengths = (sequence_lengths)
        else:
            self.sequence_lengths = ((inputs)[0], dtype=tf.int32) * (
                (inputs)[1]
            )

        viterbi_sequence, _ = crf_decode(sequences,
                                         ,
                                         self.sequence_lengths)
        output = K.one_hot(viterbi_sequence, self.output_dim)
        return K.in_train_phase(sequences, output)

    @property
    def loss(self):
        def crf_loss(y_true, y_pred):
            y_pred = tf.convert_to_tensor(y_pred, dtype=)
            log_likelihood,  = crf_log_likelihood(
                y_pred,
                ((y_true), dtype=tf.int32) if self.sparse_target else y_true,
                self.sequence_lengths,
                transition_params=,
            )
            return tf.reduce_mean(-log_likelihood)
        return crf_loss

    @property
    def accuracy(self):
        def viterbi_accuracy(y_true, y_pred):
            # -1e10 to avoid zero at sum(mask)
            mask = (
                ((y_pred, -1e10), axis=2), ())
            shape = (y_pred)
            sequence_lengths = (shape[0], dtype=tf.int32) * (shape[1])
            y_pred, _ = crf_decode(y_pred, , sequence_lengths)
            if self.sparse_target:
                y_true = (y_true, 2)
            y_pred = (y_pred, 'int32')
            y_true = (y_true, 'int32')
            corrects = ((y_true, y_pred), ())
            return (corrects * mask) / (mask)
        return viterbi_accuracy

    def compute_output_shape(self, input_shape):
        (input_shape).assert_has_rank(3)
        return input_shape[:2] + (self.output_dim,)

    def get_config(self):
        config = {
            'output_dim': self.output_dim,
            'sparse_target': self.sparse_target,
            'supports_masking': self.supports_masking,
            'transitions': ()
        }
        base_config = super(CRF, self).get_config()
        return dict(base_config, **config)

preprocessing

import numpy as np
import tensorflow as tf

def build_data():
    """
    Getting data
    :return: Return data(classical Chinese poem, tab (of a window) (computing)) / 所有classical Chinese poem汇总的字典
    """

    # Storing data
    datas = []

    # Store x
    sample_x = []

    # Store y
    sample_y = []

    # Store words
    vocabs = {'UNK'}

    # Traverse
    for line in open("data/", encoding="utf-8"):

        # Split
        line = ().split('\t')

        # Take out characters
        char = line[0]

        # If the character is empty, skip
        if not char:
            continue

        # Fetch the label corresponding to the character
        cate = line[-1]

        # append
        sample_x.append(char)
        sample_y.append(cate)
        (char)

        # When punctuation is encountered it represents the end of the sentence
        if char in ['。', '?', '!', '!', '?']:
            ([sample_x, sample_y])

            # Clear
            sample_x = []
            sample_y = []

    # set is converted to a dictionary to store occurrences of the word
    word_dict = {wd: index for index, wd in enumerate(list(vocabs))}

    print("vocab_size:", len(word_dict))


    return datas, word_dict


def modify_data():

    # Getting data
    datas, word_dict = build_data()
    X, y = zip(*datas)
    print(X[:5])
    print(y[:5])

    # tokenizer
    tokenizer = ()
    tokenizer.fit_on_texts(word_dict)
    X_train = tokenizer.texts_to_sequences(X)

    # Filling
    X_train = .pad_sequences(X_train, 150)
    print(X_train[:5])

    class_dict = {
        'O': 0,
        'TREATMENT-I': 1,
        'TREATMENT-B': 2,
        'BODY-B': 3,
        'BODY-I': 4,
        'SIGNS-I': 5,
        'SIGNS-B': 6,
        'CHECK-B': 7,
        'CHECK-I': 8,
        'DISEASE-I': 9,
        'DISEASE-B': 10
    }

    # tokenize
    X_train = [[word_dict[char] for char in data[0]] for data in datas]
    y_train = [[class_dict[label] for label in data[1]] for data in datas]
    print(X_train[:5])
    print(y_train[:5])

    # padding
    X_train = .pad_sequences(X_train, 150)
    y_train = .pad_sequences(y_train, 150)
    y_train = np.expand_dims(y_train, 2)


    # ndarray
    X_train = (X_train)
    y_train = (y_train)
    print(X_train.shape)
    print(y_train.shape)

    return X_train, y_train

if __name__ == '__main__':
    modify_data()

main program

import tensorflow as tf
from pre_processing import modify_data
from crf import CRF

# Hyperparameters
EPOCHS = 10  # of iterations
BATCH_SIZE = 64  # of word training samples
learning_rate = 0.00003  # Learning rate
VOCAB_SIZE = 1759 + 1
optimizer = (learning_rate=learning_rate)  # Optimizer
loss = ()  # Losses


def main():

    # Getting data
    X_train, y_train = modify_data()

    model = ([
        (VOCAB_SIZE, 300),
        ((128, dropout=0.5, recurrent_dropout=0.5, return_sequences=True)),
        ((64, dropout=0.5, recurrent_dropout=0.5, return_sequences=True)),
        ((1)),
        CRF(1, sparse_target=True)
    ])


    # Combination
    (optimizer=optimizer, loss=loss, metrics=["accuracy"])

    # summery
    ([None, 150])
    print(())

    # Save
    checkpoint = (
        "../model/model.h5", monitor='val_loss',
        verbose=1, save_best_only=True, mode='min',
        save_weights_only=True
    )

    # Training
    (X_train, y_train, validation_split=0.2, epochs=EPOCHS, batch_size=BATCH_SIZE, callbacks=[checkpoint])

if __name__ == '__main__':
    main()

Output.

vocab_size: 1759
(['≠≠,', 'male', ',', 'surname Shuang', 'stupa (abbr. loanword from Sanskrit tapo)', 'bundled straw in which silkworms spin cocoons', 'man', ',', 'trump card (in card games)', 'reason', 'sound of sighing', 'suppress', '、', 'stop (doing sth)', 'sputum', '1', 'classifier for individual things or people, general, catch-all classifier', 'moon', ',', 'plus', 'repetition', '3', 'sky', ',', 'whip or thrash', 'twitch', '1', 'substandard', 'sentence-final interrogative particle', '2', '0', '1', '6', 'surname Nian', '1', '2', 'moon', '0', '8', 'date', '0', '7', ':', '0', '0', 'in order to', '1', '、', 'lungs', 'anti-inflammation', '2', '、', 'whip or thrash', 'twitch', 'deal with', 'surname Zha', 'in care of (used on address line after name)', 'confirm or agree with', 'institution', '。'], ['suffix forming noun from adjective, corresponding -ness or -ity', 'love dearly', 'thoroughly', '1', 'surname Nian', 'in care of (used on address line after name)', 'confirm or agree with', 'institution', '。'], [',', 'male', ',', '4', 'year (of crop harvests)', ',', 'river', 'be defeated (classical)', 'leave out', 'due to', 'favor', 'city', 'surname Shuang', 'river and county in Hebei Province', 'surname Ou', 'narrate', 'fence', 'small thing', 'township (PRC administrative unit)', 'narrate', 'fence', 'small thing', 'village', 'man', ',', 'trump card (in card games)', 'reason', '"', 'sound of sighing', 'suppress', '、', 'sound of sighing', 'sputum', ',', 'comrade', 'show (one's feeling)', 'heat up', '6', 'sky', '"', 'sentence-final interrogative particle', '2', '0', '1', '6', 'surname Nian', '1', '2', 'moon', '1', '3', 'date', '1', '1', ':', '4', '7', 'in order to', 'classifier for rod-shaped objects, e.g. pens, guns; for army divisions; for songs', 'qi', 'take care (of)', 'lungs', 'anti-inflammation', 'in care of (used on address line after name)', 'confirm or agree with', 'institution', '。'], ['2', 'surname Nian', 'puffed (swollen)', 'bladder', 'make', 'ulceration', 'mouth', 'go out', 'urinate', '1', 'surname Nian', 'sentence-final interrogative particle', '2', '0', '1', '7', '-', '-', '0', '2', '-', '-', '0', '6', 'in care of (used on address line after name)', 'confirm or agree with', 'institution', '。'], [';', 'n', 'b', 's', 'p', ';', 'noun prefix denoting function or status', 'eastern bean goose', 'wild goose', 'women', '5', '9', 'year (of crop harvests)', 'afterwards', 'take a wife', ' ', 'the Han dynasty (206 BC-220 AD)', 'ethnicity', ' ', 'river', 'be defeated (classical)', 'due to', 'favor', 'surname Shuang', 'river and county in Hebei Province', 'surname Ou', 'man', ',', 'appear', '(suffix indicating firmness, steadiness, or coming a halt)', 'electronic', 'factory', 'classifier for families or businesses e.g. shops, companies', 'be born in the year of (one of the 12 animals)', 'institution', ',', 'trump card (in card games)', 'reason', 'shoulder (responsibilities etc)', 'glaze', 'classifier for works of literature, films, cars or land line telephones', 'love dearly', 'thoroughly', '1', '0', '(following numerical value) or more', 'surname Nian', ',', 'plus', 'repetition', '2', 'classifier for individual things or people, general, catch-all classifier', 'moon', 'sentence-final interrogative particle', '2', '0', '1', '6', '-', '0', '1', '-', '1', '8', ' ', '9', ':', '1', '9', 'in care of (used on address line after name)', 'confirm or agree with', 'institution', '。'])
(['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'SIGNS-B', 'SIGNS-I', 'O', 'SIGNS-B', 'SIGNS-I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'SIGNS-B', 'SIGNS-I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'DISEASE-B', 'DISEASE-I', 'O', 'O', 'SIGNS-B', 'SIGNS-I', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'SIGNS-B', 'SIGNS-I', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'SIGNS-B', 'SIGNS-I', 'O', 'SIGNS-B', 'SIGNS-I', 'O', 'O', 'SIGNS-B', 'SIGNS-I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'DISEASE-B', 'DISEASE-I', 'DISEASE-I', 'DISEASE-I', 'DISEASE-I', 'O', 'O', 'O', 'O'], ['O', 'O', 'BODY-B', 'BODY-I', 'BODY-I', 'BODY-I', 'BODY-I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'BODY-B', 'BODY-I', 'BODY-I', 'SIGNS-B', 'SIGNS-I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'])
[[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 880 1182 602 698 1530 1630 1457
602 31 878 1388 124 1211 225 346 456 267 1430 602 542 677
796 272 602 238 1251 456 1170 1268 577 46 456 1056 1641 456
577 1430 46 699 853 46 1231 46 46 1152 456 1211 797 1323
577 1211 238 1251 591 1364 1133 513 282 1232]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1514 1259 709 456 1641 1133 513 282 1232]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 602 1182 602 1090 959 602 1155 1708 882 426 1426 1561
698 1242 908 174 1445 1334 229 174 1445 1334 1199 1457 602 31
878 1388 124 1211 1388 346 602 216 767 371 1056 272 1268 577
46 456 1056 1641 456 577 1430 456 796 853 456 456 1090 1231
1152 1455 669 1322 797 1323 1133 513 282 1232]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
577 1641 1584 734 1643 1126 186 896 967 456 1641 1268 577 46
456 1231 46 577 46 1056 1133 513 282 1232]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1398 7 14 16 103 290 1491 1483 1024 1531 959 1081 559
845 114 1155 1708 426 1426 698 1242 908 1457 602 583 188 1575
1379 1337 326 282 602 31 878 1439 885 1520 1259 709 456 46
1625 1641 602 542 677 577 267 1430 1268 577 46 456 1056 46
456 456 699 1531 456 1531 1133 513 282 1232]]
[[891, 1203, 604, 702, 1562, 1665, 1486, 604, 11, 889, 1413, 110, 1233, 213, 337, 453, 255, 1457, 604, 542, 681, 803, 260, 604, 226, 1275, 453, 1190, 1292, 579, 26, 453, 1072, 1676, 453, 579, 1457, 26, 703, 864, 26, 1255, 1465, 26, 26, 1172, 453, 1233, 804, 1347, 579, 1233, 226, 1275, 593, 1388, 1153, 512, 270, 1256], [1546, 1283, 713, 453, 1676, 1153, 512, 270, 1256], [604, 1203, 604, 1108, 971, 604, 1175, 1745, 893, 421, 1451, 1594, 702, 1266, 919, 160, 1473, 1358, 217, 160, 1473, 1358, 1221, 1486, 604, 11, 889, 1127, 1413, 110, 1233, 1413, 337, 604, 204, 772, 362, 1072, 260, 1127, 1292, 579, 26, 453, 1072, 1676, 453, 579, 1457, 453, 803, 864, 453, 453, 1465, 1108, 1255, 1172, 1484, 673, 1346, 804, 1347, 1153, 512, 270, 1256], [579, 1676, 1618, 738, 1678, 1145, 173, 907, 979, 453, 1676, 1292, 579, 26, 453, 1255, 1495, 1495, 26, 579, 1495, 1495, 26, 1072, 1153, 512, 270, 1256], [369, 1423, 811, 1730, 986, 369, 88, 278, 1522, 1514, 1039, 1563, 971, 1099, 560, 1234, 855, 100, 1234, 1175, 1745, 421, 1451, 702, 1266, 919, 1486, 604, 585, 175, 1609, 1403, 1361, 317, 270, 604, 11, 889, 1467, 896, 1552, 1283, 713, 453, 26, 1660, 1676, 604, 542, 681, 579, 255, 1457, 1292, 579, 26, 453, 1072, 1495, 26, 453, 1495, 453, 703, 1234, 1563, 1465, 453, 1563, 1153, 512, 270, 1256]]
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 5, 0, 6, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 9, 0, 0, 6, 5, 0, 0, 0, 0, 0, 0], [0, 6, 5, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 5, 0, 6, 5, 0, 0, 6, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 9, 9, 9, 9, 0, 0, 0, 0], [0, 0, 3, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 4, 6, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
(7836, 150)
(7836, 150, 1)

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 300) 528000
_________________________________________________________________
bidirectional (Bidirectional (None, None, 256) 439296
_________________________________________________________________
bidirectional_1 (Bidirection (None, None, 128) 164352
_________________________________________________________________
time_distributed (TimeDistri (None, None, 1) 129
_________________________________________________________________
crf (CRF) (None, None, 1) 1
=================================================================
Total params: 1,131,778
Trainable params: 1,131,778
Non-trainable params: 0
_________________________________________________________________
None
2021-11-23 00:31:29.846318: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/10
10/98 [==>...........................] - ETA: 7:52 - loss: 5.2686e-08 - accuracy: 0.9232

to this article on Python machine learning NLP natural language processing basic operations of named examples of the extraction of the article is introduced to this, more related Python named examples of the extraction of content please search my previous posts or continue to browse the following related articles I hope you will support me in the future more!