A core problem in Machine Learning (ML) is the definition of meaningful representations of input objects that provide the learning algorithm with enough information to estimate an accurate model for solving a target task. In Natural Language Learning, the traditional approach consists in modeling input texts as feature vectors where each value encodes some linguistic aspects, e.g., lexical information, syntax, semantics, etc. Feature-based systems can reach state-of-the-art results on several Natural Language Processing (NLP); however, the definition of an expressive feature set is usually a very expensive operation that requires a deep knowledge of the linguistic phenomena characterizing a particular task. Furthermore, those feature sets that are developed for a specific problem are usually not valid for a new task, and fail to adapt to different languages or different domains. The problem of modeling input data is even more acute when the input examples are not individual objects, but pairs of objects. How can linguistic patterns characterizing a valid answer for a given question be automatically discovered? How can rewriting rules in paraphrasing be learnt? How can semantic and syntactic relations in textual entailment be automatically captured? Kernel methods are an elegant and efficient alternative to the feature-based approach: instead of trying to design a synthetic feature space, kernels can directly operate on structured data, implicitly generating an extremely large set of features. For instance tree kernels compute the similarity between two sentences evaluating the tree fragments shared by their corresponding syntactic parse trees. This operation equals to a dot product in the implicit feature space of all possible tree fragments. The dimensionality of such space is extremely large, and operating directly on it is not viable. This thesis proposes a kernel-based learning framework that allows to efficiently and effectively tackle NLP tasks that require to determine whether a particular seman- tic relation holds between two texts. Pairs of texts will be modeled using expressive structured representations, and novel kernels operating on such representations will be described. Exploring inter-pair relations, the learning framework is able to automatically induce complex pairwise patterns, such as rewriting or entailment rules. A detailed empirical evaluation will be conducted on the tasks of Paraphrase Identification, in which the problem is assessing if two sentences convey the same information, and Recognizing Textual Entailment, where the task is understanding whether a text implies a hypothesis. Furthermore, Answer Selection in Community Question Answering will be investigated, proving how the proposed framework can learn complex question-answering patterns. The state-of-the-art results achieved in these three rather different NLP problems will demonstrate the flexibility and the generalization capability of the proposed methods.

Filice, S. (2016). Learning relations between short texts [10.58015/filice-simone_phd2016-04].

Learning relations between short texts

FILICE, SIMONE
2016-04-01

Abstract

A core problem in Machine Learning (ML) is the definition of meaningful representations of input objects that provide the learning algorithm with enough information to estimate an accurate model for solving a target task. In Natural Language Learning, the traditional approach consists in modeling input texts as feature vectors where each value encodes some linguistic aspects, e.g., lexical information, syntax, semantics, etc. Feature-based systems can reach state-of-the-art results on several Natural Language Processing (NLP); however, the definition of an expressive feature set is usually a very expensive operation that requires a deep knowledge of the linguistic phenomena characterizing a particular task. Furthermore, those feature sets that are developed for a specific problem are usually not valid for a new task, and fail to adapt to different languages or different domains. The problem of modeling input data is even more acute when the input examples are not individual objects, but pairs of objects. How can linguistic patterns characterizing a valid answer for a given question be automatically discovered? How can rewriting rules in paraphrasing be learnt? How can semantic and syntactic relations in textual entailment be automatically captured? Kernel methods are an elegant and efficient alternative to the feature-based approach: instead of trying to design a synthetic feature space, kernels can directly operate on structured data, implicitly generating an extremely large set of features. For instance tree kernels compute the similarity between two sentences evaluating the tree fragments shared by their corresponding syntactic parse trees. This operation equals to a dot product in the implicit feature space of all possible tree fragments. The dimensionality of such space is extremely large, and operating directly on it is not viable. This thesis proposes a kernel-based learning framework that allows to efficiently and effectively tackle NLP tasks that require to determine whether a particular seman- tic relation holds between two texts. Pairs of texts will be modeled using expressive structured representations, and novel kernels operating on such representations will be described. Exploring inter-pair relations, the learning framework is able to automatically induce complex pairwise patterns, such as rewriting or entailment rules. A detailed empirical evaluation will be conducted on the tasks of Paraphrase Identification, in which the problem is assessing if two sentences convey the same information, and Recognizing Textual Entailment, where the task is understanding whether a text implies a hypothesis. Furthermore, Answer Selection in Community Question Answering will be investigated, proving how the proposed framework can learn complex question-answering patterns. The state-of-the-art results achieved in these three rather different NLP problems will demonstrate the flexibility and the generalization capability of the proposed methods.
apr-2016
2015/2016
Computer Science, Control and Geoinformation
27.
Telecommunications; Learning Relations; Short Texts
Settore ING-INF/03 - TELECOMUNICAZIONI
Settore IINF-03/A - Telecomunicazioni
English
Tesi di dottorato
Filice, S. (2016). Learning relations between short texts [10.58015/filice-simone_phd2016-04].
File in questo prodotto:
File Dimensione Formato  
phdthesis.pdf

solo utenti autorizzati

Licenza: Copyright degli autori
Dimensione 2.64 MB
Formato Adobe PDF
2.64 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2108/203176
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact