{"id":2416,"date":"2021-06-08T08:46:52","date_gmt":"2021-06-08T07:46:52","guid":{"rendered":"https:\/\/www.jurecuhalev.com\/blog\/?p=2416"},"modified":"2021-08-17T13:48:01","modified_gmt":"2021-08-17T12:48:01","slug":"matej-martinc","status":"publish","type":"post","link":"https:\/\/www.jurecuhalev.com\/blog\/matej-martinc\/","title":{"rendered":"Matej Martinc explains Natural Language Processing"},"content":{"rendered":"\n<p class=\"has-light-brown-background-color has-background\">In Meaningful work interviews I talk to people about their area of work and expertise to better understand what they do and why it matters to them.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.linkedin.com\/in\/matej-martinc-091b81b7\/\">Matej Martinc<\/a> is a Ph.D. researcher at <a href=\"https:\/\/kt.ijs.si\/\">\u201cJo\u017eef Stefan&#8221; Institute in the Department of Knowledge Technologies<\/a> where he invents new approaches on how to work and analyze written text. He explained to me the basics of Natural Language Processing (NLP), why neural networks are amazing, and how one gets started with all of this. In the second half, he shared how he ended up in Computer Science with a Philosophy degree and why working for companies like Google is not something that interests him.<\/p>\n\n\n\n<p><strong>How do people introduce you?<\/strong><\/p>\n\n\n\n<p>They introduce me as a researcher at the IJS institute. I\u2019m in the last year of my Ph.D. thesis research. I\u2019m mostly working on Natural Language Processing (NLP). NLP is a big field and I\u2019m currently exploring several different areas.<\/p>\n\n\n\n<p>I initially started by automatically profiling text authors by their style of writing &#8211; we can detect their age, gender, and psychological properties. I also worked on automatic identification of text readability. We\u2019ve also created a system to detect Alzheimer&#8217;s patients based on their writing.<\/p>\n\n\n\n<p>Lately, I\u2019ve been working on automatic keyword extraction and detecting political bias in word usage in media articles. I\u2019m also contributing to research on semantic change &#8211; how word usage changes through time.<\/p>\n\n\n\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\n<div class=\"wp-block-group has-background\" style=\"background-color:#f0fbff\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\n<p><em>References to research that Matej is referencing throughout this interview. I encourage you to read them as they\u2019re written in a very clear language.<\/em><\/p>\n\n\n\n<p><a href=\"https:\/\/www.aclweb.org\/anthology\/2021.naacl-main.369.pdf\"><strong>Scalable and Interpretable Semantic Change Detection<\/strong><\/a><\/p>\n\n\n\n<p><em>[..] We propose a novel scalable method for word usage change detection that offers large gains in processing time and significant memory savings while offering the same interpretability and better performance than unscalable methods. We demonstrate the applicability of the proposed method by analyzing a large corpus of news articles about COVID-19<\/em><\/p>\n\n\n\n<p><a href=\"https:\/\/www.mdpi.com\/2076-3417\/10\/17\/5993\"><strong>Zero-Shot Learning for Cross-Lingual News Sentiment Classification<\/strong><\/a><\/p>\n\n\n\n<p><em>In this paper, we address the task of zero-shot cross-lingual news sentiment classification. Given the annotated dataset of positive, neutral, and negative news in Slovene, the aim is to develop a news classification system that assigns the sentiment category not only to Slovene news, but to news in another language without any training data required. [..]<\/em><\/p>\n\n\n\n<p><a href=\"https:\/\/www.aclweb.org\/anthology\/2021.hackashop-1.17.pdf\"><strong>Automatic sentiment and viewpoint analysis of Slovenian news corpus on the topic of LGBTIQ+<\/strong><\/a><\/p>\n\n\n\n<p><em>We conduct automatic sentiment and viewpoint analysis of the newly created Slovenian news corpus containing articles related to the topic of LGBTIQ+ by employing the state-of the-art news sentiment classifier and a system for semantic change detection. The focus is on the differences in reporting between quality news media with long tradition and news media with financial and political connections to SDS, a Slovene right-wing political party. The results suggest that political affiliation of the media can affect the sentiment distribution of articles and the framing of specific LGBTIQ+ specific topics, such as same-sex marriage.<\/em><\/p>\n<\/div><\/div>\n<\/div><\/div>\n<\/div><\/div>\n\n\n\n<p><strong>Can you start by explaining some background about NLP (Natural Language Processing) to start with?<\/strong><\/p>\n\n\n\n<p>As a first step, it\u2019s good to consider how SVM (support vector machine) classifiers and decision tree techniques used for classification work. Very broadly speaking, they operate on a set of manually crafted features extracted from the dataset that you train your model on. Examples of that type of features would be: &#8220;number of words in a document&#8221; or a &#8220;bag of words model&#8221; where you put all the words into \u201ca bag\u201d and a classifier learns which words from this bag appear in different documents. If you have a dataset of documents, for which you know into which class they belong to (e.g., a class can be a gender of the author that wrote a specific document), you can train your model on this dataset and then use this model to classify new documents based on how similar these documents are to the ones in the dataset on which the model was trained. The limitation of this approach is that these statistical features do not really&nbsp; take semantic relation between words into account, since they are based on simple frequency-based statistics.<\/p>\n\n\n\n<p>About 10 years ago a different approach was invented using neural networks. What neural networks allow you to do is to work with unstructured datasets because you don\u2019t need to define these features (i.e., classification rules) in advance. You train them by inputing sequences of words and the network learns on itself how often a given word appears closer to another word in a sequence. The information on each word is gathered&nbsp; in a special layer of this neural network, called an <em>embedding layer<\/em> that is basically a vector representation that encodes how a specific word relates to other words.&nbsp;<\/p>\n\n\n\n<p>What\u2019s interesting is that synonyms have a very similar vector representation. This allows you to extract relations between words.&nbsp;<\/p>\n\n\n\n<p>An example of that would be trying to answer: \u201cParis in relation to France\u201d is the same as \u201cBerlin in relation to (what?)\u201d. To solve this question you can take the embedding of Paris, subtract the embedding of France and add embedding of Berlin and you\u2019ll get an embedding as an answer &#8211; Germany. This was a big revolution in the field as it allows us to operationalize relations in the context of languages. The second revolution came when they invented transfer learning, a procedure employed for example in&nbsp; the <a href=\"https:\/\/en.wikipedia.org\/wiki\/BERT_(language_model)\">BERT neural network<\/a> that was trained on BookCorpus with 800 million words and English Wikipedia with 2500 million words.&nbsp;<\/p>\n\n\n\n<p>In this procedure, the first thing you want to do is to train a language model. You want the model to predict the next word in a given sequence of words. You can also mask words in a given text and train the neural network to fill the gaps with the correct words. What implicitly happens in such training is that the neural network will learn about semantic relations between words. So if you\u2019re doing this on a large corpus of texts (like billions of words in BERT) you get a model that you can use on a wide variety of general tasks. Because nobody had to label the data to do the training it means that it\u2019s an unsupervised model.<\/p>\n\n\n\n<p><strong>Are you working with special pre-trained datasets?<\/strong><\/p>\n\n\n\n<p>I\u2019m now mostly working with unsupervised methods similar to the BERT model. So what we do is to take that kind of model and do additional fine-tuning on a smaller training set&nbsp; that makes it better suited for that specific research. This approach allowed us to do all of the research that I\u2019m referencing here.<\/p>\n\n\n\n<p>A different research area that doesn\u2019t require additional training is to&nbsp; employ clustering on the embeddings of these neural networks. You can take a corpus of text from the 1960s and another one from the 2000s. We can then compare how usage of specific embeddings (words) compare between these two collections of texts. That\u2019s essentially how we can study how the semantic meaning of words changed in our culture.<\/p>\n\n\n\n<p>Modern neural networks can also produce embedding for each usage of a word, meaning that words with more than one meaning have more than one embedding. This allows you to differentiate between Apple (software company) and apple (fruit). We used this approach when studying how different words connected to&nbsp; COVID changed through time. We generated embeddings for each word appearance in the corpus of news about COVID and clustered these word occurrences into distinct word usages. Two interesting terms that we identified were <em>diamond<\/em> and <em>strain<\/em>. For <em>strain<\/em>, you can see the shift from using it in epidemiological terms (strain virus) to a more economic usage in later months (financial strain).<\/p>\n\n\n\n<p>What we showed with our research is that you can detect changes even across short (monthly) time periods. There\u2019s a limit to how accurately we can identify the difference. It\u2019s often hard even for humans to decide how to label such data. We can usually get close to humane performance by using our unsupervised methods.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"426\" src=\"https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image-1024x426.png\" alt=\"\" class=\"wp-image-2417\" srcset=\"https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image-1024x426.png 1024w, https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image-550x229.png 550w, https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image-768x319.png 768w, https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image-1536x638.png 1536w, https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"419\" src=\"https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image-1-1024x419.png\" alt=\"\" class=\"wp-image-2418\" srcset=\"https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image-1-1024x419.png 1024w, https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image-1-550x225.png 550w, https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image-1-768x314.png 768w, https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image-1-1536x629.png 1536w, https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image-1.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>(both figures are from paper <a href=\"https:\/\/www.aclweb.org\/anthology\/2021.naacl-main.369.pdf\"><strong>Scalable and Interpretable Semantic Change Detection<\/strong><\/a>)<\/figcaption><\/figure>\n\n\n\n<p><strong>Does this work for Non-English languages?<\/strong><\/p>\n\n\n\n<p>You can use the same technology with a non-English language and we\u2019re successfully using it with Slovenian language. In the case of&nbsp; viewpoint analysis of Slovenian news reporting, we\u2019ve discovered a difference in how the word <em>deep<\/em> is used in&nbsp; different context. Mostly because of the <em>deep state<\/em> that became a popular topic in certain publications.<\/p>\n\n\n\n<p>For our LGBTIQ+ research, we can show that certain media avoids using the word <em>marriage<\/em> in the context of LGBTIQ+ reporting and replaces it with terms like <em>&nbsp;domestic partnership<\/em>. They\u2019re also not&nbsp; discussing LGBTIQ+ relationship within the context of terms such as <em>family<\/em>. We can detect the political leaning of the media based on how they write about these topics.<\/p>\n\n\n\n<p>We just started with this research on the Slovenian language so we expect that we\u2019ll have much more to show later in the year.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"465\" src=\"https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image-2-1024x465.png\" alt=\"\" class=\"wp-image-2419\" srcset=\"https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image-2-1024x465.png 1024w, https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image-2-550x250.png 550w, https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image-2-768x349.png 768w, https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/image-2.png 1234w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>(figure is from paper <a href=\"https:\/\/www.aclweb.org\/anthology\/2021.hackashop-1.17.pdf\">Automatic sentiment and viewpoint analysis of Slovenian news corpus on the topic of LGBTIQ+<\/a>)<\/figcaption><\/figure>\n\n\n\n<p><strong>So far you\u2019ve talked about analysis and understanding of texts. What other research are you doing?<\/strong><\/p>\n\n\n\n<p>We\u2019re working on models for generating texts as part of the <a href=\"http:\/\/embeddia.eu\/\">Embeddia project<\/a>. The output of this research also works with the Slovenian language.<\/p>\n\n\n\n<p>We\u2019re also investigating if we can transfer embeddings between languages. We have a special version of the BERT neural network that has been trained on 100+ different language Wikipedias. What we\u2019ve found out is that you can take a corpus of texts in the English language, train the model on&nbsp; it to, for example, detect the gender of the author, and then use that same model to predict the gender of the author of some Slovenian text. This approach is called a zero-shot transfer.<\/p>\n\n\n\n<p><strong>How approachable is all this research and knowledge? Do I need a Ph.D. to be able to understand and use your research?<\/strong><\/p>\n\n\n\n<p>It takes students of our graduate school about a year to become productive in this field. The biggest initial hurdle is that you need to learn how to work with neural networks.<\/p>\n\n\n\n<p>Good thing is that we now have very approachable libraries in this field. I\u2019m a big fan of <a href=\"https:\/\/pytorch.org\/\">PyTorch<\/a> as it\u2019s well integrated with the Python ecosystem. There\u2019s also <a href=\"https:\/\/www.tensorflow.org\/\">TensorFlow<\/a> that\u2019s more popular in the industry and less in research. I found it harder to use for the type of work we\u2019re doing and harder to debug. With PyTorch it takes about a month or two for our students to understand the basics.<\/p>\n\n\n\n<p>In our context, it\u2019s not just about using the existing neural networks and methods. Understanding the science part of our field and how to contribute via independent paper writing and publishing it\u2019s usually about 2 years.<\/p>\n\n\n\n<p><strong>How easy is it to use your research in \u2018real-world\u2019 applications?<\/strong><\/p>\n\n\n\n<p>We have some international media companies that are using our research in the area of automatic keyword extraction from text. We\u2019re helping them with additional tweaking of our models.<\/p>\n\n\n\n<p>Overall we try to publish everything that we do under open access licenses with code and datasets publicly available.<\/p>\n\n\n\n<p>What we don\u2019t do is maintain our work in terms of production code. It\u2019s beyond the scope of research and we don\u2019t have funding to do it. It\u2019s also very time-consuming and it doesn\u2019t help us with our future research. That\u2019s also what I like about scientific research. We get to invent things and we don\u2019t need to maintain and integrate them. We can shift our focus to the next research question.<\/p>\n\n\n\n<p>So in practice, all of our research is available to you but you\u2019ll need to do the engineering work to integrate it with your product.<\/p>\n\n\n\n<p><strong>Let\u2019s shift a bit to your story and how you got into this research. How did you get here?<\/strong><\/p>\n\n\n\n<p>I first graduated in philosophy and sociology in 2011, at the time when Slovenia was still recovering from the financial crisis. While I considered Ph.D. in philosophy I decided that there are not many jobs for philosophers. That\u2019s why I\u2019ve enrolled in a Computer Science degree that offered better job prospects.<\/p>\n\n\n\n<p>During my Computer Science studies, I was also working in different IT startups. I quickly realized that you don\u2019t have a lot of freedom in such an environment. Software engineering was too constrained for me in terms of what kind of work I could do.<\/p>\n\n\n\n<p>After I graduated I took the opportunity to do Erasmus Exchange and I went to University in Spain. In that academic environment, I found the opposite approach. I received a dataset, a very loose description of a problem, and complete freedom to decide on how I\u2019m going to approach and solve the problem.<\/p>\n\n\n\n<p>When I returned to Slovenia I decided to apply to a few different laboratories inside IJS to see if I could continue with academic research. I\u2019ve got a few offers and accepted the offer from the laboratory where I\u2019m working today.&nbsp;<\/p>\n\n\n\n<p>I also decided to focus on NLP and language technologies as I\u2019m still interested in doing philosophical and sociological research. Currently, I have the freedom to explore these topics in my research field without too many constraints. I\u2019m also really enjoying all the conferences and travel that comes with it. Due to the fast-changing nature of my field, all the cutting-edge research is presented at conferences, and publishing in journals is just too slow. It takes over a year to publish a paper but there&#8217;s groundbreaking research almost monthly.<\/p>\n\n\n\n<p><strong>How do you see research done at FAANG (Facebook, Amazon, Apple, Netflix, Google) companies? We know that they\u2019re investing a large amount of money into this field and have large research teams.<\/strong><\/p>\n\n\n\n<p>They\u2019re doing a lot of good research. At the same time, they\u2019re also often relying more on having access to a large number of hardware resources that we don\u2019t. This can be both a blessing and a curse. At the moment I don\u2019t see their research being that much better from the findings from universities. Universities are also more incentivized to develop new optimization techniques as they can&#8217;t use brute hardware force for their research.<\/p>\n\n\n\n<p><strong>Are you considering working for a FAANG company after your Ph.D.?<\/strong><\/p>\n\n\n\n<p>Not really. I already have a lot of freedom in my research and I can get funding to explore the areas that interest me. If I would work inside a FAANG company I would need to start at the bottom of the hierarchy and also be limited by their research agenda.<\/p>\n\n\n\n<p>I also really like living in Slovenia and I don\u2019t want to relocate to another country. At the same time, I&#8217;m excited about potential study\/researchexchanges as I enjoy collaborating with researchers at foreign institutions.<\/p>\n\n\n\n<p><strong>What are some good resources to follow in your field?<\/strong><\/p>\n\n\n\n<p>You can follow the current state of the art at:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Proceedings from Annual Meeting of the Association for Computational Linguistics, European, and North American editions collected at <a href=\"https:\/\/www.aclweb.org\/anthology\/\">https:\/\/www.aclweb.org\/anthology\/<\/a><\/li><li>Neural Information Processing Systems (NIPS) conference: <a href=\"https:\/\/nips.cc\/\">https:\/\/nips.cc\/<\/a><\/li><\/ul>\n\n\n\n<p>Papers describing paradigm shifts in the field of NLP:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Discovery of embeddings: <a href=\"https:\/\/arxiv.org\/abs\/1301.3781\">Efficient Estimation of Word Representations in Vector Space<\/a><\/li><li>A new type of neural networks (transformers) that are now being widely used in NLP: <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention Is All You Need<\/a><\/li><\/ul>\n\n\n\n<p>Unsupervised language model pretraining and transfer learning: <a href=\"https:\/\/arxiv.org\/abs\/1810.04805\">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding<\/a><\/p>\n\n\n\n<div class=\"wp-block-group has-yellow-background-color has-background\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\n<p><em><strong>What I learned from talking with Matej<\/strong><\/em><\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Recognizing what kind of work makes you happy allows you to optimize your job or clients so that you do such work.<\/li><li>Natural Language Processing is a very approachable technology and not something that only big companies can use.<\/li><li>There are many opportunities to bring research findings into the industry. It does require expertise and connections to both fields.<\/li><li>These technologies now also work for the Slovenian language.<\/li><\/ul>\n<\/div><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Matej Martinc is a Ph.D. researcher at \u201cJo\u017eef Stefan&#8221; Institute in the Department of Knowledge Technologies where he invents new approaches on how to work and analyze written text. He explained to me the basics of Natural Language Processing (NLP), why neural networks are amazing, and how one gets started with all of this. In the second half, he shared how he ended up in Computer Science with a Philosophy degree and why working for companies like Google is not something that interests him.<\/p>\n","protected":false},"author":1,"featured_media":2421,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[943],"tags":[],"class_list":["post-2416","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-meaningful-work"],"acf":[],"jetpack_featured_media_url":"https:\/\/www.jurecuhalev.com\/blog\/wp-content\/uploads\/2021\/06\/09-Matej-Martinc.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.jurecuhalev.com\/blog\/wp-json\/wp\/v2\/posts\/2416","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.jurecuhalev.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.jurecuhalev.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.jurecuhalev.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.jurecuhalev.com\/blog\/wp-json\/wp\/v2\/comments?post=2416"}],"version-history":[{"count":6,"href":"https:\/\/www.jurecuhalev.com\/blog\/wp-json\/wp\/v2\/posts\/2416\/revisions"}],"predecessor-version":[{"id":2544,"href":"https:\/\/www.jurecuhalev.com\/blog\/wp-json\/wp\/v2\/posts\/2416\/revisions\/2544"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.jurecuhalev.com\/blog\/wp-json\/wp\/v2\/media\/2421"}],"wp:attachment":[{"href":"https:\/\/www.jurecuhalev.com\/blog\/wp-json\/wp\/v2\/media?parent=2416"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.jurecuhalev.com\/blog\/wp-json\/wp\/v2\/categories?post=2416"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.jurecuhalev.com\/blog\/wp-json\/wp\/v2\/tags?post=2416"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}