Using of Natural Language Processing Techniques in Suicide Research

It is estimated that each year many people, most of whom are teenagers and young adults die by suicide worldwide. Suicide receives special attention with many countries developing national strategies for prevention. Since, more medical information is available in text, Preventing the growing trend of suicide in communities requires analyzing various textual resources, such as patient records, information on the web or questionnaires. For this purpose, this study systematically reviews recent studies related to the use of natural language processing techniques in the area of people’s health who have completed suicide or are at risk. After electronically searching for the PubMed and ScienceDirect databases and studying articles by two reviewers, 21 articles matched the inclusion criteria. This study revealed that, if a suitable data set is available, natural language processing techniques are well suited for various types of suicide related research.

and community [4]. Moreover, it differs across communities and that is why a copious amount of attention has been drawn to this issue worldwide. There are usually great attempts made at a national level to prevent this disaster to occur or prevail [10].
In its first Mental Health Action Plan report in 2014 [11], WHO maintained that an improved state of data in hospital registries is the key way to cut down on the rate of suicide and its victims [4]. So far, adherence to old methods of monitoring public health has made it hard to investigate issues related to suicide [4]. Therefore, it is essential to use costeffectiveness tools and strategies to collect and interpret the required data for suicide prevention [4]. Researchers believe that thinking about suicide mostly leads to the action itself and that is why it is essential to identify those who tend to do so [1]. Computational methods such as NLP can use Electronic Health Record (EHR) data to identify those prone to suicide [4]. The other key textual sources that help to predict suicide are notes, interviews and questionnaires filled by people who already committed suicide. Suicide notes help to recognize their motivations and thoughts [5]. With this concern, NLP-based tools can be applied too to spot those at the risk of suicide or under mental pressures and to prevent undesirable consequences [4].
NLP deals with a structured data extraction from free text (not following a particular format). Knowledge-based NLP algorithms were based on terminologies and rules developed by experts. In recent years, however, annotated clinical texts attracted researchers' attention to use ML algorithms which were more flexible and time-saving. However, knowledge-based NLP algorithms were more suitable for solving particular problems. A third method exists which has mixed the benefits of the first two methods and is accordingly known as the hybrid method [12,13]. NLP data extraction methods are less costly and less time-consuming which made helped them to achieve much in different health-related domains [4,13].
The present study aimed to systematically review recent investigations which used NLP techniques in issues related to suicide within the past ten years. The following of this paper is organized in four sections. The second section deals with the method used to search and select the articles. The third section presents the results and the fourth discusses the findings in the light of the related literature. The final section presents a summary of findings and also makes suggestions for further research.

2-Method
This paper is a systematic review to insure that search and retrieval process is accurate and comprehensive enough. The articles published in the last ten years in Pubmed and ScienceDirect electronic databases were reviewed with various combination of related keywords. Table 1 and 2 show the search query used in this study and inclusion/exclusion criteria, respectively.

Table 2. Inclusion and exclusion criteria Inclusion Criteria Exclusion Criteria
Articles that have been used NLP to analysis texts Studies when access to full-text articles was not available Articles published in English Newspapers, Review, letter to editor, workshops, posters, short report, book and thesis Articles published during the last ten years Articles written in non-English language A total of 39 papers were identified in the first stage using the search query (Table 1). All duplicated articles were removed automatically using Endnote and a manual revision was done for verification. Based on the criteria for inclusion and exclusion, two reviewers independently screened titles, abstracts and then full text of articles identified through search, with discrepancies resolved by consensus. Finally, 21 publications were included on the basis of eligibility criteria. The agreement between the two reviewers was calculated by using the statistic. The resulting k-statistic is equal to 0.81 (P<0.001). Figure 1 shows PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flowchart of selecting the studies.

3-Results
As previously mentioned in the introduction, the selected articles reviewed here were analysed from various aspects including the purpose of research, data set, setting, methodology and the main results.
As already described, suicide notes were mostly affective in content and could be useful in reading victims' minds. Therefore, some studies addressed this issue [9,14]. Those who attempted suicide and were taken to the hospital emergency unit are likely to commit suicide again. The notes written by those who commit suicide and those pretending to do so differ widely. The research [9] aimed to classify completer and simulated suicide notes. Therefore, three experts, according to an ontology, searched for all emotional concepts (anger, depression, affection, etc.) in the content and annotated them. Five experts then divided these annotations into two groups, real and pretended. Subsequently, these texts were used for machine-learning so as to distinguish between linguistic and emotional characteristics of the notes. The results revealed that Machine Learning (ML) algorithms were capable of distinguishing people pretending to commit suicide and those who really aim to do so. ML algorithms showed to be useful for a prospective clinical analysis of the mentally ill who not only intend to commit suicide but also think of murder. Also, the authors in the other paper (14) showed that how well different machine learning algorithms performed compared to humans (practicing Mental health professionals and psychiatry physician trainees) who were asked to distinguish between elicited and genuine suicide notes. Sentiment Analysis (SA) has recently turned into a main area of research in computational linguistics. Emotion recognition is a type of SA used to detect the emotion fragments involved in free texts [15]. NLP challenge proposed by Informatics for Integrating Biology and Bedside (i2b2) in 2011 dealt with SA of suicide notes [16]. Many researchers participated in this challenge and the results were published in several papers [3,5,15,[17][18][19][20][21][22][23]. This challenge provided researchers with 900 notes made by those who already committed suicide and died as a result. The content of 600 notes were annotated according to 15 emotion classes such as fear, sin, anger and so on [16].
As biomedical relations are highly involved in biological processes, study of interactions within biology has attracted research interest. Many attempts have been made to extract the effective biological relationship such as protein-protein interactions or gene-disease associations. Knowing how genetic factors are involved in the occurrence of diseases can help to develop new techniques for prevention, diagnosis and treatment of diseases. According to the related literature, genetics is a risk factor of suicide. The majority of biomedical facts can be accessed in free texts such as articles. So, text mining techniques can be used to extract the required data. Compared to other diseases, finding gene-suicide associations through experience is hard. In different databases few samples contain information related to suicide and the relevant genes. That is why supervised methods are not possible. Therefore, Quan [10] used supervised and semisupervised methods for finding the gene-suicide associations.
The depressed are usually prone to suicide and are often treated and kept in primary care unit. Therefore, a close monitoring of suicidal patients in primary care and how they behave can reduce the rate of suicide. Using the existing data in EHR can contribute to this supervision; however, it hardly contains the frequency of one's attempts to commit suicide. In [24] the content of Patient Health Questionnaire [PHQ-9]) regularly completed by suicidal patients and history of present illness (HPI) notes field in the EHR was analysed and compared to diagnostic codes recorded in EHR.
A correct and comprehensive record of data in EHR helps to facilitate quite many investigations. The paper [2] explored whether it was possible to make more precise estimates of the national rate of suicide annually based on an automatic EHR analysis system rather than the manual? To this aim, initially all the records were coded via an automatic system based on UMLS (Unified Medical Language System). Then seven ML algorithms were used to classify the reasons (suicide vs. non-suicide) why people were taken to hospital emergence unit.
Symptoms of depression can range from temporary sadness to suicide. Shyness, embarrassment and the negative social aspect of suicide dissuade one from asking for help. Current forms of social media especially internet forums and web-logs create the chance of discussing feelings in a safe and confidential environment. However, the majority of members are not capable of diagnosing the severity of their depression and their need for cure. Online diagnosis, as the target of Kamren's study [25], is a strategy for finding those at risk, facilitating the right and timely intervention and improving public health. Similar investigations have been conducted in Korea which used web-based data. As an instance, a Korean study [1] explored the search queries on web-based social media that are commonly used with adolescents and young adults and not only analysed the risk of suicide but also its relation to the monthly employment rate, rental prices index, youth suicide rate, and number of bully victims. Woo et al. [26] analysed Twitter content to explore social behaviour in response to the sinking of Sewol ship. Disasters often influence mental health tremendously. However, most investigations only explore the effects of disasters on those affected e.g. victims or their families. A few studies have addressed the massive effect of disasters on public mental health. The data presented in social media can help to analyse public behaviour and estimate the rate of suicide after a certain disaster occurs.
As there is a high risk of suicide after discharge from hospital, a system was proposed by McCoy [27] to reduce the rate of post-discharge suicide. NLP was initially used in this system to extract the positive and negative (suicide-related) terms. Then, a regression analysis was performed to estimate suicide risk for each individual. The results were also compared to what was recorded in patients' follow-up records as the cause of death.
Another system was developed by Cook [4] to predict suicide and sever mental symptoms in adults just discharged from hospital emergency unit or a mental clinic. To this aim, 1453 young adults (˃18 yrs.) who survived from a suicide attempt and were just discharged from an emergency unit or a mental clinic were sent, within a year, text messages containing a link to a structured questionnaire. The content was about the state of sleep, well-being and anger. There was also an unstructured question to be answered: "How do you feel today?" A regression analysis was used on the structured questionnaire while NLP helped to analyse the unstructured question. The results were then compared together.
Though the rate of suicide and self-harm is significantly reduced during pregnancy, women already suffering from mental disorders get more prone to suicide. However, a few studies have been done with this concern and the risk factors involved are unknown. In [8], NLP was used to extract information from the medical records of pregnant women with mental disorders. Such information was history of domestic, substances abuse, suicide ideation, location and method of self-harm. Then the statistical analysis particularly regression analysis was performed to explore the prevalence of these injuries, their locations and methods. The main question explored was whether women with self-harm during pregnancy differed from the others in terms of demographic and clinical features. A summary of the articles explored is presented in Table 3.  [10] Agreement between the ICD-9 code for suicidal ideation and suicidal ideation documented in the clinician notes field,

Rule-based A number of 32385 patient record
To estimate the use of diagnostic codes in EHRs to document suicidal US Anderson [24] item 9 of the 9-item Patient Health Questionnaire _PHQ-9 or logical disjunction of them was low (K-statistic= 0.036 , 0.068 and 0.04 respectively) ideation and attempt among patients seen in primary care.
Average F-measure =79% Average recall =75% Average precision =84% Pattern matching 1304 posts extracted from an internet forum Detection of depression symptoms and their frequency using NLP methods

German y
Karmen [25] The largest total effect was observed in the grade pressure to depression to suicide risk. A lower employment rate, a higher rental prices index, and more victims of bullying, but not monthly youth suicide rate, were associated with increased online search of suicide-related words. In Korea, a human-made disaster can lead to an immediate increase in the suicidal preoccupation of the general public.

Time series analysis
Daily Twitter posts (3 years before and 2 months after the Sewol disaster) To explore how the public mood in Korea changed following the Sewol disaster using Twitter data

South Korea
Woo [26] Self-harm in pregnancy was associated with younger age, a history of child abuse or domestic violence, current (i.e. during pregnancy) domestic violence, a history of self-harm in the 2 years preceding pregnancy, substance misuse, smoking, non-affective disorder, acute admissions in the 2 years preceding pregnancy and stopping or switching rather than continuing a maintenance medication in the first trimester of pregnancy. The majority of self-harm events took place at home (73.1 %) and through overdoses (38.5%). There were no differences in age, ethnicity, diagnosis and admission rate between two groups Regression Of 420 women (33 women with and 387 women without (group 1,2 respectively) a recorded self-harm event in their index pregnancy) Investigating the prevalence and correlates of self-harm in pregnant women with psychotic disorders UK Taylor [8] Suicide ideation (heightened psychiatric symptoms) prediction:

4-Discussion
Though the majority of those committing suicide already visit doctors or specialists, it is hard to predict their suicide [27]. Those with a failed attempt of suicide are at a high risk of making a new suicide attempt (12-30%) or completing suicide (1-3%) within the next one year [2]. An analysis of suicide notes can partly help to delve into their mind [3,5,9,14,15,[18][19][20][21][22][23]. Even in an absence of any notes, questionnaires can be used to explore their emotions [4,24]. Notes and questionnaires are both considered as free texts with no particular format. Therefore, NLP techniques can be used for data extraction, which showed to be effective in doing so. Besides, online platforms are commonly used today to express and exchange feelings openly. Adolescents and young adults often create online content and publish it in social networks and websites. To prevent suicide, the risks need to be analysed in the early stages and with the help of reliable and supportive people who are referred to. As the existing data in social networks are large in amount, those willing to access and investigate them (e.g. suicide prevention agencies or health domain experts) cannot manually and consistently monitor them. There is a need for automatic systems to do so. IT-based solutions shave succeeded in identifying these behaviours and managed to enquire the data confidentially so as to ensure privacy matters (17). NLP-based models have proved effective in analysing large amounts of data produced in social media publically available (such as Facebook and Twitter) [1,25,26].
To act efficiently, NLP-based systems require a comprehensive and precise record of data in patients' records. As mentioned in the previous section, if the data are adequately comprehensive, epidemiologic research can be conducted [2] and certain target groups can be monitored too [8].
Unfortunately, due to the lingual, structural and typical diversity of the existing texts, it cannot be concluded which NLP method was the most efficient. However, developing standardized datasets such as that by i2b2 can create a basis for comparing different algorithms.

5-Conclusion
The present study was a systematic review of recent articles which used NLP in their investigation of suicide. It attempted to analyse the target articles from different aspects. The realm of medicine is replete with textual data. Considering the high rate of suicide worldwide, using systems capable of processing textual data can help to find trends, people at risk, the risk factors involved and can facilitate the right intervention taken to prevent or reduce the risk of suicide.