Named-Entity Recognition (NER)
After the preprocessing step of character modification, the quality of the full text is enhanced. In the next step, we will analyze unstructured text to identify important details, including names of locations, persons, organizations, events, and concepts.
What is NER?
NER is a task of information extraction where we aim to locate named entities in text and classify them into predefined categories. By utilizing NER, we could extract the recognized entities and their associated information from the API’s responses. Similarly, we utilize ChatGPT-3.5 API to facilitate our task.
Few-Shot Learning – A Useful Prompt Engineering Technique
In this task, we utilize the few-shot learning technique, which is one of the popular prompt engineering techniques. The motivation is simple. Consider two scenarios: an interviewee is given a question (1) together with a sample input and output as an example and (2) without any more information. It is expected the interviewee in scenario one will respond with an answer that is more aligned with the expectations of the interviewer. The same applies to LLMs. By providing a sample input-output pair, in addition to the original prompt, it has been shown useful for LLMs to respond with a more satisfactory answer. The drawback is the additional cost incurred with the increasing input tokens.
Since the entity relationship generated is vital in generating the knowledge graph in the next step, we require a high-quality relationship expected from the text. We observe that one-shot learning is sufficient such that the majority of the output is in the correct format and aligns with our expectations.
To achieve NER, the following prompt is given to ChatGPT:
Text 2: Please perform named entity recognition on the following news article text and identify the relationships between the entities. Specifically, I’m interested in relationships involving Person, Organization, Location and Event that are relevant to the context of the news article. Output the entities as new nodes created in neo4j using Cypher. For each relationship, please describe the nature of the relationship using the Cypher query language as well using verbs or verb phrases in english. For example,
Example text:
政府推動的大型盛事之一、香港設計中心策展的「Chubby Hearts Hong Kong」在2月14日情人節啟動。由著名設計師Anya Hindmarch構思的直徑12米巨型紅色Chubby Hearts,率先於中環皇后像廣場花園飄浮,而直徑3米的Chubby Hearts也於另外三個地方「快閃」展示。
活動會一直舉行至2月24日正月十五「中國情人節」元宵節。期間「大心」、直徑12米的Chubby Hearts會長駐中環皇后像廣場花園,而「細心」、直徑3米的Chubby Hearts則會每日在不同地方「快閃」飄浮,供市民打卡。
The format of the answer is (and only contains):
Output:{
\”nodes\”:[
\”CREATE (anya:Person {name: \’Anya Hindmarch\’, role: \’著名設計師\’})\”,
\”CREATE (chubby:Event {name: \’Chubby Hearts Hong Kong\’})\”,
\”CREATE (香港政府:Organization {name: \’香港政府\’})\”,
\”CREATE (香港設計中心:Organization {name: \’香港設計中心\’})\”,
\”CREATE (皇后像廣場花園:Location {name:\’皇后像廣場花園\’})\”,
\”CREATE (中環:Location {name:\’中環\’})\”
],
\”relationships\”: [
\”CREATE (chubby)-[:LOCATED_IN]->(皇后像廣場花園)\”,
\”CREATE (皇后像廣場花園)-[:LOCATED_IN]->(中環)\”,
\”CREATE (香港政府)-[:PROMOTED]->(chubby)\”,
\”CREATE (anya)-[:DESIGNED]->(chubby)\”,
\”CREATE (香港設計中心)-[:ORGANIZED]->(chubby)\”
]
}
End of example.
Now, perform the same for the following news article. Try to be aggresive and it is always better to include all proper nouns as entities.
Output:
In the above prompt, an example news and the expected output is given. The output is in the format of Cypher, which is a query language for Neo4j software, a useful graph visualization tools.
An example of the result of NER is shown below:
Suppose we have the example text (after text modification):
友誼之窗
袁秀蘭:女,現讀某校中學,因感自己見識淺陋,願和各地男女青年結為筆友,兀相交流知識。我愛好的是:文藝、電影、雜誌、各地風景相片。通訊處:香港英皇道木星街十二號地下。
李健興、楊洋、彭藝:我們都是戲劇藝術的愛好者,擬徵求相同愛好的男女青年為友,共同研究。通訊處:香港軒尼詩道一六四號四樓源慶祥轉。
陳育德:我愛好與各地筆友通訊,及交換郵票、書籍、雜誌、風景相片等。請多來信賜教,信到必覆,決不食言。通訊處:香港中環利源東街廿一號三樓林君轉交。
The result of NER will be (simplified for the sake of presentation)
Person: 袁秀蘭, 李健興, 楊洋, 彭藝, 陳育德
Location: 香港, 英皇道木星街十二號地下, 軒尼詩道一六四號四樓源慶祥轉
Relationships (* modify for visualization):
(袁秀蘭)-[:RESIDES_IN]->(英皇道木星街十二號地下)”,
(李健興, 楊洋, 彭藝)-[:RESIDES_IN]->(軒尼詩道一六四號四樓源慶祥轉)”,
(陳育德)-[:RESIDES_IN]->(中環利源東街廿一號三樓林君轉)”,
(袁秀蘭)-[:INTERESTED_IN]->(文藝, 電影, 雜誌, 各地風景相片)”,
(李健興, 楊洋, 彭藝)-[:INTERESTED_IN]->(戲劇藝術)”,
(陳育德)-[:INTERESTED_IN]->(通訊, 交換郵票, 書籍, 雜誌, 風景相片)”
Code
The code is provided as follows: