Sentiment Analysis
Sentiment Value
Sentiment Value is the result of a sentence after its sentiment analysis. With value in [-1,1]
, this value can represent the attitude of the author toward this sentence. Because of the technical requirements, the minimum unit of sentiment analysis is one sentence.
Target: Get sentiment changes diagram in 1950-1985.
As a politically oriented newspaper, 《天文臺》 did not satisfy simply reporting the news. On the contrary, the chief editor 陳孝威 loved to publish his article to state his attitude toward different things and predicted future events. So, by analysing the attitude of 《天文臺》 toward different objects and observing the sentiment changes, we can find (in some ways) how media in HK regard the same object in different periods.
Preprocessing
We need to make the data format suitable for sentiment analysis. First, put the content of one year into one text file. Second, separate the whole document into sentences. Third, do word segmentation to find out the words that were contained in each sentence (The words are tagged with their corresponding property for further usage):
Because we want to find the attitude towards “Object”, we will select the word with a noun tag.
Get Sentiment value
As the minimum unit of the analysis is a sentence and we want to get the sentiment value of words, we got the sentiment value of all sentences. Then, we assigned the value of one sentence to all the nouns it contains. In this way, we could get a file contained all the nouns and their corresponding sentiment value. Next, we did a statistic summary. There are many ways to achieve that and preserve as much information as possible, we got the following result:
The items are:
Object | Word_Frequency | Total_Positive_value | Appear_as_Positive | Total_Negative_value | Appear_as_Negative |
美國 | 667 | 259.65999999999997 | 367 | -187.27999999999997 | 300 |
台灣 | 472 | 165.06 | 247 | -141.23000000000002 | 225 |
Table-1
Then we could combine data from all years and generate the result bar chart as Figure-1. Each bar has two parts, the top orange one represents the frequency at which this word appears in a positive point of view, and the button blue one represents the frequency a word appears as negative sentiment. The colour of the year in the X-axis depends on the gap: if Total_Positive_value – Total_Negative_value > 10 the colour is red; if <-10 then the colour is blue; if neither, the colour remains black. You may also notice that the colour of the bar is different in different years, that is because the colour is determined by Total_Positive_Value / Appear_as_Positive or Total_Negative_value / Appear_as_Negative, the result closer to 1/-1, the darker the colour should be.
Figure-1 Sentiment Change Analysis of “美國”
Code
The code to get the sentiment value is shown in the “Methodology” part, here we show the code that generates the bar chart:
The draw(word)
function can draw the input “word” bar chart. The words[year] contain all the word information in Table-1 format of a given year.
def draw(word):
pos_sc = []
neg_sc = []
pos = []
neg = []
a for year in years:
thing = []
for item in words[year]:
if word in item[0]:
thing = item
break
if thing == []:
pos.append(0)
pos_sc.append(0)
neg.append(0)
neg_sc.append(0)
continue
pos.append(thing[3])
pos_sc.append(thing[2])
neg.append(thing[5])
neg_sc.append(thing[4])
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.figure(figsize=(10,8))
colorN = getNeg(neg_sc)
plt.bar(np.arange(26), neg, alpha=0.5,label='Negative' ,color=colorN)
plt.bar(np.arange(26), pos, alpha=0.5,bottom=neg, label='Positive', color=getPos(pos_sc))
plt.legend(loc='upper center')
plt.xticks(np.arange(26),years,fontsize=8)
tickcolor = []
for i in range(26):
tmp = (neg_sc[i]*neg[i]+pos_sc[i]*pos[i])
if tmp > 10:
tickcolor.append('#ff0000')
elif tmp < -10:
tickcolor.append('#0000ff')
else:
tickcolor.append('#000000')
for ticklabel, tickcolor in zip(plt.gca().get_xticklabels(), tickcolor):
ticklabel.set_color(tickcolor)
plt.xlabel("Years")
plt.ylabel("Sentiment: Negative - Positive")
for i in range(26):
plt.text(i, neg[i]/2, str(neg[i]), ha='center', fontsize=8)
plt.text(i, neg[i]+pos[i]/2, str(pos[i]), ha='center', fontsize=8)
plt.title(word)
plt.grid(axis='y')
# change this to save this in different loc
plt.savefig('./Setiment Change/'+package+'/'+word+'.png')