The Endlessly Painful Road of Python and R

Since I finished the process of scraping data with Python from Bilibili website last week, I began to think about how to visualize them, I searched a lot and also talked with Professor Cairo and Clay to get professional advice. The truth is, professional advice means more and more work.

In the beginning, I wanted to use a wordcloud chart to visualize the frequency of the keywords in the bottom comments. However, I felt like a bubble chart might be a better choice. In WordCloud charts, the higher the frequency is, the bigger font size the word is. So longer words might look bigger than shorter words, even if they are in the same font size.The larger the bubble is, the higher the frequency is. It’s perfect to compare the different frequencies without the distraction of the length of the keywords. For example, “The Twelve Kingdoms” is longer than “myself”. It might influence people’s judgment.

“The Twelve Kingdoms” looks bigger than “myself”, even if they are in the same font size.

After choosing the bubble chart, I searched for some examples. I did love the forced bubble charts. They look like they slightly move towards each other with gravitational force. The force direct-tree bubble chart is also full of fun. The advantage of the force direct-tree bubble chart is to show the hierarchy of the data. But how to find the hierarchy of the data, I had no idea. so I considered manually categorize those bubbles into different groups, including characters, specific nouns (Qi, Shitsudou), strong emotion and so forth.

My original proposal is full of ambition. I wanted to scrape data from 3 different resources:

Different bubble charts (forced, direct-tree, packed)

In this case, I thought I just need to make those bubbles appear, then add detailed tooltips to express them, and they are done. Nevertheless, Professor Cairo told me, I needed to show the proximity of those keywords. When he said: ”proximity”, my face was w(゚Д゚)w ??? “Proximity, what’s that?”

And he smiled upon me:” You need to show the relationship among those bubbles, instead of randomly fly with each other. Find a way to do that”.

Okay, I would do that. With this question, I began to search on how to find the relationship between words. Quickly I found some ways to do that. They are:

Since I don’t have the same beginning of the sentence like the project “why do cats & dogs…”, I felt like the relationship between keywords is not parent and child. So hierarchical clustering might not be the right choice. At the same time, I don’t want to just do documents similarity. Well, for example, “Youko” and “Kirin” are not similar, but they are highly related to each other, because they are the hero and heroine. So I asked for my classmate Deb, who did a wonderful project about Anne Sexton’s poems last semester, for help. And she recommended me to try N-grams (a concept found in Natural Language Processing).

Common bigrams in Jane Austen’s novels by n-grams in R

It’s my current pace. Learn n-grams in R. luckily, I’m not lagging behind. Deb also encouraged me it’s just the most difficult part of the text analysis. After this part, things will go much faster.

That’s it. Keep working;)