text-graph is a data visualization of a network graph of based on 18,000 texts, with weighted edges representing sentiment degree between named entities.
This is a port of a write-up for a project I made at 6Sense Data Hack in San Francisco. This won Judges' Favorite Hack and my round-trip flight ticket's (AUS - SFO) worth, which is why I said a little bit of profit. It was also featured on their company blog.
For a lot of data science-oriented tasks, the Anaconda distribution of Python will be awesome. PyCharm for static type-checking and quickly accessing docstrings, since I'm using a lot of unfamiliar libraries.
cElementTree
from the xml
Python lib to get contact names, text, etc.class TextCorpus(object):
"""It's a graph."""
def __init__(self):
self.contacts = get_contact_objs()
self.contact_names = names.values()
self.global_adj_list = [item for sublist in
map(lambda x: getattr(x, 'adj_list'), self.contacts)
for item in sublist]
self.global_pair_map = dict(zip(map(lambda x: getattr(x, 'name'), self.contacts),
map(lambda x: getattr(x, 'adj_list'), self.contacts)))
self.global_pair_flat_list = map(lambda x: getattr(x, 'adj_list'), self.contacts)
class Contact(object):
"""A Contact obj reprs one of the people I've contacted."""
def __init__(self, fn):
self.name = get_name_from_fn(fn)
self.messages = get_texts(fn, False)
self.messages_str_list = map(lambda x: getattr(x, 'body'), self.messages)
self.links = get_mentions_msg_list(self.messages_as_str_list)
self.unweighted_links = tuple(set(get_mentions_msg_list(self.messages_as_str_list)))
self.weighted_links = Counter(get_mentions_msg_list(self.messages_as_str_list))
self.adjacency_list = map(lambda x: tuple([self.name, x]),
list(set(get_mentions_msg_list(self.messages_as_str_list))))
self.sentiment_links = get_sentiment_link_dict(self.weighted_links, self.name)
self.avg_sentiment = 0.5 if len(self.sentiment_links) is 0
else float(sum(self.sentiment_links.values())) / len(self.sentiment_links)
There's definitely a concern for privacy since these are texts between my friends...
For each Contact
in TextCorpus
, grab their TextMessage
objs. If a contact name is in the body of the text, perform a hex digest of the name and humanize
the hash.
import humanhash
import hashlib
class TextMessage(object):
"""A TextMessage obj's attrs correspond to the XML tags I care about."""
def __init__(self, msg_id, posix, sent, body, sender):
self.msg_id = msg_id
self.posix = posix
self.sent = sent
self.body = 'neutral' if body is None else anonymize(body.lower())
self.mentions = get_mentions(self.body)
self.sender = sender if self.sent is False else 'rainier'
def anonymize(msg):
"""Anonymizes a single text body.
e.g. 'Alice: Bob is awesome' -> 'zero-phone-quantum-anchorage: waffle-panda-theory-rushmore is awesome'
"""
for orig, hmhashed in names.iteritems():
msg = msg.replace(orig, hmhashed)
return msg
def digest_and_humanize(word):
m = hashlib.md5()
m.update(word)
return humanhash.humanize(m.hexdigest())
In the interest of time at the hackathon, I used pre-trained models from Indico's high-quality sentiment API on short texts. They report about a 93% accuracy rate on the IMDB movie review corpus.
Aside: There are lots of papers and tutorials on sentiment analysis algorithms using machine learning. Some 'not really machine learning' methods involve using a list of 'happy' words such as 'awesome' and 'great' to increase the sentiment. Could also classifiers such as Naïve Bayes or topic modeling techniques. But anyway...
Since I can only freely hit the Indico API 100,000 times, I had to find a way to store API call results persistently, and also unit test. I could have used a database, but for now, pandas
has good tools for parsing CSVs.
from texts.config import INDICO_API_KEY
def indico_batch_sentiment():
"""a ONE-OFF method to call the indico.io API to HQ batch sentiment 18192 texts.
Stores it as sentiments.csv in the working dir.
"""
with open('sentiments.csv', 'wb') as f:
texts = []
writer = csv.writer(f)
with open('texts/filenames.txt', 'r') as filenames:
fn_list = map(str.strip, [filename for filename in filenames])
fn_list = map(lambda x: 'texts/texts/' + x, fn_list)
for fn in fn_list:
texts.append(get_texts(fn)) # returns TextMessage object
texts = [item for sublist in texts for item in sublist]
with open('indico_sentiment_hq_errors.txt', 'w') as error_log:
for text in texts:
sentiment_result = None
try:
sentiment_result = sentiment_hq(text.body.encode(), api_key=INDICO_API_KEY)
except BaseException as e:
error_log.write(str(e))
finally:
writer.writerow([unicode(s).encode('utf-8') for s in
[text.msg_id, text.posix, repr(text.sent),
text.body, repr(text.mentions), sentiment_result]])
pandas
and csvfile
. Could be factored into sentiment somehow...The code is here. Even though I haven't touched it in a while, I'm always dreaming of things to add on to it.