rainier ababao reads and writes

Mining 18000 texts for fun and no profit

Well, maybe a little bit of profit.

text-graph is a data visualization of a network graph of based on 18,000 texts, with weighted edges representing sentiment degree between named entities.

This is a port of a write-up for a project I made at 6Sense Data Hack in San Francisco. This won Judges’ Favorite Hack and my round-trip flight ticket’s (AUS - SFO) worth, which is why I said a little bit of profit. It was also featured on their company blog.


For a lot of data science-oriented tasks, the Anaconda distribution of Python will be awesome. PyCharm for static type-checking and quickly accessing docstrings, since I’m using a lot of unfamiliar libraries.

Data wrangling

  • Grab texts from Go SMS Pro (an alternative text-messaging app) backups (XML format)
  • Parse XML tree using cElementTree from the xml Python lib to get contact names, text, etc.
  • Create in-memory data structures in Python using data from the XML tree:
class TextCorpus(object):
    """It's a graph."""

    def __init__(self):
        self.contacts = get_contact_objs()
        self.contact_names = names.values()
        self.global_adj_list = [item for sublist in
                                            map(lambda x: getattr(x, 'adj_list'), self.contacts)
                                                for item in sublist]
        self.global_pair_map = dict(zip(map(lambda x: getattr(x, 'name'), self.contacts),
                                                map(lambda x: getattr(x, 'adj_list'), self.contacts)))
        self.global_pair_flat_list = map(lambda x: getattr(x, 'adj_list'), self.contacts)

class Contact(object):
    """A Contact obj reprs one of the people I've contacted."""

    def __init__(self, fn):
        self.name = get_name_from_fn(fn)
        self.messages = get_texts(fn, False)
        self.messages_str_list = map(lambda x: getattr(x, 'body'), self.messages)
        self.links = get_mentions_msg_list(self.messages_as_str_list)
        self.unweighted_links = tuple(set(get_mentions_msg_list(self.messages_as_str_list)))
        self.weighted_links = Counter(get_mentions_msg_list(self.messages_as_str_list))
        self.adjacency_list = map(lambda x: tuple([self.name, x]),
        self.sentiment_links = get_sentiment_link_dict(self.weighted_links, self.name)
        self.avg_sentiment = 0.5 if len(self.sentiment_links) is 0
                                 else float(sum(self.sentiment_links.values())) / len(self.sentiment_links)

Anonymizing data

There’s definitely a concern for privacy since these are texts between my friends…

For each Contact in TextCorpus, grab their TextMessage objs. If a contact name is in the body of the text, perform a hex digest of the name and humanize the hash.

import humanhash
import hashlib

class TextMessage(object):
    """A TextMessage obj's attrs correspond to the XML tags I care about."""

    def __init__(self, msg_id, posix, sent, body, sender):
        self.msg_id = msg_id
        self.posix = posix
        self.sent = sent
        self.body = 'neutral' if body is None else anonymize(body.lower())
        self.mentions = get_mentions(self.body)
        self.sender = sender if self.sent is False else 'rainier'

def anonymize(msg):
    """Anonymizes a single text body.
    e.g. 'Alice: Bob is awesome' -> 'zero-phone-quantum-anchorage: waffle-panda-theory-rushmore is awesome'
    for orig, hmhashed in names.iteritems():
        msg = msg.replace(orig, hmhashed)
    return msg

def digest_and_humanize(word):
    m = hashlib.md5()
    return humanhash.humanize(m.hexdigest())

Sentiment analysis

In the interest of time at the hackathon, I used pre-trained models from Indico‘s high-quality sentiment API on short texts. They report about a 93% accuracy rate on the IMDB movie review corpus.

Aside: There are lots of papers and tutorials on sentiment analysis algorithms using machine learning. Some ‘not really machine learning’ methods involve using a list of ‘happy’ words such as ‘awesome’ and ‘great’ to increase the sentiment. Could also classifiers such as Naïve Bayes or topic modeling techniques. But anyway…

Since I can only freely hit the Indico API 100,000 times, I had to find a way to store API call results persistently, and also unit test. I could have used a database, but for now, pandas has good tools for parsing CSVs.

from texts.config import INDICO_API_KEY

def indico_batch_sentiment():
    """a ONE-OFF method to call the indico.io API to HQ batch sentiment 18192 texts.

    Stores it as sentiments.csv in the working dir.
    with open('sentiments.csv', 'wb') as f:
        texts = []
        writer = csv.writer(f)
        with open('texts/filenames.txt', 'r') as filenames:
            fn_list = map(str.strip, [filename for filename in filenames])
            fn_list = map(lambda x: 'texts/texts/' + x, fn_list)
            for fn in fn_list:
                texts.append(get_texts(fn)) # returns TextMessage object
        texts = [item for sublist in texts for item in sublist]
        with open('indico_sentiment_hq_errors.txt', 'w') as error_log:
            for text in texts:
                sentiment_result = None
                    sentiment_result = sentiment_hq(text.body.encode(), api_key=INDICO_API_KEY)
                except BaseException as e:
                    writer.writerow([unicode(s).encode('utf-8') for s in
                                     [text.msg_id, text.posix, repr(text.sent),
                                      text.body, repr(text.mentions), sentiment_result]])

Static visualization

Graph viz should be here

Ways to improve

  • Didn’t include emoji 👀 or use a different encoding for them because it was breaking Python’s standard library csv file IO. Of course, I was going back and forth between pandas and csvfile. Could be factored into sentiment somehow…
  • I have a contact named Will… maybe y’all could guess what happened then, or what will happen if people use that word…
  • I could have gotten Facebook’s data dump or backed up all my texts from this year as well. But I think at like 18200 texts my computer could handle enough. For a larger scale operation I could use an AWS instance or something.

The code is here. Even though I haven’t touched it in a while, I’m always dreaming of things to add on to it.