Mining 18000 texts for fun and no profitDecember 27, 2015
Well, maybe a little bit of profit.
text-graph is a data visualization of a network graph of based on 18,000 texts, with weighted edges representing sentiment degree between named entities.
This is a port of a write-up for a project I made at 6Sense Data Hack in San Francisco. This won Judges’ Favorite Hack and my round-trip flight ticket’s (AUS - SFO) worth, which is why I said a little bit of profit. It was also featured on their company blog.
For a lot of data science-oriented tasks, the Anaconda distribution of Python will be awesome. PyCharm for static type-checking and quickly accessing docstrings, since I’m using a lot of unfamiliar libraries.
- Grab texts from Go SMS Pro (an alternative text-messaging app) backups (XML format)
- Parse XML tree using
xmlPython lib to get contact names, text, etc.
- Create in-memory data structures in Python using data from the XML tree:
class TextCorpus(object): """It's a graph.""" def __init__(self): self.contacts = get_contact_objs() self.contact_names = names.values() self.global_adj_list = [item for sublist in map(lambda x: getattr(x, 'adj_list'), self.contacts) for item in sublist] self.global_pair_map = dict(zip(map(lambda x: getattr(x, 'name'), self.contacts), map(lambda x: getattr(x, 'adj_list'), self.contacts))) self.global_pair_flat_list = map(lambda x: getattr(x, 'adj_list'), self.contacts) class Contact(object): """A Contact obj reprs one of the people I've contacted.""" def __init__(self, fn): self.name = get_name_from_fn(fn) self.messages = get_texts(fn, False) self.messages_str_list = map(lambda x: getattr(x, 'body'), self.messages) self.links = get_mentions_msg_list(self.messages_as_str_list) self.unweighted_links = tuple(set(get_mentions_msg_list(self.messages_as_str_list))) self.weighted_links = Counter(get_mentions_msg_list(self.messages_as_str_list)) self.adjacency_list = map(lambda x: tuple([self.name, x]), list(set(get_mentions_msg_list(self.messages_as_str_list)))) self.sentiment_links = get_sentiment_link_dict(self.weighted_links, self.name) self.avg_sentiment = 0.5 if len(self.sentiment_links) is 0 else float(sum(self.sentiment_links.values())) / len(self.sentiment_links)
There’s definitely a concern for privacy since these are texts between my friends…
TextCorpus, grab their
TextMessage objs. If a contact name is in the body of the text, perform a hex digest of the name and
humanize the hash.
import humanhash import hashlib class TextMessage(object): """A TextMessage obj's attrs correspond to the XML tags I care about.""" def __init__(self, msg_id, posix, sent, body, sender): self.msg_id = msg_id self.posix = posix self.sent = sent self.body = 'neutral' if body is None else anonymize(body.lower()) self.mentions = get_mentions(self.body) self.sender = sender if self.sent is False else 'rainier' def anonymize(msg): """Anonymizes a single text body. e.g. 'Alice: Bob is awesome' -> 'zero-phone-quantum-anchorage: waffle-panda-theory-rushmore is awesome' """ for orig, hmhashed in names.iteritems(): msg = msg.replace(orig, hmhashed) return msg def digest_and_humanize(word): m = hashlib.md5() m.update(word) return humanhash.humanize(m.hexdigest())
In the interest of time at the hackathon, I used pre-trained models from Indico‘s high-quality sentiment API on short texts. They report about a 93% accuracy rate on the IMDB movie review corpus.
Aside: There are lots of papers and tutorials on sentiment analysis algorithms using machine learning. Some ‘not really machine learning’ methods involve using a list of ‘happy’ words such as ‘awesome’ and ‘great’ to increase the sentiment. Could also classifiers such as Naïve Bayes or topic modeling techniques. But anyway…
Since I can only freely hit the Indico API 100,000 times, I had to find a way to store API call results persistently, and also unit test. I could have used a database, but for now,
pandas has good tools for parsing CSVs.
from texts.config import INDICO_API_KEY def indico_batch_sentiment(): """a ONE-OFF method to call the indico.io API to HQ batch sentiment 18192 texts. Stores it as sentiments.csv in the working dir. """ with open('sentiments.csv', 'wb') as f: texts =  writer = csv.writer(f) with open('texts/filenames.txt', 'r') as filenames: fn_list = map(str.strip, [filename for filename in filenames]) fn_list = map(lambda x: 'texts/texts/' + x, fn_list) for fn in fn_list: texts.append(get_texts(fn)) # returns TextMessage object texts = [item for sublist in texts for item in sublist] with open('indico_sentiment_hq_errors.txt', 'w') as error_log: for text in texts: sentiment_result = None try: sentiment_result = sentiment_hq(text.body.encode(), api_key=INDICO_API_KEY) except BaseException as e: error_log.write(str(e)) finally: writer.writerow([unicode(s).encode('utf-8') for s in [text.msg_id, text.posix, repr(text.sent), text.body, repr(text.mentions), sentiment_result]])
Ways to improve
- Didn’t include emoji 👀 or use a different encoding for them because it was breaking Python’s standard library csv file IO. Of course, I was going back and forth between
csvfile. Could be factored into sentiment somehow…
- I have a contact named Will… maybe y’all could guess what happened then, or what will happen if people use that word…
- I could have gotten Facebook’s data dump or backed up all my texts from this year as well. But I think at like 18200 texts my computer could handle enough. For a larger scale operation I could use an AWS instance or something.
The code is here. Even though I haven’t touched it in a while, I’m always dreaming of things to add on to it.