rainier ababao reads and writes

General category

Learning about LIDAR (WIP)

This post is largely a work in progress, I’ve made a lot of progress in learning that hasn’t been captured here-just need to find more time to make my thoughts more clear in this sort of format.

There are some tools and things I want to pick up in order to manage self-driving car data. In this document, I track the resources and tools I’m picking up to get to the knowledge I have currently. To be honest, I don’t have a very strong physics or computer graphics background, so my selection might reflect that.

What is LIDAR? (high-level)

Industry players/Self-driving car tech

From this, it seems like the major industry players are

  • Velodyne (LIDAR)
  • Quanergy (LIDAR)
  • Waymo (Alphabet) (everything)
  • Uber ATG (software, maybe everything?)
  • Lyft (software, maybe everything?)
  • Cruise (acquired by GM) (everything but especially software?)

The up-and-coming ones seem to be:

  • Drive.ai (the whole car)
  • Voyage (the whole car)
  • Zoox (the whole car)
  • Comma.ai (self-driving car software)
  • Luminar (LIDAR)

If I was seriously trying to be a strategist, a competitive landscape map would be apt here. But it’s simply nice to be aware of the players in the space in order to roughly forecast who might be successful in the space and make a decision about joining one of these companies.

The nice thing about all of these new companies attempting to be “thought leaders” in the space is that they publish so much accessible material about the technology.

There are some startups in the space that are making it easier to construct self-driving car data sets. Such as Scale.

LIDAR point cloud data format

Object classification (assumes some ML experience already)

Language interoperability

A self-driving car is one piece of a very data-intensive system (the other parts are probably the cloud or servers on-prem, etc.). I have a gut feeling about language interoperability being a powerful tool for writing data-intensive systems quickly. Since if it is true that some engineers (particularly in the research and scientific communities) are more productive when they are writing Python, then there is benefit from investing in using tools that can make Python go faster. Things like Boost Python come to mind…

This is a pretty cool blog post by one of the TA’s of a computational biology class I took a few years ago. It discusses writing a Python module that wraps C++ code, including unit tests: Building and testing a hybrid Python/C++ package

More to come.

Perco-koan

This week I messed up on an question whose solution involved a BFS over an element’s left, right, and bottom neighbors in a grid. This post exists to make sure I don’t mess it up again.

This requires writing the function to get the neighbor coordinates of the element in a grid, like this…

grid = \
[[0 0 0 1 0 0],
 [0 0 1 X 1 0],
 [0 0 0 1 0 0]]

# let's say we want the neighbors of the coordinate where 'X' is
get_neighbors(grid, 1, 3) # -> [(1, 2), (2, 3), (1, 4)]

The part in question where I messed up on is my version of a koan (that was the word, Ben!) that I learned from competitive programming class (thanks Arnav, though I doubt you’ll ever read this). This is the incorrect version of a function that I should have really committed to memory by now…

def get_neighbors(grid, visited, i, j):
    dx = [-1, 0, 1]
    dy = [0, -1, 0]
    return [(i+x, j+y) for x, y in zip(dx, dy) \
            if legal(grid, i+x, j+y) \
            and grid[i+x][j+y] == PASS \ # problem-dependent
            and (i+x, j+y) not in visited]

Do you see it??

There are a few things wrong with it, based on my forgetting that Cartesian coordinates != the convention for multi-dimensional array indexing. When I’m coding it and saying it out loud in my head, like “Yeah, the change in x will be -1, change in y will be 0 if you go left, x 0 y -1 if you go bottom”, it makes total sense, but it’s wrong. Applying a -1 transformation to the index associated with y (which is not j, by the way, but i, so that was another mistake) is wrong - that will examine the neighbor above the input coordinate.

This is the correct way to get the left, right, and bottom neighbors:

def get_neighbors(grid, visited, i, j):
    di = [0, 1, 0]
    dj = [-1, 0, 1]
    return [(i+y, j+x) for y, x in zip(di, dj) \
            if legal(grid, i+y, j+x) \
            and grid[i+y][j+x] == PASS \ # problem-dependent
            and (i+y, j+x) not in visited]

Do you see the difference? I flipped the indices I made the association with x-y coordinates for, and changed the -1 to +1.

Another interesting problem came up today, which involved a series of sums of a sub-matrix of size K*K in a matrix of size N*M. Here’s a candidate for a new personal koan:

def subsum(grid, i1, i2, j1, j2):
    psum = 0
    for i in range(i1, i2):
        psum += sum(grid[i][j1:j2])
    return psum

for i in range(len(grid)-K+1):
    for j in range(len(grid[0])-K+1):
        subsum(grid, i, i+K, j, j+K) # do something with this

Do you think there is a better way to do this? Please email me, so I can get a job ğŸ˜…

Going to London for four months

In about 52 hours I’ll get into Heathrow Airport in the UK. Starting next week, I’ll be an engineering intern at Desktop Genetics, a hot bioinformatics startup in London focused on a machine learning and data engineering SaaS for CRISPR/Cas9 gene-editing.

Starting out as a biochemistry major in college, realizing my love for genetics and later on when I became a computer science major, love for thinking in abstractions and thusly my potential as a software engineer, I don’t think I could have picked up a more exciting opportunity than to intern at a company specializing at the intersection of my first and second loves.

I’m incredibly grateful and lucky for all of the people and experiences I’ve had so far that have led me to this point, and can’t wait to write code that can make an impact on the world.

Living in a foreign country in a continent I’ve never been to shall also be an adventure.

See you soon, London.

The Silicon Valley Internship

Clover

I interned at Clover. Facts about them:

  • Point-of-sales (POS) platform startup that launched in 2012 and was acquired later that year by First Data, the world’s largest credit card processing company.
  • First Data is in charge of sales and support, and Clover engineers the hardware and software for new POS devices.
  • There’s a “Founder’s Clause” in the acquisition contract that would make First Data pay a large fine if they interfered with Clover’s operations. So it still operates and feels like a startup.
  • Their devices run a custom fork of Android. They provide an app store and ecosystem with a public API and SDK for developers to build business-related apps (e.g. for tabling, employee/inventory management, etc.).
  • Over 100 people work in the Sunnyvale HQ as of this writing. However, I saw new engineers and business people get hired every week.
  • 11 engineering interns at Clover HQ this summer (8 from UT Austin due to some early employees being alum).

It went for 13 weeks, from May 23rd to August 19th, 2016.

I was on the server team. My project initially involved integrating a time-series database and fixing a metrics collection service written in Python and a tiny bit of Go. Later on, I helped redesign and rewrite it completely in Go, both experiences from which I learned immensely from. Also, I was assigned bugs and performance issues related to the main server codebase like any other full-time engineer on the team. This allowed me to dust off my Java, learn Netty, pick up a large codebase, understand architectural concepts like the difference between monolith and microservices, and get better at switching tasks quickly.

My favorite part about the server team was how talented everybody had to be. To be a successful server engineer at Clover, you have to be a full-stack generalist. I’m not talking about being able to tie together web app frontends and backends with React, Node.js and MongoDB, although that’s still a marketable skill on its own. You have to pick up parts of the Android and Web ecosystems to see how your changes will affect those end-users, know database query optimizations, know where it’s smart to introduce caches and fiddle with the JVM, account security compliance into your implementations (it is a payments company, after all), and apply statistics for performance anomaly detection. If a critical part of server goes down or experiences significant latency, lots of businesses are potentially affected - the code at Clover matters. But on top of the engineering culture, there are friendly and approachable people everywhere. I didn’t feel discouraged from asking anyone questions.

The food and perks were nice. We had swanky corporate housing, free Uber for commuting to work and back, highly-customizable lunch from Forkable and catered dinner, daily. Yummy snacks and drinks like harmless coconut water and matcha green tea Kit Kats. Open bar and weekly happy hours if you’re 21+.

Clover was a very good place to intern - I had fun, felt that I had provided tangible value to the company, and learned a ton, which should be the goal of an internship. They are growing and hiring rapidly, and their internship program can only get better every year. I recommend it.

Why you should try a smaller company at least once

I’m a strong believer that software engineering internships should be about trying as many experiences as you can, and growing. Especially at small-to-midsize startups, there’s the unique possibility of witnessing the extremely fast growth of a company. Then you’ll get to watch (or help 😀) engineers wrestle with scaling issues, listen to product design meetings, overhear business people who aren’t on a different floor yet perform pitches, participate in a company’s early all-hands meetings, and so much more.

I’ve been privileged to intern at two differently-sized startups: a 10-person one and a 110-person one. I’ve had the chance to report to and get my code reviewed by very senior engineers at each one and push features to production that had a much likely higher relative impact (or as my friends in management consulting would call it, “value add”) than if I had worked at a larger corporation. Finally, I’ve gained experience with technologies that are easy to scale, iterate with, and will build the future, including Go, Python, and JavaScript. When I asked some engineers at Clover why they prefer working there over bigger shops, it’s because of huge technical growth, the available bandwidth to take ownership of, and impact.

I’m not saying you shouldn’t work at a larger company - I hope I’ll get to try it out one day and see which experiences optimize for my growth and personal happiness at whatever stage of life I’m in. But a smaller company is something, I believe, you should try at least once.

Silicon Valley

I went to a few networking events. I couldn’t intern in Silicon Valley and not network - it’s Silicon Valley. Tweet or DM @rainieratx if you missed me at one of them!

  • KPCB Diversity Mixer
  • Accel Backyard Bash
  • Keen IO Party
  • HackCon IV (in Colorado)
  • FutureForce Friendfest
  • Google SF Insider Night
  • Medallia Game Night
  • Explore Pinterest
  • Uber Open House

I also visited some friends at Facebook and Twitch HQ. I was surrounded by brilliant engineers, product designers, and thought leaders of the entrepreneurial spirit everywhere. Lots of people were passionate about stuff outside of work or school, and I loved the culture of building the future.

The areas surrounding Sunnyvale were nice. I hung around Palo Alto, Mountain View, and Santa Clara whenever I chilled in South Bay. Visited Half Moon Bay, Santa Cruz, and Yosemite. Went to San Francisco nearly every weekend via Caltrain. Asian food, including sushi and bubble tea, was 10x better there than in Austin. Iced mint mojitos from Philz were amazing but California hasn’t seemed to figure out breakfast burritos or queso yet ğŸ˜¬.

I’m glad I tried it out, and I wouldn’t mind coming back to the Bay Area next summer.

What did you like about your summer? Questions about my internship?
Tweet or DM @rainieratx!

Mining 18000 texts for fun and no profit

Well, maybe a little bit of profit.

text-graph is a data visualization of a network graph of based on 18,000 texts, with weighted edges representing sentiment degree between named entities.

This is a port of a write-up for a project I made at 6Sense Data Hack in San Francisco. This won Judges’ Favorite Hack and my round-trip flight ticket’s (AUS - SFO) worth, which is why I said a little bit of profit. It was also featured on their company blog.

Setup

For a lot of data science-oriented tasks, the Anaconda distribution of Python will be awesome. PyCharm for static type-checking and quickly accessing docstrings, since I’m using a lot of unfamiliar libraries.

Data wrangling

  • Grab texts from Go SMS Pro (an alternative text-messaging app) backups (XML format)
  • Parse XML tree using cElementTree from the xml Python lib to get contact names, text, etc.
  • Create in-memory data structures in Python using data from the XML tree:
class TextCorpus(object):
    """It's a graph."""

    def __init__(self):
        self.contacts = get_contact_objs()
        self.contact_names = names.values()
        self.global_adj_list = [item for sublist in
                                            map(lambda x: getattr(x, 'adj_list'), self.contacts)
                                                for item in sublist]
        self.global_pair_map = dict(zip(map(lambda x: getattr(x, 'name'), self.contacts),
                                                map(lambda x: getattr(x, 'adj_list'), self.contacts)))
        self.global_pair_flat_list = map(lambda x: getattr(x, 'adj_list'), self.contacts)

class Contact(object):
    """A Contact obj reprs one of the people I've contacted."""

    def __init__(self, fn):
        self.name = get_name_from_fn(fn)
        self.messages = get_texts(fn, False)
        self.messages_str_list = map(lambda x: getattr(x, 'body'), self.messages)
        self.links = get_mentions_msg_list(self.messages_as_str_list)
        self.unweighted_links = tuple(set(get_mentions_msg_list(self.messages_as_str_list)))
        self.weighted_links = Counter(get_mentions_msg_list(self.messages_as_str_list))
        self.adjacency_list = map(lambda x: tuple([self.name, x]),
                                            list(set(get_mentions_msg_list(self.messages_as_str_list))))
        self.sentiment_links = get_sentiment_link_dict(self.weighted_links, self.name)
        self.avg_sentiment = 0.5 if len(self.sentiment_links) is 0
                                 else float(sum(self.sentiment_links.values())) / len(self.sentiment_links)

Anonymizing data

There’s definitely a concern for privacy since these are texts between my friends…

For each Contact in TextCorpus, grab their TextMessage objs. If a contact name is in the body of the text, perform a hex digest of the name and humanize the hash.

import humanhash
import hashlib

class TextMessage(object):
    """A TextMessage obj's attrs correspond to the XML tags I care about."""

    def __init__(self, msg_id, posix, sent, body, sender):
        self.msg_id = msg_id
        self.posix = posix
        self.sent = sent
        self.body = 'neutral' if body is None else anonymize(body.lower())
        self.mentions = get_mentions(self.body)
        self.sender = sender if self.sent is False else 'rainier'

def anonymize(msg):
    """Anonymizes a single text body.
    e.g. 'Alice: Bob is awesome' -> 'zero-phone-quantum-anchorage: waffle-panda-theory-rushmore is awesome'
    """
    for orig, hmhashed in names.iteritems():
        msg = msg.replace(orig, hmhashed)
    return msg

def digest_and_humanize(word):
    m = hashlib.md5()
    m.update(word)
    return humanhash.humanize(m.hexdigest())

Sentiment analysis

In the interest of time at the hackathon, I used pre-trained models from Indico‘s high-quality sentiment API on short texts. They report about a 93% accuracy rate on the IMDB movie review corpus.

Aside: There are lots of papers and tutorials on sentiment analysis algorithms using machine learning. Some ‘not really machine learning’ methods involve using a list of ‘happy’ words such as ‘awesome’ and ‘great’ to increase the sentiment. Could also classifiers such as Naïve Bayes or topic modeling techniques. But anyway…

Since I can only freely hit the Indico API 100,000 times, I had to find a way to store API call results persistently, and also unit test. I could have used a database, but for now, pandas has good tools for parsing CSVs.

from texts.config import INDICO_API_KEY

def indico_batch_sentiment():
    """a ONE-OFF method to call the indico.io API to HQ batch sentiment 18192 texts.

    Stores it as sentiments.csv in the working dir.
    """
    with open('sentiments.csv', 'wb') as f:
        texts = []
        writer = csv.writer(f)
        with open('texts/filenames.txt', 'r') as filenames:
            fn_list = map(str.strip, [filename for filename in filenames])
            fn_list = map(lambda x: 'texts/texts/' + x, fn_list)
            for fn in fn_list:
                texts.append(get_texts(fn)) # returns TextMessage object
        texts = [item for sublist in texts for item in sublist]
        with open('indico_sentiment_hq_errors.txt', 'w') as error_log:
            for text in texts:
                sentiment_result = None
                try:
                    sentiment_result = sentiment_hq(text.body.encode(), api_key=INDICO_API_KEY)
                except BaseException as e:
                    error_log.write(str(e))
                finally:
                    writer.writerow([unicode(s).encode('utf-8') for s in
                                     [text.msg_id, text.posix, repr(text.sent),
                                      text.body, repr(text.mentions), sentiment_result]])

Static visualization

Graph viz should be here

Ways to improve

  • Didn’t include emoji 👀 or use a different encoding for them because it was breaking Python’s standard library csv file IO. Of course, I was going back and forth between pandas and csvfile. Could be factored into sentiment somehow…
  • I have a contact named Will… maybe y’all could guess what happened then, or what will happen if people use that word…
  • I could have gotten Facebook’s data dump or backed up all my texts from this year as well. But I think at like 18200 texts my computer could handle enough. For a larger scale operation I could use an AWS instance or something.

The code is here. Even though I haven’t touched it in a while, I’m always dreaming of things to add on to it.

Seeing if this script worked

Automates writing test content with correct metadata for Pelican. This is the partial code for it, which can also be found here:

import sys
import time
import os
os.getcwd()
os.chdir("/Users/rainierababao/cs/rainier/content/")
words = [arg for arg in sys.argv[1:]]
with open("{}.md".format(('-').join(words)), "w") as beak:
    beak.write("Title: {}\n".format((' ').join(words)))
    beak.write("Date: {}\n".format(time.strftime("%Y-%m-%d")))
    beak.write("Category: General\n\n")
    beak.write("Automates writing test content...")
    # ...turtles all the way down :)

And I ran this to automate publishing it, dubbed fly-pelican:

source activate blog
cd /Users/rainierababao/cs/rainier
pelican content -o output -s pelicanconf.py
ghp-import output
git push git@github.com:rainiera/rainiera.github.io.git gh-pages:master
echo "Published!"

blog is just a conda env I use for site development.

Made with ❤️ and automation in Austin, TX

Hello World!

#include<stdio.h>

int main() {
    printf("Hello World!");
    return 0;
}