Data Science Thesis (or: Why I took a job in Data Science) -
AI is restructuring the software and information industries, which are trillions of $ in size. This is a consequential shift.
AI models, the brain of an AI system, have three components at a basic level:
Model architecture
Model weights and biases
Training data sets
Model architecture isn’t defensible. AI researchers to date have been able to intuit architecture with minimal information.
Model weights and biases aren’t defensible. They can be reverse engineering, fine tuned away, and are derivative of the training data set and process.
This leaves data sets, which seems to be the only defensible component a business can build a competitive edge with. As a matter of fact, training data is already the collision point for AI companies. It’s where the largest lawsuits and the most innovative minds are focusing (to my knowledge).
Data generation is speeding up. >90% of all data was generated within the last 2 years. This will only accelerate with the advent of things like digital twins and synthetic data generation pipelines.
As a result, data management will become a crucial part of every business. Refining proprietary data, processing it, and producing intelligence.
Within the data world, there are three main disciplines:
Data analytics
Data engineering
Data science
I believe data analytics will be automated by AI. It’s mostly done by MBAs these days anyway (MBAs are basically pre-trained models, good in general but not good at anything specific). Data science will continue to grow in importance and it’s where all the sexy ML stuff happens - but - value creation density will converge on data engineering. The speedup and sophistication of the ML data pipeline will become a multi $100B within years. It already is a multi-B industry.
So that’s where I’m focusing.
Some additional motives for making this decision:
Saving energy. Data centers consume 3% of the world’s energy (that’s a lot), my work will drive down the kilowatt hours consumer by these servers.
Making the world more efficient. The great leaps forward in humanity occur when many, many people experience a boost in average productivity. Data is the bottleneck for personalized ML/DL models that will drastically improve the throughput of knowledge workers. Let’s trigger the next info revolution!
Improving the world’s hardware. Data is how real world things are represented in digital space. As digital space become more dynamic and digital to physical interfaces and production pipeline becomes seamless, AI will drive an era of real-world innovation. Innovation is moving from bits back to atoms.