One should be able to infer the mission of a company based on their data infrastructure. By reviewing the data infrastructure, you can see what inputs and outputs matter to a company, who is using the data, and how it’s used.
At Kaspien, we use data-centric processes to inform our business strategies. These processes, such as ideation, hypotheses testing, or refining operation, all depend on a data platform that is both deep and wide. The depth of our data enables more nuanced and precise insights, while the breadth enables us to understand the big picture and identify trends and anomalies.
Together, a deep and broad data platform maximizes our insights, empowering us and our partners to grow smart. Leveraging data, we can identify the biggest opportunities and gravest threats early, allowing all our business segments to successfully adapt to ever-changing marketplaces. As a result, we empower our partners and drive success on Amazon, Walmart, and beyond.
Our teams make decisions based on the best available data. We rely on hourly updates of our data for three things:
To achieve these essential functions, our data infrastructure must be able to collect, transform, and load data into systems where the data can be further tailored to the specific needs of each team.
For example, our Finance, Data, and Marketing teams all utilize similar data, but each team needs the data structured in very different ways. Our Finance teams ingest 10 million rows of data each day to enable precise capital allocation. Our Data team processes 600 million rows of data to build and refine internal software and algorithms. Our Marketing team pours over 2 million rows of data each day to assess campaign performance and make strategic adjustments. To meet the needs of each team, we pump billions of data points through our platform every day.
Our data platform has seen meteoric growth over the past few years. Currently, our entire system processes just over 1 billion data points per week, a growth of 40% over the past 2 years.
While we now enjoy countless benefits of a robust data platform, our data operations were not always so streamlined. We had to adapt to the growing number of business initiatives and data requirements. We grew our engineering and data teams to build our platform and continue to invest in our data infrastructure today. The tools we use give us the ability to acquire, update, and distribute data with ease. Our platform is amenable to adding new data sources, managing prioritization queues, and delivering real-time data.
In short, our data platform is the stanchion from which the rest of our services stem.
We provide dozens of ecommerce services that cover the full gambit of running and optimizing an online business, including inventory and supply chain management, brand protection, digital marketing, creative services, and tax compliance. We also have developed a suite of software services, including self-service options for ad management, seller & price tracking, and Amazon seller reimbursements. All our software and services are exponentially improved by leveraging our data platform, and those benefits are shared by all of our partners, fueling their success.
Our primary analytics data pipeline (and associated workflow) is orchestrated via Apache Airflow. We’ll go into further detail about how we use Airflow in a later blog post (see here and here for some good introduction material). Airflow allows us to build in diagnostics, tests, and monitor the progress for all our systems, which include applications and databases.
Our analytics data is largely warehoused in AWS S3 and RDS, Redshift, MongoDB, and Snowflake. We use these various warehouses for the different types of data that we need to collate to be useful for our analytical purposes, which vary from team-to-team.
Our data pipeline starts with the systems we developed that gather data from marketplace APIs, third-party data sources, and our web-scraping systems. We process 3 million data points per hour from these systems. We then prepare our data to be collated and structured to ensure data integrity and maximum utility for our teams.
Many of our experiments and exploratory analysis depend on systematic data collection to ensure high-resolution and high-quality data. Our rich datasets covering pricing, demand, product metadata, among many others, have grown by 15% month-over-month over the past year.
At this point the data are stored in their final form in warehouses for our ‘end-user’ processes to consume, such as internal dashboards, purchasing applications, or inventory management systems. But for the Data Team, the data in these warehouses are not final. Rather, they are but a collection of staging warehouses to then further transform and load into our own analytics databases. From these databases, we prototype forecasting models, respond to real-time research needs, or explore new aspects of our data.
Ultimately, the insights derived from our data platform help our partners and clients grow. Our diverse portfolio of data combined with our partner’s portfolio creates synergies we can harness.
For example, while our partners may have sales and marketing data for their own portfolio of products, we can use our data to simulate and forecast sales. For marketing efficiencies, we have data related to pay-per-click ads, search engine optimization, and retargeting going back years for hundreds of thousands of products. Our Business Intelligence teams can stand up dashboards for monitoring key performance indicators that we know are important to track, personalized for each partner. Our historical data on shipping lead-times and product warehousing cycles will help our partner manage inventory and capital allocation. And last but certainly not least, our Data Science team uses these data to train AI models for our core business and partners.
Data is the core of what we do. It enables us to make improvements on the micro and macro levels for our partners, fueling their growth to new heights. If you want to leverage our data, software, or services, get in touch through our contact form.