Big Data defines a new generation of technologies and architectures that efficiently extracts value from very large volumes of data through high velocity capture, discovery, storage and analysis.
Real Time Streaming
Processing large data streams at or near real-time with the goal of running “continuous queries” against it to produce real time analytics. CCI has worked with many open source frameworks in this area to provide solutions for our clients. Specifically, it has experience working with the following commonly used open source streaming frameworks:
Apache Spark Streaming runs on Apache Spark to provide processing and analysis of both real time and historical data. It uses DStream – a discretized stream – to represent a continuous stream of data. It enables a high volume, scalable, and fault-tolerant processing of live data streams. Spark also comes with tools to facilitate machine learning and graph processing on these streams.
Apache Storm provides event collection at massive scale. It provides real time processing and includes batch support through Trident – a high level abstraction for real time computing over Storm. It supports building stream processing logic in multiple languages, and guarantees delivery of data.
Apache Fink Streaming is a distributed stream processing engine that provides accurate results, even for out-of-sequence or delayed data. It maintains exactly-once application state to provide a robust streaming solution that is stateful and possesses an ability to seamlessly recover from failures. It performs well at large scale, with high volume and low latency characteristics.
Kafka Streams simplify application development by building on the Kafka producer and consumer libraries and leveraging the native capabilities of Kafka to offer data parallelism, distributed coordination, fault tolerance, and operational simplicity. It is best suited for streaming applications that use Kafka in their data pipeline.
Batch Processing
CCI has used the “Big Data” infrastructure to solve a major telecommunication company’s data processing bottleneck for reporting purposes. Call center tracking data had grown exponentially pushing up against a performance limitation causing a 48 hour delay in availability of the data for external application in the data warehouse staging systems. The scale issues in ETL and in the central data stores were forcing the organization to either compromise on data requirements reducing both scope and granularity of the data collected or committing sizable amounts of capital to expand infrastructure with diminishing returns on performance.
CCI delivered a fully integrated, highly scalable, and low-cost ETL solution based on Hadoop “Big Data” infrastructure to support their growing data needs. The system ingests data from workflow database, transform it, stores it in high volume store, and exports it to various data marts in time for generation of business reports. It runs on low-cost commodity hardware on clustered instances within the corporate cloud, and is projected to provide a linear growth path with minimal performance impact.