Skip to content

Commit 9089fbd

Browse files
committed
firehouse moving from gitlab to github
0 parents  commit 9089fbd

38 files changed

+1193046
-0
lines changed

README.md

+193
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
# Firehose ?
2+
`The firehose API is a steady stream of all available data from a source in realtime –  a giant spigot that delivers data to any number of subscribers at a time. The stream is constant, delivering new, updated data as it happens. The amount of data in the firehose can vary with spikes and lows, but nonetheless, the data continues to flow through the firehose until it is crunched. Once crunched, that data can be visualized, published, graphed; really anything you want to do with it, all in realtime.`
3+
4+
## How we can build firehose ?
5+
6+
```[Data] + [Queue] + [Streaming] = Firehose```
7+
8+
1. Data
9+
* Weather and temperature data
10+
* Stock quote prices
11+
* Public transportation time and location data
12+
* RSS and blog feeds
13+
* Multiplayer game player position and state
14+
* Internet of Things sensor network data
15+
16+
2. Queue server support
17+
* ActiveMQ
18+
* Amazon SQS
19+
* Apache Kafka
20+
* RabbitMQ
21+
22+
3. Streaming Server support
23+
* Amazon Kinesis
24+
* Apache Spark
25+
* Apache Storm
26+
* Google DataFlow
27+
28+
### Our use case - Process 1 million stock market data.
29+
* [Bombay stock exchange historical stock price data.](http://www.bseindia.com/markets/equity/EQReports/StockPrcHistori.aspx?scripcode=512289&flag=sp&Submit=G)
30+
* Apache kafka
31+
* Apache Spark
32+
33+
## Kafka Setup
34+
35+
#### Download kafka at your machine [download kafka](https://kafka.apache.org/quickstart)
36+
37+
38+
### Kafka setting up multiple broker server
39+
40+
* First we make a config file for each of the brokers
41+
42+
```
43+
$ cp config/server.properties config/server-1.properties
44+
$ cp config/server.properties config/server-2.properties
45+
$ cp config/server.properties config/server-3.properties
46+
```
47+
48+
* Now edit these new files and set the following properties:
49+
50+
```
51+
config/server-1.properties:
52+
broker.id=1
53+
listeners=PLAINTEXT://:9093
54+
log.dir=/tmp/kafka-logs-1
55+
56+
config/server-2.properties:
57+
broker.id=2
58+
listeners=PLAINTEXT://:9094
59+
log.dir=/tmp/kafka-logs-2
60+
61+
config/server-3.properties:
62+
broker.id=3
63+
listeners=PLAINTEXT://:9095
64+
log.dir=/tmp/kafka-logs-3
65+
```
66+
67+
### Start the zookeeper
68+
69+
```
70+
$ bin/zookeeper-server-start.sh config/zookeeper.properties
71+
```
72+
73+
### Start the server
74+
75+
```
76+
$ bin/kafka-server-start.sh config/server-1.properties
77+
78+
$ bin/kafka-server-start.sh config/server-2.properties
79+
80+
$ bin/kafka-server-start.sh config/server-3.properties
81+
```
82+
83+
### create a new topic with a replication factor of three:
84+
85+
```
86+
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic test-firehose
87+
```
88+
89+
90+
### Kafka Producer Properties
91+
92+
* Update Kafka producer properties in [application.yml](firehose/src/main/resources/application.yml)
93+
94+
```
95+
bootstrap servers: : localhost:9093,localhost:9094,localhost:9095
96+
97+
```
98+
99+
### Spark consumer Properties
100+
101+
* Update Spark consumer properties in [application.yml](firehose/src/main/resources/application.yml)
102+
103+
```
104+
bootstrap servers: : localhost:9093,localhost:9094,localhost:9095
105+
zookeeper connect: localhost:2181
106+
107+
```
108+
109+
### Stock market data location
110+
* Update data location in [application.yml](firehose/src/main/resources/application.yml)
111+
112+
```
113+
data location: /Users/rohitkumar/Work/code-brusher/firehose/src/main/resources/data
114+
```
115+
116+
117+
### How to run the firehose
118+
119+
120+
```
121+
$ mvn spring-boot:run
122+
123+
```
124+
125+
### See Spark UI
126+
127+
* [spark-web-ui](http://localhost:4040) for the firehose job stats.
128+
129+
130+
## firehose Statistics on my machine - Processed 1 million stock price records.
131+
132+
#### 1 Million Stock Price Record Processed in `4 min and 16` seconds.
133+
134+
* Machine mac-book pro Processor - 2.7 GHz Intel dual Core i5 and Ram - 8 GB 1867 MHz DDR3 and 128 GB SSD.
135+
136+
![Spark one million](firehose/src/main/resources/spark_stats/Spark-1.png "Spark UI")
137+
![Spark scheduling delay](firehose/src/main/resources/spark_stats/spark-2.png "Spark UI")
138+
![Spark Bath status](firehose/src/main/resources/spark_stats/spark-3.png "Spark UI")
139+
![Complete Data Processed](firehose/src/main/resources/spark_stats/spark-4.png "Spark UI")
140+
141+
### Performance Tuning spark streaming.
142+
143+
#### Batch Interval Parameter :
144+
145+
Start with some intuitive batch interval say 5 or 10 seconds.
146+
If your overall processing time < batch interval time, then application is stable.
147+
In my case 15 seconds suited my processing and I got the performance.
148+
149+
```
150+
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkContext, new Duration(15000));
151+
```
152+
153+
#### ConcurrentJobs Parameter :
154+
155+
By default the number of concurrent jobs is 1 which means at a time only 1 job will be active and till its not finished,
156+
other jobs will be queued up even if the resources are available and idle. However this parameter is intentionally not
157+
documented in Spark docs as sometimes it may cause weird behaviour as Spark Streaming creator Tathagata discussed in
158+
this useful[thread](http://stackoverflow.com/questions/23528006/how-jobs-are-assigned-to-executors-in-spark-streaming).
159+
Tune it accordingly keeping side-effects in mind.
160+
161+
Running concurrent jobs brings down the overall processing time and
162+
scheduling delay even if a batch takes processing time slightly more than batch interval.
163+
164+
In my case :
165+
166+
```
167+
"spark.streaming.concurrentJobs","1" - Scheduling Delay around - 3.43 second and Processing Time - 9.8 Seconds
168+
"spark.streaming.concurrentJobs","2 - Improved - Scheduling Delay - 1 milli-second - 8.8 seconds
169+
```
170+
171+
#### Backpressure Parameter :
172+
Spark gives a very powerful feature called backpressure .
173+
Having this property enabled means spark streaming will tell kafka to slow down rate of sending messages if the processing
174+
time is coming more than batch interval and scheduling delay is increasing. Its helpful in cases like when there is sudden
175+
surge in data flow and is a must have property to have in production to avoid application being over burdened. However this
176+
property should be disabled during development and staging phase otherwise we cannot test the limit of the maximum load our
177+
application can and should handle.
178+
179+
```
180+
set(spark.streaming.backpressure.enabled","true")
181+
```
182+
183+
184+
185+
186+
187+
188+
189+
190+
191+
192+
193+

pom.xml

+164
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<project xmlns="http://maven.apache.org/POM/4.0.0"
3+
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
4+
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
5+
<modelVersion>4.0.0</modelVersion>
6+
7+
<packaging>jar</packaging>
8+
9+
<groupId>firehose</groupId>
10+
<artifactId>data.firehose</artifactId>
11+
<version>1.0-SNAPSHOT</version>
12+
13+
<properties>
14+
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
15+
<maven.compiler.source>1.8</maven.compiler.source>
16+
<maven.compiler.target>1.8</maven.compiler.target>
17+
</properties>
18+
19+
20+
<dependencies>
21+
22+
<dependency>
23+
<groupId>org.springframework.boot</groupId>
24+
<artifactId>spring-boot-starter-web</artifactId>
25+
<version>1.5.5.RELEASE</version>
26+
</dependency>
27+
28+
<dependency>
29+
<groupId>org.apache.kafka</groupId>
30+
<artifactId>kafka_2.11</artifactId>
31+
<version>0.8.2.1</version>
32+
</dependency>
33+
<dependency>
34+
<groupId>org.apache.kafka</groupId>
35+
<artifactId>kafka-clients</artifactId>
36+
<version>0.8.2.0</version>
37+
</dependency>
38+
39+
<!-- kafka -->
40+
<!--<dependency>
41+
<groupId>org.apache.kafka</groupId>
42+
<artifactId>kafka_2.11</artifactId>
43+
<version>0.10.0.0</version>
44+
</dependency>
45+
46+
<dependency>
47+
<groupId>org.apache.kafka</groupId>
48+
<artifactId>kafka-clients</artifactId>
49+
<version>0.10.0.0</version>
50+
</dependency>-->
51+
52+
<!-- spark -->
53+
54+
<dependency>
55+
<groupId>org.apache.spark</groupId>
56+
<artifactId>spark-core_2.11</artifactId>
57+
<version>2.2.0</version>
58+
</dependency>
59+
<dependency>
60+
<groupId>org.apache.spark</groupId>
61+
<artifactId>spark-streaming_2.11</artifactId>
62+
<version>2.2.0</version>
63+
</dependency>
64+
65+
<!--<dependency>
66+
<groupId>org.apache.spark</groupId>
67+
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
68+
<version>2.2.0</version>
69+
</dependency>-->
70+
71+
<dependency>
72+
<groupId>org.apache.spark</groupId>
73+
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
74+
<version>2.2.0</version>
75+
</dependency>
76+
77+
<dependency>
78+
<groupId>org.apache.avro</groupId>
79+
<artifactId>avro</artifactId>
80+
<version>1.8.0</version>
81+
</dependency>
82+
83+
<dependency>
84+
<groupId>com.twitter</groupId>
85+
<artifactId>bijection-avro_2.10</artifactId>
86+
<version>0.9.2</version>
87+
</dependency>
88+
89+
<dependency>
90+
<groupId>org.apache.poi</groupId>
91+
<artifactId>poi</artifactId>
92+
<version>3.16</version>
93+
</dependency>
94+
95+
<dependency>
96+
<groupId>org.apache.poi</groupId>
97+
<artifactId>poi-ooxml</artifactId>
98+
<version>3.16</version>
99+
</dependency>
100+
101+
102+
<dependency>
103+
<groupId>com.googlecode.json-simple</groupId>
104+
<artifactId>json-simple</artifactId>
105+
<version>1.1.1</version>
106+
</dependency>
107+
108+
<dependency>
109+
<groupId>net.sf.supercsv</groupId>
110+
<artifactId>super-csv</artifactId>
111+
<version>2.4.0</version>
112+
</dependency>
113+
114+
<!-- jakson dependencies for generating json schema from domain object-->
115+
<dependency>
116+
<groupId>com.fasterxml.jackson.module</groupId>
117+
<artifactId>jackson-module-jsonSchema</artifactId>
118+
<version>2.6.5</version>
119+
</dependency>
120+
<dependency>
121+
<groupId>com.fasterxml.jackson.core</groupId>
122+
<artifactId>jackson-core</artifactId>
123+
<version>2.6.5</version>
124+
</dependency>
125+
<dependency>
126+
<groupId>com.fasterxml.jackson.core</groupId>
127+
<artifactId>jackson-databind</artifactId>
128+
<version>2.6.5</version>
129+
</dependency>
130+
131+
<!-- Cassandra -->
132+
<dependency>
133+
<groupId>com.datastax.spark</groupId>
134+
<artifactId>spark-cassandra-connector_2.10</artifactId>
135+
<version>2.0.3</version>
136+
</dependency>
137+
138+
139+
</dependencies>
140+
141+
<build>
142+
<pluginManagement>
143+
<plugins>
144+
<plugin>
145+
<groupId>org.springframework.boot</groupId>
146+
<artifactId>spring-boot-maven-plugin</artifactId>
147+
<version>1.5.5.RELEASE</version>
148+
<executions>
149+
<execution>
150+
<goals>
151+
<goal>repackage</goal>
152+
</goals>
153+
<configuration>
154+
<executable>true</executable>
155+
</configuration>
156+
</execution>
157+
</executions>
158+
</plugin>
159+
</plugins>
160+
</pluginManagement>
161+
</build>
162+
163+
164+
</project>

0 commit comments

Comments
 (0)