Documents
GCHQ Analytic Cloud Challenges
Sep. 25 2015 — 9:30a.m.
GCHQ Analytic Cloud Challenges Innovation Lead for Data, Analytics & Visualisation Engineering This information is exempt from Disclosure under the Freedom of Information Act 2000 and may be subject to exemption under other UK information legislation. Refer disclosure requests to GCHQ on 14/5/2012 UK TOP SECRET STRAP 1 1
Approximately 50 Billion Events Per Day from 250+ x10G bearers Other 9% Volume by type Target Detection Identifier: Presence 35% Comms 5% Host Refererer 20% Chelt 33% Web Visits 31% Bude 67% • Scale Target for May 2012 is approximately 320 x 10G with a capacity for 100 BEPD 26/3/2012 Ref: 18171507 Volume by site • Typical retention is six months, sometimes reduced to cope with volume increase UK TOP SECRET STRAP 1 2
lJl'L SECRET TDB Events PG (JV-1 - Events High Level Concept Graphic Purpose A high level view can GCHQ event (unselected meta-data} capabilities and uses. Mail Cinder Files armed Events vents eaten Analytic. i: ADODP Clusters] Extract nd ?ltering Analytic velopment, Ad- he data mining Query Asses vie Dat pert {Interface};r 'v'AlL [web Lil] I Git-13 Hieh Created: rs'ommzizezna Medined: meme? 15:21:32 Owner: - crap This infnm'ialinn is lemm d'meeure uni'jer [he Freedem Ui' infennesen nuzeeu end may be sub-3911' to medium under other UK infnn'nalien .Refet disclosure raqueste In ism-la en SECRET
Query Focused Dataset What are QFDs • A database designed for answering a single (or few) analytic question • Initially adopted from research, engineering tasked to mature and scale ASAP • The engineering has been incremental based on priority • Library of components and design patterns developed • Scale using appropriately sized database instances, federation using service tiers • Deployment of new instances and updates streamlined • Corporate support • New QFDs still being produced by engineering and others using the standard components • Some now provided interactive access to datasets generated by HADOOPS batch analytics 26/3/2012 Ref: 18171507 UK TOP SECRET STRAP 1 QFD Facts and Figures • Approximately 100 QFD instances deployed for 16 different QFDs + more for UI etc • Each with events (or results) appropriate for the questions they answer • Hardware shared depending on QFD, all storage is shared • Driven by the need for a flexible platform at large scale with low overall cost • Redhat Enterprise Linux, Oracle DB (DB needs are simple) • Latest generation HP BL456c G7 blades, 24 core, 64GB RAM, EVA storage ~200TB usable • Most QFD instances have 70TB usable capacity • Total storage 15PB raw, 11PB usable 4
A selection of the major QFDs Name Description Questions Answered Physical Bude+Chelt AUTOASSOC Bulk unselected TDI-TDI correlations with confidence scores. What other TDIs belong to your target ? What technologies your target is using ? 2+1 instances, each 50-70TB storage Evolved Mutant Broth Identify when certain TDIs appear in traffic which indicate target usage and their location. Telephony and C2C data provide a converged view. Where has my target been? What kind of communications devices has my target been using? 10+5 instances, each 70TB storage Hard Assoc Provide strongly correlated selectors for both C2C and Telephony traffic taken from TDIs appearing in the same packet Are there any alternative C2C or Telephony selectors for my target? 3+2 instances, each 70TB storage HRMap Host-referrer relationships - information about how people get to websites, including links followed and direct accesses. How do people get to my website of interest and where do they go to next? What websites have been visited from a given IP? 5+3 instances, each 70TB storage KARMA POLICE Which TDIs have been seen at approximately the same time, and from the same computer, as visits to websites. Which websites your target visits, and when/where those visits occurred. Who visits suspicious websites, and when/where those visits occurred. Which other websites are visited by people who visit a suspicious website. Which IP address and web browser were being used by your target when they visited a website. 11+7 instances, each 70TB storage, 3+1 correlator instances SOCIAL ANTHROPOID Converged comms events allowing you to see who your targets have communicated with via phone, over the internet, or using converged channels (e.g. sending emails from a phone or making voice calls over the internet). What communications your target is engaged in. Who has your target been communicating with. What communications have occurred using a particular locator (IP address, cell tower, etc). 6+3 instances, each 70TB storage 26/3/2012 Ref: 18171507 UK TOP SECRET STRAP 1 5
HADOOP Clusters Facts & Figures Usage • Three batch analytic clusters (HADOOP clouds) • Each contains unique, unselected events, reference and results data • Redhat Enterprise Linux, Apache HADOOP (Map Reduce) • Each with ~900 data nodes, plus UI/ingest/egress servers • Each node 8 core, 64GB RAM, 6x1TB • Each 6PB raw storage, 1.5PB for events (3x replication and results) • The three clouds could, in theory, store 24 Trillion events @140 BEPD and 6 months retention period • 200 capability developer accounts • Many more users able to access applications • Guiding light > 150 users • First Contact, Cloudy Cobra > 50 • Jobs during working hours • Operationally ad-hoc (experimental, search, one time use) • Development of new capability • Schedules Jobs run overnight • Sustained, operational applications • Increasing expectation for fully supported, automated jobs such as Rumour Mill 26/3/2012 Ref: 18171507 UK TOP SECRET STRAP 1 6
Physical Overview Data Node Data Node Data Node Application Node Application Node (QFD) (QFD) Analyst External Data Sources Ingest Node Ingest Node Master Node Master Node Hadoop Cluster Innovation Node Innovation Node (QFD) (QFD) Capability Developer Private Cloud • Data Nodes • HADOOP Master Nodes (Job Track & Name Node) Edge Nodes connect to Computer Hall Network • Ingest Nodes • Application Nodes (provide user interface – web) • Innovation Nodes (cap dev login to servers) 26/3/2012 Ref: 18171507 UK TOP SECRET STRAP 1 7
Data Ingest Ingest Design Goals Data Formats • Store all data received, i.e. don’t risk losing important fields by normalising to single file format. • Partition by directory structure • Partition by data type to simplify common queries. • Partition by ingest date to allow incremental analytics. • Partition by security compartment. • Horizontal scaling by adding new hardware. 26/3/2012 Ref: 18171507 • • • • • • • TLV – Tag Length Value Comma Delimited Fix Position Single Line & Multi-Line Multiple Variations Actor Action Many sources and many types UK TOP SECRET STRAP 1 8
Analytic use of data Security Silver Library • Data access controlled across 2 dimensions: • Usability: Operational (standard events) or Controlled (special purpose e.g. test) • Classification: Multiple buckets Open (TSS2), a few compartments, general CIOs Sensitive • Capability developer access restricted by file/directory permissions • User facing applications apply fine grained security on per record basis • Abstraction layer to separate analytic development from storage dependencies • Pluggable parsers for new formats • Record parsers for event data held in HDFS • Applications still need to know field labels (function calls) • Applications often add additional abstractions • Input and Output handlers to read and write data based on type • Utility functions to aid development • Simple search and filter routines • Events Product Centre maintained 26/3/2012 Ref: 18171507 UK TOP SECRET STRAP 1 9
Example Analytics Every Assoc •User /machine correlations from C2C presence Every Police •User /machine website visits Every Creature •User /machine search terms Every eAD •User /machine electronic attack patterns Every Cipher •User /machine cipher events Grey Fox / Silver Fox •Country level summary of where identifier observed First Contact •1 & 2 hop contact chains between seeds and targets Cloudy Cobra •Glorified grep driven by GUI – find events that contain user search term 26/3/2012 Ref: 18171507 Guiding Light •MI information types/volumes of traffic on bearer Golden Axe •Generated list of suspected clone mobile phones (IMEI grey list) Tribal Carnem •Uses Radius logs to identify & collect activity for IP session Public Anemone •Geolocation based on web-based map searches Epic Fail •Identifies careless use of TOR networks Sterling Moth •IP summarisation tool using c2C presence events Foghorn •Find non-targets using targets machines And many more…. UK TOP SECRET STRAP 1 10
Rich Visualisation • Desktop applications for rich visualisation and analysis - Eclipse RCP • LOOKING GLASS client platform, FIRE ENGINE question-based federated access to events and reference data sources • Pilots – NGCC Contact chaining with C2C/converged event data – QFD Federator – NG Geo: Converged geo events analysis • Next stage to learn from pilots and build the tool for everyone - ALPHA CENTAURI 26/3/2012 Ref: 18171507 UK TOP SECRET STRAP 1 11
Rumour Mill – Push Analytics • Rumour Mill is a dashboard that will: – Enable analysts to prioritise new work as it arrives from customers by easily finding out "what does GCHQ already know" – Enable analysts to monitor existing work to spot when something happens that would change their priorities • • • 26/3/2012 Ref: 18171507 First level results, list of questions against identifiers. Simple yes/no answers Click on a yes to the second level drills into the detail Many questions are derived from cloud based analytics run each day against the current identifier list UK TOP SECRET STRAP 1 12
Sharing & Collaboration Other SIA and foreign partners Data (bulk & query) and technology exchange Two major components: • Web user interfaces (VAIL) on GCHQ servers but accessible from the partner site. Interactive query of QFDs. Allow exposure of GCHQ tradecraft. • Brokering services. Sustained access for interactive query of GCHQ data integrated into partner tools. 26/3/2012 Ref: 18171507 UK TOP SECRET STRAP 1 13
Future Options for Event Processing
Key Challenges Affordable, continuing scale of our capabilities • All dimensions, cost, power, cooling, space, storage, bandwidth, processing etc • As we share with more partners, the access demands increase • Enable more delivery by collaborating with others Enabling our analysts to cope with complexity and volume of data • Need to streamline understood workflows to save analysts time • Need to push data to analysts, sometimes with low latency • Enable next level analysis, behavioural, pattern of life, noticing change, predictions Complexity and pace of analytic development • Suitably skilled people are hard to find in-house and externally • Must be able to cope with mixed maturity capabilities • Workload separation, processing contention, cross site analytics Agility, resilience and maturity of platforms 26/3/2012 Ref: 18156927 • Keep pace with development, respond to community demands • Understand and match the evolving business expectations of maturity • Support appropriately, agreed strategy for resilience UK TOP SECRET STRAP 1 15
Approach to challenges Deploy more of the same while maturing Explore new technology Know the value of our data Understand usage of capabilities Stay flexible and increase collaboration 26/3/2012 Ref: 18156927 • Suitably refreshed, will answer immediate scaling challenge • Reduce support burden and streamline new capability development • Implement improvements to existing architecture • Non-relational, distributed databases; QFD consolidation & convergence • Research initiatives (A)SEM/HAKIM/STREAMING, and commercial offerings • Collaboration opportunities CLOUDBASE/ACCUMULO • Don’t generate/ingest data we don’t use, filter or de-dupe upstream • Be smart about retention, be smart about need for interactive availability • Distil the raw data to generate rich information sets for analysts • Automate the simple workflows with summaries and push analytics • Optimise capabilities for their use and access load • Mature, provide resilience and support as appropriate • Enable collaboration and lower the entry bar for capability developers • Stay vendor neutral, continue to use open standards, open software • Enhance the benefit for both research & engineering of working together UK TOP SECRET STRAP 1 16
Understand usage of capabilities Use Case Class Mapping Large proportion of queries for most of the analysts Known Target, Unknown Query Small proportion of queries for some of the analysts • Developed jointly between GCHQ and NSA to understand – the benefits of our current capabilities – where respective strengths and weaknesses exist Unknown Target, Unknown Query Target Based Discovery Anomaly Detection Target-based Development Target Tracking Known Target, Known Query 26/3/2012 Ref: 18156927 • Provides a clear set of drivers for architectural evolution Behaviour-based Discovery Unknown Target, Known Query UK TOP SECRET STRAP 1 – Missing capabilities – Suboptimal use of capabilities – Opportunities for collaboration and reuse 17
Existing coverage of use cases Known Target, Known Query • Streaming, Rumour Mill and QFD capabilities (possibly assisted by cloud analytic) • Could select data subset based on target, could pre-calculate results • Large usage and automation suggests need for optimised capability Known Target, Unknown Query • Not possible with existing QFD capabilities • Needs new analytic or more indexes on target selected data • Large usage and needs suggest a new capability Unknown Target, Known Query • Core QFD capability, possibly populated by cloud analytic • Full unselected data set required but what level of interactive query? • Is it acceptable that older data be made available non-interactively? Unknown Target, Unknown Query • Use batch analytic platform for low level search or new analytic • Could a heavily indexed store provide a responsive capability? • Possibly only over a subset of data - recent data only? 26/3/2012 Ref: 18156927 UK TOP SECRET STRAP 1 18
Value assessment What measures of value? • Duplicated over time or across bearers • Value decreases with age • Value decreases after processing (SPAM, summaries) • Use by analysts and analytics What actions to take? • Do filtering upstream or don’t generated • Investigate use of streaming capabilities for filtering or sampling • Discard immediately after first stage processing • Age off and discard more selectively • Periodically evaluate data types to understand usage Need to agree possibilities with operations and experiment 26/3/2012 Ref: 18156927 UK TOP SECRET STRAP 1 19
Technology candidates QFD consolidation and HAKIM Existing QFD drawbacks • Some duplication between QFDs (~10%) • Need new QFDs to answer new questions Need a consolidated database with multiple indexes and flexible additions • HAKIM is a research prototype to do just that • Unification of data, associated data kept together • Quick and flexible addition of new data types and indexes • Scalable and cost effective Other candidates exist and some HAKIM components could be replaced • Oracle DB or distributed, non-relational database? • Convergence with HADOOP stack HBASE/ACCUMULO Engineering is working with research to develop to the next stage 26/3/2012 Ref: 18156927 UK TOP SECRET STRAP 1 20
Technology candidates The “skinny” cloud (Laurel) What is it? • A batch analytic HADOOP cluster • Contains all the data but for a short retention period What would it be used for? • An ideal place to do incremental analytics • Answers the cross-site analytic problem • Could be dedicated to sustained usage • Provides some resilience by duplicating recent data What are the challenges? • Can we transfer and ingest all the data Engineering could potential build this using on-order kit 26/3/2012 Ref: 18156927 UK TOP SECRET STRAP 1 21
Technology candidates The GCHQ “core” (Hardy) What is it? • A HADOOP cluster with Map/Reduce and interactive query/analytics capabilities • Probably using CLOUDBASE/ACCUMULO and reusing NSA knowledge What would it be used for? • A place for data and summaries promoted from the bulk stores • Known Target, Know Query and some Known Target, Unknown Query • Optimised for major use case and suitable for data sharing • Provides some resilience by duplicating important data What are the challenges? • GCHQ has limited expertise with CLOUDBASE/ACCUMULO this technology • The promotion analytics and criteria are not developed Engineering could potential build this using on-order kit 26/3/2012 Ref: 18156927 UK TOP SECRET STRAP 1 22
UK TCIPSEERET TDB Events PG Stir-1 - Skinny cloud, GCHQ Core and QFD consolidation Input Buffer All Bude Even All l[Ihelt Er Buds Speci?c Event Types Bull: Events Batch Analytic Cluster 1 heltenham Sile Daily new Events for current selectors All Evenls for new selectors lery Ftes ponss Push daily wanes NSA Site All Events for new selectors All Events for nevrI selectors Purpose Shows possible hey.r architecture for GCHQ Events. New components: All Bude Even Input Buffer All Bude Events Bull: Events Batch Analytic muster 1 Bull Event: Batch Analytic Cluster 2 Elude Site Skinny cloud - all the data from both sites but held for less time. A place to do daily, incremental analytics on bulk events. Potentially. only sustained analytics. Also provides a usable copy of the data for a shon time period giving some resilience. GCHQ I{lore 'v'aluable events are selected SA 3L GCHQ seeds} by analytics and copied into the core. The core is not primarily a place for batch analytics, rather an interactive query repository. In addition to the ray.r eventsr summaries are held and shipped to NBA for integration into their tools. All the important events are held for the normal retention period providing resilience. Consolidated DFD - The next stage for QFDs. lower storage, more ?exible and more esin scaled are all Sit-1 Resource interaction Speci?cation Created: ?reassurance-l in mm disclaer under rm Freedom drinrodnarjm menus and may beenbiectip underethlr urc legislation. Referdiaclcsure requests to ecr-io or possible bene?ts. Modified: 15:09:22 Owner: UK