[Owasp-appsensor-project] GSoC 2016 Trend Monitoring Analysis Engine

Timothy Sum Hon Mun timothy22000 at gmail.com
Tue Mar 15 18:26:49 UTC 2016


Dear John,

Sorry for the late reply. Been busy for my last few weeks of placement. :)
Hope you are doing well. Responses inline with regards to GSoC.

Side note: I notice that there are 3 new issues open with regards to the
appsensor (elasticsearch, kafka, and mongo query) and I am wondering if
anybody is working on any of them? I am keen to help out with the kafka one
(although I might start work on it in April/May when I am done with my
placement bar the report). I read on the version 0.9, its new consumer API
and security features.

I suppose the scope of that issue covers upgrading, ensuring nothing breaks
and the new consumer API with the new security features being a separate
issue? We can discuss on the issue ticket itself.

Kind Regards,
Tim

On Tue, Mar 8, 2016 at 5:27 PM, John Melton <jtmelton at gmail.com> wrote:

> Responses inline.
>
> On Tue, Mar 8, 2016 at 4:33 AM, Timothy Sum Hon Mun <
> timothy22000 at gmail.com> wrote:
>
>> Hi John,
>>
>> Thanks for getting back to me. It was good hearing back from you. I've
>> replied to you inline below.
>>
>> Besides that, I made a pull request for some minor changes and test that
>> I added for appsensor as a first contribution:
>> https://github.com/jtmelton/appsensor/pull/38
>>
>>
> Fantastic. I'll take a look at that later today!
>
>
>> Thanks again!
>>
>> Best Regards,
>> Tim
>>
>> On Mon, Mar 7, 2016 at 4:37 AM, John Melton <jtmelton at gmail.com> wrote:
>>
>>> Tim,
>>>
>>> Hi, and thanks so much for your email. I've responded with specific
>>> comments inline below.
>>>
>>> Thanks,
>>> John
>>>
>>> On Sun, Mar 6, 2016 at 1:58 PM, Timothy Sum Hon Mun <
>>> timothy22000 at gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Firstly, congratulations on OWASP being accepted for GSoC 2016!!
>>>>
>>>> My name is Timothy Sum and I am from Malaysia. I am currently a final
>>>> year MSc Computer Science student studying at University of Kent in the UK.
>>>> I have experience in Java, Javascript, Python, Node.js, MongoDB, AWS,
>>>> Jenkins, Git workflow, Dropwizard, Logstash, Apache Spark (MSc
>>>> dissertation) and others, I am always keen to learn new technologies and
>>>> try things outside my comfort zone!
>>>>
>>>> I am currently undergoing my placement (where I gained most of my
>>>> experience from) which will be concluded on the 31st March 2016. I will be
>>>> working full time on the weekdays before then. Therefore, I will do my
>>>> research about the project and prepare my proposal typically at night or
>>>> during the weekends. After my placement finishes, I will be able to
>>>> completely commit to GSoC by researching, learning and experimenting about
>>>> gaps in my knowledge during April even before the community bonding period.
>>>> I’ll have a written report to write about my placement that is due on June
>>>> 2016 but I can do that while coding over the summer!
>>>>
>>>> I just recently stumbled over GSoC 3 days ago and have been looking
>>>> through the project list to decide which project I should go for. This will
>>>> be my first time contributing to an open source project and I am very hyped
>>>> up about it as I get to learn from a mentor and contribute at the same
>>>> time. :) I also do not mind having skype/hangout discussion with mentors
>>>> regularly to discuss about my progress.
>>>>
>>>
>>> Yes, skype/hangouts is the normal way we communicate. I generally aim
>>> for meetings 2-3 times a week so we can make sure we're making forward
>>> progress and then use email in between meetings for specific questions.
>>>
>>
>>>
>>>>
>>>> I am interested in the Trend Monitoring Analysis Engine project for
>>>> OWASP AppSensor and would be excited if I can work on it. I do not
>>>> have a background in application security and intrusion detection but am
>>>> highly interested learning about it. So far, I have:
>>>>
>>>
>>> Fantastic. Honestly, a background in spark / machine learning will be
>>> more important.
>>>
>>
>> Cool! I did a module in data mining for my MSc that would come in handy
>> (learned about machine learning algos like decision trees etc).  I used
>> Spark for the first time during my dissertation to implement a
>> classification algorithm. I did not get to use Spark's machine learning
>> library but my past experience would hopefully make the transition easier.
>>
>>>
>>>
>>>>
>>>> i) Read the Chapter 3 and Chapter 4 of the OWASP guide briefly and
>>>> understand the approach behind AppSensor, its high level architecture
>>>> (detection and response unit), its pattern (Event, EventManager,
>>>> EventAnalysisEngine and so on)
>>>>
>>>> ii) Manage to get a demo running locally as per the AppSensor Demo
>>>> Setup guide (
>>>> https://github.com/jtmelton/appsensor/blob/master/sample-apps/DemoSetup.md).
>>>> Had a little bump with a mongo test failing when doing mvn install but got
>>>> it to work in the end. Went through part of the codebase while doing this.
>>>>
>>>> iii) Research on trend monitoring analysis techniques. It seems that
>>>> trend analysis falls into anomaly detection based on my understanding so
>>>> far but feel free to correct me (will expand in the section below). It
>>>> would be great if you recommend me additional papers/books to read to learn
>>>> more on this topic.
>>>>
>>>> Did a first pass on two papers that cover general topics in IDS:
>>>>
>>>> http://galaxy.cs.lamar.edu/~bsun/seminar/example_papers/IDS_taxonomy.pdf
>>>>
>>>> http://www.ijcset.net/docs/Volumes/volume2issue4/ijcset2012020419.pdf
>>>>
>>>>
>>> There is not much literature specific to application intrusion
>>> detection. The concept is roughly based on network IDS systems. It is
>>> mostly transferring those concepts to the application layer, and looking
>>> for activity that is not possible (or is much harder) to detect at the
>>> network layer, but is possible (or much easier) at the application layer.
>>>
>>
>>  Interesting, I will probably do some reading to get an better overview
>> of IDS in general.
>>
>>>
>>>
>>>> Currently, I have given it some thought and my high level understanding
>>>> of the expected deliverables are:
>>>>
>>>> i)  A trend monitoring analysis engine - Extend the analysis-engines
>>>> package and add tests. Depending on which implementation strategies to use,
>>>> it seems that I would have to record the “normal” behaviour pattern of a
>>>> system and then trigger a response if the application behaves out of the
>>>> norm which will be defined by the trending rules.
>>>>
>>>
>>> I think of 2 possible approaches:
>>> - *simple trending engine* - this would be an implementation that would
>>> essentially do some simple counting. An example here might be that we have
>>> seen the occurrence of detection point ABC go up 500% in the last hour over
>>> the "normal" usage. This would likely be pretty straightforward, and could
>>> use something like a time series database to track the metadata, and do
>>> some very fast analysis.
>>>
>>
>> I looked up on time series database to learn about them better as I have
>> not work with it.
>>
>>
>> http://stackoverflow.com/questions/8816429/is-there-a-powerful-database-system-for-time-series-data
>>
>> I notice that we have a implementation to integrate with influxdb in the
>> package appsensor-integration-influxdb.
>>
>> If I were to do the simple trending machine, I would have to extend the
>> current implementation to be able to retrieve events written to it so that
>> I can retrieve it in order to conduct the counting and analysis to compare
>> whether it is unusual. This is assuming that I will be using influxDB of
>> course. what are your opinions?
>>
>
> Yes, that's the basic idea. There are several that you could use. I don't
> really care that much about the implementation (tool) to be honest, but
> rather the idea. We can provide 1 implementation, then add implementations
> for specific tools if people would like one that we don't already cover.
>
>
>>
>>>
>> - *machine learning engine* - this is a more complex implementation.
>>> This would involve creating a ML style engine that would allow for various
>>> types of analysis. An example might be noticing a shift in the composition
>>> of HTTP verb usage for a given time period. If you decide to go this route,
>>> I think you'll want to be very specific with the types of analysis you want
>>> to provide, and focus on doing great documentation about how to build rules
>>> based on training data and the algorithm selection process.
>>>
>>
>>  This is a really interesting idea! I did some researching in order to
>> get an idea of what needs to be done using Spark as a base. Idea and
>> questions below:
>>
>> i) Idea 1: There has been some work on using spark and cassandra (as a
>> time series db even though its a k-v store) for data analysis. In relation
>> to appsensor, I would have to implement Spark (probably as part of the
>> analysis engine) for its machine learning library and implement a storage
>> provider for cassandra prior to wiring them together. I will have to design
>> a schema for the time series data storage inside cassandra as well. This
>> seems quite a lot of work for the duration of the project but i'll be able
>> to leverage some existing work done.
>>
>>
>> http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data
>>
>> ii) Idea 2: Implement a simple trending analysis ending as the main
>> project work (related to the question below simple trending approach) and
>> finish the 3 deliverables. Built a ML engine using Spark for machine
>> learning which will involve wiring it to the time series db used in the
>> simple trending approach. This way I don't have to implement a separate
>> store for the ML analysis engine but challenge probably lies into working
>> out how to connect them together.
>>
>>
> Both of these ideas are good, honestly. I'd focus on which one you think
> you can accomplish in the 3-month time frame. We don't want you to be able
> to finish the project in 2 weeks, but we also don't want it to take a year.
>
>
>> iii) Question 1: What do you mean by specific about the type of analysis
>> that I am providing and algorithm selection? From what I understand, its
>> either:
>> - For example, if we have 2 cases: measure shift in composition of HTTP
>> verb and number of API calls to an endpoint. I would implement it such that
>> I will use one algorithm for checking composition of HTTP verb and another
>> algorithm for number of API calls. I guess some research needs to be done
>> to decide which algorithm would be suitable for which use
>> case/scenario/event.
>>
>> - Implement wide variety of algorithm for analysis engine and then let
>> the user decide which algorithm to use for events or each event.
>>
>> I am leaning towards with the simple trending approach for now taking
>> into account of time although I would really like to give the machine
>> learning a go. Feedback and answers to the questions above will help me
>> scope out the amount of work required for the machine learning approach
>> especially (ii). :D
>>
>
> What I meant by the "specific type of analysis" comment is around machine
> learning. For machine learning, you have to decide which algorithm (or
> family of algorithms) to use to solve a particular problem. We can
> certainly use spark-ml or some other library to give us those algorithms,
> but in order to make it useful to our users, we'll have to write some code
> to integrate those algorithms with the types of problems we want to solve.
> If we're trying to solve a problem that requires "k nearest neighbors",
> then we'll have to write some code that uses that. My point was that we
> don't want to solve _every_ problem. We want to essentially document the
> process: 1) decide what problem you want to solve, 2) pick best algorithm,
> 3) implement algorithm, 4) use training dataset, 5) turn on analysis. In
> that workflow, we are not going to implement _all_ the different types of
> analysis you could do over the summer of code. I just want us to pick a few
> problems to solve, and document the process so that our users can do the
> same thing themselves to build new types of analysis.
>
>
I did some additional research on Spark and influxDB, it looks there are
issues open to improve intergration between Spark and influxDB especially
for querying large data from influxDB.
https://github.com/influxdata/influxdb/issues/3276
https://groups.google.com/forum/#!msg/influxdb/P9BEMslIQ1Q/ydrylZ3dDAAJ

Thanks to your feedback, I will probably decide on doing the ML
implementation. I read on detection points to decide on the problem I want
to solve and algorithms that I will use, here is my idea so far:

i) Problem: Rate of login attempts / Speed of application use / Change in
usage of same transaction for the website.

ii) Algorithm: Random Forest, Decision Trees / SVM / K-NN

iii) Implementing algorithm using existing library (JavaML, sparkML or
quickml(http://quickml.org/)) with regards to the problem.

iv) Training dataset: how will we get the training dataset for the
particular problem? I suppose I will be creating it by modifying an
existing dataset to replicate a behaviour.

what do you think? I will focus on implementing one problem and an
algorithm for it as advised and probably can put the other problems as nice
to haves if I have additional time. The above is using influxDB to track
the data and using ML algorithm upon retrieval to identify abnormal
behaviours.


>>>
>>>>
>>>> ii)  Associated configuration mechanism to specify the trending
>>>> rules/policy - Extend the configuration mode package, create
>>>> respective xml and xsd configuration for the Trend Monitoring analysis
>>>> engine.
>>>>
>>>> iii) A small full sample demo application showing usage of the trend
>>>> monitoring feature. - Built on the existing demo application?
>>>>
>>>>
>>> Yes, these would be the 3 basic outputs for that project, along with the
>>> associated documentation. Additionally, I would say that we should produce
>>> a small number of rules. That will be necessary for the demo application
>>> anyways, but we can use those rules as examples for the community. As for
>>> the demo application, it's very small and trivial. We actually have a user
>>> who built a demo application for a talk about appsensor that is likely a
>>> much better fit (
>>> https://github.com/dschadow/ApplicationIntrusionDetection)
>>>
>>
>> Agreed about the rules bit. I took a look at the demo application built
>> above and it looks great, will refer to it when working on the demo
>> application part. I've used Dropwizard to built web apps but I haven't work
>> with Spring (only a little on DI) before and will have to read about it.
>>
>
> The Spring parts should be pretty straightforward, and I (and others) can
> help you there if you need anything. You don't need to know much Spring at
> all for this project.
>

Got it. I will keep that in mind.

>
>
>>
>>>
>>>> It would be great if the mentor/team can give me feedback on my ideas
>>>> and things to read to expand my knowledge in this domain. If there is any
>>>> task that you would like me to complete, I am eager to do it and will find
>>>> time at night or the weekends to complete it.
>>>>
>>>
>>> I think what I'd be most interested in is if you could let us know which
>>> approach (simple trending, machine learning) you would prefer to take when
>>> building the analysis engine. Beyond that, I think your skillset looks well
>>> suited to the project.
>>>
>>>
>>>>
>>>> I would also like to start preparing my project proposal to be able to
>>>> share with the mailing list to get feedback as this will be my first time
>>>> applying for GSoC and I will need all the help I can get!!
>>>>
>>>
>>> Sounds great. I think your notes in this email are a very solid start.
>>> To build a good proposal, I think the most important thing to do is scope
>>> the work. Try to build a detailed plan (ie. what task(s) you will
>>> accomplish each week). After that, we can review it and make suggestions
>>> about whether or not we think you should try to do more or less work, and
>>> what parts may be tricky. It will also help us know which mentor(s) to
>>> bring onto the project.
>>>
>>>
>>  I will build up my plan as I scope out the work for the two approaches
>> and will definitely share it as soon as it is ready.
>>
>
> Perfect.
>

I will share a draft depending on the response to my suggestion above
tonight or tomorrow night for my proposal so that i'll have time to tweak
it over the coming weekend. :D

>
>

>
>>
>>
>>>
>>>> Thanks for your time and look forward to your feedbacks/replies. This
>>>> young padawan needs guidance. :D
>>>>
>>>>
>>>>
>>> Thank you!
>>>
>>>
>>>> I have also started a topic in the OWASP GSoC group.
>>>>
>>>>
>>>> https://groups.google.com/forum/?fromgroups#!topic/owasp-gsoc/59vAa402jXo
>>>>
>>>>
>>>> Kind Regards,
>>>>
>>>> Tim
>>>>
>>>>
>>>> _______________________________________________
>>>> Owasp-appsensor-project mailing list
>>>> Owasp-appsensor-project at lists.owasp.org
>>>> https://lists.owasp.org/mailman/listinfo/owasp-appsensor-project
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.owasp.org/pipermail/owasp-appsensor-project/attachments/20160315/48806e63/attachment-0001.html>


More information about the Owasp-appsensor-project mailing list