[Owasp-appsensor-project] GSoC 2016 Trend Monitoring Analysis Engine

John Melton jtmelton at gmail.com
Wed Mar 16 02:32:40 UTC 2016


Tim,
Responses inline.
Thanks,
John

On Tue, Mar 15, 2016 at 2:26 PM, Timothy Sum Hon Mun <timothy22000 at gmail.com
> wrote:

> Dear John,
>
> Sorry for the late reply. Been busy for my last few weeks of placement. :)
> Hope you are doing well. Responses inline with regards to GSoC.
>
> Side note: I notice that there are 3 new issues open with regards to the
> appsensor (elasticsearch, kafka, and mongo query) and I am wondering if
> anybody is working on any of them? I am keen to help out with the kafka one
> (although I might start work on it in April/May when I am done with my
> placement bar the report). I read on the version 0.9, its new consumer API
> and security features.
>

Yep, that's correct. I'm currently working on elasticsearch and mongo.
Kafka is not yet assigned, and if you want to tackle it, that'd be great.


> I suppose the scope of that issue covers upgrading, ensuring nothing
> breaks and the new consumer API with the new security features being a
> separate issue? We can discuss on the issue ticket itself.
>

Yes, that's the main idea. We want to move to the new consumer api and
enable support for the new security features. We'll need to work with the
community to see what the needs are. My preference is to _require_ using
the new security features, but there are likely still many deployments
lower than 0.9, so we may enable security by default, and have a temporary
way to disable it. After a few months, we should remove that to ensure
folks are using kafka properly.


>
> Kind Regards,
> Tim
>
> On Tue, Mar 8, 2016 at 5:27 PM, John Melton <jtmelton at gmail.com> wrote:
>
>> Responses inline.
>>
>> On Tue, Mar 8, 2016 at 4:33 AM, Timothy Sum Hon Mun <
>> timothy22000 at gmail.com> wrote:
>>
>>> Hi John,
>>>
>>> Thanks for getting back to me. It was good hearing back from you. I've
>>> replied to you inline below.
>>>
>>> Besides that, I made a pull request for some minor changes and test that
>>> I added for appsensor as a first contribution:
>>> https://github.com/jtmelton/appsensor/pull/38
>>>
>>>
>> Fantastic. I'll take a look at that later today!
>>
>>
>>> Thanks again!
>>>
>>> Best Regards,
>>> Tim
>>>
>>> On Mon, Mar 7, 2016 at 4:37 AM, John Melton <jtmelton at gmail.com> wrote:
>>>
>>>> Tim,
>>>>
>>>> Hi, and thanks so much for your email. I've responded with specific
>>>> comments inline below.
>>>>
>>>> Thanks,
>>>> John
>>>>
>>>> On Sun, Mar 6, 2016 at 1:58 PM, Timothy Sum Hon Mun <
>>>> timothy22000 at gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Firstly, congratulations on OWASP being accepted for GSoC 2016!!
>>>>>
>>>>> My name is Timothy Sum and I am from Malaysia. I am currently a final
>>>>> year MSc Computer Science student studying at University of Kent in the UK.
>>>>> I have experience in Java, Javascript, Python, Node.js, MongoDB, AWS,
>>>>> Jenkins, Git workflow, Dropwizard, Logstash, Apache Spark (MSc
>>>>> dissertation) and others, I am always keen to learn new technologies and
>>>>> try things outside my comfort zone!
>>>>>
>>>>> I am currently undergoing my placement (where I gained most of my
>>>>> experience from) which will be concluded on the 31st March 2016. I will be
>>>>> working full time on the weekdays before then. Therefore, I will do my
>>>>> research about the project and prepare my proposal typically at night or
>>>>> during the weekends. After my placement finishes, I will be able to
>>>>> completely commit to GSoC by researching, learning and experimenting about
>>>>> gaps in my knowledge during April even before the community bonding period.
>>>>> I’ll have a written report to write about my placement that is due on June
>>>>> 2016 but I can do that while coding over the summer!
>>>>>
>>>>> I just recently stumbled over GSoC 3 days ago and have been looking
>>>>> through the project list to decide which project I should go for. This will
>>>>> be my first time contributing to an open source project and I am very hyped
>>>>> up about it as I get to learn from a mentor and contribute at the same
>>>>> time. :) I also do not mind having skype/hangout discussion with mentors
>>>>> regularly to discuss about my progress.
>>>>>
>>>>
>>>> Yes, skype/hangouts is the normal way we communicate. I generally aim
>>>> for meetings 2-3 times a week so we can make sure we're making forward
>>>> progress and then use email in between meetings for specific questions.
>>>>
>>>
>>>>
>>>>>
>>>>> I am interested in the Trend Monitoring Analysis Engine project for
>>>>> OWASP AppSensor and would be excited if I can work on it. I do not
>>>>> have a background in application security and intrusion detection but am
>>>>> highly interested learning about it. So far, I have:
>>>>>
>>>>
>>>> Fantastic. Honestly, a background in spark / machine learning will be
>>>> more important.
>>>>
>>>
>>> Cool! I did a module in data mining for my MSc that would come in handy
>>> (learned about machine learning algos like decision trees etc).  I used
>>> Spark for the first time during my dissertation to implement a
>>> classification algorithm. I did not get to use Spark's machine learning
>>> library but my past experience would hopefully make the transition easier.
>>>
>>>>
>>>>
>>>>>
>>>>> i) Read the Chapter 3 and Chapter 4 of the OWASP guide briefly and
>>>>> understand the approach behind AppSensor, its high level architecture
>>>>> (detection and response unit), its pattern (Event, EventManager,
>>>>> EventAnalysisEngine and so on)
>>>>>
>>>>> ii) Manage to get a demo running locally as per the AppSensor Demo
>>>>> Setup guide (
>>>>> https://github.com/jtmelton/appsensor/blob/master/sample-apps/DemoSetup.md).
>>>>> Had a little bump with a mongo test failing when doing mvn install but got
>>>>> it to work in the end. Went through part of the codebase while doing this.
>>>>>
>>>>> iii) Research on trend monitoring analysis techniques. It seems that
>>>>> trend analysis falls into anomaly detection based on my understanding so
>>>>> far but feel free to correct me (will expand in the section below). It
>>>>> would be great if you recommend me additional papers/books to read to learn
>>>>> more on this topic.
>>>>>
>>>>> Did a first pass on two papers that cover general topics in IDS:
>>>>>
>>>>>
>>>>> http://galaxy.cs.lamar.edu/~bsun/seminar/example_papers/IDS_taxonomy.pdf
>>>>>
>>>>> http://www.ijcset.net/docs/Volumes/volume2issue4/ijcset2012020419.pdf
>>>>>
>>>>>
>>>> There is not much literature specific to application intrusion
>>>> detection. The concept is roughly based on network IDS systems. It is
>>>> mostly transferring those concepts to the application layer, and looking
>>>> for activity that is not possible (or is much harder) to detect at the
>>>> network layer, but is possible (or much easier) at the application layer.
>>>>
>>>
>>>  Interesting, I will probably do some reading to get an better overview
>>> of IDS in general.
>>>
>>>>
>>>>
>>>>> Currently, I have given it some thought and my high level
>>>>> understanding of the expected deliverables are:
>>>>>
>>>>> i)  A trend monitoring analysis engine - Extend the analysis-engines
>>>>> package and add tests. Depending on which implementation strategies to use,
>>>>> it seems that I would have to record the “normal” behaviour pattern of a
>>>>> system and then trigger a response if the application behaves out of the
>>>>> norm which will be defined by the trending rules.
>>>>>
>>>>
>>>> I think of 2 possible approaches:
>>>> - *simple trending engine* - this would be an implementation that
>>>> would essentially do some simple counting. An example here might be that we
>>>> have seen the occurrence of detection point ABC go up 500% in the last hour
>>>> over the "normal" usage. This would likely be pretty straightforward, and
>>>> could use something like a time series database to track the metadata, and
>>>> do some very fast analysis.
>>>>
>>>
>>> I looked up on time series database to learn about them better as I have
>>> not work with it.
>>>
>>>
>>> http://stackoverflow.com/questions/8816429/is-there-a-powerful-database-system-for-time-series-data
>>>
>>> I notice that we have a implementation to integrate with influxdb in the
>>> package appsensor-integration-influxdb.
>>>
>>> If I were to do the simple trending machine, I would have to extend the
>>> current implementation to be able to retrieve events written to it so that
>>> I can retrieve it in order to conduct the counting and analysis to compare
>>> whether it is unusual. This is assuming that I will be using influxDB of
>>> course. what are your opinions?
>>>
>>
>> Yes, that's the basic idea. There are several that you could use. I don't
>> really care that much about the implementation (tool) to be honest, but
>> rather the idea. We can provide 1 implementation, then add implementations
>> for specific tools if people would like one that we don't already cover.
>>
>>
>>>
>>>>
>>> - *machine learning engine* - this is a more complex implementation.
>>>> This would involve creating a ML style engine that would allow for various
>>>> types of analysis. An example might be noticing a shift in the composition
>>>> of HTTP verb usage for a given time period. If you decide to go this route,
>>>> I think you'll want to be very specific with the types of analysis you want
>>>> to provide, and focus on doing great documentation about how to build rules
>>>> based on training data and the algorithm selection process.
>>>>
>>>
>>>  This is a really interesting idea! I did some researching in order to
>>> get an idea of what needs to be done using Spark as a base. Idea and
>>> questions below:
>>>
>>> i) Idea 1: There has been some work on using spark and cassandra (as a
>>> time series db even though its a k-v store) for data analysis. In relation
>>> to appsensor, I would have to implement Spark (probably as part of the
>>> analysis engine) for its machine learning library and implement a storage
>>> provider for cassandra prior to wiring them together. I will have to design
>>> a schema for the time series data storage inside cassandra as well. This
>>> seems quite a lot of work for the duration of the project but i'll be able
>>> to leverage some existing work done.
>>>
>>>
>>> http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data
>>>
>>> ii) Idea 2: Implement a simple trending analysis ending as the main
>>> project work (related to the question below simple trending approach) and
>>> finish the 3 deliverables. Built a ML engine using Spark for machine
>>> learning which will involve wiring it to the time series db used in the
>>> simple trending approach. This way I don't have to implement a separate
>>> store for the ML analysis engine but challenge probably lies into working
>>> out how to connect them together.
>>>
>>>
>> Both of these ideas are good, honestly. I'd focus on which one you think
>> you can accomplish in the 3-month time frame. We don't want you to be able
>> to finish the project in 2 weeks, but we also don't want it to take a year.
>>
>>
>>> iii) Question 1: What do you mean by specific about the type of analysis
>>> that I am providing and algorithm selection? From what I understand, its
>>> either:
>>> - For example, if we have 2 cases: measure shift in composition of HTTP
>>> verb and number of API calls to an endpoint. I would implement it such that
>>> I will use one algorithm for checking composition of HTTP verb and another
>>> algorithm for number of API calls. I guess some research needs to be done
>>> to decide which algorithm would be suitable for which use
>>> case/scenario/event.
>>>
>>> - Implement wide variety of algorithm for analysis engine and then let
>>> the user decide which algorithm to use for events or each event.
>>>
>>> I am leaning towards with the simple trending approach for now taking
>>> into account of time although I would really like to give the machine
>>> learning a go. Feedback and answers to the questions above will help me
>>> scope out the amount of work required for the machine learning approach
>>> especially (ii). :D
>>>
>>
>> What I meant by the "specific type of analysis" comment is around machine
>> learning. For machine learning, you have to decide which algorithm (or
>> family of algorithms) to use to solve a particular problem. We can
>> certainly use spark-ml or some other library to give us those algorithms,
>> but in order to make it useful to our users, we'll have to write some code
>> to integrate those algorithms with the types of problems we want to solve.
>> If we're trying to solve a problem that requires "k nearest neighbors",
>> then we'll have to write some code that uses that. My point was that we
>> don't want to solve _every_ problem. We want to essentially document the
>> process: 1) decide what problem you want to solve, 2) pick best algorithm,
>> 3) implement algorithm, 4) use training dataset, 5) turn on analysis. In
>> that workflow, we are not going to implement _all_ the different types of
>> analysis you could do over the summer of code. I just want us to pick a few
>> problems to solve, and document the process so that our users can do the
>> same thing themselves to build new types of analysis.
>>
>>
> I did some additional research on Spark and influxDB, it looks there are
> issues open to improve intergration between Spark and influxDB especially
> for querying large data from influxDB.
> https://github.com/influxdata/influxdb/issues/3276
> https://groups.google.com/forum/#!msg/influxdb/P9BEMslIQ1Q/ydrylZ3dDAAJ
>
> Thanks to your feedback, I will probably decide on doing the ML
> implementation. I read on detection points to decide on the problem I want
> to solve and algorithms that I will use, here is my idea so far:
>
> i) Problem: Rate of login attempts / Speed of application use / Change in
> usage of same transaction for the website.
>
> ii) Algorithm: Random Forest, Decision Trees / SVM / K-NN
>
> iii) Implementing algorithm using existing library (JavaML, sparkML or
> quickml(http://quickml.org/)) with regards to the problem.
>
> iv) Training dataset: how will we get the training dataset for the
> particular problem? I suppose I will be creating it by modifying an
> existing dataset to replicate a behaviour.
>
> what do you think? I will focus on implementing one problem and an
> algorithm for it as advised and probably can put the other problems as nice
> to haves if I have additional time. The above is using influxDB to track
> the data and using ML algorithm upon retrieval to identify abnormal
> behaviours.
>

This sounds great. Getting a data set could be a difficult task. Given the
sensitivity of the data, folks are not comfortable sharing their appsensor
dataset. Having said that, I've had decent luck just using web server logs
for many tasks. Those are pretty easy to get if you ask around. As for data
storage, feel free to pick the best tool. If that's influxdb, great. If
it's cassandra, mongo, riak, etc. ... that's fine too. I would pick
something reasonably common and well-known so you'll have good examples to
work from and so we have reasonable confidence it will work pretty well.


>
>
>>>>
>>>>>
>>>>> ii)  Associated configuration mechanism to specify the trending
>>>>> rules/policy - Extend the configuration mode package, create
>>>>> respective xml and xsd configuration for the Trend Monitoring analysis
>>>>> engine.
>>>>>
>>>>> iii) A small full sample demo application showing usage of the trend
>>>>> monitoring feature. - Built on the existing demo application?
>>>>>
>>>>>
>>>> Yes, these would be the 3 basic outputs for that project, along with
>>>> the associated documentation. Additionally, I would say that we should
>>>> produce a small number of rules. That will be necessary for the demo
>>>> application anyways, but we can use those rules as examples for the
>>>> community. As for the demo application, it's very small and trivial. We
>>>> actually have a user who built a demo application for a talk about
>>>> appsensor that is likely a much better fit (
>>>> https://github.com/dschadow/ApplicationIntrusionDetection)
>>>>
>>>
>>> Agreed about the rules bit. I took a look at the demo application built
>>> above and it looks great, will refer to it when working on the demo
>>> application part. I've used Dropwizard to built web apps but I haven't work
>>> with Spring (only a little on DI) before and will have to read about it.
>>>
>>
>> The Spring parts should be pretty straightforward, and I (and others) can
>> help you there if you need anything. You don't need to know much Spring at
>> all for this project.
>>
>
> Got it. I will keep that in mind.
>
>>
>>
>>>
>>>>
>>>>> It would be great if the mentor/team can give me feedback on my ideas
>>>>> and things to read to expand my knowledge in this domain. If there is any
>>>>> task that you would like me to complete, I am eager to do it and will find
>>>>> time at night or the weekends to complete it.
>>>>>
>>>>
>>>> I think what I'd be most interested in is if you could let us know
>>>> which approach (simple trending, machine learning) you would prefer to take
>>>> when building the analysis engine. Beyond that, I think your skillset looks
>>>> well suited to the project.
>>>>
>>>>
>>>>>
>>>>> I would also like to start preparing my project proposal to be able to
>>>>> share with the mailing list to get feedback as this will be my first time
>>>>> applying for GSoC and I will need all the help I can get!!
>>>>>
>>>>
>>>> Sounds great. I think your notes in this email are a very solid start.
>>>> To build a good proposal, I think the most important thing to do is scope
>>>> the work. Try to build a detailed plan (ie. what task(s) you will
>>>> accomplish each week). After that, we can review it and make suggestions
>>>> about whether or not we think you should try to do more or less work, and
>>>> what parts may be tricky. It will also help us know which mentor(s) to
>>>> bring onto the project.
>>>>
>>>>
>>>  I will build up my plan as I scope out the work for the two approaches
>>> and will definitely share it as soon as it is ready.
>>>
>>
>> Perfect.
>>
>
> I will share a draft depending on the response to my suggestion above
> tonight or tomorrow night for my proposal so that i'll have time to tweak
> it over the coming weekend. :D
>

Perfect, thank you.


>
>>
>
>>
>>>
>>>
>>>>
>>>>> Thanks for your time and look forward to your feedbacks/replies. This
>>>>> young padawan needs guidance. :D
>>>>>
>>>>>
>>>>>
>>>> Thank you!
>>>>
>>>>
>>>>> I have also started a topic in the OWASP GSoC group.
>>>>>
>>>>>
>>>>> https://groups.google.com/forum/?fromgroups#!topic/owasp-gsoc/59vAa402jXo
>>>>>
>>>>>
>>>>> Kind Regards,
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Owasp-appsensor-project mailing list
>>>>> Owasp-appsensor-project at lists.owasp.org
>>>>> https://lists.owasp.org/mailman/listinfo/owasp-appsensor-project
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.owasp.org/pipermail/owasp-appsensor-project/attachments/20160315/6ed9fc5c/attachment-0001.html>


More information about the Owasp-appsensor-project mailing list