[Owasp-appsensor-project] GSoC 2016 Trend Monitoring Analysis Engine

John Melton jtmelton at gmail.com
Mon Mar 21 04:18:33 UTC 2016


Sounds great - thanks Tim!

On Mon, Mar 21, 2016 at 12:10 AM, Timothy Sum Hon Mun <
timothy22000 at gmail.com> wrote:

> Dear John,
>
> Thanks for your feedback and support so far. With regards to the kafka
> ticket, I will keep in mind what you said. I'll update you if I do start
> dabbling in it or create a PR.
>
> Best Regards,
> Tim
>
> On Wed, Mar 16, 2016 at 2:32 AM, John Melton <jtmelton at gmail.com> wrote:
>
>> Tim,
>> Responses inline.
>> Thanks,
>> John
>>
>> On Tue, Mar 15, 2016 at 2:26 PM, Timothy Sum Hon Mun <
>> timothy22000 at gmail.com> wrote:
>>
>>> Dear John,
>>>
>>> Sorry for the late reply. Been busy for my last few weeks of placement.
>>> :) Hope you are doing well. Responses inline with regards to GSoC.
>>>
>>> Side note: I notice that there are 3 new issues open with regards to the
>>> appsensor (elasticsearch, kafka, and mongo query) and I am wondering if
>>> anybody is working on any of them? I am keen to help out with the kafka one
>>> (although I might start work on it in April/May when I am done with my
>>> placement bar the report). I read on the version 0.9, its new consumer API
>>> and security features.
>>>
>>
>> Yep, that's correct. I'm currently working on elasticsearch and mongo.
>> Kafka is not yet assigned, and if you want to tackle it, that'd be great.
>>
>>
>>> I suppose the scope of that issue covers upgrading, ensuring nothing
>>> breaks and the new consumer API with the new security features being a
>>> separate issue? We can discuss on the issue ticket itself.
>>>
>>
>> Yes, that's the main idea. We want to move to the new consumer api and
>> enable support for the new security features. We'll need to work with the
>> community to see what the needs are. My preference is to _require_ using
>> the new security features, but there are likely still many deployments
>> lower than 0.9, so we may enable security by default, and have a temporary
>> way to disable it. After a few months, we should remove that to ensure
>> folks are using kafka properly.
>>
>>
>>>
>>> Kind Regards,
>>> Tim
>>>
>>> On Tue, Mar 8, 2016 at 5:27 PM, John Melton <jtmelton at gmail.com> wrote:
>>>
>>>> Responses inline.
>>>>
>>>> On Tue, Mar 8, 2016 at 4:33 AM, Timothy Sum Hon Mun <
>>>> timothy22000 at gmail.com> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>> Thanks for getting back to me. It was good hearing back from you. I've
>>>>> replied to you inline below.
>>>>>
>>>>> Besides that, I made a pull request for some minor changes and test
>>>>> that I added for appsensor as a first contribution:
>>>>> https://github.com/jtmelton/appsensor/pull/38
>>>>>
>>>>>
>>>> Fantastic. I'll take a look at that later today!
>>>>
>>>>
>>>>> Thanks again!
>>>>>
>>>>> Best Regards,
>>>>> Tim
>>>>>
>>>>> On Mon, Mar 7, 2016 at 4:37 AM, John Melton <jtmelton at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Tim,
>>>>>>
>>>>>> Hi, and thanks so much for your email. I've responded with specific
>>>>>> comments inline below.
>>>>>>
>>>>>> Thanks,
>>>>>> John
>>>>>>
>>>>>> On Sun, Mar 6, 2016 at 1:58 PM, Timothy Sum Hon Mun <
>>>>>> timothy22000 at gmail.com> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Firstly, congratulations on OWASP being accepted for GSoC 2016!!
>>>>>>>
>>>>>>> My name is Timothy Sum and I am from Malaysia. I am currently a
>>>>>>> final year MSc Computer Science student studying at University of Kent in
>>>>>>> the UK. I have experience in Java, Javascript, Python, Node.js, MongoDB,
>>>>>>> AWS, Jenkins, Git workflow, Dropwizard, Logstash, Apache Spark (MSc
>>>>>>> dissertation) and others, I am always keen to learn new technologies and
>>>>>>> try things outside my comfort zone!
>>>>>>>
>>>>>>> I am currently undergoing my placement (where I gained most of my
>>>>>>> experience from) which will be concluded on the 31st March 2016. I will be
>>>>>>> working full time on the weekdays before then. Therefore, I will do my
>>>>>>> research about the project and prepare my proposal typically at night or
>>>>>>> during the weekends. After my placement finishes, I will be able to
>>>>>>> completely commit to GSoC by researching, learning and experimenting about
>>>>>>> gaps in my knowledge during April even before the community bonding period.
>>>>>>> I’ll have a written report to write about my placement that is due on June
>>>>>>> 2016 but I can do that while coding over the summer!
>>>>>>>
>>>>>>> I just recently stumbled over GSoC 3 days ago and have been looking
>>>>>>> through the project list to decide which project I should go for. This will
>>>>>>> be my first time contributing to an open source project and I am very hyped
>>>>>>> up about it as I get to learn from a mentor and contribute at the same
>>>>>>> time. :) I also do not mind having skype/hangout discussion with mentors
>>>>>>> regularly to discuss about my progress.
>>>>>>>
>>>>>>
>>>>>> Yes, skype/hangouts is the normal way we communicate. I generally aim
>>>>>> for meetings 2-3 times a week so we can make sure we're making forward
>>>>>> progress and then use email in between meetings for specific questions.
>>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> I am interested in the Trend Monitoring Analysis Engine project for
>>>>>>> OWASP AppSensor and would be excited if I can work on it. I do not
>>>>>>> have a background in application security and intrusion detection but am
>>>>>>> highly interested learning about it. So far, I have:
>>>>>>>
>>>>>>
>>>>>> Fantastic. Honestly, a background in spark / machine learning will be
>>>>>> more important.
>>>>>>
>>>>>
>>>>> Cool! I did a module in data mining for my MSc that would come in
>>>>> handy (learned about machine learning algos like decision trees etc).  I
>>>>> used Spark for the first time during my dissertation to implement a
>>>>> classification algorithm. I did not get to use Spark's machine learning
>>>>> library but my past experience would hopefully make the transition easier.
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> i) Read the Chapter 3 and Chapter 4 of the OWASP guide briefly and
>>>>>>> understand the approach behind AppSensor, its high level architecture
>>>>>>> (detection and response unit), its pattern (Event, EventManager,
>>>>>>> EventAnalysisEngine and so on)
>>>>>>>
>>>>>>> ii) Manage to get a demo running locally as per the AppSensor Demo
>>>>>>> Setup guide (
>>>>>>> https://github.com/jtmelton/appsensor/blob/master/sample-apps/DemoSetup.md).
>>>>>>> Had a little bump with a mongo test failing when doing mvn install but got
>>>>>>> it to work in the end. Went through part of the codebase while doing this.
>>>>>>>
>>>>>>> iii) Research on trend monitoring analysis techniques. It seems that
>>>>>>> trend analysis falls into anomaly detection based on my understanding so
>>>>>>> far but feel free to correct me (will expand in the section below). It
>>>>>>> would be great if you recommend me additional papers/books to read to learn
>>>>>>> more on this topic.
>>>>>>>
>>>>>>> Did a first pass on two papers that cover general topics in IDS:
>>>>>>>
>>>>>>>
>>>>>>> http://galaxy.cs.lamar.edu/~bsun/seminar/example_papers/IDS_taxonomy.pdf
>>>>>>>
>>>>>>> http://www.ijcset.net/docs/Volumes/volume2issue4/ijcset2012020419.pdf
>>>>>>>
>>>>>>>
>>>>>> There is not much literature specific to application intrusion
>>>>>> detection. The concept is roughly based on network IDS systems. It is
>>>>>> mostly transferring those concepts to the application layer, and looking
>>>>>> for activity that is not possible (or is much harder) to detect at the
>>>>>> network layer, but is possible (or much easier) at the application layer.
>>>>>>
>>>>>
>>>>>  Interesting, I will probably do some reading to get an better
>>>>> overview of IDS in general.
>>>>>
>>>>>>
>>>>>>
>>>>>>> Currently, I have given it some thought and my high level
>>>>>>> understanding of the expected deliverables are:
>>>>>>>
>>>>>>> i)  A trend monitoring analysis engine - Extend the
>>>>>>> analysis-engines package and add tests. Depending on which implementation
>>>>>>> strategies to use, it seems that I would have to record the “normal”
>>>>>>> behaviour pattern of a system and then trigger a response if the
>>>>>>> application behaves out of the norm which will be defined by the trending
>>>>>>> rules.
>>>>>>>
>>>>>>
>>>>>> I think of 2 possible approaches:
>>>>>> - *simple trending engine* - this would be an implementation that
>>>>>> would essentially do some simple counting. An example here might be that we
>>>>>> have seen the occurrence of detection point ABC go up 500% in the last hour
>>>>>> over the "normal" usage. This would likely be pretty straightforward, and
>>>>>> could use something like a time series database to track the metadata, and
>>>>>> do some very fast analysis.
>>>>>>
>>>>>
>>>>> I looked up on time series database to learn about them better as I
>>>>> have not work with it.
>>>>>
>>>>>
>>>>> http://stackoverflow.com/questions/8816429/is-there-a-powerful-database-system-for-time-series-data
>>>>>
>>>>> I notice that we have a implementation to integrate with influxdb in
>>>>> the package appsensor-integration-influxdb.
>>>>>
>>>>> If I were to do the simple trending machine, I would have to extend
>>>>> the current implementation to be able to retrieve events written to it so
>>>>> that I can retrieve it in order to conduct the counting and analysis to
>>>>> compare whether it is unusual. This is assuming that I will be using
>>>>> influxDB of course. what are your opinions?
>>>>>
>>>>
>>>> Yes, that's the basic idea. There are several that you could use. I
>>>> don't really care that much about the implementation (tool) to be honest,
>>>> but rather the idea. We can provide 1 implementation, then add
>>>> implementations for specific tools if people would like one that we don't
>>>> already cover.
>>>>
>>>>
>>>>>
>>>>>>
>>>>> - *machine learning engine* - this is a more complex implementation.
>>>>>> This would involve creating a ML style engine that would allow for various
>>>>>> types of analysis. An example might be noticing a shift in the composition
>>>>>> of HTTP verb usage for a given time period. If you decide to go this route,
>>>>>> I think you'll want to be very specific with the types of analysis you want
>>>>>> to provide, and focus on doing great documentation about how to build rules
>>>>>> based on training data and the algorithm selection process.
>>>>>>
>>>>>
>>>>>  This is a really interesting idea! I did some researching in order to
>>>>> get an idea of what needs to be done using Spark as a base. Idea and
>>>>> questions below:
>>>>>
>>>>> i) Idea 1: There has been some work on using spark and cassandra (as a
>>>>> time series db even though its a k-v store) for data analysis. In relation
>>>>> to appsensor, I would have to implement Spark (probably as part of the
>>>>> analysis engine) for its machine learning library and implement a storage
>>>>> provider for cassandra prior to wiring them together. I will have to design
>>>>> a schema for the time series data storage inside cassandra as well. This
>>>>> seems quite a lot of work for the duration of the project but i'll be able
>>>>> to leverage some existing work done.
>>>>>
>>>>>
>>>>> http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data
>>>>>
>>>>> ii) Idea 2: Implement a simple trending analysis ending as the main
>>>>> project work (related to the question below simple trending approach) and
>>>>> finish the 3 deliverables. Built a ML engine using Spark for machine
>>>>> learning which will involve wiring it to the time series db used in the
>>>>> simple trending approach. This way I don't have to implement a separate
>>>>> store for the ML analysis engine but challenge probably lies into working
>>>>> out how to connect them together.
>>>>>
>>>>>
>>>> Both of these ideas are good, honestly. I'd focus on which one you
>>>> think you can accomplish in the 3-month time frame. We don't want you to be
>>>> able to finish the project in 2 weeks, but we also don't want it to take a
>>>> year.
>>>>
>>>>
>>>>> iii) Question 1: What do you mean by specific about the type of
>>>>> analysis that I am providing and algorithm selection? From what I
>>>>> understand, its either:
>>>>> - For example, if we have 2 cases: measure shift in composition of
>>>>> HTTP verb and number of API calls to an endpoint. I would implement it such
>>>>> that I will use one algorithm for checking composition of HTTP verb and
>>>>> another algorithm for number of API calls. I guess some research needs to
>>>>> be done to decide which algorithm would be suitable for which use
>>>>> case/scenario/event.
>>>>>
>>>>> - Implement wide variety of algorithm for analysis engine and then let
>>>>> the user decide which algorithm to use for events or each event.
>>>>>
>>>>> I am leaning towards with the simple trending approach for now taking
>>>>> into account of time although I would really like to give the machine
>>>>> learning a go. Feedback and answers to the questions above will help me
>>>>> scope out the amount of work required for the machine learning approach
>>>>> especially (ii). :D
>>>>>
>>>>
>>>> What I meant by the "specific type of analysis" comment is around
>>>> machine learning. For machine learning, you have to decide which algorithm
>>>> (or family of algorithms) to use to solve a particular problem. We can
>>>> certainly use spark-ml or some other library to give us those algorithms,
>>>> but in order to make it useful to our users, we'll have to write some code
>>>> to integrate those algorithms with the types of problems we want to solve.
>>>> If we're trying to solve a problem that requires "k nearest neighbors",
>>>> then we'll have to write some code that uses that. My point was that we
>>>> don't want to solve _every_ problem. We want to essentially document the
>>>> process: 1) decide what problem you want to solve, 2) pick best algorithm,
>>>> 3) implement algorithm, 4) use training dataset, 5) turn on analysis. In
>>>> that workflow, we are not going to implement _all_ the different types of
>>>> analysis you could do over the summer of code. I just want us to pick a few
>>>> problems to solve, and document the process so that our users can do the
>>>> same thing themselves to build new types of analysis.
>>>>
>>>>
>>> I did some additional research on Spark and influxDB, it looks there are
>>> issues open to improve intergration between Spark and influxDB especially
>>> for querying large data from influxDB.
>>> https://github.com/influxdata/influxdb/issues/3276
>>> https://groups.google.com/forum/#!msg/influxdb/P9BEMslIQ1Q/ydrylZ3dDAAJ
>>>
>>> Thanks to your feedback, I will probably decide on doing the ML
>>> implementation. I read on detection points to decide on the problem I want
>>> to solve and algorithms that I will use, here is my idea so far:
>>>
>>> i) Problem: Rate of login attempts / Speed of application use / Change
>>> in usage of same transaction for the website.
>>>
>>> ii) Algorithm: Random Forest, Decision Trees / SVM / K-NN
>>>
>>> iii) Implementing algorithm using existing library (JavaML, sparkML or
>>> quickml(http://quickml.org/)) with regards to the problem.
>>>
>>> iv) Training dataset: how will we get the training dataset for the
>>> particular problem? I suppose I will be creating it by modifying an
>>> existing dataset to replicate a behaviour.
>>>
>>> what do you think? I will focus on implementing one problem and an
>>> algorithm for it as advised and probably can put the other problems as nice
>>> to haves if I have additional time. The above is using influxDB to track
>>> the data and using ML algorithm upon retrieval to identify abnormal
>>> behaviours.
>>>
>>
>> This sounds great. Getting a data set could be a difficult task. Given
>> the sensitivity of the data, folks are not comfortable sharing their
>> appsensor dataset. Having said that, I've had decent luck just using web
>> server logs for many tasks. Those are pretty easy to get if you ask around.
>> As for data storage, feel free to pick the best tool. If that's influxdb,
>> great. If it's cassandra, mongo, riak, etc. ... that's fine too. I would
>> pick something reasonably common and well-known so you'll have good
>> examples to work from and so we have reasonable confidence it will work
>> pretty well.
>>
>>
>>>
>>>
>>>>>>
>>>>>>>
>>>>>>> ii)  Associated configuration mechanism to specify the trending
>>>>>>> rules/policy - Extend the configuration mode package, create
>>>>>>> respective xml and xsd configuration for the Trend Monitoring analysis
>>>>>>> engine.
>>>>>>>
>>>>>>> iii) A small full sample demo application showing usage of the
>>>>>>> trend monitoring feature. - Built on the existing demo application?
>>>>>>>
>>>>>>>
>>>>>> Yes, these would be the 3 basic outputs for that project, along with
>>>>>> the associated documentation. Additionally, I would say that we should
>>>>>> produce a small number of rules. That will be necessary for the demo
>>>>>> application anyways, but we can use those rules as examples for the
>>>>>> community. As for the demo application, it's very small and trivial. We
>>>>>> actually have a user who built a demo application for a talk about
>>>>>> appsensor that is likely a much better fit (
>>>>>> https://github.com/dschadow/ApplicationIntrusionDetection)
>>>>>>
>>>>>
>>>>> Agreed about the rules bit. I took a look at the demo application
>>>>> built above and it looks great, will refer to it when working on the demo
>>>>> application part. I've used Dropwizard to built web apps but I haven't work
>>>>> with Spring (only a little on DI) before and will have to read about it.
>>>>>
>>>>
>>>> The Spring parts should be pretty straightforward, and I (and others)
>>>> can help you there if you need anything. You don't need to know much Spring
>>>> at all for this project.
>>>>
>>>
>>> Got it. I will keep that in mind.
>>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>>> It would be great if the mentor/team can give me feedback on my
>>>>>>> ideas and things to read to expand my knowledge in this domain. If there is
>>>>>>> any task that you would like me to complete, I am eager to do it and will
>>>>>>> find time at night or the weekends to complete it.
>>>>>>>
>>>>>>
>>>>>> I think what I'd be most interested in is if you could let us know
>>>>>> which approach (simple trending, machine learning) you would prefer to take
>>>>>> when building the analysis engine. Beyond that, I think your skillset looks
>>>>>> well suited to the project.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I would also like to start preparing my project proposal to be able
>>>>>>> to share with the mailing list to get feedback as this will be my first
>>>>>>> time applying for GSoC and I will need all the help I can get!!
>>>>>>>
>>>>>>
>>>>>> Sounds great. I think your notes in this email are a very solid
>>>>>> start. To build a good proposal, I think the most important thing to do is
>>>>>> scope the work. Try to build a detailed plan (ie. what task(s) you will
>>>>>> accomplish each week). After that, we can review it and make suggestions
>>>>>> about whether or not we think you should try to do more or less work, and
>>>>>> what parts may be tricky. It will also help us know which mentor(s) to
>>>>>> bring onto the project.
>>>>>>
>>>>>>
>>>>>  I will build up my plan as I scope out the work for the two
>>>>> approaches and will definitely share it as soon as it is ready.
>>>>>
>>>>
>>>> Perfect.
>>>>
>>>
>>> I will share a draft depending on the response to my suggestion above
>>> tonight or tomorrow night for my proposal so that i'll have time to tweak
>>> it over the coming weekend. :D
>>>
>>
>> Perfect, thank you.
>>
>>
>>>
>>>>
>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>> Thanks for your time and look forward to your feedbacks/replies.
>>>>>>> This young padawan needs guidance. :D
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>>
>>>>>>> I have also started a topic in the OWASP GSoC group.
>>>>>>>
>>>>>>>
>>>>>>> https://groups.google.com/forum/?fromgroups#!topic/owasp-gsoc/59vAa402jXo
>>>>>>>
>>>>>>>
>>>>>>> Kind Regards,
>>>>>>>
>>>>>>> Tim
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Owasp-appsensor-project mailing list
>>>>>>> Owasp-appsensor-project at lists.owasp.org
>>>>>>> https://lists.owasp.org/mailman/listinfo/owasp-appsensor-project
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.owasp.org/pipermail/owasp-appsensor-project/attachments/20160321/d96f8756/attachment-0001.html>


More information about the Owasp-appsensor-project mailing list