[Owasp-appsensor-project] GSoC 2016 Trend Monitoring Analysis Engine

Timothy Sum Hon Mun timothy22000 at gmail.com
Mon Mar 21 04:10:45 UTC 2016


Dear John,

Thanks for your feedback and support so far. With regards to the kafka
ticket, I will keep in mind what you said. I'll update you if I do start
dabbling in it or create a PR.

Best Regards,
Tim

On Wed, Mar 16, 2016 at 2:32 AM, John Melton <jtmelton at gmail.com> wrote:

> Tim,
> Responses inline.
> Thanks,
> John
>
> On Tue, Mar 15, 2016 at 2:26 PM, Timothy Sum Hon Mun <
> timothy22000 at gmail.com> wrote:
>
>> Dear John,
>>
>> Sorry for the late reply. Been busy for my last few weeks of placement.
>> :) Hope you are doing well. Responses inline with regards to GSoC.
>>
>> Side note: I notice that there are 3 new issues open with regards to the
>> appsensor (elasticsearch, kafka, and mongo query) and I am wondering if
>> anybody is working on any of them? I am keen to help out with the kafka one
>> (although I might start work on it in April/May when I am done with my
>> placement bar the report). I read on the version 0.9, its new consumer API
>> and security features.
>>
>
> Yep, that's correct. I'm currently working on elasticsearch and mongo.
> Kafka is not yet assigned, and if you want to tackle it, that'd be great.
>
>
>> I suppose the scope of that issue covers upgrading, ensuring nothing
>> breaks and the new consumer API with the new security features being a
>> separate issue? We can discuss on the issue ticket itself.
>>
>
> Yes, that's the main idea. We want to move to the new consumer api and
> enable support for the new security features. We'll need to work with the
> community to see what the needs are. My preference is to _require_ using
> the new security features, but there are likely still many deployments
> lower than 0.9, so we may enable security by default, and have a temporary
> way to disable it. After a few months, we should remove that to ensure
> folks are using kafka properly.
>
>
>>
>> Kind Regards,
>> Tim
>>
>> On Tue, Mar 8, 2016 at 5:27 PM, John Melton <jtmelton at gmail.com> wrote:
>>
>>> Responses inline.
>>>
>>> On Tue, Mar 8, 2016 at 4:33 AM, Timothy Sum Hon Mun <
>>> timothy22000 at gmail.com> wrote:
>>>
>>>> Hi John,
>>>>
>>>> Thanks for getting back to me. It was good hearing back from you. I've
>>>> replied to you inline below.
>>>>
>>>> Besides that, I made a pull request for some minor changes and test
>>>> that I added for appsensor as a first contribution:
>>>> https://github.com/jtmelton/appsensor/pull/38
>>>>
>>>>
>>> Fantastic. I'll take a look at that later today!
>>>
>>>
>>>> Thanks again!
>>>>
>>>> Best Regards,
>>>> Tim
>>>>
>>>> On Mon, Mar 7, 2016 at 4:37 AM, John Melton <jtmelton at gmail.com> wrote:
>>>>
>>>>> Tim,
>>>>>
>>>>> Hi, and thanks so much for your email. I've responded with specific
>>>>> comments inline below.
>>>>>
>>>>> Thanks,
>>>>> John
>>>>>
>>>>> On Sun, Mar 6, 2016 at 1:58 PM, Timothy Sum Hon Mun <
>>>>> timothy22000 at gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Firstly, congratulations on OWASP being accepted for GSoC 2016!!
>>>>>>
>>>>>> My name is Timothy Sum and I am from Malaysia. I am currently a final
>>>>>> year MSc Computer Science student studying at University of Kent in the UK.
>>>>>> I have experience in Java, Javascript, Python, Node.js, MongoDB, AWS,
>>>>>> Jenkins, Git workflow, Dropwizard, Logstash, Apache Spark (MSc
>>>>>> dissertation) and others, I am always keen to learn new technologies and
>>>>>> try things outside my comfort zone!
>>>>>>
>>>>>> I am currently undergoing my placement (where I gained most of my
>>>>>> experience from) which will be concluded on the 31st March 2016. I will be
>>>>>> working full time on the weekdays before then. Therefore, I will do my
>>>>>> research about the project and prepare my proposal typically at night or
>>>>>> during the weekends. After my placement finishes, I will be able to
>>>>>> completely commit to GSoC by researching, learning and experimenting about
>>>>>> gaps in my knowledge during April even before the community bonding period.
>>>>>> I’ll have a written report to write about my placement that is due on June
>>>>>> 2016 but I can do that while coding over the summer!
>>>>>>
>>>>>> I just recently stumbled over GSoC 3 days ago and have been looking
>>>>>> through the project list to decide which project I should go for. This will
>>>>>> be my first time contributing to an open source project and I am very hyped
>>>>>> up about it as I get to learn from a mentor and contribute at the same
>>>>>> time. :) I also do not mind having skype/hangout discussion with mentors
>>>>>> regularly to discuss about my progress.
>>>>>>
>>>>>
>>>>> Yes, skype/hangouts is the normal way we communicate. I generally aim
>>>>> for meetings 2-3 times a week so we can make sure we're making forward
>>>>> progress and then use email in between meetings for specific questions.
>>>>>
>>>>
>>>>>
>>>>>>
>>>>>> I am interested in the Trend Monitoring Analysis Engine project for
>>>>>> OWASP AppSensor and would be excited if I can work on it. I do not
>>>>>> have a background in application security and intrusion detection but am
>>>>>> highly interested learning about it. So far, I have:
>>>>>>
>>>>>
>>>>> Fantastic. Honestly, a background in spark / machine learning will be
>>>>> more important.
>>>>>
>>>>
>>>> Cool! I did a module in data mining for my MSc that would come in handy
>>>> (learned about machine learning algos like decision trees etc).  I used
>>>> Spark for the first time during my dissertation to implement a
>>>> classification algorithm. I did not get to use Spark's machine learning
>>>> library but my past experience would hopefully make the transition easier.
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> i) Read the Chapter 3 and Chapter 4 of the OWASP guide briefly and
>>>>>> understand the approach behind AppSensor, its high level architecture
>>>>>> (detection and response unit), its pattern (Event, EventManager,
>>>>>> EventAnalysisEngine and so on)
>>>>>>
>>>>>> ii) Manage to get a demo running locally as per the AppSensor Demo
>>>>>> Setup guide (
>>>>>> https://github.com/jtmelton/appsensor/blob/master/sample-apps/DemoSetup.md).
>>>>>> Had a little bump with a mongo test failing when doing mvn install but got
>>>>>> it to work in the end. Went through part of the codebase while doing this.
>>>>>>
>>>>>> iii) Research on trend monitoring analysis techniques. It seems that
>>>>>> trend analysis falls into anomaly detection based on my understanding so
>>>>>> far but feel free to correct me (will expand in the section below). It
>>>>>> would be great if you recommend me additional papers/books to read to learn
>>>>>> more on this topic.
>>>>>>
>>>>>> Did a first pass on two papers that cover general topics in IDS:
>>>>>>
>>>>>>
>>>>>> http://galaxy.cs.lamar.edu/~bsun/seminar/example_papers/IDS_taxonomy.pdf
>>>>>>
>>>>>> http://www.ijcset.net/docs/Volumes/volume2issue4/ijcset2012020419.pdf
>>>>>>
>>>>>>
>>>>> There is not much literature specific to application intrusion
>>>>> detection. The concept is roughly based on network IDS systems. It is
>>>>> mostly transferring those concepts to the application layer, and looking
>>>>> for activity that is not possible (or is much harder) to detect at the
>>>>> network layer, but is possible (or much easier) at the application layer.
>>>>>
>>>>
>>>>  Interesting, I will probably do some reading to get an better overview
>>>> of IDS in general.
>>>>
>>>>>
>>>>>
>>>>>> Currently, I have given it some thought and my high level
>>>>>> understanding of the expected deliverables are:
>>>>>>
>>>>>> i)  A trend monitoring analysis engine - Extend the analysis-engines
>>>>>> package and add tests. Depending on which implementation strategies to use,
>>>>>> it seems that I would have to record the “normal” behaviour pattern of a
>>>>>> system and then trigger a response if the application behaves out of the
>>>>>> norm which will be defined by the trending rules.
>>>>>>
>>>>>
>>>>> I think of 2 possible approaches:
>>>>> - *simple trending engine* - this would be an implementation that
>>>>> would essentially do some simple counting. An example here might be that we
>>>>> have seen the occurrence of detection point ABC go up 500% in the last hour
>>>>> over the "normal" usage. This would likely be pretty straightforward, and
>>>>> could use something like a time series database to track the metadata, and
>>>>> do some very fast analysis.
>>>>>
>>>>
>>>> I looked up on time series database to learn about them better as I
>>>> have not work with it.
>>>>
>>>>
>>>> http://stackoverflow.com/questions/8816429/is-there-a-powerful-database-system-for-time-series-data
>>>>
>>>> I notice that we have a implementation to integrate with influxdb in
>>>> the package appsensor-integration-influxdb.
>>>>
>>>> If I were to do the simple trending machine, I would have to extend the
>>>> current implementation to be able to retrieve events written to it so that
>>>> I can retrieve it in order to conduct the counting and analysis to compare
>>>> whether it is unusual. This is assuming that I will be using influxDB of
>>>> course. what are your opinions?
>>>>
>>>
>>> Yes, that's the basic idea. There are several that you could use. I
>>> don't really care that much about the implementation (tool) to be honest,
>>> but rather the idea. We can provide 1 implementation, then add
>>> implementations for specific tools if people would like one that we don't
>>> already cover.
>>>
>>>
>>>>
>>>>>
>>>> - *machine learning engine* - this is a more complex implementation.
>>>>> This would involve creating a ML style engine that would allow for various
>>>>> types of analysis. An example might be noticing a shift in the composition
>>>>> of HTTP verb usage for a given time period. If you decide to go this route,
>>>>> I think you'll want to be very specific with the types of analysis you want
>>>>> to provide, and focus on doing great documentation about how to build rules
>>>>> based on training data and the algorithm selection process.
>>>>>
>>>>
>>>>  This is a really interesting idea! I did some researching in order to
>>>> get an idea of what needs to be done using Spark as a base. Idea and
>>>> questions below:
>>>>
>>>> i) Idea 1: There has been some work on using spark and cassandra (as a
>>>> time series db even though its a k-v store) for data analysis. In relation
>>>> to appsensor, I would have to implement Spark (probably as part of the
>>>> analysis engine) for its machine learning library and implement a storage
>>>> provider for cassandra prior to wiring them together. I will have to design
>>>> a schema for the time series data storage inside cassandra as well. This
>>>> seems quite a lot of work for the duration of the project but i'll be able
>>>> to leverage some existing work done.
>>>>
>>>>
>>>> http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data
>>>>
>>>> ii) Idea 2: Implement a simple trending analysis ending as the main
>>>> project work (related to the question below simple trending approach) and
>>>> finish the 3 deliverables. Built a ML engine using Spark for machine
>>>> learning which will involve wiring it to the time series db used in the
>>>> simple trending approach. This way I don't have to implement a separate
>>>> store for the ML analysis engine but challenge probably lies into working
>>>> out how to connect them together.
>>>>
>>>>
>>> Both of these ideas are good, honestly. I'd focus on which one you think
>>> you can accomplish in the 3-month time frame. We don't want you to be able
>>> to finish the project in 2 weeks, but we also don't want it to take a year.
>>>
>>>
>>>> iii) Question 1: What do you mean by specific about the type of
>>>> analysis that I am providing and algorithm selection? From what I
>>>> understand, its either:
>>>> - For example, if we have 2 cases: measure shift in composition of HTTP
>>>> verb and number of API calls to an endpoint. I would implement it such that
>>>> I will use one algorithm for checking composition of HTTP verb and another
>>>> algorithm for number of API calls. I guess some research needs to be done
>>>> to decide which algorithm would be suitable for which use
>>>> case/scenario/event.
>>>>
>>>> - Implement wide variety of algorithm for analysis engine and then let
>>>> the user decide which algorithm to use for events or each event.
>>>>
>>>> I am leaning towards with the simple trending approach for now taking
>>>> into account of time although I would really like to give the machine
>>>> learning a go. Feedback and answers to the questions above will help me
>>>> scope out the amount of work required for the machine learning approach
>>>> especially (ii). :D
>>>>
>>>
>>> What I meant by the "specific type of analysis" comment is around
>>> machine learning. For machine learning, you have to decide which algorithm
>>> (or family of algorithms) to use to solve a particular problem. We can
>>> certainly use spark-ml or some other library to give us those algorithms,
>>> but in order to make it useful to our users, we'll have to write some code
>>> to integrate those algorithms with the types of problems we want to solve.
>>> If we're trying to solve a problem that requires "k nearest neighbors",
>>> then we'll have to write some code that uses that. My point was that we
>>> don't want to solve _every_ problem. We want to essentially document the
>>> process: 1) decide what problem you want to solve, 2) pick best algorithm,
>>> 3) implement algorithm, 4) use training dataset, 5) turn on analysis. In
>>> that workflow, we are not going to implement _all_ the different types of
>>> analysis you could do over the summer of code. I just want us to pick a few
>>> problems to solve, and document the process so that our users can do the
>>> same thing themselves to build new types of analysis.
>>>
>>>
>> I did some additional research on Spark and influxDB, it looks there are
>> issues open to improve intergration between Spark and influxDB especially
>> for querying large data from influxDB.
>> https://github.com/influxdata/influxdb/issues/3276
>> https://groups.google.com/forum/#!msg/influxdb/P9BEMslIQ1Q/ydrylZ3dDAAJ
>>
>> Thanks to your feedback, I will probably decide on doing the ML
>> implementation. I read on detection points to decide on the problem I want
>> to solve and algorithms that I will use, here is my idea so far:
>>
>> i) Problem: Rate of login attempts / Speed of application use / Change in
>> usage of same transaction for the website.
>>
>> ii) Algorithm: Random Forest, Decision Trees / SVM / K-NN
>>
>> iii) Implementing algorithm using existing library (JavaML, sparkML or
>> quickml(http://quickml.org/)) with regards to the problem.
>>
>> iv) Training dataset: how will we get the training dataset for the
>> particular problem? I suppose I will be creating it by modifying an
>> existing dataset to replicate a behaviour.
>>
>> what do you think? I will focus on implementing one problem and an
>> algorithm for it as advised and probably can put the other problems as nice
>> to haves if I have additional time. The above is using influxDB to track
>> the data and using ML algorithm upon retrieval to identify abnormal
>> behaviours.
>>
>
> This sounds great. Getting a data set could be a difficult task. Given the
> sensitivity of the data, folks are not comfortable sharing their appsensor
> dataset. Having said that, I've had decent luck just using web server logs
> for many tasks. Those are pretty easy to get if you ask around. As for data
> storage, feel free to pick the best tool. If that's influxdb, great. If
> it's cassandra, mongo, riak, etc. ... that's fine too. I would pick
> something reasonably common and well-known so you'll have good examples to
> work from and so we have reasonable confidence it will work pretty well.
>
>
>>
>>
>>>>>
>>>>>>
>>>>>> ii)  Associated configuration mechanism to specify the trending
>>>>>> rules/policy - Extend the configuration mode package, create
>>>>>> respective xml and xsd configuration for the Trend Monitoring analysis
>>>>>> engine.
>>>>>>
>>>>>> iii) A small full sample demo application showing usage of the trend
>>>>>> monitoring feature. - Built on the existing demo application?
>>>>>>
>>>>>>
>>>>> Yes, these would be the 3 basic outputs for that project, along with
>>>>> the associated documentation. Additionally, I would say that we should
>>>>> produce a small number of rules. That will be necessary for the demo
>>>>> application anyways, but we can use those rules as examples for the
>>>>> community. As for the demo application, it's very small and trivial. We
>>>>> actually have a user who built a demo application for a talk about
>>>>> appsensor that is likely a much better fit (
>>>>> https://github.com/dschadow/ApplicationIntrusionDetection)
>>>>>
>>>>
>>>> Agreed about the rules bit. I took a look at the demo application built
>>>> above and it looks great, will refer to it when working on the demo
>>>> application part. I've used Dropwizard to built web apps but I haven't work
>>>> with Spring (only a little on DI) before and will have to read about it.
>>>>
>>>
>>> The Spring parts should be pretty straightforward, and I (and others)
>>> can help you there if you need anything. You don't need to know much Spring
>>> at all for this project.
>>>
>>
>> Got it. I will keep that in mind.
>>
>>>
>>>
>>>>
>>>>>
>>>>>> It would be great if the mentor/team can give me feedback on my ideas
>>>>>> and things to read to expand my knowledge in this domain. If there is any
>>>>>> task that you would like me to complete, I am eager to do it and will find
>>>>>> time at night or the weekends to complete it.
>>>>>>
>>>>>
>>>>> I think what I'd be most interested in is if you could let us know
>>>>> which approach (simple trending, machine learning) you would prefer to take
>>>>> when building the analysis engine. Beyond that, I think your skillset looks
>>>>> well suited to the project.
>>>>>
>>>>>
>>>>>>
>>>>>> I would also like to start preparing my project proposal to be able
>>>>>> to share with the mailing list to get feedback as this will be my first
>>>>>> time applying for GSoC and I will need all the help I can get!!
>>>>>>
>>>>>
>>>>> Sounds great. I think your notes in this email are a very solid start.
>>>>> To build a good proposal, I think the most important thing to do is scope
>>>>> the work. Try to build a detailed plan (ie. what task(s) you will
>>>>> accomplish each week). After that, we can review it and make suggestions
>>>>> about whether or not we think you should try to do more or less work, and
>>>>> what parts may be tricky. It will also help us know which mentor(s) to
>>>>> bring onto the project.
>>>>>
>>>>>
>>>>  I will build up my plan as I scope out the work for the two approaches
>>>> and will definitely share it as soon as it is ready.
>>>>
>>>
>>> Perfect.
>>>
>>
>> I will share a draft depending on the response to my suggestion above
>> tonight or tomorrow night for my proposal so that i'll have time to tweak
>> it over the coming weekend. :D
>>
>
> Perfect, thank you.
>
>
>>
>>>
>>
>>>
>>>>
>>>>
>>>>>
>>>>>> Thanks for your time and look forward to your feedbacks/replies. This
>>>>>> young padawan needs guidance. :D
>>>>>>
>>>>>>
>>>>>>
>>>>> Thank you!
>>>>>
>>>>>
>>>>>> I have also started a topic in the OWASP GSoC group.
>>>>>>
>>>>>>
>>>>>> https://groups.google.com/forum/?fromgroups#!topic/owasp-gsoc/59vAa402jXo
>>>>>>
>>>>>>
>>>>>> Kind Regards,
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Owasp-appsensor-project mailing list
>>>>>> Owasp-appsensor-project at lists.owasp.org
>>>>>> https://lists.owasp.org/mailman/listinfo/owasp-appsensor-project
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.owasp.org/pipermail/owasp-appsensor-project/attachments/20160321/e09115fa/attachment-0001.html>


More information about the Owasp-appsensor-project mailing list