BigData on BigInsights- Answers to your queries
We recently conducted a joint webinar with IBM – Big Data on BigInsights on 23rd Feb 2012. In the Q&A session, there were many questions raised by the attendees and we thought it would be a good idea to answer them on the blog and share it with everyone.
If you want to watch the webinar, here’s the link to the recording
Q1. On BigInsights, you were mentioning large scale indexing, can you expand more on it? How are these indexes created or maintained? Do I need install additional software?
BigInsights comes with a module called BigIndex, that allows you to build an index unstructured data as well as query the data once the index is built.
BigIndex is built on top of Apache lucene.
Q2. Can you please elaborate Hadoop use in Persistent product ?
The email analytics solution is built on BigInsights (which contains apache hadoop). The solution consists of an email connector to directly load data from an email server, a text analytics pipeline that analyzes the email data and finally a connector/data loader that transfers the results in either an olap solution or a search index for easy retrieval.
Q3. Will you provide some session on infosphere Biginsight?
Additional information about IBM BigInsights is available at https://www.ibm.com/support/home/
Q4. What is the role of HIVE in big data analysis? Does the velocity of data have a direct impact on big data systems?
Hive provides a SQL like interface on the data that is stored on Hadoop, in that way it provides easy access to the data in Hadoop. Please note that SQL support is limited and only the basic SQL commands work.
Actually, big data systems (hadoop in particular) are designed to handle large amount of inserts, and in that sense they are superior to some of the other analytics systems.
Q5. Does the solution/product allow data and/or metadata to be exchanged between the various systems? Or does the customer have to do that manually?
Not sure about which system, but currently Hadoop does not contain any high level meta data management software that is available, obviously it does manage meta data about the data blocks, etc. But typically it will not be relevant in working with other system’s context.
Q6. Hive itself provide SQL interface then why to use your system?
The SQL supported in Hive is very minimal only basic operators are supported. also SQL by nature is declarative and will making writing advanced analytics difficult. Finally the performance of Hive is not very optimal.
Q7. Can you walk us through how the user interacts with the Bigdata platform? I understand the process of aggregating unstructured data. I don’t understand the process of asking questions to the system.
There are several ways of querying the hadoop platform starting from difficult to easy
1. Writing Java code on top of the map/reduce Java apis, this will require you to write code in Java and implement your logic (queries) by extending the map/reduce methods, most difficult but you have full control and can optimize it the best.
2. Write queries using higher level languages like JAQl/PIG: these are higher level languages that will convert the code your write in them in map/reduce functions and execute on the platform. This frees you up from thinking map/reduce but you have less control and in some cases the map/reduce code generated may not be most efficient.
3. Use Apache Hive and implement your logic in a series of SQL functions, similar to JAQL/PIG in spirit, with an exception that SQL is well known (no need to learn new language) but SQL support is limited also the indexing support is getting implemented. You will have to spend some time setting this up.
4. Use higher level tools like IBM Bigsheets, or an offering from datameer, that give you a visual environment (like a spread sheet) to visualize and process your data. Obviously you are further removed from the map/reduce layer on Hadoop so performance challenges are harder to debug, functionality is limited to what is available in the products, yes UDFs of some kind are available.
Essentially you will have to choose the method that works best for you to ask questions.
Q8. Is Open Data (Public data in US and Europe and other places) a new opportunity for democratic BIGInsights ? Do you have experienced use cases with it ?
Yes, this is definitely a new opportunity and with Bigdata platform being touted as an enterprise data repository, I definitely think that it would be an ideal place to integrate it with public datasets.
Usecases: There are several available, I can think of several, specific on is available from IBM. (http://www-03.ibm.com/press/us/en/pressrelease/35737.wss) This describes how Vestas used weather/tidal data available for improving their turbine placement, and optimal energy output problem.
Q9. Were the use cases presented here solved by IBM and Persistent Systems collaboration ?
The email analytics use case was built by Persistent, we have collaborated with IBM on some of the use cases mentioned in the presentation.
Q10. What is there for the business users?
Please review the earlier answer about interacting with the hadoop system.
The tooling around Bigdata is still evolving so it is not very straightforward for business users to interact with bigdata systems, however this is one area where lot of vendors are focusing so things are bound to improve.
Simultaneously, I think we will have vertical specific Bigdata offerings, that will be targeted for the business users of that domain, we should see a lot of those in next 6 months to an year.
Q11. Can you share some big data architectures slides
Persistent’s Email Analytics Architecture
Q12. How does rational tooling support development of BigData applications?
Internally from IBM perspective what they are doing is reaching out to the various plans both within software groups and obviously within STG which is our hardware division to make sure interoperability. Different products are available today we have got interoperability with DB2 and some other products would be coming down the row like SPSS and COGNOS as we move forward. So that’s basically where we are again is different products in tooling is available when outsource make sure that you have clear understanding of what specific products work in this portfolio. But at end of the day again general statement and direction without preannouncing that this stuff is available or intended to make it again reaching so that we got that interoperability.
(Answered by: Vish Vishwanath – Senior Vice President, BI & Analytics, Persistent Systems
Anand Ghalsasi – Associate Vice President, Sales, Persistent Systems
Mukund Deshpande – Associate Vice President, Operations, BI & Analytics, Persistent Systems)