Ticket #806 (closed Enhancement: worksforme)
optimization: allow much faster bulk imports (like dm4-wdtk) into dm4
Reported by: | Malte | Owned by: | jri |
---|---|---|---|
Priority: | Blocker | Milestone: | Release 4.7 |
Component: | DeepaMehta Standard Distribution | Version: | 4.5 |
Keywords: | Cc: | Malte, JuergeN, dgf | |
Complexity: | 3 | Area: | Performance |
Module: |
Description
Currently i am working at analysing and processing certain parts of the over 15 million entities in some wikidata json dumps. More specifically i am focusing around a specific geo-vocabulary matching persons and institutions to cities and countries as well as matching cities and countries to public OSM data and various statistical identifiers.
Now transforming (mostly storing/writing) just 5000 (minimal equipped) topics from these millions of items takes about 10minutes. In the same amount of time i could process all 15million items (but without storing|writing them into dm4).
Under these circumstances it does not make much sense continuing to write a generic importer because i will never have the time to test it.
Is there a way we could speed up bulk-imports of many millions of topics and associations?
Change History
comment:3 in reply to: ↑ 1 Changed 9 years ago by jri
Replying to Malte:
Hey, it seems like an upgrade to using neo4j-2.2 could facilitate optimizations in this direction:
Here (http://neo4j.com/whats-new-in-neo4j-2-2/) it is said that neo4j-2.2.0 delivers "100 times faster write-operations" than previous versions and "can write up to 1million nodes per second".
At the moment we can't upgrade to Neo4j 2.2 as DM want not loose JDK 6 compatibility. To my knowledge the latest JDK 6 compatible Neo4j version is 1.8.3. DM currently uses Neo4j 1.8.1.
I am aware, as you once mentioned, reducing the "time-to-write" (millions of nodes) with deepamehta4 significantly might has less to do with upgrading to a faster neo4j version but more with our current transaction implementation, is that right?
No. DM transactions are Neo4j transactions (provided the dm4-storage-neo4j module is used, which is the only DM storage implementation so far).
The way to go for speeding up bulk imports is: grouping several inserts into one transaction. Let's say create a new transaction for every 1000 topics, or so. You can do this in your plugin.
If you create too many transactions, e.g. when you wrap each createTopic() call into its own transaction, the transaction processing overhead accumulates significantly.
On the other hand, if you create too less transactions, e.g. when you wrap the entire import of 1 million topics into a single transaction, the transaction log becomes huge, resulting in increasingly slow import (I guess because traversing the transaction log becomes inefficient and/or because of memory shortage).
So, you have to experiment a little to find a suitable size for your import groups.
comment:6 Changed 9 years ago by Malte
Hi, i just recently saw your comments.
I just implemented your tip (group transactions) but yet have to investigate the concrete gains of skipping many thousands of (previously created) transactions.
Thank you very much for your answer!
comment:7 Changed 9 years ago by jri
Replying to Malte:
[...] skipping many thousands of (previously created) transactions.
You don't actually "skip" transactions.
You must create a transaction only for every 1000th or so "import item" call and hold that transaction in an instance variable:
class MyImport { int BULK_SIZE = 1000; int counter = 0; DeepaMehtaTransaction tx = null; // called many times void importItem() { if ((conter % BULK_SIZE) == 0) { // commit current tx if any if (tx != null) { tx.success(); tx.finish(); } // create new tx tx = dms.beginTx(); } // do import ... // counter++; } // called once void endImport() { // commit final bulk tx.success(); tx.finish(); } }
comment:8 Changed 9 years ago by Malte
Thanks for your comment and yes, sure, if i would "skip" all the transactions my could would not work at all. I used "skipping" in the meaning of "ommitting 5999 from 6000".
In another case, i have a similarly sever time issue when wanting to perform write issues:
In the kiezatlas case i must now delete 1300+ reasonable complex composite topics in a migration. Possibly in one or more migrations, that i can not know yet.
Now the benchmark on (i5 1.6GHz) is:
To run a delete one composite topic takes about 15 to 45 seconds.
So my final migration will approximately run 10 hours (if performing stable at about 30secs per topic) for deleting all these 1300+ topics.
Despite (of course) not being able to be 100% sure if a neo4j upgrade to 2.2 will bring the promised performance gain to DM 4: What are the current reasons to stick with JDK 6 compatibility?
Would be great if you could share your thoughts with us on this topic, too.
comment:9 Changed 9 years ago by Malte
Despite (of course) not being able to be 100% sure if a neo4j upgrade to 2.2 will bring the promised performance gain to DM 4: What are the current reasons to stick with JDK 6 compatibility?
Is it because one needs to consider Java 7 and 8 as actually very different programming languages than java 6 (as documented here http://docs.oracle.com/javase/8/docs/technotes/guides/language/enhancements.html) and taking advantage of these needs to be thought as a major core upgrade task or is it more because of work needs to be done with dependency management?
Hey, it seems like an upgrade to using neo4j-2.2 could facilitate optimizations in this direction:
Here (http://neo4j.com/whats-new-in-neo4j-2-2/) it is said that neo4j-2.2.0 delivers "100 times faster write-operations" than previous versions and "can write up to 1million nodes per second".
I am aware, as you once mentioned, reducing the "time-to-write" (millions of nodes) with deepamehta4 significantly might has less to do with upgrading to a faster neo4j version but more with our current transaction implementation, is that right?
Our dependencies:
For the Neo4J-Spatial a 2.2.0 compatible release already exists.
I am not sure if deepamehta4 is dependent on other/more neo4j-plugins which may not be ready for version 2.2.x.
Anyways, i just wanted to share this news here and kindly ask again about some perspective regarding this matter. Thanks a lot!
Cheers!