General purpose samples

Data analytics samples show a number of common data loading and management patterns when using MinusOneDB. Learn how to load large amounts of data as quickly as possible or how to resize stores as your usage grows.

Load Sample Data

Load 50 million Reddit comments in an hour.

Setup Time: 15 minutes

Data Load Time: 40 minutes

This python sample shows how to load 50 million Reddit comments into a MinusOneDB environment as quickly as possible. It illustrates common MinusOneDB patterns to maximize loading performance that you can use when you load your own data. Loading larger datasets works in the same way; with a corresponding increase in provisioned hardware, any dataset can be predictably loaded in the same amount of time. Once loaded, explore the sample Reddit dataset using the m1 client or our reporting application.

Prerequisites

Step 1: Download and unzip the reddit loading sample.

Step 2: Set up the m1 client if you have not already done so.

Step 3: You will need a MinusOneDB environment:

If you do not have a MinusOneDB account, reach out and we can help you set one up.
If you have a MinusOneDB account but have not set up an environment yet, create one using the minimal.json environment template file available in your download.
If you already have your target environment follow the instructions below to run the sample.

Note: Because this sample uses more than 1 data processing server, it will not work with our trial environments. Contact [email protected] if you want to try this before becoming a MinusOneDB customer.

Step 4: Install and configure python if you have not already done so.

Instructions

For reference, here is a copy of ops.json.sample

{
    "server"   : "https://ops.minusonedb.com",
    "username" : "<username>",
    "password" : "<password>"
}

For reference, here is a copy of env.json.sample

{
    "server"   : "https://envName-accountName.minusonedb.com",
    "username" : "<username>",
    "password" : "<password>"
}

Your environment key can be found using the following: m1 ops env/list -account accountId

Step 1: Navigate to the bulkload-reddit/src sample directory.

Step 2: Copy ops.json.sample to ops.json and update ops.json with your ops.minusonedb.com username and password.

Step 3: Copy env.json.sample to env.json and update your environment name, username, and password.

Step 4: Run python3 bulkload-sample.py ops.json env.json

The script will print out progress as it performs the following actions:

Rescales your environment to increase the number of available data processing servers.
Creates a high-performance data store to increase data throughput.
Publishes a number of gzip compressed jsonl files that contain the raw reddit data.
Creates a backup of the store after all files have been published.
Creates a standard data store.
Restores the created backup to the newly created store.
Destroys the high-performance data store and drops the associated configuration.

See the number of Reddit comments:

m1 <envname> query -store index -q "*"

curl https://<envname>.minusonedb.com/query \
-d "store=index&q=*" \
-H "m1-auth-token: $myToken"

See the top 10 most popular subreddits:

m1 <envname> query -store index -q "*" -json '{
    "facet": {
        "by_score": {
            "field": "subreddit",
            "type": "terms",
            "sort" : "count desc"
            "limit" : 10,
        }
    }
}'

curl https://<envname>.minusonedb.com/query -d 'store=index&q=*&json={
    "facet": {
        "by_score": {
            "field": "subreddit",
            "type": "terms",
            "sort" : "count desc",
            "limit" : 10
        }
    }
}' -H "m1-auth-token: $myToken"

You can now explore the Reddit data with the sample queries below or with our reporting application.

General purpose samples ​

Load Sample Data ​

Prerequisites ​

Instructions ​

General purpose samples

Load Sample Data

Prerequisites

Instructions