Thursday, June 28, 2012

Running R script on AWS EMS


  • Install Amazon EMR Command Line Interface
    1. Install Ruby (1.8 up)
    2. Download and unzip CLI (http://aws.amazon.com/developertools/2264)
    3. Configure credential.json
                     {
                        "access_id": "AWS Access Key ID",
                        "private_key":"AWS Secret Access Key",
                        "keypair": "EC2 keypair name",
                        "key-pair-file":"pem location",
                        "log_uri":"s3n://log-location",
                        "region":"us-east-1"
                      }

  • Job Flow Essentials
    1. Creating a Job Flow  (./elastic-mapreduce --create --alive)
    2. Listing all Job Flow (./elastic-mapreduce  --list
    3. Retrieving information about a specific Job Flow (./elastic-mapreduce --describe --jobflow ID)
    4. Adding a step using default parameter values to a Job Flow (./elastic-mapreduce -j ID --stream)
    5. Terminating a Job Flow (./elastic-mapreduce --terminate ID)
    6. Listing all active Job Flows (./elastic-mapreduce --list --active)
  • Streaming Job Flow
                ./elastic-mapreduce --create --stream \
                                     --mapper s3n://[mapper-location]
                                     --input s3n://[input-location]
                                     --output s3n://[output-location]
                                     --reducer s3n://[reducer-location]


Tuesday, June 19, 2012

geohash adjancent codes

An algorithm to find out neighbors of a geohash code.

  1. Base32: 123456789bcdefghjkmnpqrstuvwxyyz
  2. Neighbors (direction type)
    1. right even: bc01fg45238967deuvhjyznpkmstqrwx
    2. left even: 238967debc01fg45kmstqrwxuvhjyznp
    3. top even: p0r21436x8zb9dcf5h7kjnmqesgutwvy
    4. bottom even: 14365h7k9dcfesgujnmqp0r2twvyx8zb
    5. righ odd= top even (p0r21436x8zb9dcf5h7kjnmqesgutwvy)
    6. left odd= bottom even (14365h7k9dcfesgujnmqp0r2twvyx8zb)
    7. top odd= right even (bc01fg45238967deuvhjyznpkmstqrwx)
    8. bottom odd= left even (238967debc01fg45kmstqrwxuvhjyznp)
  3. Borders (direction type)
    1. right even: bcfguvyz
    2. left even: 0145hjnp
    3. top even: przx
    4. bottom even: 028b
    5. right odd= top even (przx)
    6. left odd= bottom even (028b)
    7. top odd= right even (bcfguvyz)
    8. bottom odd= left even (0145hjnp)
  4. function calculateAdjancent(String srcHashCode, String direction)
              srcHashCode = srcHashCode.toLowerCase();
              char lastCharacter=srcHashCode.charAt(srcHashCode.length-1);
              String type=(srcHashCode.length%2)?'odd': 'even':
              String base=srcHashCode.subString(0,srcHashCode.length-1);
             
              if(Borders[direction][type].indexOf(lastCharacter) != -1)
                       base=calculateAdjancent(base,direction);

              return base+BASE32[Neighbors[direction][type].indexOf(lastCharacter)];

 Ex. find a's right neighbor
       calculateAdjancent(a,right);

Tuesday, June 5, 2012

speed up st_within query in postgresql

1. Create index
CREATE INDEX idx_tablename_columnname ON tablename USING GIST(columnname);

This will create a spatial index for the geometry column [columnname] in the [tablename].  According to postgis manual, it creates the bbox for each geometry in the table to speed up the query.

2. Cluster
CLUSTER

After index created, cluster the table to arrange the similar data in to the same disk space.

3. Optional: simplify the geometries.
SELECT ST_NPOINTS(geom_column) AS npoints FROM tablename ORDER BY npoints DESC LIMIT 25;

SELECT ST_SIMPLIFY(geom_column, number_scale) AS simpgeom FROM tablename;

Check the number of points in the 25 biggest geometries.  If they are too big, the speed of st_within query will be slow. You can try simplify the geometries in the table if possible.  Simplifying means reduce the points in geometries.

4. Test
EXPLAIN UPDATE locationtable SET columnname = (
SELECT columnname FROM regiontable
WHERE ST_WITHIN(geom.locationtable , geom.regiontable)
)

Explain breaks down the query into plan and evaluate if an index is needed or not.  Through the EXPLAIN, you can find out if the query is efficient enough.

5. Dissolve multi-polygon to polygon