dd: 2012

Tuesday, August 14, 2012

JMS - glassfish

JMS allows applications to create, send, receive, and read messages. Using JMS, application can communicate with others loosely coupled, asynchronous, and reliable.

It is useful in the following use cases:
1) inventory component can send a message to factory component when inventory is low.
2) factory component can send a message to parts component to get parts assembled.
3) parts component can send message to inventory or supplied component to get parts ready.
4) factory and parts component can send message to accounting component to update their budget.
5) business component can update their info to sales force.

JMS architecture
1) A JMS provider (provided by Glassfish JEE container)
2) JMS clients (any JEE application component can be JMS client)
3) Messages (objects that communicate information beween JMS clients)
4) Administered objects (Connection factory and destinations)

Messaging domains
1) point-to-point domain contains queues, senders and receivers. each message has only one consumer. senders and receivers have no time-dependency. receivers will acknowledge the success of message processing.

2) publish-subscribe domain contains subscriber, publisher and topics. consumers subscribe to the topics they interested in. each message has multiple consumers. consumers and publishers are time-dependency relationships. consumers can only consume messages published after consumers' subscription.

Programming model
1) administered objects (provided by Glassfish): connection factory and destination.
2) connection factory( an object that clients used to connect to a jms provider)
Ex.
@Resource(lookup = "/jms/ConnectionFactory")
private static ConnectionFactory connectionFactory;

3) destination
Ex.
@Resource(looku="/jms/Queue")
private static Queue queue;

4) connections
Ex.
Connection connection = connectionFactory.createConnection();

5) sessions: sessions are used to create producers, consumers, messages, and queue browsers.
Ex.
Session session = connection.createSession(transactionType, session.Auto_acknowledge);

6) message producers
Ex.
MessageProducer producer = session.createProducer(dest);

7) message receivers
Ex.
MessageConsumer consumer = session.createConsumer(dest);

8) message listener
Ex.
Listener myListener = new Listener();
consumer.setMessageListener(myListerner);

9) messages (header, property and body)
MessageType: TextMessage (String object), MapMessage(name-value pairs), BytesMessage(undefined bytes), StreamMessage(primitive value stream), ObjectMessage (serializable object), Message(without body).

10) Queue Browser
Ex. QueueBrowser browser = session.createBrowser(queue);

Thursday, June 28, 2012

Running R script on AWS EMS

Install Amazon EMR Command Line Interface

Install Ruby (1.8 up)
Download and unzip CLI (http://aws.amazon.com/developertools/2264)
Configure credential.json

{
"access_id": "AWS Access Key ID",
"private_key":"AWS Secret Access Key",
"keypair": "EC2 keypair name",
"key-pair-file":"pem location",
"log_uri":"s3n://log-location",
"region":"us-east-1"
}

Job Flow Essentials

Creating a Job Flow (./elastic-mapreduce --create --alive)
Listing all Job Flow (./elastic-mapreduce --list
Retrieving information about a specific Job Flow (./elastic-mapreduce --describe --jobflow ID)
Adding a step using default parameter values to a Job Flow (./elastic-mapreduce -j ID --stream)
Terminating a Job Flow (./elastic-mapreduce --terminate ID)
Listing all active Job Flows (./elastic-mapreduce --list --active)

Streaming Job Flow

./elastic-mapreduce --create --stream \

--mapper s3n://[mapper-location]

--input s3n://[input-location]

--output s3n://[output-location]

--reducer s3n://[reducer-location]

Tuesday, June 19, 2012

geohash adjancent codes

An algorithm to find out neighbors of a geohash code.

Base32: 123456789bcdefghjkmnpqrstuvwxyyz
Neighbors (direction type)

right even: bc01fg45238967deuvhjyznpkmstqrwx
left even: 238967debc01fg45kmstqrwxuvhjyznp
top even: p0r21436x8zb9dcf5h7kjnmqesgutwvy
bottom even: 14365h7k9dcfesgujnmqp0r2twvyx8zb
righ odd= top even (p0r21436x8zb9dcf5h7kjnmqesgutwvy)
left odd= bottom even (14365h7k9dcfesgujnmqp0r2twvyx8zb)
top odd= right even (bc01fg45238967deuvhjyznpkmstqrwx)
bottom odd= left even (238967debc01fg45kmstqrwxuvhjyznp)

Borders (direction type)

right even: bcfguvyz
left even: 0145hjnp
top even: przx
bottom even: 028b
right odd= top even (przx)
left odd= bottom even (028b)
top odd= right even (bcfguvyz)
bottom odd= left even (0145hjnp)

function calculateAdjancent(String srcHashCode, String direction)

srcHashCode = srcHashCode.toLowerCase();

char lastCharacter=srcHashCode.charAt(srcHashCode.length-1);

String type=(srcHashCode.length%2)?'odd': 'even':

String base=srcHashCode.subString(0,srcHashCode.length-1);

if(Borders[direction][type].indexOf(lastCharacter) != -1)

base=calculateAdjancent(base,direction);

return base+BASE32[Neighbors[direction][type].indexOf(lastCharacter)];

Ex. find a's right neighbor

calculateAdjancent(a,right);

Tuesday, June 5, 2012

speed up st_within query in postgresql

1. Create index
CREATE INDEX idx_tablename_columnname ON tablename USING GIST(columnname);

This will create a spatial index for the geometry column [columnname] in the [tablename]. According to postgis manual, it creates the bbox for each geometry in the table to speed up the query.

2. Cluster
CLUSTER

After index created, cluster the table to arrange the similar data in to the same disk space.

3. Optional: simplify the geometries.
SELECT ST_NPOINTS(geom_column) AS npoints FROM tablename ORDER BY npoints DESC LIMIT 25;

SELECT ST_SIMPLIFY(geom_column, number_scale) AS simpgeom FROM tablename;

Check the number of points in the 25 biggest geometries. If they are too big, the speed of st_within query will be slow. You can try simplify the geometries in the table if possible. Simplifying means reduce the points in geometries.

4. Test
EXPLAIN UPDATE locationtable SET columnname = (
SELECT columnname FROM regiontable
WHERE ST_WITHIN(geom.locationtable , geom.regiontable)
)

Explain breaks down the query into plan and evaluate if an index is needed or not. Through the EXPLAIN, you can find out if the query is efficient enough.

5. Dissolve multi-polygon to polygon

Wednesday, May 30, 2012

R Postgresql

Reading from PostgreSQL DB

Install RPostgreSQL
:library(RPostgreSQL)
Open a db connection using the dbConnect function
:con<-dbConnect(PostgreSQL(),user="xxx", password="xxx",host="localhost", client.flag=CLIENT_MULTI_RESULTS)
Use dbGetQuery to initiate a SELECT and return the result sets
:sql<-"SELECT * FROM table WHERE name='xxx'"
:rows<- dbGetQuery(con,sql)
Use dbDisconnect to terminate the connection
:dbDisconnect(con)

Monday, May 28, 2012

tiger geocoder 2011

Settings

Ubuntu 12.04
Postgresql 9.1
Postgis 2.0
Tiger 2011 Data

Installing Postgresql + postgis2.0

sudo apt-get install python-software-properties
sudo apt-add-repository ppa:sharpie/for-science
sudo apt-add-repository ppa:sharpie/postgis-nightly
sudo apt-get update
sudo apt-get install postgresql-9.1-postgis
sudo apt-get install build-essential postgresql-9.1 postgresql-server-dev-9.1 libxml2-dev proj libjson0-dev xsltproc docbook-xsl docbook-mathml gettext postgresql-contrib-9.1 pgadmin3

Updating gdal and geos libraries

sudo apt-add-repository ppa:olivier-berten/geo
sudo apt-get update
sudo apt-get install libgdal-dev
sudo apt-get install g++ ruby ruby1.8-dev swig swig2.0 unzip
wget http://download.osgeo.org/geos/geos-3.3.3.tar.bz2
tar xvfj geos-3.3.3.tar.bz2
cd geos-3.3.3
./configure --enable-ruby --prefix=/usr
make
sudo make install

Building postgis2.0

wget http://postgis.refractions.net/download/postgis-2.0.0.tar.gz
tar zxvf postgis-2.0.0.tar.gz
cd postgis-2.0.0
./configure --with-gui
make
sudo make install
sudo ldconfig
sudo make comments-install
sudo ln -sf /usr/share/postgresql-common/pg_wrapper /usr/local/bin/shp2pgsql
sudo ln -sf /usr/share/postgresql-common/pg_wrapper /usr/local/bin/pgsql2shp
sudo ln -sf /usr/share/postgresql-common/pg_wrapper /usr/local/bin/raster2pgsql

Building spatial db template

createdb postgis2
psql -d postgis2
psql>create extension postgis;
psql>create extension postgis_topology;

Building geocoder

createdb -T postgis2 geocoder
wget http://postgis.refractions.net/download/postgis-2.1.0SVN.tar.gz
tar zxvf postgis-2.1SVN.tar.gz
EDIT postgis-2.1SVN/extra/tiger-geocoder/tiger2011/tiger_loader.sql
export PGBIN=/usr/bin
export PGPORT=5432
export PGHOST=localhost
export PGUSER=postgres
export PGPASSWORD=XXX
export PGDATABASE=geocoder
PSQL=${PGBIN}/psql
SHP2PGSQL=${PGBIN}/shp2pgsql
EDIT postgis-2.1SVN/extra/create_geocode.sh
PGPORT=5432
PGHOST=localhost
PGUSER=postgres
PGPASSWORD=XXX
THEDB=geocoder
PGSQL_CMD=/usr/bin/psql
PGCONTRIB=/usr/share/postgresql/9.1/contrib
//unmark the following two lines
${PGSQL_CMD} -d "${THEDB}" -c "CREATE EXTENSION fuzzystrmatch.sql"
${PGSQL_CMD} -d "${THEDB}" -c "ALTER DATABASE ${THEDB} SET search_path=public,tiger;"
sh ./create_geocode.sh

Downloading tiger data

sudo mkdir -p /gisdata/temp
sudo chown -R postgres:postgres /gisdata
psql -d geocoder
psql>\a
psql>\t
psql>\o nation.sh
psql>select loader_generate_nation_script('sh');
psql>\q
sh ./nation.sh
psql -d geocoder
psql>\a
psql>\t
psql>\o all_states.sh
psql>select loader_generate_script(ARRAY['AK','AR','AS',...],'sh');
psql>\q
sh ./all_states.sh
psql -d geocoder
psql>select install_missing_indexes();

Testing geocoder

psql -d geocoder
get geocode of "27 Waterview Drive, Shelton, CT 06484)psql>select g.rating,ST_X(geomout) AS lon,ST_Y(geomout) AS lat, (addy).* from geocode('27 Waterview Drive, Shelton, CT 06484') AS g;
get reverse-geocode of point(-73.0811694362766 41.2931353732512)psql>select pprint_addy(r.addy[1]) as st1, pprint_addy(r.addy[2]) as st2, pprint_addy(r.addy[3]) as st3, array_to_string(r.street,',') as cross_streets from reverse_geocode(st_geomfromtext('POINT(-73.0811694362766 41.2931353732512)',4269),true) as r;

Thursday, May 24, 2012

Maven

Why maven?
Ans: you want to manage jars in your projects. you want to manage dependencies and versions in your projects. you want to manage your project structures. you want to manage the building processes.

How setup maven
1. export M2_home=/directory/to/maven
export PATH=/directory/to/maven:${PATH}

How to use

Wednesday, May 23, 2012

R Spatial

1. Load library
library(rgdal)
library(maptools)
library(sp)
2. setwd("directory") #set default working directory

3.data<-read.table()

4. str(data)

5. coordinates(data) "LAt+Lon"

6. str(data)

7. proje4string(data) = CRS("+init=epsg:4326")

8. spplot()

XX: install rgdal in MacOS.
install.packages('rgdal',repos="http://www.stats.ox.ac.uk/pub/RWin")

Geohashing

Geohash function has been included in Postgis 2.0. To generate a geohash code, one can use "st_geohash(geom_column, [integer 1-20])".

After inserting/updating geohash code, remember to create b-tree index for the geohash codes.

Thursday, May 17, 2012

Algorithms in Mahout

From IBM:
1. Logistic Regression (Stochastic Gradient Descent)
2. Hidden Markov Models (HMM)
3. Dirichlet Clustering
4. Spectral Clustering
5. Minhash Clustering
6. Numerous Recommender Improvements
7. Collections

JVM tuning

Key JVM

-Xmx -Xmx2G maximum size of java heap.
-Xms -Xms1G minimum size of java heap.
-server/client defines type of jvm
-d32/-d64 specify the type of JVM
-XX:+NewRatio -XX:+NewRatio=12 only 1/12 of the heap will be reserved for short-lived objects.
-XX:+UseParallelGC use multi-cores or processor to do garbage collection in parallel or not.
-XX:+UseParallelOldGC

Add new Harddrive to ubuntu

Situation: the old /home directory is too small. I want to add a new hard drive and mount it as /home.

1. sudo fdisk -l to check whether ubuntu detects the new harddrive and identify the new harddrive identity (the one without mounting point, ex. sdb)

2. sudo fdisk /dev/sdb

press 'n' to create new partition.

press 'p' to add primary partition.

press '1' to name partition number.

enter the first and the last section

press 'L' to list available file sysetem type

choice '83' (linux)

press 'w' to write to the table and exit

3. sudo partprobe /dev/sdb

4. sudo mkfs /dev/sdb1 -t ext4

5. sudo mkdir /mnt/sdb1

6. sudo chmod 777 /mnt/sdb1

7. sudo mount /dev/sdb1 /mnt/sdb1 -t ext4

8. sudo cp -R /home /mnt/sdb1

9. sudo vi /etc/fstab

add "/dev/sdb1 /home ext4 defaults 0 0 to the file, save and quit

10. backup the home folder to other places, then delete all in /home

11. sudo mount -a

12. reboot and done.

Tuesday, May 8, 2012

PostGIS examples

From http://postgis.refractions.net/pipermail/postgis-users/2007-July/016358.html

SQL Commands for the Demonstration

Simple Spatial SQL

 create table points ( pt geometry, name varchar );
 insert into points values ( 'POINT(0 0)', 'Origin' );
 insert into points values ( 'POINT(5 0)', 'X Axis' );
 insert into points values ( 'POINT(0 5)', 'Y Axis' );
 select name, AsText(pt), ST_Distance(pt, 'POINT(5 5)') from points;
 drop table points;


Loading Shape Files

 cd \postgis-workshop\data
 pg_setenv
 shp2pgsql -i -s 3005 bc_pubs.shp bc_pubs > bc_pubs.sql
 notepad bc_pubs.sql
 pg_shpsql
 psql -U postgres -f bc_data.sql -d postgis
   
Creating Spatial Indexes

 create index bc_roads_gidx on bc_roads using gist ( the_geom );
 create index bc_pubs_gidx on bc_pubs using gist ( the_geom );
 create index bc_voting_areas_gidx on bc_voting_areas using gist ( the_geom );
 create index bc_municipality_gidx on bc_municipality using gist ( the_geom );
 create index bc_hospitals_gidx on bc_hospitals using gist ( the_geom );

Using Spatial Indexes

 -- Unindexed query
 select gid, name from bc_roads where _ST_Crosses(the_geom, ST_GeomFromText('LINESTRING(1220446 477473,1220417 477559)', 3005));
 -- Indexed query
 select gid, name from bc_roads where ST_Crosses(the_geom, ST_GeomFromText('LINESTRING(1220446 477473,1220417 477559)', 3005));

Indexes and Query Plans

 -- Unindexed query plan
 explain select gid, name from bc_roads where _ST_Crosses(the_geom, ST_GeomFromText('LINESTRING(1220446 477473,1220417 477559)', 3005));
 -- Indexed query plan
 explain select gid, name from bc_roads where ST_Crosses(the_geom, ST_GeomFromText('LINESTRING(1220446 477473,1220417 477559)', 3005));

PostgreSQL Optimization

  open postgresql.conf

Spatial Analysis in SQL

 -- What is the total length of all roads in the province, in kilometers?
 select sum(ST_Length(the_geom))/1000 as km_roads from bc_roads;

 -- How large is the city of Prince George, in hectares?
 select ST_Area(the_geom)/10000 as hectares from bc_municipality where name = 'PRINCE GEORGE';

 -- What is the largest municipality in the province, by area?
 select name, ST_Area(the_geom)/10000 as hectares from bc_municipality order by hectares desc limit 1;

Data Integrity

 -- How many invalid voting area polygons?
 select count(*) from bc_voting_areas where not ST_IsValid(the_geom);

Distance Queries

 -- Where is the Tabor Arms pub?
 select ST_AsText(the_geom) from bc_pubs where name ilike 'Tabor Arms%';

 -- How many Unity Party supporters live within 2km of Tabor Arms pub?
 select sum(upbc) as unity_voters from bc_voting_areas where ST_DWithin(the_geom, ST_GeomFromText('POINT(1209385 996204)', 3005), 2000);



Spatial Joins

 -- Find all pubs located within 250 meters of a hospital.
 select h.name, p.name from bc_hospitals h, bc_pubs p where ST_DWithin(h.the_geom, p.the_geom, 250);

 -- Summarize the 2000 provincial election results by municipality.
 select m.name, sum(v.ndp) as ndp, sum(v.lib) as liberal, sum(v.gp) as green, sum(v.upbc) as unity, sum(v.vtotal) as total from bc_voting_areas v, bc_municipality m where ST_Intersects(v.the_geom, m.the_geom) group by m.name order by m.name;

Overlays 

 -- Clip the voting areas by the Prince George boundary.
 create table pg_voting_areas as select ST_Intersection(v.the_geom, m.the_geom) as intersection_geom, ST_Area(v.the_geom) as va_area, v.*, m.name from bc_voting_areas v, bc_municipality m where ST_Intersects(v.the_geom, m.the_geom) and m.name = 'PRINCE GEORGE';

 -- What is the area of the clipped features?
 select sum(ST_Area(intersection_geom)) from pg_voting_areas;

 -- How does that compare to the clipping polygon? (Should be the same.)
 select ST_Area(the_geom) from bc_municipality where name = 'PRINCE GEORGE';

Coordinate Projection

 -- What is the proj4 definition of srid 3005?
 select proj4text from spatial_ref_sys where srid=3005;

 -- What are the native coordinate of the first road in the roads table?
 select ST_AsText(the_geom) from bc_roads limit 1;

 -- What do those coordinates look like in lon/lat?
 select ST_AsText(ST_Transform(the_geom,4326)) from bc_roads limit 1;

Exercises

 -- What is the perimeter of the municipality of Vancouver?
 select ST_Perimeter(the_geom) from bc_municipality where name = 'VANCOUVER';
 
 -- What is the total area of all voting areas in hectares?
 select sum(ST_Area(the_geom))/10000 as hectares from bc_voting_areas;
 
 -- What is the total area of all voting areas with more than 100 voters in them?
 select sum(ST_Area(the_geom))/10000 as hectares from bc_voting_areas where vtotal > 100;
 
 -- What is the length in kilometers of all roads named Douglas St?
 select sum(ST_Length(the_geom))/1000 as kilometers from bc_roads where name = 'Douglas St';

Advanced Exercises

 -- What is the length in kilometers of Douglas St in Victoria?
 select sum(ST_Length(r.the_geom))/1000 as kilometers from bc_roads r, bc_municipality m where ST_Contains(m.the_geom, r.the_geom) and r.name = 'Douglas St' and m.name = 'VICTORIA';
 
 -- What two pubs have the most Green Party supporters within 500 meters of them?
 select p.name, p.city, sum(v.gp) as greens from bc_pubs p, bc_voting_areas v where ST_DWithin(p.the_geom, v.the_geom, 500) group by p.name, p.city order by greens desc limit 2;

 -- What is the latitude of the most southerly hospital in the province?
 select ST_Y(ST_Transform(the_geom,4326)) as latitude from bc_hospitals order by latitude asc limit 1;

 -- What were the percentage NDP and Liberal vote within the city limits of Prince George in the 2000 provincial election?
 select 100*sum(v.ndp)/sum(v.vtotal) as ndp, 100*sum(v.lib)/sum(v.vtotal) as liberal from bc_voting_areas v, bc_municipality m where ST_Contains(m.the_geom, v.the_geom) and m.name = 'PRINCE GEORGE';

 -- What is the largest voting area polygon that has a hole?
 select gid, id, ST_Area(the_geom) as area from bc_voting_areas where ST_NRings(the_geom) > 1 order by area desc limit 1;

 -- How many NDP voters live within 50 meters of Simcoe St in Victoria?
 select sum(v.ndp) as ndp from bc_voting_areas v, bc_municipality m, bc_roads r where ST_Contains(m.the_geom, r.the_geom) and r.name = 'Simcoe St' and m.name = 'VICTORIA' and ST_DWithin(r.the_geom, v.the_geom, 50);

Wednesday, May 2, 2012

Hadoop Demo

Create MapReduce application:

a. specify input and output locations

Input/Output => <key,value>

key and value should implement "Writable", and key should implement "WritableComparable".

b. specify map and reduce functions

c. job client submit the job and configuration to JobTracker.

Example1: WordCount

public class WordCount{

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable{

private final static IntWritable one = new IntWritable(1); //value

private Text word = new Text();//use word as key

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while(tokenizer.hasMoreTokens()){

word.set(tokenizer.nextToken()); //set key

output.collect(word,one); //value will always be one

}

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>{

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{

int sum=0

//for values with the same key, sum up the value (1) and output <key, sum>

while(values.hasNext()){

sum+=values.next().get();

}

output.collect(key,new IntWritable(sum));

}

public static void main(String[] args) throws Exception{

JobConf conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class); //word

conf.setOutputValueClass(IntWritable.class);//sum

conf.setMapperClass(Map.class);

//conf.setOutputKeyComparatorClass(Comparator.class)); //control the grouping of output

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);//textfile

conf.setOutputFormat(TextOutputFormat.class);//textfile

FileInputFormat.setInputPaths(conf, new Path(args[0]));//location of input file

FileOutputFormat.setOutputPath(conf, new Path(args[1]));//location of output file

JobClient.run(conf);

}

PostgreSQL+PostGIS cheat sheet

1. Distinct Count
solution 1:
select count(*) from (select distinct * from table where column='') as dis

2. Check redundant tuple
soluction 1 (self join)
SELECTp.place_id, p.href,s.place_id,s.href
FROM place p
INNER JOIN place s ON p.href=s.href
WHERE p.place_id &s s.place_id
ORDER BY p.place_id, s.place_id;

3. Materialized View (http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views)
a. snapshot (view definition is never changed)
1) create a table to store view info
CREATE TABLE matviews(
mv_name NAME NOT NULL PRIMARY KEY,
b_name NAME NOT NULL,
last_refresh TIMESTAMP WITH TIME ZONE
);

2)create a function to insert a row in Mat_VIEWS and create a materialized view
CREATE OR REPLACE FUNCTION create_matview(NAME,NAME)
RETURNS VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS FOR $1;
view_name ALIAS FOR $2;
entry matviews%ROWTYPE;
BEGIN
SELECT * INTO entry FROM matviews WHERE mv_name=matview;

IF FOUND THEN
RAISE EXCEPTION ''Materialized view ''''%'''' already exists.'', matview;
END IF;

EXECUTE ''REVOKE ALL ON '' || view_name || ''FROM PUBLIC'';
EXECUTE "GRANT SELECT ON '' || view_name || '' TO PUBLIC'';
EXECUTE ''CREATE TABLE '' || matview|| AS SELECT * FROM ''||view_name;
EXECUTE ''REVOKE ALL ON '' || matview || ' FROM PUBLIC'';
EXECUTE ''GRANT SELECT ON '' || matview || '' TO PUBLIC'';

INSERT INTO matviews (mv_name, b_name,last_refresh)
VALUES(matview, view_name, CURRENT_TIMESTAMP);

RETURN;
END
';
3) create a function to drop a materialized view
CREATE OR REPLACE FUNCTION drop_matview(NAME) RETURN VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS FOR $1;
entry matviews%ROWTYPE;
BEGIN
SELECT * INTO entry FROM matview WHERE mv_name=matview;
IF NOT FOUND THEN
RAISE EXCEPTION ''Materialized view % does not exist.'',matview;
END IF;
EXECUTE ''DROP TABLE'' || matview;
DELETE FROM matviews WHERE mv_name=matview;
RETURN;
END
';

4) create a function to refresh a materialized view
CREATE OR REPLACE FUNCTION refresh_matview(NAME) RETURNS VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS $1;
entry matviews%ROWTYP;
BEGIN
SELECT * INTO entry FROM matviews WHERE mv_name=matview;
IF NOT FOUND THEN
RAISE EXCEPTION ''Materialized view % does not exist.'',matview;
END IF;
EXECUTE '' DELETE FROM '' || matview;
EXECUTE ''INSERT INTO ''||matview || '' SELECT * FROM '' || entry.b_name;
UPDATE matviews
SET last_refresh = CURRENT_TIMESTAMP
WHERE mv_name=matview;
RETURN;
END

';

b. eager (update whenever the base view is changed)
4. delete existing duplicate rows in a table

Before execute the following statement, make sure you have enough buffers for the session (refer to cheat 5 and 6).

solution1
DELETE FROM table USING table ta
WHERE table.key=ta.key AND table.id>ta.id;

solution2
DELETE FROM table
WHERE ctid NOT IN
(SELECT MAX(s.ctid) FROM table s GROUP BY s.column_has_be_distinct);

solution3
CREATE OR REPLACE FUNCTION remove_duplicates(text,text) RETURN VOID AS $$
DECLARE
tablename ALIAS FOR $1;
duplicate_column ALIAS FRO $2;
BEGIN
EXECUTE 'CREATE TEMPORARY TABLE TABLE_DISTINCT_' ||tablename||'
AS (SELECT DISTINCT ON ('||duplicate_column||') * FROM ' ||tablename||
ORDER BY '||duplicate_column||' ASC);';
EXECUTE 'DELETE FROM ' || tablename||';';
EXECUTE 'INSERT INTO ' || tablename||' (SELECT * FROM _DISTINCT_' || tablename||');';
EXECUTE 'DROP TABLE_DISTINCT_'|| tablename||';';
RETURN;
END;
$$LANGUAGE plpgsql;

solution4
CREATE TEMPORARY TABLE t_temp ON COMMIT DROP AS
SELECT DISTINCT ON ('column_name') * FROM table;
TRUNCATE table;
INSERT INTO table SELECT * FROM t_temp;

5. find out the size of table
solution:
SELECT pg_size_pretty(pg_relation_size('tablename'));

6. set temporary buffers
solution:
SET temp_buffers =' XXXMB';

7. delete table rows which some columns in that table (a) don't exist in the other table (b).
solution:
DELETE FROM a WHERE a.column_name NOT IN (SELECT column_name FROM b);

8. Backup
Solution1:
pg_dump
EX. login db at localhost as postgres and dump database poi2 into poi.backup.gz
pg_dump -u postgres -h localhost poi2 | gzip>poi.backup.gz

Solution2:
file system level backup
Ex.
tar -cf poi2.backup.tar /usr/local/pgsql/data
PS. The database server must be shutdown before doing file system level backup and restore.

Solutioin3:
Continuous archiving
EX.
a. enable wal in postgresql.conf
wal_level=archive
archive_mode=on
archive_command='test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
b.SELECT pg_start_backup('label',ture);
c. perform file system level backup, like tar
d. SELECT pg_stop_backup();

touch /var/lib/pgsql/backup_in_progress
psql -c "select pg_start_backup('hot_backup');"
tar -cf /var/lib/pgsql/backup.tar /var/lib/pgsql/data/
psql -c "select pg_stop_backup();"
rm /var/lib/pgsql/backup_in_progress
tar -rf /var/lib/pgsql/backup.tar /var/lib/pgsql/archive/

Solution4:
pg_basebackup
EX.
a. enable local replicate permission for posgres @localhost
b. enable max_wal_senders=1
c. pg_basebackup -U postgres -D /backup/data -F tar -z -l label

9. PostGIS import shape files into one table

Solution 1:

shp2pgsql -s 915 -a[-c] -g geom shapfile.shp schema.table | psql -U username -W dbname

-s: srid, -g: geometry_column_name

10. Test if geometry column is valid
Solution 1:
select ST_IsValid(geom) from table;
Solution 2:
select ST_IsValidDetail(geom) from table;
Solution 3:
select ST_IsEmpty(geom) from table;
11. Test the geometry type
Solution 1:
select ST_GeometryType(geom) from table;

12. Mapping points in table a to a polygon in a table b
Solution1:
select * from table1
where st_within(geom, polygon);

13. get a boundary from a collection of geometries
Solution 1:
select st_extent(geom) from table;

14. Performance tuning:
Install pgtune (apt-get update; apt-get install pgtune) -> pgtune -i postgresql.conf -o psql.conf

create dedicate index for query (considering the order of fields, and the join condition)
Use multi-fields index
the order of fields in multi-fields index (A, B, C can be used to find AB, but not BC)
Vacuum the tables after delete, update, insert, or create index.
Order of fields and order of joint have a big impact on performance.
Explicit casting: "where x=3" v.s. "where x=cast(3 as smallint)", "abs(dest) = cast(111 as smallint)" v.s. "dest =111 or dest=-111" have huge different.
set enable_seqscan=no
Avoid nested loop, merge and sort is faster.
Use temp table
Add sequence in query to monitor the progress. select nextval('seq_progress'); select currval('seq_progress');
Enable stats_start_collector and stats_row_level to get auto vacuum
Partial index: create index index_name on table1(column_name) where column_nam like 'America%'
Specify sort_memory, temp_buffers for each connections or query.

Wednesday, April 25, 2012

Monday, April 23, 2012

MongoDB cheatsheet

Import Data
//import SourceFile (csv file) into DataBaseName CollectionName with firstline as header
mongoimport -d DataBaseName -c CollectionName --type csv --file SourceFile --headerline

//import file,xxx.txt into foo in Collection bar without headerinfo(have to specify the fieldnames)
mongoimport -d foo -c bar --type csv --file xxx.txt -f id,timestamp,latitude,longitude,place

mongoimport -d foo -c bar --type tsv --file xxx.txt -f id,timestamp,latitude,longitude,place

Create Index

db.[collection-name].ensuerIndex({[tag-name]: 1})

Create Spatial Index

db.[collection-name].ensureIndex({[tag-name]:"2d"})

Python SimpleHTTPServer

Type the following in the direction which you want to share.

python -m SimpleHTTPServer 9999

or

python -m http.server 8001

//this command create a simple http server on local port 9999 using SimpleHTTPServer module.

Then other users can get the files in the directory using a browser.

Monday, April 16, 2012

Spatial data mining procedures

Procedure
1. import json files
2. build index
3. build foreign keys
4. build tiles
5. build neighborhood index (orignal oid, target oid, distance, direction, topology)
6. spatial data mining tools

Platforms
1. PostgreSQL + postgis (open source solution)
2. MongoDB
3. MS-SQL
4. Hadoop

Input
1. Place_dump_US
2. Checkin

Output
1. OpenLayers + Geoserver (visualization)
2. Patterns (representation)

Sunday, April 15, 2012

Hadoop Clusters Setup

Pre-setup
1. install jdk/jre1.6 or up
2. install ssh
a. create master sshkey
ssh-keygen -t dsa -P ""
cat id_dsa.pub>>authorized_keys
b. copy master public key to slaves
scp id_dsa.pub slaveN:~/.ssh/master.pub
c. add master pub key to authorized_keys
cat master.pub>>authorized_keys
d. from master, ssh to slaveN and check if a passphrase is needed.
3. edit /etc/hosts & /etc/hostname

Setup
1. setup env.xml (export JAVA_HOME)

2. core-site (specify name node and jobtracker) --for master & slaves

fs.default.name
hdfs://master

3. hdfs-site.xml (data node) --for master & slave

dfs.name.dir
/home/hduser/hddata/name

dfs.data.dir
/home/hduser/hddata/data

4. mapred-site.xml (jobtracker) --for master & slaves

mapred.job.tracker
master:54311

5. list all slaves to conf/slaves --for master/jobtracker only

6. chmod g-w to all data and name directories

** start-dfs.sh will consult slaves on name-node and start all data nodes on slaves.
** start-mapred.sh will consult salves on job-tracker-node and start all task-trackers on slaves.

Startup
1. execute "hadoop namenode -format" on name node site

2. execute "start-dfs.sh" on name node site

3. execute "start-mapred.sh" on job tracker site

Shutdown
1. execute "stop-mapred.sh" on job tracker site

2. execute "stop-dfs.sh" on name node site

Friday, March 30, 2012

Argument list is too long

Situation: when I try to ls or rm a "big" folder, mac terminal responds "Argument list is too long".

Solution1: use xargs
ex.
ls chunk*.geojson|xargs rm
find . -type f -name chunk*.geojson | xargs rm
//result: doesn't work in Mac OS

Solution 2: perl
ex.
perl -e 'for(<./chunk*.geojson>){unlink}'
//this one works in mac os.

Solusion 3: shellscript