dd: May 2012

Wednesday, May 30, 2012

R Postgresql

Reading from PostgreSQL DB

Install RPostgreSQL
:library(RPostgreSQL)
Open a db connection using the dbConnect function
:con<-dbConnect(PostgreSQL(),user="xxx", password="xxx",host="localhost", client.flag=CLIENT_MULTI_RESULTS)
Use dbGetQuery to initiate a SELECT and return the result sets
:sql<-"SELECT * FROM table WHERE name='xxx'"
:rows<- dbGetQuery(con,sql)
Use dbDisconnect to terminate the connection
:dbDisconnect(con)

Monday, May 28, 2012

tiger geocoder 2011

Settings

Ubuntu 12.04
Postgresql 9.1
Postgis 2.0
Tiger 2011 Data

Installing Postgresql + postgis2.0

sudo apt-get install python-software-properties
sudo apt-add-repository ppa:sharpie/for-science
sudo apt-add-repository ppa:sharpie/postgis-nightly
sudo apt-get update
sudo apt-get install postgresql-9.1-postgis
sudo apt-get install build-essential postgresql-9.1 postgresql-server-dev-9.1 libxml2-dev proj libjson0-dev xsltproc docbook-xsl docbook-mathml gettext postgresql-contrib-9.1 pgadmin3

Updating gdal and geos libraries

sudo apt-add-repository ppa:olivier-berten/geo
sudo apt-get update
sudo apt-get install libgdal-dev
sudo apt-get install g++ ruby ruby1.8-dev swig swig2.0 unzip
wget http://download.osgeo.org/geos/geos-3.3.3.tar.bz2
tar xvfj geos-3.3.3.tar.bz2
cd geos-3.3.3
./configure --enable-ruby --prefix=/usr
make
sudo make install

Building postgis2.0

wget http://postgis.refractions.net/download/postgis-2.0.0.tar.gz
tar zxvf postgis-2.0.0.tar.gz
cd postgis-2.0.0
./configure --with-gui
make
sudo make install
sudo ldconfig
sudo make comments-install
sudo ln -sf /usr/share/postgresql-common/pg_wrapper /usr/local/bin/shp2pgsql
sudo ln -sf /usr/share/postgresql-common/pg_wrapper /usr/local/bin/pgsql2shp
sudo ln -sf /usr/share/postgresql-common/pg_wrapper /usr/local/bin/raster2pgsql

Building spatial db template

createdb postgis2
psql -d postgis2
psql>create extension postgis;
psql>create extension postgis_topology;

Building geocoder

createdb -T postgis2 geocoder
wget http://postgis.refractions.net/download/postgis-2.1.0SVN.tar.gz
tar zxvf postgis-2.1SVN.tar.gz
EDIT postgis-2.1SVN/extra/tiger-geocoder/tiger2011/tiger_loader.sql
export PGBIN=/usr/bin
export PGPORT=5432
export PGHOST=localhost
export PGUSER=postgres
export PGPASSWORD=XXX
export PGDATABASE=geocoder
PSQL=${PGBIN}/psql
SHP2PGSQL=${PGBIN}/shp2pgsql
EDIT postgis-2.1SVN/extra/create_geocode.sh
PGPORT=5432
PGHOST=localhost
PGUSER=postgres
PGPASSWORD=XXX
THEDB=geocoder
PGSQL_CMD=/usr/bin/psql
PGCONTRIB=/usr/share/postgresql/9.1/contrib
//unmark the following two lines
${PGSQL_CMD} -d "${THEDB}" -c "CREATE EXTENSION fuzzystrmatch.sql"
${PGSQL_CMD} -d "${THEDB}" -c "ALTER DATABASE ${THEDB} SET search_path=public,tiger;"
sh ./create_geocode.sh

Downloading tiger data

sudo mkdir -p /gisdata/temp
sudo chown -R postgres:postgres /gisdata
psql -d geocoder
psql>\a
psql>\t
psql>\o nation.sh
psql>select loader_generate_nation_script('sh');
psql>\q
sh ./nation.sh
psql -d geocoder
psql>\a
psql>\t
psql>\o all_states.sh
psql>select loader_generate_script(ARRAY['AK','AR','AS',...],'sh');
psql>\q
sh ./all_states.sh
psql -d geocoder
psql>select install_missing_indexes();

Testing geocoder

psql -d geocoder
get geocode of "27 Waterview Drive, Shelton, CT 06484)psql>select g.rating,ST_X(geomout) AS lon,ST_Y(geomout) AS lat, (addy).* from geocode('27 Waterview Drive, Shelton, CT 06484') AS g;
get reverse-geocode of point(-73.0811694362766 41.2931353732512)psql>select pprint_addy(r.addy[1]) as st1, pprint_addy(r.addy[2]) as st2, pprint_addy(r.addy[3]) as st3, array_to_string(r.street,',') as cross_streets from reverse_geocode(st_geomfromtext('POINT(-73.0811694362766 41.2931353732512)',4269),true) as r;

Thursday, May 24, 2012

Why maven?
Ans: you want to manage jars in your projects. you want to manage dependencies and versions in your projects. you want to manage your project structures. you want to manage the building processes.

How setup maven
1. export M2_home=/directory/to/maven
export PATH=/directory/to/maven:${PATH}

How to use

Wednesday, May 23, 2012

R Spatial

1. Load library
library(rgdal)
library(maptools)
library(sp)
2. setwd("directory") #set default working directory

3.data<-read.table()

4. str(data)

5. coordinates(data) "LAt+Lon"

6. str(data)

7. proje4string(data) = CRS("+init=epsg:4326")

8. spplot()

XX: install rgdal in MacOS.
install.packages('rgdal',repos="http://www.stats.ox.ac.uk/pub/RWin")

Geohashing

Geohash function has been included in Postgis 2.0. To generate a geohash code, one can use "st_geohash(geom_column, [integer 1-20])".

After inserting/updating geohash code, remember to create b-tree index for the geohash codes.

Thursday, May 17, 2012

Algorithms in Mahout

From IBM:
1. Logistic Regression (Stochastic Gradient Descent)
2. Hidden Markov Models (HMM)
3. Dirichlet Clustering
4. Spectral Clustering
5. Minhash Clustering
6. Numerous Recommender Improvements
7. Collections

JVM tuning

Key JVM

-Xmx -Xmx2G maximum size of java heap.
-Xms -Xms1G minimum size of java heap.
-server/client defines type of jvm
-d32/-d64 specify the type of JVM
-XX:+NewRatio -XX:+NewRatio=12 only 1/12 of the heap will be reserved for short-lived objects.
-XX:+UseParallelGC use multi-cores or processor to do garbage collection in parallel or not.
-XX:+UseParallelOldGC

Add new Harddrive to ubuntu

Situation: the old /home directory is too small. I want to add a new hard drive and mount it as /home.

1. sudo fdisk -l to check whether ubuntu detects the new harddrive and identify the new harddrive identity (the one without mounting point, ex. sdb)

2. sudo fdisk /dev/sdb

press 'n' to create new partition.

press 'p' to add primary partition.

press '1' to name partition number.

enter the first and the last section

press 'L' to list available file sysetem type

choice '83' (linux)

press 'w' to write to the table and exit

3. sudo partprobe /dev/sdb

4. sudo mkfs /dev/sdb1 -t ext4

5. sudo mkdir /mnt/sdb1

6. sudo chmod 777 /mnt/sdb1

7. sudo mount /dev/sdb1 /mnt/sdb1 -t ext4

8. sudo cp -R /home /mnt/sdb1

9. sudo vi /etc/fstab

add "/dev/sdb1 /home ext4 defaults 0 0 to the file, save and quit

10. backup the home folder to other places, then delete all in /home

11. sudo mount -a

12. reboot and done.

Tuesday, May 8, 2012

PostGIS examples

From http://postgis.refractions.net/pipermail/postgis-users/2007-July/016358.html

SQL Commands for the Demonstration

Simple Spatial SQL

 create table points ( pt geometry, name varchar );
 insert into points values ( 'POINT(0 0)', 'Origin' );
 insert into points values ( 'POINT(5 0)', 'X Axis' );
 insert into points values ( 'POINT(0 5)', 'Y Axis' );
 select name, AsText(pt), ST_Distance(pt, 'POINT(5 5)') from points;
 drop table points;


Loading Shape Files

 cd \postgis-workshop\data
 pg_setenv
 shp2pgsql -i -s 3005 bc_pubs.shp bc_pubs > bc_pubs.sql
 notepad bc_pubs.sql
 pg_shpsql
 psql -U postgres -f bc_data.sql -d postgis
   
Creating Spatial Indexes

 create index bc_roads_gidx on bc_roads using gist ( the_geom );
 create index bc_pubs_gidx on bc_pubs using gist ( the_geom );
 create index bc_voting_areas_gidx on bc_voting_areas using gist ( the_geom );
 create index bc_municipality_gidx on bc_municipality using gist ( the_geom );
 create index bc_hospitals_gidx on bc_hospitals using gist ( the_geom );

Using Spatial Indexes

 -- Unindexed query
 select gid, name from bc_roads where _ST_Crosses(the_geom, ST_GeomFromText('LINESTRING(1220446 477473,1220417 477559)', 3005));
 -- Indexed query
 select gid, name from bc_roads where ST_Crosses(the_geom, ST_GeomFromText('LINESTRING(1220446 477473,1220417 477559)', 3005));

Indexes and Query Plans

 -- Unindexed query plan
 explain select gid, name from bc_roads where _ST_Crosses(the_geom, ST_GeomFromText('LINESTRING(1220446 477473,1220417 477559)', 3005));
 -- Indexed query plan
 explain select gid, name from bc_roads where ST_Crosses(the_geom, ST_GeomFromText('LINESTRING(1220446 477473,1220417 477559)', 3005));

PostgreSQL Optimization

  open postgresql.conf

Spatial Analysis in SQL

 -- What is the total length of all roads in the province, in kilometers?
 select sum(ST_Length(the_geom))/1000 as km_roads from bc_roads;

 -- How large is the city of Prince George, in hectares?
 select ST_Area(the_geom)/10000 as hectares from bc_municipality where name = 'PRINCE GEORGE';

 -- What is the largest municipality in the province, by area?
 select name, ST_Area(the_geom)/10000 as hectares from bc_municipality order by hectares desc limit 1;

Data Integrity

 -- How many invalid voting area polygons?
 select count(*) from bc_voting_areas where not ST_IsValid(the_geom);

Distance Queries

 -- Where is the Tabor Arms pub?
 select ST_AsText(the_geom) from bc_pubs where name ilike 'Tabor Arms%';

 -- How many Unity Party supporters live within 2km of Tabor Arms pub?
 select sum(upbc) as unity_voters from bc_voting_areas where ST_DWithin(the_geom, ST_GeomFromText('POINT(1209385 996204)', 3005), 2000);



Spatial Joins

 -- Find all pubs located within 250 meters of a hospital.
 select h.name, p.name from bc_hospitals h, bc_pubs p where ST_DWithin(h.the_geom, p.the_geom, 250);

 -- Summarize the 2000 provincial election results by municipality.
 select m.name, sum(v.ndp) as ndp, sum(v.lib) as liberal, sum(v.gp) as green, sum(v.upbc) as unity, sum(v.vtotal) as total from bc_voting_areas v, bc_municipality m where ST_Intersects(v.the_geom, m.the_geom) group by m.name order by m.name;

Overlays 

 -- Clip the voting areas by the Prince George boundary.
 create table pg_voting_areas as select ST_Intersection(v.the_geom, m.the_geom) as intersection_geom, ST_Area(v.the_geom) as va_area, v.*, m.name from bc_voting_areas v, bc_municipality m where ST_Intersects(v.the_geom, m.the_geom) and m.name = 'PRINCE GEORGE';

 -- What is the area of the clipped features?
 select sum(ST_Area(intersection_geom)) from pg_voting_areas;

 -- How does that compare to the clipping polygon? (Should be the same.)
 select ST_Area(the_geom) from bc_municipality where name = 'PRINCE GEORGE';

Coordinate Projection

 -- What is the proj4 definition of srid 3005?
 select proj4text from spatial_ref_sys where srid=3005;

 -- What are the native coordinate of the first road in the roads table?
 select ST_AsText(the_geom) from bc_roads limit 1;

 -- What do those coordinates look like in lon/lat?
 select ST_AsText(ST_Transform(the_geom,4326)) from bc_roads limit 1;

Exercises

 -- What is the perimeter of the municipality of Vancouver?
 select ST_Perimeter(the_geom) from bc_municipality where name = 'VANCOUVER';
 
 -- What is the total area of all voting areas in hectares?
 select sum(ST_Area(the_geom))/10000 as hectares from bc_voting_areas;
 
 -- What is the total area of all voting areas with more than 100 voters in them?
 select sum(ST_Area(the_geom))/10000 as hectares from bc_voting_areas where vtotal > 100;
 
 -- What is the length in kilometers of all roads named Douglas St?
 select sum(ST_Length(the_geom))/1000 as kilometers from bc_roads where name = 'Douglas St';

Advanced Exercises

 -- What is the length in kilometers of Douglas St in Victoria?
 select sum(ST_Length(r.the_geom))/1000 as kilometers from bc_roads r, bc_municipality m where ST_Contains(m.the_geom, r.the_geom) and r.name = 'Douglas St' and m.name = 'VICTORIA';
 
 -- What two pubs have the most Green Party supporters within 500 meters of them?
 select p.name, p.city, sum(v.gp) as greens from bc_pubs p, bc_voting_areas v where ST_DWithin(p.the_geom, v.the_geom, 500) group by p.name, p.city order by greens desc limit 2;

 -- What is the latitude of the most southerly hospital in the province?
 select ST_Y(ST_Transform(the_geom,4326)) as latitude from bc_hospitals order by latitude asc limit 1;

 -- What were the percentage NDP and Liberal vote within the city limits of Prince George in the 2000 provincial election?
 select 100*sum(v.ndp)/sum(v.vtotal) as ndp, 100*sum(v.lib)/sum(v.vtotal) as liberal from bc_voting_areas v, bc_municipality m where ST_Contains(m.the_geom, v.the_geom) and m.name = 'PRINCE GEORGE';

 -- What is the largest voting area polygon that has a hole?
 select gid, id, ST_Area(the_geom) as area from bc_voting_areas where ST_NRings(the_geom) > 1 order by area desc limit 1;

 -- How many NDP voters live within 50 meters of Simcoe St in Victoria?
 select sum(v.ndp) as ndp from bc_voting_areas v, bc_municipality m, bc_roads r where ST_Contains(m.the_geom, r.the_geom) and r.name = 'Simcoe St' and m.name = 'VICTORIA' and ST_DWithin(r.the_geom, v.the_geom, 50);

Wednesday, May 2, 2012

Hadoop Demo

Create MapReduce application:

a. specify input and output locations

Input/Output => <key,value>

key and value should implement "Writable", and key should implement "WritableComparable".

b. specify map and reduce functions

c. job client submit the job and configuration to JobTracker.

Example1: WordCount

public class WordCount{

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable{

private final static IntWritable one = new IntWritable(1); //value

private Text word = new Text();//use word as key

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while(tokenizer.hasMoreTokens()){

word.set(tokenizer.nextToken()); //set key

output.collect(word,one); //value will always be one

}

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>{

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{

int sum=0

//for values with the same key, sum up the value (1) and output <key, sum>

while(values.hasNext()){

sum+=values.next().get();

}

output.collect(key,new IntWritable(sum));

}

public static void main(String[] args) throws Exception{

JobConf conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class); //word

conf.setOutputValueClass(IntWritable.class);//sum

conf.setMapperClass(Map.class);

//conf.setOutputKeyComparatorClass(Comparator.class)); //control the grouping of output

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);//textfile

conf.setOutputFormat(TextOutputFormat.class);//textfile

FileInputFormat.setInputPaths(conf, new Path(args[0]));//location of input file

FileOutputFormat.setOutputPath(conf, new Path(args[1]));//location of output file

JobClient.run(conf);

}

PostgreSQL+PostGIS cheat sheet

1. Distinct Count
solution 1:
select count(*) from (select distinct * from table where column='') as dis

2. Check redundant tuple
soluction 1 (self join)
SELECTp.place_id, p.href,s.place_id,s.href
FROM place p
INNER JOIN place s ON p.href=s.href
WHERE p.place_id &s s.place_id
ORDER BY p.place_id, s.place_id;

3. Materialized View (http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views)
a. snapshot (view definition is never changed)
1) create a table to store view info
CREATE TABLE matviews(
mv_name NAME NOT NULL PRIMARY KEY,
b_name NAME NOT NULL,
last_refresh TIMESTAMP WITH TIME ZONE
);

2)create a function to insert a row in Mat_VIEWS and create a materialized view
CREATE OR REPLACE FUNCTION create_matview(NAME,NAME)
RETURNS VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS FOR $1;
view_name ALIAS FOR $2;
entry matviews%ROWTYPE;
BEGIN
SELECT * INTO entry FROM matviews WHERE mv_name=matview;

IF FOUND THEN
RAISE EXCEPTION ''Materialized view ''''%'''' already exists.'', matview;
END IF;

EXECUTE ''REVOKE ALL ON '' || view_name || ''FROM PUBLIC'';
EXECUTE "GRANT SELECT ON '' || view_name || '' TO PUBLIC'';
EXECUTE ''CREATE TABLE '' || matview|| AS SELECT * FROM ''||view_name;
EXECUTE ''REVOKE ALL ON '' || matview || ' FROM PUBLIC'';
EXECUTE ''GRANT SELECT ON '' || matview || '' TO PUBLIC'';

INSERT INTO matviews (mv_name, b_name,last_refresh)
VALUES(matview, view_name, CURRENT_TIMESTAMP);

RETURN;
END
';
3) create a function to drop a materialized view
CREATE OR REPLACE FUNCTION drop_matview(NAME) RETURN VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS FOR $1;
entry matviews%ROWTYPE;
BEGIN
SELECT * INTO entry FROM matview WHERE mv_name=matview;
IF NOT FOUND THEN
RAISE EXCEPTION ''Materialized view % does not exist.'',matview;
END IF;
EXECUTE ''DROP TABLE'' || matview;
DELETE FROM matviews WHERE mv_name=matview;
RETURN;
END
';

4) create a function to refresh a materialized view
CREATE OR REPLACE FUNCTION refresh_matview(NAME) RETURNS VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS $1;
entry matviews%ROWTYP;
BEGIN
SELECT * INTO entry FROM matviews WHERE mv_name=matview;
IF NOT FOUND THEN
RAISE EXCEPTION ''Materialized view % does not exist.'',matview;
END IF;
EXECUTE '' DELETE FROM '' || matview;
EXECUTE ''INSERT INTO ''||matview || '' SELECT * FROM '' || entry.b_name;
UPDATE matviews
SET last_refresh = CURRENT_TIMESTAMP
WHERE mv_name=matview;
RETURN;
END

';

b. eager (update whenever the base view is changed)
4. delete existing duplicate rows in a table

Before execute the following statement, make sure you have enough buffers for the session (refer to cheat 5 and 6).

solution1
DELETE FROM table USING table ta
WHERE table.key=ta.key AND table.id>ta.id;

solution2
DELETE FROM table
WHERE ctid NOT IN
(SELECT MAX(s.ctid) FROM table s GROUP BY s.column_has_be_distinct);

solution3
CREATE OR REPLACE FUNCTION remove_duplicates(text,text) RETURN VOID AS $$
DECLARE
tablename ALIAS FOR $1;
duplicate_column ALIAS FRO $2;
BEGIN
EXECUTE 'CREATE TEMPORARY TABLE TABLE_DISTINCT_' ||tablename||'
AS (SELECT DISTINCT ON ('||duplicate_column||') * FROM ' ||tablename||
ORDER BY '||duplicate_column||' ASC);';
EXECUTE 'DELETE FROM ' || tablename||';';
EXECUTE 'INSERT INTO ' || tablename||' (SELECT * FROM _DISTINCT_' || tablename||');';
EXECUTE 'DROP TABLE_DISTINCT_'|| tablename||';';
RETURN;
END;
$$LANGUAGE plpgsql;

solution4
CREATE TEMPORARY TABLE t_temp ON COMMIT DROP AS
SELECT DISTINCT ON ('column_name') * FROM table;
TRUNCATE table;
INSERT INTO table SELECT * FROM t_temp;

5. find out the size of table
solution:
SELECT pg_size_pretty(pg_relation_size('tablename'));

6. set temporary buffers
solution:
SET temp_buffers =' XXXMB';

7. delete table rows which some columns in that table (a) don't exist in the other table (b).
solution:
DELETE FROM a WHERE a.column_name NOT IN (SELECT column_name FROM b);

8. Backup
Solution1:
pg_dump
EX. login db at localhost as postgres and dump database poi2 into poi.backup.gz
pg_dump -u postgres -h localhost poi2 | gzip>poi.backup.gz

Solution2:
file system level backup
Ex.
tar -cf poi2.backup.tar /usr/local/pgsql/data
PS. The database server must be shutdown before doing file system level backup and restore.

Solutioin3:
Continuous archiving
EX.
a. enable wal in postgresql.conf
wal_level=archive
archive_mode=on
archive_command='test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
b.SELECT pg_start_backup('label',ture);
c. perform file system level backup, like tar
d. SELECT pg_stop_backup();

touch /var/lib/pgsql/backup_in_progress
psql -c "select pg_start_backup('hot_backup');"
tar -cf /var/lib/pgsql/backup.tar /var/lib/pgsql/data/
psql -c "select pg_stop_backup();"
rm /var/lib/pgsql/backup_in_progress
tar -rf /var/lib/pgsql/backup.tar /var/lib/pgsql/archive/

Solution4:
pg_basebackup
EX.
a. enable local replicate permission for posgres @localhost
b. enable max_wal_senders=1
c. pg_basebackup -U postgres -D /backup/data -F tar -z -l label

9. PostGIS import shape files into one table

Solution 1:

shp2pgsql -s 915 -a[-c] -g geom shapfile.shp schema.table | psql -U username -W dbname

-s: srid, -g: geometry_column_name

10. Test if geometry column is valid
Solution 1:
select ST_IsValid(geom) from table;
Solution 2:
select ST_IsValidDetail(geom) from table;
Solution 3:
select ST_IsEmpty(geom) from table;
11. Test the geometry type
Solution 1:
select ST_GeometryType(geom) from table;

12. Mapping points in table a to a polygon in a table b
Solution1:
select * from table1
where st_within(geom, polygon);

13. get a boundary from a collection of geometries
Solution 1:
select st_extent(geom) from table;

14. Performance tuning:
Install pgtune (apt-get update; apt-get install pgtune) -> pgtune -i postgresql.conf -o psql.conf

create dedicate index for query (considering the order of fields, and the join condition)
Use multi-fields index
the order of fields in multi-fields index (A, B, C can be used to find AB, but not BC)
Vacuum the tables after delete, update, insert, or create index.
Order of fields and order of joint have a big impact on performance.
Explicit casting: "where x=3" v.s. "where x=cast(3 as smallint)", "abs(dest) = cast(111 as smallint)" v.s. "dest =111 or dest=-111" have huge different.
set enable_seqscan=no
Avoid nested loop, merge and sort is faster.
Use temp table
Add sequence in query to monitor the progress. select nextval('seq_progress'); select currval('seq_progress');
Enable stats_start_collector and stats_row_level to get auto vacuum
Partial index: create index index_name on table1(column_name) where column_nam like 'America%'
Specify sort_memory, temp_buffers for each connections or query.

dd