JMS allows applications to create, send, receive, and read messages. Using JMS, application can communicate with others loosely coupled, asynchronous, and reliable.
It is useful in the following use cases:
1) inventory component can send a message to factory component when inventory is low.
2) factory component can send a message to parts component to get parts assembled.
3) parts component can send message to inventory or supplied component to get parts ready.
4) factory and parts component can send message to accounting component to update their budget.
5) business component can update their info to sales force.
JMS architecture
1) A JMS provider (provided by Glassfish JEE container)
2) JMS clients (any JEE application component can be JMS client)
3) Messages (objects that communicate information beween JMS clients)
4) Administered objects (Connection factory and destinations)
Messaging domains
1) point-to-point domain contains queues, senders and receivers. each message has only one consumer. senders and receivers have no time-dependency. receivers will acknowledge the success of message processing.
2) publish-subscribe domain contains subscriber, publisher and topics. consumers subscribe to the topics they interested in. each message has multiple consumers. consumers and publishers are time-dependency relationships. consumers can only consume messages published after consumers' subscription.
Programming model
1) administered objects (provided by Glassfish): connection factory and destination.
2) connection factory( an object that clients used to connect to a jms provider)
Ex.
@Resource(lookup = "/jms/ConnectionFactory")
private static ConnectionFactory connectionFactory;
3) destination
Ex.
@Resource(looku="/jms/Queue")
private static Queue queue;
4) connections
Ex.
Connection connection = connectionFactory.createConnection();
5) sessions: sessions are used to create producers, consumers, messages, and queue browsers.
Ex.
Session session = connection.createSession(transactionType, session.Auto_acknowledge);
6) message producers
Ex.
MessageProducer producer = session.createProducer(dest);
7) message receivers
Ex.
MessageConsumer consumer = session.createConsumer(dest);
8) message listener
Ex.
Listener myListener = new Listener();
consumer.setMessageListener(myListerner);
9) messages (header, property and body)
MessageType: TextMessage (String object), MapMessage(name-value pairs), BytesMessage(undefined bytes), StreamMessage(primitive value stream), ObjectMessage (serializable object), Message(without body).
10) Queue Browser
Ex. QueueBrowser browser = session.createBrowser(queue);
Tuesday, August 14, 2012
Thursday, June 28, 2012
Running R script on AWS EMS
- Install Amazon EMR Command Line Interface
- Install Ruby (1.8 up)
- Download and unzip CLI (http://aws.amazon.com/developertools/2264)
- Configure credential.json
"access_id": "AWS Access Key ID",
"private_key":"AWS Secret Access Key",
"keypair": "EC2 keypair name",
"key-pair-file":"pem location",
"log_uri":"s3n://log-location",
"region":"us-east-1"
}
- Job Flow Essentials
- Creating a Job Flow (./elastic-mapreduce --create --alive)
- Listing all Job Flow (./elastic-mapreduce --list
- Retrieving information about a specific Job Flow (./elastic-mapreduce --describe --jobflow ID)
- Adding a step using default parameter values to a Job Flow (./elastic-mapreduce -j ID --stream)
- Terminating a Job Flow (./elastic-mapreduce --terminate ID)
- Listing all active Job Flows (./elastic-mapreduce --list --active)
- Streaming Job Flow
./elastic-mapreduce --create --stream \
--mapper s3n://[mapper-location]
--input s3n://[input-location]
--output s3n://[output-location]
--reducer s3n://[reducer-location]
Tuesday, June 19, 2012
geohash adjancent codes
An algorithm to find out neighbors of a geohash code.
- Base32: 123456789bcdefghjkmnpqrstuvwxyyz
- Neighbors (direction type)
- right even: bc01fg45238967deuvhjyznpkmstqrwx
- left even: 238967debc01fg45kmstqrwxuvhjyznp
- top even: p0r21436x8zb9dcf5h7kjnmqesgutwvy
- bottom even: 14365h7k9dcfesgujnmqp0r2twvyx8zb
- righ odd= top even (p0r21436x8zb9dcf5h7kjnmqesgutwvy)
- left odd= bottom even (14365h7k9dcfesgujnmqp0r2twvyx8zb)
- top odd= right even (bc01fg45238967deuvhjyznpkmstqrwx)
- bottom odd= left even (238967debc01fg45kmstqrwxuvhjyznp)
- Borders (direction type)
- right even: bcfguvyz
- left even: 0145hjnp
- top even: przx
- bottom even: 028b
- right odd= top even (przx)
- left odd= bottom even (028b)
- top odd= right even (bcfguvyz)
- bottom odd= left even (0145hjnp)
- function calculateAdjancent(String srcHashCode, String direction)
char lastCharacter=srcHashCode.charAt(srcHashCode.length-1);
String type=(srcHashCode.length%2)?'odd': 'even':
String base=srcHashCode.subString(0,srcHashCode.length-1);
if(Borders[direction][type].indexOf(lastCharacter) != -1)
base=calculateAdjancent(base,direction);
return base+BASE32[Neighbors[direction][type].indexOf(lastCharacter)];
Ex. find a's right neighbor
calculateAdjancent(a,right);
Tuesday, June 5, 2012
speed up st_within query in postgresql
1. Create index
CREATE INDEX idx_tablename_columnname ON tablename USING GIST(columnname);
This will create a spatial index for the geometry column [columnname] in the [tablename]. According to postgis manual, it creates the bbox for each geometry in the table to speed up the query.
2. Cluster
CLUSTER
After index created, cluster the table to arrange the similar data in to the same disk space.
3. Optional: simplify the geometries.
SELECT ST_NPOINTS(geom_column) AS npoints FROM tablename ORDER BY npoints DESC LIMIT 25;
SELECT ST_SIMPLIFY(geom_column, number_scale) AS simpgeom FROM tablename;
Check the number of points in the 25 biggest geometries. If they are too big, the speed of st_within query will be slow. You can try simplify the geometries in the table if possible. Simplifying means reduce the points in geometries.
4. Test
EXPLAIN UPDATE locationtable SET columnname = (
SELECT columnname FROM regiontable
WHERE ST_WITHIN(geom.locationtable , geom.regiontable)
)
Explain breaks down the query into plan and evaluate if an index is needed or not. Through the EXPLAIN, you can find out if the query is efficient enough.
5. Dissolve multi-polygon to polygon
CREATE INDEX idx_tablename_columnname ON tablename USING GIST(columnname);
This will create a spatial index for the geometry column [columnname] in the [tablename]. According to postgis manual, it creates the bbox for each geometry in the table to speed up the query.
2. Cluster
CLUSTER
After index created, cluster the table to arrange the similar data in to the same disk space.
3. Optional: simplify the geometries.
SELECT ST_NPOINTS(geom_column) AS npoints FROM tablename ORDER BY npoints DESC LIMIT 25;
SELECT ST_SIMPLIFY(geom_column, number_scale) AS simpgeom FROM tablename;
Check the number of points in the 25 biggest geometries. If they are too big, the speed of st_within query will be slow. You can try simplify the geometries in the table if possible. Simplifying means reduce the points in geometries.
4. Test
EXPLAIN UPDATE locationtable SET columnname = (
SELECT columnname FROM regiontable
WHERE ST_WITHIN(geom.locationtable , geom.regiontable)
)
Explain breaks down the query into plan and evaluate if an index is needed or not. Through the EXPLAIN, you can find out if the query is efficient enough.
5. Dissolve multi-polygon to polygon
Wednesday, May 30, 2012
R Postgresql
Reading from PostgreSQL DB
- Install RPostgreSQL
:library(RPostgreSQL) - Open a db connection using the dbConnect function
:con<-dbConnect(PostgreSQL(),user="xxx", password="xxx",host="localhost", client.flag=CLIENT_MULTI_RESULTS) - Use dbGetQuery to initiate a SELECT and return the result sets
:sql<-"SELECT * FROM table WHERE name='xxx'"
:rows<- dbGetQuery(con,sql) - Use dbDisconnect to terminate the connection
:dbDisconnect(con)
Monday, May 28, 2012
tiger geocoder 2011
Settings
- Ubuntu 12.04
- Postgresql 9.1
- Postgis 2.0
- Tiger 2011 Data
- sudo apt-get install python-software-properties
- sudo apt-add-repository ppa:sharpie/for-science
- sudo apt-add-repository ppa:sharpie/postgis-nightly
- sudo apt-get update
- sudo apt-get install postgresql-9.1-postgis
- sudo apt-get install build-essential postgresql-9.1 postgresql-server-dev-9.1 libxml2-dev proj libjson0-dev xsltproc docbook-xsl docbook-mathml gettext postgresql-contrib-9.1 pgadmin3
- sudo apt-add-repository ppa:olivier-berten/geo
- sudo apt-get update
- sudo apt-get install libgdal-dev
- sudo apt-get install g++ ruby ruby1.8-dev swig swig2.0 unzip
- wget http://download.osgeo.org/geos/geos-3.3.3.tar.bz2
- tar xvfj geos-3.3.3.tar.bz2
- cd geos-3.3.3
- ./configure --enable-ruby --prefix=/usr
- make
- sudo make install
- wget http://postgis.refractions.net/download/postgis-2.0.0.tar.gz
- tar zxvf postgis-2.0.0.tar.gz
- cd postgis-2.0.0
- ./configure --with-gui
- make
- sudo make install
- sudo ldconfig
- sudo make comments-install
- sudo ln -sf /usr/share/postgresql-common/pg_wrapper /usr/local/bin/shp2pgsql
- sudo ln -sf /usr/share/postgresql-common/pg_wrapper /usr/local/bin/pgsql2shp
- sudo ln -sf /usr/share/postgresql-common/pg_wrapper /usr/local/bin/raster2pgsql
- createdb postgis2
- psql -d postgis2
- psql>create extension postgis;
- psql>create extension postgis_topology;
- createdb -T postgis2 geocoder
- wget http://postgis.refractions.net/download/postgis-2.1.0SVN.tar.gz
- tar zxvf postgis-2.1SVN.tar.gz
- EDIT postgis-2.1SVN/extra/tiger-geocoder/tiger2011/tiger_loader.sql
export PGBIN=/usr/bin
export PGPORT=5432
export PGHOST=localhost
export PGUSER=postgres
export PGPASSWORD=XXX
export PGDATABASE=geocoder
PSQL=${PGBIN}/psql
SHP2PGSQL=${PGBIN}/shp2pgsql - EDIT postgis-2.1SVN/extra/create_geocode.sh
PGPORT=5432
PGHOST=localhost
PGUSER=postgres
PGPASSWORD=XXX
THEDB=geocoder
PGSQL_CMD=/usr/bin/psql
PGCONTRIB=/usr/share/postgresql/9.1/contrib
//unmark the following two lines
${PGSQL_CMD} -d "${THEDB}" -c "CREATE EXTENSION fuzzystrmatch.sql"
${PGSQL_CMD} -d "${THEDB}" -c "ALTER DATABASE ${THEDB} SET search_path=public,tiger;" - sh ./create_geocode.sh
- sudo mkdir -p /gisdata/temp
- sudo chown -R postgres:postgres /gisdata
- psql -d geocoder
- psql>\a
- psql>\t
- psql>\o nation.sh
- psql>select loader_generate_nation_script('sh');
- psql>\q
- sh ./nation.sh
- psql -d geocoder
- psql>\a
- psql>\t
- psql>\o all_states.sh
- psql>select loader_generate_script(ARRAY['AK','AR','AS',...],'sh');
- psql>\q
- sh ./all_states.sh
- psql -d geocoder
- psql>select install_missing_indexes();
- psql -d geocoder
- get geocode of "27 Waterview Drive, Shelton, CT 06484)psql>select g.rating,ST_X(geomout) AS lon,ST_Y(geomout) AS lat, (addy).* from geocode('27 Waterview Drive, Shelton, CT 06484') AS g;
- get reverse-geocode of point(-73.0811694362766 41.2931353732512)psql>select pprint_addy(r.addy[1]) as st1, pprint_addy(r.addy[2]) as st2, pprint_addy(r.addy[3]) as st3, array_to_string(r.street,',') as cross_streets from reverse_geocode(st_geomfromtext('POINT(-73.0811694362766 41.2931353732512)',4269),true) as r;
Labels:
geocode,
postgis 2.0,
reverse geocode,
tiger 2011
Thursday, May 24, 2012
Maven
Why maven?
Ans: you want to manage jars in your projects. you want to manage dependencies and versions in your projects. you want to manage your project structures. you want to manage the building processes.
How setup maven
1. export M2_home=/directory/to/maven
export PATH=/directory/to/maven:${PATH}
How to use
Ans: you want to manage jars in your projects. you want to manage dependencies and versions in your projects. you want to manage your project structures. you want to manage the building processes.
How setup maven
1. export M2_home=/directory/to/maven
export PATH=/directory/to/maven:${PATH}
How to use
Wednesday, May 23, 2012
R Spatial
1. Load library
library(rgdal)
library(maptools)
library(sp)
2. setwd("directory") #set default working directory
3.data<-read.table()
4. str(data)
5. coordinates(data) "LAt+Lon"
6. str(data)
7. proje4string(data) = CRS("+init=epsg:4326")
8. spplot()
XX: install rgdal in MacOS.
install.packages('rgdal',repos="http://www.stats.ox.ac.uk/pub/RWin")
library(rgdal)
library(maptools)
library(sp)
2. setwd("directory") #set default working directory
3.data<-read.table()
4. str(data)
5. coordinates(data) "LAt+Lon"
6. str(data)
7. proje4string(data) = CRS("+init=epsg:4326")
8. spplot()
XX: install rgdal in MacOS.
install.packages('rgdal',repos="http://www.stats.ox.ac.uk/pub/RWin")
Geohashing
Geohash function has been included in Postgis 2.0. To generate a geohash code, one can use "st_geohash(geom_column, [integer 1-20])".
After inserting/updating geohash code, remember to create b-tree index for the geohash codes.
Thursday, May 17, 2012
Algorithms in Mahout
From IBM:
1. Logistic Regression (Stochastic Gradient Descent)
2. Hidden Markov Models (HMM)
3. Dirichlet Clustering
4. Spectral Clustering
5. Minhash Clustering
6. Numerous Recommender Improvements
7. Collections
1. Logistic Regression (Stochastic Gradient Descent)
2. Hidden Markov Models (HMM)
3. Dirichlet Clustering
4. Spectral Clustering
5. Minhash Clustering
6. Numerous Recommender Improvements
7. Collections
JVM tuning
Key JVM
-Xmx -Xmx2G maximum size of java heap.
-Xms -Xms1G minimum size of java heap.
-server/client defines type of jvm
-d32/-d64 specify the type of JVM
-XX:+NewRatio -XX:+NewRatio=12 only 1/12 of the heap will be reserved for short-lived objects.
-XX:+UseParallelGC use multi-cores or processor to do garbage collection in parallel or not.
-XX:+UseParallelOldGC
-Xmx -Xmx2G maximum size of java heap.
-Xms -Xms1G minimum size of java heap.
-server/client defines type of jvm
-d32/-d64 specify the type of JVM
-XX:+NewRatio -XX:+NewRatio=12 only 1/12 of the heap will be reserved for short-lived objects.
-XX:+UseParallelGC use multi-cores or processor to do garbage collection in parallel or not.
-XX:+UseParallelOldGC
Add new Harddrive to ubuntu
Situation: the old /home directory is too small. I want to add a new hard drive and mount it as /home.
1. sudo fdisk -l to check whether ubuntu detects the new harddrive and identify the new harddrive identity (the one without mounting point, ex. sdb)
2. sudo fdisk /dev/sdb
press 'n' to create new partition.
press 'p' to add primary partition.
press '1' to name partition number.
enter the first and the last section
press 'L' to list available file sysetem type
choice '83' (linux)
press 'w' to write to the table and exit
3. sudo partprobe /dev/sdb
4. sudo mkfs /dev/sdb1 -t ext4
5. sudo mkdir /mnt/sdb1
6. sudo chmod 777 /mnt/sdb1
7. sudo mount /dev/sdb1 /mnt/sdb1 -t ext4
8. sudo cp -R /home /mnt/sdb1
9. sudo vi /etc/fstab
add "/dev/sdb1 /home ext4 defaults 0 0 to the file, save and quit
10. backup the home folder to other places, then delete all in /home
11. sudo mount -a
12. reboot and done.
Tuesday, May 8, 2012
PostGIS examples
From http://postgis.refractions.net/pipermail/postgis-users/2007-July/016358.html
SQL Commands for the Demonstration
Simple Spatial SQL
create table points ( pt geometry, name varchar );
insert into points values ( 'POINT(0 0)', 'Origin' );
insert into points values ( 'POINT(5 0)', 'X Axis' );
insert into points values ( 'POINT(0 5)', 'Y Axis' );
select name, AsText(pt), ST_Distance(pt, 'POINT(5 5)') from points;
drop table points;
Loading Shape Files
cd \postgis-workshop\data
pg_setenv
shp2pgsql -i -s 3005 bc_pubs.shp bc_pubs > bc_pubs.sql
notepad bc_pubs.sql
pg_shpsql
psql -U postgres -f bc_data.sql -d postgis
Creating Spatial Indexes
create index bc_roads_gidx on bc_roads using gist ( the_geom );
create index bc_pubs_gidx on bc_pubs using gist ( the_geom );
create index bc_voting_areas_gidx on bc_voting_areas using gist ( the_geom );
create index bc_municipality_gidx on bc_municipality using gist ( the_geom );
create index bc_hospitals_gidx on bc_hospitals using gist ( the_geom );
Using Spatial Indexes
-- Unindexed query
select gid, name from bc_roads where _ST_Crosses(the_geom, ST_GeomFromText('LINESTRING(1220446 477473,1220417 477559)', 3005));
-- Indexed query
select gid, name from bc_roads where ST_Crosses(the_geom, ST_GeomFromText('LINESTRING(1220446 477473,1220417 477559)', 3005));
Indexes and Query Plans
-- Unindexed query plan
explain select gid, name from bc_roads where _ST_Crosses(the_geom, ST_GeomFromText('LINESTRING(1220446 477473,1220417 477559)', 3005));
-- Indexed query plan
explain select gid, name from bc_roads where ST_Crosses(the_geom, ST_GeomFromText('LINESTRING(1220446 477473,1220417 477559)', 3005));
PostgreSQL Optimization
open postgresql.conf
Spatial Analysis in SQL
-- What is the total length of all roads in the province, in kilometers?
select sum(ST_Length(the_geom))/1000 as km_roads from bc_roads;
-- How large is the city of Prince George, in hectares?
select ST_Area(the_geom)/10000 as hectares from bc_municipality where name = 'PRINCE GEORGE';
-- What is the largest municipality in the province, by area?
select name, ST_Area(the_geom)/10000 as hectares from bc_municipality order by hectares desc limit 1;
Data Integrity
-- How many invalid voting area polygons?
select count(*) from bc_voting_areas where not ST_IsValid(the_geom);
Distance Queries
-- Where is the Tabor Arms pub?
select ST_AsText(the_geom) from bc_pubs where name ilike 'Tabor Arms%';
-- How many Unity Party supporters live within 2km of Tabor Arms pub?
select sum(upbc) as unity_voters from bc_voting_areas where ST_DWithin(the_geom, ST_GeomFromText('POINT(1209385 996204)', 3005), 2000);
Spatial Joins
-- Find all pubs located within 250 meters of a hospital.
select h.name, p.name from bc_hospitals h, bc_pubs p where ST_DWithin(h.the_geom, p.the_geom, 250);
-- Summarize the 2000 provincial election results by municipality.
select m.name, sum(v.ndp) as ndp, sum(v.lib) as liberal, sum(v.gp) as green, sum(v.upbc) as unity, sum(v.vtotal) as total from bc_voting_areas v, bc_municipality m where ST_Intersects(v.the_geom, m.the_geom) group by m.name order by m.name;
Overlays
-- Clip the voting areas by the Prince George boundary.
create table pg_voting_areas as select ST_Intersection(v.the_geom, m.the_geom) as intersection_geom, ST_Area(v.the_geom) as va_area, v.*, m.name from bc_voting_areas v, bc_municipality m where ST_Intersects(v.the_geom, m.the_geom) and m.name = 'PRINCE GEORGE';
-- What is the area of the clipped features?
select sum(ST_Area(intersection_geom)) from pg_voting_areas;
-- How does that compare to the clipping polygon? (Should be the same.)
select ST_Area(the_geom) from bc_municipality where name = 'PRINCE GEORGE';
Coordinate Projection
-- What is the proj4 definition of srid 3005?
select proj4text from spatial_ref_sys where srid=3005;
-- What are the native coordinate of the first road in the roads table?
select ST_AsText(the_geom) from bc_roads limit 1;
-- What do those coordinates look like in lon/lat?
select ST_AsText(ST_Transform(the_geom,4326)) from bc_roads limit 1;
Exercises
-- What is the perimeter of the municipality of Vancouver?
select ST_Perimeter(the_geom) from bc_municipality where name = 'VANCOUVER';
-- What is the total area of all voting areas in hectares?
select sum(ST_Area(the_geom))/10000 as hectares from bc_voting_areas;
-- What is the total area of all voting areas with more than 100 voters in them?
select sum(ST_Area(the_geom))/10000 as hectares from bc_voting_areas where vtotal > 100;
-- What is the length in kilometers of all roads named Douglas St?
select sum(ST_Length(the_geom))/1000 as kilometers from bc_roads where name = 'Douglas St';
Advanced Exercises
-- What is the length in kilometers of Douglas St in Victoria?
select sum(ST_Length(r.the_geom))/1000 as kilometers from bc_roads r, bc_municipality m where ST_Contains(m.the_geom, r.the_geom) and r.name = 'Douglas St' and m.name = 'VICTORIA';
-- What two pubs have the most Green Party supporters within 500 meters of them?
select p.name, p.city, sum(v.gp) as greens from bc_pubs p, bc_voting_areas v where ST_DWithin(p.the_geom, v.the_geom, 500) group by p.name, p.city order by greens desc limit 2;
-- What is the latitude of the most southerly hospital in the province?
select ST_Y(ST_Transform(the_geom,4326)) as latitude from bc_hospitals order by latitude asc limit 1;
-- What were the percentage NDP and Liberal vote within the city limits of Prince George in the 2000 provincial election?
select 100*sum(v.ndp)/sum(v.vtotal) as ndp, 100*sum(v.lib)/sum(v.vtotal) as liberal from bc_voting_areas v, bc_municipality m where ST_Contains(m.the_geom, v.the_geom) and m.name = 'PRINCE GEORGE';
-- What is the largest voting area polygon that has a hole?
select gid, id, ST_Area(the_geom) as area from bc_voting_areas where ST_NRings(the_geom) > 1 order by area desc limit 1;
-- How many NDP voters live within 50 meters of Simcoe St in Victoria?
select sum(v.ndp) as ndp from bc_voting_areas v, bc_municipality m, bc_roads r where ST_Contains(m.the_geom, r.the_geom) and r.name = 'Simcoe St' and m.name = 'VICTORIA' and ST_DWithin(r.the_geom, v.the_geom, 50);
Wednesday, May 2, 2012
Hadoop Demo
Create MapReduce application:
a. specify input and output locations
Input/Output => <key,value>
key and value should implement "Writable", and key should implement "WritableComparable".
b. specify map and reduce functions
c. job client submit the job and configuration to JobTracker.
Example1: WordCount
public class WordCount{
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable{
private final static IntWritable one = new IntWritable(1); //value
private Text word = new Text();//use word as key
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while(tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken()); //set key
output.collect(word,one); //value will always be one
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>{
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{
int sum=0
//for values with the same key, sum up the value (1) and output <key, sum>
while(values.hasNext()){
sum+=values.next().get();
}
output.collect(key,new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception{
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class); //word
conf.setOutputValueClass(IntWritable.class);//sum
conf.setMapperClass(Map.class);
//conf.setOutputKeyComparatorClass(Comparator.class)); //control the grouping of output
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);//textfile
conf.setOutputFormat(TextOutputFormat.class);//textfile
FileInputFormat.setInputPaths(conf, new Path(args[0]));//location of input file
FileOutputFormat.setOutputPath(conf, new Path(args[1]));//location of output file
JobClient.run(conf);
}
}
PostgreSQL+PostGIS cheat sheet
1. Distinct Count
solution 1:
select count(*) from (select distinct * from table where column='') as dis
2. Check redundant tuple
soluction 1 (self join)
SELECTp.place_id, p.href,s.place_id,s.href
FROM place p
INNER JOIN place s ON p.href=s.href
WHERE p.place_id &s s.place_id
ORDER BY p.place_id, s.place_id;
3. Materialized View (http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views)
a. snapshot (view definition is never changed)
1) create a table to store view info
CREATE TABLE matviews(
mv_name NAME NOT NULL PRIMARY KEY,
b_name NAME NOT NULL,
last_refresh TIMESTAMP WITH TIME ZONE
);
2)create a function to insert a row in Mat_VIEWS and create a materialized view
CREATE OR REPLACE FUNCTION create_matview(NAME,NAME)
RETURNS VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS FOR $1;
view_name ALIAS FOR $2;
entry matviews%ROWTYPE;
BEGIN
SELECT * INTO entry FROM matviews WHERE mv_name=matview;
IF FOUND THEN
RAISE EXCEPTION ''Materialized view ''''%'''' already exists.'', matview;
END IF;
EXECUTE ''REVOKE ALL ON '' || view_name || ''FROM PUBLIC'';
EXECUTE "GRANT SELECT ON '' || view_name || '' TO PUBLIC'';
EXECUTE ''CREATE TABLE '' || matview|| AS SELECT * FROM ''||view_name;
EXECUTE ''REVOKE ALL ON '' || matview || ' FROM PUBLIC'';
EXECUTE ''GRANT SELECT ON '' || matview || '' TO PUBLIC'';
INSERT INTO matviews (mv_name, b_name,last_refresh)
VALUES(matview, view_name, CURRENT_TIMESTAMP);
RETURN;
END
';
3) create a function to drop a materialized view
CREATE OR REPLACE FUNCTION drop_matview(NAME) RETURN VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS FOR $1;
entry matviews%ROWTYPE;
BEGIN
SELECT * INTO entry FROM matview WHERE mv_name=matview;
IF NOT FOUND THEN
RAISE EXCEPTION ''Materialized view % does not exist.'',matview;
END IF;
EXECUTE ''DROP TABLE'' || matview;
DELETE FROM matviews WHERE mv_name=matview;
RETURN;
END
';
4) create a function to refresh a materialized view
CREATE OR REPLACE FUNCTION refresh_matview(NAME) RETURNS VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS $1;
entry matviews%ROWTYP;
BEGIN
SELECT * INTO entry FROM matviews WHERE mv_name=matview;
IF NOT FOUND THEN
RAISE EXCEPTION ''Materialized view % does not exist.'',matview;
END IF;
EXECUTE '' DELETE FROM '' || matview;
EXECUTE ''INSERT INTO ''||matview || '' SELECT * FROM '' || entry.b_name;
UPDATE matviews
SET last_refresh = CURRENT_TIMESTAMP
WHERE mv_name=matview;
RETURN;
END
';
b. eager (update whenever the base view is changed)
4. delete existing duplicate rows in a table
Before execute the following statement, make sure you have enough buffers for the session (refer to cheat 5 and 6).
solution1
DELETE FROM table USING table ta
WHERE table.key=ta.key AND table.id>ta.id;
solution2
DELETE FROM table
WHERE ctid NOT IN
(SELECT MAX(s.ctid) FROM table s GROUP BY s.column_has_be_distinct);
solution3
CREATE OR REPLACE FUNCTION remove_duplicates(text,text) RETURN VOID AS $$
DECLARE
tablename ALIAS FOR $1;
duplicate_column ALIAS FRO $2;
BEGIN
EXECUTE 'CREATE TEMPORARY TABLE TABLE_DISTINCT_' ||tablename||'
AS (SELECT DISTINCT ON ('||duplicate_column||') * FROM ' ||tablename||
ORDER BY '||duplicate_column||' ASC);';
EXECUTE 'DELETE FROM ' || tablename||';';
EXECUTE 'INSERT INTO ' || tablename||' (SELECT * FROM _DISTINCT_' || tablename||');';
EXECUTE 'DROP TABLE_DISTINCT_'|| tablename||';';
RETURN;
END;
$$LANGUAGE plpgsql;
solution4
CREATE TEMPORARY TABLE t_temp ON COMMIT DROP AS
SELECT DISTINCT ON ('column_name') * FROM table;
TRUNCATE table;
INSERT INTO table SELECT * FROM t_temp;
5. find out the size of table
solution:
SELECT pg_size_pretty(pg_relation_size('tablename'));
6. set temporary buffers
solution:
SET temp_buffers =' XXXMB';
7. delete table rows which some columns in that table (a) don't exist in the other table (b).
solution:
DELETE FROM a WHERE a.column_name NOT IN (SELECT column_name FROM b);
8. Backup
Solution1:
pg_dump
EX. login db at localhost as postgres and dump database poi2 into poi.backup.gz
pg_dump -u postgres -h localhost poi2 | gzip>poi.backup.gz
Solution2:
file system level backup
Ex.
tar -cf poi2.backup.tar /usr/local/pgsql/data
PS. The database server must be shutdown before doing file system level backup and restore.
Solutioin3:
Continuous archiving
EX.
a. enable wal in postgresql.conf
wal_level=archive
archive_mode=on
archive_command='test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
b.SELECT pg_start_backup('label',ture);
c. perform file system level backup, like tar
d. SELECT pg_stop_backup();
Solution4:
pg_basebackup
EX.
a. enable local replicate permission for posgres @localhost
b. enable max_wal_senders=1
c. pg_basebackup -U postgres -D /backup/data -F tar -z -l label
9. PostGIS import shape files into one table
10. Test if geometry column is valid
Solution 1:
select ST_IsValid(geom) from table;
Solution 2:
select ST_IsValidDetail(geom) from table;
Solution 3:
select ST_IsEmpty(geom) from table;
11. Test the geometry type
Solution 1:
select ST_GeometryType(geom) from table;
12. Mapping points in table a to a polygon in a table b
Solution1:
select * from table1
where st_within(geom, polygon);
13. get a boundary from a collection of geometries
Solution 1:
select st_extent(geom) from table;
14. Performance tuning:
Install pgtune (apt-get update; apt-get install pgtune) -> pgtune -i postgresql.conf -o psql.conf
create dedicate index for query (considering the order of fields, and the join condition)
Use multi-fields index
the order of fields in multi-fields index (A, B, C can be used to find AB, but not BC)
Vacuum the tables after delete, update, insert, or create index.
Order of fields and order of joint have a big impact on performance.
Explicit casting: "where x=3" v.s. "where x=cast(3 as smallint)", "abs(dest) = cast(111 as smallint)" v.s. "dest =111 or dest=-111" have huge different.
set enable_seqscan=no
Avoid nested loop, merge and sort is faster.
Use temp table
Add sequence in query to monitor the progress. select nextval('seq_progress'); select currval('seq_progress');
Enable stats_start_collector and stats_row_level to get auto vacuum
Partial index: create index index_name on table1(column_name) where column_nam like 'America%'
Specify sort_memory, temp_buffers for each connections or query.
solution 1:
select count(*) from (select distinct * from table where column='') as dis
2. Check redundant tuple
soluction 1 (self join)
SELECTp.place_id, p.href,s.place_id,s.href
FROM place p
INNER JOIN place s ON p.href=s.href
WHERE p.place_id &s s.place_id
ORDER BY p.place_id, s.place_id;
3. Materialized View (http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views)
a. snapshot (view definition is never changed)
1) create a table to store view info
CREATE TABLE matviews(
mv_name NAME NOT NULL PRIMARY KEY,
b_name NAME NOT NULL,
last_refresh TIMESTAMP WITH TIME ZONE
);
2)create a function to insert a row in Mat_VIEWS and create a materialized view
CREATE OR REPLACE FUNCTION create_matview(NAME,NAME)
RETURNS VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS FOR $1;
view_name ALIAS FOR $2;
entry matviews%ROWTYPE;
BEGIN
SELECT * INTO entry FROM matviews WHERE mv_name=matview;
IF FOUND THEN
RAISE EXCEPTION ''Materialized view ''''%'''' already exists.'', matview;
END IF;
EXECUTE ''REVOKE ALL ON '' || view_name || ''FROM PUBLIC'';
EXECUTE "GRANT SELECT ON '' || view_name || '' TO PUBLIC'';
EXECUTE ''CREATE TABLE '' || matview|| AS SELECT * FROM ''||view_name;
EXECUTE ''REVOKE ALL ON '' || matview || ' FROM PUBLIC'';
EXECUTE ''GRANT SELECT ON '' || matview || '' TO PUBLIC'';
INSERT INTO matviews (mv_name, b_name,last_refresh)
VALUES(matview, view_name, CURRENT_TIMESTAMP);
RETURN;
END
';
3) create a function to drop a materialized view
CREATE OR REPLACE FUNCTION drop_matview(NAME) RETURN VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS FOR $1;
entry matviews%ROWTYPE;
BEGIN
SELECT * INTO entry FROM matview WHERE mv_name=matview;
IF NOT FOUND THEN
RAISE EXCEPTION ''Materialized view % does not exist.'',matview;
END IF;
EXECUTE ''DROP TABLE'' || matview;
DELETE FROM matviews WHERE mv_name=matview;
RETURN;
END
';
4) create a function to refresh a materialized view
CREATE OR REPLACE FUNCTION refresh_matview(NAME) RETURNS VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS $1;
entry matviews%ROWTYP;
BEGIN
SELECT * INTO entry FROM matviews WHERE mv_name=matview;
IF NOT FOUND THEN
RAISE EXCEPTION ''Materialized view % does not exist.'',matview;
END IF;
EXECUTE '' DELETE FROM '' || matview;
EXECUTE ''INSERT INTO ''||matview || '' SELECT * FROM '' || entry.b_name;
UPDATE matviews
SET last_refresh = CURRENT_TIMESTAMP
WHERE mv_name=matview;
RETURN;
END
';
b. eager (update whenever the base view is changed)
4. delete existing duplicate rows in a table
Before execute the following statement, make sure you have enough buffers for the session (refer to cheat 5 and 6).
solution1
DELETE FROM table USING table ta
WHERE table.key=ta.key AND table.id>ta.id;
solution2
DELETE FROM table
WHERE ctid NOT IN
(SELECT MAX(s.ctid) FROM table s GROUP BY s.column_has_be_distinct);
solution3
CREATE OR REPLACE FUNCTION remove_duplicates(text,text) RETURN VOID AS $$
DECLARE
tablename ALIAS FOR $1;
duplicate_column ALIAS FRO $2;
BEGIN
EXECUTE 'CREATE TEMPORARY TABLE TABLE_DISTINCT_' ||tablename||'
AS (SELECT DISTINCT ON ('||duplicate_column||') * FROM ' ||tablename||
ORDER BY '||duplicate_column||' ASC);';
EXECUTE 'DELETE FROM ' || tablename||';';
EXECUTE 'INSERT INTO ' || tablename||' (SELECT * FROM _DISTINCT_' || tablename||');';
EXECUTE 'DROP TABLE_DISTINCT_'|| tablename||';';
RETURN;
END;
$$LANGUAGE plpgsql;
solution4
CREATE TEMPORARY TABLE t_temp ON COMMIT DROP AS
SELECT DISTINCT ON ('column_name') * FROM table;
TRUNCATE table;
INSERT INTO table SELECT * FROM t_temp;
5. find out the size of table
solution:
SELECT pg_size_pretty(pg_relation_size('tablename'));
6. set temporary buffers
solution:
SET temp_buffers =' XXXMB';
7. delete table rows which some columns in that table (a) don't exist in the other table (b).
solution:
DELETE FROM a WHERE a.column_name NOT IN (SELECT column_name FROM b);
8. Backup
Solution1:
pg_dump
EX. login db at localhost as postgres and dump database poi2 into poi.backup.gz
pg_dump -u postgres -h localhost poi2 | gzip>poi.backup.gz
Solution2:
file system level backup
Ex.
tar -cf poi2.backup.tar /usr/local/pgsql/data
PS. The database server must be shutdown before doing file system level backup and restore.
Solutioin3:
Continuous archiving
EX.
a. enable wal in postgresql.conf
wal_level=archive
archive_mode=on
archive_command='test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
b.SELECT pg_start_backup('label',ture);
c. perform file system level backup, like tar
d. SELECT pg_stop_backup();
touch /var/lib/pgsql/backup_in_progress
psql -c "select pg_start_backup('hot_backup');"
tar -cf /var/lib/pgsql/backup.tar /var/lib/pgsql/data/
psql -c "select pg_stop_backup();"
rm /var/lib/pgsql/backup_in_progress
tar -rf /var/lib/pgsql/backup.tar /var/lib/pgsql/archive/
pg_basebackup
EX.
a. enable local replicate permission for posgres @localhost
b. enable max_wal_senders=1
c. pg_basebackup -U postgres -D /backup/data -F tar -z -l label
9. PostGIS import shape files into one table
Solution 1:
shp2pgsql -s 915 -a[-c] -g geom shapfile.shp schema.table | psql -U username -W dbname
-s: srid, -g: geometry_column_name10. Test if geometry column is valid
Solution 1:
select ST_IsValid(geom) from table;
Solution 2:
select ST_IsValidDetail(geom) from table;
Solution 3:
select ST_IsEmpty(geom) from table;
11. Test the geometry type
Solution 1:
select ST_GeometryType(geom) from table;
12. Mapping points in table a to a polygon in a table b
Solution1:
select * from table1
where st_within(geom, polygon);
13. get a boundary from a collection of geometries
Solution 1:
select st_extent(geom) from table;
14. Performance tuning:
Install pgtune (apt-get update; apt-get install pgtune) -> pgtune -i postgresql.conf -o psql.conf
create dedicate index for query (considering the order of fields, and the join condition)
Use multi-fields index
the order of fields in multi-fields index (A, B, C can be used to find AB, but not BC)
Vacuum the tables after delete, update, insert, or create index.
Order of fields and order of joint have a big impact on performance.
Explicit casting: "where x=3" v.s. "where x=cast(3 as smallint)", "abs(dest) = cast(111 as smallint)" v.s. "dest =111 or dest=-111" have huge different.
set enable_seqscan=no
Avoid nested loop, merge and sort is faster.
Use temp table
Add sequence in query to monitor the progress. select nextval('seq_progress'); select currval('seq_progress');
Enable stats_start_collector and stats_row_level to get auto vacuum
Partial index: create index index_name on table1(column_name) where column_nam like 'America%'
Specify sort_memory, temp_buffers for each connections or query.
Wednesday, April 25, 2012
Python split file based on json tag
'''A script to split places_dump_US.geojson into small files based on the json tag, province'''
'''script start'''
import json
def writeState(m,state='none'):
out_name='places_dump_US.'+state+'.geojson';
tg=open(out_name,'a',-1)
print(json.dumps(m),file=tg)
tg.close
sf=open('places_dump_US.geojson','r',-1)
count=0;
for line in sf:
try:
p=json.loads(line)
cstate = p.get('properties').get('province')
writeState(p,cstate)
print(count)
count+=1
except:
writeState(p)
print(count)
count+=1
sf.close
'''script end'''
Monday, April 23, 2012
MongoDB cheatsheet
Import Data
//import SourceFile (csv file) into DataBaseName CollectionName with firstline as header
mongoimport -d DataBaseName -c CollectionName --type csv --file SourceFile --headerline
//import file,xxx.txt into foo in Collection bar without headerinfo(have to specify the fieldnames)
mongoimport -d foo -c bar --type csv --file xxx.txt -f id,timestamp,latitude,longitude,place
mongoimport -d foo -c bar --type tsv --file xxx.txt -f id,timestamp,latitude,longitude,place
Create Index
db.[collection-name].ensuerIndex({[tag-name]: 1})
Create Spatial Index
db.[collection-name].ensureIndex({[tag-name]:"2d"})
//import SourceFile (csv file) into DataBaseName CollectionName with firstline as header
mongoimport -d DataBaseName -c CollectionName --type csv --file SourceFile --headerline
//import file,xxx.txt into foo in Collection bar without headerinfo(have to specify the fieldnames)
mongoimport -d foo -c bar --type csv --file xxx.txt -f id,timestamp,latitude,longitude,place
mongoimport -d foo -c bar --type tsv --file xxx.txt -f id,timestamp,latitude,longitude,place
db.[collection-name].ensuerIndex({[tag-name]: 1})
Create Spatial Index
db.[collection-name].ensureIndex({[tag-name]:"2d"})
Python SimpleHTTPServer
Type the following in the direction which you want to share.
python -m SimpleHTTPServer 9999
or
python -m http.server 8001
//this command create a simple http server on local port 9999 using SimpleHTTPServer module.
Then other users can get the files in the directory using a browser.
python -m SimpleHTTPServer 9999
or
python -m http.server 8001
//this command create a simple http server on local port 9999 using SimpleHTTPServer module.
Then other users can get the files in the directory using a browser.
Monday, April 16, 2012
Spatial data mining procedures
Procedure
1. import json files
2. build index
3. build foreign keys
4. build tiles
5. build neighborhood index (orignal oid, target oid, distance, direction, topology)
6. spatial data mining tools
Platforms
1. PostgreSQL + postgis (open source solution)
2. MongoDB
3. MS-SQL
4. Hadoop
Input
1. Place_dump_US
2. Checkin
Output
1. OpenLayers + Geoserver (visualization)
2. Patterns (representation)
1. import json files
2. build index
3. build foreign keys
4. build tiles
5. build neighborhood index (orignal oid, target oid, distance, direction, topology)
6. spatial data mining tools
Platforms
1. PostgreSQL + postgis (open source solution)
2. MongoDB
3. MS-SQL
4. Hadoop
Input
1. Place_dump_US
2. Checkin
Output
1. OpenLayers + Geoserver (visualization)
2. Patterns (representation)
Sunday, April 15, 2012
Hadoop Clusters Setup
Pre-setup
1. install jdk/jre1.6 or up
2. install ssh
a. create master sshkey
ssh-keygen -t dsa -P ""
cat id_dsa.pub>>authorized_keys
b. copy master public key to slaves
scp id_dsa.pub slaveN:~/.ssh/master.pub
c. add master pub key to authorized_keys
cat master.pub>>authorized_keys
d. from master, ssh to slaveN and check if a passphrase is needed.
3. edit /etc/hosts & /etc/hostname
Setup
1. setup env.xml (export JAVA_HOME)
2. core-site (specify name node and jobtracker) --for master & slaves
fs.default.name
hdfs://master
3. hdfs-site.xml (data node) --for master & slave
dfs.name.dir
/home/hduser/hddata/name
dfs.data.dir
/home/hduser/hddata/data
4. mapred-site.xml (jobtracker) --for master & slaves
mapred.job.tracker
master:54311
5. list all slaves to conf/slaves --for master/jobtracker only
6. chmod g-w to all data and name directories
** start-dfs.sh will consult slaves on name-node and start all data nodes on slaves.
** start-mapred.sh will consult salves on job-tracker-node and start all task-trackers on slaves.
Startup
1. execute "hadoop namenode -format" on name node site
2. execute "start-dfs.sh" on name node site
3. execute "start-mapred.sh" on job tracker site
Shutdown
1. execute "stop-mapred.sh" on job tracker site
2. execute "stop-dfs.sh" on name node site
1. install jdk/jre1.6 or up
2. install ssh
a. create master sshkey
ssh-keygen -t dsa -P ""
cat id_dsa.pub>>authorized_keys
b. copy master public key to slaves
scp id_dsa.pub slaveN:~/.ssh/master.pub
c. add master pub key to authorized_keys
cat master.pub>>authorized_keys
d. from master, ssh to slaveN and check if a passphrase is needed.
3. edit /etc/hosts & /etc/hostname
Setup
1. setup env.xml (export JAVA_HOME)
2. core-site (specify name node and jobtracker) --for master & slaves
3. hdfs-site.xml (data node) --for master & slave
4. mapred-site.xml (jobtracker) --for master & slaves
5. list all slaves to conf/slaves --for master/jobtracker only
6. chmod g-w to all data and name directories
** start-dfs.sh will consult slaves on name-node and start all data nodes on slaves.
** start-mapred.sh will consult salves on job-tracker-node and start all task-trackers on slaves.
Startup
1. execute "hadoop namenode -format" on name node site
2. execute "start-dfs.sh" on name node site
3. execute "start-mapred.sh" on job tracker site
Shutdown
1. execute "stop-mapred.sh" on job tracker site
2. execute "stop-dfs.sh" on name node site
Friday, March 30, 2012
Argument list is too long
Situation: when I try to ls or rm a "big" folder, mac terminal responds "Argument list is too long".
Solution1: use xargs
ex.
ls chunk*.geojson|xargs rm
find . -type f -name chunk*.geojson | xargs rm
//result: doesn't work in Mac OS
Solution 2: perl
ex.
perl -e 'for(<./chunk*.geojson>){unlink}'
//this one works in mac os.
Solusion 3: shellscript
Solution1: use xargs
ex.
ls chunk*.geojson|xargs rm
find . -type f -name chunk*.geojson | xargs rm
//result: doesn't work in Mac OS
Solution 2: perl
ex.
perl -e 'for(<./chunk*.geojson>){unlink}'
//this one works in mac os.
Solusion 3: shellscript
Subscribe to:
Posts (Atom)