- Install RPostgreSQL
:library(RPostgreSQL) - Open a db connection using the dbConnect function
:con<-dbConnect(PostgreSQL(),user="xxx", password="xxx",host="localhost", client.flag=CLIENT_MULTI_RESULTS) - Use dbGetQuery to initiate a SELECT and return the result sets
:sql<-"SELECT * FROM table WHERE name='xxx'"
:rows<- dbGetQuery(con,sql) - Use dbDisconnect to terminate the connection
:dbDisconnect(con)
Wednesday, May 30, 2012
R Postgresql
Reading from PostgreSQL DB
Monday, May 28, 2012
tiger geocoder 2011
Settings
- Ubuntu 12.04
- Postgresql 9.1
- Postgis 2.0
- Tiger 2011 Data
- sudo apt-get install python-software-properties
- sudo apt-add-repository ppa:sharpie/for-science
- sudo apt-add-repository ppa:sharpie/postgis-nightly
- sudo apt-get update
- sudo apt-get install postgresql-9.1-postgis
- sudo apt-get install build-essential postgresql-9.1 postgresql-server-dev-9.1 libxml2-dev proj libjson0-dev xsltproc docbook-xsl docbook-mathml gettext postgresql-contrib-9.1 pgadmin3
- sudo apt-add-repository ppa:olivier-berten/geo
- sudo apt-get update
- sudo apt-get install libgdal-dev
- sudo apt-get install g++ ruby ruby1.8-dev swig swig2.0 unzip
- wget http://download.osgeo.org/geos/geos-3.3.3.tar.bz2
- tar xvfj geos-3.3.3.tar.bz2
- cd geos-3.3.3
- ./configure --enable-ruby --prefix=/usr
- make
- sudo make install
- wget http://postgis.refractions.net/download/postgis-2.0.0.tar.gz
- tar zxvf postgis-2.0.0.tar.gz
- cd postgis-2.0.0
- ./configure --with-gui
- make
- sudo make install
- sudo ldconfig
- sudo make comments-install
- sudo ln -sf /usr/share/postgresql-common/pg_wrapper /usr/local/bin/shp2pgsql
- sudo ln -sf /usr/share/postgresql-common/pg_wrapper /usr/local/bin/pgsql2shp
- sudo ln -sf /usr/share/postgresql-common/pg_wrapper /usr/local/bin/raster2pgsql
- createdb postgis2
- psql -d postgis2
- psql>create extension postgis;
- psql>create extension postgis_topology;
- createdb -T postgis2 geocoder
- wget http://postgis.refractions.net/download/postgis-2.1.0SVN.tar.gz
- tar zxvf postgis-2.1SVN.tar.gz
- EDIT postgis-2.1SVN/extra/tiger-geocoder/tiger2011/tiger_loader.sql
export PGBIN=/usr/bin
export PGPORT=5432
export PGHOST=localhost
export PGUSER=postgres
export PGPASSWORD=XXX
export PGDATABASE=geocoder
PSQL=${PGBIN}/psql
SHP2PGSQL=${PGBIN}/shp2pgsql - EDIT postgis-2.1SVN/extra/create_geocode.sh
PGPORT=5432
PGHOST=localhost
PGUSER=postgres
PGPASSWORD=XXX
THEDB=geocoder
PGSQL_CMD=/usr/bin/psql
PGCONTRIB=/usr/share/postgresql/9.1/contrib
//unmark the following two lines
${PGSQL_CMD} -d "${THEDB}" -c "CREATE EXTENSION fuzzystrmatch.sql"
${PGSQL_CMD} -d "${THEDB}" -c "ALTER DATABASE ${THEDB} SET search_path=public,tiger;" - sh ./create_geocode.sh
- sudo mkdir -p /gisdata/temp
- sudo chown -R postgres:postgres /gisdata
- psql -d geocoder
- psql>\a
- psql>\t
- psql>\o nation.sh
- psql>select loader_generate_nation_script('sh');
- psql>\q
- sh ./nation.sh
- psql -d geocoder
- psql>\a
- psql>\t
- psql>\o all_states.sh
- psql>select loader_generate_script(ARRAY['AK','AR','AS',...],'sh');
- psql>\q
- sh ./all_states.sh
- psql -d geocoder
- psql>select install_missing_indexes();
- psql -d geocoder
- get geocode of "27 Waterview Drive, Shelton, CT 06484)psql>select g.rating,ST_X(geomout) AS lon,ST_Y(geomout) AS lat, (addy).* from geocode('27 Waterview Drive, Shelton, CT 06484') AS g;
- get reverse-geocode of point(-73.0811694362766 41.2931353732512)psql>select pprint_addy(r.addy[1]) as st1, pprint_addy(r.addy[2]) as st2, pprint_addy(r.addy[3]) as st3, array_to_string(r.street,',') as cross_streets from reverse_geocode(st_geomfromtext('POINT(-73.0811694362766 41.2931353732512)',4269),true) as r;
Labels:
geocode,
postgis 2.0,
reverse geocode,
tiger 2011
Thursday, May 24, 2012
Maven
Why maven?
Ans: you want to manage jars in your projects. you want to manage dependencies and versions in your projects. you want to manage your project structures. you want to manage the building processes.
How setup maven
1. export M2_home=/directory/to/maven
export PATH=/directory/to/maven:${PATH}
How to use
Ans: you want to manage jars in your projects. you want to manage dependencies and versions in your projects. you want to manage your project structures. you want to manage the building processes.
How setup maven
1. export M2_home=/directory/to/maven
export PATH=/directory/to/maven:${PATH}
How to use
Wednesday, May 23, 2012
R Spatial
1. Load library
library(rgdal)
library(maptools)
library(sp)
2. setwd("directory") #set default working directory
3.data<-read.table()
4. str(data)
5. coordinates(data) "LAt+Lon"
6. str(data)
7. proje4string(data) = CRS("+init=epsg:4326")
8. spplot()
XX: install rgdal in MacOS.
install.packages('rgdal',repos="http://www.stats.ox.ac.uk/pub/RWin")
library(rgdal)
library(maptools)
library(sp)
2. setwd("directory") #set default working directory
3.data<-read.table()
4. str(data)
5. coordinates(data) "LAt+Lon"
6. str(data)
7. proje4string(data) = CRS("+init=epsg:4326")
8. spplot()
XX: install rgdal in MacOS.
install.packages('rgdal',repos="http://www.stats.ox.ac.uk/pub/RWin")
Geohashing
Geohash function has been included in Postgis 2.0. To generate a geohash code, one can use "st_geohash(geom_column, [integer 1-20])".
After inserting/updating geohash code, remember to create b-tree index for the geohash codes.
Thursday, May 17, 2012
Algorithms in Mahout
From IBM:
1. Logistic Regression (Stochastic Gradient Descent)
2. Hidden Markov Models (HMM)
3. Dirichlet Clustering
4. Spectral Clustering
5. Minhash Clustering
6. Numerous Recommender Improvements
7. Collections
1. Logistic Regression (Stochastic Gradient Descent)
2. Hidden Markov Models (HMM)
3. Dirichlet Clustering
4. Spectral Clustering
5. Minhash Clustering
6. Numerous Recommender Improvements
7. Collections
JVM tuning
Key JVM
-Xmx -Xmx2G maximum size of java heap.
-Xms -Xms1G minimum size of java heap.
-server/client defines type of jvm
-d32/-d64 specify the type of JVM
-XX:+NewRatio -XX:+NewRatio=12 only 1/12 of the heap will be reserved for short-lived objects.
-XX:+UseParallelGC use multi-cores or processor to do garbage collection in parallel or not.
-XX:+UseParallelOldGC
-Xmx -Xmx2G maximum size of java heap.
-Xms -Xms1G minimum size of java heap.
-server/client defines type of jvm
-d32/-d64 specify the type of JVM
-XX:+NewRatio -XX:+NewRatio=12 only 1/12 of the heap will be reserved for short-lived objects.
-XX:+UseParallelGC use multi-cores or processor to do garbage collection in parallel or not.
-XX:+UseParallelOldGC
Add new Harddrive to ubuntu
Situation: the old /home directory is too small. I want to add a new hard drive and mount it as /home.
1. sudo fdisk -l to check whether ubuntu detects the new harddrive and identify the new harddrive identity (the one without mounting point, ex. sdb)
2. sudo fdisk /dev/sdb
press 'n' to create new partition.
press 'p' to add primary partition.
press '1' to name partition number.
enter the first and the last section
press 'L' to list available file sysetem type
choice '83' (linux)
press 'w' to write to the table and exit
3. sudo partprobe /dev/sdb
4. sudo mkfs /dev/sdb1 -t ext4
5. sudo mkdir /mnt/sdb1
6. sudo chmod 777 /mnt/sdb1
7. sudo mount /dev/sdb1 /mnt/sdb1 -t ext4
8. sudo cp -R /home /mnt/sdb1
9. sudo vi /etc/fstab
add "/dev/sdb1 /home ext4 defaults 0 0 to the file, save and quit
10. backup the home folder to other places, then delete all in /home
11. sudo mount -a
12. reboot and done.
Tuesday, May 8, 2012
PostGIS examples
From http://postgis.refractions.net/pipermail/postgis-users/2007-July/016358.html
SQL Commands for the Demonstration
Simple Spatial SQL
create table points ( pt geometry, name varchar );
insert into points values ( 'POINT(0 0)', 'Origin' );
insert into points values ( 'POINT(5 0)', 'X Axis' );
insert into points values ( 'POINT(0 5)', 'Y Axis' );
select name, AsText(pt), ST_Distance(pt, 'POINT(5 5)') from points;
drop table points;
Loading Shape Files
cd \postgis-workshop\data
pg_setenv
shp2pgsql -i -s 3005 bc_pubs.shp bc_pubs > bc_pubs.sql
notepad bc_pubs.sql
pg_shpsql
psql -U postgres -f bc_data.sql -d postgis
Creating Spatial Indexes
create index bc_roads_gidx on bc_roads using gist ( the_geom );
create index bc_pubs_gidx on bc_pubs using gist ( the_geom );
create index bc_voting_areas_gidx on bc_voting_areas using gist ( the_geom );
create index bc_municipality_gidx on bc_municipality using gist ( the_geom );
create index bc_hospitals_gidx on bc_hospitals using gist ( the_geom );
Using Spatial Indexes
-- Unindexed query
select gid, name from bc_roads where _ST_Crosses(the_geom, ST_GeomFromText('LINESTRING(1220446 477473,1220417 477559)', 3005));
-- Indexed query
select gid, name from bc_roads where ST_Crosses(the_geom, ST_GeomFromText('LINESTRING(1220446 477473,1220417 477559)', 3005));
Indexes and Query Plans
-- Unindexed query plan
explain select gid, name from bc_roads where _ST_Crosses(the_geom, ST_GeomFromText('LINESTRING(1220446 477473,1220417 477559)', 3005));
-- Indexed query plan
explain select gid, name from bc_roads where ST_Crosses(the_geom, ST_GeomFromText('LINESTRING(1220446 477473,1220417 477559)', 3005));
PostgreSQL Optimization
open postgresql.conf
Spatial Analysis in SQL
-- What is the total length of all roads in the province, in kilometers?
select sum(ST_Length(the_geom))/1000 as km_roads from bc_roads;
-- How large is the city of Prince George, in hectares?
select ST_Area(the_geom)/10000 as hectares from bc_municipality where name = 'PRINCE GEORGE';
-- What is the largest municipality in the province, by area?
select name, ST_Area(the_geom)/10000 as hectares from bc_municipality order by hectares desc limit 1;
Data Integrity
-- How many invalid voting area polygons?
select count(*) from bc_voting_areas where not ST_IsValid(the_geom);
Distance Queries
-- Where is the Tabor Arms pub?
select ST_AsText(the_geom) from bc_pubs where name ilike 'Tabor Arms%';
-- How many Unity Party supporters live within 2km of Tabor Arms pub?
select sum(upbc) as unity_voters from bc_voting_areas where ST_DWithin(the_geom, ST_GeomFromText('POINT(1209385 996204)', 3005), 2000);
Spatial Joins
-- Find all pubs located within 250 meters of a hospital.
select h.name, p.name from bc_hospitals h, bc_pubs p where ST_DWithin(h.the_geom, p.the_geom, 250);
-- Summarize the 2000 provincial election results by municipality.
select m.name, sum(v.ndp) as ndp, sum(v.lib) as liberal, sum(v.gp) as green, sum(v.upbc) as unity, sum(v.vtotal) as total from bc_voting_areas v, bc_municipality m where ST_Intersects(v.the_geom, m.the_geom) group by m.name order by m.name;
Overlays
-- Clip the voting areas by the Prince George boundary.
create table pg_voting_areas as select ST_Intersection(v.the_geom, m.the_geom) as intersection_geom, ST_Area(v.the_geom) as va_area, v.*, m.name from bc_voting_areas v, bc_municipality m where ST_Intersects(v.the_geom, m.the_geom) and m.name = 'PRINCE GEORGE';
-- What is the area of the clipped features?
select sum(ST_Area(intersection_geom)) from pg_voting_areas;
-- How does that compare to the clipping polygon? (Should be the same.)
select ST_Area(the_geom) from bc_municipality where name = 'PRINCE GEORGE';
Coordinate Projection
-- What is the proj4 definition of srid 3005?
select proj4text from spatial_ref_sys where srid=3005;
-- What are the native coordinate of the first road in the roads table?
select ST_AsText(the_geom) from bc_roads limit 1;
-- What do those coordinates look like in lon/lat?
select ST_AsText(ST_Transform(the_geom,4326)) from bc_roads limit 1;
Exercises
-- What is the perimeter of the municipality of Vancouver?
select ST_Perimeter(the_geom) from bc_municipality where name = 'VANCOUVER';
-- What is the total area of all voting areas in hectares?
select sum(ST_Area(the_geom))/10000 as hectares from bc_voting_areas;
-- What is the total area of all voting areas with more than 100 voters in them?
select sum(ST_Area(the_geom))/10000 as hectares from bc_voting_areas where vtotal > 100;
-- What is the length in kilometers of all roads named Douglas St?
select sum(ST_Length(the_geom))/1000 as kilometers from bc_roads where name = 'Douglas St';
Advanced Exercises
-- What is the length in kilometers of Douglas St in Victoria?
select sum(ST_Length(r.the_geom))/1000 as kilometers from bc_roads r, bc_municipality m where ST_Contains(m.the_geom, r.the_geom) and r.name = 'Douglas St' and m.name = 'VICTORIA';
-- What two pubs have the most Green Party supporters within 500 meters of them?
select p.name, p.city, sum(v.gp) as greens from bc_pubs p, bc_voting_areas v where ST_DWithin(p.the_geom, v.the_geom, 500) group by p.name, p.city order by greens desc limit 2;
-- What is the latitude of the most southerly hospital in the province?
select ST_Y(ST_Transform(the_geom,4326)) as latitude from bc_hospitals order by latitude asc limit 1;
-- What were the percentage NDP and Liberal vote within the city limits of Prince George in the 2000 provincial election?
select 100*sum(v.ndp)/sum(v.vtotal) as ndp, 100*sum(v.lib)/sum(v.vtotal) as liberal from bc_voting_areas v, bc_municipality m where ST_Contains(m.the_geom, v.the_geom) and m.name = 'PRINCE GEORGE';
-- What is the largest voting area polygon that has a hole?
select gid, id, ST_Area(the_geom) as area from bc_voting_areas where ST_NRings(the_geom) > 1 order by area desc limit 1;
-- How many NDP voters live within 50 meters of Simcoe St in Victoria?
select sum(v.ndp) as ndp from bc_voting_areas v, bc_municipality m, bc_roads r where ST_Contains(m.the_geom, r.the_geom) and r.name = 'Simcoe St' and m.name = 'VICTORIA' and ST_DWithin(r.the_geom, v.the_geom, 50);
Wednesday, May 2, 2012
Hadoop Demo
Create MapReduce application:
a. specify input and output locations
Input/Output => <key,value>
key and value should implement "Writable", and key should implement "WritableComparable".
b. specify map and reduce functions
c. job client submit the job and configuration to JobTracker.
Example1: WordCount
public class WordCount{
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable{
private final static IntWritable one = new IntWritable(1); //value
private Text word = new Text();//use word as key
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while(tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken()); //set key
output.collect(word,one); //value will always be one
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>{
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{
int sum=0
//for values with the same key, sum up the value (1) and output <key, sum>
while(values.hasNext()){
sum+=values.next().get();
}
output.collect(key,new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception{
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class); //word
conf.setOutputValueClass(IntWritable.class);//sum
conf.setMapperClass(Map.class);
//conf.setOutputKeyComparatorClass(Comparator.class)); //control the grouping of output
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);//textfile
conf.setOutputFormat(TextOutputFormat.class);//textfile
FileInputFormat.setInputPaths(conf, new Path(args[0]));//location of input file
FileOutputFormat.setOutputPath(conf, new Path(args[1]));//location of output file
JobClient.run(conf);
}
}
PostgreSQL+PostGIS cheat sheet
1. Distinct Count
solution 1:
select count(*) from (select distinct * from table where column='') as dis
2. Check redundant tuple
soluction 1 (self join)
SELECTp.place_id, p.href,s.place_id,s.href
FROM place p
INNER JOIN place s ON p.href=s.href
WHERE p.place_id &s s.place_id
ORDER BY p.place_id, s.place_id;
3. Materialized View (http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views)
a. snapshot (view definition is never changed)
1) create a table to store view info
CREATE TABLE matviews(
mv_name NAME NOT NULL PRIMARY KEY,
b_name NAME NOT NULL,
last_refresh TIMESTAMP WITH TIME ZONE
);
2)create a function to insert a row in Mat_VIEWS and create a materialized view
CREATE OR REPLACE FUNCTION create_matview(NAME,NAME)
RETURNS VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS FOR $1;
view_name ALIAS FOR $2;
entry matviews%ROWTYPE;
BEGIN
SELECT * INTO entry FROM matviews WHERE mv_name=matview;
IF FOUND THEN
RAISE EXCEPTION ''Materialized view ''''%'''' already exists.'', matview;
END IF;
EXECUTE ''REVOKE ALL ON '' || view_name || ''FROM PUBLIC'';
EXECUTE "GRANT SELECT ON '' || view_name || '' TO PUBLIC'';
EXECUTE ''CREATE TABLE '' || matview|| AS SELECT * FROM ''||view_name;
EXECUTE ''REVOKE ALL ON '' || matview || ' FROM PUBLIC'';
EXECUTE ''GRANT SELECT ON '' || matview || '' TO PUBLIC'';
INSERT INTO matviews (mv_name, b_name,last_refresh)
VALUES(matview, view_name, CURRENT_TIMESTAMP);
RETURN;
END
';
3) create a function to drop a materialized view
CREATE OR REPLACE FUNCTION drop_matview(NAME) RETURN VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS FOR $1;
entry matviews%ROWTYPE;
BEGIN
SELECT * INTO entry FROM matview WHERE mv_name=matview;
IF NOT FOUND THEN
RAISE EXCEPTION ''Materialized view % does not exist.'',matview;
END IF;
EXECUTE ''DROP TABLE'' || matview;
DELETE FROM matviews WHERE mv_name=matview;
RETURN;
END
';
4) create a function to refresh a materialized view
CREATE OR REPLACE FUNCTION refresh_matview(NAME) RETURNS VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS $1;
entry matviews%ROWTYP;
BEGIN
SELECT * INTO entry FROM matviews WHERE mv_name=matview;
IF NOT FOUND THEN
RAISE EXCEPTION ''Materialized view % does not exist.'',matview;
END IF;
EXECUTE '' DELETE FROM '' || matview;
EXECUTE ''INSERT INTO ''||matview || '' SELECT * FROM '' || entry.b_name;
UPDATE matviews
SET last_refresh = CURRENT_TIMESTAMP
WHERE mv_name=matview;
RETURN;
END
';
b. eager (update whenever the base view is changed)
4. delete existing duplicate rows in a table
Before execute the following statement, make sure you have enough buffers for the session (refer to cheat 5 and 6).
solution1
DELETE FROM table USING table ta
WHERE table.key=ta.key AND table.id>ta.id;
solution2
DELETE FROM table
WHERE ctid NOT IN
(SELECT MAX(s.ctid) FROM table s GROUP BY s.column_has_be_distinct);
solution3
CREATE OR REPLACE FUNCTION remove_duplicates(text,text) RETURN VOID AS $$
DECLARE
tablename ALIAS FOR $1;
duplicate_column ALIAS FRO $2;
BEGIN
EXECUTE 'CREATE TEMPORARY TABLE TABLE_DISTINCT_' ||tablename||'
AS (SELECT DISTINCT ON ('||duplicate_column||') * FROM ' ||tablename||
ORDER BY '||duplicate_column||' ASC);';
EXECUTE 'DELETE FROM ' || tablename||';';
EXECUTE 'INSERT INTO ' || tablename||' (SELECT * FROM _DISTINCT_' || tablename||');';
EXECUTE 'DROP TABLE_DISTINCT_'|| tablename||';';
RETURN;
END;
$$LANGUAGE plpgsql;
solution4
CREATE TEMPORARY TABLE t_temp ON COMMIT DROP AS
SELECT DISTINCT ON ('column_name') * FROM table;
TRUNCATE table;
INSERT INTO table SELECT * FROM t_temp;
5. find out the size of table
solution:
SELECT pg_size_pretty(pg_relation_size('tablename'));
6. set temporary buffers
solution:
SET temp_buffers =' XXXMB';
7. delete table rows which some columns in that table (a) don't exist in the other table (b).
solution:
DELETE FROM a WHERE a.column_name NOT IN (SELECT column_name FROM b);
8. Backup
Solution1:
pg_dump
EX. login db at localhost as postgres and dump database poi2 into poi.backup.gz
pg_dump -u postgres -h localhost poi2 | gzip>poi.backup.gz
Solution2:
file system level backup
Ex.
tar -cf poi2.backup.tar /usr/local/pgsql/data
PS. The database server must be shutdown before doing file system level backup and restore.
Solutioin3:
Continuous archiving
EX.
a. enable wal in postgresql.conf
wal_level=archive
archive_mode=on
archive_command='test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
b.SELECT pg_start_backup('label',ture);
c. perform file system level backup, like tar
d. SELECT pg_stop_backup();
Solution4:
pg_basebackup
EX.
a. enable local replicate permission for posgres @localhost
b. enable max_wal_senders=1
c. pg_basebackup -U postgres -D /backup/data -F tar -z -l label
9. PostGIS import shape files into one table
10. Test if geometry column is valid
Solution 1:
select ST_IsValid(geom) from table;
Solution 2:
select ST_IsValidDetail(geom) from table;
Solution 3:
select ST_IsEmpty(geom) from table;
11. Test the geometry type
Solution 1:
select ST_GeometryType(geom) from table;
12. Mapping points in table a to a polygon in a table b
Solution1:
select * from table1
where st_within(geom, polygon);
13. get a boundary from a collection of geometries
Solution 1:
select st_extent(geom) from table;
14. Performance tuning:
Install pgtune (apt-get update; apt-get install pgtune) -> pgtune -i postgresql.conf -o psql.conf
create dedicate index for query (considering the order of fields, and the join condition)
Use multi-fields index
the order of fields in multi-fields index (A, B, C can be used to find AB, but not BC)
Vacuum the tables after delete, update, insert, or create index.
Order of fields and order of joint have a big impact on performance.
Explicit casting: "where x=3" v.s. "where x=cast(3 as smallint)", "abs(dest) = cast(111 as smallint)" v.s. "dest =111 or dest=-111" have huge different.
set enable_seqscan=no
Avoid nested loop, merge and sort is faster.
Use temp table
Add sequence in query to monitor the progress. select nextval('seq_progress'); select currval('seq_progress');
Enable stats_start_collector and stats_row_level to get auto vacuum
Partial index: create index index_name on table1(column_name) where column_nam like 'America%'
Specify sort_memory, temp_buffers for each connections or query.
solution 1:
select count(*) from (select distinct * from table where column='') as dis
2. Check redundant tuple
soluction 1 (self join)
SELECTp.place_id, p.href,s.place_id,s.href
FROM place p
INNER JOIN place s ON p.href=s.href
WHERE p.place_id &s s.place_id
ORDER BY p.place_id, s.place_id;
3. Materialized View (http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views)
a. snapshot (view definition is never changed)
1) create a table to store view info
CREATE TABLE matviews(
mv_name NAME NOT NULL PRIMARY KEY,
b_name NAME NOT NULL,
last_refresh TIMESTAMP WITH TIME ZONE
);
2)create a function to insert a row in Mat_VIEWS and create a materialized view
CREATE OR REPLACE FUNCTION create_matview(NAME,NAME)
RETURNS VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS FOR $1;
view_name ALIAS FOR $2;
entry matviews%ROWTYPE;
BEGIN
SELECT * INTO entry FROM matviews WHERE mv_name=matview;
IF FOUND THEN
RAISE EXCEPTION ''Materialized view ''''%'''' already exists.'', matview;
END IF;
EXECUTE ''REVOKE ALL ON '' || view_name || ''FROM PUBLIC'';
EXECUTE "GRANT SELECT ON '' || view_name || '' TO PUBLIC'';
EXECUTE ''CREATE TABLE '' || matview|| AS SELECT * FROM ''||view_name;
EXECUTE ''REVOKE ALL ON '' || matview || ' FROM PUBLIC'';
EXECUTE ''GRANT SELECT ON '' || matview || '' TO PUBLIC'';
INSERT INTO matviews (mv_name, b_name,last_refresh)
VALUES(matview, view_name, CURRENT_TIMESTAMP);
RETURN;
END
';
3) create a function to drop a materialized view
CREATE OR REPLACE FUNCTION drop_matview(NAME) RETURN VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS FOR $1;
entry matviews%ROWTYPE;
BEGIN
SELECT * INTO entry FROM matview WHERE mv_name=matview;
IF NOT FOUND THEN
RAISE EXCEPTION ''Materialized view % does not exist.'',matview;
END IF;
EXECUTE ''DROP TABLE'' || matview;
DELETE FROM matviews WHERE mv_name=matview;
RETURN;
END
';
4) create a function to refresh a materialized view
CREATE OR REPLACE FUNCTION refresh_matview(NAME) RETURNS VOID
SECURITY DEFINER
LANGUAGE plpgsql AS '
DECLARE
matview ALIAS $1;
entry matviews%ROWTYP;
BEGIN
SELECT * INTO entry FROM matviews WHERE mv_name=matview;
IF NOT FOUND THEN
RAISE EXCEPTION ''Materialized view % does not exist.'',matview;
END IF;
EXECUTE '' DELETE FROM '' || matview;
EXECUTE ''INSERT INTO ''||matview || '' SELECT * FROM '' || entry.b_name;
UPDATE matviews
SET last_refresh = CURRENT_TIMESTAMP
WHERE mv_name=matview;
RETURN;
END
';
b. eager (update whenever the base view is changed)
4. delete existing duplicate rows in a table
Before execute the following statement, make sure you have enough buffers for the session (refer to cheat 5 and 6).
solution1
DELETE FROM table USING table ta
WHERE table.key=ta.key AND table.id>ta.id;
solution2
DELETE FROM table
WHERE ctid NOT IN
(SELECT MAX(s.ctid) FROM table s GROUP BY s.column_has_be_distinct);
solution3
CREATE OR REPLACE FUNCTION remove_duplicates(text,text) RETURN VOID AS $$
DECLARE
tablename ALIAS FOR $1;
duplicate_column ALIAS FRO $2;
BEGIN
EXECUTE 'CREATE TEMPORARY TABLE TABLE_DISTINCT_' ||tablename||'
AS (SELECT DISTINCT ON ('||duplicate_column||') * FROM ' ||tablename||
ORDER BY '||duplicate_column||' ASC);';
EXECUTE 'DELETE FROM ' || tablename||';';
EXECUTE 'INSERT INTO ' || tablename||' (SELECT * FROM _DISTINCT_' || tablename||');';
EXECUTE 'DROP TABLE_DISTINCT_'|| tablename||';';
RETURN;
END;
$$LANGUAGE plpgsql;
solution4
CREATE TEMPORARY TABLE t_temp ON COMMIT DROP AS
SELECT DISTINCT ON ('column_name') * FROM table;
TRUNCATE table;
INSERT INTO table SELECT * FROM t_temp;
5. find out the size of table
solution:
SELECT pg_size_pretty(pg_relation_size('tablename'));
6. set temporary buffers
solution:
SET temp_buffers =' XXXMB';
7. delete table rows which some columns in that table (a) don't exist in the other table (b).
solution:
DELETE FROM a WHERE a.column_name NOT IN (SELECT column_name FROM b);
8. Backup
Solution1:
pg_dump
EX. login db at localhost as postgres and dump database poi2 into poi.backup.gz
pg_dump -u postgres -h localhost poi2 | gzip>poi.backup.gz
Solution2:
file system level backup
Ex.
tar -cf poi2.backup.tar /usr/local/pgsql/data
PS. The database server must be shutdown before doing file system level backup and restore.
Solutioin3:
Continuous archiving
EX.
a. enable wal in postgresql.conf
wal_level=archive
archive_mode=on
archive_command='test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
b.SELECT pg_start_backup('label',ture);
c. perform file system level backup, like tar
d. SELECT pg_stop_backup();
touch /var/lib/pgsql/backup_in_progress
psql -c "select pg_start_backup('hot_backup');"
tar -cf /var/lib/pgsql/backup.tar /var/lib/pgsql/data/
psql -c "select pg_stop_backup();"
rm /var/lib/pgsql/backup_in_progress
tar -rf /var/lib/pgsql/backup.tar /var/lib/pgsql/archive/
pg_basebackup
EX.
a. enable local replicate permission for posgres @localhost
b. enable max_wal_senders=1
c. pg_basebackup -U postgres -D /backup/data -F tar -z -l label
9. PostGIS import shape files into one table
Solution 1:
shp2pgsql -s 915 -a[-c] -g geom shapfile.shp schema.table | psql -U username -W dbname
-s: srid, -g: geometry_column_name10. Test if geometry column is valid
Solution 1:
select ST_IsValid(geom) from table;
Solution 2:
select ST_IsValidDetail(geom) from table;
Solution 3:
select ST_IsEmpty(geom) from table;
11. Test the geometry type
Solution 1:
select ST_GeometryType(geom) from table;
12. Mapping points in table a to a polygon in a table b
Solution1:
select * from table1
where st_within(geom, polygon);
13. get a boundary from a collection of geometries
Solution 1:
select st_extent(geom) from table;
14. Performance tuning:
Install pgtune (apt-get update; apt-get install pgtune) -> pgtune -i postgresql.conf -o psql.conf
create dedicate index for query (considering the order of fields, and the join condition)
Use multi-fields index
the order of fields in multi-fields index (A, B, C can be used to find AB, but not BC)
Vacuum the tables after delete, update, insert, or create index.
Order of fields and order of joint have a big impact on performance.
Explicit casting: "where x=3" v.s. "where x=cast(3 as smallint)", "abs(dest) = cast(111 as smallint)" v.s. "dest =111 or dest=-111" have huge different.
set enable_seqscan=no
Avoid nested loop, merge and sort is faster.
Use temp table
Add sequence in query to monitor the progress. select nextval('seq_progress'); select currval('seq_progress');
Enable stats_start_collector and stats_row_level to get auto vacuum
Partial index: create index index_name on table1(column_name) where column_nam like 'America%'
Specify sort_memory, temp_buffers for each connections or query.
Subscribe to:
Posts (Atom)